Enzyme optimization: moving from blind evolution to statistical

Opinion
Enzyme optimization: moving from
blind evolution to statistical
exploration of sequence–function space
Richard J. Fox and Gjalt W. Huisman
Codexis, Inc., 200 Penobscot Drive, Redwood City, CA 94063, USA
Directed evolution is a powerful tool for the creation
of commercially useful enzymes, particularly those
approaches that are based on in vitro recombination
methods, such as DNA shuffling. Although these types
of search algorithms are extraordinarily efficient compared with purely random methods, they do not explicitly represent or interrogate the genotype–phenotype
relationship and are essentially blind in nature. Recently,
however, researchers have begun to apply multivariate
statistical techniques to model protein sequence–function relationships and guide the evolutionary process by
rapidly identifying beneficial diversity for recombination. In conjunction with state-of-the-art library generation methods, the statistical approach to sequence
optimization is now being used routinely to create
enzymes efficiently for industrial applications.
collecting diversity from homologous enzymes has proven
useful in the discovery of improved enzymes [9–11]. Semirational approaches to enzyme engineering that are based
on structural information have also been shown to help
reduce the search space of useful mutations to those that
are more likely to improve specific functions [11,12]. For
example, stereoselectivity is more commonly affected by
changes near the active site [13], a fact exploited by Reetz
and others to identify beneficial mutations [14]. Conversely,
the effect of mutations on properties such as activity and
thermostability seem to be uncorrelated with their distance
from the active site [13], and mutations distal from the
active site have been shown to be useful sources of diversity
for improving the catalytic efficiency of enzymes [15],
Introduction
Biocatalysis has undergone explosive growth over the past
several years, playing increasingly key roles in chemical
and bioindustrial manufacturing and process development, and in enabling green chemistries that help to
protect the environment [1–4]. Enzymes isolated directly
from nature rarely exhibit the ideal combination of traits
and activities required for industrial use. Ideally, a perfect
‘designer’ enzyme, with the appropriate activities and
traits, could be created ex novo based on intimate knowledge of the relationship of enzyme structure and function.
This admirable goal, however, is neither presently achievable nor is it likely to be in the near future [5–8]. Given the
promising, but presently limited, success of purely rational
approaches to enzyme optimization, non-rational or semirational approaches will continue to play the dominant role
in the creation of new enzymes and in further expanding
the impact of biocatalysis (Figure 1). This leads to a view of
enzyme optimization in which hypotheses regarding
genetic diversity (see Glossary; identified in a multitude
of ways) are tested with screens that resemble the ultimate
application. The screens then serve as fitness functions for
search algorithms used to sift through that diversity.
Diversity: differences between amino acid or DNA sequences that are present
in a library of variants.
DNA shuffling: the original in vitro recombination technique, widely regarded
as revolutionizing the field of directed molecular evolution [17]. DNA shuffling
consists of random fragmentation of related parental DNA templates, followed
by primerless PCR recombination. The resulting libraries are composed of
highly chimeric progeny. Variants with improved function are carried forward
into subsequent rounds of evolution, fixing beneficial mutations in the
population but discarding deleterious ones.
Fitness landscape (or sequence–function landscape): a conceptual model used
to visualize or study the relationship between genotypes and phenotypes
(functions or properties of interest). Often conceived of as mountain ranges,
the ‘height’ of the landscape at a given genotype coordinate corresponds to the
phenotype. Correlated (or smooth) fitness landscapes indicate that nearby
locations in genotype space correspond to similar phenotypes.
Machine learning: a subfield of artificial intelligence, which, in the case of
supervised learning, is concerned with mapping inputs to desired outputs. The
inputs are often highly dimensional, with many variables (also known as
features or predictors), and the goal of learning is to produce robust, predictive
models based on a set of training data. The models can be used to make new
predictions or interrogated to infer which input variables are important for
increased output or response.
ProSAR: protein sequence activity relationships, a technique inspired by QSAR
and other machine learning applications, which is used to infer the effects of
mutations in a combinatorial library of proteins on a function or property of
interest [31]. The models derived from the learning or training phase can be used
to classify mutations as beneficial, deleterious or neutral and these classifications
can be used to design improved variants in subsequent rounds of evolution.
QSAR: quantitative structure activity relationships, a machine-learning-based
approach to small molecule drug design that is used extensively in medicinal
chemistry to determine which molecular features correlate with improved drug
activity [29].
Recombination: the formation of new combinations of diversity in progeny
sequences that are not present in the parents.
Semi-synthetic shuffling: a next generation method of creating combinatorial
libraries where DNA oligonucleotides are used to carry mutations that can
freely recombine with a parental DNA template [27,28], giving significantly
higher incorporation rates of mutations compared with traditional DNA
shuffling.
Diversity generation
Generation of genetic diversity is a crucial component of the
optimization process and has received a great deal of attention over the years. In addition to random mutagenesis,
Glossary
Corresponding author: Fox, R.J. ([email protected]).
132
0167-7799/$ – see front matter ß 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.tibtech.2007.12.001 Available online 28 January 2008
Opinion
making rational hypothesis generation for these remote
positions more challenging.
Although specific properties (such as stereoselectivity)
are often individually necessary for practical biocatalysts,
often only a collection of properties can be considered
jointly sufficient to achieve commercial viability. Biocatalysts must be optimized to function in non-natural environments, which include wide fluctuations in process
conditions from high substrate to high product concentrations, and exposure to solvents over prolonged periods
of time [16]. This multitude of objectives makes the task of
identifying beneficial diversity non-trivial and mutations
that span the entire structure of the enzyme must be
considered. Searching through these numerous possibilities to create improved variants has been the raison d’être
of recombination-based directed evolution [17].
This view of enzyme optimization separates the task of
diversity generation from the search process and enables
numerous methods of sourcing diversity to be applied, including information derived from computational and
rational design [11,18,19]. In this regard rational design
and directed evolution are not competing or mutually exclusive technologies – they are complementary approaches that
address orthogonal considerations.
Recombination
Owing to the large number of possible variants, enzyme
optimization requires efficient and robust search algorithms to rapidly sift through the myriad possibilities.
Researchers have begun to look for ways to improve this
process. The ability to obtain a highly improbable outcome
through a blind, evolutionary algorithm, such as DNA
shuffling, is truly awe inspiring (Box 1). Nevertheless,
the fact that natural selection is indeed blind led Darwin
himself to question its overall efficiency: ‘What a book a
Devil’s Chaplain might write on the clumsy, wasteful,
blundering, low and horridly cruel works of nature’ [20].
As powerful as the probability sieve is, the fact that it does
not make use of more of the available information present
in a population of variants leads us naturally to ask
whether we can make the process of artificial selection a
little less ‘clumsy, wasteful and blundering’.
Before the advent of DNA shuffling, directed evolution
was limited to iterative rounds of random mutagenesis,
where the best sequence is identified at each round of
evolution and used as a template for additional mutagenesis. Although this strategy is simple to execute it has the
significant drawback of discarding numerous beneficial
mutations at each round of evolution. Such asexual evolutionary methods are generally regarded as less efficient
than algorithms that recombine beneficial mutations
[21].
Given the superior efficiency of sexual or recombinationbased methods, significant efforts have been devoted to
developing new ways of creating combinatorial libraries
[22,23]. Although these efforts have played a role in advancing the field of directed evolution, they have generally not
sought to alter fundamentally the nature of the search
strategy itself [24,25]. The question that has become
increasingly important is: what is the most efficient way
to identify and recombine beneficial diversity?
Trends in Biotechnology
Vol.26 No.3
Box 1. Evolution as a probability sieve
A favorite aphorism of the legendary R.A. Fisher (cofounder of the
neo-Darwinian synthesis and father of modern statistics) was that,
‘Natural selection is a mechanism for generating an exceedingly
high degree of improbability’ [43]. Although the power of recursive
recombination has been known for some time within the protein
engineering community [44,45], the magnitude of the effective
search power is often not fully appreciated. By way of example,
consider a bit string consisting of zeroes and ones of length N = 40
whose bits are functionally coupled to K = 1 other positions. The
fitness of such a bit string is determined by the Kauffman NK
landscape, a popular fitness landscape that has been widely used to
study evolutionary algorithms [46]. In our case, we can imagine the
bit string corresponds to the variable positions in a combinatorial
library of enzymes, with only two amino acid options (corresponding to choices zero and one) at each position to keep things simple.
The parameter K = 1 serves as a measure of the epistatic coupling
between residues, indicating each mutated position is functionally
coupled to one other mutated position. This corresponds to a
roughly additive landscape, which has been shown to be the case
for a variety of proteins [47–53]. Note that the functional coupling
between all residues in the protein is expected to be much greater
but we are only concerned with a combinatorial search within a
subset of mutations under consideration. Imagine the goal is to
obtain the highest fitness possible with the least effort. An
exhaustive search of all 240 possibilities is not tenable. Yet a simple
genetic algorithm, where the top ten solutions from a population of
1000 variants are bred together with a uniform crossover operator
over four rounds of evolution (simulated on 1000 random NK
landscapes), yields >98% of the theoretically optimal fitness gain
[54]. In other words, a vast sequence space of 1012 can be effectively
searched by examining only 4000 solutions. A common fear among
protein engineers is the idea that they might miss a mutation or
combination of mutations that could increase function, if only they
were clever or exhaustive enough to find them [55]. This kind of
thinking is particularly popular in motivating the needs for rational
enzyme designs, where large library sizes are screened in silico
under the pretense that such large libraries could never be screened
in vivo. But Voltaire’s incisive observation that ‘the best is the
enemy of the good’ [56] is worth considering here. The facile
example given above demonstrates it might not be necessary or
desirable to search for the optimal solution if many acceptable
solutions can be obtained efficiently with algorithms that work
exceedingly well on the less rugged portions of fitness landscapes
[57,58]. In the end, protein engineers must always question what the
increased return is (if any) for a given approach and whether it is
worth the investment.
Machine-learning guided evolution
In the past decade, protein engineers were content to let
the gentle recombination of DNA shuffling [17] serve as the
mechanism for sifting through mutations that result in
more fit variants. But, as Moore and coworkers observed,
there is a statistical preference for the absence of mutation
in the progeny when recombining multiple parents [26]. In
the simplest example, consider ten parents each carrying
one mutation. The progeny from shuffling these ten
parents will have, on average, only a one in ten chance
of having any one mutation at a given position and the
probability of finding five or more mutations together is
less than one in 600. With the advent of semi-synthetic
shuffling [27,28] (whereby oligonucleotides carrying
specific mutations are spiked into a PCR reaction at a high
concentration relative to the parent template), protein
engineers have been able to avoid this disadvantage by
enabling for the free recombination of desired changes,
resulting in a higher representation of progeny in the
library carrying multiple mutations. Unfortunately, this
133
Opinion
Trends in Biotechnology Vol.26 No.3
Figure 1. Accelerating impact of biocatalysis. Before 1970 biocatalysis was generally limited to those applications using enzymes found in nature (e.g. chymosin from calf
and sheep stomach for the manufacture of cheese). With the advent of gene level random and site-directed mutagenesis in the 1980s, enzyme optimization via rational and
directed evolutionary approaches became possible. Recombination-based approaches, such as DNA shuffling, were used to accelerate the process of directed evolution in
the 1990s. In the current decade, computational library design coupled with directed evolution has proven to be a useful strategy along with statistical modeling of
sequence–function relationships.
approach raises another problem: how to decide which
mutations are worth recombining? Arbitrarily including
all acceptable mutations gives rise to large libraries that
might contain only a relatively small fraction of functional,
much less improved, enzymes. Conversely, smaller
libraries will usually contain more functional enzymes
but the number of beneficial to deleterious mutations
becomes the crucial factor in determining the fraction of
improved variants. The answer to identifying which
mutations are worth recombining is trivial when a
mutation occurs alone in the context of a parent enzyme
(as is typical with low dosage random mutagenesis). However, in a combinatorial library a given mutation is almost
always seen in the context of other programmed or random
changes.
The task of inferring the effects of individual changes in
combinatorial libraries is ideally suited to the well developed, yet still vibrant, field of machine learning. The
problem is akin to the inverse of the drug design problem,
where structural and chemical features of the molecules
are used to build predictive models that correlate with
observed drug activities. These so called quantitative
structure activity relationship (QSAR) drug design models
[29] can then be interrogated to determine which features
of the small molecule are important for conferring
improved function against its protein target (typically
the binding of a ligand to a receptor or an inhibitor to
an enzyme). The widespread use of QSAR in medicinal
chemistry has recently prompted interest in the protein
engineering community to use similar techniques to determine which features of an enzyme are important for
imparting improved function against a target substrate.
One such enzyme-based optimization algorithm
inspired by QSAR methodology is known as the protein
134
sequence-activity relationship (ProSAR) algorithm, the
full details of which are described elsewhere [30–32].
Briefly, ProSAR consists of training a statistical model
on a set of sequence–function data to classify individual
mutations (or elements, patterns and motifs at the DNA or
protein level) as beneficial, neutral or deleterious. These
classifications are then used to design subsequent
libraries, which are enriched for beneficial mutations
and purged of deleterious ones (Figure 2). Typically, an
improved sequence is identified at the beginning of every
round of evolution and used as a parent template for
subsequent combinatorial libraries. As is typically done
with QSAR models, the impact of a mutation on function is
generally assumed to be additive with respect to other
mutations (Box 2) – the overall goal is to establish important trends rather than obtain high resolution details of the
local sequence–function landscape. Essentially, the algorithm is used to unmask the effects of individual mutations
so that a new parent template can be inspected to see which
beneficial mutations it lacks or which deleterious
mutations are present – these mutations can then be added
to, or removed from, the next round library designs
(Figure 2). The process proceeds recursively, adding new
diversity discovered from any number of suitable sources.
The kind of sequence–activity modeling used in the
ProSAR algorithm has subsequently been adopted by other
researchers to accelerate enzyme optimization [33–35].
Such modeling can be conducted in any number of ways
that correlate features of a molecule with some desired
property or response of interest. Classical machine-learning techniques, such as linear regression (first hit upon by
Darwin’s cousin Francis Galton over 100 years ago), can be
used when there are relatively few features compared with
the number of observations (response measurements).
Opinion
Trends in Biotechnology
Vol.26 No.3
Figure 2. ProSAR driven enzyme evolution. (a) A combinatorial library of enzymes is constructed by stochastically loading a DNA parent template (long horizontal lines)
with mutations of unknown or uncertain effects (yellow spheres on structure and black circles on backbone). (b) The library is screened and a subset of functionally diverse
variants is sequenced and the data are subjected to statistical analysis [31]. (c) The analysis uncovers the effects of various mutations and classifies them as beneficial
(green), deleterious (red) or neutral (white). (d) Typically, the most active variant from one round is selected as a parent template for the next round library design(s). (e)
Beneficial mutations are taken forward to the next round. These mutations are incorporated by oligonucleotide-based semi-synthetic shuffling [27,28] (short horizontal
lines). Deleterious mutations present in the new parent template are removed by reverting to the previous template’s codon (short lines without circles). (f) As the
population of mutations is sifted down, diversity is maintained with the addition of new mutations from any number of suitable sources, including random mutagenesis,
homologous enzymes or semi-rational/computational studies.
Alternatively, modern techniques such as partial least
squares regression [36] can be used when there exists a
paucity of observations and many features. The precise
algorithms used to make inferences about mutational
effects are not coupled in any way to the particular problem of sequence–function modeling, enabling developments in the field of machine learning to be fully leveraged
for the task of enzyme engineering should they be made
available.
It is worth noting that, although the goal of the statistical modeling is to create a local map of the sequence–
function landscape, the global fitness landscape might be
highly degenerate in terms of acceptable solutions. There
might be branch points in the search space that lead to
different local optima, but as long as the algorithm continues to make progress in the directions of the highest
gradients, such degeneracies pose no special problems
(Figure 3).
In addition to the multidimensional nature of the input
space (the mutational diversity), the functional output is
often multidimensional as well, where multiple properties
such as activity, selectivity, stability and tolerance need to
be optimized. Optimization on such multidimensional
outputs or properties of interest poses its own special
challenges to evolutionary search algorithms [37]; however, the statistical approach of separating the effects
of individual mutations has its advantages in these
situations as well (Figure 4). To the extent that beneficial
mutations can be identified to contribute to various properties independently, they can be freely recombined to
improve one or more properties at a time without adversely
affecting the others. It is also worth noting that as the
number of objectives grows, considerations regarding the
screening capacity become increasingly important. Getting
the most information out of the fewest number of assays and
using that information to design high quality libraries is
crucial to reducing the level of effort required to optimize
enzymes. The statistical optimization techniques described
here are well suited to such situations because only a
relatively small number of functionally diverse variants
are required to build models for all the objectives. Unlike
some traditional directed evolution approaches, deep
screening of libraries to find the best variants is not necessary – it is sufficient to identify beneficial mutations and
move them forward into the next round, thus emphasizing
library quality over quantity to effect improvements in
enzyme function [38].
The first application of a machine-learning-based
approach to create an industrially useful enzyme was
demonstrated by evolving a halohydrin dehalogenase to
improve the volumetric productivity per unit catalyst loading of a biocatalytic process 4000-fold [32]. During the
course of the evolution, the ProSAR algorithm significantly
outperformed traditional DNA shuffling formats owing to
135
Opinion
Trends in Biotechnology Vol.26 No.3
Box 2. Modeling additivity of mutational effects
The statistical models are typically written so that the mutations are
assumed to make additive contributions to the property of interest
(activity, thermostability, enantioselectivity/specificity, etc.):
y¼
N
X
ci x i þ y 0
i¼1
where y is the predicted response (function or property of interest),
N the number of unique mutations, ci the regression coefficient for
mutation i, xi a binary digit indicating the presence or absence (1 or 0)
of mutation i, and y0 the mean of the measured response. The
assumption of rough additivity is supported by several studies that
have examined enzymes with multiple mutations [47–53,59].
Although it is straightforward to construct models that include
interacting mutations, they usually require significantly more
sequence–activity data to prevent the overfitting that leads to poor
predicative power [30]. Although the additivity assumption might be
violated in practice, there are at least two reasons to suggest why
deviations from it do not generally pose significant problems for a
search algorithm that does not explicitly include nonlinear effects.
First, as noted by Stephanopoulos and coworkers [58], it might be
sufficient to simply optimize over regions of the sequence–function
landscape that are not overly rugged, where ruggedness implies that
only particular combinations of mutations will prove functional or
beneficial. Although it is possible that useful combinations might be
overlooked by a given search algorithm, as long as acceptable
improvements can be obtained with high efficiency it does not matter
that ever more optimal solutions could be obtained in principle.
Second, nonlinear effects, although not directly present in additive
statistical models, might still cast shadows into lower dimensions
that can be exploited by a search algorithm. As long as the average
contribution to fitness of a given mutation over a variety of contexts
is neutral to beneficial it is likely to get incorporated into the next
round library. For example, during the evolution of a halohydrin
dehalogenase [32], a particular combination of mutations was found
to be beneficial: G177A/V178C (as inferred from observation and
from statistical models that did explicitly incorporate a cross-product
term [30]). An additive model based on the same set of data showed
that G177A was, on average, beneficial, whereas V178C was essentially neutral. In these situations the search algorithm will fix
beneficial mutations already present in the new parent template
(e.g. 177A) while examining statistically neutral mutations that might
be conditionally beneficial (e.g. 178C). The next round would then be
free to discover the 177A/178C interaction because 177A is fixed and
178C is beneficial in this context. Although particular regions of a
fitness landscape might not always lend themselves to this kind of
separable attack on pairs of interacting mutations, the above idea
demonstrates that use of an additive model can enable for progress,
even on semi-rugged surfaces, a result similar to that observed by
using genetic algorithms that do not explicitly respect interactions
(Box 1). Essentially, the search algorithm uses information from the
local landscape to climb Mount Improbable [42] by exploring directions with the highest gradients (Figure 3).
its ability not only to rapidly identify beneficial mutations
in hits, but also to identify such mutations in variants with
reduced function that would be overlooked by traditional
methods. Recently, the approach was further validated by
Arnold and co-workers who used statistical models of
sequence–function relationships to improve the half-life
of a cytochrome P450 monooxygenase over 100-fold [33]. In
another recent example, Gustafsson and co-workers were
able to utilize sequence–activity modeling to improve the
activity of a proteinase K 20-fold [34]. Despite its recent
introduction into the toolbox of protein engineers, the
statistical approach to enzyme optimization is already
creating opportunities to leverage sequence/structure–
function relationships in ways that promise to further
accelerate industrial biocatalysis (Box 3).
136
Figure 3. Climbing Mount Improbable. Enzyme optimization on sequence–function
landscapes can be thought of as climbing a mountain of improbability in a high
dimensional space [42] (Box 1), where the dimensions consist of mutated positions
in an alignment of related variants. Most random solutions are unlikely to lead to
fitter variants. Notoriously difficult to visualize in only three dimensions, the image
shows a simplified version of such a landscape, with two variable dimensions
corresponding to the genotype of a variant and a vertical dimension corresponding
to some phenotypic response (function or property) of interest. The statistical
modeling constructs a map of the local fitness landscape around the point p. The
map is then used to explore directions with the highest gradients (a,b,c). Because the
effects of mutations are evaluated around the local fitness landscape, greedy
extrapolation along the single highest gradient (synthesizing just one or a few of the
best predicted variants) might result in the recombination of conflicting mutations
that lead to variants with suboptimal or reduced function, envisaged by the saddle
point on the mountain pass in the direction of b or the valley on the other side.
However, because the algorithm stochastically explores a variety of higher gradient
directions, it protects against this possibility and superior solutions (a,c) can usually
be obtained provided the landscape is not overly rugged.
Although the current methods for machine-learning
guided evolution have shown promise, researchers have
only begun to explore their limitations and future possibilities. For example, the current methodology is concerned
with identifying the effects of existing mutations, but what
about mutations that have not been observed yet? Can
statistical models based on features derived from in silico
enzyme models be used to predict their effects? Such a
hybrid strategy between statistical modeling and computational design could be used to help address the crucial
Box 3. New opportunities
The machine-learning-based approach to protein engineering is
already opening up new frontiers in biocatalysis. For example,
information derived from different optimization programs using
enzymes of known sequence as a starting point can be used to
create a panel of related, but functionally diverse enzymes that, as a
population, are capable of performing a given reaction on a wide
variety of substrates. Based on the information obtained from the
statistical models, each individual in this panel can be pre-tuned to
be stable to chemical process conditions as well as manufacturable
at large-scale. These enzyme panels can then be used to rapidly
identify starting activities along with beneficial diversity for new
substrates for which enzymes have not been available before, as
well as provide excellent starting points for new evolution projects
at the same time. Because early stage drug development requires
rapid evaluation of chemo- and biocatalytic routes (on the order of
days), rapid identification of enzyme variants with sufficient starting
activity and enantioselectivity using such panels is extremely
valuable [4]. This enzyme panel concept leads us to believe that
the impact of biocatalysis (Figure 1) will continue to increase in the
near future.
Opinion
Trends in Biotechnology
Vol.26 No.3
Figure 4. Multiobjective optimization. Simultaneous optimization on multiple objectives using traditional recombination methods is generally hampered by the presence of
conflicting mutations for different properties. For example, enzymes performing the enantioselective reduction of a ketone might have increased selectivity but reduced
activity or vice versa. In these situations, it is not generally clear which variants to recombine or which property should take precedence. Conversely, because statistical
models are trained on each property of interest, they can be used to infer the effects of individual mutations. The graph shows the results of such modeling, where the effect
of each mutation is plotted for each property of interest. The four quadrants correspond to combinations of beneficial(+)/deleterious() mutations on activity/selectivity.
Mutations in the (activity,selectivity) = (+,+) quadrant are the most desirable to take forward into the next round of evolution, but mutations conferring increased function for
at least one property that does not adversely affect the other properties are also good candidates; for example, (+,0) or (0,+) mutations.
task of diversity generation [39,40]. Another open question
is to what extent does context play a role in the predicted
effects of mutations? Although the models generally
assume an additive, linear contribution for each mutation
(Box 2), the assumption is likely to break down as the
enzyme becomes heavily mutated [41]. Answers to these
questions (which will probably come in the form of trends
rather than hard and fast rules) will hopefully serve to
further the goal of obtaining ever more efficient enzyme
optimization strategies, as well as shed light on the
mechanistic principles of enzymatic catalysis.
Concluding remarks
Biocatalysis is quickly becoming a real option for
pharmaceutical and bioindustrial manufacturing. In
the past, efforts to create beneficial diversity have played
a crucial role in improving enzyme performance, but the
search algorithm used to actually sift through that
diversity has, until recently, received little attention.
Well established statistical methods used in disparate
fields from finance to engineering are ideally suited to
learning from the kind of high dimensional data produced in enzyme engineering experiments. Learning
about the local sequence–function landscape has now
been applied to accelerate the optimization process
beyond that which was possible with blind algorithms
such as classical DNA shuffling. With this development,
we are confident that advanced evolution methods will
enable biocatalysis to fulfill its long-held promise to
create numerous, commercially attractive, enzymatic
processes for next generation fuels, industrial chemicals
and pharmaceuticals.
Acknowledgements
We thank Petra Gross for inviting us to write on this topic and three
anonymous reviewers for constructive comments and suggestions. We are
especially thankful to Lori Giver, Michael Clay, John Grate, Jim Lalonde,
Russell Sarmiento and Jennifer Jones for their critical review of the
manuscript, and for helpful advice and discussions.
References
1 Grate, J. (2006) Directed Evolution of Three Biocatalysts to Produce the
Key Chiral Building Block for Atorvastatin, the Active Ingredient in
Lipitor1. 2006 Presidential Green Chemisty Challenge Award: Greener
Reaction Conditions Award. United States Environmental Protection
Agency, Washington, D.C., June 26–30 (http://www.epa.gov/gcc/pubs/
pgcc/winners/grca06.html)
2 Schoemaker, H.E. et al. (2003) Dispelling the myths—biocatalysis in
industrial synthesis. Science 299, 1694–1697
3 Thayer, A. (2006) Enzymes at work. Chem. Eng. News 84, 15–25
4 Pollard, D.J. and Woodley, J.M. (2007) Biocatalysis for pharmaceutical
intermediates: the future is now. Trends Biotechnol. 25, 66–73
5 Dwyer, M.A. et al. (2004) Computational design of a biologically active
enzyme. Science 304, 1967–1971
6 Park, H.S. et al. (2006) Design and evolution of new catalytic activity
with an existing protein scaffold. Science 311, 535–538
7 Robertson, M.P. and Scott, W.G. (2007) Biochemistry: designer
enzymes. Nature 448, 757–758
8 Tawfik, D.S. (2006) Biochemistry. Loop grafting and the origins of
enzyme species. Science 311, 475–476
9 Castle, L.A. et al. (2004) Discovery and directed evolution of a
glyphosate tolerance gene. Science 304, 1151–1154
10 Crameri, A. et al. (1998) DNA shuffling of a family of genes from diverse
species accelerates directed evolution. Nature 391, 288–291
11 Chaparro-Riggers, J.F. et al. (2007) Better library design: data-driven
protein engineering. Biotechnol. J. 2, 180–191
12 Chica, R.A. et al. (2005) Semi-rational approaches to engineering
enzyme activity: combining the benefits of directed evolution and
rational design. Curr. Opin. Biotechnol. 16, 378–384
13 Morley, K.L. and Kazlauskas, R.J. (2005) Improving enzyme
properties: when are closer mutations better? Trends Biotechnol. 23,
231–237
137
Opinion
14 Reetz, M.T. et al. (2006) Directed evolution of enantioselective
enzymes: iterative cycles of CASTing for probing protein-sequence
space. Angew. Chem. Int. Ed. Engl. 45, 1236–1241
15 Siehl, D.L. et al. (2007) The molecular basis of glyphosate resistance
by an optimized microbial acetyltransferase. J. Biol. Chem. 282,
11446–11455
16 Rubin-Pitel, S.B. and Zhao, H. (2006) Recent advances in biocatalysis
by directed enzyme evolution. Comb. Chem. High Throughput Screen.
9, 247–257
17 Stemmer, W.P. (1994) Rapid evolution of a protein in vitro by DNA
shuffling. Nature 370, 389–391
18 Voigt, C.A. et al. (2001) Computational method to reduce the search
space for directed protein evolution. Proc. Natl. Acad. Sci. U. S. A. 98,
3778–3783
19 Treynor, T.P. et al. (2007) Computationally designed libraries of
fluorescent proteins evaluated by preservation and diversity of
function. Proc. Natl. Acad. Sci. U. S. A. 104, 48–53
20 Darwin, C. (1856) Letter to J.D. Hooker, 13 July (http://
www.darwinproject.ac.uk/darwinletters/calendar/entry-1924.html)
21 Giver, L. and Arnold, F.H. (1998) Combinatorial protein design by in
vitro recombination. Curr. Opin. Chem. Biol. 2, 335–338
22 Yuan, L. et al. (2005) Laboratory-directed protein evolution. Microbiol.
Mol. Biol. Rev. 69, 373–392
23 Huisman, G.W. and Lalonde, J.J. (2006) Enzyme evolution for chemical
process applications. In Biocatalysis in the Pharmaceutical and
Biotechnology Industries (Patel, R.N., ed.), pp. 717–742, CRC Press
24 Trefzer, A. et al. (2007) Biocatalytic conversion of avermectin to
400 -oxo-avermectin: improvement of cytochrome p450 monooxygenase
specificity by directed evolution. Appl. Environ. Microbiol. 73, 4317–
4325
25 Wong, T.S. et al. (2007) Steering directed protein evolution: strategies
to manage combinatorial complexity of mutant libraries. Environ.
Microbiol. 9, 2645–2659
26 Moore, J.C. et al. (1997) Strategies for the in vitro evolution of protein
function: enzyme evolution by random recombination of improved
sequences. J. Mol. Biol. 272, 336–347
27 Stemmer, W.P. (1994) DNA shuffling by random fragmentation and
reassembly: in vitro recombination for molecular evolution. Proc. Natl.
Acad. Sci. U. S. A. 91, 10747–10751
28 Stutzman-Engwall, K. et al. (2005) Semi-synthetic DNA shuffling of
aveC leads to improved industrial scale production of doramectin by
Streptomyces avermitilis. Metab. Eng. 7, 27–37
29 Kubinyi, H. (1997) QSAR and 3D QSAR in drug design Part1:
methodology. Drug Discov. Today 2, 457–467
30 Fox, R. (2005) Directed molecular evolution by machine learning
and the influence of nonlinear interactions. J. Theor. Biol. 234,
187–199
31 Fox, R. et al. (2003) Optimizing the search algorithm for protein
engineering by directed evolution. Protein Eng. 16, 589–597
32 Fox, R.J. et al. (2007) Improving catalytic function by ProSAR-driven
enzyme evolution. Nat. Biotechnol. 25, 338–344
33 Li, Y. et al. (2007) A diverse family of thermostable cytochrome P450s
created by recombination of stabilizing fragments. Nat. Biotechnol. 25,
1051–1056
34 Liao, J. et al. (2007) Engineering proteinase K using machine learning
and synthetic genes. BMC Biotechnol. 7, 16
138
Trends in Biotechnology Vol.26 No.3
35 Gustafsson, C. et al. (2003) Putting the engineering back into protein
engineering: bioinformatic approaches to catalyst design. Curr. Opin.
Biotechnol. 14, 366–370
36 de Jong, S. (1993) SIMPLS: an alternative approach to partial least
squares regression. Chemometr. Intell. Lab. Syst. 18, 251–263
37 Deb, K. (1999) Multi-objective genetic algorithms: problem difficulties
and construction of test problems. Evol. Comput. 7, 205–230
38 Lutz, S. and Patrick, W.M. (2004) Novel methods for directed evolution
of enzymes: quality, not quantity. Curr. Opin. Biotechnol. 15, 291–297
39 Lushington, G.H. et al. (2007) Whither combine? New opportunities for
receptor-based QSAR. Curr. Med. Chem. 14, 1863–1877
40 Masso, M. and Vaisman, I.I. (2007) Accurate prediction of enzyme
mutant activity based on a multibody statistical potential.
Bioinformatics 23, 3155–3161
41 Hayashi, Y. et al. (2006) Experimental rugged fitness landscape in
protein sequence space. PLoS ONE 1, e96
42 Dawkins, R. (1996) Climbing Mount Improbable, W.W. Norton &
Company
43 Edwards, A.W. (2000) The genetical theory of natural selection.
Genetics 154, 1419–1426
44 Arkin, A.P. and Youvan, D.C. (1992) An algorithm for protein
engineering: simulations of recursive ensemble mutagenesis. Proc.
Natl. Acad. Sci. U. S. A. 89, 7811–7815
45 Youvan, D.C. (1995) Searching sequence space. Biotechnology 13,
722–723
46 Kauffman, S. (1993) The Origins of Order, Oxford University Press
47 Aita, T. et al. (2002) Surveying a local fitness landscape of a protein
with epistatic sites for the study of directed evolution. Biopolymers 64,
95–105
48 Aita, T. et al. (2001) A cross-section of the fitness landscape of
dihydrofolate reductase. Protein Eng. 14, 633–638
49 Benos, P.V. et al. (2002) Additivity in protein-DNA interactions: how
good an approximation is it? Nucleic Acids Res. 30, 4442–4451
50 Lu, S.M. et al. (2001) Predicting the reactivity of proteins from their
sequence alone: Kazal family of protein inhibitors of serine proteinases.
Proc. Natl. Acad. Sci. U. S. A. 98, 1410–1415
51 Sandberg, W.S. and Terwilliger, T.C. (1993) Engineering multiple
properties of a protein by combinatorial mutagenesis. Proc. Natl.
Acad. Sci. U. S. A. 90, 8367–8371
52 Vajdos, F.F. et al. (2002) Comprehensive function maps of the antigenbiding site of an anti-ErbB2 antibody obtained with shotgun scanning
mutagenesis. J. Mol. Biol. 320, 415–428
53 Wells, J.A. (1990) Additivity of mutational effects in proteins.
Biochemistry 29, 8509–8517
54 Weinberger, E.D. (1996) NP completeness of Kauffman’s NK model, a
tunably rugged fitness landscape. Sante Fe Institute T.R. 96-02-003
55 Kazlauskas, R. (2005) Biological chemistry: enzymes in focus. Nature
436, 1096–1097
56 Voltaire (1772) La Bégueule, ¨uvres Complètes de Voltaire, Garnier
Freres
57 Kauffman, S. (2000) Prolegomenon to a general biology, In
Investigations, pp. 1–22, Oxford University Press
58 Styczynski, M.P. et al. (2006) The intelligent design of evolution. Mol.
Syst. Biol. 2, 2006.0020
59 Yoshikuni, Y. et al. (2006) Designed divergent evolution of enzyme
function. Nature 440, 1078–1082