Opinion Enzyme optimization: moving from blind evolution to statistical exploration of sequence–function space Richard J. Fox and Gjalt W. Huisman Codexis, Inc., 200 Penobscot Drive, Redwood City, CA 94063, USA Directed evolution is a powerful tool for the creation of commercially useful enzymes, particularly those approaches that are based on in vitro recombination methods, such as DNA shuffling. Although these types of search algorithms are extraordinarily efficient compared with purely random methods, they do not explicitly represent or interrogate the genotype–phenotype relationship and are essentially blind in nature. Recently, however, researchers have begun to apply multivariate statistical techniques to model protein sequence–function relationships and guide the evolutionary process by rapidly identifying beneficial diversity for recombination. In conjunction with state-of-the-art library generation methods, the statistical approach to sequence optimization is now being used routinely to create enzymes efficiently for industrial applications. collecting diversity from homologous enzymes has proven useful in the discovery of improved enzymes [9–11]. Semirational approaches to enzyme engineering that are based on structural information have also been shown to help reduce the search space of useful mutations to those that are more likely to improve specific functions [11,12]. For example, stereoselectivity is more commonly affected by changes near the active site [13], a fact exploited by Reetz and others to identify beneficial mutations [14]. Conversely, the effect of mutations on properties such as activity and thermostability seem to be uncorrelated with their distance from the active site [13], and mutations distal from the active site have been shown to be useful sources of diversity for improving the catalytic efficiency of enzymes [15], Introduction Biocatalysis has undergone explosive growth over the past several years, playing increasingly key roles in chemical and bioindustrial manufacturing and process development, and in enabling green chemistries that help to protect the environment [1–4]. Enzymes isolated directly from nature rarely exhibit the ideal combination of traits and activities required for industrial use. Ideally, a perfect ‘designer’ enzyme, with the appropriate activities and traits, could be created ex novo based on intimate knowledge of the relationship of enzyme structure and function. This admirable goal, however, is neither presently achievable nor is it likely to be in the near future [5–8]. Given the promising, but presently limited, success of purely rational approaches to enzyme optimization, non-rational or semirational approaches will continue to play the dominant role in the creation of new enzymes and in further expanding the impact of biocatalysis (Figure 1). This leads to a view of enzyme optimization in which hypotheses regarding genetic diversity (see Glossary; identified in a multitude of ways) are tested with screens that resemble the ultimate application. The screens then serve as fitness functions for search algorithms used to sift through that diversity. Diversity: differences between amino acid or DNA sequences that are present in a library of variants. DNA shuffling: the original in vitro recombination technique, widely regarded as revolutionizing the field of directed molecular evolution [17]. DNA shuffling consists of random fragmentation of related parental DNA templates, followed by primerless PCR recombination. The resulting libraries are composed of highly chimeric progeny. Variants with improved function are carried forward into subsequent rounds of evolution, fixing beneficial mutations in the population but discarding deleterious ones. Fitness landscape (or sequence–function landscape): a conceptual model used to visualize or study the relationship between genotypes and phenotypes (functions or properties of interest). Often conceived of as mountain ranges, the ‘height’ of the landscape at a given genotype coordinate corresponds to the phenotype. Correlated (or smooth) fitness landscapes indicate that nearby locations in genotype space correspond to similar phenotypes. Machine learning: a subfield of artificial intelligence, which, in the case of supervised learning, is concerned with mapping inputs to desired outputs. The inputs are often highly dimensional, with many variables (also known as features or predictors), and the goal of learning is to produce robust, predictive models based on a set of training data. The models can be used to make new predictions or interrogated to infer which input variables are important for increased output or response. ProSAR: protein sequence activity relationships, a technique inspired by QSAR and other machine learning applications, which is used to infer the effects of mutations in a combinatorial library of proteins on a function or property of interest [31]. The models derived from the learning or training phase can be used to classify mutations as beneficial, deleterious or neutral and these classifications can be used to design improved variants in subsequent rounds of evolution. QSAR: quantitative structure activity relationships, a machine-learning-based approach to small molecule drug design that is used extensively in medicinal chemistry to determine which molecular features correlate with improved drug activity [29]. Recombination: the formation of new combinations of diversity in progeny sequences that are not present in the parents. Semi-synthetic shuffling: a next generation method of creating combinatorial libraries where DNA oligonucleotides are used to carry mutations that can freely recombine with a parental DNA template [27,28], giving significantly higher incorporation rates of mutations compared with traditional DNA shuffling. Diversity generation Generation of genetic diversity is a crucial component of the optimization process and has received a great deal of attention over the years. In addition to random mutagenesis, Glossary Corresponding author: Fox, R.J. ([email protected]). 132 0167-7799/$ – see front matter ß 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.tibtech.2007.12.001 Available online 28 January 2008 Opinion making rational hypothesis generation for these remote positions more challenging. Although specific properties (such as stereoselectivity) are often individually necessary for practical biocatalysts, often only a collection of properties can be considered jointly sufficient to achieve commercial viability. Biocatalysts must be optimized to function in non-natural environments, which include wide fluctuations in process conditions from high substrate to high product concentrations, and exposure to solvents over prolonged periods of time [16]. This multitude of objectives makes the task of identifying beneficial diversity non-trivial and mutations that span the entire structure of the enzyme must be considered. Searching through these numerous possibilities to create improved variants has been the raison d’être of recombination-based directed evolution [17]. This view of enzyme optimization separates the task of diversity generation from the search process and enables numerous methods of sourcing diversity to be applied, including information derived from computational and rational design [11,18,19]. In this regard rational design and directed evolution are not competing or mutually exclusive technologies – they are complementary approaches that address orthogonal considerations. Recombination Owing to the large number of possible variants, enzyme optimization requires efficient and robust search algorithms to rapidly sift through the myriad possibilities. Researchers have begun to look for ways to improve this process. The ability to obtain a highly improbable outcome through a blind, evolutionary algorithm, such as DNA shuffling, is truly awe inspiring (Box 1). Nevertheless, the fact that natural selection is indeed blind led Darwin himself to question its overall efficiency: ‘What a book a Devil’s Chaplain might write on the clumsy, wasteful, blundering, low and horridly cruel works of nature’ [20]. As powerful as the probability sieve is, the fact that it does not make use of more of the available information present in a population of variants leads us naturally to ask whether we can make the process of artificial selection a little less ‘clumsy, wasteful and blundering’. Before the advent of DNA shuffling, directed evolution was limited to iterative rounds of random mutagenesis, where the best sequence is identified at each round of evolution and used as a template for additional mutagenesis. Although this strategy is simple to execute it has the significant drawback of discarding numerous beneficial mutations at each round of evolution. Such asexual evolutionary methods are generally regarded as less efficient than algorithms that recombine beneficial mutations [21]. Given the superior efficiency of sexual or recombinationbased methods, significant efforts have been devoted to developing new ways of creating combinatorial libraries [22,23]. Although these efforts have played a role in advancing the field of directed evolution, they have generally not sought to alter fundamentally the nature of the search strategy itself [24,25]. The question that has become increasingly important is: what is the most efficient way to identify and recombine beneficial diversity? Trends in Biotechnology Vol.26 No.3 Box 1. Evolution as a probability sieve A favorite aphorism of the legendary R.A. Fisher (cofounder of the neo-Darwinian synthesis and father of modern statistics) was that, ‘Natural selection is a mechanism for generating an exceedingly high degree of improbability’ [43]. Although the power of recursive recombination has been known for some time within the protein engineering community [44,45], the magnitude of the effective search power is often not fully appreciated. By way of example, consider a bit string consisting of zeroes and ones of length N = 40 whose bits are functionally coupled to K = 1 other positions. The fitness of such a bit string is determined by the Kauffman NK landscape, a popular fitness landscape that has been widely used to study evolutionary algorithms [46]. In our case, we can imagine the bit string corresponds to the variable positions in a combinatorial library of enzymes, with only two amino acid options (corresponding to choices zero and one) at each position to keep things simple. The parameter K = 1 serves as a measure of the epistatic coupling between residues, indicating each mutated position is functionally coupled to one other mutated position. This corresponds to a roughly additive landscape, which has been shown to be the case for a variety of proteins [47–53]. Note that the functional coupling between all residues in the protein is expected to be much greater but we are only concerned with a combinatorial search within a subset of mutations under consideration. Imagine the goal is to obtain the highest fitness possible with the least effort. An exhaustive search of all 240 possibilities is not tenable. Yet a simple genetic algorithm, where the top ten solutions from a population of 1000 variants are bred together with a uniform crossover operator over four rounds of evolution (simulated on 1000 random NK landscapes), yields >98% of the theoretically optimal fitness gain [54]. In other words, a vast sequence space of 1012 can be effectively searched by examining only 4000 solutions. A common fear among protein engineers is the idea that they might miss a mutation or combination of mutations that could increase function, if only they were clever or exhaustive enough to find them [55]. This kind of thinking is particularly popular in motivating the needs for rational enzyme designs, where large library sizes are screened in silico under the pretense that such large libraries could never be screened in vivo. But Voltaire’s incisive observation that ‘the best is the enemy of the good’ [56] is worth considering here. The facile example given above demonstrates it might not be necessary or desirable to search for the optimal solution if many acceptable solutions can be obtained efficiently with algorithms that work exceedingly well on the less rugged portions of fitness landscapes [57,58]. In the end, protein engineers must always question what the increased return is (if any) for a given approach and whether it is worth the investment. Machine-learning guided evolution In the past decade, protein engineers were content to let the gentle recombination of DNA shuffling [17] serve as the mechanism for sifting through mutations that result in more fit variants. But, as Moore and coworkers observed, there is a statistical preference for the absence of mutation in the progeny when recombining multiple parents [26]. In the simplest example, consider ten parents each carrying one mutation. The progeny from shuffling these ten parents will have, on average, only a one in ten chance of having any one mutation at a given position and the probability of finding five or more mutations together is less than one in 600. With the advent of semi-synthetic shuffling [27,28] (whereby oligonucleotides carrying specific mutations are spiked into a PCR reaction at a high concentration relative to the parent template), protein engineers have been able to avoid this disadvantage by enabling for the free recombination of desired changes, resulting in a higher representation of progeny in the library carrying multiple mutations. Unfortunately, this 133 Opinion Trends in Biotechnology Vol.26 No.3 Figure 1. Accelerating impact of biocatalysis. Before 1970 biocatalysis was generally limited to those applications using enzymes found in nature (e.g. chymosin from calf and sheep stomach for the manufacture of cheese). With the advent of gene level random and site-directed mutagenesis in the 1980s, enzyme optimization via rational and directed evolutionary approaches became possible. Recombination-based approaches, such as DNA shuffling, were used to accelerate the process of directed evolution in the 1990s. In the current decade, computational library design coupled with directed evolution has proven to be a useful strategy along with statistical modeling of sequence–function relationships. approach raises another problem: how to decide which mutations are worth recombining? Arbitrarily including all acceptable mutations gives rise to large libraries that might contain only a relatively small fraction of functional, much less improved, enzymes. Conversely, smaller libraries will usually contain more functional enzymes but the number of beneficial to deleterious mutations becomes the crucial factor in determining the fraction of improved variants. The answer to identifying which mutations are worth recombining is trivial when a mutation occurs alone in the context of a parent enzyme (as is typical with low dosage random mutagenesis). However, in a combinatorial library a given mutation is almost always seen in the context of other programmed or random changes. The task of inferring the effects of individual changes in combinatorial libraries is ideally suited to the well developed, yet still vibrant, field of machine learning. The problem is akin to the inverse of the drug design problem, where structural and chemical features of the molecules are used to build predictive models that correlate with observed drug activities. These so called quantitative structure activity relationship (QSAR) drug design models [29] can then be interrogated to determine which features of the small molecule are important for conferring improved function against its protein target (typically the binding of a ligand to a receptor or an inhibitor to an enzyme). The widespread use of QSAR in medicinal chemistry has recently prompted interest in the protein engineering community to use similar techniques to determine which features of an enzyme are important for imparting improved function against a target substrate. One such enzyme-based optimization algorithm inspired by QSAR methodology is known as the protein 134 sequence-activity relationship (ProSAR) algorithm, the full details of which are described elsewhere [30–32]. Briefly, ProSAR consists of training a statistical model on a set of sequence–function data to classify individual mutations (or elements, patterns and motifs at the DNA or protein level) as beneficial, neutral or deleterious. These classifications are then used to design subsequent libraries, which are enriched for beneficial mutations and purged of deleterious ones (Figure 2). Typically, an improved sequence is identified at the beginning of every round of evolution and used as a parent template for subsequent combinatorial libraries. As is typically done with QSAR models, the impact of a mutation on function is generally assumed to be additive with respect to other mutations (Box 2) – the overall goal is to establish important trends rather than obtain high resolution details of the local sequence–function landscape. Essentially, the algorithm is used to unmask the effects of individual mutations so that a new parent template can be inspected to see which beneficial mutations it lacks or which deleterious mutations are present – these mutations can then be added to, or removed from, the next round library designs (Figure 2). The process proceeds recursively, adding new diversity discovered from any number of suitable sources. The kind of sequence–activity modeling used in the ProSAR algorithm has subsequently been adopted by other researchers to accelerate enzyme optimization [33–35]. Such modeling can be conducted in any number of ways that correlate features of a molecule with some desired property or response of interest. Classical machine-learning techniques, such as linear regression (first hit upon by Darwin’s cousin Francis Galton over 100 years ago), can be used when there are relatively few features compared with the number of observations (response measurements). Opinion Trends in Biotechnology Vol.26 No.3 Figure 2. ProSAR driven enzyme evolution. (a) A combinatorial library of enzymes is constructed by stochastically loading a DNA parent template (long horizontal lines) with mutations of unknown or uncertain effects (yellow spheres on structure and black circles on backbone). (b) The library is screened and a subset of functionally diverse variants is sequenced and the data are subjected to statistical analysis [31]. (c) The analysis uncovers the effects of various mutations and classifies them as beneficial (green), deleterious (red) or neutral (white). (d) Typically, the most active variant from one round is selected as a parent template for the next round library design(s). (e) Beneficial mutations are taken forward to the next round. These mutations are incorporated by oligonucleotide-based semi-synthetic shuffling [27,28] (short horizontal lines). Deleterious mutations present in the new parent template are removed by reverting to the previous template’s codon (short lines without circles). (f) As the population of mutations is sifted down, diversity is maintained with the addition of new mutations from any number of suitable sources, including random mutagenesis, homologous enzymes or semi-rational/computational studies. Alternatively, modern techniques such as partial least squares regression [36] can be used when there exists a paucity of observations and many features. The precise algorithms used to make inferences about mutational effects are not coupled in any way to the particular problem of sequence–function modeling, enabling developments in the field of machine learning to be fully leveraged for the task of enzyme engineering should they be made available. It is worth noting that, although the goal of the statistical modeling is to create a local map of the sequence– function landscape, the global fitness landscape might be highly degenerate in terms of acceptable solutions. There might be branch points in the search space that lead to different local optima, but as long as the algorithm continues to make progress in the directions of the highest gradients, such degeneracies pose no special problems (Figure 3). In addition to the multidimensional nature of the input space (the mutational diversity), the functional output is often multidimensional as well, where multiple properties such as activity, selectivity, stability and tolerance need to be optimized. Optimization on such multidimensional outputs or properties of interest poses its own special challenges to evolutionary search algorithms [37]; however, the statistical approach of separating the effects of individual mutations has its advantages in these situations as well (Figure 4). To the extent that beneficial mutations can be identified to contribute to various properties independently, they can be freely recombined to improve one or more properties at a time without adversely affecting the others. It is also worth noting that as the number of objectives grows, considerations regarding the screening capacity become increasingly important. Getting the most information out of the fewest number of assays and using that information to design high quality libraries is crucial to reducing the level of effort required to optimize enzymes. The statistical optimization techniques described here are well suited to such situations because only a relatively small number of functionally diverse variants are required to build models for all the objectives. Unlike some traditional directed evolution approaches, deep screening of libraries to find the best variants is not necessary – it is sufficient to identify beneficial mutations and move them forward into the next round, thus emphasizing library quality over quantity to effect improvements in enzyme function [38]. The first application of a machine-learning-based approach to create an industrially useful enzyme was demonstrated by evolving a halohydrin dehalogenase to improve the volumetric productivity per unit catalyst loading of a biocatalytic process 4000-fold [32]. During the course of the evolution, the ProSAR algorithm significantly outperformed traditional DNA shuffling formats owing to 135 Opinion Trends in Biotechnology Vol.26 No.3 Box 2. Modeling additivity of mutational effects The statistical models are typically written so that the mutations are assumed to make additive contributions to the property of interest (activity, thermostability, enantioselectivity/specificity, etc.): y¼ N X ci x i þ y 0 i¼1 where y is the predicted response (function or property of interest), N the number of unique mutations, ci the regression coefficient for mutation i, xi a binary digit indicating the presence or absence (1 or 0) of mutation i, and y0 the mean of the measured response. The assumption of rough additivity is supported by several studies that have examined enzymes with multiple mutations [47–53,59]. Although it is straightforward to construct models that include interacting mutations, they usually require significantly more sequence–activity data to prevent the overfitting that leads to poor predicative power [30]. Although the additivity assumption might be violated in practice, there are at least two reasons to suggest why deviations from it do not generally pose significant problems for a search algorithm that does not explicitly include nonlinear effects. First, as noted by Stephanopoulos and coworkers [58], it might be sufficient to simply optimize over regions of the sequence–function landscape that are not overly rugged, where ruggedness implies that only particular combinations of mutations will prove functional or beneficial. Although it is possible that useful combinations might be overlooked by a given search algorithm, as long as acceptable improvements can be obtained with high efficiency it does not matter that ever more optimal solutions could be obtained in principle. Second, nonlinear effects, although not directly present in additive statistical models, might still cast shadows into lower dimensions that can be exploited by a search algorithm. As long as the average contribution to fitness of a given mutation over a variety of contexts is neutral to beneficial it is likely to get incorporated into the next round library. For example, during the evolution of a halohydrin dehalogenase [32], a particular combination of mutations was found to be beneficial: G177A/V178C (as inferred from observation and from statistical models that did explicitly incorporate a cross-product term [30]). An additive model based on the same set of data showed that G177A was, on average, beneficial, whereas V178C was essentially neutral. In these situations the search algorithm will fix beneficial mutations already present in the new parent template (e.g. 177A) while examining statistically neutral mutations that might be conditionally beneficial (e.g. 178C). The next round would then be free to discover the 177A/178C interaction because 177A is fixed and 178C is beneficial in this context. Although particular regions of a fitness landscape might not always lend themselves to this kind of separable attack on pairs of interacting mutations, the above idea demonstrates that use of an additive model can enable for progress, even on semi-rugged surfaces, a result similar to that observed by using genetic algorithms that do not explicitly respect interactions (Box 1). Essentially, the search algorithm uses information from the local landscape to climb Mount Improbable [42] by exploring directions with the highest gradients (Figure 3). its ability not only to rapidly identify beneficial mutations in hits, but also to identify such mutations in variants with reduced function that would be overlooked by traditional methods. Recently, the approach was further validated by Arnold and co-workers who used statistical models of sequence–function relationships to improve the half-life of a cytochrome P450 monooxygenase over 100-fold [33]. In another recent example, Gustafsson and co-workers were able to utilize sequence–activity modeling to improve the activity of a proteinase K 20-fold [34]. Despite its recent introduction into the toolbox of protein engineers, the statistical approach to enzyme optimization is already creating opportunities to leverage sequence/structure– function relationships in ways that promise to further accelerate industrial biocatalysis (Box 3). 136 Figure 3. Climbing Mount Improbable. Enzyme optimization on sequence–function landscapes can be thought of as climbing a mountain of improbability in a high dimensional space [42] (Box 1), where the dimensions consist of mutated positions in an alignment of related variants. Most random solutions are unlikely to lead to fitter variants. Notoriously difficult to visualize in only three dimensions, the image shows a simplified version of such a landscape, with two variable dimensions corresponding to the genotype of a variant and a vertical dimension corresponding to some phenotypic response (function or property) of interest. The statistical modeling constructs a map of the local fitness landscape around the point p. The map is then used to explore directions with the highest gradients (a,b,c). Because the effects of mutations are evaluated around the local fitness landscape, greedy extrapolation along the single highest gradient (synthesizing just one or a few of the best predicted variants) might result in the recombination of conflicting mutations that lead to variants with suboptimal or reduced function, envisaged by the saddle point on the mountain pass in the direction of b or the valley on the other side. However, because the algorithm stochastically explores a variety of higher gradient directions, it protects against this possibility and superior solutions (a,c) can usually be obtained provided the landscape is not overly rugged. Although the current methods for machine-learning guided evolution have shown promise, researchers have only begun to explore their limitations and future possibilities. For example, the current methodology is concerned with identifying the effects of existing mutations, but what about mutations that have not been observed yet? Can statistical models based on features derived from in silico enzyme models be used to predict their effects? Such a hybrid strategy between statistical modeling and computational design could be used to help address the crucial Box 3. New opportunities The machine-learning-based approach to protein engineering is already opening up new frontiers in biocatalysis. For example, information derived from different optimization programs using enzymes of known sequence as a starting point can be used to create a panel of related, but functionally diverse enzymes that, as a population, are capable of performing a given reaction on a wide variety of substrates. Based on the information obtained from the statistical models, each individual in this panel can be pre-tuned to be stable to chemical process conditions as well as manufacturable at large-scale. These enzyme panels can then be used to rapidly identify starting activities along with beneficial diversity for new substrates for which enzymes have not been available before, as well as provide excellent starting points for new evolution projects at the same time. Because early stage drug development requires rapid evaluation of chemo- and biocatalytic routes (on the order of days), rapid identification of enzyme variants with sufficient starting activity and enantioselectivity using such panels is extremely valuable [4]. This enzyme panel concept leads us to believe that the impact of biocatalysis (Figure 1) will continue to increase in the near future. Opinion Trends in Biotechnology Vol.26 No.3 Figure 4. Multiobjective optimization. Simultaneous optimization on multiple objectives using traditional recombination methods is generally hampered by the presence of conflicting mutations for different properties. For example, enzymes performing the enantioselective reduction of a ketone might have increased selectivity but reduced activity or vice versa. In these situations, it is not generally clear which variants to recombine or which property should take precedence. Conversely, because statistical models are trained on each property of interest, they can be used to infer the effects of individual mutations. The graph shows the results of such modeling, where the effect of each mutation is plotted for each property of interest. The four quadrants correspond to combinations of beneficial(+)/deleterious() mutations on activity/selectivity. Mutations in the (activity,selectivity) = (+,+) quadrant are the most desirable to take forward into the next round of evolution, but mutations conferring increased function for at least one property that does not adversely affect the other properties are also good candidates; for example, (+,0) or (0,+) mutations. task of diversity generation [39,40]. Another open question is to what extent does context play a role in the predicted effects of mutations? Although the models generally assume an additive, linear contribution for each mutation (Box 2), the assumption is likely to break down as the enzyme becomes heavily mutated [41]. Answers to these questions (which will probably come in the form of trends rather than hard and fast rules) will hopefully serve to further the goal of obtaining ever more efficient enzyme optimization strategies, as well as shed light on the mechanistic principles of enzymatic catalysis. Concluding remarks Biocatalysis is quickly becoming a real option for pharmaceutical and bioindustrial manufacturing. In the past, efforts to create beneficial diversity have played a crucial role in improving enzyme performance, but the search algorithm used to actually sift through that diversity has, until recently, received little attention. Well established statistical methods used in disparate fields from finance to engineering are ideally suited to learning from the kind of high dimensional data produced in enzyme engineering experiments. Learning about the local sequence–function landscape has now been applied to accelerate the optimization process beyond that which was possible with blind algorithms such as classical DNA shuffling. With this development, we are confident that advanced evolution methods will enable biocatalysis to fulfill its long-held promise to create numerous, commercially attractive, enzymatic processes for next generation fuels, industrial chemicals and pharmaceuticals. Acknowledgements We thank Petra Gross for inviting us to write on this topic and three anonymous reviewers for constructive comments and suggestions. We are especially thankful to Lori Giver, Michael Clay, John Grate, Jim Lalonde, Russell Sarmiento and Jennifer Jones for their critical review of the manuscript, and for helpful advice and discussions. References 1 Grate, J. (2006) Directed Evolution of Three Biocatalysts to Produce the Key Chiral Building Block for Atorvastatin, the Active Ingredient in Lipitor1. 2006 Presidential Green Chemisty Challenge Award: Greener Reaction Conditions Award. United States Environmental Protection Agency, Washington, D.C., June 26–30 (http://www.epa.gov/gcc/pubs/ pgcc/winners/grca06.html) 2 Schoemaker, H.E. et al. (2003) Dispelling the myths—biocatalysis in industrial synthesis. Science 299, 1694–1697 3 Thayer, A. (2006) Enzymes at work. Chem. Eng. News 84, 15–25 4 Pollard, D.J. and Woodley, J.M. (2007) Biocatalysis for pharmaceutical intermediates: the future is now. Trends Biotechnol. 25, 66–73 5 Dwyer, M.A. et al. (2004) Computational design of a biologically active enzyme. Science 304, 1967–1971 6 Park, H.S. et al. (2006) Design and evolution of new catalytic activity with an existing protein scaffold. Science 311, 535–538 7 Robertson, M.P. and Scott, W.G. (2007) Biochemistry: designer enzymes. Nature 448, 757–758 8 Tawfik, D.S. (2006) Biochemistry. Loop grafting and the origins of enzyme species. Science 311, 475–476 9 Castle, L.A. et al. (2004) Discovery and directed evolution of a glyphosate tolerance gene. Science 304, 1151–1154 10 Crameri, A. et al. (1998) DNA shuffling of a family of genes from diverse species accelerates directed evolution. Nature 391, 288–291 11 Chaparro-Riggers, J.F. et al. (2007) Better library design: data-driven protein engineering. Biotechnol. J. 2, 180–191 12 Chica, R.A. et al. (2005) Semi-rational approaches to engineering enzyme activity: combining the benefits of directed evolution and rational design. Curr. Opin. Biotechnol. 16, 378–384 13 Morley, K.L. and Kazlauskas, R.J. (2005) Improving enzyme properties: when are closer mutations better? Trends Biotechnol. 23, 231–237 137 Opinion 14 Reetz, M.T. et al. (2006) Directed evolution of enantioselective enzymes: iterative cycles of CASTing for probing protein-sequence space. Angew. Chem. Int. Ed. Engl. 45, 1236–1241 15 Siehl, D.L. et al. (2007) The molecular basis of glyphosate resistance by an optimized microbial acetyltransferase. J. Biol. Chem. 282, 11446–11455 16 Rubin-Pitel, S.B. and Zhao, H. (2006) Recent advances in biocatalysis by directed enzyme evolution. Comb. Chem. High Throughput Screen. 9, 247–257 17 Stemmer, W.P. (1994) Rapid evolution of a protein in vitro by DNA shuffling. Nature 370, 389–391 18 Voigt, C.A. et al. (2001) Computational method to reduce the search space for directed protein evolution. Proc. Natl. Acad. Sci. U. S. A. 98, 3778–3783 19 Treynor, T.P. et al. (2007) Computationally designed libraries of fluorescent proteins evaluated by preservation and diversity of function. Proc. Natl. Acad. Sci. U. S. A. 104, 48–53 20 Darwin, C. (1856) Letter to J.D. Hooker, 13 July (http:// www.darwinproject.ac.uk/darwinletters/calendar/entry-1924.html) 21 Giver, L. and Arnold, F.H. (1998) Combinatorial protein design by in vitro recombination. Curr. Opin. Chem. Biol. 2, 335–338 22 Yuan, L. et al. (2005) Laboratory-directed protein evolution. Microbiol. Mol. Biol. Rev. 69, 373–392 23 Huisman, G.W. and Lalonde, J.J. (2006) Enzyme evolution for chemical process applications. In Biocatalysis in the Pharmaceutical and Biotechnology Industries (Patel, R.N., ed.), pp. 717–742, CRC Press 24 Trefzer, A. et al. (2007) Biocatalytic conversion of avermectin to 400 -oxo-avermectin: improvement of cytochrome p450 monooxygenase specificity by directed evolution. Appl. Environ. Microbiol. 73, 4317– 4325 25 Wong, T.S. et al. (2007) Steering directed protein evolution: strategies to manage combinatorial complexity of mutant libraries. Environ. Microbiol. 9, 2645–2659 26 Moore, J.C. et al. (1997) Strategies for the in vitro evolution of protein function: enzyme evolution by random recombination of improved sequences. J. Mol. Biol. 272, 336–347 27 Stemmer, W.P. (1994) DNA shuffling by random fragmentation and reassembly: in vitro recombination for molecular evolution. Proc. Natl. Acad. Sci. U. S. A. 91, 10747–10751 28 Stutzman-Engwall, K. et al. (2005) Semi-synthetic DNA shuffling of aveC leads to improved industrial scale production of doramectin by Streptomyces avermitilis. Metab. Eng. 7, 27–37 29 Kubinyi, H. (1997) QSAR and 3D QSAR in drug design Part1: methodology. Drug Discov. Today 2, 457–467 30 Fox, R. (2005) Directed molecular evolution by machine learning and the influence of nonlinear interactions. J. Theor. Biol. 234, 187–199 31 Fox, R. et al. (2003) Optimizing the search algorithm for protein engineering by directed evolution. Protein Eng. 16, 589–597 32 Fox, R.J. et al. (2007) Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 33 Li, Y. et al. (2007) A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments. Nat. Biotechnol. 25, 1051–1056 34 Liao, J. et al. (2007) Engineering proteinase K using machine learning and synthetic genes. BMC Biotechnol. 7, 16 138 Trends in Biotechnology Vol.26 No.3 35 Gustafsson, C. et al. (2003) Putting the engineering back into protein engineering: bioinformatic approaches to catalyst design. Curr. Opin. Biotechnol. 14, 366–370 36 de Jong, S. (1993) SIMPLS: an alternative approach to partial least squares regression. Chemometr. Intell. Lab. Syst. 18, 251–263 37 Deb, K. (1999) Multi-objective genetic algorithms: problem difficulties and construction of test problems. Evol. Comput. 7, 205–230 38 Lutz, S. and Patrick, W.M. (2004) Novel methods for directed evolution of enzymes: quality, not quantity. Curr. Opin. Biotechnol. 15, 291–297 39 Lushington, G.H. et al. (2007) Whither combine? New opportunities for receptor-based QSAR. Curr. Med. Chem. 14, 1863–1877 40 Masso, M. and Vaisman, I.I. (2007) Accurate prediction of enzyme mutant activity based on a multibody statistical potential. Bioinformatics 23, 3155–3161 41 Hayashi, Y. et al. (2006) Experimental rugged fitness landscape in protein sequence space. PLoS ONE 1, e96 42 Dawkins, R. (1996) Climbing Mount Improbable, W.W. Norton & Company 43 Edwards, A.W. (2000) The genetical theory of natural selection. Genetics 154, 1419–1426 44 Arkin, A.P. and Youvan, D.C. (1992) An algorithm for protein engineering: simulations of recursive ensemble mutagenesis. Proc. Natl. Acad. Sci. U. S. A. 89, 7811–7815 45 Youvan, D.C. (1995) Searching sequence space. Biotechnology 13, 722–723 46 Kauffman, S. (1993) The Origins of Order, Oxford University Press 47 Aita, T. et al. (2002) Surveying a local fitness landscape of a protein with epistatic sites for the study of directed evolution. Biopolymers 64, 95–105 48 Aita, T. et al. (2001) A cross-section of the fitness landscape of dihydrofolate reductase. Protein Eng. 14, 633–638 49 Benos, P.V. et al. (2002) Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451 50 Lu, S.M. et al. (2001) Predicting the reactivity of proteins from their sequence alone: Kazal family of protein inhibitors of serine proteinases. Proc. Natl. Acad. Sci. U. S. A. 98, 1410–1415 51 Sandberg, W.S. and Terwilliger, T.C. (1993) Engineering multiple properties of a protein by combinatorial mutagenesis. Proc. Natl. Acad. Sci. U. S. A. 90, 8367–8371 52 Vajdos, F.F. et al. (2002) Comprehensive function maps of the antigenbiding site of an anti-ErbB2 antibody obtained with shotgun scanning mutagenesis. J. Mol. Biol. 320, 415–428 53 Wells, J.A. (1990) Additivity of mutational effects in proteins. Biochemistry 29, 8509–8517 54 Weinberger, E.D. (1996) NP completeness of Kauffman’s NK model, a tunably rugged fitness landscape. Sante Fe Institute T.R. 96-02-003 55 Kazlauskas, R. (2005) Biological chemistry: enzymes in focus. Nature 436, 1096–1097 56 Voltaire (1772) La Bégueule, ¨uvres Complètes de Voltaire, Garnier Freres 57 Kauffman, S. (2000) Prolegomenon to a general biology, In Investigations, pp. 1–22, Oxford University Press 58 Styczynski, M.P. et al. (2006) The intelligent design of evolution. Mol. Syst. Biol. 2, 2006.0020 59 Yoshikuni, Y. et al. (2006) Designed divergent evolution of enzyme function. Nature 440, 1078–1082
© Copyright 2026 Paperzz