Vol 443 | 19 October 2006 | doi:10.1038/nature05233 LETTERS A linguistic model for the rational design of antimicrobial peptides Christopher Loose1*, Kyle Jensen1,2,3*, Isidore Rigoutsos1,4 & Gregory Stephanopoulos1 Antimicrobial peptides (AmPs) are small proteins that are used by the innate immune system to combat bacterial infection in multicellular eukaryotes1. There is mounting evidence that these peptides are less susceptible to bacterial resistance than traditional antibiotics and could form the basis for a new class of therapeutic agents2. Here we report the rational design of new AmPs that show limited homology to naturally occurring proteins but have strong bacteriostatic activity against several species of bacteria, including Staphylococcus aureus and Bacillus anthracis. These peptides were designed using a linguistic model of natural AmPs: we treated the amino-acid sequences of natural AmPs as a formal language and built a set of regular grammars to describe this language. We used this set of grammars to create new, unnatural AmP sequences. Our peptides conform to the formal syntax of natural antimicrobial peptides but populate a previously unexplored region of protein sequence space. AmPs are ubiquitous among multicellular eukaryotes and are found in diverse contexts including frog skin, scorpion venom and human sweat. Recent studies show that some AmPs are active against pathogens that are resistant to traditional antibiotics such as penicillin, tetracycline and vancomycin3. In addition to their antibiotic uses, AmPs might have other interesting clinical applications: they act as adjuvants for the adaptive immune system4 and might be useful in treating certain cancers5. The many disease-relevant actions of AmPs are a consequence of their broad ability to distinguish eukaryotic cells from pathogenic invaders. In general, AmPs have a net positive charge and an amphipathic three-dimensional structure that gives the peptides an electrostatic affinity to the outer leaflet of the microbial membrane6. This affinity leads to binding, disruption of the membrane and, ultimately, microbial cell death7. Our preliminary studies of natural AmPs indicated that their amphipathic structure gives rise to a modularity among the different AmP amino-acid sequences. The repeated usage of sequence modules—which might be a relic of evolutionary divergence and radiation—is reminiscent of phrases in a natural language, such as English. For example, the pattern QxEAGxLxKxxK (where ‘x’ is any amino acid) is found in more than 90% of the insect AmPs known as cecropins. On the basis of this observation, we modelled the AmP sequences as a formal language—a set of sentences using words from a fixed vocabulary. In this case, the vocabulary is the set of naturally occurring amino acids, represented by their one-letter symbols8. We conjectured that the ‘language of AmPs’ could be described by a set of regular grammars. Regular grammars are, in essence, simple rules that describe the allowed arrangements of words. These grammars, such as the cecropin pattern mentioned previously, are commonly written as regular expressions and are widely used to describe patterns in nucleotide and amino-acid sequences9,10. To find a set of regular grammars to describe AmPs we used the Teiresias pattern discovery tool11. With Teiresias, we derived a set of 684 regular grammars that occur commonly in 526 well-characterized eukaryotic AmP sequences from the Antimicrobial Peptide Database (APD)12 (see Methods). Together, these ,700 grammars describe the ‘language’ of the AmP sequences. In this linguistic metaphor, the peptide sequences are analogous to sentences and the individual amino acids are analogous to the words in a sentence. Each grammar describes a common arrangement of amino acids, similar to popular phrases in English. For example, the frog AmP brevinin1E contains the amino-acid sequence fragment PKIFCKITRK, which matches the grammar P[KAYS][ILN][FGI]C[KPSA][IV][TS][RKC] [KR] from our database (the bracketed expression [KAYS] indicates that, at the second position in the grammar, lysine, alanine, tyrosine or serine is equally acceptable). On the basis of this match, we would say that the brevinin-1E fragment is ‘grammatical’. By design, each grammar in this set of ,700 grammars is ten amino-acids long and is specific to AmPs—at least 80% of the matches for each grammar in Swiss-Prot/TrEMBL13 (the APD is a subset of Swiss-Prot/TrEMBL) are found in peptides annotated as AmPs. To design unnatural AmPs, we combinatorially enumerated all grammatical sequences of length twenty (see Methods) in which each window of size ten in the 20-mers was matched by one of the ,700 grammars. We chose this length because we could easily chemically synthesize 20-mers and this length is close to the median length of AmPs in the APD. From this set, we removed any 20-mers that had six or more amino acids in a row in common with a naturally occurring AmP; we then clustered the remaining sequences on the basis of similarity, choosing 42 to synthesize and assay for antimicrobial activity. This process is illustrated in Fig. 1 and the designed sequences are shown as Supplementary Table 1. For each of the 42 synthetic peptides, we also designed a shuffled sequence, in which the order of the amino acids was rearranged randomly so that the sequence did not match any grammars. These shuffled peptides had the same amino-acid composition as their synthetic counterparts and thus, the same molecular weight, charge and isoelectric point (bulk physiochemical factors that are often correlated with antimicrobial activity). We hypothesized that because the shuffled sequences were ‘ungrammatical’ they would have no antimicrobial activity, despite having the same bulk physiochemical characteristics. In addition, we selected eight peptides from the APD as positive controls and six 20-mers from non-antimicrobial proteins as negative controls. A complete list of designed and shuffled sequences, along with controls, is provided in the Supplementary Information. 1 Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. 2Harvard–-MIT Health Sciences and Technology, Cambridge, Massachusetts 02139, USA. 3Agrivida, 411 Massachusetts Ave B1, Cambridge, Massachusetts 02139, USA. 4IBM Research Division, Thomas J.Watson Research Center, Yorktown Heights, New York 10598, USA. *These authors contributed equally to this work. 867 ©2006 Nature Publishing Group LETTERS NATURE | Vol 443 | 19 October 2006 Natural AmPs ~700 grammars K[QR]QTHA R[LM]GAG RLLM[RQ]QTHA K[QR]QTHA K[QR]QTHA K[QR]Q QRT Designed sequences of overlapping grammars T T T I A L L LL I LL G G ST P P Scored population of designed sequences Experimentally tested designed sequences Grammar 1 MKSTGPLLSALLLAVTAGGDP Grammar 2 S P A G I I F V G FL LA A L VAA S L D T Q Filtered to reduce redundacy and similarity to natural AmPs Figure 1 | A schematic of the in silico peptide design strategy. Grammars are induced from the set of natural AmP sequences using Teiresias. Overlapping grammars are stitched together to create new 20-amino-acid sequences that correspond to the antimicrobial syntax. Designed sequences are scored on the basis of the prevalence of the underlying grammar in natural peptides. After removing designs with six or more consecutive amino acids in common with a natural peptide, a representative set of sequences is selected for experimental validation. We characterized the activity of each synthetic AmP using a broth microdilution assay described elsewhere14. This assay measures the minimum inhibitory concentration (MIC) at which the peptide inhibits growth of the target organism. Table 1 shows the MICs of synthetic peptides against Bacillus cereus and Escherichia coli, as representative gram-positive and gram-negative bacteria. Of the 40 soluble (two of the designed and four of the shuffled peptides were insoluble) designed peptides, 18 had an MIC of 256 mg ml–1 or less against at least one of the bacterial targets. Only two of the soluble, shuffled peptides showed antibacterial activity. Thus, the activity is not an artefact of molecular weight, charge or isoelectric point. Of the negative controls—six peptides randomly selected from the middle of non-antimicrobial proteins from Swiss-Prot/TrEMBL— none had antimicrobial activity, whereas six of the eight naturallyoccurring AmPs in the positive control group did. To validate MIC determinations, we measured optical density at varying concentrations of a representative set of designed, shuffled, and natural peptides. Supplementary Fig. 1 shows that peptides inhibit growth at the MICs determined by visual inspection. In addition, plating of these samples confirmed that these peptide-treated bacterial populations were generally not viable at concentrations of peptides above the MIC. Two of the designed peptides, D28 (FLGVVFKLASKVFPAVFGKV) and D51 (FLFRVASKVFPALIGKFKKK), inhibited B. cereus growth at 16 mg ml–1, which is close to the MICs of the strong positive controls melittin and cecropin-melittin hybrid (8 mg ml–1). Peptides with activity against gram-positive bacteria are particularly exciting because of the prevalence of drug-resistant nosocomial S. aureus and the threat of bioterror agents such as B. anthracis, or anthrax. Therefore, we assayed seven designed peptides that had such activity, including the highly active D28 and D51 peptides, against the Smith diffuse strain of S. aureus and the Sterne strain of B. anthracis. As shown in Table 2, all seven peptides were active against both bacteria at 256 mg ml–1 or less, whereas only one of the seven shuffled controls was active (the only active shuffled peptide had MICs of 128 mg ml–1 against S. aureus and 256 mg ml–1 against B. anthracis). Moreover, two designed peptides, D28 and D51, had MICs of 16 mg ml–1 against Bacillus anthracis, which is equivalent to the activity of Cecropinmelittin hybrid, a strong natural AmP. D28 also had an MIC of 8 mg ml–1 against S. aureus. In an attempt to improve the antimicrobial activity of our designed peptides, we optimized our best candidate, peptide D28, using a heuristic approach. We created 44 variants of D28 by introducing mutations that were selected to increase positive charge, increase hydrophobicity, remove an interior proline residue, or improve segregation of positive and hydrophobic residues based on a helical projection. As shown in Supplementary Table 2, 18 of the 44 D28 variants showed improved activity against E. coli, B. cereus or S. aureus. Many of the D28 variants with improved activity against B. cereus included a mutation at an internal proline, either to lysine or glycine. D28 and six of its variants were assayed for bactericidal activity, and all had activity within a twofold dilution of their MIC. One variant (R8) had MICs of 16 mg ml–1 against E. coli, 8 mg ml–1 against B. cereus and 4 mg ml–1 against S. aureus (relative to 64, 16 and 8 mg ml–1, respectively, for D28). Our linguistic approach to designing synthetic AmPs might be successful because of the pronounced modular nature of natural AmP amino-acid sequences. As we have shown, this approach can be used to expand the AmP sequence space rationally without using structure–activity information or complex simulations of the interactions of a peptide with a membrane. The peptides that we designed are different from previously designed synthetic AmPs15 in that they bear limited homology to any known protein, which might be desirable for AmPs used in clinical settings. Some researchers argue that widespread clinical use of AmPs that are too similar to human AmPs will inevitably elicit bacterial resistance, compromising our own natural defences and posing a threat to public health16. In addition, using our approach to develop an arsenal of diverse antimicrobial agents would further reduce concerns about the development of antimicrobial resistance. We hope that this approach will help to expand the diversity of known AmPs well beyond those found in nature, possibly leading to new candidates for AmP-based antibiotic therapeutics. Our designed AmPs show some degree of homology with natural AmPs because the grammars are based on native sequences. Peptide D28, for example, was matched by grammars derived from 11 natural AmPs, including brevinin, temporin and ponericin. However, Smith-Waterman alignments of our designed peptides against all natural AmPs in the Swiss-Prot/TrEMBL database reveal that the degree of homology is, by design (see Methods), limited. In particular, our two most active peptides, D51 and D28, have only 50 and 60% sequence identity, respectively, with the nearest natural AmP. Our linguistic design approach might be most valuable as a method for rationally constraining a sequence-based search for new AmPs. Diverse leads generated by our algorithms could be optimized using approaches described in the literature17. But the Table 1 | MICs of peptides against bacterial targets 8 16 32 64 128 Class Designed Shuffled Natural AmPs E. coli (gram-negative) MIC MIC #256 mg ml–1 #64 mg ml–1 16/40 1/38 6/8 4/40 0/38 6/8 B. cereus E. coli or B. cereus (gram-positive) MIC MIC MIC #256 mg ml–1 #64 mg ml–1 #256 mg ml–1 8/40 2/38 4/8 4/40 1/38 3/8 18/40 2/38 6/8 Table 2 | MICs of peptides against S. aureus and B. anthracis MIC (mg ml–1) 256 .256 S. aureus B. anthracis D28 D51 D28, D51 D22 D63, S51 D5, D35, D43 S5, S22, S28 S35, S43, S63 D, designed on the basis of motifs; S, shuffled control. 868 ©2006 Nature Publishing Group D22 D5, D35, D43, D63 S51 S5, S22, S28, S35, S43, S63 LETTERS NATURE | Vol 443 | 19 October 2006 linguistic approach described here has a number of limitations. First, sequence families that are poorly conserved on an amino-acid level would not benefit from this approach. Second, we suspect that the small size of AmPs is helpful. Owing to the simple nature of regular grammars, they would be less useful for designing larger proteins and, in particular, proteins with complex tertiary or quaternary structures. 3. 4. 5. 6. 7. 8. METHODS Computational design of unnatural AmPs. To design unnatural AmPs, we combinatorially enumerated all grammatical sequences based on the set of ,700 grammars (see Supplementary Materials, Methods, Derivation of grammars). First, for each grammar, we wrote out all possible grammatical aminoacid sequences. So, for example, the grammar [IVL]K[TEGDK]V[GA]K [AELNH][VA][GA]K produced 600 sequences, where 3 3 5 3 2 3 5 3 2 3 2 5 600, owing to the option of choosing one of many amino acids at each bracketed position. There are roughly 3 million such 10-mers that correspond to antimicrobial patterns. Then we wrote out all possible 20-amino-acid sequences for which each window of 10 amino acids is found in the set of 3 million 10-mers. From this set, we removed any 20-mers that had six or more amino acids in a row in common with a naturally occurring AmP. There are roughly 12 million such 20-mers, each of which is a ‘tiling’ of ten 10-mers. We clustered these 12 million sequences using the Mcd-hit software18 at 70% identity. From these clusters, we chose 42 high scoring sequences to test experimentally (see Supplementary Materials, Methods, Scoring of sequences)19. These sequences have varying degrees of similarity to naturally occurring AmPs, as determined by sequence alignment and discussed in the text. Peptide synthesis and MIC determination. Fmoc (fluorenylmethoxycarbonyl) chemistry was used to synthesize peptides on the Intavis Multipep Synthesizer (Intavis LLC) at the MIT Biopolymers Lab, and confirmed using MALDI-TOF Mass Spectrometry and recombinantly produced AmP standards. The MIC was measured with an assay based on the NCCLS M26A and the Hancock assay for cationic peptides17. See Methods in Supplementary Materials for details on synthesis and antimicrobial assay. Received 1 May; accepted 4 September 2006. 1. 2. Zasloff, M. Antimicrobial peptides of multicellular organisms. Nature 415, 389–-395 (2002). Hancock, R. E. & Patrzykat, A. Clinical development of cationic antimicrobial peptides: from natural to novel antibiotics. Curr. Drug Targets Infect. Disord. 2, 79–-83 (2002). 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. Tiozzo, E., Rocco, G., Tossi, A. & Romeo, D. Wide-spectrum antibiotic activity of synthetic, amphipathic peptides. Biochem. Biophys. Res. Commun. 249, 202–-206 (1998). Biragyn, A. et al. Toll-like receptor 4-dependent activation of dendritic cells by b-defensin 2. Science 298, 1025–-1029 (2002). Ellerby, H. M. et al. Anti-cancer activity of targeted pro-apoptotic peptides. Nature Med. 5, 1032–-1038 (1999). Giangaspero, A., Sandri, L. & Tossi, A. Amphipathic a helical antimicrobial peptides. Eur. J. Biochem. 268, 5589–-5600 (2001). Shai, Y. Mode of action of membrane active antimicrobial peptides. Biopolymers 66, 236–-248 (2002). Jurafsky, D. & Martin, J. H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Prentice Hall, Upper Saddle River, New Jersey, 2000). Searls, D. B. The language of genes. Nature 420, 211–-217 (2002). Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status in 1999. Nucleic Acids Res. 27, 215–-219 (1999). Rigoutsos, I. & Floratos, A. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14, 55–-67 (1998). Wang, Z. & Wang, G. APD: the Antimicrobial Peptide Database. Nucleic Acids Res. 32, D590–-D592 (2004). Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–-48 (2000). Wu, M. & Hancock, R. E. W. Interaction of the cyclic antimicrobial cationic peptide bactenecin with the outer and cytoplasmic membrane. J. Biol. Chem. 274, 29–-35 (1999). Tossi, A., Sandri, L. & Giangaspero, A. Amphipathic, a-helical antimicrobial peptides. Biopolymers 55, 4–-30 (2000). Bell, G. & Gouyon, P-H. Arming the enemy: the evolution of resistance to selfproteins. Microbiology 149, 1367–-1375 (2003). Hilpert, K., Volkmer-Engert, R., Walter, J. & Hancock, R. E. W. High-throughput generation of small antibacterial peptides with improved activity. Nature Biotechnol. 23, 1008–-1012 (2005). Li, W., Jaroszewski, L. & Godzik, A. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics 17, 282–-283 (2001). Maizel, J. V. & Lenk, R. P. Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc. Natl Acad. Sci. USA 78, 7665–-7669 (1981). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements The authors would like to thank M. Zasloff, K. D. Wittrup, R. Berwick, and G. Georgiou for valuable input on the draft manuscript, and J. Moxley for figure preparation. The authors gratefully acknowledge the support of the Singapore-MIT Alliance, the NIH, and the Fannie and John Hertz Foundation. Author Information Reprints and permissions information is available at www.nature.com/reprints. The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to G.S. ([email protected]). 869 ©2006 Nature Publishing Group
© Copyright 2026 Paperzz