Causes and Effects of N-Terminal Codon Bias in Bacterial Genes Mikk Eelmets Journal Club 21.02.2014 Introduction • Ribosomes were first observed in the mid-1950s (Nobel Prize in 1974) • Nobel Prize in 2009 for determining the detailed structure and mechanism of the ribosome There are still many questions without an answer • Many organism are enriched for poorly adapted codon at the N terminus. Why? • Which mechanisms are causal for expression changes? Strategies: study genome-wide sequence correlation, study conservation among species, use synthetic genes Genetic code Design of experiment FlowSeq Sharon E et al Nat Biotechnol. 2012 May 20;30(6):521-30 grew the entire pooled library to early exponential phase and performed multiplexed measurements of the steady-state DNA, RNA, and protein levels. We used sequencing, DNASeq and RNASeq, to obtain steady-state DNA and RNA levels, respectively, across the library (12). For obtaining protein levels, we used FlowSeq, which combines fluorescence-activated cell sorting and high-throughput DNA sequencing and is similar in design to recently published work (34, 35). Briefly, we sorted cells into 12 log-spaced bins of varying GFP/mCherry ratios; isolated, amplified, and barcoded DNA from each of the bins; and then used high-throughput sequencing to count the number of constructs that fell into each bin (Fig. 1 A and D). Using the read counts from each of the bins, we reconstructed the average and S5). RNA level calculations also showed high concordance between technical replicates (R2 = 0.995; Fig. S5). Overall, RNA levels varied by three orders of magnitude, but within a single promoter, the coefficient of variation was only 0.63 (Fig. 2, Left and Fig. S6). RNASeq data also allowed us to identify dominant transcriptional start sites for most promoters (Fig. S7). Eightyseven percent of all promoters had one dominant start position (>60% of all mapped reads). Two promoters (marked with asterisks in Fig. S7) had very few uniquely mapping reads, did not show a strong start site, and showed unrealistic translation efficiency calculations. These observations indicated that we were missing most of the RNA (but not protein) reads from these promoters, possibly because of transcription starting after Design of experiment FlowSeq A weak sfGFP mCherry medium + 114 promoters 111 ribosome binding sites + sort on sfGFP/mCherry ratio transform, DNAseq & RNAseq library of 12,653 synthesized constructs barcode & sequence FlowSeq 15 10 D 15 10 log(RFP) count (x1000) B count (x1000) strong 5 0 102 103 104 log(RFP) 105 5 0 102 100) 5 log(GFP) 104 105 15 Kosuri S et al. Proc Natl Acad Sci U S A. 2013 Aug20;110(34):14024-9 6 4 5 3 4 count (x1000) C 103 10 estimate protein levels for each construct Design of experiment Expression System and Library Design ig. S1. Diagram Expression library the construct Each libraryofconstruct contained aSystem promoter, and RBS, Library and 11 codonDesign. N-terminal Each peptide (including ‘ATG’). The peptide sequences correspond to the N terminus of 137 natural E. coli genes. For ontained ainitiating promoter, RBS, and 11 codon N-terminal peptide (including the initiating each promoter/RBS/peptide combination, we encoded 13 codon variants including the most common codons (C), most rare codons (R), wild-type sequence (wt), N andterminus codon variants variable secondary ATG’). The peptide sequences correspond to the ofwith 137 natural E. coli structures (ΔG). The library was cloned in-frame with superfolder GFP (sfGFP). The GFP expression enes. For each promoter/RBS/peptide we encoded 13 codon variants level is compared via relative fluorescence tocombination, a constitutively co-expressed mCherry protein. ncluding the most common codons (C), most rare codons (R), wild-type sequence (wt), nd codon variants with variable secondary structures ( G). The library was cloned in- Changing synonymous codon usage in the 11–amino acid N-terminal peptide resulted in a mean 60-fold increase in protein abundance from the most common or rare synonymous codon in E. coli for every amino acid. The rare codon constructs displayed a mean 14-fold (median 4-fold) C Strong High Mid High Weak High WT High Strong Low Mid Low Promoter BS RBS Protein expression of the reporter library 1x 4x 16x 64x 256x nge in Expression from C to R e reporter library. (A) N-terminal odon variants show increased exs (C). (B) Fold change in expression pendent of RBS strength. WT, wild asured by the sfGFP:mCherry ratio) on variant sets are grouped into on variants include C, R, wild-type condary structure (∆G). Not shown ere mostly outside the quantitative t data, and light gray squares cor- Protein levels: 7327 (51.5%) constructs They used UNAFOLD software to predict free energy of folding for each RBS-codon variant wt C R *** K Protein Expression (FlowSeq) 1 2 4 6 8 10 12 Protein expression of the library (as measured by the sfGFP:mCherry ratio) covers a ~200-fold range. The 13-member codon variant sets are grouped into columns by promoter/RBS combination (top). Codon L B C, R, wild-type sequence C (wt), and 10 sequences with variants include varying secondary structure (∆G). 3.0 ression tion I inc ∆G 2x 2x 2 R = 0.41 1.5x −16 1.5x Usage (RSCU) 6x 1/4x 10-AA N-terminal peptides from 137 genes S RNA and DNA Abundance DNARNA abundances for each construct. Promoter is the first column and RBS(ratio identityof is RNA the Fig. S4. and DNA Abundance. (A) identity Relative RNA abundances to column. DNA abundances varied due to differences in DNA synthesis efficiencies as well as DNAsecond contig counts) for each construct are displayed grouped by promoter and RBS lower growth rate for very highly expressed genes identities. (B) Same plot as in (A) but displaying DNA abundances for each construct. Labels at the right are the same as in Figure 1E, where promoter identity is the first RNA and DNA Abundance Relative RNA abundances (ratio of RNA to DNA contig counts) for each construct are displayed grouped by promoter and RBS identities. Gene expression measurements of the reporter library 16x 4x 1x 1 4x 1 16x R C Codon Variant 60 40 20 0 30 20 10 0 15 10 5 0 15 10 5 0 C Strong RBS Promoter B Number of N-terminal peptides Fold Change in Protein Expression A Mid RBS Weak RBS WT RBS 1/64x 1/16x 1/4x 1x RBS Changing synonymous codon usage in the 11–amino acid N-terminal peptide resulted in a mean 60-fold increase in protein abundance from 4x 16x 64x 256x Fold Change in Expression from C to R 10-AA N-terminal peptides from 137 genes termined by FlowSeq; 7327 (51.5%) constructs were within the quantitative range of our assay [coefficient of determination (R2) = 0.955, P < 2 × (A) N-terminal sequences encoding the most codon variants show expression Fig. 1. Genepeptide expression measurements of rare the(R) reporter library. (A)increased N-terminal whensequences compared toencoding the most common onesrare (C) (mean 14-fold;variants median 4-fold). peptide the most (R) codon show increased ex(B) (B) Fold change in expression between C and R codon variants is largely independent of RBS strength. pression when compared to the most common ones (C). (B) Fold change in expression WT, wild type. between C and R codon variants is largely independent of RBS strength. WT, wild type. (C) Protein expression of the library (as measured by the sfGFP:mCherry ratio) Strong High wt C R inc ∆G C E F 2x 2 *** G H I K *** *** B L *** *** R = 0.41 *** p < 10 1x −16 1.5x *** 0.75x * *** *** *** *** *** T *** 0.75x 1x 1 2x 1 log2 RSCU *** *** R = 0.73 Y *** *** 1x * *** V *** ** ** 2 p = 10−9 *** *** *** *** CTG TTA TTG CTC CTT CTA S AAA AAG ATT ATC ATA CAT CAC R *** 1.5x 0.75x GGC GGT GGG GGA Q 2x 0 TTT TTC P 1x GAA GAG N GAT GAC -1 C 1.5x TGC TGT -2 Mean fold change in expression due to codon substitution 2x D ** 4x 2x * *** *** *** Mean fold change in expression A (RSCU) Usage Relative Synonymous Codon due to codon substitution A ** *** Codon Frequency Enrichment *** *** at N-terminus of E. coli Genes Protein Expression (FlowSeq) 1 2 4 C 2x Relative Synonymous R = 0.41 1.5x Codon p < 10 Usage 2x 3.0 2 −16 1.5x n = number of synonymous codons 1x (1 ≤ n ≤ 6) for the amino acid under R2 = 0.73 study, Xi = number 0.75x of occurrences p = 10−9 of codon i. 1x 0.75x 2.0 6 8 Relative Synonymous Codon Usage (RSCU) wt C R Choice of codon and Expression GCG GCC GCA GCT -3 Protein sequence (wt), and 10 sequences with varying secondary structure (∆G). Not shown Expression are two additional inc low-promoter the quantitative ∆G panels, which were mostly1 outside 2 gray 4 squares 6 8 10 12 (FlowSeq) FlowSeq range. Dark gray squares had insufficient data, and light correspond to duplicate constructs. 10 12 3.0 2.0 1.0 0.5 1x 4x 1 2x 1 If-1the 0synonymous codons of an 2x Codon Frequency Enrichment log 2 RSCU amino acid are used equalof E. coli Genes at with N-terminus frequencies, their RSCU values will Fig. 2.1.0 Rare equal codons1.generally increase expression levels. (A) The -3 -2 average fold change in expression is correlated with the choice of codon. The y axis is the slope of a linear model linking codon use to expression change.0.5 Codons are sorted left to right by increasing genomic frequency and colored according to their relative synonymous codon usage (RSCU) in E. coli. (P values after Bonferroni correction: *P < 0.05, **P < 0.005, ***P < 0.001). (B) The individual codon slopes (y axis) as in (A) show an inverse relationship with RSCU (x axis). (C) The individual codon slopes correlate with enrichment of codons at the N terminus of genes in E. coli. X RSCU = n i i 1 X n∑ i i=1 TAT TAC GTG GTT GTC GTA ACC ACG ACT ACA AGC TCG TCC AGT TCT TCA CGC CGT CGG CGA AGA AGG CAG CAA CCG CCA CCT CCC AAC AAT . Rare codons generally levels. (A) The The y axis is the slope of a The average fold changeincrease in expressionexpression is correlated with the choice of codon. linear in model linking codon to change. Codons are left to right by increasing genomic e fold change expression isuse correlated with the choice ofsorted codon. 25expression OCTOBER 2013 VOL 342 SCIENCE www.sciencemag.org 476 frequency and colored according to their relative synonymous codon usage (RSCU) in E. coli. (P values axis is the slope of a linear model linking codon use to expression after Bonferroni correction: *P < 0.05, **P < 0.005, ***P < 0.001). e. Codons are sorted left to right by increasing genomic frequency lored according to their relative synonymous codon usage (RSCU) squares corthe quantitative gray squares cor- wt C R wt C R inc ∆G inc ∆G Protein Expression Protein 1 2 4 6 (FlowSeq) Expression (FlowSeq) 1 2 4 6 8 10 12 8 10 12 *** *** *** *** CTG TTA TTG CTG CTC TTA V Y *** Y C C 2x B2x Mean fold change in expression due to codon substitution L Mean fold change in expression due to codon substitution B 2x 2x R2 = 0.41 2 R p < 10=−160.41 1.5x 1.5x p < 10−16 1.5x 1x 0.75x 1.5x 1x 1x R2 2= 0.73 R = 0.73 p = 10−9−9 p = 10 0.75x 0.75x 0.75x -3 1x -2 -3 -1 -2 -1 0 0 1 log2log RSCU 2 RSCU 1 1 2x 1 2x 1x 1x 2x 2x 4x 4x (RSCU) Usage Codon Synonymous Relative (RSCU) Usage Codon Synonymous Relative Choice of codon and Expression 3.0 3.0 2.0 2.0 1.0 1.0 0.5 0.5 Codon CodonFrequency FrequencyEnrichment Enrichment at at N-terminus of E. N-terminus of E.coli coliGenes Genes Fig. 2. Rare codons generally increase expression levels. (A) The average foldfold change in expression is iscorrelated average change in expression correlatedwith withthe thechoice choice of codon. codon. The The y axis is the slope of of a linear model y axis is the slope a linear modellinking linkingcodon codonuse use to to expression expression change. Codons sorted rightbybyincreasing increasinggenomic genomic frequency frequency change. Codons areare sorted leftleft to to right (B) The individual codon slopes (y axis) as in (A) show an inverse relationship with RSCU (x axis). Fig. 2. Rare codons generally increase expression levels. (A) The (C) The individual codon slopes correlate with enrichment of codons at the N terminus of genes in E. coli. mous codon with expression changes (Fig. 2A and fig. S8). For most amino acids, we found a link between the rarity of the codon and increased expression (Fig. 2B). There is a strong correlation 0.00654), and did not significantly correlate with codons that increased protein expression in our data set (R2 = 0.024, P = 0.124). Others have proposed that slow ribosome progression at the N Choice of codon and Expression B 16x E 16x 4x 4x 1x 1x 1 4x 1 4x 1 16x ∆∆ G from average codon variant Fold Change in Protein Expression A 1 16x 10 5 0 -5 -10 > -10% -8% -4% 0% +4% +8% > +10% > -6 n=277 803 1819 2431 2058 861 n=89 232 Change in %GC from average variant 190 -6 -4 -2 0 +2 +4 675 1676 2800 2099 692 > +4 176 ∆∆ G from average variant (kcal/mol) -15 R2 = 0.05 ssion R2 = 0.87 ssion n G) sion n Rare codons alter expression by reducing mRNA secondary structure. (A) Expression changes are D boxplot includes ±2% of centered value.F(B) correlatedCwith relative changes in %GC content. Each Expression increases correlate to relative increases in free2xenergy of folding at the front of the transcript 2x (ΔΔG). Each boxplot includes ±2 kcal/mol of centered value. 16x -0 p < 10−16 ∆∆ G 1 16x 1 16x -10 Choice of codon and Expression ∆∆ G from average variant Change in %GC > -10% -8% -4% 0% +4% +8% > +10% > -6 n=277 803 1819 2431 2058 861 n=89 232 190 -6 -4 0 +2 +4 > +4 675 1676 2800 2099 692 D -15 F -0.2 2x 2 R = 0.87 R2 = 0.05 1.5x p < 10−16 1x 0.75x ∆tAI fro ∆tAI=0 Fold Change in Protein Expression Mean fold change in expression due to codon substitution (After Controlling for ∆∆G) Mean fold change in expression due to codon substitution 2x 1.5x 176 (kcal/mol) from average variant C -2 p = 0.197 1x 0.75x 16x p < 10−16 R2 = 0.39 4x 1x 1 4x 1 16x -1 0 1 2 Codon substitution, ∆∆G (kcal/mol) -3 -2 -1 0 log2 RSCU 1 -10 -5 0 ∆∆G from average codo linearslopes regression, there is no longe 3. alter Rareexpression codons alter expression reducing mRNA (C) secondary RareFig. codons by reducing mRNAby secondary structure. Individual codon correlate slopes and RSCUlinear (compare with Fig. are correlated(D) withAfter relative changes in with structure. the ΔΔG (A) per Expression individual changes codon substitution. controlling for%GC ΔΔG with a multiple regression, no longer any relation slopes and RSCUplotted for all constructs within the qu content.there Eachisboxplot includes T2% ofbetween centeredindividual value. (B)codon Expression increases correlate to relative increases in free energy of folding at the front of the transcript (DDG). Each boxplot includes T2 kcal/mol of centered value. (C) Individual codon slopes (same as Fig. 2A y axis) correlate with the DDG per their relative fold change in expressi the set. (F) Subsets of constructs corre Points with constant codon adaptatio imple linear regression model se of each individual synonyh expression changes (Fig. 2A most amino acids, we found a arity of the codon and increased B). There is a strong correlation +2 E NUPACK (24). We found that increased secondary structure was correlated with decreased expression, which explained more variation than any other variable we measured (R2 = 0.34, P < 2 × 10−16) (Fig. 3B). We made a similar linear regression model relating individual codon substitution Choice of codon and Expression 10 ∆∆ G from average codon variant Subset ∆∆ G=0 5 ∆ tAI=0 0 Fold Change in Protein Level < 1 16x -5 1 16x 1 4x 1x -10 +4 099 692 4x > +4 16x 176 > 16x ge variant The classical translational efficiency cTEi for each codon is the tRNA adaptation index (tAI) -15 F -0.2 0.0 ∆tAI from average codon variant Wi –absolute adaptiveness ni - the number of tRNA isoacceptors recognizing codon i sij - a selective constraint on the efficiency of the codon-anticodon coupling tGCNij – gene copy number of the tRNA 0.2 in Expression (E) The ΔΔG versus change in tAI is plotted for all constructs within the quantitative range. Constructs ∆tAI=0 ∆∆G=0 are colored by their relative fold change in expression from the average codon variant within the set. 16x 4x p < 10−16 p = 0.773 R2 = 0.39 R2 = 0 1x -10 -6 9 232 -4 -2 0 +2 +4 675 1676 2800 2099 692 16x 176 > 16x Choice of codon and Expression -15 F Fold Change in Protein Expression p = 0.197 3 > +4 ∆∆ G from average variant (kcal/mol) R2 = 0.05 4x -0.2 F 16x 0.0 ∆tAI from average codon variant ∆tAI=0 0.2 ∆∆G=0 p < 10−16 p = 0.773 R2 = 0.39 R2 = 0 4x 1x 1 4x 1 16x -2 -1 0 log2 RSCU 1 -10 -5 0 5 -0.2 -0.1 0.0 0.1 ∆∆G from average codon variant ∆ tAI from average codon variant regression, there is notolonger any relation individual codon cing mRNA secondary (F) Subsets linear of constructs corresponding the shaded boxes inbetween (E). (Left) Points with constant codon varied structure, points withDDG constant secondary slopes and secondary RSCU (compare with (right) Fig. 2B). (E) The versus change instructure tAI is and varied relative changesadaptation in %GC and codon adaptation. plotted for all constructs within the quantitative range. Constructs are colored by . (B) Expression increases ding at the front of the their relative fold change in expression from the average codon variant within l of centered value. (C) the set. (F) Subsets of constructs corresponding to the shaded boxes in (E). (Left) relate with the DDG per Points with constant codon adaptation and varied secondary structure, (right) or DDG with a multiple points with constant secondary structure and varied codon adaptation. left to uncover, and the extent to which codon usage beyond the N-terminal region alters gene expression remains unresolved (8, 14). The N terminus of genes in almost all bacteria displays reduced secondary structure, but enrichment of poorly adapted N-terminal codons is only found in bacteria with GC content of at least 50% (3). Recent work further shows that AT-rich would expect a disproportionate enrichment of A over T due to G-U wobble pairing. Indeed, nucleotide triplets with A at the wobble position were more consistently correlated with expression in our data set and with enrichment at the N terminus of E. coli genes (fig. S11). Kudla et al. show that local RNA structure in the region between –4 and +38 of translation start References and Notes 1. J. B. Plotkin, G. Kudla, Nat. Rev. Genet. 12, 32–42 (2011). 2. T. Tuller et al., Cell 141, 344–354 (2010). 3. M. Allert, J. C. Cox, H. W. Hellinga, J. Mol. Biol. 402, 905–918 (2010). 4. S. Pechmann, J. Frydman, Nat. Struct. Mol. Biol. 20, 237–243 (2013). 5. K. Bentele, P. Saffert, R. Rauscher, Z. Ignatova, N. Blüthgen, Mol. Syst. Biol. 9, 675 (2013). 6. G.-W. Li, E. Oh, J. S. Weissman, Nature 484, 538–541 (2012). 7. W. Gu, T. Zhou, C. O. Wilke, PLOS Comput. Biol. 6, e1000664 (2010). 8. G. Kudla, A. W. Murray, D. Tollervey, J. B. Plotkin, Science 324, 255–258 (2009). 9. M. dos Reis, R. Savva, L. Wernisch, Nucleic Acids Res. 32, 5036–5044 (2004). 10. P. Shah, Y. Ding, M. Niemczyk, G. Kudla, J. B. Plotkin, Cell 153, 1589–1601 (2013). 11. M. Welch et al., PLOS ONE 4, e7002 (2009). 12. M. Zhou et al., Nature 495, 111–115 (2013). 13. S. Navon, Y. Pilpel, Genome Biol. 12, R12 (2011). 14. T. Tuller, Y. Y. Waldman, M. Kupiec, E. Ruppin, Proc. Natl. Acad. Sci. U.S.A. 107, 3645–3650 (2010). 15. A. R. Subramaniam, T. Pan, P. Cluzel, Proc. Natl. Acad. Sci. U.S.A. 110, 2419–2424 (2013). 16. E. M. LeProust et al., Nucleic Acids Res. 38, 2522–2540 (2010). 17. J.-D.Pédelacq , S. E. P. Cabantous, T. Tran, T. C. Terwilliger, G. S. Waldo, Nat. Biotechnol. 24, 79–88 (2006). 18. N. C. Shaner et al., Nat. Biotechnol. 22, 1567–1572 (2004). 19. S. Kosuri et al., Proc. Natl. Acad. Sci. U.S.A. 110, 14024–14029 (2013). 20. Y. Yamazaki, H. Niki, J.-I. Kato, Methods Mol. Biol. 416, 385–389 (2008). 21. D. L. Hartl, E. N. Moriyama, S. A. Sawyer, Genetics 138, 227–234 (1994). 22. M. Gouy, C. Gautier, Nucleic Acids Res. 10, 7055–7074 (1982). 23. P. M. Sharp, W. H. Li, Nucleic Acids Res. 15, 1281–1295 (1987). 24. J. N. Zadeh et al., J. Comput. Chem. 32, 170–173 (2011). 25. D. Voges, M. Watzele, C. Nemetz, S. Wizemann, B. Buchberger, Biochem. Biophys. Res. Commun. 318, 601–614 (2004). 26. M. H. de Smit, J. van Duin, Proc. Natl. Acad. Sci. U.S.A. 87, 7668–7672 (1990). 27. J. R. Coleman et al., Science 320, 1784–1787 (2008). 28. A. A. Komar, Trends Biochem. Sci. 34, 16–24 (2009). Hybridization probabilities and Expression RBS variable codons GFP +30% Window Relative Hybridization Probability +20% +10% A multiple linear regression model that combines promoter and RBS choice, as well as Nterminal secondary structure and GC content, still explains only 54% of variation in expression levels 0% -10% -20% Mean +/- 1 SD Best 5% of constructs (422 Constructs) Worst 5% of constructs (422 Constructs) -30% -log 10 (p-value) Fig. 4. mRNA structure downstream of start codon is most correlated withreducedexpression. Relative hybridization probabilities averaged in 10nuclotide windows are plotted against their correlation with expression change as a function of position (–20 to +60 from ATG). (Top) The best and worst 5% of constructs— as ranked by relative expression within a codon variant set—are grouped and plotted as blue and red ribbons, respectively. The ribbon tops and bottoms are one standard deviation from the mean, which is shown as a solid line. (Bottom) The P value for linear regressions, correlating hybridization probabilities within each window to expression fold change in all constructs. Hybridization probabilities was calculated using UNAFOLD Acknowledgments: We thank J. C. Way, E. R. Daugharthy, and software R. T. Sauer for comments. The research was supported 150 100 50 0 -20 0 20 40 10-bp Windows across transcript 60 by the U.S. Department of Energy (DE-FG02-02ER63445 to G.M.C.), NSF SynBERC (SA5283-11210 to G.M.C.), Office of Naval Research (N000141010144 to G.M.C. and S.K.), Agilent mRNA structure downstream of start codon is most correlated with reduced expression. (Top) The best and worst 5% of 478 25 OCTOBER 2013 VOL 342 SCIENCE www.sciencemag.org constructs as ranked by relative expression within a codon variant set are grouped and plotted as blue and red ribbons, respectively. The ribbon tops and bottoms are one standard de- viation from the mean, which is shown as a solid line. (Bottom) The p-value for linear regressions, correlating hybridization probabilities within each window to expression fold change in all constructs. Conclusion • Using N-terminal rare codons instead of common ones increase expression by ~14 fold (median 4 fold) • Reduced RNA structure, not codon rarity itself, is responsible for expression increases • This knowledge helps better predict how to synthesize genes that make enzymes or drugs in a bacterial cell. Reference Daniel B. Goodman, George M. Church, Sriram Kosuri “Causes and Effects of N-Terminal Codon Bias in Bacterial Genes“ Science. 2013 Oct 25;342(6157):475-9 THANK YOU Flow Cytometry Controls –16) with individually Fig. S5. All Flow Cytometryratio Controls. fluorescence ratio measured FlowSeq estimates of fluorescence correlateFlowSeq well (R2 =estimates 0.955, p <of 2×10 2 –16 fluorescence ratios(R from=sequence-verified 51 individually constructs thatmeasured were outside of the quantitative correlate well 0.955, p < 2×10clones. ) with fluorescence ratios FlowSeq range aresequence-verified shown in red. from clones. 51 constructs that were outside of the quantitative FlowSeq range are shown in red.
© Copyright 2026 Paperzz