bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 1 Sequence features explain most of the mRNA 2 stability variation across genes 3 4 Jun cheng1,2, Žiga Avsec1,2, Julien Gagneur1,2,* 5 1. Department of Informatics, Technical University of Munich, Boltzmannstraße 3, 6 85748 Garching, Germany 7 2. Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians- 8 Universität München, Germany 9 * Correspondence: [email protected] 10 11 Abstract 12 The stability of messenger RNA (mRNA) is one of the major determinants of gene 13 expression. Although a wealth of sequence elements and mechanisms regulating 14 mRNA stability has been described, their quantitative contributions in determining 15 mRNA half-life is unknown. Here, we built quantitative models for two eukaryotic 16 genomes Saccharomyces cerevisiae and Schizosaccharomyces pombe that, for the 17 first time, explain most of the half-life variation between genes based on mRNA 18 sequence alone. The models integrate known functional cis-regulatory elements, 19 identify novel ones, and quantify their contributions at single-nucleotide resolution. 20 We show quantitatively that codon usage is the major determinant of mRNA stability, 21 and that this effect depends on canonical mRNA degradation pathways. Altogether, 22 our results integrate and quantitatively delineate mRNA stability cis-regulatory 23 elements and provide a methodology that can serve as a scaffold to study the 24 function of cis-regulatory elements and to discover novel ones. 25 26 Introduction 27 The stability of messenger RNAs is an important aspect of gene regulation. It 28 influences the overall cellular mRNA concentration, as mRNA steady-state levels are 29 the ratio of synthesis and degradation rate. Moreover, low stability confers high 30 turnover to mRNA and therefore the capacity to rapidly reach a new steady-state 31 level in response to a transcriptional trigger. Hence, stress genes, which must rapidly crPage 1 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 32 respond to environmental signals, show low stability (Zeisel et al, 2011; Rabani et al, 33 2014). In contrast, high stability provides robustness to variations in transcription. 34 Accordingly, a wide range of mRNA-half-lives is observed in eukaryotes, with typical 35 variations in a given genome spanning one to two orders of magnitude 36 (Schwanhäusser et al, 2011; Schwalb et al, 2016; Eser et al, 2016). 37 38 How mRNA stability is encoded in a gene sequence has long been a subject of 39 study. Cis-regulatory elements (CREs) affecting mRNA stability are mainly encoded 40 in the mRNA itself. They include but are not limited to secondary structure, sequence 41 motifs present in the 3’UTR including binding sites of RNA-binding proteins (Olivas & 42 Parker, 2000; Vasudevan & Peltz, 2001), and, in higher eukaryotes, microRNAs (Lee, 43 1993). Moreover, codon usage, which affects translation elongation rate, also 44 regulates mRNA stability (Hoekema et al, 1987; Presnyak et al, 2015; Boël et al, 45 2016; Mishima & Tomari, 2016). For instance, inserting strong secondary structure 46 elements in the 5’UTR or modifying the translation start codon context strongly 47 destabilizes the long-lived PGK1 mRNA in S. cerevisiae (Muhlrad et al, 1995; 48 LaGrandeur & Parker, 1999). 49 50 Since the RNA degradation machineries are well conserved among eukaryotes, the 51 pathways have been extensively studied using Saccharomyces cerevisiae as a 52 model organism (Caponigro & Parker, 1996; Parker, 2012). The general mRNA 53 degradation pathway starts with the removal of the poly(A) tail by the Pan2/Pan3 54 (Boeck et al, 1996; Brown et al, 1996) and Ccr4/Not complexes (Tucker et al, 2001; 55 Collart, 2003; Yamashita et al, 2005). Subsequently, mRNA is subjected to 56 decapping carried out by Dcp2 and promoted by several factors including Dhh1 and 57 Pat1 (Pilkington & Parker, 2008; She et al, 2008). The decapped and deadenylated 58 mRNA can be degraded rapidly in the 3’ to 5’ direction by the exosome (Anderson & 59 Parker, 1998) or in the 5’ to 3’ direction by Xrn1 (Muhlrad et al, 1994; Hsu & Stevens, 60 1993). Further mRNA degradation pathways are triggered when aberrant 61 translational status is detected, including Nonsense-mediated decay (NMD), No-go 62 decay (NGD) and Non-stop decay (NSD) (Garneau et al, 2007; Parker, 2012). 63 64 Despite all this knowledge, prediction of mRNA half-life from a gene sequence is still 65 not established. Most of the mechanistic studies could only be performed on crPage 2 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 66 individual genes and it is unclear how the effects generalize genome-wide. In 67 addition, neither the contribution of different CREs to the overall stability and to the 68 variation between genes has been systematically studied, nor have the dependences 69 of CREs to distinct mRNA degradation pathways. 70 71 Here, we used an integrative approach where we mathematically modelled mRNA 72 half-life as a function of its sequence and applied it to S .cerevisiae. For the first time, 73 our model can explain most of the between-gene half-life variance from sequence 74 alone. We used a semi-mechanistic model which allowed us to interpret it in terms of 75 individual sequence features in the 5’UTR, coding region, and 3’UTR. Our approach 76 de novo recovered known cis-regulatory elements and identified novel ones. 77 Quantification of the respective contributions revealed that codon usage is the major 78 contributor to mRNA stability. Interpretability of the model also allowed studying how 79 the function of CREs depends on different mRNA degradation pathways. The 80 application of the modeling approach to S. pombe supports the generality of these 81 findings. 82 83 RESULTS 84 85 Regression reveals novel mRNA sequence features associated with 86 mRNA stability 87 To study cis-regulatory determinants of mRNA stability in S. cerevisiae, we chose the 88 dataset by Sun and colleagues (Sun et al, 2013), which provides genome-wide half- 89 life measurements for 4,388 expressed genes of a wild type lab strain (Fig 1). The 90 same study also measured genome-wide mRNA half-lives for 34 strains knocked out 91 for RNA degradation pathway genes (Table EV1), allowing for investigating how the 92 association of candidate CREs with half-life depends on RNA degradation pathways. 93 When applicable, associations were also investigated in the S. pombe genome, using 94 half-life values for 3,614 expressed mRNAs in a wild-type lab strain from Eser and 95 colleagues (Eser et al, 2016). We considered sequence features within 5 overlapping 96 regions: the 5’UTR, the start codon context, the coding sequence, the stop codon 97 context and the 3’UTR. Half-life associated negatively with 5’UTR length (Spearman 98 ρ = -0.17, P = 3 × 10-12 for S. cerevisiae and Spearman ρ = -0.27, P = 2 × 10-11 for S. crPage 3 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 99 pombe), with the coding sequence (Spearman ρ = -0.23, P < 2 × 10-16 for S. 100 cerevisiae and Spearman ρ = -0.32, P = 4 × 10-15 for S. pombe), and with 3’UTR 101 length (Spearman ρ = -0.06, P = 1 × 10-8 for S. cerevisiae and Spearman ρ = -0.23, P 102 = 6 × 10-12 for S. pombe) (Fig EV1A, B, C, D, E, F). Moreover, S .cerevisiae half-life 103 was found to be negatively associated with the GC-content of the 5’UTR (Spearman 104 ρ = -0.11, P = 3 × 10-14) and of the 3’UTR (Spearman ρ = -0.17, P < 2 × 10-16), 105 whereas the GC-content of the CDS region correlated positively with half-life 106 (Spearman ρ = 0.27, P < 2 × 10-16) (Fig EV2A, C, E). In contrast, S. pombe showed 107 consistent positive association between GC-content and half-life (Spearman ρ = 0.16, 108 0.24, 0.05 and P < 2 × 10-16, P < 2 × 10-16, P = 0.01 respectively for 5’UTR, CDS and 109 3’UTR, Fig EV2B, D, F). Also, secondary structure in 5’UTR (Materials and Methods) 110 associated with RNA instability (Spearman ρ = -0.22, P = 8 × 10-12 for S. cerevisiae 111 and Spearman ρ = -0.24, P < 2 × 10-16 for S. pombe, Fig EV1G, H). 112 113 Novel candidate cis-regulatory elements were searched for using a robust k-mer 114 based regression, investigating all 3- to 8-mers, and followed by k-mer assembly 115 (Materials and Methods). This recovered de novo the Puf3 binding motif TGTAAATA 116 in 3’UTR (Gerber et al, 2004; Gupta et al, 2014), a well-studied CRE that confers 117 RNA instability, as well as the Whi3 binding motif TGCAT (Colomina et al, 2008; Cai 118 & Futcher, 2013). Two new motifs were found: AAACAAA in 5’UTR, and ATATTC in 119 3’UTR (Fig 1A). Notably, the motif TGCAT, was found in 21% of the genes (938 out 120 of 4,388) and significantly co-occurred with the other two destabilizing motifs found in 121 3’UTR: Puf3 and the Whi3 binding site motifs (Fig 1B). Except for AAACAAA and 122 TTTTTTA, all motifs associated with shorter half-lives (Fig 1A). 123 124 In the following subsections, we describe first the findings for each of the 5 gene 125 regions and then a model that integrates all these sequence features. 126 127 Upstream AUGs destabilize mRNAs by triggering nonsense- 128 mediated decay 129 Genome-wide, occurrence of an upstream AUG (uAUG) associated with 1.37 fold 130 shorter half-life (fold change between medians, P < 2.2 × 10-16), and the effect 131 strengthened for genes with two or more AUGs (Fig 2A, B). Among the 34 knock-out crPage 4 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 132 strains, the association between uAUG and shorter half-life was almost lost only for 133 mutants of the two essential components of the nonsense-mediated mRNA decay 134 (NMD) UPF2 and UPF3 (Leeds et al, 1992; Cui et al, 1995), and for the general 5’-3’ 135 exonuclease Xrn1 (Fig 2A). The dependence on NMD suggested that the association 136 between uAUG and shorter half-life might be due to the occurrence of a premature 137 stop codon. Indeed, the association of uAUG with decreased half-lives was only 138 found for genes with a premature stop codon cognate with the uAUG (Fig 2C). This 139 held not only for cognate premature stop codons within the 5’UTR, leading to a 140 potential upstream ORF, but also for cognate premature stop codons within the ORF, 141 which occurred almost always for uAUG out-of-frame with the main ORF (Fig 2C). 142 This finding likely holds for many other eukaryotes as we found the same trends in S. 143 pombe (Fig 2D). This observations are consistent with single-gene studies 144 demonstrating that translation of upstream ORFs can lead to RNA degradation by 145 nonsense-mediated decay (Gaba et al, 2005; Barbosa et al, 2013). Altogether, these 146 results show that uAUGs are mRNA destabilizing elements as they almost surely 147 match with a cognate premature stop codon, which, whether in frame or not with the 148 gene, and within the UTR or in the coding region, trigger NMD. 149 150 Translation initiation predicts mRNA stability 151 Several sequence features in the 5’UTR associated significantly with mRNA half-life. 152 153 First, the folding energy of the 5’ initiation sequence, defined as the 5’UTR and the 154 first 10 codons (Zur & Tuller, 2013), positively correlated with mRNA stability in S. 155 cerevisiae (Spearman ρ = -0.22, P = 8 × 10-12 for S. cerevisiae and Spearman ρ = - 156 0.24, P < 2 × 10-16 for S. pombe, Fig EV1G, H). The mRNA destabilizing function of 157 strong secondary structure elements had been demonstrated for the gene PGK1 158 (LaGrandeur & Parker, 1999; Muhlrad et al, 1995). Our genome-wide analysis 159 indicates that this is a general phenomenon. Possibly, strong secondary structure in 160 the 5’ UTR interferes with ribosome scanning, thus slows down translation initiation, 161 and therefore decreases the level of mRNA protection by the translation machinery 162 (Roy & Jacobson, 2013; Huch & Nissan, 2014). 163 crPage 5 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 164 Second, longer 5’UTRs associated with less stable mRNAs (Spearman ρ = -0.17, P = 165 3 × 10-12 for S. cerevisiae and Spearman ρ = -0.27, P = 2 × 10-11 for S. pombe, Fig 166 EV1A, B). In mouse cells, mRNA isoforms with longer 5’UTR are translated with 167 lower efficiency (Wang et al, 2016), possibly because longer 5’UTR generally harbor 168 more translation repression elements. Hence, longer 5’UTR may confer mRNA 169 instability by decreasing translation initiation and therefore decreasing the protection 170 by the translation machinery. 171 172 Third, a significant association between the third nucleotide before the start codon 173 and mRNA half-life was observed (Fig 3A). The median half-life correlated with the 174 nucleotide frequency (Fig EV3A), leading to 1.28 fold difference between the 2,736 175 genes with an adenosine, the most frequent nucleotide at this position, and the 360 176 genes with a cytosine, the least frequent nucleotide at this position (P = 1.7x10-11). 177 The same correlation was also found to be significant for S. pombe (Fig EV3B, C). 178 Functional effect of the start codon context on mRNA stability had been established 179 as the long-lived PGK1 mRNA was significantly destabilized when substituting the 180 sequence context around its start codon with the one from the short-lived MFA2 181 mRNA (LaGrandeur & Parker, 1999). Our genome-wide analysis indicates that this 182 effect generalizes to other genes. Possibly the start codon context, which controls 183 translation initiation efficiency (Kozak, 1987; Dvir et al, 2013), affects thereby mRNA 184 stability. 185 186 Altogether, these findings indicate that 5’UTR elements, including the start codon 187 context, may affect mRNA stability by altering translation initiation. De novo search 188 for regulatory motifs identified AAACAAA to be significantly associated with longer 189 mRNA half-lives. However, further analysis did not provide strong support for this 190 motif, indicating it might be merely correlative (Fig EV5). 191 192 Codon usage regulates mRNA stability through common mRNA 193 decay pathways 194 Codon usage has been reported to affect mRNA stability through a translation- 195 dependent manner (Presnyak et al, 2015; Mishima & Tomari, 2016; Bazzini et al, 196 2016). However, it is unclear on which mRNA degradation pathway it depends. crPage 6 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 197 198 The species-specific tRNA adaptation index (sTAI) was used to quantify the level of 199 codon usage bias (Sabi & Tuller, 2014). First, we confirmed on this independent half- 200 life dataset the strong correlation between codon optimality and mRNA stability in S. 201 cerevisiae (Fig 3C, Spearman ρ = 0.55, P < 2.2x10-16). Using the out-of-folds 202 explained variance as a summary statistics, we assessed its variation across different 203 gene knockouts (Materials and Methods). The effect of codon usage exclusively 204 depended on the genes from the common deadenylation- and decapping-dependent 205 5’ to 3’ mRNA decay pathway and the NMD pathway (Fig 3D). In particular, all 206 assayed genes of the Ccr4-Not complex, including CCR4, NOT3, CAF40 and POP2, 207 are required for wild-type level effects of codon usage on mRNA decay. Among them, 208 CCR4 has the largest effect. This confirmed recent studies showing that accelerated 209 decay of non-optimal codon genes requires deadenylation activities of Ccr4-Not 210 (Presnyak et al, 2015; Mishima & Tomari, 2016). Moreover, our results confirmed the 211 dependence on Dhh1 (Radhakrishnan et al, 2016), but also on its interacting partner 212 Pat1. In contrast to genes of the Ccr4-Not complex, PAN2/3 genes which encode 213 also deadenylation enzymes, were not found to be essential for the coupling between 214 codon usage and mRNA decay (Fig 3D). Notably, we did not observe any change of 215 effect upon knockout of DOM34 and HBS1 (Figure EV5), which are essential genes 216 for the No-Go decay pathway. This implies that the effect of codon usage is unlikely 217 due to stalled ribosomes at non-optimal codons. 218 219 Our systematic analysis revealed two additional novel dependencies: First, on the 220 common 5’ to 3’ exonuclease Xrn1, and second, on UPF genes, which are essential 221 players of NMD (Fig 3D). Previous studies have shown that NMD is more than just a 222 RNA surveillance pathway, but rather one of the general mRNA decay mechanisms 223 that target a wide range of mRNAs, including aberrant and normal ones (He et al, 224 2003; Hug et al, 2015). Altogether, our analysis strongly indicates that, the so-called 225 “codon-mediated decay” is not an mRNA decay pathway itself, but a regulatory 226 mechanism of the common mRNA decay pathways. 227 228 Stop codon context associates with mRNA stability 229 crPage 7 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 230 Linear regression against the 6 bases 5’ and 3’ of the stop codon revealed the first 231 nucleotide 3’ of the stop codon to most strongly associate with mRNA stability. This 232 association was observed for each of the three possible stop codons, and for each 233 codon a cytosine significantly associated with lower half-life (Fig. 3B). This also held 234 for S. pombe (Fig EV3D). A cytosine following the stop codon structurally interferes 235 with stop codon recognition (Brown et al, 2015), thereby leading to stop codon read- 236 through events (Bonetti et al, 1995; McCaughan et al, 1995). Of all combinations, 237 TGA-C is known to be the leakiest stop codon context (Jungreis et al, 2011) and also 238 associated with shortest mRNA half-life (Fig. 3B). These results are consistent with 239 non-stop decay, a mechanism that triggers exosome-dependent RNA degradation 240 when the ribosome reaches the poly(A) tail. Consistent with this interpretation, 241 mRNAs with additional in-frame stop codons in the 3’UTR, which are over- 242 represented in yeast (Williams et al, 2004), exhibited significantly higher half-life (Fig 243 EV3E, F). However, the association between the stop codon context and half-life was 244 not weakened in mutants of the Ski complex, which is required for the cytoplasmic 245 functions of the exosome (Fig EV5). These results strongly indicate that the fourth 246 nucleotide after the stop codon is an important determinant of mRNA stability, likely 247 because of translational read-through. 248 249 Sequence motifs in 3’UTR 250 Four motifs in the 3’UTR were found to be significantly associated with mRNA 251 stability (Fig 5A, FDR < 0.1, Materials and Methods). This analysis recovered three 252 described motifs: the Puf3 binding motif TGTAAATA (Gerber et al, 2004), the Whi3 253 binding motif TGCAT (Colomina et al, 2008; Cai & Futcher, 2013), and a poly(U) 254 motif TTTTTTA, which is likely bound by Pub1 (Duttagupta et al, 2005), or is part of 255 the long poly(U) stretch that form a looping structure with poly(A) tail (Geisberg et al, 256 2014). We also identified a novel motif, ATATTC, that associated with lower mRNA 257 half-life. 258 259 Four lines of evidence supported the potential functionality of the new motif. First, it 260 preferentially localizes in the vicinity of the poly(A) site (Fig 5B), and functionally 261 depends on Ccr4 (Fig EV5), suggesting a potential interaction with deadenylation 262 factors. Second, single nucleotide deviations from the consensus sequence of the crPage 8 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 263 motif associated with decreased effects on half-life (Fig 5C, linear regression allowing 264 for one mismatch, Materials and Methods). Moreover, the flanking nucleotides did not 265 show further associations indicating that the whole lengths of the motifs were 266 recovered (Fig 5C). Third, when allowing for one mismatch, the motif still showed 267 strong preferences (Fig 5D). Fourth, the motif instances were more conserved than 268 their flanking bases (Fig 5E). 269 270 Consistent with the role of Puf3 in recruiting deadenylation factors, Puf3 binding motif 271 localized preferentially close to the poly(A) site (Fig 5B). The effect of the Puf3 motifs 272 was significantly lower in the knockout of PUF3 (Fig EV5). We also found a 273 significant dependence on the deadenylation (CCR4, POP2) and decapping (DHH1, 274 PAT1) pathways (Fig EV5), consistent with previous single gene experiment showing 275 that Puf3 binding promotes both deadenylation and decapping (Olivas & Parker, 276 2000; Goldstrohm et al, 2007). Strikingly, Puf3 binding motif switched to a 277 stabilization motif in the absence of Puf3 and Ccr4, suggesting that deadenylation of 278 Puf3 motif containing mRNAs is not only facilitated by Puf3 binding, but also depends 279 on it. 280 281 Whi3 plays an important role in cell cycle control (Garí et al, 2001). Binding of Whi3 282 leads to destabilization of the CLN3 mRNA (Cai & Futcher, 2013). A subset of yeast 283 genes are up-regulated in the Whi3 knockout strain (Cai & Futcher, 2013; Holmes et 284 al, 2013). However, it was so far unclear whether Whi3 generally destabilizes mRNAs 285 upon its binding. Our analysis showed that mRNAs containing the Whi3 binding motif 286 (TGCAT) have significantly shorter half-life (FDR = 6.9x10-04). Surprisingly, this 287 binding motif is extremely widespread, with 896 out of 4,388 (20%) genes that we 288 examined containing the motif on the 3’UTR region, suggesting possible broader and 289 genome-wide role of Whi3 and its binding motif. However, no significant genetic 290 dependence of the effect of the Whi3 binding motif was found (Fig EV5). 291 292 The mRNAs harboring the TTTTTTA motif tended to be more stable (Fig 5A). No 293 positional preferences was observed for this motif (Fig 5B), so as the poly(U) stretch 294 reported before (Geisberg et al, 2014). Effects of this motif depends on genes from 295 Ccr4-Not complex and Xrn1 (Fig EV5). 296 crPage 9 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 297 60% between-gene half-life variance can be explained by sequence 298 features 299 We next asked how well one could predict mRNA half-life from these mRNA 300 sequence features, and what their respective contributions were when considered 301 jointly. To this end, we performed a multivariate linear regression of the logarithm of 302 the half-life against the identified sequence features. The predictive power of the 303 model on unseen data was assessed using 10-fold cross validation (Material and 304 Methods, Table EV2). Altogether, 60% of S. cerevisiae half-life variance in the 305 logarithmic scale can be explained by simple linear combinations of the above 306 sequence features (Fig 6A). The median out-of-folds error across genes is 30%. 307 Prediction accuracy of half-life of 30% is remarkably high because it is in the order of 308 magnitude of the expression variation that is typically physiologically tolerated, and it 309 is also about the amount of variation observed between replicate experiments (Eser 310 et al, 2016). The same model applied to S. pombe explained 42% of the total half-life 311 variance. Because the measures also entail measurement noise, these numbers are 312 conservative underestimations of the total biological variance explained by our 313 model. 314 315 The uAUG, 5’UTR length, 5’UTR GC content, 61 coding codons, CDS length, all four 316 3’UTR motifs, and 3’UTR length remained significantly associated with half-life in the 317 joint model indicating that they contributed independently to half-life (the complete list 318 of features and their p-values are given in Table EV3). In contrast, start codon 319 context, stop codon context, 5’ folding energy, the 5’UTR motif AAACAAA, and 320 3’UTR GC content dropped below the significance when considered in the joint 321 model (Materials and Methods). This loss of statistical significance may be due to 322 lack of statistical power or because the marginal association of these sequence 323 features with half-life is a consequence of a correlation with other sequence features. 324 Among all sequence features, codon usage as a group is the best predictor both in a 325 univariate model (55.90%) and in the joint model (44.23%) (Figure 6B). This shows 326 that, quantitatively, codon usage is the major determinant of mRNA stability in yeast. 327 328 The variance analysis quantifies the contribution of each sequence feature to the 329 variation across genes. Features that vary a lot between genes, such as UTR length crPage 10 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 330 and codon usage, favorably contribute to the variation. This does not reflect, 331 however, the effect on a given gene of elementary sequence variations in these 332 features. For instance, a single-nucleotide variant can lead to the creation of an 333 uAUG with a strong effect on half-life, but a single nucleotide variant in the coding 334 sequence may have little impact on overall codon usage. We used the joint model to 335 assess the sensitivity of each feature as median fold-change across genes upon 336 single-nucleotide variants, simulating single-nucleotide deletions for the length 337 features and single nucleotide substitutions for the remaining ones (Materials and 338 Methods). Single-nucleotide variations typically altered half-life by less than 10%. The 339 largest effects were observed in the 3’UTR motifs and uAUG (Fig 6D). Notably, 340 although codon usage was the major contributor to the variance, synonymous 341 variation on codons typically affected half-life by less than 2% (Fig 6D; Fig EV6). For 342 those synonymous variations that changed half-life by more than 2%, most of them 343 were variations that involved the most non-optimized codons CGA or ATA (Fig EV6, 344 Presnyak et al. 2015). 345 Altogether, our results show that most of yeast mRNA half-life variation can be 346 predicted from mRNA sequence alone, with codon usage being the major contributor. 347 However, single-nucleotide variation at 3’UTR motif or uAUG had the largest 348 expected effect on mRNA stability. 349 350 351 DISCUSSION 352 We systematically searched for mRNA sequence features associating with mRNA 353 stability and estimated their effects at single-nucleotide resolution in a joint model. 354 Overall, the joint model showed that 60% of the variance could be predicted from 355 mRNA sequence alone in S. cerevisiae. This analysis showed that translation-related 356 features, in particular codon usage, contributed most to the explained variance. This 357 findings strengthens further the importance of the coupling between translation and 358 mRNA degradation (Roy & Jacobson, 2013; Huch & Nissan, 2014). Moreover, we 359 assessed the RNA degradation pathway dependencies of each sequence feature. 360 Remarkably, we identified that codon-mediated decay is a regulatory mechanism of 361 the canonical decay pathways, including deadenylation- and decapping-dependent 5’ 362 to 3’ decay and NMD (Figure 6E). crPage 11 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 363 364 Integrative analyses of cis-regulatory elements as we used here complement 365 mechanistic single-gene studies for important aspects. It allows assessing genome- 366 wide the importance of CREs that have been reported previously with single-gene 367 experiments, as Vogel and colleagues (Vogel et al, 2010) had shown when modeling 368 protein abundance from mRNA levels and sequence. Also, single-nucleotide effect 369 prediction can more precisely supports the interpretation of genetic variants, including 370 mutations in non-coding region as well as synonymous transitions in coding region. 371 Furthermore, such integrative analyses can be combined with a search for novel 372 sequence features, as we did here with k-mers, allowing the identification of novel 373 candidate cis-regulatory elements. An alternative approach to the modeling of 374 endogenous sequence is to use large-scale perturbation screens as developed by 375 the group of Segal (Dvir et al, 2013; Shalem et al, 2015). Although very powerful to 376 dissect known cis-regulatory elements or to investigate small variations around select 377 genes, the sequence space is so large that these large-scale perturbation screens 378 cannot uncover all regulatory motifs. It would be interesting to combine both 379 approaches and design large-scale validation experiments guided by insights coming 380 from modeling of endogenous sequences as we developed here. 381 382 Causality cannot be proven through a regression analysis approach. However, we 383 provided several complementary analyses to further assess the potential functionality 384 of candidate CREs. These include conservation, positional preferences, and epistasis 385 analyses to assess the dependencies on RNA degradation pathways. One of the 386 motif we found by regression, AAACAAA in 5’UTR, was neither supported by these 387 complementary analyses, nor it remained significant in the joint model. We think this 388 motifs is most likely correlative. In contrast, the ATATTC motif in 3’UTR is strongly 389 supported by these complementary analyses and is also significant in the joint model. 390 Two of the most interesting sequence features that we identified but still need to be 391 functionally assayed are the start codon context and 5’ sequence free energy. Given 392 their established effect on translation initiation (Kozak, 1986; Dvir et al, 2013; Tuller & 393 Zur, 2014), the general coupling between translation and mRNA degradation (Huch & 394 Nissan, 2014; Roy & Jacobson, 2013), as well as several observations directly on 395 mRNA stability for single genes (Muhlrad et al, 1995; Barnes, 1998; LaGrandeur & 396 Parker, 1999; Schwartz & Parker, 1999), they are very likely to be functional on most crPage 12 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 397 genes. Consistent with this hypothesis, large scale experiments that perturb 5’ 398 sequence secondary structure and start codon context indeed showed a wide range 399 of mRNA level changes in the direction that we would predict (Dvir et al, 2013). 400 Altogether, such integrative approaches allow the identification of candidate 401 regulatory elements that could be functionally tested later on. 402 403 We are not aware of previous studies that systematically assessed the effects of cis- 404 regulatory elements in the context of knockout backgrounds, as we did here. This 405 part of our analysis turned out to be very insightful. We recovered results from recent 406 studies about the dependencies of the effect of codon usage on RNA stability on the 407 Ccr4-Not complex and Dhh1, but also identified important novel ones including NMD 408 factors, Pat1 and Xrn1. With the growing availability of knockout or mutant 409 background in model organisms and human cell lines, we anticipate this approach to 410 become a fruitful methodology to unravel regulatory mechanisms. 411 412 Materials and Methods 413 414 Data and Genomes 415 Wild-type and knockout genome-wide S. cerevisiae half-life data were obtained from 416 Sun et al. 2013. S. cerevisiae gene boundaries were taken from the boundaries of 417 the most abundant isoform quantified by Pelechano et al. 2013. Reference genome 418 fasta file and genome annotation were obtained from the Ensembl database (release 419 79). UTR regions were defined by subtracting out gene body (exon and introns from 420 the Ensembl annotation) from the gene boundaries. 421 Genome-wide half-life data of S. pombe as well as refined transcription unit 422 annotation were obtained from Eser et al. 2016. Reference genome version 423 ASM294v2.26 was used to obtain sequence information. 424 For both half-life datasets, only mRNAs with mapped 5’UTR and 3’UTR were 425 considered. mRNAs that have 5’UTR length shorter than 6nt were further filtered out, 426 so that start codon context could always be defined. Half-life outliers of S. pombe 427 (half-life less than 1 or larger than 250 mins) were removed. crPage 13 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 428 Codon-wise species-specific tRNA adaptation index (sTAI) of yeasts were obtained 429 from Sabi & Tuller 2014. Gene-wise sTAIs were calculated as the geometric mean of 430 sTAIs of all its codons (stop codon excluded). 431 432 Analysis of knockout strains 433 The effect level of an individual sequence feature was compared against the wild- 434 type with Wilcoxon rank-sum test followed by multiple hypothesis testing p-value 435 correction (FDR < 0.1). The sequence feature effect levels were defined as follows 436 for different classes of sequence features: 437 uAUG: 438 Motifs: 439 half-life of mRNAs that has zero and one instance of the motif respectively. 440 Codon usage: For each knockout or wild-type, a linear model was fitted with all 441 coding codons as covariates. The effect size of codon usage (joint effect of all 442 codons) on half-life was defined by the explained variance of out-of-sample 443 predictions. 444 Start codon -3 position: 445 mRNAs with base adenine and cytosine at start codon -3 position respectively. 446 Stop codon +1 position: 447 half-life of mRNAs that has stop codon TGA followed by guanine and cytosine 448 respectively. 449 5’ folding energy: genome-wide Spearman rank correlation between 5’ folding energy 450 and half-life. (𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑤𝑖𝑡ℎ𝑢𝐴𝑈𝐺 )− 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑤𝑖𝑡ℎ𝑜𝑢𝑡𝑢𝐴𝑈𝐺 )) 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑤𝑖𝑡ℎ𝑜𝑢𝑡𝑢𝐴𝑈𝐺 ) (𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑐𝑜𝑢𝑛𝑡=1 )− 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑐𝑜𝑢𝑛𝑡=0 )) 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑐𝑜𝑢𝑛𝑡=0 ) where 𝐻 stands for half-life. where 𝐻𝑐𝑜𝑢𝑛𝑡=0 and 𝐻𝑐𝑜𝑢𝑛𝑡=1 stands for the (𝑚𝑒𝑑𝑎𝑖𝑛(𝐻𝐴 )−𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑐 )) 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝐴 ) where 𝐻𝐴 and 𝐻𝐶 is the half-life of (𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑇𝐺𝐴𝐺 )−𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑇𝐺𝐴𝐶 )) 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑇𝐺𝐴𝐺 ) where 𝐻𝑇𝐺𝐴𝐺 and 𝐻𝑇𝐺𝐴𝐶 is the 451 452 Motif discovery 453 Motif discovery was conducted similarly to Eser et al. 2016 for the 5’UTR, the CDS 454 and the 3’UTR regions. A linear mixed effect model was used to assess the effect of 455 each individual k-mer (fixed effect) while controlling the effects of the others (random 456 effect) and for the region length as a covariate. For CDS we also used codons as 457 further covariates. We systematically tested the effects of all possible k-mers with 458 length from 3 to 8. The linear mixed model for motif discovery was fitted with GEMMA crPage 14 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 459 software (Zhou et al, 2013). The p-values were corrected for multiple testing using 460 Benjamini-Hochberg’s FDR. Motifs were subsequently manually assembled based on 461 overlapping significant k-mers. 462 463 Folding energy calculation 464 5’ ensemble folding energy was calculated with RNAfold from ViennaRNA version 465 2.1.9 (Lorenz et al, 2011), with default parameters. 5’ sequence was defined as 466 5’UTR sequence plus the first 10 codons within the coding region. Genes with 467 5’UTRs that are shorter than 30 bps were excluded from the analysis. 468 469 S. cerevisiae conservation analysis 470 The phastCons (Siepel et al, 2005) conservation track for S. cerevisiae was 471 downloaded from the UCSC Genome browser 472 (http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/phastCons7way/). Motif single- 473 nucleotide level conservation scores were computed as the mean conservation score 474 of each nucleotide (including 2 extended nucleotide at each side of the motif) across 475 all motif instances genome-wide (removing NA values). 476 477 Linear model for genome-wide half-life prediction (joint model) 478 Multivariate linear regression models were used to predict genome-wide mRNA half- 479 life on the logarithmic scale from sequence features. Only mRNAs that contain all 480 features were used to fit the models, resulting with 3,862 mRNAs for S. cerevisiae 481 and 3,130 mRNAs for S. pombe. Linear models were first fitted with all sequence 482 features as covariates, non-significant sequence features were then removed from 483 the final prediction models, ending up with 70 features for S. cerevisiae model and 73 484 features for S. pombe (each single coding codon was fitted as a single covariate). L1 485 or L2 regularization were not necessary, as they did not improve the out-of-fold 486 prediction accuracy (tested with glmnet R package (Friedman et al, 2010)). Out-of- 487 fold prediction was applied with 10-fold cross validation for any prediction task in this 488 study. A complete list of model features and their p-values for both yeast species are 489 provided in Table EV2. 490 491 Variance explained for linear model crPage 15 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 492 The percentage of explained variance for a linear model was calculated by 100 ∗ 493 (1 − 𝑣𝑎𝑟𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 / 𝑣𝑎𝑟𝑡𝑜𝑡𝑎𝑙 ), where 𝑣𝑎𝑟𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 and 𝑣𝑎𝑟𝑡𝑜𝑡𝑎𝑙 are the variance of 494 regression residual and the total half-life variance respectively. 495 496 Analysis of sequence feature contribution 497 The contribution of each sequence feature was analyzed individually as a univariate 498 regression and also jointly in a multivariate regression model. We quantified the 499 feature contributions in 10-fold cross-validation. The contribution of each feature 500 individually was calculated as the variance explained by a univariate model. We then 501 added the features in a descending order of their individual explained variance to a 502 joint model, and recorded the cumulative variance explained as adding more 503 features. As compensation to the issue that the additional variance explained by a 504 single feature to a joint model depends on the order of adding it, the drop of variance 505 explained upon leaving out one feature separately from the full model was also 506 assessed. 507 508 Single-nucleotide variant effect prediction 509 The effects of single-nucleotide variation (except for motifs) were predicted by 510 introducing a single nucleotide perturbation into the full prediction model for each 511 gene, and summarizing the effect with the median half-life change across all genes. 512 For example, the effect of a single CGT to CGA transition was assessed by 513 decreasing the count of CGT for each gene by one while increasing the count of CGA 514 for each gene (only perturbing genes that had at least one CGT). In the case of 515 length and GC content, we decreased the length or number of GC count by one for 516 each gene. Note that the side effects of varying the length, such as disrupting 517 reading-frame, were not considered. Only synonymous transitions were considered 518 for codons. 519 520 Motif single-nucleotide variant analysis 521 Effects of motif single-nucleotide variants were predicted with linear model modified 522 from (Eser et al, 2016). When assessing the effect of a given motif variation, we 523 controlled for the effect of all other sequence features using a linear model with the 524 other features as covariates. 525 crPage 16 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 526 Code availability 527 Analysis scripts are available at: https://github.com/s6juncheng/Cheng_et_al_2016 528 529 Acknowledgements 530 We are thankful to Fabien Bonneau (Max Planck Institute of Biochemistry) for helpful 531 discussions on motifs and RNA degradation pathways, as well as useful feedback on 532 the manuscript. We thank Björn Schwalb for communication on analyzing the 533 knockout data. We thank Vicente Yépez for useful feedback on the manuscript. JC 534 and ŽA is supported by a DFG fellowship through QBM. JG was supported by the 535 Bundesministerium für Bildung und Forschung, Juniorverbund in der Systemmedizin 536 “mitOmics” (grant FKZ 01ZX1405A). 537 538 References 539 540 Anderson JS & Parker RP (1998) The 3’ to 5' degradation of yeast mRNAs is a general 541 mechanism for mRNA turnover that requires the SKI2 DEVH box protein and 3' to 5' 542 exonucleases of the exosome complex. EMBO J. 17: 1497–506 543 544 545 546 Barbosa C, Peixeiro I & Romão L (2013) Gene Expression Regulation by Upstream Open Reading Frames and Human Disease. PLoS Genet. 9: e1003529 Barnes CA (1998) Upf1 and Upf2 proteins mediate normal yeast mRNA degradation when translation initiation is limited. Nucleic Acids Res. 26: 2433–2441 547 Bazzini AA, Viso F, Moreno-mateos MA, Johnstone TG & Charles E (2016) Codon identity 548 regulates mRNA stability and translation efficiency during the maternal-to-zygotic 549 transition. EMBO J.: 1–17 550 Boeck R, Tarun S, Rieger M, Deardorff JA, Mu S & Sachs AB (1996) The Yeast Pan2 551 Protein Is Required for Poly ( A ) -binding Protein-stimulated Poly ( A ) -nuclease Activity 552 J. Biol. Chem.. 271: 432–438 553 Boël G, Letso R, Neely H, Price WN, Wong K, Su M, Luff JD, Valecha M, Everett JK, Acton 554 TB, Xiao R, Montelione GT, Aalberts DP & Hunt JF (2016) Codon influence on protein 555 expression in E. coli correlates with mRNA levels. Nature 529: 358–363 556 Bonetti B, Fu L, Moon J & Bedwell DM (1995) The efficiency of translation termination is 557 determined by a synergistic interplay between upstream and downstream sequences in 558 Saccharomyces cerevisiae. J. Mol. Biol. 251: 334–45 559 560 Brown A, Shao S, Murray J, Hegde RS & Ramakrishnan V (2015) Structural basis for stop codon recognition in eukaryotes. Nature 524: 493–6 crPage 17 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 561 Brown CE, Tarun SZ, Boeck R, Sachs AB, Sachs AB & Chem JB (1996) PAN3 Encodes a 562 Subunit of the Pab1p-Dependent Poly ( A ) Nuclease in Saccharomyces cerevisiae. 563 Mol. Cell. Biol. 16: 5744–5753 564 Cai Y & Futcher B (2013) Effects of the Yeast RNA-Binding Protein Whi3 on the Half-Life and 565 Abundance of CLN3 mRNA and Other Targets. PLoS One 8: e84630 566 Caponigro G & Parker R (1996) Mechanisms and control of mRNA turnover in 567 568 569 Saccharomyces cerevisiae. Microbiol. Rev. 60: 233–249 Collart MA (2003) Global control of gene expression in yeast by the Ccr4-Not complex. Gene 313: 1–16 570 Colomina N, Ferrezuelo F, Wang H, Aldea M & Garí E (2008) Whi3, a developmental 571 regulator of budding yeast, binds a large set of mRNAs functionally related to the 572 endoplasmic reticulum. J. Biol. Chem. 283: 28670–28679 573 Cui Y, Hagan KW, Zhang S & Peltz SW (1995) Identification and characterization of genes 574 that are required for the accelerated degradation of mRNAs containing a premature 575 translational termination codon. Genes Dev. 9: 423–36 576 Duttagupta R, Tian B, Wilusz CJ, Danny T, Soteropoulos P, Ouyang M, Joseph P, Peltz SW, 577 Khounh DT & Dougherty JP (2005) Global Analysis of Pub1p Targets Reveals a 578 Coordinate Control of Gene Expression through Modulation of Binding and Stability 579 Global Analysis of Pub1p Targets Reveals a Coordinate Control of Gene Expression 580 through Modulation of Binding and Stability Mol. Cell. Biol.. 25: 5499–5513 581 Dvir S, Velten L, Sharon E, Zeevi D, Carey LB, Weinberger A & Segal E (2013) Deciphering 582 the rules by which 5 ’UTR sequences affect protein expression in yeast. Proc. Natl. 583 Acad. Sci. U. S. A. 110: E2792–E2801 584 Eser P, Wachutka L, Maier KC, Demel C, Boroni M, Iyer S, Cramer P & Gagneur J (2016) 585 Determinants of RNA metabolism in the Schizosaccharomyces pombe genome. Mol. 586 Syst. Biol. 12: 857 587 588 589 Friedman J, Hastie T & Tibshirani R (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33: 1–22 Gaba A, Jacobson A & Sachs MS (2005) Ribosome occupancy of the yeast CPA1 upstream 590 open reading frame termination codon modulates nonsense-mediated mRNA decay. 591 Mol. Cell 20: 449–460 592 Garí E, Volpe T, Wang H, Gallego C, Futcher B & Aldea M (2001) Whi3 binds the mRNA of 593 the G1 cyclin CLN3 to modulate cell fate in budding yeast. Genes Dev. 15: 2803–2808 594 Garneau NL, Wilusz J & Wilusz CJ (2007) The highways and byways of mRNA decay. Nat. 595 596 597 Rev. Mol. Cell Biol. 8: 113–126 Geisberg J V., Moqtaderi Z, Fan X, Ozsolak F & Struhl K (2014) Global analysis of mRNA isoform half-lives reveals stabilizing and destabilizing elements in yeast. Cell 156: 812– crPage 18 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 598 599 824 Gerber AP, Herschlag D & Brown PO (2004) Extensive association of functionally and 600 cytotopically related mRNAs with Puf family RNA-binding proteins in yeast. PLoS Biol. 601 2: e79 602 603 604 Goldstrohm AC, Seay DJ, Hook BA & Wickens M (2007) PUF protein-mediated deadenylation is catalyzed by Ccr4p. J. Biol. Chem. 282: 109–114 Gupta I, Clauder-Münster S, Klaus B, Järvelin AI, Aiyar RS, Benes V, Wilkening S, Huber W, 605 Pelechano V & Steinmetz LM (2014) Alternative polyadenylation diversifies post- 606 transcriptional regulation by selective RNA-protein interactions. Mol. Syst. Biol. 10: 719 607 He F, Li X, Spatrick P, Casillo R, Dong S & Jacobson A (2003) Genome-Wide Analysis of 608 mRNAs Regulated by the Nonsense-Mediated and 5'to 3' mRNA Decay Pathways in 609 Yeast. Mol. Cell 12: 1439–1452 610 Hoekema A, Kastelein ROBA, Vasser M & Boer HADE (1987) Codon Replacement in the 611 PGKI Gene of Saccharomyces cerevisiae : Experimental Approach To Study the Role of 612 Biased Codon Usage in Gene Expression. 7: 2914–2924 613 Holmes KJ, Klass DM, Guiney EL & Cyert MS (2013) Whi3, an S. cerevisiae RNA-binding 614 protein, is a component of stress granules that regulates levels of its target mRNAS. 615 PLoS One 8: 10–12 616 Hsu CL & Stevens A (1993) Yeast cells lacking 5’-->3' exoribonuclease 1 contain mRNA 617 species that are poly(A) deficient and partially lack the 5' cap structure. Mol. Cell. Biol. 618 13: 4826–4835 619 620 621 622 623 Huch S & Nissan T (2014) Interrelations between translation and general mRNA degradation in yeast. Wiley Interdiscip. Rev. RNA 5: 747–763 Hug N, Longman D & Cáceres JF (2015) Mechanism and regulation of the nonsensemediated decay pathway. Nucleic Acids Res. 44: 1483–1495 Jungreis I, Lin MF, Spokony R, Chan CS, Negre N, Victorsen A, White KP & Kellis M (2011) 624 Evidence of abundant stop codon readthrough in Drosophila and other metazoa. 625 Genome Res. 21: 2096–2113 626 627 628 629 630 Kozak M (1986) Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes. Cell 44: 283–292 Kozak M (1987) At least six nucleotides preceding the AUG initiator codon enhance translation in mammalian cells. J. Mol. Biol. 196: 947–950 LaGrandeur T & Parker R (1999) The cis acting sequences responsible for the differential 631 decay of the unstable MFA2 and stable PGK1 transcripts in yeast include the context of 632 the translational start codon. RNA 5: 420–433 633 634 Lee RC (1993) The C . elegans Heterochronic Gene lin-4 Encodes Small RNAs with Antisense Complementarity to & II-14. Cell 75: 843–854 crPage 19 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 635 636 Leeds P, Wood JM, Lee B & Culbertson MR (1992) Gene Products That Promote mRNA Turnover in Saccharomyces cerevisiaet. Mol. Cell. Biol. 12: 2165–2177 637 Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker 638 IL, Thirumalai D, Lee N, Woodson S, Klimov D, Tinoco I, Uhlenbeck O, Levine M, 639 Tinoco I, Borer P, Dengler B, Levine N, Uhlenbeck O, Crothers D, et al (2011) 640 ViennaRNA Package 2.0. Algorithms Mol. Biol. 6: 26 641 McCaughan KK, Brown CM, Dalphin ME, Berry MJ & Tate WP (1995) Translational 642 termination efficiency in mammals is influenced by the base following the stop codon. 643 Proc. Natl. Acad. Sci. U. S. A. 92: 5431–5435 644 645 646 Mishima Y & Tomari Y (2016) Codon Usage and 3′ UTR Length Determine Maternal mRNA Stability in Zebrafish. Mol. Cell 61: 874–885 Muhlrad D, Decker CJ & Parker R (1994) Deadenylation of the unstable mRNA encoded by 647 the yeast MFA2 gene leads to decapping followed by 5’ ??? 3' digestion of the 648 transcript. Genes Dev. 8: 855–866 649 650 651 652 Muhlrad D, Decker CJ & Parker ROY (1995) Turnover Mechanisms of the Stable Yeast PGK1 mRNA. Mol. Cell. Biol. 15: 2145–2156 Olivas W & Parker R (2000) The Puf3 protein is a transcript-specific regulator of mRNA degradation in yeast. EMBO J. 19: 6602-6611 653 Parker R (2012) RNA degradation in Saccharomyces cerevisae. Genetics 191: 671–702 654 Pelechano V, Wei W & Steinmetz LM (2013) Extensive transcriptional heterogeneity 655 656 657 658 revealed by isoform profiling. Nature 497: 127–31 Pilkington GR & Parker R (2008) Pat1 contains distinct functional domains that promote Pbody assembly and activation of decapping. Mol. Cell. Biol. 28: 1298–1312 Presnyak V, Alhusaini N, Chen Y-H, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker 659 KE, Graveley BR & Coller J (2015) Codon Optimality Is a Major Determinant of mRNA 660 Stability. Cell 160: 1111–1124 661 Rabani M, Raychowdhury R, Jovanovic M, Rooney M, Stumpo DJ, Pauli A, Hacohen N, 662 Schier AF, Blackshear PJ, Friedman N, Amit I & Regev A (2014) High-Resolution 663 Sequencing and Modeling Identifies Distinct Dynamic RNA Regulatory Strategies. Cell 664 159: 1698–1710 665 Radhakrishnan A, Chen Y-H, Martin S, Alhusaini N, Green R & Coller J (2016) The DEAD- 666 Box Protein Dhh1p Couples mRNA Decay and Translation by Monitoring Codon 667 Optimality. Cell 167: 122–128.e9 668 669 670 671 Roy B & Jacobson A (2013) The intimate relationships of mRNA decay and translation. Trends Genet. 29: 691–699 Sabi R & Tuller T (2014) Modelling the Efficiency of Codon-tRNA Interactions Based on Codon Usage Bias. DNA Res.: 1–15 crPage 20 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 672 673 674 675 676 Schwalb B, Michel M, Zacher B, Demel C, Tresch A & Gagneur J (2016) TT-seq maps the human transient transcriptome. Science 352: 1225-1228 Schwanhäusser B, Busse D & Li N (2011) Global quantification of mammalian gene expression control. Nature 473: 337–342 Schwartz DC & Parker R (1999) Mutations in translation initiation factors lead to increased 677 rates of deadenylation and decapping of mRNAs in Saccharomyces cerevisiae. Mol. 678 Cell. Biol. 19: 5247–5256 679 Shalem O, Sharon E, Lubliner S, Regev I, Lotan-Pompan M, Yakhini Z & Segal E (2015) 680 Systematic Dissection of the Sequence Determinants of Gene 3’ End Mediated 681 Expression Control. PLOS Genet. 11: e1005147 682 She M, Decker CJ, Svergun DI, Round A, Chen N, Muhlrad D, Parker R & Song H (2008) 683 Structural Basis of Dcp2 Recognition and Activation by Dcp1. Mol. Cell 29: 337–349 684 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth 685 J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W & 686 Haussler D (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and 687 yeast genomes. Genome Res. 15: 1034–1050 688 Sun M, Schwalb B, Pirkl N, Maier KC, Schenk A, Failmezger H, Tresch A & Cramer P (2013) 689 Global analysis of Eukaryotic mRNA degradation reveals Xrn1-dependent buffering of 690 transcript levels. Mol. Cell 52: 52–62 691 Tucker M, Valencia-sanchez MA, Staples RR, Chen J, Denis CL & Parker R (2001) The 692 Transcription Factor Associated Ccr4 and Caf1 Proteins Are Components of the Major 693 Cytoplasmic mRNA Deadenylase in Saccharomyces cerevisiae. Cell 104: 377–386 694 Tuller T & Zur H (2014) Multiple roles of the coding sequence 5’ end in gene expression 695 696 697 698 regulation. Nucleic Acids Res. 43: 13–28 Vasudevan S & Peltz SW (2001) Regulated ARE-mediated mRNA decay in Saccharomyces cerevisiae. Mol. Cell 7: 1191–1200 Vogel C, Abreu R de S, Ko D, Le S-YY, Shapiro BA, Burns SC, Sandhu D, Boutz DR, 699 Marcotte EM & Penalva LO (2010) Sequence signatures and mRNA concentration can 700 explain two-thirds of protein abundance variation in a human cell line. Mol. Syst. Biol. 6: 701 400 702 703 704 Wang X, Hou J, Quedenau C, Chen W (2016) Pervasive isoform-specific translational regulation via alternative transcription start sites in mammals. Mol. Syst. Biol. 12: 875 Williams I, Richardson J, Starkey A & Stansfield I (2004) Genome-wide prediction of stop 705 codon readthrough during translation in the yeast Saccharomyces cerevisiae. Nucleic 706 Acids Res. 32: 6605–6616 707 708 Yamashita A, Chang T-C, Yamashita Y, Zhu W, Zhong Z, Chen C-YA & Shyu A-B (2005) Concerted action of poly(A) nucleases and decapping enzyme in mammalian mRNA crPage 21 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 709 turnover. Nat. Struct. Mol. Biol. 12: 1054–1063 710 Zeisel A, Köstler WJ, Molotski N, Tsai JM, Krauthgamer R, Jacob-Hirsch J, Rechavi G, Soen 711 Y, Jung S, Yarden Y & Domany E (2011) Coupled pre-mRNA and mRNA dynamics 712 unveil operational strategies underlying transcriptional responses to stimuli. Mol. Syst. 713 Biol. 7: 529 714 715 716 717 Zhou X, Carbonetto P & Stephens M (2013) Polygenic Modeling with Bayesian Sparse Linear Mixed Models. PLoS Genet. 9: e1003264 Zur H & Tuller T (2013) New Universal Rules of Eukaryotic Translation Initiation Fidelity. 9: e1003136 718 crPage 22 | 22 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Sequence features explain most of the mRNA stability variation across genes Jun Cheng, Žiga Avsec, Julien Gagneur November 2016 1 Main Figures Data Genome-wide mRNA half-lives in S. cerevisiae: wild-type + 34 strains knocked out for RNA degradation pathway genes (Sun et al. Mol. Cell. 2013) Analysis Dependence on degradation pathways Cis-regulatory elements AUG AAA AUG …CGA…GGU… UGAC AUAUUC AAAAAA… • Novel elements discovery • Integrate novel and known elements • Single-nucleotide effect estimation Integrative model Out-of-sample predictions Relative contributions CRE1 Single-nucleotide effects ...CATACGATA... ...CATAGGATA... CRE2 CRE3 Fold-change Percentage of variance explained by (%) Figure 1: Study overview. The goal of this study is to discover and integrate cis-regulatory mRNA elements affecting mRNA stability and assess their dependence on mRNA degradation pathways. Data) We obtained S. cerevisiae genome-wide half-life data from wild-type (WT) as well as from 34 knockout strains from Sun et al. 2013. Each of the knockout strains has one gene closely related to mRNA degradation pathways knocked out. Analysis) We systematically searched for novel sequence features associating with half-life from 5’UTR, start codon context, CDS, stop codon context, and 3’UTR. Effects of previously reported cis-regulatory elements were also assessed. Moreover, we assessed the dependencies of different sequence features on degradation pathways by analyzing their effects in the knockout strains. Integrative model) We build a statistical model to predict genome-wide half-life solely from mRNA sequence. This allowed the quantification of the relative contributions of the sequence features to the overall variation across genes and assessing the sensitivity of mRNA stability with respect to single-nucleotide variants. bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. A Stabilizing elements Destabilizing elements Elements effect sign context dependent TSS CDS 5' UTR Stop codon context Start codon context ATG AAACAAA Secondary structure GC-content Length B 3' UTR Codon usage GC-content Length 5' UTR TGTAAATA (Puf3) TGCAT (Whi3) TTTTTTA ATATTC GC-content Length 3' UTR ATG (uAUG) AAACAAA log10(FDR) Enriched ATATTC 4 2 TGCAT 0 -2 -4 TGTAAATA Depleted TTTTTTA TGTAAATA TGCAT ATATTC AAACAAA ATG (uAUG) TTTTTTA 0 200 400 600 800 1000 Number of mRNAs Figure 2: Overview of identified or collected sequence features. (A) Sequence features that were identified or collected from different sequence regions in this study. When applicable, stabilizing elements are shown in blue, destabilizing in red. (B) Co-occurrence significance (FDR, Fisher test p-value corrected with Benjamini-Hochberg) between different motifs (left). Number of occurrences among the 4,388 mRNAs (right). bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. A WT xrn1Δ upf2Δ upf3Δ 40 3631 20 3648 3647 10 739 3648 738 739 5 P < 2e-16 Median FC = 1.37 1 no uAUG P = 0.022 Median FC = 1.07 with uAUG no uAUG S. cerevisiae B 501 238 5 20 1 1 Number of uAUGs >=2 3649 203 P = 0.75 519 5 no PTC no uAUG in-frame no PTC with PTC out-frame PTC in CDS P = 1.7e-13 150 P = 0.098 17 10 with uAUG S. pombe D 40 1 0 no uAUG P < 2e-16 P = 0.53 Half-life [min] 10 with uAUG P = 8e-13 P = 0.013 3648 P = 8e-04 Median FC = 1.05 P < 2e-16 40 20 no uAUG S. cerevisiae 150 P < 2e-16 150 P = 1e-04 Median FC = 1.09 with uAUG C P < 2e-16 half-life [min] 738 uORF Half-life [min] Half-life [min] 150 P = 0.99 40 20 2051 11 10 1377 196 5 1 no PTC no uAUG in-frame no PTC with PTC out-frame PTC in CDS uORF Figure 3: Upstream AUG codon (uAUG) destabilize mRNA. (A) Distribution of mRNA half-life for mRNAs without uAUG (left) and with at least one uAUG (right) in, from left to right: wild type, XRN1, UPF2 and UPF3 knockout S. cerevisiae strains. Median fold-change (Median FC) calculated by dividing the median of the group without uAUG with the group with uAUG. (B) Distribution of mRNA half-lives for mRNAs with zero (left), one (middle), or more (right) uAUGs in S. cerevisiae. (C) Distribution of mRNA half-lives for S. cerevisiae mRNAs with, from left to right: no uAUG, with one in-frame uAUG but no cognate premature termination codon, with one out-of-frame uAUG and one cognate premature termination codon in the CDS, and with one uAUG and one cognate stop codon in the 5’UTR (uORF). (D) Same as in (C) for S. pombe mRNAs. All p-values were calculated with Wilcoxon rank-sum test. Numbers in the boxes indicate number of members in the corresponding group. Boxes represent quartiles, whiskers extend to the highest or lowest value within 1.5 times the interquartile range and horizontal bars in the boxes represent medians. Data points falling further than 1.5-fold the interquartile distance are considered outliers and are shown as dots. bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. S. cerevisiae P = 9.2e-08 40 2736 872 419 360 5 1 T C Half-life [min] 5 20 100 Deadenylation Decapping Exosome RBP Surveillance WT XRN1 Knockouts S. cerevisiae D P = 1.8e-09 P = 0.0049 P = 7.5e-07 150 Half-life [min] Spearman rho=0.55 R-squared=0.3 * Stop_codon 40 20 486 10 798 624 151 227 378 TAA 294 141 248 491 442 108 TAG TGA 5 TGAC TGAT TGAA stop codon\+1 TGAG TAGC TAGT TAGA 0.35 0.45 0.55 0.70 S. cerevisiae-specific tAI TAGG 1 1 TAAC B G start codon\-3 Group 0 TAAT A ** ** 20 TAAA 10 * 40 *** TAAG 20 60 WT ccr4Δ caf40Δ not3Δ pop2Δ pan2Δ pan3Δ dhh1Δ pat1Δ lsm1Δ dcs1Δ lsm7Δ lsm6Δ scd6Δ edc3Δ dcs2Δ edc2Δ rrp6Δ rrp47Δ ski3Δ ski2Δ ski7Δ ski8Δ pub1Δ puf4Δ puf6Δ puf3Δ puf5Δ puf1Δ puf2Δ upf3Δ upf2Δ hbs1Δ dom34Δ xrn1Δ P = 6.9e-12 150 half-life [min] C P = 1.7e-11 Codon-explained-variance (%) A Figure 4: Translation initiation, elongation and termination features associate with mRNA half-life. (A) Distribution of half-life for mRNAs grouped by the third nucleotide before the start codon. Group sizes (numbers in boxes) show that nucleotide frequency at this position positively associates with half-life. (B) mRNA half-life (y-axis) versus species-specific tRNA adaptation index (sTAI) (x-axis). (C) mRNA half-life explained variance (y-axis, Materials and Methods) in wild-type (WT) and across all 34 knockout strains (grouped according to their functions). Each blue dot represents one replicate, bar heights indicate means across replicates. Bars with a red star are significantly different from the wild type level (FDR <0.1, Wilcoxon rank-sum test, followed by Benjamini-Hochberg correction). (D) Distribution of half-life for mRNAs grouped by the stop codon and the following nucleotide. Colors represent three different stop codons (TAA, TAG and TGA), within each stop codon group, boxes are shown in G, A, T, C order of their following base. Only the P-values for the most drastic pairwise comparisons (A versus C within each stop codon group) are shown. All p-values in boxplots were calculated with Wilcoxon rank-sum test. Boxplots computed as in Fig 3. bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. A ATATTC FDR=1.96e−05 TGCAT FDR=6.94e−04 TGTAAATA FDR=3.2e−05 TTTTTTA FDR=0.086 150 Half-life [min] 40 20 3797 10 529 62 1 >=2 3492 4179 769 100 27 1 2 >=3 94 610 3684 209 5 1 0 Motif density B 0.4 0.3 0.2 0.1 0.0 0 ATATTC 150 100 50 TGCAT 0 150 100 50 Effects on half-life base ATATTC >=1 0 TGTAAATA 0 150 100 Distance from poly(A) site C 1.3 1.2 1.1 1.0 0.9 0 Occurence per transcript A C G TGCAT 50 1 >=2 TTTTTTA 0 150 100 50 0 T TGTAAATA TTTTTTA -2 -1 1 2 3 4 5 6 +1 +2 -2 -1 1 2 3 4 5 +1 +2 -2 -1 1 2 3 4 5 6 7 8 +1 +2 -2 -1 1 2 3 4 5 6 7 +1 +2 -2 -1 1 2 3 4 5 6 +1 +2 -2 -1 1 2 3 4 5 +1 +2 -2 -1 1 2 3 4 5 6 7 8 +1+2 -2 -1 1 2 3 4 5 6 7 +1 +2 Frequency D 1.00 0.75 0.50 0.25 0.00 conservation E ATATTC 0.70 0.65 0.60 0.55 -2 -1 1 2 3 4 5 6 +1 +2 0.325 0.300 0.275 0.250 0.225 Position TGCAT TGTAAATA 0.9 0.22 0.7 2 3 4 5 +1 +2 TTTTTTA 0.24 0.8 -2 -1 1 0.26 -2 -1 1 2 3 4 5 6 7 8 +1+2 Position 0.20 -2 -1 1 2 3 4 5 6 7 +1 +2 Figure 5: 3’UTR half-life determinant motifs in S. cerevisiae. (A) Distribution of half-lives for mRNAs grouped by the number of occurrence(s) of the motif ATATTC, TGCAT (Whi3), TGTAAATA (Puf3) and TTTTTTA respectively in their 3’UTR sequence. Numbers in the boxes represent the number of members in each box. FDR were reported from the linear mixed effect model (Materials and Methods). (B) Fraction of transcripts containing the motif (y-axis) within a 20-bp window centered at a position (x-axis) with respect to poly(A) site for different motifs (facet titles). Positional bias was not observed when aligning 3’UTR motifs with respect to the stop codon. (C) Prediction of the relative effect on half-life (y-axis) for single-nucleotide substitution in the motif with respect to the consensus motif (y=1, horizontal line). The motifs were extended 2 bases at each flanking site (positions +1, +2, -1, -2). (D) Nucleotide frequency within motif instances, when allowing for one mismatch compared to the consensus motif. (E) Mean conservation score (phastCons, Materials and Methods) of each base in the consensus motif with 2 flanking nucleotides (y-axis). bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. S. cerevisiae S. pombe R-squared=0.60 R-squared=0.42 50 20 1 5 20 50 Measured half-life [min] 100 B 5 Measured half-life [min] A 50 100 1 Predicted half-life [min] C Individual Cumulative Drop Codon usage CREs 55.90 55.92 44.23 uAUG 4.46 56.20 0.05 3' UTR motifs 5.53 57.59 1.29 UTR5_length 4.24 58.41 0.56 CDS_length 3.70 59.82 1.58 GC_content_UTR5 1.27 60.09 0.22 0.28 60.17 0.10 50 Effect of single-nucleotide variations 1.3 1.2 1.1 1.0 U C D S_ UTR3_length 20 Predicted half-life [min] D Contribution of each sequence features (%) 5 le ng TR th 3_ le ng U TR th 5_ le ng th G C _c co on do te n nt _U TR 5 uA U G TT TT TT A TG CA T AT AT TC TG TA AA TA 20 Expected half-life fold-change 5 E Figure 6: Genome-wide prediction of mRNA half-lives from sequence features and analysis of the contributions. (A-B) mRNA half-lives predicted (x-axis) versus measured (y-axis) for S. cerevisiae (A) and S. pombe (B) respectively. (C) Contribution of each sequence feature individually (Individual), cumulatively when sequentially added into a combined model (Cumulative) and explained variance drop when each single feature is removed from the full model separately (Drop). Values reported are the mean of 100 times cross-validated evaluation (Materials and Methods). (D) Expected half-life fold-change of single-nucleotide variations on sequence features. For length and GC, dot represent median half-life fold change of one nucleotide shorter or one G/C to A/T transition respectively. For codon usage, each dot represents median half-life fold-change of one type of synonymous mutation, all kinds of synonymous mutations are considered. For uAUG, each dot represents median half-life foldchange of mutating out one uAUG. For motifs, each dot represents median half-life fold-change of one type of nucleotide transition at one position on the motif (Materials and Methods). Medians are calculated across all mRNAs. (E) Overview of conclusions. 2 bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. ExpandednotView bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. S. cerevisiae S. pombe Spearman rho=-0.17 Spearman rho=-0.27 B 50 10 half-life [min] 200 50 5 10 1 1 2 2 half-life [min] P = 2.1e-11 200 P = 3.4e-12 5 A 500 2000 10 20 S. cerevisiae Spearman rho=-0.23 C 500 2000 P = 4.2e-15 200 half-life [min] 200 50 5 10 half-life [min] 1 1 2 2 200 S. pombe Spearman rho=-0.32 D P < 2e-16 50 5'UTR length 50 50 100 5'UTR length 10 20 5 10 500 1000 5000 CDS length 100 S. cerevisiae Spearman rho=-0.06 P = 5.7e-12 200 half-life [min] 200 50 5 10 1 1 2 2 half-life [min] S. pombe Spearman rho=-0.23 F P = 1.4e-08 2000 5000 50 E 500 CDS length 10 200 5 100 500 10 50 1000 S. pombe Spearman rho=-0.24 H P < 2e-16 -50 half-life [min] -100 -1 -2 -10 200 3'UTR length -200 P = 7.8e-12 5 -20 50 3'UTR length S. cerevisiae Spearman rho=-0.21 G half-life [min] 5 -5 1 1e-02 1e-01 1e+00 1e+01 5' Free Energy 1e+02 2 5 10 20 50 200 500 5' Free Energy Figure EV1: Length of 5’UTR, CDS and 3’UTR as well as 5’ folding energy correlate with mRNA half-life. (A-B) 5’UTR length (x-axis) versus half-life (y-axis) for S. cerevisiae (A) and S. pombe (B). (C-D) CDS length (x-axis) versus half-life (y-axis) for S. cerevisiae (C) and S. pombe (D). (E-F) 3’UTR length (x-axis) versus half-life (y-axis) for S. cerevisiae (E) and S. pombe (F). (G-H) 5’ free energy (x-axis) versus half-life (y-axis) for S. cerevisiae (G) and S. pombe (H). bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. S. cerevisiae Spearman rho=-0.11 A S. pombe Spearman rho=0.16 P < 2e-16 1 1 2 2 5 5 10 10 50 50 half-life [min] half-life [min] 200 200 P = 2.8e-14 0.1 0.2 0.3 0.4 0.6 GC content 5'UTR 0.2 0.3 0.4 0.5 0.7 GC content 5'UTR S. cerevisiae Spearman rho=0.27 S. pombe Spearman rho=0.24 P < 2e-16 1 1 2 2 5 5 10 10 50 50 half-life [min] half-life [min] 200 200 P < 2e-16 0.40 0.45 0.55 0.30 S. cerevisiae Spearman rho=-0.17 0.35 0.40 0.45 0.50 GC content CDS S. pombe Spearman rho=0.05 P < 2e-16 P = 0.012 1 1 2 2 5 5 10 10 50 50 half-life [min] half-life [min] 200 E 0.35 GC content CDS 200 0.30 0.2 0.3 0.4 GC content 3'UTR 0.5 0.15 0.20 0.25 0.30 GC content 3'UTR 0.40 Figure EV2: GC content of 5’UTR, CDS and 3’UTR correlate with mRNA half-life (A-B) 5’UTR GC content (x-axis) versus half-life (y-axis) for S. cerevisiae (A) and S. pombe. (C-D) CDS GC content (x-axis) versus half-life (y-axis) for S. cerevisiae (C) and S. pombe (D). (E-F) 3’UTR GC content (x-axis) versus half-life (y-axis) for S. cerevisiae (E) and S. pombe (F). Information content bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. A S. cerevisiae 0.6 0.5 0.4 0.3 0.2 0.1 0 −6 −5 −4 −3 −2 −1 A Information content B T G 1 2 3 4 5 6 1 2 3 4 5 6 S. pombe 0.6 0.5 0.4 0.3 0.2 0.1 0 −6 −5 −4 −3 −2 −1 A T G Position C F S. cerevisiae 150 P = 7.5e−05 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 2241 S. pombe P = 0.013 ● 150 ● ● ● ● ● ● ● ● ● ● ● ● 40 20 stop codon\+1 2146 5 TGAC T ● TGAA G start codon\+3 E half−life [min] ● 46 ● ● ● ● TGAG A ● ● ● ● ● ● TAAG ● ● ● ● ● ● 148 284 241 102 ● 1 1 P = 0.0076 TAGC 5 232 TAGT 505 316 118 166 TAGA 10 294 692 TAAC 578 321 948 TAAT 2237 40 20 10 5 TAAA half−life [min] 40 P = 0.0091 TGAT 150 ● ● 20 P = 8.2e−08 half−life [min] ● P = 8e−06 P < 2e−16 P = 7.6e−05 half−life [min] 150 S. Pombe D TAGG S. Pombe C ● ● ● 40 20 10 1840 1774 5 ● ● ● ● ● ● 1 FALSE TRUE 3'UTR with in−frame stop codon ● ● ● ● ● ● 1 FALSE TRUE 3'UTR with in−frame stop codon Figure EV3: Start codon context (Kozak sequence) and S. pombe half-life versus start and stop codon context. (A-B) Start codon context (Kozak sequence) generated from 4388 S. cerevisiae genes (A) and 3713 S. pombe genes (B). (C) Distribution of half-life for mRNAs grouped by the third nucleotide before the start codon for S. pombe. Group sizes (numbers in boxes) show that nucleotide frequency at this position positively associates with half-life. (D) Distribution of half-life for mRNAs grouped by the stop codon and the following nucleotide for S. pombe. Colors represent three different stop codons (TAA, TAG and TGA), within each stop codon group, boxes are shown in G, A, T, C order of their following base. Only the P-values for the most drastic pairwise comparisons (A versus C within each stop codon group) are shown. (E) Distribution of half-life for mRNAs grouped by with or without additional 3’UTR in-frame stop codon for S. cerevisiae. 30 bases window after the main stop codon was considered. (F) Same as (E) for S. pombe. All p-values in boxplot were calculated with Wilcoxon rank-sum test. Boxplots computed as in Fig 3. bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. AAACAAA FDR=0.049 A half-life (mins) 150 40 20 228 4160 10 5 1 0 effects on half-life B Frequency C AAAAACAAAAA 1.3 1.2 1.1 1.0 0.9 0.8 1.00 0.75 0.50 0.25 0.00 base A C G T 1 2 3 4 5 6 7 positions 8 9 10 11 -2 -1 1 2 3 4 5 6 7 +1 +2 D conservation >= 1 AAACAAA 0.16 0.14 0.12 0.10 0.08 -2 -1 1 2 3 4 5 Position 6 7 +1 +2 Figure EV4: S. cerevisiae 5’UTR mRNA half-life associated motif. (A) Distribution of half-lives for mRNAs grouped by the number of occurrence(s) of the motif AAACAAA in their 5’UTR sequence. Numbers in the boxes represent the number of members in each box. FDR were reported from the linear mixed effect model (Materials and Methods). (B) Prediction of the relative effect on half-life (y-axis) for single-nucleotide substitution in the motif with respect to the consensus motif (y=1, horizontal line). The motifs were extended 2 bases at each flanking site (positions +1, +2, -1, -2). (C) Nucleotide frequency within motif instances, when allowing for one mismatch compared to the consensus motif. (D) Mean conservation score (phastCons, Materials and Methods) of each base in the consensus motif with 2 flanking nucleotides (y-axis). bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. xrn1Δ dom34Δ hbs1Δ upf2Δ upf3Δ puf2Δ puf1Δ puf5Δ puf3Δ puf6Δ puf4Δ pub1Δ ski8Δ ski7Δ ski2Δ ski3Δ rrp47Δ rrp6Δ edc2Δ dcs2Δ edc3Δ scd6Δ lsm6Δ lsm7Δ dcs1Δ lsm1Δ pat1Δ dhh1Δ pan3Δ pan2Δ pop2Δ not3Δ caf40Δ ccr4Δ WT * * * * * * * * * * * * * * * * * * * * * * * Effects * * * 0.2 0.0 -0.2 * * * * * * 0.4 * * * * * * * * * * * * * * * * * * * * * * * * * * * * A TT TT TT AT C TG TC A AT AT AT A AG TA TG e TG - g AC usa TG on d A co C-3 gy St ner A e A 5' CA A AA G U uA Figure EV5: Summary of CREs effect changes across all 34 knockouts comparing with WT. Colour represent the relative effect size (motifs, St-3 C-A, TGAG-TGAC, uAUG), correlation (5’ folding energy) or explained variance (codon usage) upon knockout of different genes (y-axis) (Materials and Methods for detailed description). Wild-type label is shown in the bottom (WT) P-values calculated with Wilcoxon rank-sum test by comparing each mutant to wild-type level, multiple testing p-values corrected with Bonferroni & Hochberg (FDR). Stars indicating significance of statistical testing (FDR <0.1). 5’ energy: correlation of 5’end (5’UTR plus first 10 codons) folding energy with mRNA half-lives; St-3 C-A: relative median half-life difference between genes with cytosine and adenine at start codon -3 position; TGAG-TGAC: relative median half-life difference between genes with stop codon +1 TAAG and TGAC. codon usage: codon usage explained mRNA half-lives variance. uAUG: relative median half-life difference between genes without and with upstream AUG in the 5’UTR (Materials and Methods) codons bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. AAT AAC TAT GGC GGA GAT GGG TAC GAC TTT TGT TTC TGC CTC CTA AGC AGT GCG GCT GAG GGT CAT CTT TCG CAC GAA TCT TCA GCC CTG GCA CAG ACG CAA ACT ACC TTA CCT CCC CGG ACA AGG CGC TCC ATC ATT CCA CCG AAA AAG AGA TTG CGA CGT ATA AAC AAT TAC GGT GGA GGG GGT GGC GGG GAC GGT GGA GGC TAT GAT TTC TGC TTT TGT CTT CTA CTG CTGCTC CTT TTA AGT AGC GCTGCC GCA GCC GCG GCA GAA GGC GGA GGG CAC CTG CTACTC TCC TCT TCA CAT GAG TCC TCG TCA TCC TCT TCG GCT GCG GCA TTG CTA CTC CTT GCC GCT GCG CAA ACT ACC ACA CAG ACC ACG ACA ACT ACG ACA TTG CTA CCA CCC CCG CCA CCT CCG CGACGT CGC AGG ACC ACT ACG CGG AGA CGA CGT CGG TCT TCG TCA ATA ATT ATC ATA CCT CCC CCG CCA CCT CCC AAG AAA CGA AGG CTG TTA CGG CGC AGA CGC CGG ATC ATT 1.00 1.01 1.02 Expected half-life fold-change CGT CGA 1.03 Figure EV6: Predicted effects of synonymous codon transitions on half-life. Expected half-life fold-change (x-axis) at each synonymous codon transitions. Each row represent transition from one codon (y-axis) to its synonymous partners. Only synonymous codons that differ by one base were considered.
© Copyright 2026 Paperzz