Sequence features explain most of the mRNA stability

bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
1
Sequence features explain most of the mRNA
2
stability variation across genes
3
4
Jun cheng1,2, Žiga Avsec1,2, Julien Gagneur1,2,*
5
1. Department of Informatics, Technical University of Munich, Boltzmannstraße 3,
6
85748 Garching, Germany
7
2. Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-
8
Universität München, Germany
9
* Correspondence: [email protected]
10
11
Abstract
12
The stability of messenger RNA (mRNA) is one of the major determinants of gene
13
expression. Although a wealth of sequence elements and mechanisms regulating
14
mRNA stability has been described, their quantitative contributions in determining
15
mRNA half-life is unknown. Here, we built quantitative models for two eukaryotic
16
genomes Saccharomyces cerevisiae and Schizosaccharomyces pombe that, for the
17
first time, explain most of the half-life variation between genes based on mRNA
18
sequence alone. The models integrate known functional cis-regulatory elements,
19
identify novel ones, and quantify their contributions at single-nucleotide resolution.
20
We show quantitatively that codon usage is the major determinant of mRNA stability,
21
and that this effect depends on canonical mRNA degradation pathways. Altogether,
22
our results integrate and quantitatively delineate mRNA stability cis-regulatory
23
elements and provide a methodology that can serve as a scaffold to study the
24
function of cis-regulatory elements and to discover novel ones.
25
26
Introduction
27
The stability of messenger RNAs is an important aspect of gene regulation. It
28
influences the overall cellular mRNA concentration, as mRNA steady-state levels are
29
the ratio of synthesis and degradation rate. Moreover, low stability confers high
30
turnover to mRNA and therefore the capacity to rapidly reach a new steady-state
31
level in response to a transcriptional trigger. Hence, stress genes, which must rapidly
crPage 1 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
32
respond to environmental signals, show low stability (Zeisel et al, 2011; Rabani et al,
33
2014). In contrast, high stability provides robustness to variations in transcription.
34
Accordingly, a wide range of mRNA-half-lives is observed in eukaryotes, with typical
35
variations in a given genome spanning one to two orders of magnitude
36
(Schwanhäusser et al, 2011; Schwalb et al, 2016; Eser et al, 2016).
37
38
How mRNA stability is encoded in a gene sequence has long been a subject of
39
study. Cis-regulatory elements (CREs) affecting mRNA stability are mainly encoded
40
in the mRNA itself. They include but are not limited to secondary structure, sequence
41
motifs present in the 3’UTR including binding sites of RNA-binding proteins (Olivas &
42
Parker, 2000; Vasudevan & Peltz, 2001), and, in higher eukaryotes, microRNAs (Lee,
43
1993). Moreover, codon usage, which affects translation elongation rate, also
44
regulates mRNA stability (Hoekema et al, 1987; Presnyak et al, 2015; Boël et al,
45
2016; Mishima & Tomari, 2016). For instance, inserting strong secondary structure
46
elements in the 5’UTR or modifying the translation start codon context strongly
47
destabilizes the long-lived PGK1 mRNA in S. cerevisiae (Muhlrad et al, 1995;
48
LaGrandeur & Parker, 1999).
49
50
Since the RNA degradation machineries are well conserved among eukaryotes, the
51
pathways have been extensively studied using Saccharomyces cerevisiae as a
52
model organism (Caponigro & Parker, 1996; Parker, 2012). The general mRNA
53
degradation pathway starts with the removal of the poly(A) tail by the Pan2/Pan3
54
(Boeck et al, 1996; Brown et al, 1996) and Ccr4/Not complexes (Tucker et al, 2001;
55
Collart, 2003; Yamashita et al, 2005). Subsequently, mRNA is subjected to
56
decapping carried out by Dcp2 and promoted by several factors including Dhh1 and
57
Pat1 (Pilkington & Parker, 2008; She et al, 2008). The decapped and deadenylated
58
mRNA can be degraded rapidly in the 3’ to 5’ direction by the exosome (Anderson &
59
Parker, 1998) or in the 5’ to 3’ direction by Xrn1 (Muhlrad et al, 1994; Hsu & Stevens,
60
1993). Further mRNA degradation pathways are triggered when aberrant
61
translational status is detected, including Nonsense-mediated decay (NMD), No-go
62
decay (NGD) and Non-stop decay (NSD) (Garneau et al, 2007; Parker, 2012).
63
64
Despite all this knowledge, prediction of mRNA half-life from a gene sequence is still
65
not established. Most of the mechanistic studies could only be performed on
crPage 2 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
66
individual genes and it is unclear how the effects generalize genome-wide. In
67
addition, neither the contribution of different CREs to the overall stability and to the
68
variation between genes has been systematically studied, nor have the dependences
69
of CREs to distinct mRNA degradation pathways.
70
71
Here, we used an integrative approach where we mathematically modelled mRNA
72
half-life as a function of its sequence and applied it to S .cerevisiae. For the first time,
73
our model can explain most of the between-gene half-life variance from sequence
74
alone. We used a semi-mechanistic model which allowed us to interpret it in terms of
75
individual sequence features in the 5’UTR, coding region, and 3’UTR. Our approach
76
de novo recovered known cis-regulatory elements and identified novel ones.
77
Quantification of the respective contributions revealed that codon usage is the major
78
contributor to mRNA stability. Interpretability of the model also allowed studying how
79
the function of CREs depends on different mRNA degradation pathways. The
80
application of the modeling approach to S. pombe supports the generality of these
81
findings.
82
83
RESULTS
84
85
Regression reveals novel mRNA sequence features associated with
86
mRNA stability
87
To study cis-regulatory determinants of mRNA stability in S. cerevisiae, we chose the
88
dataset by Sun and colleagues (Sun et al, 2013), which provides genome-wide half-
89
life measurements for 4,388 expressed genes of a wild type lab strain (Fig 1). The
90
same study also measured genome-wide mRNA half-lives for 34 strains knocked out
91
for RNA degradation pathway genes (Table EV1), allowing for investigating how the
92
association of candidate CREs with half-life depends on RNA degradation pathways.
93
When applicable, associations were also investigated in the S. pombe genome, using
94
half-life values for 3,614 expressed mRNAs in a wild-type lab strain from Eser and
95
colleagues (Eser et al, 2016). We considered sequence features within 5 overlapping
96
regions: the 5’UTR, the start codon context, the coding sequence, the stop codon
97
context and the 3’UTR. Half-life associated negatively with 5’UTR length (Spearman
98
ρ = -0.17, P = 3 × 10-12 for S. cerevisiae and Spearman ρ = -0.27, P = 2 × 10-11 for S.
crPage 3 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
99
pombe), with the coding sequence (Spearman ρ = -0.23, P < 2 × 10-16 for S.
100
cerevisiae and Spearman ρ = -0.32, P = 4 × 10-15 for S. pombe), and with 3’UTR
101
length (Spearman ρ = -0.06, P = 1 × 10-8 for S. cerevisiae and Spearman ρ = -0.23, P
102
= 6 × 10-12 for S. pombe) (Fig EV1A, B, C, D, E, F). Moreover, S .cerevisiae half-life
103
was found to be negatively associated with the GC-content of the 5’UTR (Spearman
104
ρ = -0.11, P = 3 × 10-14) and of the 3’UTR (Spearman ρ = -0.17, P < 2 × 10-16),
105
whereas the GC-content of the CDS region correlated positively with half-life
106
(Spearman ρ = 0.27, P < 2 × 10-16) (Fig EV2A, C, E). In contrast, S. pombe showed
107
consistent positive association between GC-content and half-life (Spearman ρ = 0.16,
108
0.24, 0.05 and P < 2 × 10-16, P < 2 × 10-16, P = 0.01 respectively for 5’UTR, CDS and
109
3’UTR, Fig EV2B, D, F). Also, secondary structure in 5’UTR (Materials and Methods)
110
associated with RNA instability (Spearman ρ = -0.22, P = 8 × 10-12 for S. cerevisiae
111
and Spearman ρ = -0.24, P < 2 × 10-16 for S. pombe, Fig EV1G, H).
112
113
Novel candidate cis-regulatory elements were searched for using a robust k-mer
114
based regression, investigating all 3- to 8-mers, and followed by k-mer assembly
115
(Materials and Methods). This recovered de novo the Puf3 binding motif TGTAAATA
116
in 3’UTR (Gerber et al, 2004; Gupta et al, 2014), a well-studied CRE that confers
117
RNA instability, as well as the Whi3 binding motif TGCAT (Colomina et al, 2008; Cai
118
& Futcher, 2013). Two new motifs were found: AAACAAA in 5’UTR, and ATATTC in
119
3’UTR (Fig 1A). Notably, the motif TGCAT, was found in 21% of the genes (938 out
120
of 4,388) and significantly co-occurred with the other two destabilizing motifs found in
121
3’UTR: Puf3 and the Whi3 binding site motifs (Fig 1B). Except for AAACAAA and
122
TTTTTTA, all motifs associated with shorter half-lives (Fig 1A).
123
124
In the following subsections, we describe first the findings for each of the 5 gene
125
regions and then a model that integrates all these sequence features.
126
127
Upstream AUGs destabilize mRNAs by triggering nonsense-
128
mediated decay
129
Genome-wide, occurrence of an upstream AUG (uAUG) associated with 1.37 fold
130
shorter half-life (fold change between medians, P < 2.2 × 10-16), and the effect
131
strengthened for genes with two or more AUGs (Fig 2A, B). Among the 34 knock-out
crPage 4 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
132
strains, the association between uAUG and shorter half-life was almost lost only for
133
mutants of the two essential components of the nonsense-mediated mRNA decay
134
(NMD) UPF2 and UPF3 (Leeds et al, 1992; Cui et al, 1995), and for the general 5’-3’
135
exonuclease Xrn1 (Fig 2A). The dependence on NMD suggested that the association
136
between uAUG and shorter half-life might be due to the occurrence of a premature
137
stop codon. Indeed, the association of uAUG with decreased half-lives was only
138
found for genes with a premature stop codon cognate with the uAUG (Fig 2C). This
139
held not only for cognate premature stop codons within the 5’UTR, leading to a
140
potential upstream ORF, but also for cognate premature stop codons within the ORF,
141
which occurred almost always for uAUG out-of-frame with the main ORF (Fig 2C).
142
This finding likely holds for many other eukaryotes as we found the same trends in S.
143
pombe (Fig 2D). This observations are consistent with single-gene studies
144
demonstrating that translation of upstream ORFs can lead to RNA degradation by
145
nonsense-mediated decay (Gaba et al, 2005; Barbosa et al, 2013). Altogether, these
146
results show that uAUGs are mRNA destabilizing elements as they almost surely
147
match with a cognate premature stop codon, which, whether in frame or not with the
148
gene, and within the UTR or in the coding region, trigger NMD.
149
150
Translation initiation predicts mRNA stability
151
Several sequence features in the 5’UTR associated significantly with mRNA half-life.
152
153
First, the folding energy of the 5’ initiation sequence, defined as the 5’UTR and the
154
first 10 codons (Zur & Tuller, 2013), positively correlated with mRNA stability in S.
155
cerevisiae (Spearman ρ = -0.22, P = 8 × 10-12 for S. cerevisiae and Spearman ρ = -
156
0.24, P < 2 × 10-16 for S. pombe, Fig EV1G, H). The mRNA destabilizing function of
157
strong secondary structure elements had been demonstrated for the gene PGK1
158
(LaGrandeur & Parker, 1999; Muhlrad et al, 1995). Our genome-wide analysis
159
indicates that this is a general phenomenon. Possibly, strong secondary structure in
160
the 5’ UTR interferes with ribosome scanning, thus slows down translation initiation,
161
and therefore decreases the level of mRNA protection by the translation machinery
162
(Roy & Jacobson, 2013; Huch & Nissan, 2014).
163
crPage 5 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
164
Second, longer 5’UTRs associated with less stable mRNAs (Spearman ρ = -0.17, P =
165
3 × 10-12 for S. cerevisiae and Spearman ρ = -0.27, P = 2 × 10-11 for S. pombe, Fig
166
EV1A, B). In mouse cells, mRNA isoforms with longer 5’UTR are translated with
167
lower efficiency (Wang et al, 2016), possibly because longer 5’UTR generally harbor
168
more translation repression elements. Hence, longer 5’UTR may confer mRNA
169
instability by decreasing translation initiation and therefore decreasing the protection
170
by the translation machinery.
171
172
Third, a significant association between the third nucleotide before the start codon
173
and mRNA half-life was observed (Fig 3A). The median half-life correlated with the
174
nucleotide frequency (Fig EV3A), leading to 1.28 fold difference between the 2,736
175
genes with an adenosine, the most frequent nucleotide at this position, and the 360
176
genes with a cytosine, the least frequent nucleotide at this position (P = 1.7x10-11).
177
The same correlation was also found to be significant for S. pombe (Fig EV3B, C).
178
Functional effect of the start codon context on mRNA stability had been established
179
as the long-lived PGK1 mRNA was significantly destabilized when substituting the
180
sequence context around its start codon with the one from the short-lived MFA2
181
mRNA (LaGrandeur & Parker, 1999). Our genome-wide analysis indicates that this
182
effect generalizes to other genes. Possibly the start codon context, which controls
183
translation initiation efficiency (Kozak, 1987; Dvir et al, 2013), affects thereby mRNA
184
stability.
185
186
Altogether, these findings indicate that 5’UTR elements, including the start codon
187
context, may affect mRNA stability by altering translation initiation. De novo search
188
for regulatory motifs identified AAACAAA to be significantly associated with longer
189
mRNA half-lives. However, further analysis did not provide strong support for this
190
motif, indicating it might be merely correlative (Fig EV5).
191
192
Codon usage regulates mRNA stability through common mRNA
193
decay pathways
194
Codon usage has been reported to affect mRNA stability through a translation-
195
dependent manner (Presnyak et al, 2015; Mishima & Tomari, 2016; Bazzini et al,
196
2016). However, it is unclear on which mRNA degradation pathway it depends.
crPage 6 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
197
198
The species-specific tRNA adaptation index (sTAI) was used to quantify the level of
199
codon usage bias (Sabi & Tuller, 2014). First, we confirmed on this independent half-
200
life dataset the strong correlation between codon optimality and mRNA stability in S.
201
cerevisiae (Fig 3C, Spearman ρ = 0.55, P < 2.2x10-16). Using the out-of-folds
202
explained variance as a summary statistics, we assessed its variation across different
203
gene knockouts (Materials and Methods). The effect of codon usage exclusively
204
depended on the genes from the common deadenylation- and decapping-dependent
205
5’ to 3’ mRNA decay pathway and the NMD pathway (Fig 3D). In particular, all
206
assayed genes of the Ccr4-Not complex, including CCR4, NOT3, CAF40 and POP2,
207
are required for wild-type level effects of codon usage on mRNA decay. Among them,
208
CCR4 has the largest effect. This confirmed recent studies showing that accelerated
209
decay of non-optimal codon genes requires deadenylation activities of Ccr4-Not
210
(Presnyak et al, 2015; Mishima & Tomari, 2016). Moreover, our results confirmed the
211
dependence on Dhh1 (Radhakrishnan et al, 2016), but also on its interacting partner
212
Pat1. In contrast to genes of the Ccr4-Not complex, PAN2/3 genes which encode
213
also deadenylation enzymes, were not found to be essential for the coupling between
214
codon usage and mRNA decay (Fig 3D). Notably, we did not observe any change of
215
effect upon knockout of DOM34 and HBS1 (Figure EV5), which are essential genes
216
for the No-Go decay pathway. This implies that the effect of codon usage is unlikely
217
due to stalled ribosomes at non-optimal codons.
218
219
Our systematic analysis revealed two additional novel dependencies: First, on the
220
common 5’ to 3’ exonuclease Xrn1, and second, on UPF genes, which are essential
221
players of NMD (Fig 3D). Previous studies have shown that NMD is more than just a
222
RNA surveillance pathway, but rather one of the general mRNA decay mechanisms
223
that target a wide range of mRNAs, including aberrant and normal ones (He et al,
224
2003; Hug et al, 2015). Altogether, our analysis strongly indicates that, the so-called
225
“codon-mediated decay” is not an mRNA decay pathway itself, but a regulatory
226
mechanism of the common mRNA decay pathways.
227
228
Stop codon context associates with mRNA stability
229
crPage 7 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
230
Linear regression against the 6 bases 5’ and 3’ of the stop codon revealed the first
231
nucleotide 3’ of the stop codon to most strongly associate with mRNA stability. This
232
association was observed for each of the three possible stop codons, and for each
233
codon a cytosine significantly associated with lower half-life (Fig. 3B). This also held
234
for S. pombe (Fig EV3D). A cytosine following the stop codon structurally interferes
235
with stop codon recognition (Brown et al, 2015), thereby leading to stop codon read-
236
through events (Bonetti et al, 1995; McCaughan et al, 1995). Of all combinations,
237
TGA-C is known to be the leakiest stop codon context (Jungreis et al, 2011) and also
238
associated with shortest mRNA half-life (Fig. 3B). These results are consistent with
239
non-stop decay, a mechanism that triggers exosome-dependent RNA degradation
240
when the ribosome reaches the poly(A) tail. Consistent with this interpretation,
241
mRNAs with additional in-frame stop codons in the 3’UTR, which are over-
242
represented in yeast (Williams et al, 2004), exhibited significantly higher half-life (Fig
243
EV3E, F). However, the association between the stop codon context and half-life was
244
not weakened in mutants of the Ski complex, which is required for the cytoplasmic
245
functions of the exosome (Fig EV5). These results strongly indicate that the fourth
246
nucleotide after the stop codon is an important determinant of mRNA stability, likely
247
because of translational read-through.
248
249
Sequence motifs in 3’UTR
250
Four motifs in the 3’UTR were found to be significantly associated with mRNA
251
stability (Fig 5A, FDR < 0.1, Materials and Methods). This analysis recovered three
252
described motifs: the Puf3 binding motif TGTAAATA (Gerber et al, 2004), the Whi3
253
binding motif TGCAT (Colomina et al, 2008; Cai & Futcher, 2013), and a poly(U)
254
motif TTTTTTA, which is likely bound by Pub1 (Duttagupta et al, 2005), or is part of
255
the long poly(U) stretch that form a looping structure with poly(A) tail (Geisberg et al,
256
2014). We also identified a novel motif, ATATTC, that associated with lower mRNA
257
half-life.
258
259
Four lines of evidence supported the potential functionality of the new motif. First, it
260
preferentially localizes in the vicinity of the poly(A) site (Fig 5B), and functionally
261
depends on Ccr4 (Fig EV5), suggesting a potential interaction with deadenylation
262
factors. Second, single nucleotide deviations from the consensus sequence of the
crPage 8 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
263
motif associated with decreased effects on half-life (Fig 5C, linear regression allowing
264
for one mismatch, Materials and Methods). Moreover, the flanking nucleotides did not
265
show further associations indicating that the whole lengths of the motifs were
266
recovered (Fig 5C). Third, when allowing for one mismatch, the motif still showed
267
strong preferences (Fig 5D). Fourth, the motif instances were more conserved than
268
their flanking bases (Fig 5E).
269
270
Consistent with the role of Puf3 in recruiting deadenylation factors, Puf3 binding motif
271
localized preferentially close to the poly(A) site (Fig 5B). The effect of the Puf3 motifs
272
was significantly lower in the knockout of PUF3 (Fig EV5). We also found a
273
significant dependence on the deadenylation (CCR4, POP2) and decapping (DHH1,
274
PAT1) pathways (Fig EV5), consistent with previous single gene experiment showing
275
that Puf3 binding promotes both deadenylation and decapping (Olivas & Parker,
276
2000; Goldstrohm et al, 2007). Strikingly, Puf3 binding motif switched to a
277
stabilization motif in the absence of Puf3 and Ccr4, suggesting that deadenylation of
278
Puf3 motif containing mRNAs is not only facilitated by Puf3 binding, but also depends
279
on it.
280
281
Whi3 plays an important role in cell cycle control (Garí et al, 2001). Binding of Whi3
282
leads to destabilization of the CLN3 mRNA (Cai & Futcher, 2013). A subset of yeast
283
genes are up-regulated in the Whi3 knockout strain (Cai & Futcher, 2013; Holmes et
284
al, 2013). However, it was so far unclear whether Whi3 generally destabilizes mRNAs
285
upon its binding. Our analysis showed that mRNAs containing the Whi3 binding motif
286
(TGCAT) have significantly shorter half-life (FDR = 6.9x10-04). Surprisingly, this
287
binding motif is extremely widespread, with 896 out of 4,388 (20%) genes that we
288
examined containing the motif on the 3’UTR region, suggesting possible broader and
289
genome-wide role of Whi3 and its binding motif. However, no significant genetic
290
dependence of the effect of the Whi3 binding motif was found (Fig EV5).
291
292
The mRNAs harboring the TTTTTTA motif tended to be more stable (Fig 5A). No
293
positional preferences was observed for this motif (Fig 5B), so as the poly(U) stretch
294
reported before (Geisberg et al, 2014). Effects of this motif depends on genes from
295
Ccr4-Not complex and Xrn1 (Fig EV5).
296
crPage 9 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
297
60% between-gene half-life variance can be explained by sequence
298
features
299
We next asked how well one could predict mRNA half-life from these mRNA
300
sequence features, and what their respective contributions were when considered
301
jointly. To this end, we performed a multivariate linear regression of the logarithm of
302
the half-life against the identified sequence features. The predictive power of the
303
model on unseen data was assessed using 10-fold cross validation (Material and
304
Methods, Table EV2). Altogether, 60% of S. cerevisiae half-life variance in the
305
logarithmic scale can be explained by simple linear combinations of the above
306
sequence features (Fig 6A). The median out-of-folds error across genes is 30%.
307
Prediction accuracy of half-life of 30% is remarkably high because it is in the order of
308
magnitude of the expression variation that is typically physiologically tolerated, and it
309
is also about the amount of variation observed between replicate experiments (Eser
310
et al, 2016). The same model applied to S. pombe explained 42% of the total half-life
311
variance. Because the measures also entail measurement noise, these numbers are
312
conservative underestimations of the total biological variance explained by our
313
model.
314
315
The uAUG, 5’UTR length, 5’UTR GC content, 61 coding codons, CDS length, all four
316
3’UTR motifs, and 3’UTR length remained significantly associated with half-life in the
317
joint model indicating that they contributed independently to half-life (the complete list
318
of features and their p-values are given in Table EV3). In contrast, start codon
319
context, stop codon context, 5’ folding energy, the 5’UTR motif AAACAAA, and
320
3’UTR GC content dropped below the significance when considered in the joint
321
model (Materials and Methods). This loss of statistical significance may be due to
322
lack of statistical power or because the marginal association of these sequence
323
features with half-life is a consequence of a correlation with other sequence features.
324
Among all sequence features, codon usage as a group is the best predictor both in a
325
univariate model (55.90%) and in the joint model (44.23%) (Figure 6B). This shows
326
that, quantitatively, codon usage is the major determinant of mRNA stability in yeast.
327
328
The variance analysis quantifies the contribution of each sequence feature to the
329
variation across genes. Features that vary a lot between genes, such as UTR length
crPage 10 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
330
and codon usage, favorably contribute to the variation. This does not reflect,
331
however, the effect on a given gene of elementary sequence variations in these
332
features. For instance, a single-nucleotide variant can lead to the creation of an
333
uAUG with a strong effect on half-life, but a single nucleotide variant in the coding
334
sequence may have little impact on overall codon usage. We used the joint model to
335
assess the sensitivity of each feature as median fold-change across genes upon
336
single-nucleotide variants, simulating single-nucleotide deletions for the length
337
features and single nucleotide substitutions for the remaining ones (Materials and
338
Methods). Single-nucleotide variations typically altered half-life by less than 10%. The
339
largest effects were observed in the 3’UTR motifs and uAUG (Fig 6D). Notably,
340
although codon usage was the major contributor to the variance, synonymous
341
variation on codons typically affected half-life by less than 2% (Fig 6D; Fig EV6). For
342
those synonymous variations that changed half-life by more than 2%, most of them
343
were variations that involved the most non-optimized codons CGA or ATA (Fig EV6,
344
Presnyak et al. 2015).
345
Altogether, our results show that most of yeast mRNA half-life variation can be
346
predicted from mRNA sequence alone, with codon usage being the major contributor.
347
However, single-nucleotide variation at 3’UTR motif or uAUG had the largest
348
expected effect on mRNA stability.
349
350
351
DISCUSSION
352
We systematically searched for mRNA sequence features associating with mRNA
353
stability and estimated their effects at single-nucleotide resolution in a joint model.
354
Overall, the joint model showed that 60% of the variance could be predicted from
355
mRNA sequence alone in S. cerevisiae. This analysis showed that translation-related
356
features, in particular codon usage, contributed most to the explained variance. This
357
findings strengthens further the importance of the coupling between translation and
358
mRNA degradation (Roy & Jacobson, 2013; Huch & Nissan, 2014). Moreover, we
359
assessed the RNA degradation pathway dependencies of each sequence feature.
360
Remarkably, we identified that codon-mediated decay is a regulatory mechanism of
361
the canonical decay pathways, including deadenylation- and decapping-dependent 5’
362
to 3’ decay and NMD (Figure 6E).
crPage 11 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
363
364
Integrative analyses of cis-regulatory elements as we used here complement
365
mechanistic single-gene studies for important aspects. It allows assessing genome-
366
wide the importance of CREs that have been reported previously with single-gene
367
experiments, as Vogel and colleagues (Vogel et al, 2010) had shown when modeling
368
protein abundance from mRNA levels and sequence. Also, single-nucleotide effect
369
prediction can more precisely supports the interpretation of genetic variants, including
370
mutations in non-coding region as well as synonymous transitions in coding region.
371
Furthermore, such integrative analyses can be combined with a search for novel
372
sequence features, as we did here with k-mers, allowing the identification of novel
373
candidate cis-regulatory elements. An alternative approach to the modeling of
374
endogenous sequence is to use large-scale perturbation screens as developed by
375
the group of Segal (Dvir et al, 2013; Shalem et al, 2015). Although very powerful to
376
dissect known cis-regulatory elements or to investigate small variations around select
377
genes, the sequence space is so large that these large-scale perturbation screens
378
cannot uncover all regulatory motifs. It would be interesting to combine both
379
approaches and design large-scale validation experiments guided by insights coming
380
from modeling of endogenous sequences as we developed here.
381
382
Causality cannot be proven through a regression analysis approach. However, we
383
provided several complementary analyses to further assess the potential functionality
384
of candidate CREs. These include conservation, positional preferences, and epistasis
385
analyses to assess the dependencies on RNA degradation pathways. One of the
386
motif we found by regression, AAACAAA in 5’UTR, was neither supported by these
387
complementary analyses, nor it remained significant in the joint model. We think this
388
motifs is most likely correlative. In contrast, the ATATTC motif in 3’UTR is strongly
389
supported by these complementary analyses and is also significant in the joint model.
390
Two of the most interesting sequence features that we identified but still need to be
391
functionally assayed are the start codon context and 5’ sequence free energy. Given
392
their established effect on translation initiation (Kozak, 1986; Dvir et al, 2013; Tuller &
393
Zur, 2014), the general coupling between translation and mRNA degradation (Huch &
394
Nissan, 2014; Roy & Jacobson, 2013), as well as several observations directly on
395
mRNA stability for single genes (Muhlrad et al, 1995; Barnes, 1998; LaGrandeur &
396
Parker, 1999; Schwartz & Parker, 1999), they are very likely to be functional on most
crPage 12 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
397
genes. Consistent with this hypothesis, large scale experiments that perturb 5’
398
sequence secondary structure and start codon context indeed showed a wide range
399
of mRNA level changes in the direction that we would predict (Dvir et al, 2013).
400
Altogether, such integrative approaches allow the identification of candidate
401
regulatory elements that could be functionally tested later on.
402
403
We are not aware of previous studies that systematically assessed the effects of cis-
404
regulatory elements in the context of knockout backgrounds, as we did here. This
405
part of our analysis turned out to be very insightful. We recovered results from recent
406
studies about the dependencies of the effect of codon usage on RNA stability on the
407
Ccr4-Not complex and Dhh1, but also identified important novel ones including NMD
408
factors, Pat1 and Xrn1. With the growing availability of knockout or mutant
409
background in model organisms and human cell lines, we anticipate this approach to
410
become a fruitful methodology to unravel regulatory mechanisms.
411
412
Materials and Methods
413
414
Data and Genomes
415
Wild-type and knockout genome-wide S. cerevisiae half-life data were obtained from
416
Sun et al. 2013. S. cerevisiae gene boundaries were taken from the boundaries of
417
the most abundant isoform quantified by Pelechano et al. 2013. Reference genome
418
fasta file and genome annotation were obtained from the Ensembl database (release
419
79). UTR regions were defined by subtracting out gene body (exon and introns from
420
the Ensembl annotation) from the gene boundaries.
421
Genome-wide half-life data of S. pombe as well as refined transcription unit
422
annotation were obtained from Eser et al. 2016. Reference genome version
423
ASM294v2.26 was used to obtain sequence information.
424
For both half-life datasets, only mRNAs with mapped 5’UTR and 3’UTR were
425
considered. mRNAs that have 5’UTR length shorter than 6nt were further filtered out,
426
so that start codon context could always be defined. Half-life outliers of S. pombe
427
(half-life less than 1 or larger than 250 mins) were removed.
crPage 13 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
428
Codon-wise species-specific tRNA adaptation index (sTAI) of yeasts were obtained
429
from Sabi & Tuller 2014. Gene-wise sTAIs were calculated as the geometric mean of
430
sTAIs of all its codons (stop codon excluded).
431
432
Analysis of knockout strains
433
The effect level of an individual sequence feature was compared against the wild-
434
type with Wilcoxon rank-sum test followed by multiple hypothesis testing p-value
435
correction (FDR < 0.1). The sequence feature effect levels were defined as follows
436
for different classes of sequence features:
437
uAUG:
438
Motifs:
439
half-life of mRNAs that has zero and one instance of the motif respectively.
440
Codon usage: For each knockout or wild-type, a linear model was fitted with all
441
coding codons as covariates. The effect size of codon usage (joint effect of all
442
codons) on half-life was defined by the explained variance of out-of-sample
443
predictions.
444
Start codon -3 position:
445
mRNAs with base adenine and cytosine at start codon -3 position respectively.
446
Stop codon +1 position:
447
half-life of mRNAs that has stop codon TGA followed by guanine and cytosine
448
respectively.
449
5’ folding energy: genome-wide Spearman rank correlation between 5’ folding energy
450
and half-life.
(𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑤𝑖𝑡ℎ𝑢𝐴𝑈𝐺 )− 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑤𝑖𝑡ℎ𝑜𝑢𝑡𝑢𝐴𝑈𝐺 ))
𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑤𝑖𝑡ℎ𝑜𝑢𝑡𝑢𝐴𝑈𝐺 )
(𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑐𝑜𝑢𝑛𝑡=1 )− 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑐𝑜𝑢𝑛𝑡=0 ))
𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑐𝑜𝑢𝑛𝑡=0 )
where 𝐻 stands for half-life.
where 𝐻𝑐𝑜𝑢𝑛𝑡=0 and 𝐻𝑐𝑜𝑢𝑛𝑡=1 stands for the
(𝑚𝑒𝑑𝑎𝑖𝑛(𝐻𝐴 )−𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑐 ))
𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝐴 )
where 𝐻𝐴 and 𝐻𝐶 is the half-life of
(𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑇𝐺𝐴𝐺 )−𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑇𝐺𝐴𝐶 ))
𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑇𝐺𝐴𝐺 )
where 𝐻𝑇𝐺𝐴𝐺 and 𝐻𝑇𝐺𝐴𝐶 is the
451
452
Motif discovery
453
Motif discovery was conducted similarly to Eser et al. 2016 for the 5’UTR, the CDS
454
and the 3’UTR regions. A linear mixed effect model was used to assess the effect of
455
each individual k-mer (fixed effect) while controlling the effects of the others (random
456
effect) and for the region length as a covariate. For CDS we also used codons as
457
further covariates. We systematically tested the effects of all possible k-mers with
458
length from 3 to 8. The linear mixed model for motif discovery was fitted with GEMMA
crPage 14 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
459
software (Zhou et al, 2013). The p-values were corrected for multiple testing using
460
Benjamini-Hochberg’s FDR. Motifs were subsequently manually assembled based on
461
overlapping significant k-mers.
462
463
Folding energy calculation
464
5’ ensemble folding energy was calculated with RNAfold from ViennaRNA version
465
2.1.9 (Lorenz et al, 2011), with default parameters. 5’ sequence was defined as
466
5’UTR sequence plus the first 10 codons within the coding region. Genes with
467
5’UTRs that are shorter than 30 bps were excluded from the analysis.
468
469
S. cerevisiae conservation analysis
470
The phastCons (Siepel et al, 2005) conservation track for S. cerevisiae was
471
downloaded from the UCSC Genome browser
472
(http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/phastCons7way/). Motif single-
473
nucleotide level conservation scores were computed as the mean conservation score
474
of each nucleotide (including 2 extended nucleotide at each side of the motif) across
475
all motif instances genome-wide (removing NA values).
476
477
Linear model for genome-wide half-life prediction (joint model)
478
Multivariate linear regression models were used to predict genome-wide mRNA half-
479
life on the logarithmic scale from sequence features. Only mRNAs that contain all
480
features were used to fit the models, resulting with 3,862 mRNAs for S. cerevisiae
481
and 3,130 mRNAs for S. pombe. Linear models were first fitted with all sequence
482
features as covariates, non-significant sequence features were then removed from
483
the final prediction models, ending up with 70 features for S. cerevisiae model and 73
484
features for S. pombe (each single coding codon was fitted as a single covariate). L1
485
or L2 regularization were not necessary, as they did not improve the out-of-fold
486
prediction accuracy (tested with glmnet R package (Friedman et al, 2010)). Out-of-
487
fold prediction was applied with 10-fold cross validation for any prediction task in this
488
study. A complete list of model features and their p-values for both yeast species are
489
provided in Table EV2.
490
491
Variance explained for linear model
crPage 15 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
492
The percentage of explained variance for a linear model was calculated by 100 ∗
493
(1 − 𝑣𝑎𝑟𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 / 𝑣𝑎𝑟𝑡𝑜𝑡𝑎𝑙 ), where 𝑣𝑎𝑟𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 and 𝑣𝑎𝑟𝑡𝑜𝑡𝑎𝑙 are the variance of
494
regression residual and the total half-life variance respectively.
495
496
Analysis of sequence feature contribution
497
The contribution of each sequence feature was analyzed individually as a univariate
498
regression and also jointly in a multivariate regression model. We quantified the
499
feature contributions in 10-fold cross-validation. The contribution of each feature
500
individually was calculated as the variance explained by a univariate model. We then
501
added the features in a descending order of their individual explained variance to a
502
joint model, and recorded the cumulative variance explained as adding more
503
features. As compensation to the issue that the additional variance explained by a
504
single feature to a joint model depends on the order of adding it, the drop of variance
505
explained upon leaving out one feature separately from the full model was also
506
assessed.
507
508
Single-nucleotide variant effect prediction
509
The effects of single-nucleotide variation (except for motifs) were predicted by
510
introducing a single nucleotide perturbation into the full prediction model for each
511
gene, and summarizing the effect with the median half-life change across all genes.
512
For example, the effect of a single CGT to CGA transition was assessed by
513
decreasing the count of CGT for each gene by one while increasing the count of CGA
514
for each gene (only perturbing genes that had at least one CGT). In the case of
515
length and GC content, we decreased the length or number of GC count by one for
516
each gene. Note that the side effects of varying the length, such as disrupting
517
reading-frame, were not considered. Only synonymous transitions were considered
518
for codons.
519
520
Motif single-nucleotide variant analysis
521
Effects of motif single-nucleotide variants were predicted with linear model modified
522
from (Eser et al, 2016). When assessing the effect of a given motif variation, we
523
controlled for the effect of all other sequence features using a linear model with the
524
other features as covariates.
525
crPage 16 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
526
Code availability
527
Analysis scripts are available at: https://github.com/s6juncheng/Cheng_et_al_2016
528
529
Acknowledgements
530
We are thankful to Fabien Bonneau (Max Planck Institute of Biochemistry) for helpful
531
discussions on motifs and RNA degradation pathways, as well as useful feedback on
532
the manuscript. We thank Björn Schwalb for communication on analyzing the
533
knockout data. We thank Vicente Yépez for useful feedback on the manuscript. JC
534
and ŽA is supported by a DFG fellowship through QBM. JG was supported by the
535
Bundesministerium für Bildung und Forschung, Juniorverbund in der Systemmedizin
536
“mitOmics” (grant FKZ 01ZX1405A).
537
538
References
539
540
Anderson JS & Parker RP (1998) The 3’ to 5' degradation of yeast mRNAs is a general
541
mechanism for mRNA turnover that requires the SKI2 DEVH box protein and 3' to 5'
542
exonucleases of the exosome complex. EMBO J. 17: 1497–506
543
544
545
546
Barbosa C, Peixeiro I & Romão L (2013) Gene Expression Regulation by Upstream Open
Reading Frames and Human Disease. PLoS Genet. 9: e1003529
Barnes CA (1998) Upf1 and Upf2 proteins mediate normal yeast mRNA degradation when
translation initiation is limited. Nucleic Acids Res. 26: 2433–2441
547
Bazzini AA, Viso F, Moreno-mateos MA, Johnstone TG & Charles E (2016) Codon identity
548
regulates mRNA stability and translation efficiency during the maternal-to-zygotic
549
transition. EMBO J.: 1–17
550
Boeck R, Tarun S, Rieger M, Deardorff JA, Mu S & Sachs AB (1996) The Yeast Pan2
551
Protein Is Required for Poly ( A ) -binding Protein-stimulated Poly ( A ) -nuclease Activity
552
J. Biol. Chem.. 271: 432–438
553
Boël G, Letso R, Neely H, Price WN, Wong K, Su M, Luff JD, Valecha M, Everett JK, Acton
554
TB, Xiao R, Montelione GT, Aalberts DP & Hunt JF (2016) Codon influence on protein
555
expression in E. coli correlates with mRNA levels. Nature 529: 358–363
556
Bonetti B, Fu L, Moon J & Bedwell DM (1995) The efficiency of translation termination is
557
determined by a synergistic interplay between upstream and downstream sequences in
558
Saccharomyces cerevisiae. J. Mol. Biol. 251: 334–45
559
560
Brown A, Shao S, Murray J, Hegde RS & Ramakrishnan V (2015) Structural basis for stop
codon recognition in eukaryotes. Nature 524: 493–6
crPage 17 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
561
Brown CE, Tarun SZ, Boeck R, Sachs AB, Sachs AB & Chem JB (1996) PAN3 Encodes a
562
Subunit of the Pab1p-Dependent Poly ( A ) Nuclease in Saccharomyces cerevisiae.
563
Mol. Cell. Biol. 16: 5744–5753
564
Cai Y & Futcher B (2013) Effects of the Yeast RNA-Binding Protein Whi3 on the Half-Life and
565
Abundance of CLN3 mRNA and Other Targets. PLoS One 8: e84630
566
Caponigro G & Parker R (1996) Mechanisms and control of mRNA turnover in
567
568
569
Saccharomyces cerevisiae. Microbiol. Rev. 60: 233–249
Collart MA (2003) Global control of gene expression in yeast by the Ccr4-Not complex. Gene
313: 1–16
570
Colomina N, Ferrezuelo F, Wang H, Aldea M & Garí E (2008) Whi3, a developmental
571
regulator of budding yeast, binds a large set of mRNAs functionally related to the
572
endoplasmic reticulum. J. Biol. Chem. 283: 28670–28679
573
Cui Y, Hagan KW, Zhang S & Peltz SW (1995) Identification and characterization of genes
574
that are required for the accelerated degradation of mRNAs containing a premature
575
translational termination codon. Genes Dev. 9: 423–36
576
Duttagupta R, Tian B, Wilusz CJ, Danny T, Soteropoulos P, Ouyang M, Joseph P, Peltz SW,
577
Khounh DT & Dougherty JP (2005) Global Analysis of Pub1p Targets Reveals a
578
Coordinate Control of Gene Expression through Modulation of Binding and Stability
579
Global Analysis of Pub1p Targets Reveals a Coordinate Control of Gene Expression
580
through Modulation of Binding and Stability Mol. Cell. Biol.. 25: 5499–5513
581
Dvir S, Velten L, Sharon E, Zeevi D, Carey LB, Weinberger A & Segal E (2013) Deciphering
582
the rules by which 5 ’UTR sequences affect protein expression in yeast. Proc. Natl.
583
Acad. Sci. U. S. A. 110: E2792–E2801
584
Eser P, Wachutka L, Maier KC, Demel C, Boroni M, Iyer S, Cramer P & Gagneur J (2016)
585
Determinants of RNA metabolism in the Schizosaccharomyces pombe genome. Mol.
586
Syst. Biol. 12: 857
587
588
589
Friedman J, Hastie T & Tibshirani R (2010) Regularization Paths for Generalized Linear
Models via Coordinate Descent. J. Stat. Softw. 33: 1–22
Gaba A, Jacobson A & Sachs MS (2005) Ribosome occupancy of the yeast CPA1 upstream
590
open reading frame termination codon modulates nonsense-mediated mRNA decay.
591
Mol. Cell 20: 449–460
592
Garí E, Volpe T, Wang H, Gallego C, Futcher B & Aldea M (2001) Whi3 binds the mRNA of
593
the G1 cyclin CLN3 to modulate cell fate in budding yeast. Genes Dev. 15: 2803–2808
594
Garneau NL, Wilusz J & Wilusz CJ (2007) The highways and byways of mRNA decay. Nat.
595
596
597
Rev. Mol. Cell Biol. 8: 113–126
Geisberg J V., Moqtaderi Z, Fan X, Ozsolak F & Struhl K (2014) Global analysis of mRNA
isoform half-lives reveals stabilizing and destabilizing elements in yeast. Cell 156: 812–
crPage 18 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
598
599
824
Gerber AP, Herschlag D & Brown PO (2004) Extensive association of functionally and
600
cytotopically related mRNAs with Puf family RNA-binding proteins in yeast. PLoS Biol.
601
2: e79
602
603
604
Goldstrohm AC, Seay DJ, Hook BA & Wickens M (2007) PUF protein-mediated
deadenylation is catalyzed by Ccr4p. J. Biol. Chem. 282: 109–114
Gupta I, Clauder-Münster S, Klaus B, Järvelin AI, Aiyar RS, Benes V, Wilkening S, Huber W,
605
Pelechano V & Steinmetz LM (2014) Alternative polyadenylation diversifies post-
606
transcriptional regulation by selective RNA-protein interactions. Mol. Syst. Biol. 10: 719
607
He F, Li X, Spatrick P, Casillo R, Dong S & Jacobson A (2003) Genome-Wide Analysis of
608
mRNAs Regulated by the Nonsense-Mediated and 5'to 3' mRNA Decay Pathways in
609
Yeast. Mol. Cell 12: 1439–1452
610
Hoekema A, Kastelein ROBA, Vasser M & Boer HADE (1987) Codon Replacement in the
611
PGKI Gene of Saccharomyces cerevisiae : Experimental Approach To Study the Role of
612
Biased Codon Usage in Gene Expression. 7: 2914–2924
613
Holmes KJ, Klass DM, Guiney EL & Cyert MS (2013) Whi3, an S. cerevisiae RNA-binding
614
protein, is a component of stress granules that regulates levels of its target mRNAS.
615
PLoS One 8: 10–12
616
Hsu CL & Stevens A (1993) Yeast cells lacking 5’-->3' exoribonuclease 1 contain mRNA
617
species that are poly(A) deficient and partially lack the 5' cap structure. Mol. Cell. Biol.
618
13: 4826–4835
619
620
621
622
623
Huch S & Nissan T (2014) Interrelations between translation and general mRNA degradation
in yeast. Wiley Interdiscip. Rev. RNA 5: 747–763
Hug N, Longman D & Cáceres JF (2015) Mechanism and regulation of the nonsensemediated decay pathway. Nucleic Acids Res. 44: 1483–1495
Jungreis I, Lin MF, Spokony R, Chan CS, Negre N, Victorsen A, White KP & Kellis M (2011)
624
Evidence of abundant stop codon readthrough in Drosophila and other metazoa.
625
Genome Res. 21: 2096–2113
626
627
628
629
630
Kozak M (1986) Point mutations define a sequence flanking the AUG initiator codon that
modulates translation by eukaryotic ribosomes. Cell 44: 283–292
Kozak M (1987) At least six nucleotides preceding the AUG initiator codon enhance
translation in mammalian cells. J. Mol. Biol. 196: 947–950
LaGrandeur T & Parker R (1999) The cis acting sequences responsible for the differential
631
decay of the unstable MFA2 and stable PGK1 transcripts in yeast include the context of
632
the translational start codon. RNA 5: 420–433
633
634
Lee RC (1993) The C . elegans Heterochronic Gene lin-4 Encodes Small RNAs with
Antisense Complementarity to & II-14. Cell 75: 843–854
crPage 19 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
635
636
Leeds P, Wood JM, Lee B & Culbertson MR (1992) Gene Products That Promote mRNA
Turnover in Saccharomyces cerevisiaet. Mol. Cell. Biol. 12: 2165–2177
637
Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker
638
IL, Thirumalai D, Lee N, Woodson S, Klimov D, Tinoco I, Uhlenbeck O, Levine M,
639
Tinoco I, Borer P, Dengler B, Levine N, Uhlenbeck O, Crothers D, et al (2011)
640
ViennaRNA Package 2.0. Algorithms Mol. Biol. 6: 26
641
McCaughan KK, Brown CM, Dalphin ME, Berry MJ & Tate WP (1995) Translational
642
termination efficiency in mammals is influenced by the base following the stop codon.
643
Proc. Natl. Acad. Sci. U. S. A. 92: 5431–5435
644
645
646
Mishima Y & Tomari Y (2016) Codon Usage and 3′ UTR Length Determine Maternal mRNA
Stability in Zebrafish. Mol. Cell 61: 874–885
Muhlrad D, Decker CJ & Parker R (1994) Deadenylation of the unstable mRNA encoded by
647
the yeast MFA2 gene leads to decapping followed by 5’ ??? 3' digestion of the
648
transcript. Genes Dev. 8: 855–866
649
650
651
652
Muhlrad D, Decker CJ & Parker ROY (1995) Turnover Mechanisms of the Stable Yeast
PGK1 mRNA. Mol. Cell. Biol. 15: 2145–2156
Olivas W & Parker R (2000) The Puf3 protein is a transcript-specific regulator of mRNA
degradation in yeast. EMBO J. 19: 6602-6611
653
Parker R (2012) RNA degradation in Saccharomyces cerevisae. Genetics 191: 671–702
654
Pelechano V, Wei W & Steinmetz LM (2013) Extensive transcriptional heterogeneity
655
656
657
658
revealed by isoform profiling. Nature 497: 127–31
Pilkington GR & Parker R (2008) Pat1 contains distinct functional domains that promote Pbody assembly and activation of decapping. Mol. Cell. Biol. 28: 1298–1312
Presnyak V, Alhusaini N, Chen Y-H, Martin S, Morris N, Kline N, Olson S, Weinberg D, Baker
659
KE, Graveley BR & Coller J (2015) Codon Optimality Is a Major Determinant of mRNA
660
Stability. Cell 160: 1111–1124
661
Rabani M, Raychowdhury R, Jovanovic M, Rooney M, Stumpo DJ, Pauli A, Hacohen N,
662
Schier AF, Blackshear PJ, Friedman N, Amit I & Regev A (2014) High-Resolution
663
Sequencing and Modeling Identifies Distinct Dynamic RNA Regulatory Strategies. Cell
664
159: 1698–1710
665
Radhakrishnan A, Chen Y-H, Martin S, Alhusaini N, Green R & Coller J (2016) The DEAD-
666
Box Protein Dhh1p Couples mRNA Decay and Translation by Monitoring Codon
667
Optimality. Cell 167: 122–128.e9
668
669
670
671
Roy B & Jacobson A (2013) The intimate relationships of mRNA decay and translation.
Trends Genet. 29: 691–699
Sabi R & Tuller T (2014) Modelling the Efficiency of Codon-tRNA Interactions Based on
Codon Usage Bias. DNA Res.: 1–15
crPage 20 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
672
673
674
675
676
Schwalb B, Michel M, Zacher B, Demel C, Tresch A & Gagneur J (2016) TT-seq maps the
human transient transcriptome. Science 352: 1225-1228
Schwanhäusser B, Busse D & Li N (2011) Global quantification of mammalian gene
expression control. Nature 473: 337–342
Schwartz DC & Parker R (1999) Mutations in translation initiation factors lead to increased
677
rates of deadenylation and decapping of mRNAs in Saccharomyces cerevisiae. Mol.
678
Cell. Biol. 19: 5247–5256
679
Shalem O, Sharon E, Lubliner S, Regev I, Lotan-Pompan M, Yakhini Z & Segal E (2015)
680
Systematic Dissection of the Sequence Determinants of Gene 3’ End Mediated
681
Expression Control. PLOS Genet. 11: e1005147
682
She M, Decker CJ, Svergun DI, Round A, Chen N, Muhlrad D, Parker R & Song H (2008)
683
Structural Basis of Dcp2 Recognition and Activation by Dcp1. Mol. Cell 29: 337–349
684
Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth
685
J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W &
686
Haussler D (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and
687
yeast genomes. Genome Res. 15: 1034–1050
688
Sun M, Schwalb B, Pirkl N, Maier KC, Schenk A, Failmezger H, Tresch A & Cramer P (2013)
689
Global analysis of Eukaryotic mRNA degradation reveals Xrn1-dependent buffering of
690
transcript levels. Mol. Cell 52: 52–62
691
Tucker M, Valencia-sanchez MA, Staples RR, Chen J, Denis CL & Parker R (2001) The
692
Transcription Factor Associated Ccr4 and Caf1 Proteins Are Components of the Major
693
Cytoplasmic mRNA Deadenylase in Saccharomyces cerevisiae. Cell 104: 377–386
694
Tuller T & Zur H (2014) Multiple roles of the coding sequence 5’ end in gene expression
695
696
697
698
regulation. Nucleic Acids Res. 43: 13–28
Vasudevan S & Peltz SW (2001) Regulated ARE-mediated mRNA decay in Saccharomyces
cerevisiae. Mol. Cell 7: 1191–1200
Vogel C, Abreu R de S, Ko D, Le S-YY, Shapiro BA, Burns SC, Sandhu D, Boutz DR,
699
Marcotte EM & Penalva LO (2010) Sequence signatures and mRNA concentration can
700
explain two-thirds of protein abundance variation in a human cell line. Mol. Syst. Biol. 6:
701
400
702
703
704
Wang X, Hou J, Quedenau C, Chen W (2016) Pervasive isoform-specific translational
regulation via alternative transcription start sites in mammals. Mol. Syst. Biol. 12: 875
Williams I, Richardson J, Starkey A & Stansfield I (2004) Genome-wide prediction of stop
705
codon readthrough during translation in the yeast Saccharomyces cerevisiae. Nucleic
706
Acids Res. 32: 6605–6616
707
708
Yamashita A, Chang T-C, Yamashita Y, Zhu W, Zhong Z, Chen C-YA & Shyu A-B (2005)
Concerted action of poly(A) nucleases and decapping enzyme in mammalian mRNA
crPage 21 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
709
turnover. Nat. Struct. Mol. Biol. 12: 1054–1063
710
Zeisel A, Köstler WJ, Molotski N, Tsai JM, Krauthgamer R, Jacob-Hirsch J, Rechavi G, Soen
711
Y, Jung S, Yarden Y & Domany E (2011) Coupled pre-mRNA and mRNA dynamics
712
unveil operational strategies underlying transcriptional responses to stimuli. Mol. Syst.
713
Biol. 7: 529
714
715
716
717
Zhou X, Carbonetto P & Stephens M (2013) Polygenic Modeling with Bayesian Sparse
Linear Mixed Models. PLoS Genet. 9: e1003264
Zur H & Tuller T (2013) New Universal Rules of Eukaryotic Translation Initiation Fidelity. 9:
e1003136
718
crPage 22 | 22
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Sequence features explain most of the mRNA stability variation across genes
Jun Cheng, Žiga Avsec, Julien Gagneur
November 2016
1
Main Figures
Data
Genome-wide mRNA half-lives in S. cerevisiae: wild-type + 34 strains knocked out for RNA degradation pathway genes
(Sun et al. Mol. Cell. 2013)
Analysis
Dependence on degradation pathways
Cis-regulatory elements
AUG
AAA AUG
…CGA…GGU…
UGAC
AUAUUC AAAAAA…
• Novel elements discovery
• Integrate novel and known elements
• Single-nucleotide effect estimation
Integrative model
Out-of-sample predictions
Relative contributions
CRE1
Single-nucleotide effects
...CATACGATA...
...CATAGGATA...
CRE2
CRE3
Fold-change
Percentage of variance explained by (%)
Figure 1: Study overview. The goal of this study is to discover and integrate cis-regulatory mRNA elements affecting mRNA
stability and assess their dependence on mRNA degradation pathways. Data) We obtained S. cerevisiae genome-wide half-life data
from wild-type (WT) as well as from 34 knockout strains from Sun et al. 2013. Each of the knockout strains has one gene closely
related to mRNA degradation pathways knocked out. Analysis) We systematically searched for novel sequence features associating
with half-life from 5’UTR, start codon context, CDS, stop codon context, and 3’UTR. Effects of previously reported cis-regulatory
elements were also assessed. Moreover, we assessed the dependencies of different sequence features on degradation pathways by
analyzing their effects in the knockout strains. Integrative model) We build a statistical model to predict genome-wide half-life
solely from mRNA sequence. This allowed the quantification of the relative contributions of the sequence features to the overall
variation across genes and assessing the sensitivity of mRNA stability with respect to single-nucleotide variants.
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
A
Stabilizing elements
Destabilizing elements
Elements effect sign context dependent
TSS
CDS
5' UTR
Stop codon context
Start codon context
ATG
AAACAAA
Secondary structure
GC-content
Length
B
3' UTR
Codon usage
GC-content
Length
5' UTR
TGTAAATA (Puf3)
TGCAT (Whi3)
TTTTTTA
ATATTC
GC-content
Length
3' UTR
ATG (uAUG)
AAACAAA
log10(FDR)
Enriched
ATATTC
4
2
TGCAT
0
-2
-4
TGTAAATA
Depleted
TTTTTTA
TGTAAATA
TGCAT
ATATTC
AAACAAA
ATG (uAUG)
TTTTTTA
0
200
400
600
800
1000
Number of mRNAs
Figure 2: Overview of identified or collected sequence features. (A) Sequence features that were identified or collected
from different sequence regions in this study. When applicable, stabilizing elements are shown in blue, destabilizing in red. (B)
Co-occurrence significance (FDR, Fisher test p-value corrected with Benjamini-Hochberg) between different motifs (left). Number
of occurrences among the 4,388 mRNAs (right).
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
A
WT
xrn1Δ
upf2Δ
upf3Δ
40
3631
20
3648
3647
10
739
3648
738
739
5
P < 2e-16
Median FC = 1.37
1
no uAUG
P = 0.022
Median FC = 1.07
with uAUG
no uAUG
S. cerevisiae
B
501
238
5
20
1
1
Number of uAUGs
>=2
3649
203
P = 0.75
519
5
no PTC
no uAUG
in-frame
no PTC
with PTC
out-frame
PTC in CDS
P = 1.7e-13
150
P = 0.098
17
10
with uAUG
S. pombe
D
40
1
0
no uAUG
P < 2e-16
P = 0.53
Half-life [min]
10
with uAUG
P = 8e-13
P = 0.013
3648
P = 8e-04
Median FC = 1.05
P < 2e-16
40
20
no uAUG
S. cerevisiae
150
P < 2e-16
150
P = 1e-04
Median FC = 1.09
with uAUG
C
P < 2e-16
half-life [min]
738
uORF
Half-life [min]
Half-life [min]
150
P = 0.99
40
20
2051
11
10
1377
196
5
1
no PTC
no uAUG
in-frame
no PTC
with PTC
out-frame
PTC in CDS
uORF
Figure 3: Upstream AUG codon (uAUG) destabilize mRNA. (A) Distribution of mRNA half-life for mRNAs without
uAUG (left) and with at least one uAUG (right) in, from left to right: wild type, XRN1, UPF2 and UPF3 knockout S. cerevisiae
strains. Median fold-change (Median FC) calculated by dividing the median of the group without uAUG with the group with
uAUG. (B) Distribution of mRNA half-lives for mRNAs with zero (left), one (middle), or more (right) uAUGs in S. cerevisiae.
(C) Distribution of mRNA half-lives for S. cerevisiae mRNAs with, from left to right: no uAUG, with one in-frame uAUG but no
cognate premature termination codon, with one out-of-frame uAUG and one cognate premature termination codon in the CDS,
and with one uAUG and one cognate stop codon in the 5’UTR (uORF). (D) Same as in (C) for S. pombe mRNAs. All p-values
were calculated with Wilcoxon rank-sum test. Numbers in the boxes indicate number of members in the corresponding group.
Boxes represent quartiles, whiskers extend to the highest or lowest value within 1.5 times the interquartile range and horizontal
bars in the boxes represent medians. Data points falling further than 1.5-fold the interquartile distance are considered outliers and
are shown as dots.
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
S. cerevisiae
P = 9.2e-08
40
2736
872
419
360
5
1
T
C
Half-life [min]
5 20 100
Deadenylation
Decapping
Exosome
RBP
Surveillance
WT
XRN1
Knockouts
S. cerevisiae
D
P = 1.8e-09
P = 0.0049
P = 7.5e-07
150
Half-life [min]
Spearman rho=0.55
R-squared=0.3
*
Stop_codon
40
20
486
10
798
624
151
227
378
TAA
294
141
248
491
442
108
TAG
TGA
5
TGAC
TGAT
TGAA
stop codon\+1
TGAG
TAGC
TAGT
TAGA
0.35
0.45 0.55
0.70
S. cerevisiae-specific tAI
TAGG
1
1
TAAC
B
G
start codon\-3
Group
0
TAAT
A
**
**
20
TAAA
10
*
40
***
TAAG
20
60
WT
ccr4Δ
caf40Δ
not3Δ
pop2Δ
pan2Δ
pan3Δ
dhh1Δ
pat1Δ
lsm1Δ
dcs1Δ
lsm7Δ
lsm6Δ
scd6Δ
edc3Δ
dcs2Δ
edc2Δ
rrp6Δ
rrp47Δ
ski3Δ
ski2Δ
ski7Δ
ski8Δ
pub1Δ
puf4Δ
puf6Δ
puf3Δ
puf5Δ
puf1Δ
puf2Δ
upf3Δ
upf2Δ
hbs1Δ
dom34Δ
xrn1Δ
P = 6.9e-12
150
half-life [min]
C
P = 1.7e-11
Codon-explained-variance (%)
A
Figure 4: Translation initiation, elongation and termination features associate with mRNA half-life. (A) Distribution
of half-life for mRNAs grouped by the third nucleotide before the start codon. Group sizes (numbers in boxes) show that nucleotide
frequency at this position positively associates with half-life. (B) mRNA half-life (y-axis) versus species-specific tRNA adaptation
index (sTAI) (x-axis). (C) mRNA half-life explained variance (y-axis, Materials and Methods) in wild-type (WT) and across all 34
knockout strains (grouped according to their functions). Each blue dot represents one replicate, bar heights indicate means across
replicates. Bars with a red star are significantly different from the wild type level (FDR <0.1, Wilcoxon rank-sum test, followed by
Benjamini-Hochberg correction). (D) Distribution of half-life for mRNAs grouped by the stop codon and the following nucleotide.
Colors represent three different stop codons (TAA, TAG and TGA), within each stop codon group, boxes are shown in G, A, T,
C order of their following base. Only the P-values for the most drastic pairwise comparisons (A versus C within each stop codon
group) are shown. All p-values in boxplots were calculated with Wilcoxon rank-sum test. Boxplots computed as in Fig 3.
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
A
ATATTC
FDR=1.96e−05
TGCAT
FDR=6.94e−04
TGTAAATA
FDR=3.2e−05
TTTTTTA
FDR=0.086
150
Half-life [min]
40
20
3797
10
529
62
1
>=2
3492
4179
769
100
27
1
2
>=3
94
610
3684
209
5
1
0
Motif density
B
0.4
0.3
0.2
0.1
0.0
0
ATATTC
150
100
50
TGCAT
0
150
100
50
Effects on half-life
base
ATATTC
>=1
0
TGTAAATA
0
150
100
Distance from poly(A) site
C
1.3
1.2
1.1
1.0
0.9
0
Occurence per transcript
A
C
G
TGCAT
50
1
>=2
TTTTTTA
0
150
100
50
0
T
TGTAAATA
TTTTTTA
-2 -1 1 2 3 4 5 6 +1 +2
-2 -1 1
2
3
4
5 +1 +2
-2 -1 1 2 3 4 5 6 7 8 +1 +2
-2 -1 1 2 3 4 5 6 7 +1 +2
-2 -1 1 2 3 4 5 6 +1 +2
-2 -1 1
2
3
4
5 +1 +2
-2 -1 1 2 3 4 5 6 7 8 +1+2
-2 -1 1 2 3 4 5 6 7 +1 +2
Frequency
D
1.00
0.75
0.50
0.25
0.00
conservation
E
ATATTC
0.70
0.65
0.60
0.55
-2 -1 1 2 3 4 5 6 +1 +2
0.325
0.300
0.275
0.250
0.225
Position
TGCAT
TGTAAATA
0.9
0.22
0.7
2
3
4
5 +1 +2
TTTTTTA
0.24
0.8
-2 -1 1
0.26
-2 -1 1 2 3 4 5 6 7 8 +1+2
Position
0.20
-2 -1 1 2 3 4 5 6 7 +1 +2
Figure 5: 3’UTR half-life determinant motifs in S. cerevisiae. (A) Distribution of half-lives for mRNAs grouped by the
number of occurrence(s) of the motif ATATTC, TGCAT (Whi3), TGTAAATA (Puf3) and TTTTTTA respectively in their 3’UTR
sequence. Numbers in the boxes represent the number of members in each box. FDR were reported from the linear mixed effect
model (Materials and Methods). (B) Fraction of transcripts containing the motif (y-axis) within a 20-bp window centered at a
position (x-axis) with respect to poly(A) site for different motifs (facet titles). Positional bias was not observed when aligning 3’UTR
motifs with respect to the stop codon. (C) Prediction of the relative effect on half-life (y-axis) for single-nucleotide substitution
in the motif with respect to the consensus motif (y=1, horizontal line). The motifs were extended 2 bases at each flanking site
(positions +1, +2, -1, -2). (D) Nucleotide frequency within motif instances, when allowing for one mismatch compared to the
consensus motif. (E) Mean conservation score (phastCons, Materials and Methods) of each base in the consensus motif with 2
flanking nucleotides (y-axis).
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
S. cerevisiae
S. pombe
R-squared=0.60
R-squared=0.42
50
20
1
5
20
50
Measured half-life [min]
100
B
5
Measured half-life [min]
A
50
100
1
Predicted half-life [min]
C
Individual
Cumulative
Drop
Codon usage
CREs
55.90
55.92
44.23
uAUG
4.46
56.20
0.05
3' UTR motifs
5.53
57.59
1.29
UTR5_length
4.24
58.41
0.56
CDS_length
3.70
59.82
1.58
GC_content_UTR5 1.27
60.09
0.22
0.28
60.17
0.10
50
Effect of single-nucleotide variations
1.3
1.2
1.1
1.0
U
C
D
S_
UTR3_length
20
Predicted half-life [min]
D
Contribution of each sequence features (%)
5
le
ng
TR
th
3_
le
ng
U
TR
th
5_
le
ng
th
G
C
_c
co
on
do
te
n
nt
_U
TR
5
uA
U
G
TT
TT
TT
A
TG
CA
T
AT
AT
TC
TG
TA
AA
TA
20
Expected half-life fold-change
5
E
Figure 6: Genome-wide prediction of mRNA half-lives from sequence features and analysis of the contributions.
(A-B) mRNA half-lives predicted (x-axis) versus measured (y-axis) for S. cerevisiae (A) and S. pombe (B) respectively. (C)
Contribution of each sequence feature individually (Individual), cumulatively when sequentially added into a combined model
(Cumulative) and explained variance drop when each single feature is removed from the full model separately (Drop). Values
reported are the mean of 100 times cross-validated evaluation (Materials and Methods). (D) Expected half-life fold-change of
single-nucleotide variations on sequence features. For length and GC, dot represent median half-life fold change of one nucleotide
shorter or one G/C to A/T transition respectively. For codon usage, each dot represents median half-life fold-change of one type
of synonymous mutation, all kinds of synonymous mutations are considered. For uAUG, each dot represents median half-life foldchange of mutating out one uAUG. For motifs, each dot represents median half-life fold-change of one type of nucleotide transition
at one position on the motif (Materials and Methods). Medians are calculated across all mRNAs. (E) Overview of conclusions.
2
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
ExpandednotView
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
S. cerevisiae
S. pombe
Spearman rho=-0.17
Spearman rho=-0.27
B
50
10
half-life [min]
200
50
5 10
1
1
2
2
half-life [min]
P = 2.1e-11
200
P = 3.4e-12
5
A
500
2000
10 20
S. cerevisiae
Spearman rho=-0.23
C
500
2000
P = 4.2e-15
200
half-life [min]
200
50
5 10
half-life [min]
1
1
2
2
200
S. pombe
Spearman rho=-0.32
D
P < 2e-16
50
5'UTR length
50
50 100
5'UTR length
10
20
5
10
500 1000
5000
CDS length
100
S. cerevisiae
Spearman rho=-0.06
P = 5.7e-12
200
half-life [min]
200
50
5 10
1
1
2
2
half-life [min]
S. pombe
Spearman rho=-0.23
F
P = 1.4e-08
2000 5000
50
E
500
CDS length
10
200
5
100
500
10
50
1000
S. pombe
Spearman rho=-0.24
H
P < 2e-16
-50
half-life [min]
-100
-1
-2
-10
200
3'UTR length
-200
P = 7.8e-12
5
-20
50
3'UTR length
S. cerevisiae
Spearman rho=-0.21
G
half-life [min]
5
-5
1
1e-02
1e-01
1e+00
1e+01
5' Free Energy
1e+02
2
5
10 20
50
200 500
5' Free Energy
Figure EV1: Length of 5’UTR, CDS and 3’UTR as well as 5’ folding energy correlate with mRNA half-life. (A-B)
5’UTR length (x-axis) versus half-life (y-axis) for S. cerevisiae (A) and S. pombe (B). (C-D) CDS length (x-axis) versus half-life
(y-axis) for S. cerevisiae (C) and S. pombe (D). (E-F) 3’UTR length (x-axis) versus half-life (y-axis) for S. cerevisiae (E) and S.
pombe (F). (G-H) 5’ free energy (x-axis) versus half-life (y-axis) for S. cerevisiae (G) and S. pombe (H).
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
S. cerevisiae
Spearman rho=-0.11
A
S. pombe
Spearman rho=0.16
P < 2e-16
1
1
2
2
5
5
10
10
50
50
half-life [min]
half-life [min]
200
200
P = 2.8e-14
0.1
0.2
0.3
0.4
0.6
GC content 5'UTR
0.2
0.3
0.4
0.5
0.7
GC content 5'UTR
S. cerevisiae
Spearman rho=0.27
S. pombe
Spearman rho=0.24
P < 2e-16
1
1
2
2
5
5
10
10
50
50
half-life [min]
half-life [min]
200
200
P < 2e-16
0.40 0.45
0.55
0.30
S. cerevisiae
Spearman rho=-0.17
0.35
0.40
0.45
0.50
GC content CDS
S. pombe
Spearman rho=0.05
P < 2e-16
P = 0.012
1
1
2
2
5
5
10
10
50
50
half-life [min]
half-life [min]
200
E
0.35
GC content CDS
200
0.30
0.2
0.3
0.4
GC content 3'UTR
0.5
0.15
0.20
0.25
0.30
GC content 3'UTR
0.40
Figure EV2: GC content of 5’UTR, CDS and 3’UTR correlate with mRNA half-life (A-B) 5’UTR GC content (x-axis)
versus half-life (y-axis) for S. cerevisiae (A) and S. pombe. (C-D) CDS GC content (x-axis) versus half-life (y-axis) for S. cerevisiae
(C) and S. pombe (D). (E-F) 3’UTR GC content (x-axis) versus half-life (y-axis) for S. cerevisiae (E) and S. pombe (F).
Information content
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available
under a CC-BY-NC-ND 4.0 International license.
A
S. cerevisiae
0.6
0.5
0.4
0.3
0.2
0.1
0
−6
−5
−4
−3
−2
−1
A
Information content
B
T
G
1
2
3
4
5
6
1
2
3
4
5
6
S. pombe
0.6
0.5
0.4
0.3
0.2
0.1
0
−6
−5
−4
−3
−2
−1
A
T
G
Position
C
F
S. cerevisiae
150
P = 7.5e−05
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
2241
S. pombe
P = 0.013
●
150
●
●
●
●
●
●
●
●
●
●
●
●
40
20
stop codon\+1
2146
5
TGAC
T
●
TGAA
G
start codon\+3
E
half−life [min]
●
46
●
●
●
●
TGAG
A
●
●
●
●
●
●
TAAG
●
●
●
●
●
●
148 284 241
102
●
1
1
P = 0.0076
TAGC
5
232
TAGT
505
316
118
166
TAGA
10
294
692
TAAC
578
321 948
TAAT
2237
40
20
10
5
TAAA
half−life [min]
40
P = 0.0091
TGAT
150
●
●
20
P = 8.2e−08
half−life [min]
●
P = 8e−06
P < 2e−16
P = 7.6e−05
half−life [min]
150
S. Pombe
D
TAGG
S. Pombe
C
●
●
●
40
20
10
1840
1774
5
●
●
●
●
●
●
1
FALSE
TRUE
3'UTR with in−frame stop codon
●
●
●
●
●
●
1
FALSE
TRUE
3'UTR with in−frame stop codon
Figure EV3: Start codon context (Kozak sequence) and S. pombe half-life versus start and stop codon context.
(A-B) Start codon context (Kozak sequence) generated from 4388 S. cerevisiae genes (A) and 3713 S. pombe genes (B). (C)
Distribution of half-life for mRNAs grouped by the third nucleotide before the start codon for S. pombe. Group sizes (numbers in
boxes) show that nucleotide frequency at this position positively associates with half-life. (D) Distribution of half-life for mRNAs
grouped by the stop codon and the following nucleotide for S. pombe. Colors represent three different stop codons (TAA, TAG
and TGA), within each stop codon group, boxes are shown in G, A, T, C order of their following base. Only the P-values for the
most drastic pairwise comparisons (A versus C within each stop codon group) are shown. (E) Distribution of half-life for mRNAs
grouped by with or without additional 3’UTR in-frame stop codon for S. cerevisiae. 30 bases window after the main stop codon
was considered. (F) Same as (E) for S. pombe. All p-values in boxplot were calculated with Wilcoxon rank-sum test. Boxplots
computed as in Fig 3.
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
AAACAAA
FDR=0.049
A
half-life (mins)
150
40
20
228
4160
10
5
1
0
effects on half-life
B
Frequency
C
AAAAACAAAAA
1.3
1.2
1.1
1.0
0.9
0.8
1.00
0.75
0.50
0.25
0.00
base
A
C
G
T
1
2
3
4
5
6
7
positions
8
9 10 11
-2 -1 1 2 3 4 5 6 7 +1 +2
D
conservation
>= 1
AAACAAA
0.16
0.14
0.12
0.10
0.08
-2
-1
1
2
3
4
5
Position
6
7
+1
+2
Figure EV4: S. cerevisiae 5’UTR mRNA half-life associated motif. (A) Distribution of half-lives for mRNAs grouped by
the number of occurrence(s) of the motif AAACAAA in their 5’UTR sequence. Numbers in the boxes represent the number of
members in each box. FDR were reported from the linear mixed effect model (Materials and Methods). (B) Prediction of the
relative effect on half-life (y-axis) for single-nucleotide substitution in the motif with respect to the consensus motif (y=1, horizontal
line). The motifs were extended 2 bases at each flanking site (positions +1, +2, -1, -2). (C) Nucleotide frequency within motif
instances, when allowing for one mismatch compared to the consensus motif. (D) Mean conservation score (phastCons, Materials
and Methods) of each base in the consensus motif with 2 flanking nucleotides (y-axis).
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
xrn1Δ
dom34Δ
hbs1Δ
upf2Δ
upf3Δ
puf2Δ
puf1Δ
puf5Δ
puf3Δ
puf6Δ
puf4Δ
pub1Δ
ski8Δ
ski7Δ
ski2Δ
ski3Δ
rrp47Δ
rrp6Δ
edc2Δ
dcs2Δ
edc3Δ
scd6Δ
lsm6Δ
lsm7Δ
dcs1Δ
lsm1Δ
pat1Δ
dhh1Δ
pan3Δ
pan2Δ
pop2Δ
not3Δ
caf40Δ
ccr4Δ
WT
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Effects
*
*
*
0.2
0.0
-0.2
*
*
*
*
*
*
0.4
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
A
TT
TT
TT AT
C
TG TC A
AT AT
AT A AG
TA TG e
TG - g
AC usa
TG on
d A
co C-3 gy
St ner A
e A
5' CA
A
AA G
U
uA
Figure EV5: Summary of CREs effect changes across all 34 knockouts comparing with WT. Colour represent the
relative effect size (motifs, St-3 C-A, TGAG-TGAC, uAUG), correlation (5’ folding energy) or explained variance (codon usage)
upon knockout of different genes (y-axis) (Materials and Methods for detailed description). Wild-type label is shown in the bottom
(WT) P-values calculated with Wilcoxon rank-sum test by comparing each mutant to wild-type level, multiple testing p-values
corrected with Bonferroni & Hochberg (FDR). Stars indicating significance of statistical testing (FDR <0.1). 5’ energy: correlation
of 5’end (5’UTR plus first 10 codons) folding energy with mRNA half-lives; St-3 C-A: relative median half-life difference between
genes with cytosine and adenine at start codon -3 position; TGAG-TGAC: relative median half-life difference between genes with
stop codon +1 TAAG and TGAC. codon usage: codon usage explained mRNA half-lives variance. uAUG: relative median half-life
difference between genes without and with upstream AUG in the 5’UTR (Materials and Methods)
codons
bioRxiv preprint first posted online Nov. 3, 2016; doi: http://dx.doi.org/10.1101/085522. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
AAT
AAC
TAT
GGC
GGA
GAT
GGG
TAC
GAC
TTT
TGT
TTC
TGC
CTC
CTA
AGC
AGT
GCG
GCT
GAG
GGT
CAT
CTT
TCG
CAC
GAA
TCT
TCA
GCC
CTG
GCA
CAG
ACG
CAA
ACT
ACC
TTA
CCT
CCC
CGG
ACA
AGG
CGC
TCC
ATC
ATT
CCA
CCG
AAA
AAG
AGA
TTG
CGA
CGT
ATA
AAC
AAT
TAC
GGT
GGA
GGG
GGT
GGC
GGG
GAC
GGT
GGA
GGC
TAT
GAT
TTC
TGC
TTT
TGT
CTT
CTA
CTG
CTGCTC
CTT TTA
AGT
AGC
GCTGCC
GCA
GCC
GCG
GCA
GAA
GGC
GGA
GGG
CAC
CTG
CTACTC
TCC
TCT
TCA
CAT
GAG
TCC
TCG
TCA
TCC
TCT
TCG
GCT
GCG
GCA
TTG
CTA CTC CTT
GCC
GCT
GCG
CAA
ACT
ACC
ACA
CAG
ACC
ACG
ACA
ACT
ACG
ACA
TTG
CTA
CCA
CCC
CCG
CCA
CCT
CCG
CGACGT
CGC
AGG
ACC
ACT
ACG
CGG
AGA
CGA
CGT
CGG
TCT
TCG
TCA
ATA
ATT
ATC
ATA
CCT
CCC
CCG
CCA
CCT
CCC
AAG
AAA
CGA
AGG
CTG
TTA
CGG CGC
AGA
CGC CGG
ATC
ATT
1.00
1.01
1.02
Expected half-life fold-change
CGT
CGA
1.03
Figure EV6: Predicted effects of synonymous codon transitions on half-life. Expected half-life fold-change (x-axis) at
each synonymous codon transitions. Each row represent transition from one codon (y-axis) to its synonymous partners. Only
synonymous codons that differ by one base were considered.