ABSTRACT HU, QIWEN. Genome-wide Search for Elements that

ABSTRACT
HU, QIWEN. Genome-wide Search for Elements that may Affect Translation in Arabidopsis
thaliana. (Under the direction of Steffen Heber).
Translation is a fundamental process in all living organisms. During translation,
messenger RNA is rewritten into protein. Experimental evidence suggests that there are
multiple control mechanisms that regulate translation. However, the exact nature and the
building blocks of translational control are often unknown. In this dissertation, we describe
several novel computational approaches to identify genome-wide-scale elements that affect
translation.
First, we present a novel method to identify translated upstream open reading frames
(uORFs) in the model plant Arabidopsis thaliana. It is believed that the translation of an
uORF reduces the translational efficiency of the corresponding main coding region, but only
few uORFs are experimentally characterized, and the translation status of most uORFs is
unknown. Our approach employs semi-supervised learning in combination with stacked
classification to identify translated uORFs using ribosome footprinting data and additional
genome and transcriptome information. In a cross-validation study, the algorithm performs
very well. Using this approach, we discovered 5360 translated uORFs in 2051 genes. Further
analysis of our results showed that translated uORFs occur with high frequency in multiisoform genes, and that many translated uORFs are affected by alternative transcript start
sites or alternative splicing events. Association rule mining revealed several sequence
elements that are associated with the translation status of the uORFs. Our results suggest that
the translation of uORFs is a complex process that might be affected by multiple factors.
Next, we apply LASSO and random forest algorithms to search for additional
transcript features that affect translation in Arabidopsis thaliana. We have identified ~50
features and measured their influence on translation in different transcript regions. We found
that some features have a different impact on translation dependent on the transcript region
where the feature is located. Often, features involved in the elongation step of translation
showed larger effects. Interestingly, we found that some features might have a different
effect on translation in transcripts that belong to different functional groups. This suggests
that transcripts might employ different mechanisms for the regulation of translation.
Taken together, this dissertation provides a comprehensive list of elements that affect
translation in Arabidopsis thaliana. We hope that our results will provide the basis for
predictive models of translation and contribute to a better understanding of protein
production in plants.
© Copyright 2016 Qiwen Hu
All Rights Reserved
Genome-wide Search for Elements that may Affect Translation in Arabidopsis thaliana
by
Qiwen Hu
A dissertation submitted to the Graduate Faculty of
North Carolina State University
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Bioinformatics
Raleigh, North Carolina
2016
APPROVED BY:
_______________________________
Steffen Heber
Committee Chair
_______________________________
Jose M. Alonso
_______________________________
Dalia M. Nielsen
_______________________________
Eric A. Stone
_______________________________
Matthias F. Stallmann
DEDICATION
To my parents
ii
BIOGRAPHY
Qiwen Hu was born in Hangzhou, China. She got her Bachelor degree in Computer Science
and Technology from Ningbo Institute of Technology and Master degree in Biochemistry
and Molecular Biology from Shanghai University. It was during her stay in Shanghai
University, she was motivated to pursue an advance degree in bioinformatics. In 2011, she
came to Raleigh for her PhD study at North Carolina State University. Since 2012, she did
research in the field of computational biology and data mining under the direction of Dr.
Steffen Heber. Her research is focused on developing computational methods to identify
regulatory elements that affect translation at genomic level.
iii
ACKNOWLEDGMENTS
First of all, I would like to express my sincere gratitude to my advisor Dr. Steffen
Heber for his help and support throughout the research. This dissertation could not have
been possible without his invaluable advice and guidance. Steffen set an excellent example
for me by being a dedicate scientist and enthusiastic teacher. It was an enjoyable and
remarkable experience for me to be his student as well as a teaching assistant for his classes.
I would also like to thank Dr. Jose Alonso, Dr. Dalia Nielsen, Dr. Eric Stone, Dr.
Lorena Bociu and Dr. Matthias Stallmann for their time to serve in my advisory committee
and for their feedback on my research.
I would like to express my gratitude to my collaborators in the Alonso-Stepanova lab
for providing me this excellent dataset. I thank Dr. Anna Stepanova for sharing her
thoughtful biological insight with me. I thank Dr. Catharina Merchante for sharing the data
with me.
I am also thankful for all the faculty, staff, post-docs and graduate students in BRC
for creating a wonderful environment for research and communication. I thank my friends
Yiwen Luo, Xiang Ji, Guozhu Zhang, Tao Jiang and Dr. Kuangyu Wang for their
companionship.
Finally, I would like to offer my inmost thanks to my beloved parents for their
unconditional support and encouragement.
iv
TABLE OF CONTENTS
List of Tables……………………………………………………………………………….. vii
List of Figures……………………………………………………………………………… viii
CHAPTER 1 INTRODUCTION .............................................................................................. 1
1.1 Translation and Translational regulation in eukaryotes .................................................. 5
1.1.1 Translation mechanism in eukaryotes ...................................................................... 5
1.1.2 Upstream open reading frames and how they affect translation .............................. 7
1.1.3 Other transcript elements involved in translational regulation ............................... 12
1.2 High throughput sequencing techniques for comprehensive transcription and
translation analysis .............................................................................................................. 16
1.2.1 High throughput RNA sequencing ......................................................................... 16
1.2.2 Ribosome Footprinting ........................................................................................... 19
CHAPTER 2 MACHINE LEARNING AlGORITHMS ........................................................ 25
2.1 Supervised learning algorithms..................................................................................... 26
2.1.1 Linear regression .................................................................................................... 26
2.1.2 k-Nearest Neighbors ............................................................................................... 27
2.1.3 Support Vector Machine ......................................................................................... 29
2.1.4 Naive Bayes ............................................................................................................ 30
2.1.5 Artificial Neuron Network...................................................................................... 31
2.1.6 Decision Tree .......................................................................................................... 33
2.1.6 Random Forest ........................................................................................................ 34
2.2 Unsupervised learning algorithms ................................................................................ 36
2.2.1 Clustering................................................................................................................ 36
2.2.2 Association rule mining .......................................................................................... 37
2.3 Performance evaluation ................................................................................................ 40
2.3.1 Confusion Matrix .................................................................................................... 40
2.3.1 ROC curve .............................................................................................................. 41
CHAPTER 3 GENOME-WIDE SEARCH FOR TRANSLATED UPSTREAM OPEN
READING FRAMES IN ARABIDOPSIS THALIANA ........................................................... 43
3.1 Introduction ................................................................................................................... 43
3.2 Material and Methods ................................................................................................... 46
v
3.2.1 Data preparation ..................................................................................................... 47
3.2.2 Feature extraction ................................................................................................... 47
3.2.3 Semi-supervised Learning: Clustering based Classification .................................. 51
3.2.4 Clustering and Training Set Construction .............................................................. 52
3.2.5 Stacking Classification ........................................................................................... 53
3.2.6 Association Rule Mining ........................................................................................ 54
3.3 Result ............................................................................................................................ 56
3.3.1 Cluster Analysis ...................................................................................................... 56
3.3.2 Performance Evaluation ......................................................................................... 61
3.3.3 Translated uORFs in Arabidopsis thaliana ............................................................. 63
3.3.4 Features associated with translated uORFs ............................................................ 67
3.4 Conclusion .................................................................................................................... 69
CHAPTER 4 MINING TRANSCRIPT FEATURES RELATED TO TRANSLATION IN
ARABIDOPSIS USING LASSO AND RANDOM FOREST ............................................... 71
4.1 Introduction ................................................................................................................... 71
4.2 Material and Methods ................................................................................................... 75
4.2.1 Data Sources and Preparation ................................................................................. 75
4.2.2 Feature Computation .............................................................................................. 76
4.2.3 Model Construction and Feature Selection ............................................................ 77
4.3 Result ............................................................................................................................ 81
4.3.1 Model Performance ................................................................................................ 81
4.3.2 Transcript Features Identified Through 5'-UTR, CDS and 3'-UTR ....................... 82
4.3.3 Transcript Features Identified through the Whole Transcript ................................ 85
4.3.4 The mRNA contribution for translation ................................................................. 86
4.4 Conclusion .................................................................................................................... 89
CHAPTER 5 CONCLUSIONS AND FUTURE WORK ....................................................... 91
REFERENCES ....................................................................................................................... 97
vi
LIST OF TABLES
Table 3.1 Correctly identified data points by base-level classifiers…………………………62
Table 3.2 Overall classifier performance ……………………………………………………62
Table 3.3 uORF from current studies ……………………………………………………….63
Table 3.4 Translated uORFs identified in Arabidopsis…...………………………………….66
Table 3.5 Rules of translated/untranslated uORFs…...………………………………………68
Table 4.1 Performance of Models……………….…...………………………………………81
Table 4.2 Proportion of variability of translation that can be explained by mRNA levels and
transcript features ……………………………………………………………………………87
Table 4.3 mRNA level and transcript feature contribution in different functional…….……88
vii
LIST OF FIGURES
Figure 1.1 Structure of mature mRNA and transcript regions where translation occurs……. .5
Figure 1.2 Different scenarios that ribosome may skip a uORF..………………...….……….9
Figure 1.3 Regulation of SAMDC translation in plant ......……………………...….……….11
Figure 1.4 The Illumina sequencing technique……...……………………...….…………….18
Figure 1.5 Overview of ribosome footprinting experiment…………...….………………….20
Figure 1.6 The difference of reads distribution between ribosome footprints and mRNA….23
Figure 2.1 3-Nearest Neighbor classification example…………...….………………………28
Figure 2.2 Example of training support vector machine…………...….……………………. 29
Figure 2.3 Hierarchical structure of Artificial neuron network…....….……………………. 32
Figure 2.4 Example of decision tree model for the credit data…....….…………………….. 33
Figure 2.5 Radom forest approach………………………………....….……………………. 35
Figure 2.6 Confusion Matrix for two-class classification….……....….……………………. 41
Figure 2.7 ROC curve…………………………………..….……....….……………………. 42
Figure 3.1 Translation stages and features characteristic for the different translation stage...48
Figure 3.2 Average silhouette value for different numbers of clusters k…………..……….. 57
Figure 3.3 Visualization of clusters generated by k-means clustering…………..…………..60
Figure 3.4 Characteristics of genes that contain translated uORFs…………..……………...65
Figure 4.1 Model construction algorithm…………..……………..........................................79
Figure 4.2: Results of importance of features identified by LR and RF..................................83
Figure 4.3: Important features identified by LR and RF based on whole transcripts..............86
viii
CHAPTER 1 INTRODUCTION
Translation is a fundamental process in all living organisms. During translation, the
ribosome decodes messenger RNA (mRNA) into a polypeptide chain that folds into a protein.
The amount of produced protein depends on the amount of mRNA that is translated, the
efficiency of translation, as well as various post-translational mechanisms, see (Lu, Vogel et
al. 2007) for a detailed review. In addition to the central role of translation during protein
synthesis, errors during translation might cause important human diseases including
Parkinson’s disease, Charcot-Marie-Tooth disease and X-linked dyskeratosis congenital
(Scheper, van der Knaap et al. 2007, Taymans, Nkiliza et al. 2015). Therefore, learning the
components and rules of translational regulation is of great interest for many biomedical and
bioengineering applications.
This dissertation focuses on designing algorithms for investigating translation in the
model plant Arabidopsis thaliana. The research capitalizes essentially from the recently
developed ribosome footprinting technique (Ribo-seq). Ribo-seq is based on deep sequencing
of ribosome-protected mRNA fragments and allows researchers to monitor and quantify
translation on a genome-wide scale with sub-codon resolution (Ingolia, Ghaemmaghami et al.
2009). We integrate Ribo-seq data and genomic features to construct robust models for
predicting translated uORFs and other transcript features that affect translation. In addition,
we mine features that are associated with uORF translation. The computational methods
developed in this research can be applied to other model organisms as well.
1
My dissertation is organized as follows. Chapter 1, section 1.1.1 gives a brief review
of translation in eukaryotes and plants. Section 1.1.2 and 1.1.3 summarizes our knowledge of
elements that contribute to translational regulation, in particular uORFs and, Section 1.2
describes the Ribo-seq techniques used in this dissertation. Chapter 2 describes the machine
learning techniques used in this dissertation.
Chapter 3 describes a novel approach to identify translated upstream open reading
frames (uORFs). uORFs are open reading frames that appears in the 5’ untranslated region
(5’-UTR) of a mRNA. Translated uORFs are important gene expression regulators that
attenuate the translation of the main ORF. Translated uORFs occur in various organisms and
often affect metabolic genes (von Arnim, Jia et al. 2014). While it is easy to identify uORFs
via simple pattern matching, it is challenging to differentiate between translated and
untranslated uORFs. Before the invention of ribosome footprinting, computational studies to
identify translated uORFs on a genome-wide scale have relied on two strategies: crossspecies conservation and classification-based approaches that use machine learning to
recognize characteristic features of experimentally validated functional uORFs.
The first strategy is based on the assumption that uORFs without cross-species
conservation are less likely to encode functional peptides (Clamp, Fry et al. 2007).
Homology-based sequence search has been used successfully in many organisms, including
higher plants, human and yeast, and hundreds of conserved uORFs have been identified
(Cvijovic, Dalevi et al. 2007, Tran, Schultz et al. 2008, Skarshewski, Stanton-Cook et al.
2014). The approach has been implemented in the Bioinformatics tool uPEPperoni
(Skarshewski, Stanton-Cook et al. 2014). Classification-based approaches generate predictive
2
models of functional uORFs based on known examples of protein-coding uORFs. For
example, inductive logic programming was used to infer rules from experimentally
characterized uORFs in yeast in order to predict functional uORFs genome-wide (Selpi,
Bryant et al. 2009).
Although successful, these approaches have serious limitations. First, evolutionary
conserved uORFs account only for a fraction of the translated uORFs. Second, there are only
few uORFs currently experimentally characterized. Learning knowledge based on limited
number of known trainings examples may result in biased or even wrong conclusions. In this
chapter, we used semi-supervised learning approach to solve this problem. Semi-supervised
learning incorporates some unknown data for training. If the data follow certain smoothness,
cluster, or manifold assumptions, semi-supervised learning can more precisely capture the
characteristics of data than a purely supervised learning approach, and as a result it can
produce very accurate predictions using only few labeled data points, for a thorough review
see (Zhu 2008).
In Chapter 4, statistical approaches were applied to mine transcript features that may
contribute to translational regulation in Arabidopsis genome. Several elements located in the
transcript region were identified previously and have been shown experimentally to have
regulatory roles in controlling the translation of the transcript (Kozak 2001, Pestova and
Kolupaeva 2002, Kim, Lee et al. 2014). However, experiments can only target specific
regulatory elements, so it remains unclear if there are additional elements located in the
transcript region that also affect its translation.
3
Previous studies used linear regression to detect elements that will contribute to
transcriptional and translational regulation based on high-throughput mRNA sequencing or
Mass Spectrometry data. Using this approach, several regulatory elements were identified in
Escherichia coli, S. cerevisiae, and Desulfovibrio vulgaris (Nie, Wu et al. 2006, Dvir, Velten
et al. 2013, Guimaraes, Rocha et al. 2014). However, when the dimension of data is high and
there are only a few features that influence translation, linear regression may end up with a
prediction with large variance which affects the accuracy of prediction. In addition, due to
the difficulty or cost of Mass Spectrometry techniques, the levels of proteins of translated
genes are usually not measured at a genomic scale but rather the focus is made on the genes
of specific interest, so it is unclear if the identified elements associate with all the genes in
genome. In this chapter, we measured translation efficiency at genomic level from Ribo-seq
and RNA-seq data and applied LASSO and random forest to first select important transcript
features that may influence translation of transcripts in Arabidopsis and then explored the
contributions of each individual transcript feature.
Chapter 5 summarizes the results of these projects and the direction for future studies.
There are totally 5 publications related to this research (Merchante, Brumos et al. , Howard,
Hu et al. 2013, Hu, Merchante et al. 2015, Qiwen, Merchante et al. 2015, Hu, Merchante et al.
2016). The focus of this dissertation are published at IEEE Transactions on NanoBioscience
(Hu, Merchante et al. 2016) and the proceeding of Computational Advances in Bio and
Medical Sciences (Qiwen, Merchante et al. 2015).
4
1.1 Translation and Translational regulation in eukaryotes
1.1.1 Translation mechanism in eukaryotes
During translation ribosomes decode a mature mRNA into a chain of amino acids
which then folds into a functional protein. Usually, translation is divided into three stages:
initiation, elongation and termination. In eukaryotes, a typical mature mRNA (mature
transcript) consists of five parts: the 5' cap which is important for the proper attachment of
mRNA to ribosomes, the main coding sequence region (CDS), 5’ and 3’ untranslated regions
(5’-UTR/3' UTR), and a poly(A) tail which prevents mRNA molecules from degradation.
Figure 1 shows the structure of a typical mature mRNA. From a biophysical point of view,
the three different stages of translation occur in different transcript parts. Translation
initiation mainly takes place in the 5’-UTR and the region between the 5’-UTR and the
translated coding sequence. Translation elongation occurs in the translated coding sequence,
and termination mainly occurs in the 3’-UTR (Zur and Tuller 2013).
5' UTR
main coding sequence
3' UTR
Cap
Poly A
AAAAAA
Initiation
Initiation
and
Elongation
Elongation
Termination
Figure 1.1 Structure of mature mRNA and transcript regions where translation occurs.
The transcript regions 5’-UTR, CDS, and 3'UTR correspond to the three stages of
translation - initiation, elongation, and termination.
5
Translation initiation:
In eukaryotes, translation initiation usually occurs in a cap-dependent manner, where
the 43S ribosomal pre-initiation complex containing Met-tRNAi and eIFs (1, 1A, 2, 3 and 5)
is recruited to the 5' cap with the help of cap-binding factor eIF4E, eIF4G and eIF4A in the
eIF4F complex. The complex then scans through the 5’-UTR of the mRNA in an ATPdependent manner, looking for a start codon. Then the Met-charged initiator tRNA binds to
the P site of the ribosome by initiation factor elF2. The recognition of the start codon triggers
the hydrolysis of GTP followed by the release of eIF2-GDP and other eIFs. Then the 60S
subunit of ribosome joins to the complex to form 80S initiation complex and get ready for the
elongation step (Malys and McCarthy 2011). In some cases, translation initiation can also
occur in a cap-independent mode with the ribosome recruited to the mRNA via an Internal
Ribosome Entry Site (IRES). In this approach, the ribosome can reach the start codon from
IRES directly by IRES trans-acting factors without the need of scanning from 5’-UTR
(Lopez-Lastra, Rivas et al. 2005).
Translation elongation:
At the end of initiation, the initiator tRNA occupies the P site of ribosome while the
A site of ribosome is ready to receive a new aminoacyl-tRNA. During the elongation process,
the incoming aminoacyl-tRNA enters the A site, where it is matched with the codon on
mRNA. Upon recognition of the codon by its complement in the anti-codon loop of the
tRNA, a peptide bond is formed between the peptide in the P site and the amino acid in the A
6
site. The ribosome then moves over by 3 bases, and the spent tRNA is released from the E
site. This process is repeated until the ribosome reaches the stop codon.
Translation termination:
Translation termination occurs when the stop codon, which is recognized by the
release factor eRF1 moves into the A site. The polypeptide is then released from the tRNA,
then the tRNA is released from the ribosome and finally, the two ribosomal subunits
dissociate from the mRNA.
There are various factors that affect translation and may play a regulatory role in
protein production. Since it is believed that translation initiation is a rate-limiting step during
translation, studies used to focus on identifying regulatory elements near the initiation codon.
However, in recent years, several regulatory elements have been identified that affect
translation elongation and termination. Major transcript features that affect translation
include upstream open reading frames, codon usage, and sequence context near the start
codon. However, this list is likely incomplete. The following sections provide a detailed
description of the currently known regulatory elements identified in different organisms and
possible regulatory mechanisms for these elements.
1.1.2 Upstream open reading frames and how they affect translation
An important element of translational control is the existence of upstream open
reading frames (uORFs). uORFs are open reading frames that appear in the 5’-UTR of an
mRNA. Studies have shown that uORFs are often involved in the translational regulation of
7
the downstream main open reading frame (CDS or main ORF) (Morris and Geballe 2000,
Calvo, Pagliarini et al. 2009, Jeon and Kim 2010). Functional uORFs can regulate the protein
levels of genes via four different mechanisms: ribosome leaky scanning, reinitiation,
dissociation and stalling (Morris and Geballe 2000). During translation initiation, a ribosome
scans through the mRNA until it encounters a start codon. The presence of an uORF can lead
to either translation initiation of the uORF, or leaky scanning past the uORF start codon, a
phenomenon in which the initiation codon is skipped and the small subunit of the ribosome
continuous scanning the mRNA. If the ribosome initiates translation at the start codon of the
uORF, there are three possibilities: first, the ribosome might dissociate from the mRNA
when the stop codon of the uORF is reached; second, the small subunit of the ribosome
might continue to scan the mRNA beyond the stop codon of the uORF and reinitiate
translation downstream, or third, the ribosome might get stalled on the uORF after interacting
with regions encoded specific peptides (Morris and Geballe 2000, von Arnim, Jia et al. 2014).
It has been shown that the extent of leaky scanning may be affected by the sequence
context near the start codon (Lukaszewicz, Feuermann et al. 2000, Kozak 2002). Dependent
on the specific sequence context, ribosomes are more or less likely to recognize the start
codon of an uORF and initiate translation (David-Assael, Saul et al. 2005, von Arnim, Jia et
al. 2014). However, it is still unclear what exactly determines if a ribosome will pass a uORF.
Recent studies identified three scenarios in which ribosome will ignore the start codon of a
uORF (Fig 2) (von Arnim, Jia et al. 2014). First, a ribosome will bypass the start codon while
it is actively translating another uORF in a different reading frame. Second, a ribosome may
bypass the uORF via ribosome recruitment through an IRES which is located downstream of
8
the start codon of the uORF (Somers, Poyry et al. 2013). Third, a ribosome may skip the
uORF by shunting, in which a hairpin loop structure located near the uORF region causes the
small subunit of the ribosome to ‘jump’ over the hairpin loop without unfolding it. Cases of
shunting are rare and usually found in viral transcripts (Babinger, Hallmann et al. 2006, von
Arnim, Jia et al. 2014).
Figure 1.2 Different scenarios that ribosome may skip a uORF according to (von Arnim,
Jia et al. 2014). A: Ribosome may not translate a uORF if it is actively translating
another uORF in a different frame. B: Ribosome may skip a uORF through IRES. C:
Ribosome may skip a uORF by shunting.
9
Different criteria are used to classify uORFs. For example, based on the location of
the stop codon of the uORF one can distinguish between uORFs that have their stop codon
located in the 5’-UTR region, and uORFs that overlap with the main ORF. A genome-wide
study in yeast found that mutations which generate uORFs that overlap with the main ORF
have the most dramatic inhibitory role on gene expression among all mutations in the 5’UTR (Yun, Adesanya et al. 2012). Another criteria for classifying uORFs is based on
evolutionary conservation of the peptides encoded by the uORFs. It has been shown that
conserved uORF peptides often play a key role in the repression of translation of the
downstream main CDS. It is hypothesized that the conserved peptides stall the ribosome and
as a result, block the movement of upstream ribosomes and prevent reinitiation at the
downstream main CDS (Hanfrey, Franceschetti et al. 2002, von Arnim, Jia et al. 2014).
However, uORFs with conserved peptides (CPuORFs) only account for a small fraction for
the total uORFs, for example, in Arabidopsis only about 70 uORFs are classified as CPuORF
(von Arnim, Jia et al. 2014), which represent less than 1% of all uORFs.
Translational regulation by uORFs plays a key role in many aspects of living cells,
such as metabolic pathways, cell development and disease. Sometimes, uORF-mediated
regulation depends on additional factors such as stress or hormone level (Rahmani, Hummel
et al. 2009, Barbosa, Peixeiro et al. 2013, von Arnim, Jia et al. 2014). Fig 3 shows an
example of translational regulation of polyamine metabolism by uORFs and polyamine level.
Polyamines are metabolites that stimulate translation and affect start codon recognition
(Ivanov, Atkins et al. 2010). The genes in the polyamine synthesis pathway of many
organisms are regulated by uORFs. The figure shows a classical example of S-
10
adenosylmethionine decarboxylase (SAMDC) regulation in plants. The SAMDC mRNA
harbors two overlapping conserved uORFs: one relatively long uORF with 48-55 codons and
another tiny uORF with only 2-3 codons. When the polyamine level is low, the ribosome
recognizes the start codon of the first tiny uORF and bypasses the second one. Since it is a
short uORF, the ribosome reinitiates translation at the main ORF region of SAMDC. On the
other hand, then the polyamine level is high, the ribosome bypasses the start codon of the
tiny uORF and engages with the second long uORF, which has a strong sequence context of
start codon. The ribosome then dissociates at the stop codon of the long uORF which
prevents it from reinitiating at the main ORF and therefore represses the synthesis of
SAMDC, and maintains polyamine homeostasis (von Arnim, Jia et al. 2014). The specific
mechanism that allows uORFs to sense polyamine levels is still unknown.
Figure 1.3 Regulation of SAMDC translation in plants according to (von Arnim, Jia et
al. 2014). The gene contains two CPuORFs, uORF 1 (orange rectangle): tiny uORF with
2-3 codons and uORF 2 (green rectangle): long uORF with 48-55 codons. The ribosome
is shown as a blue ellipse.
11
uORFs occur in many genes and in many different organisms (Calvo, Pagliarini et al.
2009), for example, in humans, around 49% percent of genes contain uORF, and in
Arabidopsis about 39% of genes contain uORF, but only a few uORFs have been
experimentally characterized and classified as functional or translated uORFs. The most
common way to test if a uORF is functional through wet-lab experiments is site-specific
mutagenesis of the start codon of a uORF. A comparison of mRNA and protein level of the
main CDS is performed between genes with and without site-specific mutation to determine
if the uORF of interest significantly affects the protein or mRNA level. However, this labbased verification of functional uORFs is costly and time-consuming (Selpi, Bryant et al.
2009). Another possible approach of large-scale identification of translated uORFs is through
mass spectrometry (MS), however, at the moment, only very few uORFs have been identified
by this method. One possible explanation for this lack of MS validation is the fact that
uORFs typically are small and MS is designed to detect fragments of larger proteins; in
addition, uORF peptides may lack protease cleavage sites (von Arnim, Jia et al. 2014). Today,
the translation status for most of the uORF is still unknown.
1.1.3 Other transcript elements involved in translational regulation
Extensive studies have been conducted to probe the complexity of translational
regulation. Various regulatory elements that affect the level of protein being produced have
been found in different transcript parts, including start codon initiation context, secondary
structure, codon usage and length (De Angioletti, Lacerra et al. 2004, Cavener 1987, Rangan,
12
Vogel et al. 2008, Jackson, Hellen et al. 2010, Kim, Lee et al. 2014, Wang and Wessler
2001). A brief overview of this list of elements is provided in the following part.
Sequence context near start codon (initiation context)
The initiation context is defined by flanking sequences near the start codon, which is
recognized by the ribosome during translation initiation. Studies found that these sequences
are crucial for start site recognition and thus affect the translation initiation (De Angioletti,
Lacerra et al. 2004). The optimal consensus sequence (Kozak sequence) was defined by
Kozak based on the analysis of initiation context from highly expressed genes (Cavener
1987). Kozak sequence shares some characteristics with organism but may differ for
different organisms (Cavener 1987, Rangan, Vogel et al. 2008). For example, for a strong
consensus, a G always appears at +4 and purines (A/G) appear at -3 positions respective to
the start codon (Kozak 1984). In mammals, the optimal initiation context is
GCC(A/G)CCaugG. Mutations of G+4 and A-3 residues strongly impair translation initiation
(Kozak 1999). The consensus sequence in plants, however, is slightly different from the
mammalian example, where -3 position shows a preference for A rather than A/G
(Lukaszewicz, Feuermann et al. 2000).
Studies show modification of initiation context can affect the protein level of the
corresponding gene (Jackson, Hellen et al. 2010, Kim, Lee et al. 2014). Different
composition of nucleotide sequences in this region can result in more than 200-fold change in
translational efficiency. A recent study in Arabidopsis indicates the A residues in positions -1
to -5 was preferred for a high-level translation efficiency (Kim, Lee et al. 2014).
13
mRNA secondary structure
mRNA secondary structure is another feature that my affect the start codon
recognition and ribosome scanning (Wang and Wessler 2001). A stable stem loop structure
(ΔG = -16kcal/mol) located near the 5' cap can reduce translation by blocking the loading of
40S ribosome. A location of more stable loop structure (ΔG < -50 kcal/mol) further
downstream in the 5’-UTR region can disrupt ribosome scanning. Some secondary structures,
such as iron response element (IRE), can interact with RNA-binding proteins to increase the
efficiency of loading of ribosomes. IRE elements are also found in 3' UTRs but there they
affect mRNA degradation and further have effects on protein production (Hayden 2006).
In addition to IRES and shunting described in Section 1.1.2, there are also other types of
mRNA secondary structures that affect ribosome scanning. For example, Kudla et al.
generated a library of 154 GFP genes that encode the same amino-acid sequence but varied
randomly synonymous sites (Kudla, Murray et al. 2009). The study found that a tightly
folded structure at the 5' end represses translation, but a loosely folded structure promotes it.
This result is further supported by another genome-wide study conducted by Gu et al. (Gu,
Zhou et al. 2010).
Codon usage
Several studies have explored the relationship between codon usage and protein
abundance (Ikemura 1985, Boycheva, Chkodrov et al. 2003, Coleman, Papamichail et al.
2008). Synonymous mutations can have a significant effect on protein levels via changes in
14
translation efficiency. This suggests codon usage as an important determinant of translational
efficiency.
Condon usage is affected by evolutionary selection (Shields, Sharp et al. 1988,
Moriyama and Powell 1997). This demonstrates translational selection - the unequal usage of
synonymous codons in protein coding genes. Translational selection is observed in various
species, including yeast, C. elegans, D. melanogaster, and Arabidopsis thaliana (Duret and
Mouchiroud 1999, Duret 2000, Man and Pilpel 2007, Drummond and Wilke 2008), but there
are contradictory results in humans. Some studies in humans found no evidence for
translational selection while other studies indicate a weak translational selection (dos Reis,
Savva et al. 2004, Lavner and Kotlar 2005). Codon usage and mRNA secondary structure are
widely accepted as key determinants of translation efficiency during initiation and elongation,
respectively, but some studies proposed that the overall protein synthesis rate may be
determined by the interaction of these two factors (Gingold and Pilpel 2011). Kudla et al.
proposed that the mRNA structure at the beginning of the transcript has the predominant role
for gene expression levels, while codon usage affects global protein synthesis rates by
affecting the ribosome density on mRNAs (Kudla, Murray et al. 2009). This result is
supported by later studies conducted by Tuller et al. (Tuller, Waldman et al. 2010).
Length
Some studies found a relationship between the length of 5’-UTR and the efficiency of
translation initiation. Kozak analyzed several 5’-UTR sequences of eukaryotic mRNAs and
found that extremely short 5’-UTR regions inhibit translation because of a reduction of
15
ribosome binding and start codon recognition (Kozak 1991, Kozak 1994). Recent genomewide analyses also demonstrate a relationship between the 5’-UTR length and protein levels
(Zur and Tuller 2013).
1.2 High-throughput sequencing techniques for comprehensive transcription and
translation analysis
Nowadays, high-throughput sequencing techniques are widely used for genome-wide
gene expression studies. The two types of datasets used in this dissertation are RNA-seq and
ribosome footprinting (RNA-seq) data generated by the Illumina platform. Thus, a brief
introduction of these two techniques is provided in this section.
1.2.1 High-throughput RNA sequencing
The transcriptome is the entire set of transcripts expressed by an organism.
Understanding the transcriptome is important for studying functional elements in a genome,
the organism’s development and diseases. Currently, high-throughput RNA sequencing
(RNA-seq) is the most popular transcriptome analysis technique, however, various
alternative approaches do exist (Wang, Gerstein et al. 2009). During RNA-seq approach, a
population of RNA molecules is converted into a library of cDNA fragments with adaptors
attached at their ends. Then the cDNA fragments are sequenced using high-throughput
sequencing (Wang, Gerstein et al. 2009). Current high-throughput sequencing platforms
include Illumina, Applied Biosystems SOLID, Roche 454 Life Sciences and other
approaches. A brief description of the Illumina sequencing technique is shown in figure 4.
16
The approach begins with sample preparation, where a DNA sample is randomly
fragmented and adapters are tagged to both ends of the fragments. Subsequently, the
fragments are attached to the solid surface of a flow cell, which contains two types of
oligomers that are complementary to the adaptor sequences of the fragments. This step
generates a duplicate of the fragments, and the original templates are washed away.
Next, “bridge amplification” is used to create multiple copies of each fragment. In
this step, the fragments bend towards the surface so that the adaptor regions of the fragments
attach to the second oligomer type in the flow cell forming a "bridge". A complimentary
strand is synthesized and the two strands are denatured, but both remain attached to the
surface at one end. After several rounds, this processes generates millions of homogenous
cDNA clusters, where each cluster contains about 1 million copies of an original cDNA
fragment. After bridge amplification, the reverse strand is washed away and sequencing
begins at the primer positions.
17
Figure 1.4 The Illumina sequencing technique in (Mardis 2008)
The reads generated by sequencing are then aligned back to the reference genome.
RNA-seq is widely used for gene annotation and quantification of transcript expression levels.
Many studies have applied RNA-seq to specific biological problems, such as the
18
quantification of alternative splicing (Wang, Sandberg et al. 2008), discovery of new fusion
genes in cancer (Maher, Kumar-Sinha et al. 2009), improvement of genome assembly
(Mortazavi, Schwarz et al. 2010) and transcript identification (Denoeud, Aury et al. 2008,
Yassour, Kaplan et al. 2009, Garber, Grabherr et al. 2011).
1.2.2 Ribosome Footprinting
Due to rapid advancements in high-throughput sequencing techniques, RNA-seqbased transcriptome analyses are among the most popular tools to investigate gene
expression on a genome-wide scale. However, transcript abundance is only a proxy for the
protein abundance. Therefore, measurements based on transcriptome level only provide
limited information about protein synthesis, which is the final point of gene expression for all
protein-coding genes. Various low-resolution methods to monitor the translation at genomic
levels have been developed (Johannes, Carter et al. 1999, Arava, Wang et al. 2003,
Hendrickson, Hogan et al. 2009). However, only recently, due to the newly developed
ribosome footprinting technique, has a high-resolution translatome analysis become
technically feasible. Ribosome footprinting is a technique based on deep sequencing of
ribosome-protected mRNA fragments (Ingolia, Ghaemmaghami et al. 2009). Usually, a
translating ribosome encloses an ~30-nt mRNA fragment on the mRNA template and
protects it from nuclease digestion. Thus, the sequences of the ribosome footprints mark the
translated mRNA as well as the accurate positions of the ribosomes on the transcript. Deep
sequencing of ribosome footprints provides information about ribosome positions and
translational efficiency. Ribosome footprinting has been successfully applied to many model
19
organisms (Ingolia, Ghaemmaghami et al. 2009, Guo, Ingolia et al. 2010, Ingolia 2010,
Ingolia, Lareau et al. 2011, Brar, Yassour et al. 2012, Fritsch, Herrmann et al. 2012, Ingolia,
Brar et al. 2012, Li, Oh et al. 2012, Chew, Pauli et al. 2013, Dunn, Foo et al. 2013,
Juntawong, Girke et al. 2014, McManus, May et al. 2014). For example, it was used to study
the developmental changes in mouse embryonic stem cells, and to monitor the effects of drug
therapies in human cancer cell models (Ingolia, Lareau et al. 2011).
Linker ligation and reverse
transcription
Circularization PCR
Ribosome Footprints
Generation
Sequencing Library
Construction
Deep Sequencing and Data
Analysis
Figure 1.5 Overview of ribosome footprinting experiment described in (Ingolia, Brar et
al. 2012). The entire approach can be divided into 3 steps: ribosome footprints
generation, sequence library construction, and deep sequencing of ribosome footprints
20
Ribosome footprinting is mainly divided into three steps: 1. Generation of ribosomeprotected mRNA fragments (footprints); 2. Conversion of these footprints into a library of
DNA molecules in a form that is suitable for deep sequencing; 3. Analysis of sequencing data.
Figure 1.5 shows the main experimental steps of ribosome footprinting.
Generation and Purification of Ribosome Footprints
The first step of a ribosome footprinting protocol is the preparation of cell extracts
containing mRNA-bound ribosomes. A chemical (e.g. cycloheximide in yeast) is added to
the cell to stabilize polysomes and to block translation. Ribosome footprinting can also be
performed on drug-free extracts if cells are harvested and frozen quickly (Ingolia, Brar et al.
2012). Then the polysomes bound to mRNA templates are digested by a nuclease to degrade
unprotected transcript parts. The resulting individual ribosomes remain bound to about 30-nt
protected mRNA fragments. The ribosomes are isolated by sucrose density gradient
centrifugation, and the mRNA fragments are then purified through two size-selection steps.
Sequencing Library Construction
The ribosome footprint RNA fragments need to be converted into a library suitable
for deep sequencing. To create the library, the small RNA fragments are first polyadenylated
to provide a primer-binding site for reverse transcription, then an anchored oligo-d(T) primer
and a linker sequence are attached to the RNA fragments to prime reverse transcription. After
reverse transcription, the first-strand cDNA is circularized to generate a PCR template with
linker sequences on both sides of the RNA fragment. Then PCR amplification is performed
21
and the library is further purified. The resulting gel-purified library is suitable for deep
sequencing.
Characteristic of ribosome footprints and its application
After sequencing, the first step in analyzing ribosome footprint data is to map the
sequencing reads against the reference genome.
In a first approximation, the aligned
positions indicate which portions of the genome are being translated into protein, and include
protein-coding genes and other DNA elements, such as those encoding short translated
peptides in translated uORFs.
Because of the different approaches to generate the mRNA fragments, the
distributions of ribosome footprints and mRNA reads in a gene are different. As figure 2
shows, for mRNA, the number of reads aligning to different regions of a transcript is
proportional to the length of the 5’-UTR, coding sequence region and the 3’-UTR, but for
ribosome footprints, most of the reads align to the coding sequence region, because during
translation, ribosomes spend most of their time in this region. In addition, for ribosome
footprints, a 3-nt periodicity is obvious, because ribosomes translate transcripts three
nucleotides (i.e., one codon) at a time. As a result, there are more reads aligning to the first
position of a codon, unlike RNA-seq reads that are uniformly distributed along a transcript.
22
Figure 1.6 The difference of reads distribution between ribosome footprints and mRNA
described in (Ingolia, Brar et al. 2012).
One application for ribosome footprinting is the identification of translated regions.
By analyzing alignments of ribosome footprints, several novel translated features can be
identified. In particular, the number of small peptides was found to be much larger than
expected (Fritsch, Herrmann et al. , Ingolia, Lareau et al. , Juntawong, Girke et al.).
In addition to discovering novel translated features, a major application of ribosome
footprinting is to quantify the amount of translation. The number of footprints produced
from a transcript corresponds to the number of ribosomes engaged in synthesizing the protein,
because each footprint corresponds to a translating ribosome. Several studies used ribosome
footprints to investigate the protein synthesis and found that for different groups of genes the
speed of protein synthesis is constant, so under a given condition, the translation time of an
ORF is simply proportional to its length (Ingolia, Lareau et al. , Juntawong, Girke et al. ,
McManus, May et al. , Ingolia, Ghaemmaghami et al. 2009). Thus, footprint density can be
used to quantify gene expression.
Finally, ribosome footprinting provides an approach to study the mechanics of
translation. The number of footprints centered on a codon indicates how often a ribosome is
found at that particular position, so if ribosomes stall at a specific point when translating a
23
gene, they will spend more time and thus produce a large number of footprints at this
position. Using this phenomenon of ribosome footprints, peptide-mediated translational
stalling has been discovered in mammalian cells, and RNA-mediated stalling has been
observed in bacteria (Ingolia, Lareau et al. , Li, Oh et al.). Footprint density has also been
used as a way to determine codon-specific elongation rates in bacteria, yeast, worms, and
human cell cultures.
24
CHAPTER 2 MACHINE LEARNING ALGORITHMS
Machine learning is a technique of data analysis that uses algorithms to detect
patterns from data automatically and allows computers to discover useful information in
large data repositories without being explicitly programmed. It is widely used in many areas,
such as retail industry, finance, and bioinformatics. Generally, the tasks can be grouped into
two categories: supervised learning (predictive task) and unsupervised learning (descriptive
task). In supervised learning, there exists training data that consist of a set of known
examples and class label. Supervised learning algorithms use the training data to infer a
function which is used to determine the class labels for the unknown data points. In contrast,
in unsupervised learning no training data exist - unsupervised algorithms infer pattern from
unlabeled data. This chapter gives a brief introduction to the supervised and unsupervised
learning methods used in this dissertation. In section 2.1, we discuss the supervised learning
algorithms: k-Nearest Neighbors, support vector machine, artificial neural network, naive
Bayes, decision tree. These methods are used in Chapter 3 to predict translated uORFs based
on training set. In section 2.2, the unsupervised learning algorithms: k-means clustering and
association rule mining are used in chapter 3, whereas LASSO and random forest algorithms
are used in chapter 4. Section 2.3 gives an introduction of the model evaluation methods used
in chapter 3.
25
2.1 Supervised learning algorithms
2.1.1 Linear regression
Linear regression is a supervised learning approach. It models the relationship
between response variable y and one or more independent variables X. Linear regression is
usually used to predict a quantitative response.
Given p-dimensional vectors (xi1, xi2, … ,xip, yi), i=1,…, n, where (xi1, xi2, … , xip) are
independent variables and yi is the dependent response variable, the linear regression model
assumes a linear relationship between independent variables and response in form of
p
yi = β0 + ∑ βj 𝑥ij + ϵi ,
j=1
where βj is the coefficient for jth variable, β0 is the intercept, and εi is the error term.
Intercept and coefficient vector β is estimated through minimizing the residual sum of
squares (SSE):
2
n
β̂𝑂 = argmin {∑(yi − β0 − ∑ βj xij )}
i=1
j
This approach may yield estimates with large variance and reduced predictive
accuracy for high-dimensional data. Several alternative methods have been proposed to
address this problem (Iain M. Johnstone 2009). A popular alternative is Least Absolute
Shrinkage and Selection Operator (LASSO) (R 1996). LASSO minimizes the residual sum of
squares subject to the sum of absolute values of the coefficient being less than a constant.
2
n
β̂L = argmin {∑(yi − β0 − ∑ βj xij )} ,
i=1
j
26
subject to
p
∑|β̂Lj | ≤ t (Constant) .
j=1
p
If 0 < 𝑡 < ∑j=1 |β̂oj |, minimizing the residual sum of square subject to t is equivalent to
2
N
β̂L = arg min{∑ (yi − α − ∑ βj xij ) + λ ∑ |βj |},
i=1
j
j
where λ>0. Due to this constraint, LASSO will shrink coefficients that have a small
correlation with the response to zero. This provides an efficient method to select features that
are in a linear relationship with the response; the approach trades an increase in bias against a
reduction of variance to generate more accurate results. LASSO method was used to select
important features related to translation in Arabidopsis in chapter 4.
2.1.2 k-Nearest Neighbors
The k-Nearest Neighbors algorithm (k-NN) is a nonparametric method used for
classification. The algorithm assigns a label based on the class of k nearest neighbors, where
k is a user specified parameter.
The features of the training data form an n-dimensional feature space, and each
training record represents a point in this space. For any unlabeled data point with the same
feature set, k-NN searches for the k closest training records to the unlabeled data point.
Closeness is defined by a distance function, often the Euclidean distance. The Euclidean
distance between two points X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) is defined as:
27
n
dist(X, Y) = √∑(xi − yi )2
i=1
Figure 2.1 shows an example of 3-nearest neighbors. The data set contains two
classes and the class label of the unknown record is determined by the majority voting from
the class labels of its 3 nearest neighbors. Choosing a proper value of k is important for k-NN
algorithm. Often, a large k-value will increase the error rate and decrease the variance of the
estimates, while a too small k-value may lead to overfitting and make the model sensitive to
noise points.
k-NN can be extremely slow when the training data set is large. Assuming the size of
training data set is N, and F is the number of features in each training records, then the
algorithm has a time and space complexity of O(N×F). We use k-NN as meta-level classifier
in a stacked classifier in chapter 3.
Unknown record
Figure 2.1 3-Nearest Neighbor classification example according to the
description in (Pang-Ning Tan). The unknown record will be classified as green class.
28
2.1.3 Support Vector Machine
A support vector machine (SVM) is a classification algorithm that solves two-class
pattern recognition problems (Cortes 1995). In the case of SVM, a data point is viewed as a
p-dimensional vector. During training, the SVM algorithm tries to find a hyperplane in a high
dimensional space that can best separate the different class labels in the training set. The
optimal hyperplane is obtained by maximize the distance between the hyperplane and the
closest training points on each side of the plane (margin), because in general, the classifier
will have lowest generalization error if the margin is maximized. The special data points that
constrain the margin are called support vectors. Figure 2.2 shows an example a separating
M
ar
gi
n
hyperplane.
im
pt
O
al
hy
pe
l
rp
an
e
Figure 2.2 Example of training support vector machine described in (Richard O.
Duda). Support vector machine algorithm tries to find an optimal hyperplane that
maximize the margin.
29
In addition to linear classification, in some cases, the data set cannot be separated by
a linear plane. In this situation, a different approach is applied. A kernel function is used to
project the data points into a high-dimensional space. In this space, it is often possible to
separate the class labels by a maximum-margin hyperplane. The performance of SVM
depends on the selection of kernel and model parameters.
During classification, the unknown data point is mapped into the same space as the
training data and the class label is predicted based on the side of the separating hyperplane
the data point is mapped to. In most cases, SVM only used for two-class classification
problems, but some modifications of the two-class SVM exist that allow users to solve multiclass classification problems (Crammer 2001, Lee 2001, Duan 2005).
2.1.4 Naive Bayes
Naive Bayes classifier is a classification algorithm that uses Bayes' theorem under the
assumption that features are independent given a particular class. So, the probability of a
particular feature set given class c can be computed as 𝑃(𝑋|𝐶) = ∏𝑛𝑖=1 𝑃(𝑋𝑖 |𝐶), where X =
(X1, X2, …, Xn) is the feature set and C is the class label (Rish 2001).
Let X = (X1, X2, …, Xn) be a vector of observed features in training set. The naive
Bayes algorithm tries to assign the class label of an unknown data point that maximizes the
posterior according to Bayes rules 𝑃(𝐶 = 𝑐|𝑋 = 𝑥) =
𝑃(𝑋
= 𝑥|𝐶 = 𝑐 )𝑃(𝐶=𝑐)
, where P(X=x)
𝑃(𝑋=𝑥)
is a constant and thus can be ignored while finding the optimal c.
30
Based on the independence assumption of naive Bayes, the posteriori probability of a
given feature set can be rewritten as:
𝑛
𝑃(𝑋 = 𝑥|𝐶 = 𝑐)𝑃(𝐶 = 𝑐) = ∏ 𝑃(𝑋 = 𝑥𝑗 |𝐶 = 𝑐)𝑃(𝐶 = 𝑐)
𝑗=1
Therefore, naive Bayes assigns a class label that maximizes the posteriori probability
to an unlabeled data point according to the function h(x) which finds the for a given feature
set:
ℎ(𝑥) = arg 𝑚𝑎𝑥𝑃(𝑋 = 𝑥|𝐶 = 𝑐)𝑃(𝐶 = 𝑐)
Naive Bayes is applied to many areas, such as text classification and medical
diagnosis, but its independence assumption may not be accurate for many cases in reality. (P.
Langley 1992, N. Friedman 1997). Even though, naive Bayes classifier is still successful in
some of those cases because its classification decision is often correct even if the probability
estimation is not right (Rish 2001).
2.1.5 Artificial Neuron Network
Artificial neuron network (ANN) is an algorithm that is inspired by biological neuron
networks and has been used to solve complex problems in many areas. ANNs are usually
organized in three layers: input, hidden, and output layers. Layers are made up of a number
of nodes (neurons) which use an activation function to learn the patterns from training set.
The input layer handles all the input information and communicates with one or more hidden
layers through a system of weighted connections. The hidden layer then links to the output
31
layer, the output layer reports the final result. The hierarchical structure of ANN is shown in
figure 2.3.
Input layer Hidden layer 1 Hidden layer 2
Hidden layer n
Output layer
Figure 2.3 Hierarchical structure of Artificial neuron network according to the
description in (S Malinov 2001)
ANNs are trained in order to approximate the desired output as good as possible
according to training set. Backpropagation algorithm is a commonly used approach for
training ANNs in order to correctly map inputs to outputs (M. Forouzanfar 2010).
ANNs have several advantages, for example they have the ability to capture
nonlinear relationships and multiple interactions, and they require less formal statistical
training (Tu 1996). On the other hand, the approach is a `black box` approach, and it is often
difficult to understand how a classification has been obtained, and how to trouble shoot if the
approach does not work as expected. In addition, if the training set is large, training an ANN
may require more time than alternative approaches.
32
2.1.6 Decision Tree
A decision tree is a tree-like graph that use branching method to display every
possible outcome of a decision. Decision trees consist of three basic elements: leaf nodes
which are associated with a class label, non-leaf nodes which correspond to data attributes,
and edges which correspond to different attribute values. Figure 2.4 shows the basic structure
of decision tree. In this tree, the class labels are YES and NO, attributes names are shown
within non-leaf nodes (rectangular boxes). Decision tree learning algorithm uses decision tree
as a model by mapping the features to a tree from the root to leaf nodes. The path from the
root to leaf represents a classification rule. For classifying any unknown data item, the item is
sorted into a class by following the classification rules and is assigned to the class that has
associated with the specific classification rule.
Refund
Yes
No
No
MarSt
Married
Single,Divorced
TaxInc
< 80K
No
No
> 80K
YES
Figure 2.4 Example of decision tree model for the credit data described in (PangNing Tan)
33
Decision tree is constructed in the training phase. The simple algorithm for growing
a decision tree uses a top-down, greedy search through all possible splits which partition the
data into two or more parts according to attributes. At each node, the algorithm selects the
attribute that produces the best split. This process continues until all the attributes have been
used or no split will result in a further partition.
There are different methods to compute the best split, but the underlying ideas are
similar. Each method computes the impurity of potential splits and splits with homogeneous
class (low degree of impurity) are preferred. Popular impurity measures are Gini index,
Entropy, and misclassification error. The quality of a split is assessed via computing its Gain:
𝑘
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐼𝑀𝑃𝑈𝑅𝐼𝑇𝑌(𝑝) − (∑
𝑖=1
𝑛𝑖
𝐼𝑀𝑃𝑈𝑅𝐼𝑇𝑌(𝑖))
𝑛
where p is the parent node, k is the number parts of the split, and ni is the number of
data points in part i.
Decision tree is used to predict translated uORF based on training set in chapter 3.
2.1.7 Random Forest
Random forest is an ensemble algorithm developed by Leo Breiman (Breiman 2001)
that combines the results of multiple regression trees. In Breiman's approach, multiple
training sets are selected through bootstrap resampling with replacement. For each training
set, a small group of features is selected at random and then a decision tree is grown. The
34
results of the individual decision trees are combined and the class label is determined via
averaging. Figure 2.5 gives an overview of the procedure.
Dataset
Bootstrap sampling based on
small group of feature sets
Grow decision trees
D(1)
D(2)
Decision tree
1
Decision tree
2
D(n-1)
D(n)
Decision tree Decision tree
n-1
n
Combine results for
multiple decision trees
Figure 2.5 Radom forest approach described in (Pang-Ning Tan)
An important feature of the random forest approach is that it can determine the
importance of individual variables for the classification task. The importance score of each
variable is computed by the out-of-bag error (errOOB) after permuting the variable values in
the tree. The raw importance score of feature Xj is calculated as:
VI(X j ) =
1
j
∑(errOOBt − errOOBt ),
ntree
t
35
j
where ntree is the number of regression trees, errOOBt is the error after permuting jth
variable, and errOOBt is the error before permutation.
The importance score can be used to rank all feature with respect to their importance
for the classification task, and it can be used as a guideline for feature selection.
The random forest algorithm has several advantages. It can handle large number of
features without the need of reducing the number of features because the individual trees of
the forest use only a small subset of the original features, and each tree is grown to its
maximum size without pruning. Random forests perform often better than other individual
classifier (Opitz 1999). The random forest model is used to rank the contribution of transcript
features that affect translation in chapter 4.
2.2 Unsupervised learning algorithms
2.2.1 Clustering
In unsupervised learning, there are no labeled training examples. The task is to find if
there is a specific pattern or pattern groups in the data. One of the most commonly used
unsupervised approach is clustering, which is an exploratory technique to find groups of
similar objects in a dataset. In this dissertation, k-means clustering algorithm was used, so the
following part gives a brief introduction of k-means clustering.
k-means clustering is an algorithm for cluster analysis in data mining. It tries to
partition the dataset into k clusters, where k is a user specified parameter (Pang-Ning Tan).
More formally, given a data set X= (x1, x2, …,xn), where each data point xi is a ddimensional vector, the algorithm tries to partition the data set into k (k<=n) clusters, so that
the within cluster sum of squares, which measures the sum of distances of each point in the
cluster to its cluster center, is minimized. The within cluster sum of squares is defined as:
𝑘
𝑛𝑗
𝑗
2
𝑒 2 = ∑ ∑ ||𝑥𝑖 − 𝑐𝑗 || ,
𝑗=1 𝑖
36
where xij is the ith data point belong to jth cluster and cj is the centroid (center of the
data points in the cluster) of the jth cluster.
The k-means clustering algorithm operates as following:
(1) Choose k points at random as initial centroids
(2) Assign each data point to the closest centroid
(3) Recompute the cluster centroids according to the current cluster members
(4) Repeat (1)-(3) until all the centroids are not changed
One problem for k-means clustering is to determine the number of clusters k. There
are different methods that can be used to determine a suitable value for k. One option is to
calculate the average silhouette coefficient for different k and choose k with maximum
average silhouette value (Rousseeuw 1987). The silhouette value measures the fit of a data
point within its cluster in comparison with neighboring clusters. Silhouette values are in the
range of -1 to 1. A silhouette value close to 1 indicates that a data point is in an appropriate
cluster, while a silhouette value close to -1 indicates that it might be erroneously assigned.
k-means clustering is applied in chapter 3 to detect different clusters in uORF dataset.
2.2.2 Association rule mining
Association rule mining is a unsupervised method used to discover interesting
relationships or associations in a large set of item (Agrawal, Imieli et al. 1993). The
relationships or associations are described via association rules. The problem of association
rule mining can be described as: given a database of transactions, discover associations
among pairs of item sets such that the presence of one set of items in a transaction will imply
37
the presence of the other set items in the same transaction. More formally, according to the
definition from Agrawal et al, (Agrawal, Imieli et al. 1993), the problem of association rule
mining is defined as following:
Let I = {i1, i2, …, im} be a set of items, and D = {t1, t2, …, tn} be a database of
transactions. Each transaction in 𝐷 ⊂ 𝐼 has a unique transaction ID, TID. An association
rule is an implication of the form X=>Y, where X, Y ⊂ 𝐼 𝑎𝑛𝑑 𝑋⋂𝑌 = ∅ . X is called
antecedent or left hand side (LHS) and Y is called consequent or right-hand-side (RHS). A
rule X=>Y has support s if s% of transactions in D contain 𝑋⨆𝑌 . A rule X=>Y has
confidence c in the transaction database D if c% of the transactions in D that contain X also
contain 𝑋⨆𝑌. An item set is frequent if its support exceeds a certain threshold minsup. The
association rule mining problems is defined as given a set of transaction with support and
confidence threshold, find all the rules that predict the occurrence of an item based on the
occurrences of other items in the transaction.
Given support and confidence thresholds minsup and minconf, the problem of finding
all association rules can be decomposed into two sub-problems:
1. Find all frequent item sets.
2. Using the identified frequent item sets, generate all rules with confidence value
larger or equal than the minimum confidence threshold minconf.
The first sub-problem may be more important because it is time-consuming and
requires a lot of search space. The rule generation step can be done in an easy manner once
all the frequent items are generated.
38
One popular algorithm for generating all frequent item sets is Apriori (Agrawal and
Srikant 1994). The algorithm starts with generating the set of all the frequent 1-itemsets L1
and uses this set to generate a set of candidate 2-itemsets C2 in the next pass. The algorithm
then deletes the item sets in C2 that are not frequent, resulting in the set L2 and generates
frequent 3-itemsets. This approach is repeated until no new frequent itemsets are discovered.
The algorithm to generate frequent itemsets consists of two phases: join phase and
prune phase. The join phase generates a set of candidate k-itemsets Ck via joining two
frequent k-1 itemsets in Lk-1 that have an intersection of k-2 items. During the prune phase,
the support value for each candidate itemset in Ck is counted and all infrequent itemsets are
deleted. Finally, candidate association rules are generated from the frequent itemsets and
further pruning is performed to remove rules that do not meet the confidence threshold.
One important question for association rule mining is how to determine which rules
are interesting. Generally, support and confidence are two criterions to help filtering out the
uninteresting rules. Unfortunately, for some cases, rules with high support and confidence are
still not interesting or even misleading. For example, among 1000 students, 75% of the
student drink milk, 60% of the student play basketball, and 40% of the student play
basketball and drink milk. The rule: play basketball => drink milk (support=40%, confident =
66.7%) suggests that drinking milk and playing basketball are associated. This is misleading
since 75% of all student drink milk, a higher percentage than the percentage of milk drinking
students among basketball players (66.7%). To tackle this weakness, another measure called
lift is proposed. Lift measures association of itemsets X and Y. The lift value between the
occurrence X and Y is defined as:
39
lift(X, Y) =
P(𝑋⨆𝑌)
P(𝑋)P(𝑌)
A lift value smaller than 1 indicates that the occurrence of X is negative correlated
with the occurrence of Y, while a lift value larger than 1 indicates a positive correlation
between X and Y. A lift value equal to 1 indicates the statistical independence of itemsets X
and Y.
Another independence measure is the chi-squared measure. The χ2 measurement is
calculated as:
χ2 = ∑
(observed − expected)2
expected
Where observed is the actual frequency of X and Y and expected is the frequency one
would expect to find between X and Y. A higher chi-squared statistic indicates that itemsets X
and Y are not independent. Dependent on the application, various other measures for
identifying interesting association rules have been proposed (Tan, Kumar et al. 2002).
Association rules are used to detect elements associated with uORF translation in chapter 3.
2.3 Performance evaluation
2.3.1 Confusion Matrix
An important aspect for supervised learning and classification is to evaluate the
performance of a model with the given dataset. Often, a confusion matrix is used as a starting
point for performance evaluation. A confusion matrix contains the information about actual
and predicted class values obtained by the model (Stehman 1997). For a two-class
classification problem, four possible outcomes are tallied: True positives (TP), true negatives
40
(TN), false positives (FP) and false negatives (FN). Figure 2.7 shows an example of
confusion matrix.
Figure 2.6 Confusion Matrix for two-class classification
Various performance measurements can be derived from a confusion matrix. The
most frequently used measurements are:
Accuracy =
TP + TN
TP + TN + FP + TN
Sensitivity (true positive rate) =
TP
TP + FN
Specificity (true negative rate) =
TN
FP + TN
False positive rate =
F score =
FP
FP + TN
2TP
2TP + FP + FN
2.3.1 ROC curve
The receiver operating characteristic (ROC) curve is another approach to measure the
performance of a classification model (Fawcett 2006). The ROC curve is a plot of the true
positive rate against the false positive rate for the different possible cutpoints of a diagnostic
41
test (Figure 2.8). The point (0,1) on the curve indicates that the model classifies all positive
class and negative class correctly. The point (0,0) corresponds to every data point is
classified as negative class and point (1,1) corresponds to every data points is classified as
positive class. The diagonal line on the curve indicates a random guess, because 50% of data
are classified as positive and 50% of data are classified as negative.
ROC
True Positive Rate
Radom
Good
Better
Best
False Positive Rate
Figure 2.7 ROC curve
ROC curve provide a visual tool to see the model performance. The area under the
curve measures the performance of a model. The larger the area, the better is the performance
a model.
42
CHAPTER 3 GENOME-WIDE SEARCH FOR TRANSLATED UPSTREAM OPEN
READING FRAMES IN ARABIDOPSIS THALIANA
Upstream open reading frames (uORFs) are open reading frames that occur within the
5’-UTR of an mRNA. uORFs have been found in many organisms. They play an important
role in gene regulation, cell development, and in various metabolic processes. It is believed
that translated uORFs reduce the translational efficiency of the main coding region.
However, only few uORFs are experimentally characterized. In this chapter, we used
ribosome footprinting together with a semi-supervised approach based on stacking
classification models to identify translated uORFs in Arabidopsis thaliana. Our approach
identified 5360 potentially translated uORFs in 2051 genes. GO terms enriched in genes with
translated uORFs include catalytic activity, binding, transferase activity, phosphotransferase
activity, kinase activity and transcription regulator activity. The reported uORFs occur with a
higher frequency in multi-isoform genes, and some uORFs are affected by alternative
transcript start sites or alternative splicing events. Association rule mining revealed sequence
features associated with the translation status of the uORFs. We hypothesize that uORF
translation is a complex process that might be regulated by multiple factors. This work is
published at IEEE Transactions on NanoBioscience (Hu, Merchante et al. 2016).
3.1 Introduction
Upstream open reading frames (uORFs) are open reading frames that appear in the 5’
untranslated region (UTR) of an mRNA. uORFs have been found in many organisms and
play an important role in cell development and gene regulation. It is believed that translated
uORFs attenuate translation of the downstream main open reading frame (main ORF)
43
(Morris and Geballe 2000, Kwon, Lee et al. 2001, Calvo, Pagliarini et al. 2009, Jeon and
Kim 2010). It is estimated that there are more than 20,000 uORFs in Arabidopsis genome,
but only few uORFs are experimentally characterized and in most cases it is unknown what
biological function they have and if they are translated (Imai, Hanzawa et al. 2006, AlatorreCobos, Cruz-Ramirez et al. 2012, Ebina, Takemoto-Tsutsumi et al. 2015). Experimental
identification of functional uORFs is time-consuming, and so far, only few uORFs have been
directly investigated through forward genetic analysis at the whole-plant level (Hanfrey,
Franceschetti et al. 2002, Selpi, Bryant et al. 2009). A genome-wide identification of
translated uORFs via Mass Spectrometry has been challenging due to the short length of the
encoded proteins (von Arnim, Jia et al. 2014). Several studies have used evolutionary
conservation to predicted functional uORFs (Cvijovic, Dalevi et al. 2007, Takahashi,
Takahashi et al. 2012). For example, in (Takahashi, Takahashi et al. 2012), the authors have
developed a BLAST-based algorithm to identify conserved uORFs across various plant
species. Their approach resulted in 18 novel uORFs in Arabidopsis thaliana. Unfortunately,
conserved uORFs only account for a small part of the uORFs in the Arabidopsis genome currently, the TAIR database lists only about 70 conserved uORFs which account for less
than 1% of total uORFs (von Arnim, Jia et al. 2014). The biological function and translation
status of most uORFs is still unknown.
Recently, ribosome footprinting (RF) has been developed to investigate translation
via deep sequencing of ribosome protected mRNA fragments (ribosome footprints) which
provide evidence for translated regions (Ingolia, Ghaemmaghami et al. 2009). Several studies
have used RF information to identify translated open reading frames and translation initiation
44
sites (TISs). For example, Fritsch and colleagues analyzed the coverage of ribosome
footprints upstream of annotated TISs and trained a neural network to detect novel TISs. This
experiment identified 2994 novel uORFs in the human genome (Fritsch, Herrmann et al.
2012). A similar study in mouse used a support vector machine to learn the distribution of
ribosome footprints near TISs and identified 13,454 candidate TISs (Ingolia, Lareau et al.
2011).
In this study, we used a semi-supervised approach that uses RF data with additional
gene information to learn features of translated uORFs in Arabidopsis thaliana. Semisupervised learning is a machine learning technique that uses unlabeled data in conjunction
with few labeled data points for training. If the data follow certain smoothness, cluster, or
manifold assumptions, semi-supervised learning can more precisely capture the
characteristics of data than a purely supervised learning approach, and as a result it can
produce very accurate predictions using only few labeled data points, for a thorough review
see (Zhu 2008). Semi-supervised learning has been applied in many areas such as text mining,
disease classification and pattern recognition (Dara, Kremer et al. 2002, Hua-Jun, Xuan-Hui
et al. 2003, Longstaff, Reddy et al. 2010). The approach is most useful in scenarios where
labeled instances are expensive and hard to obtain, while unlabeled data is relative easy to
collect, similar to the situation in our data. We combined semi-supervised learning with a
stacking-based classification model that utilizes RF data along with additional gene
information to identify translated uORFs. Stacking combines the predictions of several baselevel classifiers by a meta-level classifier to improve predictive accuracy (Wolpert 1992,
SASO DZEROSKI 2004). Our algorithm discovered 5360 translated uORFs that occur in
45
2051 genes. Analysis of the predicted uORFs shows that the enriched GO-terms of the
uORF-containing
genes
include
catalytic
activity,
binding,
transferase
activity,
phosphotransferase activity, kinase activity and transcription regulator activity and that
uORFs are prevalent in multi-isoform genes. Analysis of translation efficiency shows that
genes that contain translated uORFs tend to have lower translation efficiency in the main
ORF. Association rule mining revealed groups of sequence features that are associated with
the translation status of uORFs. Our results suggest that uORF translation is a complex
process that may be influenced by multiple factors.
3.2 Material and Methods
We used Ribo-seq and accompanying RNA-seq data from a study of gene-specific
translation regulation mediated by the hormone-signaling molecule EIN2 (Merchante,
Brumos et al.) and developed a semi-supervised learning approach combined with stacked
classification to identify translated uORFs in Arabidopsis. Our approach is outlined in the
following section, whereas a detailed description of the individual steps and semi-supervised
learning algorithm is provided in parts A-F.
We start with aligning ribosome footprints and the corresponding mRNA reads to the
genome sequence of Arabidopsis thaliana. The aligned reads were assigned to predicted
uORFs and annotated main coding regions. For each uORF, 12 features were extracted for
the subsequent analyses. Then, k-means clustering was used to construct a training dataset.
Subsequently, five different base-level classifiers were trained, and a stacking approach was
used to combine the results of the base classifiers. Finally, association rule mining was
46
applied to explore sequence features that associate with the translation status of uORFs.
3.2.1 Data preparation
Ribosome footprints and RNA-seq data were generated from Arabidopsis seedlings.
We analyzed over 90 million reads from two biological wildtype replicates. First, we
performed quality control and removed adaptor sequences and low quality reads using the
FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). The resulting reads were aligned
to the genome sequence of Arabidopsis thaliana using Tophat (Trapnell, Pachter et al. 2009).
Reads that mapped to multiple genomic positions, as well as reads with length smaller than
25bp, or larger than 40bp, were discarded. The remaining reads were assigned to transcript
regions using custom Perl scripts. Genome and transcript sequences, as well as gene
annotation were downloaded from The Arabidopsis Information Resource (TAIR, version 10,
http://www.arabidopsis.org/). We generated an exhaustive list of uORFs, where each uORF
corresponds to a sequence of start codon (ATG), followed by one or more amino-acid-coding
codons, and a stop codon (TAG, TAA or TGA) in frame. All generated uORFs start in the 5’UTR, but they might extend beyond this transcript region.
We excluded uORFs that consist of only a start and a stop codon (but no additional
codons) from our analysis, because such uORFs are not considered to be functional
(Andrews and Rothnagel 2014). This resulted in 29629 uORFs in 7831 genes.
3.2.2 Feature extraction
As described above, translation of an open reading frame can be divided into 3 stages:
47
initiation, elongation, and termination. Each stage corresponds to a particular ribosome
moving pattern (Schmeing and Ramakrishnan 2009). During translation initiation, a
ribosome scans through mRNA and starts translation when a translation start site is
recognized. The speed of the ribosome slows down, and as a consequence, an accumulation
of ribosome footprints before translation start sites of translated open reading frames can be
observed (Ingolia, Lareau et al. 2011, Juntawong, Girke et al. 2014). In the elongation step,
the ribosome translates the codons three nucleotides at a time. For a well-translated region,
the protein is continuously synthesized by ribosome resulting in an appearance of ribosome
footprints in this region without a large break. During termination, the ribosome dissociates
at the stop codon. To identify translated uORFs, we generated several features measuring
ribosome footprint distribution patterns that are characteristic of the different stages of
translation (Figure 3.1). Our rational is that translated uORFs should display these features
accordingly.
Open Reading Frame
...
Initiation
Den_left
Dp
De
Ds
Den_utr
Elongation
Var
Den
Den_max
Den_min
l
Distribution of ribosome
footprints
Termination
Features
Den_right
Figure 3.1 Translation stages and features characteristic of the different translation
stages.
48
In addition, several features of functional uORFs are known to impact the translation
of downstream main ORFs, for example the length of uORF and the distance between the
uORF and its main ORF (Vilela and McCarthy 2003). We include these features in our
model. The following describes the individual features in detail.
Denoted u(1), u(2), …, u(l) the sequence positions of an uORF with length l in a
transcript t. We assume that the main ORF starts at positions in t. For i=1, ..., l we denote the
number of ribosome footprints mapping to u(i) by c(u(i)). We computed 12 features for each
possible uORF:
1) Distance from uORF start to the start of main ORF: Ds = s - u(1).
2) Distance from uORF end to main ORF: De = s - u(l).
3) Length of uORF: l.
4) Distance from uORF start to the nearest peak of the ribosome density curve: Dp.
Assume the number of ribosome footprints aligned to the positions of a gene with length n
were counted as (c(1), c(2), …, c(n)). We use a kernel smoother to estimate the ribosome
density curve:
n
1
K h (x − c(i))
fk (x) =
∑
,
nh
h
i=1
where Kh is the kernel function and h is the smoothing parameter (bandwidth). We
used the R function density with kernel function Gaussian and bandwidth 5. Peaks p(1), p(2),
…, p(k) of the density curve indicate positions where ribosome footprints accumulate. We
have
49
Dp = min( | p(i)-u(1) | ), i = 1, 2, …, k.
5) Ribosome footprint density of a uORF: Den, with
∑u(l)
i=u(1) c(i)
Den =
l
.
6) Maximum local ribosome footprint density of uORF: Den_max. The local
ribosome footprint density is calculated using a sliding window of size 3 along the uORF
region. Den_max is the maximum value of the resulting local ribosome densities.
7) Minimum local ribosome footprint density of uORF: Den_min. The value is
computed analogously to Den_max.
8) Ribosome footprint density for the region left of uORF: Den_left. The ribosome
footprint density upstream of uORF indicates ribosome loading before start codon. We chose
15 base pairs as it is about the half length of ribosome in Arabidopsis. We have
Den_left =
∑u(1)
i=u(1)−15 c(i)
.
15
9) Ribosome footprint density for the region right of uORF: Den_right, with
Den_right =
∑u(l)+15
i=u(l) c(i)
15
.
10) Variance of ribosome footprints distribution along uORF region: Var, with
u(l)
1
Var = ∑ (c(i) − μ)2 ,
l
i=u(1)
where µ is the mean value of c(i) in the uORF region.
11) Ribosome density of UTR region: Den_utr. Assuming the UTR region extends
from position a to position b on a transcript, we have
50
b
Den_utr = ∑
i=a
c(i)
.
b−a+1
12) Ribosome density of CDS: Den_cds. The value is computed analogously to
Den_utr.
3.2.3 Semi-supervised Learning: Clustering based Classification
Translated transcript regions show a characteristic ribosome footprint distribution
pattern. Unfortunately, there were only 8 experimentally verified translated uORFs that are
well expressed in our samples. Since these experimentally characterized uORFs showed the
expected ribosome footprint distribution pattern and occurred in a distinct cluster of selected,
high-quality uORFs we used a clustering based semi-supervised approach, similar to the text
classification algorithm described in (Hua-Jun, Xuan-Hui et al. 2003), to identify translated
uORFs in our data. Our approach consists of two steps:
1. Clustering step: we cluster selected unlabeled data points under the guidance of
labeled data to obtain an expanded training set. The details of the approach are described in
part D of the Material and Methods Section, and part A of the Results Section.
2. Classification step: we use the generated training data to train a stacked classifier.
The details of the approach are described in part E of the Material and Methods Section.
51
The algorithm is given below.
Algorithm: Clustering-based Classification
Input: labeled and unlabeled uORF datasets
Output: translation status of uORFs
Clustering step:
1: Combine labeled data with selected unlabeled data points
2: Perform k-means clustering on combined data
3: Select clustering (k) based on average silhouette value
4: Assign class label based on clustering results
5: Select training data D
Classification step:
6: Train base-level classifiers using D
7: Combine outputs of base-level classifiers with features of D
8: Train meta-level classifier MC using combined dataset
9: Stacked classification using base-level classifier and MC to determine the translation
status of uORFs
3.2.4 Clustering and Training Set Construction
We performed k-means clustering to identify groups of similar uORFs. The resulting
clusters were characterized with respect to their translation behavior. A detailed description
is given in Section III part A. The training set is constructed based on our clustering results.
The positive class (translated uORFs) is chosen from cluster 2.1 (see Fig.3). The uORFs in
this cluster show the characteristic distribution of ribosome footprints in a canonically
translated uORF. There are 76 uORFs in this cluster. The negative class is randomly selected
from clusters 1 and 2.2. The uORFs in these clusters exhibit a small ribosome footprint
density, or ribosome footprints accumulate far from the translation start site. We selected 76
records from both classes, resulting in a training set with 152 records.
52
3.2.5 Stacking Classification
Our classification model is developed based on the positive and negative classes
described above. We use five base-level classification algorithms: a k-nearest neighbor
classifier, a support vector machine (SVM), a decision tree, a Naïve Bayes classifier, and a
neural network. We have chosen these five algorithms because they used the different
classification strategies. Each classifier was tuned and the model with lowest error rate was
used. The performance of the different classifiers was evaluated by a leave-one-out cross
validation.
Stacking is a method that combines the predictions of several base-level classifiers by a
meta-level classifier in order to improve predictive accuracy. More precisely, following the
description in (Wolpert 1992, SASO DZEROSKI 2004): given a training data set D and
learning algorithms L1, L2, …, Lt, stacking first uses D to learn a set of base-level classifiers
BC1, BC2, …, BCt, with BCi = Li(D) for i=1, ..., t. Subsequently, using the predictions of
BC1, BC2, …, BCt, a meta-level classifier MC is learned. Correspondingly, during the
prediction step, data features are first used by the base-level classifiers. Then, the resulting
base-level predictions are used by the meta-level classifier to generate the final prediction. It
has been shown that stacking can combine the expertise of different base-level classifiers
while reducing their bias (Wolpert 1992, SASO DZEROSKI 2004).
Here, we apply stacking with five different base-level classifiers to the features from
the uORF regions in the training set. First, we use our training set to train the base-level
classifier. We record the results of the base-level classifiers and use them together with the
extracted features of the training set to generate a meta-level k-nearest neighbor classifier.
53
We use leave-one-out cross validation to evaluate the performance of our approach. Suppose
our training dataset D consists of N records D(1), ..., D(N). For each record D(i), i=1, ..., N,
we train the base-level and meta-level classifiers using D - D(i), and we evaluate their
performance using D(i).
3.2.6 Association Rule Mining
Association rule mining is a powerful tool to discover interesting patterns - for a
detailed description we refer the reader to (Agrawal 1993). We apply association rule mining
to find groups of sequence features associated with the translation status of uORFs. We
computed for each uORF the following sequence features, numerical features were
discretized into 5 groups (small, medium_small, medium, medium_large, and large) using
equal frequencies:

Length of 5’-UTR/cds/3'-utr for the gene that contains the uORF (futr/cds/tutr).

Order of uORF, indicating the relative position of each individual uORF along the 5’-
UTR (order).

Reading frame of uORF with respect to the main ORF (rf: 0/1/2).

Number of uORFs that occur in the gene (no_of_uorf).

Minimum free energy of the uORF region (mfe). The minimum free energy was
computed using the ViennaRNA package (Lorenz, Bernhart et al. 2011).

Occurrence of A or G at position -3 before the start codon (a_g_3: 0 absence, 1
presence).

Occurrence of G at position +4 after the start codon (g_4: 0 absence, 1 presence).
54

Instability index for uORF peptides (ins: stable or instable) (Boman 2003). Peptides
with instability index smaller than 40 are predicted as stable.

Potential protein interaction index for uORF peptide (pii: bp or nbp) (Guruprasad,
Reddy et al. 1990). Proteins with potential protein interaction index higher than 2.48 are
considered as high binding potential (bp).

Codon adaptation index for the uORF region (cai) (Rice, Longden et al. 2000).
For each uORF, we record the corresponding set of sequence features and the
translation status (translated or untranslated) of the uORF as items in a transaction t. The
resulting transactions form our transaction database T = {t1, t2, …, tm}. An association rule
r:X→Y describes an implication between the items of X and Y, where X and Y are disjoint
itemsets. In our analysis, we constrain the left-hand-side (X) of r to sequence features, and
the right hand side (Y) to the uORF translation status. Three parameters were used to select
relevant rules: support, confidence, and lift. Support of rule r:X→Y corresponds to the
proportion of transactions in T that contain X∪Y. Confidence measures the proportion of
transactions that contain X∪Y among the transactions that contain X. Lift is calculated as the
support count of X∪Y divided by the product of support count X and support count of Y.
The lift value measures the departure from a random model; a lift value larger than 1
indicates a positive association between X and Y.
We used the Apriori algorithm [27] implemented in the R package arules
(http://cran.r-project.org/web/packages/arule/index.html,[26]) to find rules that meet our
thresholds for support (0.02) and confidence (0.8). The resulting rules are sorted according to
the lift value and listed in table 3.4.
55
3.3 Result
3.3.1 Cluster Analysis
To identify groups of similar uORFs, we performed a cluster analysis using k-means
clustering and Euclidean distance. We restricted our analysis to well-expressed genes that
contain one single uORF in their 5’-UTRs to avoid ambiguities that might occur if different
translated and untranslated uORFs appear in close proximity on the same transcript.
To determine a suitable number of clusters k, we used the average silhouette value
(Rousseeuw 1987). The silhouette value measures the fit of a data point within its cluster in
comparison with neighboring clusters. Silhouette values are in the range of -1 to 1. A
silhouette value close to 1 indicates that a data point is in an appropriate cluster, while a
silhouette value close to -1 indicates that it might be erroneously assigned. Fig. 3.2 shows the
average silhouette value for different numbers of clusters k. The average silhouette values for
k=2, ..., 6 clusters are similar and clearly larger than average silhouette values for k>6.
56
Figure 3.2 Average silhouette value for different numbers of clusters k.
Fig. 3.3A shows the evolution of clusters fork=2, ..., 6. For k=2, we have 2 clusters:
cluster 1 and cluster 2. For k=3, cluster 2 is further divided into cluster 2.1 and cluster 2.2.
When k is further increased to 6, the clusters 2.1 and 2.2 remain unchanged, while cluster 1 is
divided into multiple clusters.
To identify translated uORFs, we have focused on ribosome footprint density in the
uORF region (Den) and the distance from uORF start to the nearest peak of the density curve
(Dp). These two features are the most important features that determine the translation status
of uORFs (Ingolia, Lareau et al. 2011, Juntawong, Girke et al. 2014). Fig. 3.3B shows a
scatter plot of Den and Dp superimposed with the point density as background. The Figure
shows three clusters that coincide with the cluster reported by 3-means clustering: cluster 1
(diamond-shaped points at the bottom) consists of uORFs for which only few ribosomal
footprints have been detected, indicating that uORFs from this cluster are not translated. In
contrast, the uORFs of the other 2 clusters show ample ribosomal footprints. Cluster 2.1
57
(circular-shaped points) shows small Dp and large Dens values while cluster 2.2 (triangularshaped points) shows large Dp and smaller Dens values. During k-means clustering with
k=4, ..., 6, only cluster 1 is subdivided. Since this cluster does not provide additional
information for the translated uORFs, we decided to choose k=3 for our subsequent analyses.
To learn the characteristics of the uORFs in the different clusters, we analyzed their
features. There are no ribosome footprints in the uORFs from cluster 1, and we hypothesize
that the uORFs in this group are not translated. The group accounts for about 65% of the total
uORFs in our dataset. The uORFs from cluster 2.1 and 2.2 have a positive footprint density
Den, but we observed significant differences in the variable Dp. Dp is significantly larger in
cluster 2.2 indicating that fewer ribosomal footprints accumulate at the start codon of the
corresponding uORFs. This is inconsistent with translated open reading frames [10,15]. In
addition, we analyzed:
1) Experimentally verified uORFs: there are two experimentally verified uORFs
whose genes are well expressed and translated in the dataset. Both uORFs belong to cluster
2.1.
2) GO terms of uORF-containing genes: we used AgriGO (Du, Zhou et al. 2010) to
identify overrepresented GOterms with significance level 0.05. Genes in cluster 2.1 show
terms such as biological regulation (GO:0065007), metabolic process (GO:0008152), and
cellular process (GO:0009987). This is consistent with the GO term annotation of currently
known uORF-containing genes (Imai, Hanzawa et al. 2006, Tabuchi, Okada et al. 2006,
Alatorre-Cobos, Cruz-Ramirez et al. 2012, von Arnim, Jia et al. 2014). We did not find
significantly overrepresented GOterms for cluster 1 and 2.2.
58
3) Local minimum coverage of ribosome footprints in a uORF region (Den_min): A
Den_min value larger than 0 indicates the continuous translation of ribosomes in the region.
Ideally, a well translated uORF should show Den_min=0 for only a small fraction of its
length. We found a statistically significant difference of this value between cluster 2.1 and
cluster 2.2. For cluster 2.1, about 20% uORFs have Den_min=0, whereas in cluster 2.2 we
have 74%.
4) Ribosome footprint density in the first 6bp region immediately after the start codon:
we checked the footprint density immediately after the start codon. Our analysis indicates
that ~70% of uORFs in cluster 2.2 do not show any footprints in this region. In contrast, all
uORFs in cluster 2.1 show a non-zero footprint density in this region.
Based on these observations we chose the uORFs in cluster 2.1 as templates for
translated uORFs and used them as positive class to train our classifiers. After inspecting
representative examples from cluster 2.2 (see reads coverage plot in Fig. 2B), we hypothesize
that some of the uORFs in this cluster might use a different, non-canonical translation start
site.
59
Cluster 1.1.4
A
Cluster 1.2
Cluster 1
Cluster 1.1.3
Cluster 1
Cluster 1.1
Cluster 1.1.2
Cluster 1.1.1
Cluster 0
Cluster 2.2
Cluster 2.2
Cluster 2.2
Cluster 2
Cluster 2.2
K=6
Cluster 2.1
K=4
K=3
K=2
K=1
Cluster 2.1
B
Figure 3.3 Visualization of clusters generated by k-means clustering. A. Evolution of
clusters according to different values of k. B. Scatter plot of ribosome density and
distance between the start of uORF and nearest peak of ribosome density curve.
Different shape of points indicates different clusters identified by 3-means clustering.
Cluster 1: diamond-shaped points. Cluster 2.1: circular points. Cluster 2.2: triangular
points. The background color indicates the density of points. A darker color indicates a
higher point density. The reads coverage plots above show uORFs from cluster 2.1 and
cluster 2.2.
60
3.3.2 Performance Evaluation
To compare the performance of our base-level classifiers, we compared the data
points of the training set that are identified correctly by a specific algorithm using leave-oneout cross validation. We found a large overlap between algorithms, however, each classifier
also detects a certain proportion of the data which is not detected by the other algorithms
(Table 3.1). The results suggest that each classifier has its own expertise for classifying
uORFs correctly, thus, we apply stacking to combine the expertise for each classifier in order
to get more accurate predictions.
To further evaluate the performance of our classifiers, we performed leave-one-out
cross validation and calculated accuracy, precision, recall and f-score. Receiver Operating
Characteristic Curve (ROC curve) was also used to assess the overall performance (Fawcett
2006). The area under the curve (AUC) indicates the performance of a classifier. The larger
the area, the better is the performance of a classifier. Accuracy, precision, recall and f-score
are defined as follows. Denote TP, FN, TN and FP is the number of true positives, false
negatives, true negatives and false positives. We have:
Accuracy = (TP+TN)/(TP+FN+TN+FP),
Precision = TP/(TP+FP),
Recall = TP/(TP+FN),
F-score = 2TP/(2TP+FP+FN).
61
Table 3.1 Correctly identified data points by base-level classifiers.
SVM: support vector machine, DT: decision tree, NB: Naïve Bayes, NN: neural
network, KNN: k-nearest neighbor. The numbers on the main diagonal in the table
indicate the total number of correctly classified data points for the individual classifier.
Off-diagonal elements denote the pairwise overlap of correctly classified data points
between pairs of classifiers.
SVM
DT
NB
NN
KNN
SVM
DT
NB
NN
KNN
115
82%
83%
83%
91%
97%
136
91%
94%
97%
78%
71%
105
72%
75%
97%
90%
90%
130
93%
95%
82%
82%
82%
115
Table 3.2 Overall classifier performance
Algorithms
Accuracy
Precision
Recall
F-score
AUC
KNN
SVM
DT
NB
NN
Stacking with
KNN
0.76
0.76
0.89
0.69
0.86
0.63
0.74
0.84
0.54
0.87
0.84
0.77
0.94
0.77
0.85
0.72
0.75
0.89
0.64
0.86
0.85
0.85
0.86
0.86
0.92
0.90
0.95
0.87
0.91
0.94
To demonstrate the power of our stacking approach, we compared the performance of
stacking with the performance of the individual base classifiers (Table 3.2, Fig. 3.4). The
result shows stacking outperforms the underlying base-level classifiers for all values except
62
for recall which indicates stacking has very good prediction accuracy.
We used our model to identify translated uORFs, and compared our predictions to the
results from previous studies. We collected experimentally characterized uORFs from current
research papers (Hanfrey, Franceschetti et al. 2002, Wiese, Elzinga et al. 2004, David-Assael,
Saul et al. 2005, Nishimura, Wada et al. 2005, Imai, Hanzawa et al. 2006, Puyaubert, Denis
et al. 2008, Nyiko, Sonkoly et al. 2009, Alatorre-Cobos, Cruz-Ramirez et al. 2012, Zhu,
Thalor et al. 2012), untranslated uORFs according to the experiment, and downloaded the
conserved uORFs annotated by TAIR. The results are shown in Table 3.3. For the genes
expressed and translated in our experiment, our method detects ~75% of the experimentally
verified uORFs. Similarly, most of the conserved uORFs identified through conservation
search among multiple-species are identified by our approach. These results suggest that
conserved uORFs are often translated.
Table 3.3 uORF from current studies
EC: Experimentally characterized; CD: conserved. True positive rate = TP/(TP+FN),
False positive rate = FP/(FP+TN).
Category of
uORF
Total
EC
CD
19
69
Expressed/translate
d in our
experiment
12
45
uORF
identified by
stacking
9
33
True positive
rate
75.0%
73.3%
False positive
rate
8%
9%
3.3.3 Translated uORFs in Arabidopsis thaliana
Our stacking approach identified 5360 translated uORFs. The identified uORFs occur
63
in 2051 genes, which account for about 6% of the all annotated genes in Arabidopsis thaliana.
Likely, this number is an under-estimation since about 30% of uORF-containing genes were
not transcribed in our experiment.
Among the genes that contain translated uORFs, the majority (~58%) contain only
one single uORF. However, there are few genes that include up to 30 uORFs. To further
characterize genes that contain translated uORFs, we performed a GO-term enrichment
analysis using AgriGO (Du, Zhou et al. 2010). Enriched GOterms with significance level
0.05 include catalytic activity, binding, transferase activity, phosphotransferase activity,
kinase activity and transcription regulator activity (Fig.3.5, A, B).
64
D
Figure 3.4. Characteristics of genes that contain translated uORFs. A. Distribution of
number of uORFs per gene. B. Enriched GO-terms for genes containing translated
uORFs. Background represents the GO-terms for all gene models in TAIR 10. C.
Proportion of genes containing translated uORFs. D. Box plot of translation efficiency
for genes with translated uORFs and genes without translated uORFs.
Remarkably, the majority of translated uORFs occur in multi-isoform genes. When
comparing expressed single- and multi-isoform genes with respect to the occurrence of
translated uORFs, we found a significant difference (p-value < 2.2e-16, Fisher’s exact test):
65
only 3.4% of the single-isoform genes contain translated uORFs, whereas about 19% of the
multi-isoform genes contain translated uORFs (Fig 3.5C). About 15% of these uORFs do not
occur in all transcripts generated by the multi-isoform genes. See Table 3.4 and Fig.3.5C for
a detailed breakdown.
Table 3.4 Translated uORFs identified in Arabidopsis
There are 27717 single-isoform genes and 5885 multi-isoform genes in the Arabidopsis
genome.
Category
Translted uORFs
All
Inn multi-isoform genes
Affected by AS events
In single-isoform genes
5360
3783
580
1577
Gene with
translated uORFs
2051
1121
293
930
Percentage of all
genes
6.10%
3.34%
0.87%
2.77%
Ribosome footprinting provides us with a way to measure translation efficiency (TE)
of the main ORF, thus we were able to investigate the role of uORFs in the regulation of
gene translation. TE is calculated as the ribosome footprint density divided by its
corresponding RNA-seq read density, and it indicates how well a transcript is translated
(Ingolia, Ghaemmaghami et al. 2009). We compared the TE between genes with translated
uORFs and genes without translated uORFs, and we found a significant reduction of TE for
transcripts with translated uORFs (p-value = 1.593e-14) (Fig. 3.5D).
66
3.3.4 Features associated with translated uORFs
To further investigate the mechanisms involved in uORF translation, we searched for
features that are associated with the translation status of uORFs. We extracted a set of
sequence features from the vicinity of translated/untranslated uORF regions and performed
association rule mining to identify features associated with the translation status of the
assessed uORFs. The rules that meet support (0.02) and confidence (0.8) thresholds are
shown in Table 3.5. We found that translated uORFs tend to be present in the genes with
long coding sequences, 5’-UTR and 3' UTR and have a smaller number of uORFs in their 5’UTR region as compared to genes with untranslated uORFs.
It has been suggested that the sequence context near the translation start site
influences the translation of a coding region (Kozak 1994). In higher plants, the most
conserved sequence patterns near the translation start site of highly expressed genes are the
presence of A/G at position -3, and the presence of G at position +4 (Joshi, Zhou et al. 1997).
However, in general, sequence conservation in the vicinity of the translation start site differs
among species. In our analysis, we found rules that associate the presence of A or G at
positions -3 and 4 with respect to the start codon with translated uORFs. On the other hand,
our analysis produced rules that associate the absence of A or G at the corresponding
sequence positions with untranslated uORFs. We also found rules that include features other
than the sequence context of the start codon. For example, the peptides encoded by translated
uORFs tend to associate with high codon bias. Some features occur in rules that describe
both translated and untranslated uORF. For example, there are rules that show the peptides in
both translated and untranslated uORFs have non-binding potential, but a uORF with non-
67
binding potential peptides and a short UTR tends to be untranslated. These results indicate
that uORF translation is a complex process that may require the interaction with multiple
sequence features.
Table 3.5 Rules of translated/untranslated uORFs
The sequence features (items) are described in Part F of Material and Methods. UN:
untranslated uORF, TR: translated uORF. All reported rules are statistically
significant and have adjusted p-values smaller than 1.21E-03. P-values are calculated
using a chi-square test (Alvarez 2003) and adjusted using Bonferroni correction
(Hochberg 1995).
Rules
{order=2,pi=nbp,tutr=medium_large}→ TR
{order=2,tutr=medium_large}→ TR
{pi=nbp,futr=long,tutr=medium_large,mfe=mediu
m}→TR
{order= 2,pi=nbp,cai=medium_large} →TR
{a_g_3=1,g_4=1,futr=medium_large}→TR
{rf=0,cds=large,no_of_uorf=1} →TR
{pi=nbp,futr=large,tutr=medium_large}→TR
{tutr=medium_large,no_of_uorf=2,mfe=medium}
→TR
{ins=stable,cds=small,tutr=small} →UN
{g_4=0,cds=small,no_of_uorf=[5,52]} →UN
{cds=small,no_of_uorf=[5,52]}→UN
{cds=small,tutr=small,mfe=medium}→UN
{g_4=0,futr=large,cds=short} →UN
{futr=large,cds=small} →UN
{rf=2,order=
1,pi=nbp,tutr=small,mfe=medium}→UN
{order=1,a_g_3=0,g_4=0,pi=nbp,ins=stable,tutr=s
mall} →UN
Support
Confidence
Lift
0.02
0.03
0.87
0.86
1.74
1.73
0.02
0.82
1.64
0.02
0.02
0.02
0.03
0.81
0.81
0.81
0.80
1.62
1.62
1.62
1.60
0.02
0.80
1.60
0.02
0.03
0.03
0.02
0.02
0.03
1.00
0.97
0.95
0.94
0.92
0.89
2.00
1.95
1.91
1.89
1.83
1.78
0.02
0.88
1.76
0.03
0.85
1.71
68
3.4 Conclusion
In this study, we described a semi-supervised approach to identify translated uORFs
using ribosome footprinting data in combination with additional genomic features. We have
chosen this approach because due to the extremely small number of experimentally
characterized uORFs a simple classification approach likely would result in unreliable
predictions. Since we have observed that the distribution of ribosome footprints follows
characteristic patterns during the different stages of translation, we expect that incorporating
unlabeled data points that follow these distribution patterns, and clustering with the
experimentally validated uORFs, will help learning the true characteristics of translated
uORFs, and likely improve our results.
Using this approach, we identified 5360 translated uORFs in 2051 genes, which
account for 6% of the annotated genes in Arabidopsis thaliana.
Previously, it has been shown that genes encoding transcription factors are overrepresented among the genes containing conserved uORFs (Hayden and Jorgensen 2007).
Our study confirms these results. In addition, our analysis highlights other regulatory
functions, such as catalytic activity, transferase activity and kinase activity enriched in genes
that contain translated uORFs.
Several studies have suggested that uORFs may control expression of downstream
main ORFs via different mechanisms [1, 2]. Often, a uORFs attenuates the translation of the
main CDS due to a reduction of ribosome re-initiation at the downstream translation start site
[1, 2]. The presence of a uORF might also create a premature termination codon, and thus
69
trigger nonsense-mediated decay (Rehwinkel, Raes et al. 2006, Barbosa, Peixeiro et al. 2013).
In addition, various sequence features that affect translation status and translation speed,
often in combination with uORFs, have been reported (Kozak 2001, Vilela and McCarthy
2003, Shu, Dai et al. 2006). Our analysis is consistent with these observations. We also found
a statistically significant reduction of TE in genes with translated uORFs. In addition,
association rule mining revealed sequence features associated with the translation status of
uORFs. We hypothesize that the corresponding association rules describe the translation
behavior of uORFs.
A further result of our study is the observation that translated uORFs occur with a
higher frequency in multi-isoform genes than in single-isoform genes. Several uORFs occur
only in a subset of the transcripts generated from the corresponding gene locus. We
hypothesize that in some cases alternative transcription start sites, or alternative splicing
events, might regulate presence and absence of translated uORFs.
Together, these results suggest that uORF translation and, consequently, the
translation of the corresponding downstream main ORFs is a complex process that may be
regulated by multiple factors.
In summary, this study provides a novel bioinformatics approach to identify
potentially translated uORFs in the Arabidopsis genome using Ribo-seq data. Our analysis
provides a list of candidate uORFs for future functional analyses. We expect that additional
translated uORFs will be found in the future as more ribosome footprinting datasets become
available for Arabidopsis and additional uORFs get experimentally validated.
70
CHAPTER 4 MINING TRANSCRIPT FEATURES RELATED TO TRANSLATION
IN ARABIDOPSIS USING LASSO AND RANDOM FOREST
Translation is an important process for all living organisms. During translation,
messenger RNA is rewritten into protein. Multiple control mechanisms determine how much
protein is generated during translation. In particular, several regulatory elements located on
mRNA transcripts are known to affect translation. In this study, a genome-wide analysis was
performed to mine features related to translation in the genome of Arabidopsis thaliana. We
used ribosome footprinting data to measure translation and constructed a predictive model
using LASSO and random forest to select features that likely affect translation. We identified
multiple transcript features and measured their influence on translation in different transcript
regions. We found that features related to different translation stages may have a different
impact on translation; often, features relevant to the elongation step were playing a stronger
role. Interestingly, we found that the contribution of features may be different for transcripts
belonging to different functional groups, suggesting that transcripts might employ different
mechanisms for the regulation of translation. This work is published at the proceeding of
Computational Advances in Bio and Medical Sciences (Qiwen, Merchante et al. 2015).
4.1 Introduction
Translation is one of the most important processes for living organisms. During
translation, mRNA (messenger RNA) is decoded by the ribosome into a polypeptide chain
that folds into a functional protein, and the amount of protein product generated via
translation of a transcript often correlates with its mRNA abundance (Lu, Vogel et al. 2007).
71
In the past two decades, high-throughput sequencing techniques have been developed to
measure the mRNA levels at a genome-wide scale. Due to difficulties in measuring protein
abundance directly, mRNA levels have been used frequently to estimate protein abundance
(Greenbaum, Colangelo et al. 2003). However, several studies have pointed out that mRNA
abundance is only moderately correlated with protein abundance because of various posttranscriptional regulation mechanisms (Lichtinghagen, Musholt et al. 2002, Nie, Wu et al.
2006, Stark, Pfannenschmidt et al. 2006, Dickson, Mulligan et al. 2007, Lu, Vogel et al.
2007). This paper focuses on transcriptional regulation, a major contributor to posttranscriptional gene regulation. Many factors, such as mRNA level and tRNA pools, can
affect translation (Tuller, Waldman et al. 2010). In addition, recent studies indicate that
specific sequence features of transcripts affect translational rates significantly (Deana and
Belasco 2005, Kudla, Murray et al. 2009, Tuller, Waldman et al. 2010). Thus, a systematic
search for transcript features that affect translation will likely provide useful information to
improve our understanding of translational regulation.
Translation has three stages: initiation, elongation and termination. Different stages
during translation may contribute to the final protein abundance or protein synthesis
efficiency of a transcript. In eukaryotes, many of the analyses have been mainly focused on
the translation initiation (Kozak 2001, Pestova and Kolupaeva 2002, Kim, Lee et al. 2014).
These analyses indicate that several features located in the 5’-UTRs, such as secondary
structure of 5’-UTR, length, sequence context near start codon and upstream open reading
frames, may affect the efficiency of translation initiation. Previous studies in Arabidopsis
based on microarray data found several features in the 5' and 3'-UTR that may be relevant to
72
translation regulation (Kawaguchi and Bailey-Serres 2005). One recent study in Arabidopsis
indicates the flanking region just upstream of start codon plays an important role for
determining the translational efficiency of genes. The study found that over-representation of
A residues at positions -1 to -5 corresponds to high translational efficiency (Kim, Lee et al.
2014). However, it remains to be determined if additional transcript features located in the
5’-UTR or in other parts of plant transcripts contribute to translational regulation.
Genome-wide identification of features related to mRNA and protein abundance was
performed in organisms such as Escherichia coli, S. cerevisiae, and Desulfovibrio vulgaris
(Nie, Wu et al. 2006, Dvir, Velten et al. 2013, Guimaraes, Rocha et al. 2014). These studies
identified a large number of features within transcripts that may influence their respective
protein abundance. But in plants only a few large-scale studies have been performed to
directly measure protein abundance (Kawaguchi and Bailey-Serres 2005). This is partly
because methods for large-scale measurements of protein accumulation took longer to
develop. Only recently it became possible to quantify translation at a whole-genome scale
with a nucleotide precision using ribosome footprinting (Ingolia, Ghaemmaghami et al. 2009,
Vogel, Abreu Rde et al. 2010), an approach that allows us to then perform comprehensive
analyses to identify significant feature sets in plant systems.
Linear regression has been applied to study the relationship between large feature sets
and translation (Nie, Wu et al. 2006, Nie, Wu et al. 2006). However, for high-dimensional
data, if there are only a few features that will influence translation, linear regression may lead
to a prediction with large variance which will affect the accuracy of prediction (Minjung
Kyung 2010). Also, linear regression does not provide information about the relevance of
73
features. The Least Absolute Shrinkage and Selection Operator (LASSO) is designed to
overcome the shortcomings of linear regression for high-dimensional data (R 1996). LASSO
penalizes the absolute size of regression coefficients in a way that coefficients of features that
show little relationship with the response will shrink to 0. Therefore, the method provides an
efficient way to select potentially important features, especially for high-dimensional datasets,
and the resulting model often shows a lower predictive error as compared to general linear
regression. There are various other algorithms that can be used to do the same task. Among
them, random forest is an ensemble method that combines the results of multiple regression
trees (Breiman 2001). One study shows that the ensemble approach shows better predictive
performance over using individual algorithms (Opitz 1999). In addition, the random forest
approach can rank features according to their importance score, which provides a way to see
the most important features related to the response of interest.
In this study, we utilized ribosome footprinting (RF) data, together with its
correspondent mRNA sequencing data, to measure translation and employed LASSO and
random forest models to investigate various transcript features that may influence translation
of transcripts in Arabidopsis. We identified a series of transcript features and measured their
contribution to translation regulation during initiation, elongation and termination stages. The
results provide a systematic quantitative analysis of transcript-related features in Arabidopsis.
In addition, we constructed a robust model that can be used to predict the translation of
transcripts. Based on this model, we found that transcript features present in different regions
of the mRNA have a different impact on translation of the entire transcript, with features in
the coding sequence (CDS) region playing a stronger role. Also, we found that the
74
contribution of features may be different for transcripts belonging to different functional
groups, which suggests that transcripts with divergent functions may employ different
mechanisms for translation regulation.
4.2 Material and Methods
4.2.1 Data Sources and Preparation
Ribosome footprints and the corresponding mRNA measurements were obtained
from wild-type Arabidopsis. The gene annotations, transcript sequences and genome of
Arabidopsis were downloaded from The Arabidopsis Information Resource (TAIR, version
10, http://www.arabidopsis.org/). Only protein coding genes were used for our analysis.
Also, to remove ambiguity, genes with multiple isoforms were excluded, resulting in a total
of 21531 protein-coding genes in our dataset.
The ribosome footprints and mRNA reads were aligned to the genome of Arabidopsis
using Tophat (Trapnell, Pachter et al. 2009). Only aligned reads with the length between 2540bp were used in our analysis. Uniquely aligned reads were assigned to untranslated regions
of each transcript (UTRs) and coding sequences (CDSs) using custom Perl scripts.
The translation of transcripts was measured by two measurements: ribosome rate
(RR) and translational efficiency (TE)(Ingolia, Ghaemmaghami et al. 2009). In ribosome
footprinting, the number of reads aligned to a transcript indicates the number of ribosomes
translating it which is correlated with the current protein abundance, so RR corresponds to
the ribosome abundance of a transcript; it measures the current protein synthesis level. TE
measures the rate of mRNA translation into proteins; it indicates how fast ribosomes translate
75
a specific mRNA molecule. Given similar transcript and ribosome levels, transcripts with
high TE will generate more proteins than transcripts with small TE during the same amount
of time. In this paper, we estimate the TE of transcript according to ribosome footpring data.
The corresponding formulas are shown as below:
Assume C = Number of reads aligned to a target region, T = Total aligned reads, L =
Length of a target region, and RF indicates ribosome footprints, then,
𝑅𝑅 =
109 × 𝐶𝑅𝐹
,
𝑇𝑅𝐹 × 𝐿
𝑇𝐸 =
𝐶𝑅𝐹 × 𝑇𝑚𝑅𝑁𝐴
𝑇𝑅𝐹 × 𝐶𝑚𝑅𝑁𝐴
To remove the influence of outliers, transcripts with RR in the top 10% and bottom 10%
were removed from the analysis.
4.2.2 Feature Computation
A total of 477 transcript features were computed from three different regions of the
transcripts in Arabidopsis: 5’-UTR, CDS and 3'-UTR. We selected a comprehensive list of
features that were reported to influence the protein abundance or translational efficiency from
other studies (Kozak 1994, Kudla, Murray et al. 2009, Dvir, Velten et al. 2013, Guimaraes,
Rocha et al. 2014). In 5’-UTR, we computed features such as the number of start codons,
length, the mRNA secondary structure of 5’-UTR, GC content, nucleotide frequency and
initiation score which indicates the sequence context near translation start site. Similar
feature sets were computed from the 3’-UTR region. Features in the CDS region include CAI
(The Condon Adaptation Index), codon usage frequency, nucleotide frequency, mRNA
76
secondary structure before and after the translation start site (~ 40bp before and after the start
codon).
The CAI, as well as codon and nucleotide frequencies of were calculated using
EMBOSS software (Rice, Longden et al. 2000). The mRNA secondary structure of given
sequences was computed according to the minimum free energy generated by ViennaRNA
package (Lorenz, Bernhart et al. 2011). The initiation scores were calculated using a profile
matrix model according to the 100 most expressed genes. The profile matrix contains six
nucleotides before and three nucleotides after the start codon. The detailed calculation is
described in (Miyasaka 1999).
4.2.3 Model Construction and Feature Selection
A linear regression model was applied to investigate transcript features that were
associated with translation in previous studies [6]. We assume the ith observation in our data
is (xi1, xi2, … ,xip, yi), i=1,…, n, where(xi1, xi2, … , xip)is the feature set of transcript i and yi
is its corresponding translation measurement (either RR, or TE). The linear regression model
is described as following:
p
yi = α + ∑ βj xij ,
j=1
Where βj is the coefficient for jth feature, α is the intercept and p is the number of
features in the data.
The coefficient β is estimated through minimizing the residual sum of squares β̂O , but
in our dataset, p is large (477), so the ordinary least square may lead to a large variance for
77
the estimates, which will reduce the accuracy of prediction. We therefore choose to perform
LASSO generalized regression which trades an increase in bias against a reduction of
variance to generate more accurate results (R 1996).
LASSO minimizes the residual sum of squares, while requiring that the sum of
absolute values of the coefficient is less than a constant, that is:
p
∑|β̂Lj | ≤ t (Constant) .
j=1
p
If 0 < 𝑡 < ∑j=1 |β̂oj |, minimizing the residual sum of squares is equivalent to
2
N
β̂L = arg min{∑ (yi − α − ∑ βj xij ) + λ ∑ |βj |},
i=1
j
j
where λ >0. Due to this constraint, LASSO will shrink coefficients with little
correlation to the response to zero. In high-dimensional datasets this provides a very efficient
way to select features that have a linear relationship with the response.
We estimated the LASSO parameter λ through a ten-fold cross validation. For each
cross validation, a grid of λ s was fed to estimate the coefficients. The corresponding
prediction error was estimated according to the test set. The λ value, which achieves the
minimum overall prediction error, was selected.
To get robust features and determine the rank of features, a different algorithm,
random forest, was used. Random forest is an ensemble approach that combines the results of
multiple regression trees generated from the dataset using bootstrap resampling. The
importance score of each variable is determined by the out-of-bag error (oob error) after
78
permuting the variable values in the tree, so the raw importance score of feature Xj is
calculated as:
VI(X j ) =
1
j
∑(errOOBt − errOOBt ),
ntree
t
where ntree is the number of regression trees, errOOBtj is the error after permuting jth
variable, and errOOBt is the error before permutation.
The importance score provides the rank information for each feature and can be used
as a guideline for feature selection.
Figure 4.1 Model construction algorithm. RF: random forest
79
We combined the result of LASSO and random forest to select robust features that are
related to the translation of transcripts and ranked the selected features according to the
importance score. To generate a computational model, we used a 10-fold cross validation
(CV) scheme to select features with a good predictive power. The dataset is randomly
partitioned into 10 subsets, nine of which were used for training and the held-out set was
used for testing. For each training dataset, two steps were used to select related features. First,
(1) a correlation test was used to remove features that show no correlation with translation.
Benjamini-Hochberg correction was performed to control false discovery rate to 5%
(Hochberg 1995). Then, LASSO was performed to select subset of features with a good
predictive power. We selected features that appear in all 10 CVs as robust features. For
random forest, in each CV round, we built different models according to a grid of parameters
and computed the model performance based on the testing set. After 10-fold CVs, we chose
the best model with lowest error rate and used this model to compute the importance score
for each feature. Finally, we compared the model performance of the two approaches and
chose features that were identified by both approaches. The rank of the selected features was
determined by the importance score.
The performance of the model was measured by R-squared. R-squared measures the
closeness of the data to the regression line. It is defined as:
𝑅2 = 1 −
𝑆𝑆𝐸 𝑆𝑆𝑅
=
𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦)2
𝑖
80
𝑆𝑆𝑅 = ∑(𝑦̂𝑖 − 𝑦)2 , 𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 ,
𝑖
𝑖
where𝑦𝑖 and 𝑦̂𝑖 are observed value and predicted value and 𝑦 is the mean of the
observed values. SST is total sum of squares, SSR is residual sum of squares and SSE is
explained sum of squares. An overview of the procedure is shown in figure 4.1.
4.3 Result
4.3.1 Model Performance
To capture transcript features directly related to translation, we generated and
analyzed a set of features and used linear regression with LASSO (LR) and random forest
(RF) approaches to identify important features associated with translation of transcripts, see
Material and Methods, part 4.2.2 and 4.2.3.
Table 4.1 Performance of Models
The performance was measured by R-square using 10-fold cross validation. LR: Linear
regression with LASSO. RF: random forest.
Ribosome Rate (RR)
LR
RF
Translation Efficiency (TE)
LR
RF
5’-UTR
CDS
3’UTR
Whole transcript
0.08
0.07
0.26
0.29
0.05
0.06
0.30
0.32
0.08
0.07
0.20
0.24
0.01
0.01
0.29
0.29
81
We measured translation based on RR and TE (Ingolia, Ghaemmaghami et al.
2009).Table 1 shows the performance of the corresponding models with respect to different
parts of the transcript. The performance is measured by R-squared which shows the
proportion of variable variation that can be explained by the model. Generally, LR and RF
show very similar performance and get similar important feature sets. The model based on
the 3’-UTR shows the lowest performance and the performance for the model based on the
whole transcript is similar to the model based on CDS. In total, the model explains around
30% variation of translation based on RR and TE.
4.3.2 Transcript Features Identified Through 5'-UTR, CDS and 3'-UTR
We first attempted to identify the features in the three main parts of the transcript (5’UTR, CDS and 3’-UTR) separately. This analysis allowed us to investigate the contribution
of these parts. Also, it allowed us to find more features in these regions because a feature that
strongly affected the translation in other regions might mask the effect of weak features
elsewhere.
The important features in 5’-UTR region related to RR include prevalence of some
codons, the secondary structure of 5’-UTR region, and length of 5’-UTRs. There is a
negative correlation between the UTR length and RR, which implies that transcripts with
short UTRs tend to have higher RRs. Also, we found that a number of alternative start
codons in the 5’-UTR region have a strong impact on RR, because alternative start codons in
this region may interfere with the ribosome scanning, thus imposing additional effort for the
ribosome to translate the main open reading frame. There were fewer features identified
82
according to TE, but there were some features that affected TE in similar manner as they
influenced RR, such as the secondary structure of 5’-UTR and alternative start codon in the
5’-UTR region, which indicates that these features affect both ribosome rate and TE. Some
important features show stronger correlation and partial correlation for the given mRNA
level. This suggests that those features may not be independently correlated with RR or TE,
instead, they are correlated with the mRNA level (Figure 4.2, A, B).
A
C
E
B
D
F
Figure 4.2: Results of importance of features identified by LR and RF in 5’-UTR (A,B),
CDS (C, D) and 3'-UTR region (E, F).
83
The features identified for the CDS region show a much stronger correlation in both
of the two measurements. Specifically, several codon frequency features identified in this
region (that indicate codon usage) mainly affect the translation elongation step. The codons
GTT, ATG and CTG were identified in both RR and TE assessments. Among them, codon
CTG shows positive correlation with RR and TE and codon GTT shows negative correlation,
but codon ATG shows positive correlation with RR but negative correlation with TE. Many
features identified for RR show a much stronger correlation than the partial correlation such
as length, total number of start codons, and CAI index. This result shows that these feature
values are correlated by their mRNA level (Figure 4.2, C, D).
Features identified for the3’-UTR show the weakest correlation. The top important
features include length of the 3’-UTR, mRNA secondary structure, total number of
alternative start codons, and some nucleotide frequencies. Some features are very similar to
the features in the 5’-UTR region such as the length, mRNA secondary structure and total
number of alternative start codons (Figure 2, E and F).
Overall, the performance of LR and RF was very similar for the models based on
features in the UTRs and CDSs of the transcripts. The LR and RF models based on the
features in UTRs shows much lower performance than that for CDSs: 0.08 (0.07 for RF) with
RR and TE in 5’-UTR, 0.05 (0.06 for RF) with RR and 0.01 (0.01 RF) with TE in 3’-UTR.
The 3’-UTR shows the lowest performance.
84
4.3.3 Transcript Features Identified through the Whole Transcript
In the next step, we tried to explore the features based on the whole transcript to see if
there are some differences in the important features when combining the features from the
three regions of a transcript described above. This analysis allowed us to find the strongest
features that influence the translation of entire transcripts.
There were in total 54 features identified through RR and 64 features identified
through TE. The results of top 10 features and performance of the model are shown in figure
3 and Table 1 (whole transcript). Generally, most of the top features identified using the
whole transcripts overlap with the top features identified through individual parts of the
transcript. Features with strong effect for each individual part tend to be also identified using
whole transcripts. In addition, features identified based on the whole transcripts are all in the
5’-UTR and CDS region. The performance for the whole transcript is very close to the
performance for the CDS. This result suggests that using features based exclusively on CDS
can provide accurate prediction.
85
Figure 4.3: Important features identified by LR and RF based on whole
transcripts. A,C top featured identified through RR and TE. B,D Plot of prediction v.s.
true value. Blue line: regression line for LR, red line: regression line for RF. r is the
correlation between predicted value and true value
The prominent features identified according to RR and TE include the length of 5’UTR and CDS regions, the number of start codons in the 5’-UTR region, presence or absence
of certain codons, and occurrence of some nucleotides in these two regions.
4.3.4 The mRNA contribution for translation
Previous studies show that the mRNA-protein correlation explained most of the
variation in the model, which indicates that the mRNA levels have a large contribution to
86
translation (Nie, Wu et al. 2006, Zur and Tuller 2013, Guimaraes, Rocha et al. 2014).
Therefore, to see the mRNA contribution for the two measurements in our dataset, we
investigated how well the measurements of translation can be predicted if mRNA levels are
considered. Table II summarizes the performance of the model when incorporating the
mRNA level information based on the features from entire transcripts. The performance of
the model in both measurements improved as compared to the model without incorporating
mRNA levels (Table II). For RR, mRNA abundance contributes to a large portion of
variability (~ 65%) and mRNA levels together with transcript features can explain around
74% of total variability. For TE, there was only a little improvement for the model
performance, but mRNA abundance still contributed somewhat (~11%) to the variability.
Transcript features mainly contributed to the variability of TE in our model, but in total they
only explained 33% of the total variability and there was a large proportion of variability in
TE that may be determined at the post-transcriptional level or be subject to other types of
regulation.
Table 4.2 Proportion of variability of translation that can be explained by mRNA levels
and transcript features.
Performance was measured by R-square, all p-values are less than 2.2e-16
mRNA abundance only
Ribosome Rate (RR) Translation Efficiency (TE)
0.65
0.11
transcript features only
0.30
0.29
mRNA with transcript features
0.74
0.33
87
Next, we explored if the contribution of transcript features remained the same for
transcripts belonging to different functional groups. To do this, we fitted the model using the
same approach for transcripts in different functional groups and analyzed how model
performance changed according to different functional categories.
Table 4.3 mRNA level and transcript feature contribution in different functional
categories
Contribution of featureswas measured by R-square (all p-values were less than
2.2e-16). The mRNA abundance is the model using mRNA feature only. RR and TE are
models incorporating only transcript features.
Gene category
transcription regulator activity
cell cycle
protein amino acid phosphorylation
catalytic activity
Binding
transferase activity
transferring phosphorus-containing groups
kinase activity
post-translational protein modification
protein metabolic process
all genes
mRNA abundance for RR
0.62
0.55
0.58
0.66
0.65
RR
0.16
0.21
0.29
0.30
0.31
TE
0.16
0.10
0.2
0.17
0.27
0.62
0.34
0.26
0.57
0.6
0.68
0.65
0.34
0.35
0.42
0.26
0.36
0.21
0.30
0.29
Table 4.3 shows the contribution of transcript features and mRNA levels to
translation in genes from different functional groups. We found that the relationship between
mRNA level and translation varied between groups. In addition, the contribution of transcript
features related to RR and TE changed among different functional groups. Transcripts
involved in protein metabolic processes showed the highest R-square values for both RR and
88
TE. The model could explain 42% variability of RR and 30% variability of TE based on
transcript features. However, there were also some discrepancies for the contribution of
mRNA levels and transcript features in certain functional groups. For example, in the kinase
activity group, mRNA levels only explained 57% variability of translation, but the
contribution of transcript features for RR and TE was high for both (34% for RR and 36% for
TE), which indicates that for certain functional groups transcript features may play more
important roles.
4.4 Conclusion
In this study, we explored the relationship between transcript features and translation
in a plant system. We analyzed a total of 477 transcript features and built models to identify
features that may affect RR and TE. We showed that transcript features from different
transcript regions may have a different impact on translation. In particular, features in the
CDS region play a stronger role. Our model used around 50 transcript features to estimate
translation. It explains ~30% of variability related to translation (RR and TE). This is
comparable with the translation models of Desulfovibrio vulgaris (Nie, Wu et al. 2006, Nie,
Wu et al. 2006). The model performance for RR improves to 0.65 when mRNA feature is
incorporated, but the improvement for TE (0.11) is not as large as RR.
We found that mRNA abundance and sequence features contribute differently to
translation models of transcripts belonging to specific functional groups. This suggests that
transcripts in different functional categories may employ different mechanisms of translation
regulation.
89
In conclusion, in this study we reported lists of important features that may influence
protein synthesis and explored their contributions to translation in Arabidopsis. We believe
these features can provide genome-wide information for understanding the mechanisms
underlying regulation of translation in plants.
90
CHAPTER 5 CONCLUSIONS AND FUTURE WORK
This dissertation focuses on the genome-wide identification of elements that affect
translation in Arabidopsis thaliana. Machine learning approaches were used to address this
problem.
In chapter 3, we describe a semi-supervised approach to identify translated uORFs
that uses Ribo-seq and RNA-seq data. It is believed that translated uORFs play an important
role in the regulation of translation, but the translation status of most uORFs is still unclear.
Several computational approaches for finding conserved uORFs were proposed previously
(Cvijovic, Dalevi et al. 2007, Takahashi, Takahashi et al. 2012). However, the reliable
detection of sequence conservation in small uORFs is a challenging task and only few
conserved uORFs have been found. More important, evolutionary conservation does not
necessarily imply translation and vice versa.
Our approach uses the newly developed ribosome footprinting technique in
combination with additional genomic features. Although a certain percentage of ribosomal
footprints might correspond to scanning ribosomes, RNA secondary structure, or other
mRNA-protein complexes, it is believed that the majority of the ribosomal footprints witness
translating ribosomes. This gives us the unprecedented opportunity to probe directly the
translation status of uORFs on a genome-wide scale. Due to the small number of
experimentally validated translated uORFs and the high noise rate in our data, a simple
supervised classification approach for identifying translated uORFs is difficult and may
result in unreliable predictions. Therefore, we have applied semi-supervised learning to
capitalize from the footprint data that is available for uORFs that have not been
91
experimentally characterized and validated. Based on the observation that the distribution of
ribosomal footprints follows characteristic patterns during the different stages of translation,
we expect that incorporating data points that follow these distribution patterns, and cluster
with the experimentally validated uORFs, will help learning the true characteristics of
translated uORFs, and likely improve our results. Finally, to reduce the variability of
individual classification approaches, we designed a stacking based model. Our approach
combines the predictions of five base-level classifiers by a meta-level classifier. In
comparison to the individual base-level classifiers, this approach yields more reliable results
and it improves predictive accuracy.
Using this approach, we identified 5360 translated uORFs in 2051 genes. Likely,
these results are only a lower bound of the true number of translated uORFs. For example, if
a gene is not expressed under the experimental conditions under which the ribosome
footprint data was generated, our method will not be able to predict the translation status of
uORFs located in this gene. Also, currently, our method only takes uORFs with canonical
AUG start codon into account. However, it has been shown in mouse that translation might
start at various non-canonical start codons (Ingolia, Lareau et al. 2011). It is unknown to
what extent this observation applies to Arabidopsis thaliana as well. Further analysis of the
inferred translated uORFs leads to the following results. Firstly, not all uORFs are translated.
Among the uORFs that occur in transcribed genes only 6% are translated. The uORF
sequences appear in around 50% of genes in Arabidopsis, but only 6% of genes contain
translated uORFs, which indicate most of the uORFs are not actually translated. In addition,
the translated uORFs are not distributed evenly over all genes. Although majority of uORF
92
containing genes (~60%) only contain one translated uORF in their 5-UTR region, there are
~40% genes contain multiple translated uORFs, but there are very few genes have more than
6 translated uORFs. Secondly, genes that contain translated uORF are more likely to have
regulatory functions. GO analysis shows that translated uORFs are enriched in functional
gene groups, such as catalytic activity, transferase activity, phosphotransferase activity, and
transcription regulator activity. Finally, uORFs occur with a higher frequency in multiisoform genes than in single-isoform genes.
Our results are in concordance with previous studies that found that the translation of
uORF plays an important role in gene regulation (Morris and Geballe 2000, Calvo, Pagliarini
et al. 2009, Jeon and Kim 2010). Several experimental studies have observed that the
sequence context near the start codon may affect the translation status of an uORF
(Lukaszewicz, Feuermann et al. 2000, Kozak 2002), however, only few sequence features
have been identified, and no systematic search for such features has been performed. In this
dissertation we describe a comprehensive comparison between transcript sequence features
and the translation status of uORFs. We generated a large set of sequence features for each
translated and untranslated uORF region and applied association rule mining to search for the
features associated with the translation status of uORFs. We found several features that are
associated with the translation of uORFs. For example, translated uORFs tend to occur in
genes with long coding sequences, long 5-UTRs, and long 3' UTRs. Genes with translated
uORFs harbor a smaller number of uORFs in their 5'-UTR region than genes that only
contain untranslated uORFs. Often, a strong Kozak signal occurs in translated uORFs.
Peptides encoded by translated uORFs exhibit high codon bias. Some of these features
93
correlate with each other. Together, these results suggest that uORF translation is a complex
process that may be regulated by multiple factors. Hopefully, our results will contribute to a
better understanding of uORF translation and its regulation.
A major shortcoming of our approach is that ~ 30% of genes in our experiment are
not transcribed and hence we cannot infer the translation status of uORFs in these genes. To
overcome this problem, we plan in the future to develop an approach to identify those
translated uORFs based on the sequence features, since we found several sequence features
that are associated with the translation status of uORFs. We plan to develop a classifier to
identify translated uORFs based on sequence features only in future research. Based on
preliminary experiments (data not shown), an associative classification mining approach
appears suitable for this problem. Associative classification mining is a technique that uses
association rules to construct classification systems (Thabtah 2007). Several studies have
shown that associative classification mining algorithms are able to discover unknown
patterns from datasets and build classifiers of highest quality (Wenmin, Jiawei et al. 2001,
Thabtah, Cowling et al. 2004, Thabtah 2007). The detail algorithm for associative
classification can be found in (Bing Liu 1998). Using this approach, we hope to identify
more translated uORFs, especially for the overlapping uORFs or uORFs located in the genes
that are not translated or transcribed in experiment.
In chapter 4, we describe a systematic search for features that affect gene (CDS)
translation levels in Arabidopsis thaliana using ribosome footprinting and RNA-seq data.
Prior genome-wide analyses have focused on animals and bacteria, however, very few
studies are in plants. Several features in the 5’-UTR region that correlate with general
94
translation levels have been reported (Kozak 2001, Pestova and Kolupaeva 2002, Kim, Lee et
al. 2014). In our analysis, we use LASSO and a random forest to select features that affect
translation. Subsequently, we evaluate the relative contribution of each individual feature. In
general, the selected features in our analysis are consistent with the features identified in
other studies. The most important features that impact translation are:
1) length of 5’-UTR, secondary structure in 5’-UTR, and alternative start codons in 5’-UTR
region
2) length of CDS region, and alternative start codons in CDS region
3) length of 3’-UTR region, and secondary structure in 3’-UTR region.
In general, features in the CDS region play a stronger role than feature in other
transcript regions. A major difference between Arabidopsis and other organisms is codon
usage. Different codons are associated with different stages of translation. For example,
during elongation, the codons GTT, TAA, and CTG are strongly associated with translational
regulation in Arabidopsis, while in yeast, the most important codons are ATA, CTT. In
addition, we found that mRNA abundance and several transcript features, such as codon
usage and alternative start codon, affect the translation of genes that belong to different
functional groups differently. This suggests that genes in different functional categories may
use different mechanisms of translational regulation.
In summary, we have developed various novel machine learning approaches to find
features that affect translation in Arabidopsis thaliana. As a result, we have reported lists of
translated uORFs and other features that may influence translation and protein synthesis, and
we have explored the relative contributions of these features on translation. We believe that
95
these data provide valuable candidates for experimental validation, and that they will
contribute to a better understanding of the mechanisms that underlie the regulation of
translation in plants.
96
REFERENCES
Agrawal, R., T. Imieli, #324, ski and A. Swami (1993). "Mining association rules between
sets of items in large databases." SIGMOD Rec. 22(2): 207-216.
Agrawal, R. and R. Srikant (1994). Fast Algorithms for Mining Association Rules in Large
Databases. Proceedings of the 20th International Conference on Very Large Data Bases,
Morgan Kaufmann Publishers Inc.: 487-499.
Agrawal, R. I., T.; Swami, A. (1993). "Mining association rules between sets of items in
large databases." Proceedings of the 1993 ACM SIGMOD international conference on
Management of data - SIGMOD '93: 207.
Alatorre-Cobos, F., A. Cruz-Ramirez, C. A. Hayden, C. A. Perez-Torres, A. L. Chauvin, E.
Ibarra-Laclette, E. Alva-Cortes, R. A. Jorgensen and L. Herrera-Estrella (2012).
"Translational regulation of Arabidopsis XIPOTL1 is modulated by phosphocholine levels
via the phylogenetically conserved upstream open reading frame 30." J Exp Bot 63(14):
5203-5221.
Alvarez, S. A. (2003). "Chi-Squared Computation for Association Rules: Preliminary
Results." echnical Report BC-CS-2003-01, Computer Science Department, Boston College.
Andrews, S. J. and J. A. Rothnagel (2014). "Emerging evidence for functional peptides
encoded by short open reading frames." Nat Rev Genet 15(3): 193-204.
Arava, Y., Y. Wang, J. D. Storey, C. L. Liu, P. O. Brown and D. Herschlag (2003).
"Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae." Proc
Natl Acad Sci U S A 100(7): 3889-3894.
Babinger, K., A. Hallmann and R. Schmitt (2006). "Translational control of regA, a key gene
controlling cell differentiation in Volvox carteri." Development 133(20): 4045-4051.
Barbosa, C., I. Peixeiro and L. Romao (2013). "Gene expression regulation by upstream open
reading frames and human disease." PLoS Genet 9(8): e1003529.
Boman, H. G. (2003). "Antibacterial peptides: basic facts and emerging concepts." J Intern
Med 254(3): 197-215.
Boycheva, S., G. Chkodrov and I. Ivanov (2003). "Codon pairs in the genome of Escherichia
coli." Bioinformatics 19(8): 987-998.
Brar, G. A., M. Yassour, N. Friedman, A. Regev, N. T. Ingolia and J. S. Weissman (2012).
"High-resolution view of the yeast meiotic program revealed by ribosome profiling." Science
335(6068): 552-557.
97
Breiman, L. (2001). "Random Forests." Machine Learning 45(1): 5-32.
Calvo, S. E., D. J. Pagliarini and V. K. Mootha (2009). "Upstream open reading frames cause
widespread reduction of protein expression and are polymorphic among humans." Proc Natl
Acad Sci U S A 106(18): 7507-7512.
Cavener, D. R. (1987). "Comparison of the consensus sequence flanking translational start
sites in Drosophila and vertebrates." Nucleic Acids Res 15(4): 1353-1361.
Chew, G. L., A. Pauli, J. L. Rinn, A. Regev, A. F. Schier and E. Valen (2013). "Ribosome
profiling reveals resemblance between long non-coding RNAs and 5' leaders of coding
RNAs." Development 140(13): 2828-2834.
Clamp, M., B. Fry, M. Kamal, X. Xie, J. Cuff, M. F. Lin, M. Kellis, K. Lindblad-Toh and E.
S. Lander (2007). "Distinguishing protein-coding and noncoding genes in the human
genome." Proc Natl Acad Sci U S A 104(49): 19428-19433.
Coleman, J. R., D. Papamichail, S. Skiena, B. Futcher, E. Wimmer and S. Mueller (2008).
"Virus attenuation by genome-scale changes in codon pair bias." Science 320(5884): 17841787.
Cortes, C. V., V. (1995). "Support-vector networks." Machine Learning (20): 273.
Crammer, K. a. S., Yoram (2001). "On the Algorithmic Implementation of Multiclass
Kernel-based Vector Machines." J. of Machine Learning Research(2): 265-292.
Cvijovic, M., D. Dalevi, E. Bilsland, G. J. Kemp and P. Sunnerhagen (2007). "Identification
of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary
conservation." BMC Bioinformatics 8: 295.
Dara, R., S. C. Kremer and D. A. Stacey (2002). Clustering unlabeled data with SOMs
improves classification of labeled real-world data. Neural Networks, 2002. IJCNN '02.
Proceedings of the 2002 International Joint Conference on.
David-Assael, O., H. Saul, V. Saul, T. Mizrachy-Dagri, I. Berezin, E. Brook and O. Shaul
(2005). "Expression of AtMHX, an Arabidopsis vacuolar metal transporter, is repressed by
the 5' untranslated region of its gene." J Exp Bot 56(413): 1039-1047.
De Angioletti, M., G. Lacerra, V. Sabato and C. Carestia (2004). "Beta+45 G --> C: a novel
silent beta-thalassaemia mutation, the first in the Kozak sequence." Br J Haematol 124(2):
224-231.
98
Deana, A. and J. G. Belasco (2005). "Lost in translation: the influence of ribosomes on
bacterial mRNA decay." Genes Dev 19(21): 2526-2533.
Denoeud, F., J. M. Aury, C. Da Silva, B. Noel, O. Rogier, M. Delledonne, M. Morgante, G.
Valle, P. Wincker, C. Scarpelli, O. Jaillon and F. Artiguenave (2008). "Annotating genomes
with massive-scale RNA sequencing." Genome Biol 9(12): R175.
Dickson, B. C., A. M. Mulligan, H. Zhang, G. Lockwood, F. P. O'Malley, S. E. Egan and M.
Reedijk (2007). "High-level JAG1 mRNA and protein predict poor outcome in breast
cancer." Mod Pathol 20(6): 685-693.
dos Reis, M., R. Savva and L. Wernisch (2004). "Solving the riddle of codon usage
preferences: a test for translational selection." Nucleic Acids Res 32(17): 5036-5044.
Drummond, D. A. and C. O. Wilke (2008). "Mistranslation-induced protein misfolding as a
dominant constraint on coding-sequence evolution." Cell 134(2): 341-352.
Du, Z., X. Zhou, Y. Ling, Z. Zhang and Z. Su (2010). "agriGO: a GO analysis toolkit for the
agricultural community." Nucleic Acids Res 38(Web Server issue): W64-70.
Duan, K. B. K., S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical
Study." Multiple Classifier Systems(3541): 278–285.
Dunn, J. G., C. K. Foo, N. G. Belletier, E. R. Gavis and J. S. Weissman (2013). "Ribosome
profiling reveals pervasive and regulated stop codon readthrough in Drosophila
melanogaster." Elife 2: e01179.
Duret, L. (2000). "tRNA gene number and codon usage in the C. elegans genome are coadapted for optimal translation of highly expressed genes." Trends Genet 16(7): 287-289.
Duret, L. and D. Mouchiroud (1999). "Expression pattern and, surprisingly, gene length
shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis." Proc Natl Acad Sci U S
A 96(8): 4482-4487.
Dvir, S., L. Velten, E. Sharon, D. Zeevi, L. B. Carey, A. Weinberger and E. Segal (2013).
"Deciphering the rules by which 5'-UTR sequences affect protein expression in yeast." Proc
Natl Acad Sci U S A 110(30): E2792-2801.
Ebina, I., M. Takemoto-Tsutsumi, S. Watanabe, H. Koyama, Y. Endo, K. Kimata, T.
Igarashi, K. Murakami, R. Kudo, A. Ohsumi, A. L. Noh, H. Takahashi, S. Naito and H.
Onouchi (2015). "Identification of novel Arabidopsis thaliana upstream open reading frames
that control expression of the main coding sequences in a peptide sequence-dependent
manner." Nucleic Acids Res 43(3): 1562-1576.
99
Fawcett, T. (2006). "An introduction to ROC analysis." Pattern Recogn. Lett. 27(8): 861-874.
Fritsch, C., A. Herrmann, M. Nothnagel, K. Szafranski, K. Huse, F. Schumann, S. Schreiber,
M. Platzer, M. Krawczak, J. Hampe and M. Brosch "Genome-wide search for novel human
uORFs and N-terminal protein extensions using ribosomal footprinting." Genome Res
22(11): 2208-2218.
Fritsch, C., A. Herrmann, M. Nothnagel, K. Szafranski, K. Huse, F. Schumann, S. Schreiber,
M. Platzer, M. Krawczak, J. Hampe and M. Brosch (2012). "Genome-wide search for novel
human uORFs and N-terminal protein extensions using ribosomal footprinting." Genome Res
22(11): 2208-2218.
Garber, M., M. G. Grabherr, M. Guttman and C. Trapnell (2011). "Computational methods
for transcriptome annotation and quantification using RNA-seq." Nat Methods 8(6): 469-477.
Gingold, H. and Y. Pilpel (2011). "Determinants of translation efficiency and accuracy." Mol
Syst Biol 7: 481.
Greenbaum, D., C. Colangelo, K. Williams and M. Gerstein (2003). "Comparing protein
abundance and mRNA expression levels on a genomic scale." Genome Biol 4(9): 117.
Gu, W., T. Zhou and C. O. Wilke (2010). "A universal trend of reduced mRNA stability near
the translation-initiation site in prokaryotes and eukaryotes." PLoS Comput Biol 6(2):
e1000664.
Guimaraes, J. C., M. Rocha and A. P. Arkin (2014). "Transcript level and sequence
determinants of protein abundance and noise in Escherichia coli." Nucleic Acids Res 42(8):
4791-4799.
Guo, H., N. T. Ingolia, J. S. Weissman and D. P. Bartel (2010). "Mammalian microRNAs
predominantly act to decrease target mRNA levels." Nature 466(7308): 835-840.
Guruprasad, K., B. V. Reddy and M. W. Pandit (1990). "Correlation between stability of a
protein and its dipeptide composition: a novel approach for predicting in vivo stability of a
protein from its primary sequence." Protein Eng 4(2): 155-161.
Hanfrey, C., M. Franceschetti, M. J. Mayer, C. Illingworth and A. J. Michael (2002).
"Abrogation of upstream open reading frame-mediated translational control of a plant Sadenosylmethionine decarboxylase results in polyamine disruption and growth
perturbations." J Biol Chem 277(46): 44131-44139.
Hayden, C. A.-M. (2006). POST-TRANSCRIPTIONAL GENE REGULATION IN
PLANTS. DOCTOR OF PHILOSOPHY, UNIVERSITY OF ARIZONA.
100
Hayden, C. A. and R. A. Jorgensen (2007). "Identification of novel conserved peptide uORF
homology groups in Arabidopsis and rice reveals ancient eukaryotic origin of select groups
and preferential association with transcription factor-encoding genes." BMC Biol 5: 32.
Hendrickson, D. G., D. J. Hogan, H. L. McCullough, J. W. Myers, D. Herschlag, J. E. Ferrell
and P. O. Brown (2009). "Concordant regulation of translation and mRNA abundance for
hundreds of targets of a human microRNA." PLoS Biol 7(11): e1000238.
Hochberg, B. (1995). "Controlling the false discovery rate: A practical and powerful
approach to multiple testing." Journal of the Royal Statistical Society Serie B(57): 800-803.
Howard, B. E., Q. Hu, A. C. Babaoglu, M. Chandra, M. Borghi, X. Tan, L. He, H. WinterSederoff, W. Gassmann, P. Veronese and S. Heber (2013). "High-Throughput RNA
Sequencing of Pseudomonas-Infected Arabidopsis Reveals Hidden Transcriptome
Complexity and Novel Splice Variants." PLoS ONE 8(10): e74183.
Hu, Q., C. Merchante, A. Stepanova, J. Alonso and S. Heber (2016). "Genome-wide Search
for Translated Upstream Open Reading Frames in Arabidopsis thaliana." IEEE Transactions
on NanoBioscience PP(99): 1-1.
Hu, Q., C. Merchante, A. N. Stepanova, J. M. Alonso and S. Heber (2015). A StackingBased Approach to Identify Translated Upstream Open Reading Frames in Arabidopsis
Thaliana. Bioinformatics Research and Applications: 11th International Symposium, ISBRA
2015 Norfolk, USA, June 7-10, 2015 Proceedings. R. Harrison, Y. Li and I. Măndoiu. Cham,
Springer International Publishing: 138-149.
Hu, Q., C. Merchante, A. N. Stepanova, J. M. Alonso and S. Heber (2016). "Genome-Wide
Search for Translated Upstream Open Reading Frames in Arabidopsis Thaliana." IEEE
Transactions on NanoBioscience 15(2): 148-157.
Hua-Jun, Z., W. Xuan-Hui, C. Zheng, L. Hongjun and M. Wei-Ying (2003). CBC: clustering
based text classification requiring minimal labeled data. Data Mining, 2003. ICDM 2003.
Third IEEE International Conference on.
Iain M. Johnstone, D. M. T. (2009). "Statistical challenges of high-dimensional data." Phil.
Trans. R. Soc(367): 4237-4253.
Ikemura, T. (1985). "Codon usage and tRNA content in unicellular and multicellular
organisms." Mol Biol Evol 2(1): 13-34.
Imai, A., Y. Hanzawa, M. Komura, K. T. Yamamoto, Y. Komeda and T. Takahashi (2006).
"The dwarf phenotype of the Arabidopsis acl5 mutant is suppressed by a mutation in an
upstream ORF of a bHLH gene." Development 133(18): 3575-3585.
101
Ingolia, N. T. (2010). "Genome-wide translational profiling by ribosome footprinting."
Methods Enzymol 470: 119-142.
Ingolia, N. T., G. A. Brar, S. Rouskin, A. M. McGeachy and J. S. Weissman (2012). "The
ribosome profiling strategy for monitoring translation in vivo by deep sequencing of
ribosome-protected mRNA fragments." Nat Protoc 7(8): 1534-1550.
Ingolia, N. T., S. Ghaemmaghami, J. R. Newman and J. S. Weissman (2009). "Genome-wide
analysis in vivo of translation with nucleotide resolution using ribosome profiling." Science
324(5924): 218-223.
Ingolia, N. T., L. F. Lareau and J. S. Weissman "Ribosome profiling of mouse embryonic
stem cells reveals the complexity and dynamics of mammalian proteomes." Cell 147(4): 789802.
Ingolia, N. T., L. F. Lareau and J. S. Weissman (2011). "Ribosome profiling of mouse
embryonic stem cells reveals the complexity and dynamics of mammalian proteomes." Cell
147(4): 789-802.
Ivanov, I. P., J. F. Atkins and A. J. Michael (2010). "A profusion of upstream open reading
frame mechanisms in polyamine-responsive translational regulation." Nucleic Acids Res
38(2): 353-359.
Jackson, R. J., C. U. Hellen and T. V. Pestova (2010). "The mechanism of eukaryotic
translation initiation and principles of its regulation." Nat Rev Mol Cell Biol 11(2): 113-127.
Jeon, S. and J. Kim (2010). "Upstream open reading frames regulate the cell cycle-dependent
expression of the RNA helicase Rok1 in Saccharomyces cerevisiae." FEBS Lett 584(22):
4593-4598.
Johannes, G., M. S. Carter, M. B. Eisen, P. O. Brown and P. Sarnow (1999). "Identification
of eukaryotic mRNAs that are translated at reduced cap binding complex eIF4F
concentrations using a cDNA microarray." Proc Natl Acad Sci U S A 96(23): 13118-13123.
Joshi, C., H. Zhou, X. Huang and V. Chiang (1997). "Context sequences of translation
initiation codon in plants." Plant Molecular Biology 35(6): 993-1001.
Juntawong, P., T. Girke, J. Bazin and J. Bailey-Serres "Translational dynamics revealed by
genome-wide profiling of ribosome footprints in Arabidopsis." Proc Natl Acad Sci U S A
111(1): E203-212.
102
Juntawong, P., T. Girke, J. Bazin and J. Bailey-Serres (2014). "Translational dynamics
revealed by genome-wide profiling of ribosome footprints in Arabidopsis." Proc Natl Acad
Sci U S A 111(1): E203-212.
Kawaguchi, R. and J. Bailey-Serres (2005). "mRNA sequence features that contribute to
translational regulation in Arabidopsis." Nucleic Acids Res 33(3): 955-965.
Kim, Y., G. Lee, E. Jeon, E. J. Sohn, Y. Lee, H. Kang, D. W. Lee, D. H. Kim and I. Hwang
(2014). "The immediate upstream region of the 5'-UTR from the AUG start codon has a
pronounced effect on the translational efficiency in Arabidopsis thaliana." Nucleic Acids Res
42(1): 485-498.
Kozak, M. (1984). "Point mutations close to the AUG initiator codon affect the efficiency of
translation of rat preproinsulin in vivo." Nature 308(5956): 241-246.
Kozak, M. (1991). "A short leader sequence impairs the fidelity of initiation by eukaryotic
ribosomes." Gene Expr 1(2): 111-115.
Kozak, M. (1994). "Determinants of translational fidelity and efficiency in vertebrate
mRNAs." Biochimie 76(9): 815-821.
Kozak, M. (1999). "Initiation of translation in prokaryotes and eukaryotes." Gene 234(2):
187-208.
Kozak, M. (2001). "Constraints on reinitiation of translation in mammals." Nucleic Acids
Res 29(24): 5226-5232.
Kozak, M. (2002). "Pushing the limits of the scanning mechanism for initiation of
translation." Gene 299(1-2): 1-34.
Kudla, G., A. W. Murray, D. Tollervey and J. B. Plotkin (2009). "Coding-sequence
determinants of gene expression in Escherichia coli." Science 324(5924): 255-258.
Kwon, H. S., D. K. Lee, J. J. Lee, H. J. Edenberg, Y. H. Ahn and M. W. Hur (2001).
"Posttranscriptional regulation of human ADH5/FDH and Myf6 gene expression by upstream
AUG codons." Arch Biochem Biophys 386(2): 163-171.
Lavner, Y. and D. Kotlar (2005). "Codon bias as a factor in regulating expression via
translation rate in the human genome." Gene 345(1): 127-138.
Lee, Y. L., Y.; and Wahba, G. (2001). "Multicategory Support Vector Machines."
Computing Science and Statistics 33.
103
Li, G. W., E. Oh and J. S. Weissman "The anti-Shine-Dalgarno sequence drives translational
pausing and codon choice in bacteria." Nature 484(7395): 538-541.
Li, G. W., E. Oh and J. S. Weissman (2012). "The anti-Shine-Dalgarno sequence drives
translational pausing and codon choice in bacteria." Nature 484(7395): 538-541.
Lichtinghagen, R., P. B. Musholt, M. Lein, A. Romer, B. Rudolph, G. Kristiansen, S.
Hauptmann, D. Schnorr, S. A. Loening and K. Jung (2002). "Different mRNA and protein
expression of matrix metalloproteinases 2 and 9 and tissue inhibitor of metalloproteinases 1
in benign and malignant prostate tissue." Eur Urol 42(4): 398-406.
Longstaff, B., S. Reddy and D. Estrin (2010). Improving activity classification for health
applications on mobile devices using active and semi-supervised learning. Pervasive
Computing Technologies for Healthcare (PervasiveHealth), 2010 4th International
Conference on-NO PERMISSIONS.
Lopez-Lastra, M., A. Rivas and M. I. Barria (2005). "Protein synthesis in eukaryotes: the
growing biological relevance of cap-independent translation initiation." Biol Res 38(2-3):
121-146.
Lorenz, R., S. H. Bernhart, C. Honer Zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler and
I. L. Hofacker (2011). "ViennaRNA Package 2.0." Algorithms Mol Biol 6: 26.
Lu, P., C. Vogel, R. Wang, X. Yao and E. M. Marcotte (2007). "Absolute protein expression
profiling estimates the relative contributions of transcriptional and translational regulation."
Nat Biotechnol 25(1): 117-124.
Lukaszewicz, M., M. Feuermann, B. Jerouville, A. Stas and M. Boutry (2000). "In vivo
evaluation of the context sequence of the translation initiation codon in plants." Plant Sci
154(1): 89-98.
M. Forouzanfar, H. R. D., V. Z. Groza, M. Bolic, and S. Rajan (2010). Comparison of FeedForward Neural Network Training Algorithms for Oscillometric Blood Pressure Estimation.
4th Int. Workshop Soft Computing Applications. Arad, Romania, IEEE.
Maher, C. A., C. Kumar-Sinha, X. Cao, S. Kalyana-Sundaram, B. Han, X. Jing, L. Sam, T.
Barrette, N. Palanisamy and A. M. Chinnaiyan (2009). "Transcriptome sequencing to detect
gene fusions in cancer." Nature 458(7234): 97-101.
Malys, N. and J. E. McCarthy (2011). "Translation initiation: variations in the mechanism
can be anticipated." Cell Mol Life Sci 68(6): 991-1003.
104
Man, O. and Y. Pilpel (2007). "Differential translation efficiency of orthologous genes is
involved in phenotypic divergence of yeast species." Nat Genet 39(3): 415-421.
Mardis, E. R. (2008). "Next-generation DNA sequencing methods." Annu Rev Genomics
Hum Genet 9: 387-402.
McManus, C. J., G. E. May, P. Spealman and A. Shteyman (2014). "Ribosome profiling
reveals post-transcriptional buffering of divergent gene expression in yeast." Genome Res
24(3): 422-430.
McManus, J., G. E. May, P. Spealman and A. Shteyman "Ribosome profiling reveals posttranscriptional buffering of divergent gene expression in yeast." Genome Res.
Merchante, C., J. Brumos, J. Yun, Q. Hu, Kristina R. Spencer, P. Enríquez, Brad M. Binder,
S. Heber, Anna N. Stepanova and Jose M. Alonso "Gene-Specific Translation Regulation
Mediated by the Hormone-Signaling Molecule EIN2." Cell 163(3): 684-697.
Minjung Kyung, J. G., Malay Ghoshz and George Casellax (2010). "Penalized Regression,
Standard Errors, and Bayesian Lassos." Bayesian Analysis: 369-412.
Miyasaka, H. (1999). "The positive relationship between codon usage bias and translation
initiation AUG context in Saccharomyces cerevisiae." Yeast 15(8): 633-637.
Moriyama, E. N. and J. R. Powell (1997). "Codon usage bias and tRNA abundance in
Drosophila." J Mol Evol 45(5): 514-523.
Morris, D. R. and A. P. Geballe (2000). "Upstream open reading frames as regulators of
mRNA translation." Mol Cell Biol 20(23): 8635-8642.
Mortazavi, A., E. M. Schwarz, B. Williams, L. Schaeffer, I. Antoshechkin, B. J. Wold and P.
W. Sternberg (2010). "Scaffolding a Caenorhabditis nematode genome with RNA-seq."
Genome Res 20(12): 1740-1747.
N. Friedman, D. G., and Goldszmidt M. (1997). "Bayesian network classifiers." Machine
Learning(29): 131–163.
Nie, L., G. Wu and W. Zhang (2006). "Correlation between mRNA and protein abundance in
Desulfovibrio vulgaris: a multiple regression to identify sources of variations." Biochem
Biophys Res Commun 339(2): 603-610.
Nie, L., G. Wu and W. Zhang (2006). "Correlation of mRNA expression and protein
abundance affected by multiple sequence features related to translational efficiency in
Desulfovibrio vulgaris: a quantitative analysis." Genetics 174(4): 2229-2243.
105
Nishimura, T., T. Wada, K. T. Yamamoto and K. Okada (2005). "The Arabidopsis STV1
protein, responsible for translation reinitiation, is required for auxin-mediated gynoecium
patterning." Plant Cell 17(11): 2940-2953.
Nyiko, T., B. Sonkoly, Z. Merai, A. H. Benkovics and D. Silhavy (2009). "Plant upstream
ORFs can trigger nonsense-mediated mRNA decay in a size-dependent manner." Plant Mol
Biol 71(4-5): 367-378.
Opitz, D. M., R (1999). "Popular ensemble methods: An empirical study." Journal of
Artificial Intelligence Research(11): 169-198.
P. Langley, W. I., and K. Thompson (1992). An analysis of Bayesian classifiers. In
Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA.
Pang-Ning Tan, M. S., Vipin Kumar Introduction to Data Mining, Addison Wesley.
Pestova, T. V. and V. G. Kolupaeva (2002). "The roles of individual eukaryotic translation
initiation factors in ribosomal scanning and initiation codon selection." Genes Dev 16(22):
2906-2922.
Puyaubert, J., L. Denis and C. Alban (2008). "Dual targeting of Arabidopsis holocarboxylase
synthetase1: a small upstream open reading frame regulates translation initiation and protein
targeting." Plant Physiol 146(2): 478-491.
Qiwen, H., C. Merchante, A. N. Stepanova, J. M. Alonso and S. Heber (2015). Mining
transcript features related to translation in Arabidopsis using LASSO and random forest.
Computational Advances in Bio and Medical Sciences (ICCABS), 2015 IEEE 5th
International Conference on.
R, T. (1996). "Regression shrinkage and selection via the lasso." J Roy Stat Soc B Met 58(1):
267-288.
Rahmani, F., M. Hummel, J. Schuurmans, A. Wiese-Klinkenberg, S. Smeekens and J.
Hanson (2009). "Sucrose control of translation mediated by an upstream open reading frameencoded peptide." Plant Physiol 150(3): 1356-1367.
Rangan, L., C. Vogel and A. Srivastava (2008). "Analysis of context sequence surrounding
translation initiation site from complete genome of model plants." Mol Biotechnol 39(3):
207-213.
Rehwinkel, J., J. Raes and E. Izaurralde (2006). "Nonsense-mediated mRNA decay: Target
genes and functional diversification of effectors." Trends Biochem Sci 31(11): 639-646.
106
Rice, P., I. Longden and A. Bleasby (2000). "EMBOSS: the European Molecular Biology
Open Software Suite." Trends Genet 16(6): 276-277.
Richard O. Duda, P. E. H., David G. Stork Pattern Classification.
Rish, I. (2001). An empirical study of the naive bayes classifier.
Rousseeuw, P. J. (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of
Cluster Analysis." Computational and Applied Mathematics (20): 53-65.
S Malinov, W. S., J.J McKeown (2001). "Modelling the correlation between processing
parameters and properties in titanium alloys using artificial neural network." Computational
Materials Science 21(3): 375-394.
SASO DZEROSKI, B. Z. (2004). "Is Combining Classifiers with Stacking Better than
Selecting the Best One." Machine Learning 54: 255-273.
Scheper, G. C., M. S. van der Knaap and C. G. Proud (2007). "Translation matters: protein
synthesis defects in inherited disease." Nat Rev Genet 8(9): 711-723.
Schmeing, T. M. and V. Ramakrishnan (2009). "What recent ribosome structures have
revealed about the mechanism of translation." Nature 461(7268): 1234-1242.
Selpi, C. H. Bryant, G. J. Kemp, J. Sarv, E. Kristiansson and P. Sunnerhagen (2009).
"Predicting functional upstream open reading frames in Saccharomyces cerevisiae." BMC
Bioinformatics 10: 451.
Shields, D. C., P. M. Sharp, D. G. Higgins and F. Wright (1988). ""Silent" sites in
Drosophila genes are not neutral: evidence of selection among synonymous codons." Mol
Biol Evol 5(6): 704-716.
Shu, P., H. Dai, W. Gao and E. Goldman (2006). "Inhibition of translation by consecutive
rare leucine codons in E. coli: absence of effect of varying mRNA stability." Gene Expr
13(2): 97-106.
Skarshewski, A., M. Stanton-Cook, T. Huber, S. Al Mansoori, R. Smith, S. A. Beatson and J.
A. Rothnagel (2014). "uPEPperoni: an online tool for upstream open reading frame location
and analysis of transcript conservation." BMC Bioinformatics 15: 36.
Somers, J., T. Poyry and A. E. Willis (2013). "A perspective on mammalian upstream open
reading frame function." Int J Biochem Cell Biol 45(8): 1690-1700.
107
Stark, A. M., S. Pfannenschmidt, H. Tscheslog, N. Maass, F. Rosel, H. M. Mehdorn and J.
Held-Feindt (2006). "Reduced mRNA and protein expression of BCL-2 versus decreased
mRNA and increased protein expression of BAX in breast cancer brain metastases: a realtime PCR and immunohistochemical evaluation." Neurol Res 28(8): 787-793.
Stehman, S. V. (1997). "Selecting and interpreting measures of thematic classification
accuracy." Remote Sensing of Environment 62: 77–89.
Tabuchi, T., T. Okada, T. Azuma, T. Nanmori and T. Yasuda (2006). "Posttranscriptional
regulation by the upstream open reading frame of the phosphoethanolamine Nmethyltransferase gene." Biosci Biotechnol Biochem 70(9): 2330-2334.
Takahashi, H., A. Takahashi, S. Naito and H. Onouchi (2012). "BAIUCAS: a novel BLASTbased algorithm for the identification of upstream open reading frames with conserved amino
acid sequences and its application to the Arabidopsis thaliana genome." Bioinformatics
28(17): 2231-2241.
Tan, P.-N., V. Kumar and J. Srivastava (2002). Selecting the right interestingness measure
for association patterns. Proceedings of the eighth ACM SIGKDD international conference
on Knowledge discovery and data mining. Edmonton, Alberta, Canada, ACM: 32-41.
Taymans, J. M., A. Nkiliza and M. C. Chartier-Harlin (2015). "Deregulation of protein
translation control, a potential game-changing hypothesis for Parkinson's disease
pathogenesis." Trends Mol Med 21(8): 466-472.
Tran, M. K., C. J. Schultz and U. Baumann (2008). "Conserved upstream open reading
frames in higher plants." BMC Genomics 9: 361.
Trapnell, C., L. Pachter and S. L. Salzberg (2009). "TopHat: discovering splice junctions
with RNA-Seq." Bioinformatics 25(9): 1105-1111.
Tu, J. V. (1996). "Advantages and disadvantages of using artificial neural networks versus
logistic regression for predicting medical outcomes." J Clin Epidemiol 49(11): 1225-1231.
Tuller, T., Y. Y. Waldman, M. Kupiec and E. Ruppin (2010). "Translation efficiency is
determined by both codon bias and folding energy." Proc Natl Acad Sci U S A 107(8): 36453650.
Vilela, C. and J. E. McCarthy (2003). "Regulation of fungal gene expression via short open
reading frames in the mRNA 5'untranslated region." Mol Microbiol 49(4): 859-867.
Vogel, C., S. Abreu Rde, D. Ko, S. Y. Le, B. A. Shapiro, S. C. Burns, D. Sandhu, D. R.
Boutz, E. M. Marcotte and L. O. Penalva (2010). "Sequence signatures and mRNA
108
concentration can explain two-thirds of protein abundance variation in a human cell line."
Mol Syst Biol 6: 400.
von Arnim, A. G., Q. Jia and J. N. Vaughn (2014). "Regulation of plant translation by
upstream open reading frames." Plant Sci 214: 1-12.
Wang, E. T., R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F. Kingsmore, G.
P. Schroth and C. B. Burge (2008). "Alternative isoform regulation in human tissue
transcriptomes." Nature 456(7221): 470-476.
Wang, L. and S. R. Wessler (2001). "Role of mRNA secondary structure in translational
repression of the maize transcriptional activator Lc(1,2)." Plant Physiol 125(3): 1380-1387.
Wang, Z., M. Gerstein and M. Snyder (2009). "RNA-Seq: a revolutionary tool for
transcriptomics." Nat Rev Genet 10(1): 57-63.
Wiese, A., N. Elzinga, B. Wobbes and S. Smeekens (2004). "A conserved upstream open
reading frame mediates sucrose-induced repression of translation." Plant Cell 16(7): 17171729.
Wolpert, D. H. (1992). "Stacked generalization." Neural Networks 5: 241-259.
Yassour, M., T. Kaplan, H. B. Fraser, J. Z. Levin, J. Pfiffner, X. Adiconis, G. Schroth, S.
Luo, I. Khrebtukova, A. Gnirke, C. Nusbaum, D. A. Thompson, N. Friedman and A. Regev
(2009). "Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA
sequencing." Proc Natl Acad Sci U S A 106(9): 3264-3269.
Yun, Y., T. M. Adesanya and R. D. Mitra (2012). "A systematic study of gene expression
variation at single-nucleotide resolution reveals widespread regulatory roles for uAUGs."
Genome Res 22(6): 1089-1097.
Zhu, X. (2008). Semi-supervised learning literature survey, Computer Sciences, University
of Wisconsin-Madison
Zhu, X., S. K. Thalor, Y. Takahashi, T. Berberich and T. Kusano (2012). "An inhibitory
effect of the sequence-conserved upstream open-reading frame on the translation of the main
open-reading frame of HsfB1 transcripts in Arabidopsis." Plant Cell Environ 35(11): 20142030.
Zur, H. and T. Tuller (2013). "Transcript features alone enable accurate prediction and
understanding of gene expression in S. cerevisiae." BMC Bioinformatics 14 Suppl 15: S1.
109