Genome-wide Characterization of RNA Expression and Processing

Digital Comprehensive Summaries of Uppsala Dissertations
from the Faculty of Medicine 939
Genome-wide Characterization
of RNA Expression and
Processing
AMMAR ZAGHLOOL
ACTA
UNIVERSITATIS
UPSALIENSIS
UPPSALA
2013
ISSN 1651-6206
ISBN 978-91-554-8784-3
urn:nbn:se:uu:diva-209390
Dissertation presented at Uppsala University to be publicly examined in Rudbeck Salen,
Dag Hammarskjölds väg 20,, Uppsala, Friday, 29 November 2013 at 09:15 for the degree of
Doctor of Philosophy (Faculty of Medicine). The examination will be conducted in English.
Faculty examiner: Rickard Sandberg (Karolinska Institutet).
Abstract
Zaghlool, A. 2013. Genome-wide Characterization of RNA Expression and Processing.
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine
939. 61 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-8784-3.
The production of fully mature protein-coding transcripts is an intricate process that involves
numerous regulation steps. The complexity of these steps provides the means for multilayered
control of gene expression. Comprehensive understanding of gene expression regulation is
essential for interpreting the role of gene expression programs in tissue specificity, development
and disease. In this thesis, we aim to provide a better global view of the human transcriptome,
focusing on its content, synthesis, processing and regulation using next-generation sequencing
as a read-out.
In Paper I, we show that sequencing of total RNA provides unique insights into RNA
processing. Our results revealed that co-transcriptional splicing is a widespread mechanism in
human and chimpanzee brain tissues. We also found a correlation between slowly removed
introns and alternative splicing. In Paper II, we explore the benefits of exome capture approaches
in combination with RNA-sequencing to detect transcripts expressed at low-levels. Based on
our results, we demonstrate that this approach increases the sensitivity for detecting low level
transcripts and leads to the identification of novel exons and splice isoforms. In Paper III, we
highlight the advantages of performing RNA-sequencing on separate cytoplasmic and nuclear
RNA fractions. In comparison with conventional poly(A) RNA, cytoplasmic RNA contained
a significantly higher fraction of exonic sequence, providing increased sensitivity for splice
junction detection and for improved de novo assembly. Conversely, the nuclear fraction showed
an enrichment of unprocessed RNA compared to when sequencing total RNA, making it
suitable for analysis of RNA processing dynamics. In Paper IV, we used exome sequencing
to sequence the DNA of a patient with unexplained intellectual disability and identified a de
novo mutation in BAZ1A, which encodes the chromatin-remodeling factor ACF1. Functional
studies indicated that the mutation influences the expression of genes involved in extracellular
matrix organization, synaptic function and vitamin D3 metabolism. The differential expression
of CYP24A, SYNGAP1 and COL1A2 correlated with the patient’s clinical diagnosis.
The findings presented in this thesis contribute towards an improved understanding of the
human transcriptome in health and disease, and highlight the advantages of developing novel
methods to obtain global and comprehensive views of the transcriptome.
Keywords: RNA sequencing, RNA splicing, RNA processing, Gene expression
Ammar Zaghlool, Department of Immunology, Genetics and Pathology, Rudbecklaboratoriet,
Uppsala University, SE-751 85 Uppsala, Sweden.
© Ammar Zaghlool 2013
ISSN 1651-6206
ISBN 978-91-554-8784-3
urn:nbn:se:uu:diva-209390 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-209390)
初⼼心忘るべからず “We should not forget our beginner’s spirit”
Japanese proverb To my family
r
List of Papers
This thesis is based on the following papers, which are referred to in the text
by their Roman numerals:
I
Ameur A, Zaghlool A, Halvardson J, Wetterbom A, Gyllensten
U, Cavelier L, Feuk L. (2011) Total RNA sequencing reveals
nascent transcription and widespread co-transcriptional splicing
in the human brain. Nature Structural and Molecular Biology,
18(12): 1435–40
II
Halvardson J*, Zaghlool A*, Feuk L. (2013) Exome RNA
sequencing reveals rare and novel alternative transcripts.
Nucleic Acid Research, 41(1): e6
III
Zaghlool A*, Ameur A*, Nyberg L, Grabherr M, Cavelier L,
Feuk L. (2013) Efficient cellular fractionation improves RNA
sequencing analysis of mature and nascent transcripts from
human tissues. (Accepted in BMC Biotechnology)
IV
Zaghlool A, Halvardson J, Zhao J, Kalushkova A, Konska K,
Jernberg-Wiklund H, Thuresson AC, Feuk L. (2013) Mutation
in the chromatin-remodeling factor BAZ1A is associated with
intellectual disability. (Manuscript)
* Equal first authors
Reprints were made with permission from the respective publishers.
Additional publication
I
Kozyrev S *, Abelson AK *, Wojcik J, Zaghlool A, Linga
Reddy MV, Sanchez E, Gunnarsson I, Svenungsson E, Sturfelt
G, jönsen A, Truedsson L, Pons-Estel BA, Witte T, D'Alfonso
S, Barizzone N, Danieli MG, Gutierrez C, Suarez A, Junker P,
Laustrup H, González-Escribano MF, Martin J, Abderrahim H,
Alarcón-Riquelme ME. (2008) Functional variants in the B-cell
gene BANK1 are associated with systemic lupus erythematosus.
Nature Genetics, 40(2): 211–6
Contents
Abbreviations .................................................................................................ix Introduction ................................................................................................... 11 The transcriptome: the faithful message ....................................................... 13 Transcription and gene expression ........................................................... 14 Transcriptional regulation and RNA processing ...................................... 15 Chromatin remodeling ......................................................................... 15 RNA processing ................................................................................... 16 Transcriptome analysis: In light of next-generation sequencing .............. 26 Enabling technologies and method development .......................................... 28 RNA sequencing ....................................................................................... 28 Background .......................................................................................... 28 Library preparation .............................................................................. 28 SOLiD sequencing ............................................................................... 29 Exome sequencing .................................................................................... 30 Background .......................................................................................... 30 Sequencing library preparation and exome capture ............................. 31 Cytoplasmic and nuclear RNA purification ............................................. 33 The present investigation .............................................................................. 35 Aims.......................................................................................................... 35 Paper I .................................................................................................. 35 Paper II ................................................................................................. 35 Paper III ............................................................................................... 35 Paper IV ............................................................................................... 35 Results and discussion ................................................................................... 36 Total RNA sequencing reveals nascent transcription and widespread
co-transcriptional splicing in the human brain (Paper I) .......................... 36 Exome RNA sequencing reveals rare and novel alternative transcripts
(Paper II) ................................................................................................... 39 Efficient cellular fractionation improves RNA sequencing analysis of
mature and nascent transcripts from human tissues (Paper III)................ 41 Mutation in the chromatin-remodeling factor BAZ1A is associated with
intellectual disability (Paper IV)............................................................... 44 Concluding remarks and future perspectives................................................ 47 Acknowledgements....................................................................................... 50 References..................................................................................................... 53 Abbreviations
APA
AS
cDNA
ChIP
CNV
CTD
CytoRNA-seq
DCPM
DNA
ESE
ESS
EST
ExomeRNA-seq
GTF
GWAS
ID
ISE
ISS
LncRNA
mRNA
ncRNA
PCR
PIC
Poly(A)
Poly(A) RNA
Pre-mRNA
qrtPCR
RNA
RNA pol II
RNA-seq
RPKM
rRNA
SNP
SNV
tRNA
TSS
Alternative polyadenylation
Alternative splicing
Complementary DNA
Chromatin Immunoprecipitation
Copy number variation
The carboxy-terminal domain
Cytoplasmic RNA sequencing
Depth of coverage per million mapped reads
Deoxyribonucleic acid
Exonic splicing enhancer
Exonic splicing silencer
Expressed sequence tag
Exome RNA capture sequencing
General transcription factor
Genome-wide association studies
Intellectual disability
Intronic splicing enhancer
Intronic splicing silencer
Long noncoding RNA
Messenger RNA
Non-coding RNA
Polymerase chain reaction
Transcription pre-initiation complex
Polyadenylation
Polyadenylated RNA
Premature RNA
Quantitative real time PCR
Ribonucleic acid
RNA polymerase II
RNA sequencing
Reads per kilobase per million mapped reads
Ribosomal RNA
Single nucleotide polymorphism
Single nucleotide variant
Transfer RNA
Transcription start site
10
Introduction
The human body contains an estimated 1014 cells, representing a large
number of different cell types with specialized function. Although the cells
are highly specialized, each cell contains the information required to create
all cell types in the body. This information is stored within the genome,
which is located in the cell nucleus. The genome is divided into twenty-three
pairs of chromosomes, with the father and mother contributing one
chromosome to each pair. The biological information in our chromosomes is
mainly encoded by deoxyribonucleic acid (DNA). The DNA structure was
revealed in 1953 by Watson and Crick. Their model suggested that the DNA
molecule forms a double helix of two long chains encoded by four different
nucelotides1, called adenine (A), cytosine (C), thymine (T) and guanine (G).
The genetic code is determined by the order of nucleotides, and this code
allows for traits to be reliably passed from one generation to the next.
Therefore, the discovery of the DNA structure resolved the mystery of
inheritance that had previously puzzled scientists.
Another breakthrough finding in genomics research took place in 2001
when the sequence of the entire three billion bases that make up the human
genome was published2,3. The completion of the human genome sequence
has resulted in enormous surprises that fundamentally reshaped our view of
genome architecture. The coding sequence represented only 1.5% of the
genome and the number of coding genes is estimated to be between 20,00025,000 (International Human Genome Sequencing Consortium 2004),
significantly fewer than previously predicted4.
The fact that the number of genes was significantly lower than previously
expected raised several important questions. If we have the same number of
genes as many other species, where does our complexity come from, and
how can such a low number of genes give rise to the large number of
encoded proteins? There must be many mechanisms, processes and
elements, other than the genes, that help explain our complexity. Efforts by
the Encyclopedia of DNA Elements Consortium (ENCODE) are aiming to
answer this question by applying genome-wide assays to many different
human cell lines. In a recent publication from the ENCODE Consortium,
they claim to have assigned biochemical function to almost 80% of the
genome, particularly in regions outside known genes and regions previously
considered “junk” DNA5. Furthermore, genomic complexity in higher
eukaryotes can be achieved not only by gene number, but also through
11
mechanisms that lead to the production of multiple proteins from the same
gene2. These mechanisms include alternative splicing, RNA editing, transsplicing, alternative promoter usage, and post-translational protein
modifications6,7. In addition, non-coding RNAs (ncRNAs), such as miRNAs
and siRNAs, transcribed from intronic regions or from regions outside genes
may add another layer of complexity to the genome by controlling multiple
and large regulatory networks8-10. In light of the many mechanisms of
regulation that exist at the level of RNA, and as it represents a direct output
of genetic information, deciphering and interpreting the complexity of all the
RNA in the cell (the so-called transcriptome) represents an important step
towards decoding the functional elements of the genome.
Initial estimates of the proportion of the genome that is transcribed have
largely relied on hybridization-based microarrays. Results coming from
these studies showed that transcription is pervasive and many transcripts fall
outside known genes. The nature of these transcripts was unclear, but it was
suggested to represent a combination of functional transcripts and biological
or experimental background11,12. Recently, the ENCODE project performed a
more comprehensive analysis of transcriptional landscape in 15 human cell
lines using RNA sequencing (RNA-seq) in combination with other
technologies. The data revealed that more than three-quarters of the human
genome is capable of being transcribed, firmly establishing the reality of
pervasive transcription14.
My thesis is based on four genome-wide studies of the human
transcriptome. The findings presented in all the studies contribute towards an
improved understanding of the human transcriptome in healthy and diseased
individuals and highlight the advantages of developing novel methods to
obtain global and comprehensive views of RNA expression and processing.
• Paper I focuses on using total RNA-seq to study nascent RNA
transcripts and co-transcriptional splicing genome-wide.
• Paper II highlights the benefits of using whole exome capture and
massive parallel sequencing on cDNA to study the transcriptional
landscape of human tissues and to identify novel transcripts and
splice junctions expressed at low levels.
• Paper III highlights the advantages of sequencing separated
cytoplasmic and nuclear RNA fractions using a modified cellular
fractionation protocol to investigate mature RNA transcripts from
cytoplasmic RNA and nascent transcripts from nuclear RNA.
• Paper IV describes a de novo mutation in the chromatin-remodeling
gene BAZ1A in a patient with intellectual disability. It characterizes
how the effect of the mutation on the expression and function of the
gene may be linked to the clinical manifestations of the patient.
12
The transcriptome: the faithful message
Following the discovery of the genetic code, it was generally believed that
genetic information flows from DNA to RNA to protein (i.e. the central
dogma). As a consequence, it was assumed that RNA transcripts were
primarily used as templates for translation to protein. Recent data have
challenged this theory and showed that a large proportion of transcripts
represent noncoding RNAs12. In fact, it is proposed that the regulatory
potential of these noncoding RNAs correlates with the complexity of the
transcriptome9,15.
The transcriptome represents the complete collection of transcripts or
RNA molecules present in a cell at a given time point. It is estimated that
frequently expressed transcripts represent only 5% of the genetic code and
only 1-2% protein coding transcripts16. In multicellular organisms, nearly all
the cells share the same genome. However, the gene expression profiles can
significantly vary between different cell types. Moreover, the expression
patterns inside a single cell may change at different developmental stages,
during differentiation and in response to external stimuli. To achieve this
complexity, the transcriptome relies on extremely tight, yet flexible and
orchestrated, regulatory mechanisms that provide the cells with a unique
expression pattern suitable for the cell specific function at that very moment.
The regulatory mechanisms function in concerted action to control the four
coordinates of gene expression: where, when, how much and for how long a
single transcript is expressed. Therefore, completing the catalogue of
transcripts and comparing transcription and RNA processing patterns in
different cells and tissues at different developmental stages will allow us to
understand what constitutes a particular cell type and how transcriptional
dynamics may cause or contribute to disease.
In the following sections of my thesis, I will shed some light on the basic
principles and recent understanding of transcription and RNA processing
machineries. I will also touch upon the contribution of different nextgeneration sequencing technologies in expanding our knowledge of
transcriptomics.
13
Transcription and gene expression
The function and characteristics of any given cell are largely influenced by
the cell’s protein content. The cell regulates how much of a protein is present
or how active a protein is by mainly controlling gene expression. The
expression of eukaryotic protein coding genes can be dissected into several
processing,
major
stages
including
transcription,
pre-mRNA
polyadenylation, transport and translation. Transcription is the first stage of
gene expression where a discrete segment of DNA, a gene, is copied into
messenger RNA transcript in the nucleus by the enzyme RNA polymerase
(RNA pol). In eukaryotes, there are three different types of RNA
polymerases. RNA pol I produces most of the ribosomal RNAs (rRNAs).
RNA pol II transcribes precursor RNAs (pre-mRNA) from genes which
serve as templates for protein synthesis. RNA pol III catalyzes the synthesis
of one small rRNA (5sRNA) and transfer RNAs (tRNAs). A simplified view
of gene expression steps is illustrated in Figure 1.
Figure 1. A schematic for the process of gene expression in a cell. First, both coding
and noncoding regions of DNA are transcribed into RNA. Some regions are
removed (introns) during initial RNA processing. The remaining exons are then
spliced together. The spliced, polyadenylated and capped mRNA molecule (red) is
then exported out of the nucleus to the cytoplasm. Once in the cytoplasm, the
mRNA can be used as a template for protein synthesis. Figure adapted from © 2010
Nature Education.
14
The first step in the transcription process is initiation. The initiation of transcription requires the core enzymatic activity of the RNA pol II and multiple
general transcription factors (GTFs)17-20. The transcription process also requires cis-acting elements to ensure accurate gene expression.
Cis-acting elements include a promoter, which is composed of a core
promoter sequence that includes a TATA box or initiator element. The core
promoter and the TATA box are located upstream the transcription start site
(TSS), providing a secure initial binding site for GTFs, RNA pol II and the
TATA box binding protein (TBP)21. The location of the core promoter and
the binding of transcription factors influence the transcription rate and determine the location of the TSS22. There are also distal regulatory cis-acting
elements called enhancers, which are found at a considerable distance from
the promoter they influence. Enhancer elements regulate gene activation
when bound by activator proteins by changing the 3-D structure of the DNA
to make it more accessible to transcription factors23.
During transcription initiation, the GTFs assemble on the core promoter
in an organized fashion forming a transcription pre-initiation complex (PIC),
which facilitates the recruitment of RNA pol II to the TSS to ultimately form
the transcription initiation complex24. Once transcription is initiated the
carboxy-terminal domain (CTD) of RNA Pol II is phosphorylated and begins
to recruit the factors needed for elongation and mRNA processing25.
Concomitantly, RNA pol II begins unwinding the double stranded DNA and
starts copying the template strand of DNA into RNA by adding nucleotides
to the 3’ end of the growing transcript until the polymerase reaches the
termination signal at the end of the coding sequence. The polymerase
continues for several hundreds or even thousands of bases until it reaches the
polyadenylation site (poly(A) site), a consensus sequence that is bound and
cleaved by the polyadenylation complex. Following cleavage, a poly(A) tail
is added to the newly synthesized mRNA transcript, a process called mRNA
polyadenylation26. The newly synthesized mRNA is then exported to the
cytoplasm and in most instances, translated into protein. It is important to
mention that several polymerases can be found on the same gene producing
multiple transcripts from a single gene copy27.
Transcriptional regulation and RNA processing
Chromatin remodeling
DNA is condensed in the cell in a chromatin structure. The chromatin is
composed of a repeated structural unite called a nucleosome which is
comprised of 146 base pairs (bp) of DNA wrapped around a histone protein
octamer formed from two copies of H2A, H2B, H3 and H428-30. Chromatin
15
does not only serve as mechanism to condense DNA, but it also represents a
barrier for the transcriptional machinery31-35. The nucleosome complexes are
not static structures. Their dynamics of “opening” or “closing” the chromatin
structure represent one of the fundamental layers of gene expression
regulation34. Various protein complexes tightly regulate the dynamics of
chromatin structures36, including the chromatin-remodeling complexes37.
Chromatin remodelers are multi-protein complexes that utilize ATP
hydrolysis to mobilize and disrupt histone-DNA contacts, allowing the
transcriptional machinery to access DNA and consequently, to
transcriptional activation38-40.
All eukaryotes have at least five families of chromatin remodelers. The
two best-studied families include SWI/SNF and ISWI. The functions of
these families include, but are not limited to, chromatin assembly,
nucleosome spacing, transcriptional repression, double stranded DNAdamage repair and development38,41-43. Given the fundamental role of these
complexes in regulating spatial and temporal gene expression, mutations
affecting the function of the components of chromatin remodeling
complexes generally cause cancers and multi-system developmental
disorders44-48. Moreover, recent family-based studies using exome
sequencing show that de novo mutations in chromatin-remodeling proteins
are linked to intellectual disability (ID), autism and other
neurodevelopmental disorders48,49.
RNA processing
The production of fully mature protein-coding transcripts is a complex
process that involves numerous tightly regulated yet flexible steps. The
complexity of these processes provides the means for multilayered
regulation of gene expression. During and after transcription, newly
synthesized pre-mRNA transcripts undergo a series of modifications and
processing steps in the nucleus before the fully mature RNA is exported to
the cytoplasm. These processes include: a) 5´ pre-mRNA capping, the
addition of altered guanine nucleotide to the 5´end of a growing RNA
transcript (7-methylguanylate cap or m7G), b) pre-mRNA splicing, the
removal the intronic sequences from the transcribed RNA, and c)
polyadenylation, the addition of a poly(A) stretch to the fully transcribed
RNA transcript by a tightly coupled cleavage and polyadenylation reaction.
Pre-mRNA splicing
In eukaryotic genes, the protein coding sequences (i.e. exons) are interrupted
by intervening sequences called introns. These introns have to be precisely
removed form the pre-mRNA to produce a mature mRNA transcript that will
be used to produce a functional protein. The process by which introns are
removed and exons are joined together is called mRNA splicing and it
16
occurs in the nucleus. The mechanism by which introns are removed from
the transcript is dependent on two conserved sequences located at the 5´and
3´ ends of introns. These specific sequences are called splice sites. Most
often the dinucleotides GU is found at the 5´splice site (donor site) and an
AG at the 3’ splice site (acceptor site). Mutations disrupting these consensus
sequences have been linked to aberrant splicing and disease (discussed
further below)50. Another conserved intronic sequence that is also important
for intron removal is called the branch site. It is often located anywhere
between 21 to 34 nucleotides upstream of the 3´end of introns. The branch
site typically contains the consensus yUnAy, where the underlined A is the
branch point, n is any nucleotide, and the lowercase pyrimidines are not as
conserved as the uppercase A and U51.
The splicing process is carried out by a series of reactions catalyzed by a
macromolecular machine called the spliceosome52. The spliceosome consists
of small nuclear ribonucleoprotein particles (snRNPs) called U1, U2, U4, U5
and U6, and a large number of non-snRNP factors53-55. The spliceosome
assembly is initiated by a sequential recruitment of the spliceosome
components to the intron (Figure 2). The U1 snRNP and the heterodimeric
U2 snRNP auxiliary factor (U2AF) recognize the 5´ and the 3´ splice sites,
respectively. Next, the U2 snRNP is recruited to the branch point site and
then the U4-U5-U6 tri snRNPs join the spliceosome complex. This complex
becomes activated by a series of structural rearrangements that ultimately
leads to the joining of exons, spliceosome disassembly and release of the
lariat intron. This splicing mechanism is called canonical splicing and
accounts for more than 99% of the splicing events in eukaryotes56. Some
introns do not possess the canonical 5´and 3´splice site sequences57. These
introns are called minor introns and they are spliced using the U12
spliceosome (also referred to as the minor spliceosome)58-60. The minor
spliceosome contains the same U5 snRNP as the major spliceosome, but
instead of having U1, U2, U4 and U6, it has U11, U12, U4atac and U6atac
snRNPs. This minor splicing event is considered rare and it is still under
debate whether it occurs in the nucleus or in the cytoplasm61,62.
17
pre-mRNA splicing
Exon
5´ss
Exon
branch point
U1
U2
Exon
U4
U6
3´ss
U2AF
Exon
Exon
Exon
U1 U2AF
U2
U5
U2AF
U4
U1
U6
U2
U5
Exon
Exon
Figure 2. Pre-mRNA splicing. The spliceosomal factors are sequentially recruited to
the 5´splice site, 3´splice site and to the branch point. Once the complete
spliceosome is assembled over the intron, the exons are joined together and the
intron lariat is released.
The evolution of pre-mRNA splicing
The process of intron splicing requires an accurate definition of the intron
boundaries. This occurs through base-pairing interaction between U-rich
sequences in the spliceosomal components (snRNPs) and a particular
intronic sequence located in the splice site vicinity63. While the U-rich
sequences have remained conserved during evolution, splicing signals are
substantially more variable in higher eukaryotes compared to yeast64. In the
Saccharomyces cerevisiae (S. cerevisiae) genome, there are only ∼250
annotated introns present in only 3% of the genes. The introns described in
yeast are typically short and alternative splicing events are very rare65,66.
Conversely, most genes in higher eukaryotes contain multiple introns with
different lengths and alternative splicing is very common67. Therefore, the
recognition of introns in higher eukaryotes depends on additional sequences
(cis-elements) in exons and introns to compensate for the sequence
divergence and the resulting poor splice site definition.
18
Alternative splicing: The source of functional and phenotypic diversity
Before completion of the human genome sequence, it was hypothesized that
there is a link between the complexity of an organism and the number of
genes it possesses. The first draft of the human genome demonstrated that
the number of genes in the human genome was lower than anticipated and
not very different from the number of genes in several less complex
organisms e.g. the fruit fly. One possible source of greater complexity in
higher eukaryotes is alternative pre-mRNA splicing (AS), a process where
exclusion of certain exons through splicing may give rise to a large number
of different mRNA transcripts from the same gene. Despite the unexpectedly
low number of protein coding genes identified in the human genome (around
21,000), these genes code for more than 100,000 protein coding
isoforms68,69. These isoforms can be created by several mechanisms
including AS, alternative transcription initiation, and alternative
polyadenylation.
AS was first discovered in 1977 in adenoviruses70. It also occurs in other
viruses such as HIV and cytomegalovirus71,72. In higher eukaryotes, AS has
emerged as one of the most essential mechanisms underlying the phenotypic
diversity displayed by mammalian cells. A single pre-mRNA transcript thus
has the potential to give rise to a number of different protein isoforms simply
by generating different combinations of exons via alternative splicing. In the
human genome, up to 95% of the multi-exon genes are believed to be subjected to alternative splicing, thereby greatly enriching the functional diversity of the human proteome and adding additional layers for regulation of
gene expression67,73-76. AS may create different mRNA isoforms that differ in
their coding capacity or display distinctive stability; they can even have profound effects on the protein properties77,78. Particular mRNA isoforms might
be specific to a sex, developmental stage, cell or tissue type, or environmental conditions78,79. For example, the human neurexin gene family of neuronal proteins, has been reported to give rise to more than 3000 distinct
mRNA isoforms80 which differ in their structure and their interaction capability with other proteins81,82. In other cases, products of AS might not code
for a functional protein, but rather function as a negative regulator to prevent
the overexpression of the encoded protein83. AS events can be classified into
five major categories: a) exon skipping, representing the most common
mode of alternative splicing, b) intron retention; c) alternative 5´splice site
usage, d) alternative 3´splice site usage, and e) mutually exclusive exons.
The precision of the AS process is a key element in gene expression and
production of functional proteins. Therefore, aberrant AS may result in the
creation of abnormal transcripts with incorrect gene products that might have
severe effects and cause disease. In the last decade, a handful diseases have
been reported to be associated with aberrant splicing events84-86. In fact, up to
50% of the disease causing mutations introduce defects in the splicing
19
mechanisms87,88. Such mutations might be located in a splice site or other
cis-acting sequence element that influence the recognition of the correct
exon borders.
Alternative splicing regulation
AS has emerged as one of the most essential mechanisms underlying the
complex phenotypic and functional diversity displayed by the mammalian
cells. This enormous diversity is achieved by a tight regulation of AS
outcomes. The production of multiple mRNA transcripts from a single gene
often occurs through the inclusion or exclusion of different combinations of
exons. In a typical mRNA transcript, some exons are constitutively spliced,
i.e. they are always included in the transcript. Other exons are alternatively
spliced, therefore, they are sometimes included and sometimes excluded.
The decision of including or excluding a particular alternative exon is
dependent not only on 5´ and 3´ splice site, but also on many cis-acting
sequence elements distributed in exons and introns. These sequence
elements provide flexibility in their usage and possibility of regulation by
serving as binding sites for a series of splicing regulatory proteins89.
Cis-acting elements can either promote exon inclusion (splicing
enhancers) or exon skipping (splicing silencers) and both kinds occur in both
exonic and intronic regions. These elements are classified as exonic splicing
enhancers (ESE) or exonic splicing silencers (ESS) if they function from
exonic locations, and as intronic splicing enhancers (ISE) or intronic splicing
silencers (ISS) if they function from intronic locations (Figure 3). ESEs are
the best characterized sequence elements, serving as binding sites for a
family of proteins called SR (Ser–Arg) proteins. SR proteins influence the
inclusion of exons by recruiting the splicing machinery to the adjacent
exon/intron borders90,91. ESS and ISS repress exon inclusion by recruiting
members of the heterogeneous nuclear ribonucleoprotein family (hnRNPs),
such as hnRNPA190,92. Cis-elements have also been shown to influence
splice site selection by altering splice site recognition through forming looplike secondary structures93. Furthermore, the concentration ratios between
negative and positive splicing factors in a given cell influences splice site
selection94.
20
pre-mRNA splicing regulation
U2AF
Exon
ISE ISS
3´ss
SR
hnRNP
Exon
ESE
ESS
U1
5´ss
Exon
ISE ISS
Figure 3. Splicing regulation by the cis-acting elements and trans-acting splicing
factors. Two alternative splicing pathways with the middle exon either included or
excluded. Splicing enhancing signals are shown in green, while splicing inhibitory
signals are shown in red. In this model, ESS or ISS are bound by hnRNPs and repress exon inclusion by inhibiting the recruitment of the splicing machinery (e.g.
U2AF and U1). ESE or ISE serve as binding sites for SR proteins that influence the
inclusion of exons by recruiting the splicing machinery.
AS regulation also has an important role in tissue specificity75,95. Tissuespecific AS events can be partially explained by tissue-specific expression of
splicing factors, and their corresponding influence on their RNA targets96-99.
Recently, large numbers of tissue-specific AS events have been identified90.
The brain was found to harbor the highest incidence of AS events, as might
be expected considering that the brain represents the most complex and
functionally diverse tissue in the body. Relatedly, many brain-specific
splicing factors have also been identified, such as NOVA1 and nPTB100,101.
Cell type and developmental-specific expression of these factors was
observed in proliferating and post-mitotic mouse brain cells, neural
progenitor cells and in differentiated neurons100,102,103. These findings
highlight the importance of deciphering the ‘splicing code’ to understand the
molecular mechanisms that define the function and specificity of cells and
tissues.
AS is not only regulated through trans-acting splicing factors but also by
processes linked to the transcription machinery. Recently, several lines of
research have established that splicing is physically and functionally coupled
with transcription, and that this coupling may influence AS regulation and
other downstream RNA processing mechanisms104-106. This implies that
21
RNA processing factors are recruited to the emerging RNA transcript during
transcription and that splicing occurs co-transcriptionally. Two models have
been suggested to explain the coupling between the transcription unit and the
splicing machinery107. The first model is called recruitment coupling where
the CTD of RNA pol II plays a major role in co-transcriptional coupling
between RNA biogenesis and processing108. In this model, the CTD of RNA
pol II offers a flexible landing pad for various transcription and splicing
factors, and facilitates their recruitment to the emerging nascent RNA
transcript109. This recruitment may influence AS regulation by the ability of
different transcription factors to recruit distinct splicing factors110. The
second model is called the kinetic coupling111. This model proposes that
RNA pol II-mediated elongation rate influences the regulation of alternative
splicing and splicing outcomes by affecting the period during which splice
signals are exposed to the splice factors in the growing nascent RNA
transcript112. For example, if the RNA pol II elongation rate is high due to a
strong upstream promoter or an open chromatin structures, it increases the
possibility that weak 3´ splice sites around cassette exons will be
outcompeted by an already transcribed strong 3´ splice site downstream. In
comparison, low RNA pol II processivity will provide enough time for the
splicing machinery to recognize any weak 3´splice sites leading to the
inclusion of a cassette exon113. There are several lines of evidence that
support this model114-116. For example, it has been demonstrated that slow
transcription of the fibronectin gene (FN1) leads to the inclusion of the
fibronectin extra domain 1 (ED1) exon which is preceded by a weak 3´splice
site. However, when transcription elongation rate was higher, this exon was
excluded117.
More recent global studies have revealed a pausing of RNA pol II near
the 3´end of intron-containing genes. It is suggested that the pausing
represents a check point to allow the splicing machinery to cope with
transcription118. There is also increasing evidence that chromatin status and
histone modifications also play a key role in AS regulation119,120. The
outcome of AS might also be affected by other post-transcriptional
mechanisms such as RNA editing, mRNA decay and microRNA binding121123
.
Co-transcriptional splicing
Transcription and pre-mRNA splicing were previously depicted as totally
independent events. However, in the last decade, there is increasing evidence
that splicing and transcription are actually coupled and that splicing can
occur co-transcriptionally. Co-transcriptional splicing means that the
splicing machinery is tightly coupled to the transcription units, thereby
splicing out introns while transcription of the gene is still progressing.
Illustration of co-transcriptional splicing is presented in Figure 4.
22
Co-transcriptional splicing
POL II
on
Ex
U4
U2AF
on
Ex
U1
U4
U6
U5
U2
U6
U1 U2AF
U2
U5
on
Ex
Figure 4. Co-transcriptional splicing. As the newly synthesized RNA emerges from
RNA pol II, the splicing factors are recruited to the nascent transcript and the introns
are excised prior to completion of the transcription process.
Initial support for co-transcriptional splicing was achieved using electron
microscopy of Drosophila melanogaster embryonic transcription units,
visualizing intron lariat formation and associated ribonucleoproteins splicing
complexes on transcripts that are still attached to DNA124. Since then,
different approaches have been adopted to further investigate these
observations. Immunofluorescent microscopy experiments demonstrated
localization of splicing factors at sites of active transcription125,126. Moreover,
RNA-FISH probes against exon-exon junctions showed partially spliced premRNA at their gene template, thereby suggesting that pre-mRNA splicing
occur prior to transcript release127. Further evidence identified spliced
mRNA, spliceosomal components and splicing factors in the chromatin
environment of transcriptionally active genes128-131.
Although multiple independent observations collectively suggest that cotranscriptional splicing is the common splicing mechanism, other studies
have revealed that splicing can also occur post-transcriptionally, after the
transcript release from the template DNA130. Investigation of co-
23
transcriptional splicing patterns in chromatin-associated RNA and free RNA
in the nucleoplasm of human cell lines revealed that introns adjacent to
constitutive exons are mostly co-transcriptionally spliced, and that these
introns are removed in the 5´to 3´order. Meanwhile, 3´ introns were shown
to be primarily post-transcriptionally spliced130. Internal introns harboring
alternative exons are also excised during transcription, although with
variable efficiencies114,130. These observations were recently validated using
single molecule imaging of transcriptionally coupled and uncoupled
splicing132. These data, in addition to the fact that co-transcriptional
recruitment of SR proteins increases splicing efficiency133, suggests a
regulatory potential of co-transcriptional splicing. Further studies have
provided support for the kinetic coupling between splicing and transcription.
One study analyzed chromatin associated nascent transcripts and revealed
RNA pol II pausing at terminal exons134. Given that terminal exons are
generally short, it was suggested that RNA pol II pausing would allow
sufficient time for intron removal before transcript release. Similarly,
another study reported that RNA pol II pausing at the 3´end of introns in
reporter genes is coincident with splicing factor recruitment118. Together,
these results suggest that co-transcriptional splicing in eukaryotes represents
the rule rather than the exception.
Early studies on co-transcriptional splicing were mainly based on a single
gene or a limited number of genes. Consequently, an accurate estimation of
the global frequency of co-transcriptional was lacking. Presently, recent
advances in global approaches allow us to obtain more comprehensive and
global analysis of co-transcriptional splicing events in tissues and cells. A
handful of global analysis investigating co-transcriptional splicing events in
tissues and cells in four different organisms have been published, including
the analysis presented in Paper I in this thesis. Although these studies differ
in their experimental and analytical strategies, they generally conclude that
co-transcriptional splicing is widespread but not exclusive in cells and
tissue134-139. The frequencies of co-transcriptional splicing reported in these
studies are presented in Table 1.
Table 1 Global studies on co-transcriptional splicing in four different organisms
Organism
S. cerevisiae134
Drosophila135
Human (HeLa cells)136
Human (K562 cells)139
Human (B-cells)137
Mouse liver138
Paper I in this thesis
24
Co-transcriptional splicing frequency
0,74
0,83
0,80
0,75
0,75
0,45
0,84
Co-transcriptional splicing may be advantageous for gene expression as it
may facilitate recognition of splicing signals as they stem from the
elongating transcription unit, and it may provide a platform for transcription
and splicing to regulate and facilitate reciprocal functions. For these reasons,
additional extended analysis of co-transcriptional splicing on a global level
will be important to fully understand mechanisms for gene expression
regulation in development and disease.
Polyadenylation
The vast majority of protein-coding and noncoding RNA transcripts undergo
polyadenylation in the nucleus, a process by which a stretch of adenine bases
are added to the 3´end of RNA molecules. The proper addition of a poly(A)
tail is required for nuclear export, stability of the mRNA transcript and for
efficient translation by serving as a major docking platform for factors that
control and regulate these processes140-142. Polyadenylation is a two-step
process that involves an endonucleolytic cleavage of a newly produced premRNA transcript and an addition of 200-300 adenine residues. These steps
require several cis-elements and various trans-acting proteins143. The key
cis-elements include a polyadenylation signal (PAS), a motif with the
consensus sequence AAUAAA located 15-25 nucleotides upstream of the
cleavage site, and a downstream GU-rich sequence element. There are
several essential trans-acting factors that catalyze polyadenylation and
cleavage. These include the cleavage and polyadenylation specificity factor
(CPSF), which recognizes the PAS, and the cleavage-stimulating factor
(CSTF) that binds to the GU-rich region. These factors in turn recruit the
poly(A) polymerase (PAP) to the cleavage site144,145. Polyadenylation can
occur at multiple PASs, leading to the production of different transcripts
from the same gene by alternative polyadenylation (APA). In fact, recent
discoveries using genome-wide high throughput technologies have
demonstrated that APA is a widespread phenomenon that occurs in
approximately 70–75% of human genes146, thereby adding an extra layer of
complexity to the mechanisms of gene expression regulation. Of note,
increasing evidence suggests interplay between APA and transcription
through physical and kinetic coupling. Several polyadenylation factors have
been demonstrated to directly interact with the RNA pol II CTD and with
transcription factors147,148, thereby increasing cleavage and polyadenylation
efficiency149. Similar to alternative splicing regulation, kinetic coupling
(represented by the transcription elongation rate), has been shown to
influence the choice of polyadenylation sites150. Splicing and
polyadenylation factors also interact, leading to enhanced cleavage
efficiency at the poly(A) sites143.
25
Transcriptome analysis: In light of next-generation
sequencing
As described in previous sections, the transcriptome represents the complete
set of transcripts in a cell at any given time. The ultimate aim of whole
transcriptome analysis is to establish and characterize a catalogue of all the
transcripts within a particular cell/tissue, to determine transcript structure,
splicing and regulation patterns, and to quantify their differential expression
in both normal physiological and pathological conditions.
Many different technologies have been utilized to study the
transcriptome. Until the introduction of next-generation sequencing,
transcriptome studies mainly relied on hybridization-based approaches such
as transcription tiling microarrays. These approaches represented highthroughput, rapid and cost-effective methods to investigate gene expression
patterns. They are typically based on hybridizing fluorescently labeled
cDNA to custom-designed or standardized high-density probes anchored to
an array slide. Due to their nature, these approaches possess several
drawbacks including high background and differential hybridization
affinities across the probes. In addition, they generally rely on existing
knowledge about transcribed sequences that limit their ability to identify
novel transcripts and splicing events. In contrast, the recent advances in next
generation sequencing, especially RNA-seq, have offered a revolutionary
platform to perform hypothesis-free analysis of the transcriptome and
unprecedented single base pair resolution. It also offers higher dynamic
range, compared to arrays for detecting transcripts at different expression
levels combined with minimal amounts of noise. These technologies are
generally based on converting RNA into a cDNA fragment library. These
fragments are then sequenced in a parallelized, high-throughput manner.
Early RNA-seq studies have proven effective in investigating the mammalian transcriptome landscape and revealed much more complexity than
previously believed. Before the advent of RNA-seq, several studies using
tiling microarray analysis revealed pervasive transcription over most of the
genome and a large proportion of transcripts mapped to intergenic regions
with unknown function (also referred to as dark matter)11,151. These observations were challenged when early RNA-seq studies showed that most sequencing reads align to known genes or around known genes with many
reads located in intronic regions, indicating that most dark matter might represent transcriptional noise and experimental background13. Nevertheless,
some intergenic transcripts exhibit functional characteristics. More recent
RNA-seq data from the ENCODE project revealed that three-quarters of the
genome is transcribed refueling the debate about the dark matter14.
RNA-seq has also emerged as an affective method to assay presence and
prevalence of RNA transcripts. While microarrays are associated with significant background noise that limits its ability for accurate quantitative
26
readout, RNA-seq data provides highly reproducible and quantitative measurements of transcript abundance152,153. The high resolution, sensitivity and
low background of RNA-seq have provided us with an unprecedented global
view of the composition and complexity of the transcriptome in different
species. These advantages have enabled the identification and characterization of many novel transcripts, exons and splicing events, and have provided
an accurate identification of the 5´ and 3´ ends of genes95,154-157. Other studies have also demonstrated the ability of RNA-seq to detect fusion transcripts158,159, identify long noncoding RNAs (lncRNAs)160 and identify single
nucleotide variants (SNVs) and allele ratios in RNA161. Although RNA-seq
has provided tremendous advantages, this method is still associated with
certain challenges, such as biases associated with library preparations and
the need for continuous improvements in statistical analysis and computational infrastructure. In addition, the transcriptome is dominated by highly
expressed transcripts that consume most of the sequencing reads, leading to
less power to detect transcripts expressed at low levels. In fact, targetenrichment strategies, which were originally used to capture specific sequences of DNA, have been used to perform capture on cDNA. Combined
with RNA-seq, this method has lead to the identification of many novel transcripts and splice junctions which are below the level of detection in conventional RNA-seq162 (plus Paper II in this thesis). The function and the nature
of these rare transcripts still need to be investigated.
Current RNA-seq studies largely depend on poly(A) enriched RNA
populations to study transcriptomes. Most RNA-seq results therefore provide
little information about transcription dynamics and about forms of RNA that
lack polyA tails. Recent studies have implemented different manipulations
and improvements to RNA sample preparation steps to extend the
boundaries of RNA-seq technologies, thereby obtaining deeper insight into
the dynamics of gene expression and RNA processing. For example, two
technologies named GRO-seq (Global Run-On Sequencing)163 and NET-seq
(Native Elongating Transcript Sequencing)164 have been developed to
provide insights into the dynamics of transcript synthesis and decay by
capturing and sequencing nascent transcripts genome wide. In parallel, other
studies rely on cellular fractionation to isolate and sequence sub-populations
of RNA molecules. RNA sequencing of these fractions revealed interesting
findings related to splicing kinetics, co-transcriptional splicing, posttranscriptional splicing, polyadenylation, transcript release from the
chromatin and miRNAs distribution and function139,165-168. These studies also
provide strong evidence for extensive cross-talk between RNA biogenesis
and processing mechanisms.
27
Enabling technologies and method
development
RNA sequencing
Background
One of the most valuable achievements in genomics has been the ability to
perform a comprehensive, dynamic, unbiased and hypothesis-free analysis of
a genome in a high-throughput manner. Approaches developed for
investigating and quantifying RNA by high-throughput sequencing (nextgeneration sequencing (NGS)) are collectively called RNA-seq169. NGS
platforms for performing RNA-seq studies are dominated by three different
companies: Roche 454 (FLX pyrosequencing system), Illumina (Genome
Analyzer) and Life Technologies (SOLiD system). For each of these
platforms, RNA fragments are first converted to cDNA libraries and then
sequenced in parallel, producing massive numbers of relatively short reads.
However, the different platforms produce reads that vary in their length
ranging from 30-150 bp for SOLiD and the Genome Analyzer to 200-500 for
the FLX pyrosequencing. It is important to mention, that sequencing-read
length and the number of reads produced is to-date continuously increasing.
The SOLiD platform (Life Technologies) was used for sequencing in all
the papers presented in this thesis. However, the starting material was
different in each of the studies and included total RNA (Paper I), exomecaptured RNA (Paper II), cytoplasmic, nuclear and poly(A) RNA (Paper III)
and exome captured DNA (Paper IV). I will discuss SOLiD sequencing
technology in more detail below.
Library preparation
The quality of RNA samples are controlled using the Agilent Bioanalyzer
and only Agilent-defined RIN-values above 7,0 are accepted. RNA samples
are first depleted from ribosomal RNA (rRNA) using the RiboMinus™
Eukaryote Kit v2 (Ambion) except for the nuclear RNA. rRNA depleted
samples are then fragmented using RNase III and subsequently purified.
Adaptors are then ligated to the fragmented and purified RNA, and cDNA is
generated and amplified in a strand-specific manner. Thereafter, the cDNA
28
fragments are clonally amplified on SOLiD™ P1 DNA Beads by emulsion
PCR. After the enrichment of the beads with cDNA, the beads are deposited
and covalently attached on a glass slide.
SOLiD sequencing
Following the library preparation, the first step in the sequencing reaction is
the binding of a universal sequencing primer to the P1 adaptor which is
attached to the cDNA template. Mixtures of four differentially labeled dibase probes, where the first two bases are known, compete for binding to the
template. The bound probe, where the first two bases match those of the
template, is ligated to the free 3´end of the universal primer. Subsequently,
further probes bind and ligate to the growing strand. After each ligation, a
fluorescent marker attached to each probe is detected and cleaved. The
template sequence is interrogated by the specificity of 1st and 2nd base in
each ligation reaction. The sequencing reaction is outlined in Figure 5.
universal seq primer
3’
3’
p5’
Ligate
Bead
5’
5
Cleavage
5’
5
P1 Adapter
5’
5
3’
3’
5’
TA nnnzzz
P1 Adapter
Template Sequence
Cleavage
3’
5’
zzz
TA nnn
P1 Adapter
5’
AT nnnzzz
Template Sequence
universal seq primer
Bead
5’
TC nnnzzz
TA nnnzzz
universal seq primer
Bead
3’
5’
universal seq primer 3’
p5’
5’
GG nnnzzz
ligase
Template Sequence
3’
Figure 5. Overview of the SOLiD sequencing chemistry. Figure adapted from Life
Technologies Corporation © 2013
29
After multiple cycles of ligation, detection and cleavage, the template is
reset. This is done by removing and discarding the initial primer and all the
ligated probes from the template. The number of the cycles determines the
eventual read length. After a reset, a new universal primer starting from n-1
is used and the cycles of ligation, detection and cleavage are repeated. The
process is repeated with five different universal primers (n, n-1, n-2, n-3, n4), generating overlapping reads with each base interrogated twice.
Universal primers and the dual interrogation system are illustrated in Figure
6.
Figure 6. The SOLiD dual interrogation of each base using the five different universal primers. For example, universal primer 2 and 3 interrogate the base at read position 10. Figure adapted from Life Technologies Corporation © 2013
Exome sequencing
Background
Over the last years, great advances have been achieved in next-generation
sequencing and capture technologies, enabling the identification of almost
all protein-coding variants in the genome170. Because a large proportion of
disease-causing mutations are likely to be located in coding regions, exome
capture followed by massive parallel sequencing have promoted the
discovery of many disease causing variants in Mendelian diseases and rare
variants in complex traits171,172. Compared to whole-genome sequencing,
which is still prohibitively expensive for most applications, and GWAS,
which is largely based on the analysis of common variants located in noncoding regions, exome sequencing has proven more efficient and cost
effective to detect disease-associated, protein-coding variants. Exome
sequencing has also emerged as an approach to study de novo mutations and
their role in the genetics of complex diseases173. In fact, several recent
studies focused on parent-offspring exome sequencing have provided strong
evidence that de novo mutations in many different genes represent a major
30
cause for common neurodevelopmental diseases such as intellectual
disability and autism48,174-176. The results also indicate that de novo mutations
are more common in patients than controls174. A de novo mutation is a
spontaneous alteration in a gene that occurs for the first time in one family
member due to a mutation either in a germ cell of one of the parents or
during early cell divisions in the fertilized embryo. De novo mutations are
considered rare genetic variants and they tend to be more harmful than
inherited mutations because they have not been subjected to the same levels
of evolutionary selection177,178. Collectively, de novo mutations are not rare
and have per generation mutation rate between 7.6 × 10−9 and 2.2 × 10−8,
corresponding to 50 - 120 new mutations per generation, with an average of
0.86 amino acid altering mutations per newborn179,180. The fact that these
mutations are under less stringent selection and are collectively common,
make them prime candidates as causative mutations in sporadic and
commonly occurring diseases.
Sequencing library preparation and exome capture
SOLiD based exome sequencing is mainly based on introducing an exome
enrichment step after preparing the sequence library. The DNA sample is
first fragmented to an average size of 165 bp using the Covaris® System.
The DNA libraries are prepared as mentioned in the Library preparation
section (see above). The genomic DNA library is then ready to use for an
exome enrichment stage.
Exome capture in the projects described in this thesis was performed
using the Agilent SureSelect 50 Mb Exome Enrichment kit (Agilent
Technologies). In this assay, the amplified sequencing library is hybridized
to the SureSelect Capture Library mix containing biotinylated RNA probes
(BAITS). The hybridized DNA fragments (exome) are then captured using
streptavidin-coated magnetic beads. After several washing steps, the exome
library is amplified and then ready for sequencing. The exome capture
procedure is illustrated in Figure 7. In the case of Paper II, where exome
capture was performed on RNA, double stranded cDNA was generated prior
to the library preparation and exome capture steps.
31
Figure 7. The SureSelect Target Enrichment process. Figure adapted from Agilent
Technologies Copyright © 2013
32
Cytoplasmic and nuclear RNA purification
The quality of isolated RNA from tissues, blood and cells is the initial and
the most crucial step for performing RNA based research. RNA extraction
methods are used in basic molecular biology research, medical research and
in the diagnoses of many medical conditions. Currently, RNA research is
mainly performed on total and poly(A) RNA. RNA sequencing using total
RNA provides a snapshot of the complete content of RNA molecules in cells
or tissues whereas poly(A) RNA enrich for mature transcripts. Working with
RNA-seq, we realized that total RNA represents heterogeneous pools of
RNA populations with different RNA molecules at different levels of
maturity. We also know that poly(A) selection introduces several biases to
RNA samples. These biases mainly arise from poly(A) stretches in introns,
polyadenylated pre-mRNA transcripts166, and due to the fact that
polyadenylation also occurs during degradation of incomplete RNA
polymerase I transcripts181. In an attempt to increase the sensitivity of RNA
sequencing, we aimed to develop a protocol to obtain pure populations of
mRNA transcripts enriched from the cytoplasmic fraction and nascent RNA
enriched from the nuclear fraction. To achieve this goal, we modified the
original protocol of the commercially available kit from Norgen Biotech
(Cytoplasmic & Nuclear RNA Purification Kit), which suffers from
significant cross contamination between the fractions, and we were able to
obtain highly pure subcellular RNA populations. Sequencing the RNA
fractions extracted using this new protocol allowed us to obtain greater
insight into transcription dynamics from the nuclear fraction and a more
accurate assessment of RNA expression from the cytoplasmic fraction. The
modifications applied to the original protocol are illustrated in Figure 8.
Briefly, frozen tissue (20 mg) is ground in liquid nitrogen using a mortar
and pestle. Tissue powder is transferred to ice cold 1.5 ml tubes. Then, 200
µl lysis buffer (Norgen) is added to the ground tissue. The tubes are
incubated on ice for 10 minutes and then centrifuged for 3 minutes at 13.000
rpm to separate the cellular fractions. The supernatant containing the
cytoplasmic fraction and the pellet containing the nuclei are mixed with 400
µl 1.6M sucrose solution and carefully layered on the top of a 500 µl sucrose
solution in two separate tubes. Both fractions are then centrifuged at 13.000rpm for 15 minutes at 4C°. The cytoplasmic fraction is collected from the
top of the sucrose cushion and the cytoplasmic RNA is then further purified
according to the Norgen kit recommendations. The nuclear pellet is collected
from the bottom of the tube and washed with 200 µl 1x PBS. The nuclei are
collected after another centrifugation step at 13,000 rpm for 3 minutes. The
nuclear RNA is then purified from the nuclear fraction according to the
Norgen kit recommendations.
33
Figure 8. Schematic illustration of the cytoplasmic and nuclear RNA purification
using the Cytoplasmic and Nuclear RNA Purification kit (Norgen), with the
modifications to the protocol marked in the green box.
34
The present investigation
Aims
The general aim of this thesis is to provide a better global view of the human
transcriptome via analyzing its content, synthesis, processing and regulation
using next generation sequencing. In the long run, a better characterization
of the transcriptome will offer a valuable resource to understand the
molecular mechanisms in health and disease conditions. It will also provide
a tool for more accurate disease diagnosis and efficient treatment. The
specific objectives of each paper in this thesis are:
Paper I
Characterize the global pattern of intronic transcription and cotranscriptional splicing in the human brain using RNA-seq.
Paper II
Utilize the increased sensitivity of targeted enrichment of coding sequences
via RNA exome capture and massively parallel sequencing to improve the
detection of lowly expressed transcripts genome wide.
Paper III
Develop a more efficient cellular fractionation method to obtain
homogenous pools of RNA molecules for more focused and comprehensive
transcriptome analysis using RNA sequencing.
Paper IV
To functionally characterize a causative de novo mutation identified in a
patient with intellectual disability and to understand how it may be linked to
the clinical manifestations of the patient.
35
Results and discussion
Total RNA sequencing reveals nascent transcription and
widespread co-transcriptional splicing in the human
brain (Paper I)
The advent of next-generation sequencing (NGS) has revolutionized our
understanding of the transcriptome and revealed greater complexity than
previously believed. In the last decade, series of tiling array studies showed
that a much larger portion of the mammalian genome is transcribed than was
previously expected and many transcripts were detected outside known
genes. More recent studies of the mammalian transcriptome using RNA-seq
have revealed that most sequencing reads map within known genes and
many are intronic, indicating that most transcripts outside known genes represent experimental background or transcriptional noise. The function and
the nature of intron transcripts have not been thoroughly investigated.
In this paper, we aimed to investigate the origin and function of intronic
transcripts. We preformed RNA-seq on brain and liver samples obtained
from adult and fetal human tissues using the SOLiD4 system. Our total RNA
sequencing results indicated that around 70%-80% of the sequence reads
map to regions outside known exons. Specifically, 38% of these reads map
to introns (Figure 1 in Paper I). The results also showed that the intronic
reads were significantly more enriched in brain compared to liver (P value
<10 -300) and in fetal brain compared to adult brain, indicating high level of
nascent transcription in brain tissues and especially in fetal brain.
Interestingly, the genes with the highest intronic expression were involved in
brain development and neural signaling pathways. The higher levels of
intronic RNAs in fetal brain were further validated using qrtPCR.
Another interesting observation was evident in our global expression
analysis. The intronic reads were unequally distributed across introns. The
expression of intronic RNAs were substantially higher in the 5’ end
compared to the 3’ end forming a clear saw-tooth like pattern, especially in
the long introns (Figure 9). To experimentally validate these observations,
we conducted qrtPCR on two genes with high intronic expression, namely
NRXN1 and GRID2. The qrtPCR results showed high correlation with the
RNA-seq data and indicated that the highly expressed 5’ intronic RNAs are
36
connected to the upstream exon, whereas low levels of the 3’ intronic
transcripts are attached to their adjacent exon.
a
500 kb
AUTS2
AUTS2
AUTS2
200 kb
C21orf34
C21orf34
b
RefSeq genes
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
RefSeq genes
MIR99A
MIRLET7C
MIR125B2
Figure 9. Nascent transcription and co-transcriptional splicing. (a) Pattern for
AUTS2 (top) and C21orf34, a noncoding RNA gene (bottom), viewed in the University of California, Santa Cruz (UCSC) Genome Browser. The RNA-seq signals have
been smoothed using window averaging. For both protein coding genes and long
noncoding RNA genes, there is an apparent 'saw-tooth' pattern with higher RNA-seq
read signals toward the 5′ end of each intron. (b) Model for co-transcriptional splicing. The total RNA-seq data give rise to a typical saw-tooth pattern across genes that
are actively transcribed. The gradient of RNA across the introns can be explained by
a large number of nascent transcripts in various stages of completion. The pattern is
repeated for each intron because the nascent transcript is spliced very rapidly after
the polymerase completes transcribing each intron. The sequence read coverage is
comparatively higher for exons, as the RNA-seq is measuring both the pool of nascent transcripts and the pool of mature polyadenylated RNA.
Based on these results, we proposed an explanation for this observation.
In our model, we assume that each gene is transcribed continuously forming
a pool of RNA molecules at different stages of maturation and processing.
These molecules generate a gradient of nascent transcripts along the whole
gene. The fact that this gradient is over a single intron, rather than the whole
gene, is explained by co-transcriptional splicing. In other words, the splicing
machinery shadows the transcription machinery and excises each transcribed
intron (Figure 9b). To investigate this phenomenon globally, we measured
the average read depth for all introns in brain and liver tissues. As expected,
the analysis showed that large introns >100 kb in the brain contained the
highest levels of intronic reads at their 5’ end compared to the small introns,
and particularly in genes involved in axonal growth and synaptic transmission. These results can be explained by the fact that shorter introns are transcribed and spliced faster than larger introns and the resolution of RNA-seq
37
is limited to detect these events. The analysis also confirmed the differences
between brain and liver tissue as well as between fetal and adult brain. An
independent qrtPCR quantification of the differences showed that intronic
RNAs were higher in fetal brain, while exonic RNAs were equal between
both tissues. These observations might indicate higher rates of transcription
and turn over in the fetal brain.
Early studies of splicing regulation proposed that understanding the order
of intron splicing in a single transcript is crucial to understand splicing kinetics. A recent study by de la Mata et al114 proposed the model “first come first
served (committed)” for intronic splicing. This model implies that as introns
are being transcribed, they commit to the splicing machinery directly. To test
this theory in our data, we designed a PCR experiment to investigate cotranscriptional splicing of consecutive exons in NRXN1 and ERBB4. Our
results demonstrated that indeed these introns are co-transcriptionally
spliced. However, in agreement with de la Mata et al, we noticed a time gap
between the transcription and the excision of these introns. We also observed
that co-transcriptional splicing is not restricted to the long introns or to certain tissue. In fact, it also occurs in short introns and at different levels in
other human tissues. To show that our analysis can measure levels of cotranscriptional splicing, we re-sequenced the fetal brain to obtain higher read
coverage. We then analyzed the levels of co-transcriptional splicing globally
by quantifying the levels of intronic expression adjacent to the 3’ and 5’ end
of each exon. After correcting for intron length, the analysis revealed evidence for co-transcriptional splicing for 84% of exons surrounded by long
introns. Moreover, introns surrounding cassette exons showed high levels of
intronic RNA compared to constitutive exons. In agreement with PandyaJones et al and de la Mata et al113,130, our results suggest that introns surrounding alternatively spliced exons are excised at slower rate.
The results in this study indicate that sequencing reads over introns
mainly originate from nascent RNA transcripts undergoing transcription.
Moreover, the distribution of these reads over the introns, reflected by a sawtooth like pattern, revealed widespread co-transcriptional splicing in the human brain. The results also demonstrate that genes with long introns and
high levels of intronic RNA are associated in neurodevelopmental disorders.
These genes may experience more complex transcriptional regulation patterns. Our experimental model provided a unique platform to view and analyze nascent transcripts undergoing transcription and measure levels of cotranscriptional splicing in human tissues.
38
Exome RNA sequencing reveals rare and novel
alternative transcripts (Paper II)
One of the ultimate goals of transcriptomics research is to identify the
complete catalogue of expressed transcripts. RNA sequencing technologies
have provided valuable tools to assess the quantity and composition of the
human transcriptome and have led to the identification of many unannotated
transcripts and exons. However, the detection of rare transcripts and splice
junction still represent a challenge for these technologies.
In this study we aimed to improve the detection of transcripts expressed at
low-levels genome wide by performing exome capture on RNA followed by
massive parallel RNA-seq (ExomeRNAseq). We performed target
enrichment of exome sequences using the Agilent SureSelect 50 Mb kit on
four cDNA samples from an adult liver, an adult cortex, a fetal liver and a
fetal cortex. The captured targets were then sequenced on the SOLiD4 and
73 million reads per sample were generated. The expression analysis of
ExomeRNAseq using reads per kilobase per million reads (RPKM) showed
weak correlation with the expression analysis from total RNA sequencing
(TotalRNAseq) of the same samples. To explore the distribution of
sequencing read coverage over the genome resulting from the different
sequencing techniques, we compared the fraction of reads mapping to exons,
introns, UTRs and intergenic regions between ExomeRNAseq,
TotalRNAseq and a poly(A) selected RNA-seq. The results demonstrated
that the largest fraction of exonic reads was detected in ExomeRNAseq
compared to the other sequencing technologies. Our analysis also revealed
that larger numbers of genes could be detected using ExomeRNAseq.
Moreover, the transcripts determined to be expressed at low levels using
TotalRNAseq showed higher expression in ExomeRNAseq. To evaluate if
ExomeRNAseq can be considered as a quantitative test, we selected seven
random genes and five genes with differential expression between
ExomeRNAseq and TotalRNAseq and tested their expression using qrtPCR.
The qrtPCR data demonstrated a better correlation with the TotalRNAseq
data indicating that there are differences in capture affinity across the
different probes and that ExomeRNAseq does not accurately reflect
quantitative differences.
Given that ExomeRNAseq contains a higher fraction of sequence reads
mapping to exonic regions, we hypothesized that the capture technology will
improve splice junction discovery. To evaluate this hypothesis, we used the
TopHat and SplitSeek tools to identify splice junctions and novel exons in
the different samples sequenced. The results demonstrated that
ExomeRNAseq resulted in the identification of 100,000 splice junctions
compared to 25,000 and 44,000 junctions identified using TotalRNAseq and
poly(A) selected RNA-seq, respectively. We further investigated the nature
and the novelty of the identified splice junctions. The analysis was based on
39
comparing the splice junctions in our samples to University of California,
Santa Cruz genome browser (UCSC) annotated genes. The splice junctions
were divided into two groups: The first group included known splice
junctions reported in the UCSC; the second group included novel splice
junctions that are not represented in the current mRNA annotation in the
UCSC. Our ExomeRNAseq showed a significantly higher efficiency in
detecting novel splice junction (n=22,432) compared to TotalRNAseq
(n=3,036). To determine if the larger number of novel splice junctions
detected in the ExomeRNAseq data reflected false positive observations, we
calculated the number of novel splice junctions that overlap with the
junctions reported in the EST (UCSC) for both ExomeRNAseq and
TotalRNAseq. We found that the fraction of novel splice junctions mapping
to the UCSC EST data to be similar between the two technologies
suggesting that ExomeRNAseq does not increase the false positive discovery
rate of novel splice junctions. Three novel splice junctions and six novel
exons were further validated using PCR, Sanger sequencing and qrtPCR.
Having established that ExomeRNAseq represents an efficient technology
to enrich for coding sequences, we sought to assess the ability of
ExomeRNAseq to identify coding variants. We performed single nucleotide
variant (SNV) calling on the data from ExomeRNAseq and TotalRNAseq.
The analysis resulted in the detection of substantially larger numbers of
SNVs in ExomeRNAseq (average of 9,106 per sample) compared to
TotalRNAseq (average of 1,919 variants per sample). However, the SNVs
detected in the TotalRNAseq had better overlap with the SNVs reported in
the dbSNP database (average 91% vs. 83% in the capture data). The multiple
cDNA amplification cycles during sample preparation may introduce
mutations in the original sequence and might represent one potential
explanation for the reduced accuracy of ExomeRNAseq for SNPs discovery.
Another explanation may be that the capture probes had some level of
unspecific binding or cross-hybridization.
In this study, we demonstrate that whole exome capture can be applied to
cDNA to study the transcriptome landscape. We also show that using the
capture step combined with parallel RNA sequencing leads to the detection
of large amounts of novel exons and splice junctions. Compared to
TotalRNAseq, ExomeRNAseq showed increased ability to detect rare
transcripts and splice variants expressed at low levels. However,
ExomeRNAseq is less sensitive in detecting expression differences. We
propose that ExomeRNAseq is an effective platform to expand our
knowledge and understanding of the transcriptome, especially in detecting
novel transcripts, splice variants and disease causing fusion transcripts. An
illustration for the workflow used to identify novel exons and splices
junction is represented in Figure 10
40
Intergenic regions
Novel exons
Case 1
5’ novel exon
Case 2
Internal novel exon
Case 3
Novel splice isoform
Known exons
Case 4
3’ novel exon
Unannotated splice junctions
Known splice juctions
5‘
3‘
Reference genome
Data analysis
Splice junction
N
N
N
N
A
C
B
A
B C
A
B
N
N
Novel exons and novel
splice isoforms
PCR validation
and Sanger sequencing
Figure 10. Illustration of the workflow used for the detection and validation of novel
exons and novel splice junctions.
Efficient cellular fractionation improves RNA
sequencing analysis of mature and nascent transcripts
from human tissues (Paper III)
Currently, total RNA is considered the standard starting material in RNA
sequencing experiments to obtain a global view of the transcriptome.
However, if the aim is to study mature RNA transcripts, poly(A) selected
RNA is the method of choice. RNA sequencing studies using total and
poly(A) selected RNA have successfully provided us with novel insights into
the transcriptome. However, many challenges still remain in sample
preparation and data analysis steps. Previous RNA-seq studies reported that
total and poly(A) RNA represent diverse and heterogeneous pools of
different RNA molecules at different maturation and processing levels.
Moreover, the presence of substantial amounts of intronic RNA originating
from immature transcripts in these RNA populations will consume most of
the sequencing reads, and thereby impacts the sensitivity of RNA-seq to
detect transcripts and measure expression levels. Thus, the analysis of a
specific RNA type or a specific regulation mechanism is still not common.
In recent years, considerable efforts have been made to extend the
resolution of RNA-seq data, and to develop new methods and strategies to
monitor alternative regulation of mRNA biogenesis and to detect novel
transcripts expressed at low levels. Recently, two new technologies GROseq and NET-seq, have been described to study nascent transcripts163,164.
Although these technologies have successfully provided snapshots of
ongoing transcription, they do not provide insights into post-transcriptional
41
events, they require manipulation of the physiological conditions and they
need extensive optimization.
Recent advances in RNA extraction techniques also offer the possibility
to analyze particular pools of RNA molecules by cellular fractionation.
However, these techniques are associated with significant amount of cross
contamination between the nucleus and the cytoplasm. To overcome this
problem, recent studies166,182 added a poly(A) selection step to the
cytoplasmic fraction to enrich for mature transcripts, and a chromatinassociated transcript selection step to enrich for nascent transcripts. These
selection steps are time consuming and require large amounts of starting
material.
In this study, we aimed to investigate the benefits of performing RNA-seq
on fractionated cytosolic and nuclear RNA from small amounts of tissue
starting material and without any selection steps. To obtain high quality
cytosolic and nuclear RNA, and minimize the amounts of cross
contamination, we modified and improved the Cytoplasmic and Nuclear
RNA Purification Kit (Norgen). The modifications to the kit’s protocol are
illustrated in Figure 8. Quality assessment of our protocol showed that the
nuclear fraction is virtually free of any ribosomal RNA and no genomic
DNA traces could be detected in the cytosolic fraction.
To analyze the efficiency of the protocol in enriching for the mature
transcripts from the cytosolic fraction and the nascent transcripts from the
nuclear fraction, we sequenced cytosolic and nuclear RNA from two human
fetal frontal cortex tissues, designated sample 1 and 2, along with total and
poly(A) RNA from the same tissues using the SOLiD 5500 platform. Upon
investigation of the sequencing read distribution in the different RNA
fractions, we observed a striking enrichment of exonic reads in the cytosolic
fraction compared to the total, poly(A) and nuclear RNA fractions.
Surprisingly, poly(A) RNA contained high levels of intronic RNA across the
entire transcript, contradicting the idea that poly(A) selection exclusively
enriches for mature RNA. In agreement with our findings, recent studies
demonstrated that transcript polyadenylation might occur before the
transcript is fully spliced166,183. These findings, along with the biases
associated with the poly(A) selection, may provide an explanation for the
high intronic levels in the poly(A) RNA fraction. The differences in exonic
and intronic levels between the different fractions were further validated by
qrtPCR. Of note, based on data from Paper I, we believe these differences
are especially prominent in the fetal brain where long neuronal genes are
highly transcriptionally active.
Next, we devised a measurement called the EI ratio (exonic-to-intronic)
to globally quantify the relative enrichment of exonic compared to intronic
reads in the different fractions. The EI ratio is equal to1 when all intragenic
reads are exonic and 0 when all reads are intronic. Again, cytoplasmic RNA
had the highest exonic enrichment with ratios of 0.74 and 0.72 for sample 1
42
and 2, respectively. Poly(A), total and nuclear RNAs show substantially
lower values with EI ratios of 0.27-0.45, 0.20-0.42 and 0.12-0.31,
respectively. Having established that cytosolic RNA is significantly enriched
for exons, we propose that it is the best extraction method for studying
mature transcripts. We also propose that analyzing the nuclear fraction is a
preferable choice to study nascent transcripts as it contains the highest
amount of intronic RNA.
We proposed that the higher amount of exonic reads in the cytosolic
fraction leads to increased efficiency in detection and the quantification of
mature RNA transcripts. To test this, we used depth of coverage per million
mapped reads (DCPM) as a measure to quantify the expression levels of all
exons in the human genome. As expected, cytosolic RNA-seq gives the
highest DCPM values for exons compared to the other fractions. We were
also able to quantify the number of expressed exons in the different RNA
fractions by normalizing the DCPM values over the exons against the noise
signals coming from the anti-sense strand. The cytosolic RNA contained 8 to
19% more expressed exons than in poly(A) RNA, and 29 to 49% more than
in total RNA.
Presently, there is an increasing appreciation of the importance of
alternative splicing as a process for proteome diversity and tissue specific
regulation. Many efforts have been devoted to increase the efficiency of
RNA-seq to detect splice junctions. To test if cytoplasmic RNA-seq
improves the efficiency of splice junction detection compared to poly(A) and
total RNA-seq, we used TopHat software to perform splice junction analysis.
Indeed, we found that cytosolic RNA-seq improves splice junction discovery
and generated data where we detected 10% more junctions than poly(A)
RNA-seq. We next used Trinity software184 to perform de novo
transcriptome assembly for each RNA population. This test allowed us to
evaluate the data in unbiased fashion without prior information about gene
coordinates. As expected, using cytoplasmic RNA provides longer contigs
and generates 30% more transcripts longer than 1 kb compared to poly(A)
RNA and 10 times more compared to total RNA. These results highlight the
benefits of using cytoplasmic RNA to improve gene discovery of complex
transcriptomes and to study the genetic characteristics of non-model
organisms.
In Paper I, we demonstrated that total RNA-seq could be used to study
nascent transcripts undergoing transcription. We also showed that the 5’-3’
negative gradient of RNA-seq reads across long introns represent an
indicator of ongoing transcription and co-transcriptional splicing. In this
study, we showed that the highest amounts of nascent transcripts were
detected in the RNA-seq data from the nuclear fraction (i.e. showed the
lowest EI ratio). To evaluate if nuclear RNA-seq increases the efficiency of
nascent transcript detection, we performed a global analysis of the sequence
coverage across introns for all RNA fractions. The analysis revealed that
43
nuclear RNA-seq contains the highest amount of nascent transcripts and the
most distinct 5’-3’ gradients among all RNA fractions. Thus, the nuclear
RNA represents the optimal fraction to study ongoing transcription and
splicing dynamics.
In this study, we presented the advantages of performing RNA-seq on
cytoplasmic and nuclear RNA fractions using our modified and improved
protocol. Compared to poly(A) RNA, which is considered the standard
material for studying mature transcripts, cytoplasmic RNA-seq results in the
detection of proportionally greater amounts of exonic sequences with
minimal levels of intronic reads, offering increased sensitivity for measuring
expression levels, detecting splice junctions and for de novo transcriptome
assembly. On the other hand, nuclear RNA-seq contains higher amounts of
intronic transcripts compared to total RNA-seq, providing a preferable
fraction to study ongoing transcription, splicing dynamics and intron
retention patterns.
Mutation in the chromatin-remodeling factor BAZ1A is
associated with intellectual disability (Paper IV)
Intellectual disability (ID), also known as mental retardation, is a complex
and heterogeneous group of disorders that causes significant health burden
and affects 3-4% of the general population. The causes of intellectual
disability encompass inborn, environmental and inherited factors. Over the
past years, genetic testing of ID patients was focused on family-based
linkage analysis, homozygosity testing in consanguineous families,
cytogenetic analysis and array based screening. Causative factors have only
been identified in approximately 40% of patients leaving the majority of the
cases without any explanation for their symptoms. It is believed that
sporadic cases of ID are mainly caused by genetic factors. Understanding the
genetic causes of ID benefits the patients and their families by providing
information about the disease prognosis and avoids unnecessary additional
testing. It may also provide better-tailored therapy and medical care. Recent
studies have suggested a major role for de novo mutations in the etiology of
ID, and have led to the identification of many causative de novo mutations in
sporadic ID using family-based exome capture and massive parallel
sequencing.
In this study, we performed exome sequencing on a trio family with
healthy parents and a child suffering from ID, epilepsy, ataxia and hyper
mobile joints to investigate if a de novo mutation is responsible for the
disease in the child. Exome sequencing and Sanger sequencing validated a
de novo non-synonymous mutation in the gene BAZ1A which codes for the
chromatin remodeling factor ACF1. The mutation led to a phenylalanine to
44
cysteine substitution of amino acid 1347 in the linker motif between the
PHD domain and the bromodomain (Brd) in the C terminus of ACF1.
Given the established role of ACF1 in the transcription repression of the
vitamin D receptor (VDR) regulated genes, we conducted targeted PCR on
RNA samples from the trio to evaluate the impact of the mutation on the
expression of VDR regulated genes and on the expression levels of BAZ1A.
Interestingly, CYP24A1, involved in the metabolism of vitamin D3 (VD3),
was significantly upregulated in the patient compared to the parents. Of note,
BAZ1A expression levels showed no difference between the trio members.
To determine if the upregulation of CYP24A1 influences VD3 levels, we
performed a clinical test to measure VD3 levels in the patient. Interestingly,
the results showed that the patient was suffering from VD3 insufficiency. To
independently test if the differential expression of CYP24A1 is caused by the
identified mutation in BAZ1A, we performed transient transfection of wild
type BAZ1A (wt-BAZ1A) and mutant BAZ1A (mut-BAZ1A) transcripts in
SAOS2 cell lines. qrtPCR results validated CYP24A1 upregulation in the
cells transfected with the mut-BAZ1A, indicating the role of the de novo
mutation on the levels of CYP24A1.
To investigate whether the mutation influence the expression of genes
involved in other pathways, we performed poly(A) RNA-seq on the trio
samples. RNA-seq results confirmed the previously identified CYP24A1
upregulation. In addition, we identified the expression of SYNGAP1 to be
significantly higher in the patient compared to the parents. SYNGAP1
upregulation was further confirmed using qrtPCR. Mutations in this gene
were previously found to cause epilepsy and ID. Moreover, overexpression
of SYNGAP1 has been shown to influence neuronal signaling and synaptic
function. Consecutive GO analysis of the differentially expressed genes in
the trio RNA-seq revealed significant enrichment of the category
extracellular matrix (ECM) organization. To further study the association of
BAZ1A and genes involved in ECM organization, we carried out an
interaction network analysis between the differentially expressed ECM
organization genes and BAZ1A using GeneMANIA. Interestingly, COL1A2
was among the genes that showed close interaction with BAZ1A. COL1A2,
which causes Ehler-Dahnlos syndrome (inherited connective tissue), was
shown to have direct genetic interaction with BAZ1A.
ACF1 is known to directly interact with the promoter region of many
VDR regulated genes leading to their suppression. In this study, we show
that CYP24A1 is upregulated in the ID child and in the cells transfected with
mut-BAZ1A plasmid. To test if the de novo mutation decreases the
association of ACF1 to its target regions and consequently relieves the
repression of CYP24A1 in the patient, we performed chromatinimmunoprecipitation assay (ChIP) using human ACF1 antibodies on fresh
blood monocytes extracted from the trio. We then conducted qrtPCR to
evaluate the occupancy of ACF1 on the CYP24A1 promoter. The results
45
indicated that ACF1 is significantly less enriched on the CYP24A1 promoter
in the patient compared with the parents, thereby explaining the increased
expression level of CYP24A1 in the patient.
In summary, we detected a de novo mutation in BAZ1A encoding the
ATP-utilizing chromatin assembly and remodeling factor ACF1. Consistent
with the established role of ACF1 in chromatin remodeling and
transcriptional regulation, our RNA sequencing data revealed that the de
novo mutation influenced the expression profile of several biological
pathways in the patient but most significantly pathways involved in
extracellular matrix organization, vitamin D3 metabolism and synaptic
functions. We performed functional studies and showed that the mutated
ACF1 protein is less efficient in suppressing the expression of the CYP24A1.
Moreover, we demonstrate that the overexpression of CYP24A1, SYNGAP1
and COL1A2, detected in the patient is highly correlated with the reported
clinical features of the patient. Our results suggest that BAZ1A mutation is
the causative mutation.
46
Concluding remarks and future perspectives
The findings presented in this thesis contribute towards an improved
understanding of the human transcriptome in healthy and diseased conditions
and highlight the advantages of developing novel methods and strategies to
obtain global and comprehensive views of RNA expression and processing.
In Paper I, we show that sequencing of total RNA provides unique insights into RNA processing in cells. Our results revealed that cotranscriptional splicing is a widespread mechanism in human and chimpanzee brain tissues. We also found that co-transcriptional splicing is more frequent in long introns than in short introns. Moreover, we found a correlation
between slowly removed introns and alternative splicing. This study is
among the first to analyze global co-transcriptional splicing events. More
recently, several global studies investigating different tissues and cell lines in
human, mouse, Drosophila and S. cerevisiae, have further supported our
finding that splicing occurs primarily co-transcriptionally134-139. Challenges
for the future include designing studies to fully explore the significance of
co-transcriptional and post-transcriptional splicing on the fate of RNA and
how co-transcriptional splicing contributes to gene expression regulation.
In Paper II, we exploited the benefits of exome capture approaches in
combination with RNA-seq to detect transcripts expressed at low levels and
to identify novel exons and splice junctions. We demonstrate that exome
capture can be successfully applied to double stranded cDNA. We also show
that this approach increases the sensitivity for identifying transcripts expressed at levels below the detection threshold in conventional RNA-seq
methods. Moreover, this approach led to the identification of large numbers
of novel exons and splice isoforms. Therefore, we highlight the advantages
of this approach for genome-wide discovery of novel transcripts, alternative
splice variants and fusion transcripts. This approach has also been used by
Mercer et al. where they were able to detect large number of transcripts expressed at very low levels162. The sequencing depth required to detect such
low level transcripts with conventional RNA-seq would be very costly, making the capture strategy an attractive alternative. However, it is reasonable to
assume that in the near future, conventional RNA-seq will be able to achieve
sufficient resolution at affordable costs. Regardless, we have yet to answer
the important question regarding the functional relevance of low-level transcripts.
47
In Paper III, we demonstrate the advantages of performing RNA-seq on
separated cytoplasmic and nuclear RNA fractions. Compared to poly(A) and
total RNA, cytoplasmic RNA-seq yields a substantially higher fraction of
exonic sequences with minimal levels of intronic reads, offering increased
sensitivity for measuring expression levels, detecting splice junctions and
assembling transcripts de novo. On the other hand, data from nuclear RNAseq contains higher sequence coverage over introns compared to total RNAseq, providing a preferable fraction to study ongoing transcription, splicing
dynamics and intron retention patterns. Recently, there has been considerable effort focused on the development of methods to perform sequencing on
different RNA subtypes. Published data from these efforts have led to interesting findings on nascent transcription and splicing dynamics163,164. Future
RNA-seq experiments on subcellular RNA fractions may provide additional
novel insights into the subcellular localization of different RNA types, identify nuclear retained transcripts, translation dynamics and RNA turnover
rates.
In Paper IV, we describe the identification of a non-synonymous de novo
mutation in BAZ1A encoding the ATP-utilizing chromatin assembly and
remodeling factor 1 (ACF1), in a patient with unexplained ID. The mutation
was found to affect the expression of many genes involved in extra cellular
matrix organization, synaptic function and vitamin D3 metabolism. The differential expression of CYP24A, SYNGAP1 and COL1A2 correlates with the
clinical diagnosis of the patient. Given that ACF1 is involved in transcription
repression of many genes, and that the mutation likely decreases ACF1’s
binding affinity to its targets. Therefore, future experiments will include a
ChIP-seq of ACF1 in the trio to further identify affected genes. BAZ1A was
not reported previously to be linked to ID, therefore we will also scan additional ID patients for mutations in this gene.
The development of next-generation sequencing technologies was certainly one of the greatest breakthroughs since the end of the Human Genome
Project. These technologies have, for the first time, provided a platform to
obtain a global view of genomes. Applying NGS to RNA has pushed the
borders of this technology and allowed a comprehensive analysis of the transcriptome, contributing towards the goal of a complete understanding of
splicing patterns, gene structures and gene expression profiles in both
physiological and pathological conditions. Initial RNA-seq studies have
uncovered extremely complex, yet exquisitely organized, gene expression
programs. Now, rapid improvements in RNA-seq technologies, methods and
analysis are continuously contributing knowledge about the composition of
the transcriptome and the mechanisms that regulate and control gene expression. We now realize that the transcriptome is a very dynamic system, where
transcription is functionally and kinetically coupled to pre-mRNA splicing,
and influences mRNA polyadenylation, stability and export. As a result, the
48
decision to express a gene is controlled by the combinatorial effect of all
these mechanisms.
Despite the great amount of knowledge we have gained over the years
about gene expression programs, we are just scratching the surface in terms
of understanding transcriptome regulation. For example, we know that premRNA splicing and transcription are coupled, and that splicing in most cases
occurs co-transcriptionally. However, we still lack detailed information
about the extent of co-transcriptional splicing in vivo, how it occurs and how
it is coupled with other RNA processing mechanisms. These questions also
apply to many other mechanisms involved in gene expression. To be able to
achieve this kind of information, multidisciplinary work is required. Sequencing technologies are working in our favor and are developing at a rapid
pace to obtain longer sequence reads and greater data output. Increasing
length of sequence reads will enable facilitated read mapping, identification
of differentially spliced isoforms and facilitated linkage of novel exons to
known genes. It will also allow us to obtain deeper insight into RNA processing mechanisms such as intron removal order, co-transcriptional splicing
frequencies in different tissues and coupling between different RNA processing mechanisms. Increasing data output and sequencing depth could be very
useful to perform comprehensive profiling of transcriptome composition.
Moreover, more attention has to be paid to experimental design, especially to
the quality and type of RNA to be used in experiments. Low quality RNA
can significantly influence the interpretation of the results. Selective RNA
extraction from different subcellular pools of RNA has become a very powerful strategy to perform more focused analysis of RNA. These more focused studies can also lead to better understanding of the biogenesis and
processing of specific RNA subtypes.
In the forthcoming years, I believe that we will witness many exciting
discoveries in the transcriptomics field, especially with the development of
3rd generation amplification-free sequencing technologies such as PacBio,
which promises to improve the sensitivity of gene expression quantification,
and to avoid biases associated with PCR amplification and highly expressed
transcripts. As it is becoming increasingly clear that seemingly homogenous
cell population might have different gene expression profiles, single cell
transcriptome analysis will provide new opportunities to decipher expression
patterns linked to tissue heterogeneity in normal and tumor cells.
The hope for the future is to integrate RNA-seq data with complementary
sequencing and omics applications to obtain a comprehensive understanding
of the complete regulatory networks of gene expression in development and
disease.
49
Acknowledgements
The research in this thesis was performed at the Department of Immunology,
Genetics and Pathology, Medical Faculty, Uppsala University, Sweden. The
research was funded by the Swedish Foundation for Strategic Research
(SSF).
Completing a PhD is truly a marathon event and I would not have been able
to complete this journey without the help and support of the people around
me; to only some of whom it is possible to give a particular mention here.
First and foremost I would like to express my sincere gratitude and deepest
appreciation to my supervisor Lars Feuk, whose help, knowledge, patience,
commitment and scholarship to the highest standards has inspired and
motivated me and set an example I hope to match some day. Thank you for
your trust and belief in me when I really needed it, encouraging me to follow
my ideas and for the great working environment you have provided in the
lab. Thank you so much Lars for everything! It is always a pleasure and fun
to work with you. You are such a wonderful person and an excellent
supervisor.
Thank you to my co supervisor Ulf Landegren for all your support, encouragement and fruitful discussions and for always being there whenever I
needed help. You are a great resource to have around.
My deepest appreciation goes to Lucia Cavelier. I really don't how to thank
you for all the things you have done for me. You have always been there for
me in all aspects. You are a real friend, a great collaborator and a true inspiration. I want to thank you also for your artistic skills, which has made the
cover of this thesis look so fashionable. 
To my current and previous colleagues in Lars Feuk’s group, Eva L, for
being such a kind, helpful and supportive colleague, Jonatan H, for being an
important player in all the papers presented in this thesis, (…and by the way,
I am in for sequencing the vampire squid genome ), Jin Z, for being such a
nice colleague, I wish you all the luck with your PhD), Anna J, thank you for
being the sport spirit of our group and Johan S, we are missing your stylish
outfits on Fridays.
50
I would like to offer my special thanks to our collaborators, Ulf Gyllensten,
Adam Ameur, Lucia Cavelier, Lotta Thuresson, Manfred Grabherr, Antonia
Kalushkova and Helena Jernberg Wiklund.
I also want to thank the Uppsala Genome Center for performing all the
sequencing experiments for the studies presented in this thesis. Special
thanks to Inger J, Inger G, Stefan, Ignas, Sebastian, Nina, Cecilia, Christian
and Olga. Olga, I always enjoyed eating your home made delicious cookies!
I would also like to express my gratitude to the administration and IT departments for always being so nice and patient and for all your professional
support.
During my years in Uppsala, I have had the pleasure of meeting many wonderful people that have become true friends! Thank you for making me feel
at home.
I would like to show my greatest appreciation to my awesome friends and
colleagues!
Friends from Rudbeck:
Spyros, Fiona, Lucy, Nikki, Frank, Lesley, Larry, Jelena, Mi, Sara K,
Antonia, Sara B, Veronica, Tatjana, Marco M, Justyna, Demet, Vasil,
Jimmy, Mattias, Joakim K, Linnea, Ida, Sanna, Michaela and Emma for all
the fun in and outside of Rudbeck. Yo Nikk, I am so glad you are back to
Sweden. By the way, I still only understand 31,5% of what you’re saying.
Spyros! I hope you will behave as a toastmaster, despite my doubts. 
Thanks for everything buddy!
Special thanks also to Lars Forsberg, Tom A, Anders H and Adam A, you
are real friends. Lars, I will never forget the days we had in San Francisco
and Vegas . Anders, you are a friend to trust. Thanks for the great days in
Hawaii. Tom, my kettlebell partner, thank you for always being around.
Adam, one day I will be able to top your skills in billiard. Störningsjouren
har kommit till fel lägenhet, eller hur? 
Big thanks to Joakim B, Ola, Tomas and Paul for all the fun we have had on
our “guys nights with high drinkability”. Paul, you are a loyal friend with
many talents, thanks for your true friendship.
Thank you to the previous SLE group; Anna-Karin, Prassad, Madhu, Serena,
Angelica, Sara L and Caroline. Angelica, you are a very special friend.
Thank you for all the years and for being a nice colleague, friend and conference companion. Caroline, first of all thank you for introducing me to Lars
51
. You were always a caring person and very generous to share your knowledge and experience. Thank you also for proofreading this thesis.
Many thanks also to the people in the B11-4 BMC corridor for creating such
a unique working environment!
Friends outside Rudbeck
To the T-shirts crew; Daniel C, Daniel O, Ola, Clair, Tanai, Sahar, Jessica,
Anna E, Tod, Tobias J, Tobias B, Angelica and Emilia. Thank you guys for
all the great adventures in Sälen and Norreda. 
I am deeply grateful to Sten Dahlborg for letting me know that there is a
world outside science and for all our fruitful discussions. I truly enjoyed all
of our meetings and discussions that have shaped my view of the business
world. I already miss the sashimi at Miss Voon. 
Jordanian friends
Thank you to all my friends in Jordan; Muath, Mohamad T, Musa, Khaled A
and Ihab O. Your friendship has been so precious to me all these years. You
have always been there for me even though I have been many miles away!
I am so lucky to have had the unconditional support and encouragement
from my big family in Jordan. Thank you to my uncles, aunts and cousins.
Apologies for not being able to mention all of your names, you are rather a
big group, and I would need a couple of pages only for you all.  A big
thank you to you all!
To my parents Mustafa and Hanan D, I would have never become the person
I am today without you. Your unconditional love, support and encouragement are invaluable. You have always been there for me no matter what….
To my brothers and sister, Hamze, Tarek, Osama and Seema, you are my
treasures. I could never ask for a better family. A thank you will never be
enough!
Finally, my greatest appreciation and love to Hanan H, who left her family,
friends and work to be with me. Thank you for your endless love and support
and for taking care of me especially over the last few months. Thank you!
Ammar Zaghlool
Uppsala 2013
52
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Watson, J.D. & Crick, F.H. Molecular structure of nucleic acids; a structure for
deoxyribose nucleic acid. Nature 171, 737-8 (1953).
Lander, E.S. et al. Initial sequencing and analysis of the human genome.
Nature 409, 860-921 (2001).
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304-51
(2001).
Finishing the euchromatic sequence of the human genome. Nature 431, 931-45
(2004).
Bernstein, B.E. et al. An integrated encyclopedia of DNA elements in the
human genome. Nature 489, 57-74 (2012).
Lareau, L.F., Green, R.E., Bhatnagar, R.S. & Brenner, S.E. The evolving roles
of alternative splicing. Curr Opin Struct Biol 14, 273-82 (2004).
Yang, X.J. Multisite protein modification and intramolecular signaling.
Oncogene 24, 1653-62 (2005).
Prasanth, K.V. & Spector, D.L. Eukaryotic regulatory RNAs: an answer to the
'genome complexity' conundrum. Genes Dev 21, 11-42 (2007).
Mattick, J.S. RNA regulation: a new genetics? Nat Rev Genet 5, 316-23
(2004).
Shabalina, S.A. & Spiridonov, N.A. The mammalian transcriptome and the
function of non-coding DNA sequences. Genome Biol 5, 105 (2004).
Johnson, J.M., Edwards, S., Shoemaker, D. & Schadt, E.E. Dark matter in the
genome: evidence of widespread transcription detected by microarray tiling
experiments. Trends Genet 21, 93-102 (2005).
Mattick, J.S. The functional genomics of noncoding RNA. Science 309, 1527-8
(2005).
van Bakel, H., Nislow, C., Blencowe, B.J. & Hughes, T.R. Most "dark matter"
transcripts are associated with known genes. PLoS Biol 8, e1000371 (2010).
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101-8
(2012).
Szymanski, M. & Barciszewski, J. Regulatory RNAs in mammals. Handb Exp
Pharmacol, 45-72 (2006).
Frith, M.C., Pheasant, M. & Mattick, J.S. The amazing complexity of the
human transcriptome. Eur J Hum Genet 13, 894-7 (2005).
Hahn, S. Structure and mechanism of the RNA polymerase II transcription
machinery. Nat Struct Mol Biol 11, 394-403 (2004).
Kadonaga, J.T. Regulation of RNA polymerase II transcription by sequencespecific DNA binding factors. Cell 116, 247-57 (2004).
Kornberg, R.D. Eukaryotic transcriptional control. Trends Cell Biol 9, M46-9
(1999).
Orphanides, G., Lagrange, T. & Reinberg, D. The general transcription factors
of RNA polymerase II. Genes Dev 10, 2657-83 (1996).
53
21. Juven-Gershon, T. & Kadonaga, J.T. Regulation of gene expression via the
core promoter and the basal transcriptional machinery. Dev Biol 339, 225-9
(2010).
22. Kim, T.H. et al. A high-resolution map of active promoters in the human
genome. Nature 436, 876-80 (2005).
23. Szutorisz, H., Dillon, N. & Tora, L. The role of enhancers as centres for
general transcription factor recruitment. Trends Biochem Sci 30, 593-9 (2005).
24. Ptashne, M. & Gann, A. Transcriptional activation by recruitment. Nature 386,
569-77 (1997).
25. Buratowski, S. The CTD code. Nat Struct Biol 10, 679-80 (2003).
26. Edmonds, M. & Abrams, R. Polynucleotide biosynthesis: formation of a
sequence of adenylate units from adenosine triphosphate by an enzyme from
thymus nuclei. J Biol Chem 235, 1142-9 (1960).
27. Szentirmay, M.N., Musso, M., Van Dyke, M.W. & Sawadogo, M. Multiple
rounds of transcription by RNA polymerase II at covalently cross-linked
templates. Nucleic Acids Res 26, 2754-60 (1998).
28. Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F. & Richmond, T.J.
Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature
389, 251-60 (1997).
29. Felsenfeld, G. & Groudine, M. Controlling the double helix. Nature 421, 44853 (2003).
30. Woodcock, C.L. Chromatin architecture. Curr Opin Struct Biol 16, 213-20
(2006).
31. Edmondson, D.G. & Roth, S.Y. Chromatin and transcription. FASEB J 10,
1173-82 (1996).
32. Knezetic, J.A. & Luse, D.S. The presence of nucleosomes on a DNA template
prevents initiation by RNA polymerase II in vitro. Cell 45, 95-104 (1986).
33. Lorch, Y., LaPointe, J.W. & Kornberg, R.D. Nucleosomes inhibit the initiation
of transcription but allow chain elongation with the displacement of histones.
Cell 49, 203-10 (1987).
34. Li, B., Carey, M. & Workman, J.L. The role of chromatin during transcription.
Cell 128, 707-19 (2007).
35. Paranjape, S.M., Kamakaka, R.T. & Kadonaga, J.T. Role of chromatin
structure in the regulation of transcription by RNA polymerase II. Annu Rev
Biochem 63, 265-97 (1994).
36. Kouzarides, T. Chromatin modifications and their function. Cell 128, 693-705
(2007).
37. Clapier, C.R. & Cairns, B.R. The biology of chromatin remodeling complexes.
Annu Rev Biochem 78, 273-304 (2009).
38. Saha, A., Wittmeyer, J. & Cairns, B.R. Chromatin remodelling: the industrial
revolution of DNA around histones. Nat Rev Mol Cell Biol 7, 437-47 (2006).
39. Ehrenhofer-Murray, A.E. Chromatin dynamics at DNA replication,
transcription and repair. Eur J Biochem 271, 2335-49 (2004).
40. Smith, C.L. & Peterson, C.L. ATP-dependent chromatin remodeling. Curr Top
Dev Biol 65, 115-48 (2005).
41. Ewing, A.K., Attner, M. & Chakravarti, D. Novel regulatory role for human
Acf1 in transcriptional repression of vitamin D3 receptor-regulated genes. Mol
Endocrinol 21, 1791-806 (2007).
42. Lickert, H. et al. Baf60c is essential for function of BAF chromatin
remodelling complexes in heart development. Nature 432, 107-12 (2004).
54
43. van Attikum, H., Fritsch, O., Hohn, B. & Gasser, S.M. Recruitment of the
INO80 complex by H2A phosphorylation links ATP-dependent chromatin
remodeling with DNA double-strand break repair. Cell 119, 777-88 (2004).
44. Davis, P.K. & Brackmann, R.K. Chromatin remodeling and cancer. Cancer
Biol Ther 2, 22-9 (2003).
45. Morris, C.A. & Mervis, C.B. Williams syndrome and related disorders. Annu
Rev Genomics Hum Genet 1, 461-84 (2000).
46. Zentner, G.E. & Scacheri, P.C. The chromatin fingerprint of gene enhancer
elements. J Biol Chem 287, 30888-96 (2012).
47. Santen, G.W. et al. Mutations in SWI/SNF chromatin remodeling complex
gene ARID1B cause Coffin-Siris syndrome. Nat Genet 44, 379-80 (2012).
48. O'Roak, B.J. et al. Sporadic autism exomes reveal a highly interconnected
protein network of de novo mutations. Nature 485, 246-50 (2012).
49. Kosho, T. et al. Clinical correlations of mutations affecting six components of
the SWI/SNF complex: detailed description of 21 patients and a review of the
literature. Am J Med Genet A 161, 1221-37 (2013).
50. Ward, A.J. & Cooper, T.A. The pathobiology of splicing. Journal of Pathology
220, 152-163 (2010).
51. Gao, K.P., Masuda, A., Matsuura, T. & Ohno, K. Human branch point
consensus sequence is yUnAy. Nucleic Acids Research 36, 2257-2267 (2008).
52. Will, C.L. & Luhrmann, R. Spliceosome Structure and Function. Cold Spring
Harbor Perspectives in Biology 3(2011).
53. Zhou, Z.L., Licklider, L.J., Gygi, S.P. & Reed, R. Comprehensive proteomic
analysis of the human spliceosome. Nature 419, 182-185 (2002).
54. Rappsilber, J., Ryder, U., Lamond, A.I. & Mann, M. Large-scale proteomic
analysis of the human splicesome. Genome Research 12, 1231-1245 (2002).
55. Kramer, A. The structure and function of proteins involved in mammalian premRNA splicing. Annual Review of Biochemistry 65, 367-409 (1996).
56. Wahl, M.C., Will, C.L. & Luhrmann, R. The Spliceosome: Design Principles
of a Dynamic RNP Machine. Cell 136, 701-718 (2009).
57. Wu, Q. & Krainer, A.R. AT-AC Pre-mRNA splicing mechanisms and
conservation of minor introns in voltage-gated ion channel genes. Molecular
and Cellular Biology 19, 3225-3236 (1999).
58. Hall, S.L. & Padgett, R.A. Requirement of U12 snRNA for in vivo splicing of
a minor class of eukaryotic nuclear pre-mRNA introns. Science 271, 17161718 (1996).
59. Patel, A.A. & Steitz, J.A. Splicing double: Insights from the second
spliceosome. Nature Reviews Molecular Cell Biology 4, 960-970 (2003).
60. Turunen, J.J., Niemela, E.H., Verma, B. & Frilander, M.J. The significant
other: splicing by the minor spliceosome. Wiley Interdisciplinary Reviews-Rna
4, 61-76 (2013).
61. Konig, H., Matter, N., Bader, R., Thiele, W. & Muller, F. Splicing segregation:
The minor spliceosome acts outside the nucleus and controls cell proliferation.
Cell 131, 718-729 (2007).
62. Pessa, H.K. et al. Minor spliceosome components are predominantly localized
in the nucleus. Proc Natl Acad Sci U S A 105, 8655-60 (2008).
63. Valadkhan, S. snRNAs as the catalysts of pre-mRNA splicing. Curr Opin
Chem Biol 9, 603-8 (2005).
64. Ast, G. How did alternative splicing evolve? Nat Rev Genet 5, 773-82 (2004).
65. Ares, M., Jr., Grate, L. & Pauling, M.H. A handful of intron-containing genes
produces the lion's share of yeast mRNA. RNA 5, 1138-9 (1999).
55
66. Lopez, P.J. & Seraphin, B. Genomic-scale quantitative analysis of yeast premRNA splicing: implications for splice-site recognition. RNA 5, 1135-7
(1999).
67. Black, D.L. Mechanisms of alternative pre-messenger RNA splicing. Annual
Review of Biochemistry 72, 291-336 (2003).
68. Flicek, P. et al. Ensembl 2012. Nucleic Acids Research 40, D84-90 (2012).
69. Modrek, B. & Lee, C. A genomic view of alternative splicing. Nat Genet 30,
13-9 (2002).
70. Berget, S.M., Moore, C. & Sharp, P.A. Spliced segments at the 5' terminus of
adenovirus 2 late mRNA. Proc Natl Acad Sci U S A 74, 3171-5 (1977).
71. Coffin, J.M., Hughes, S.H. & Varmus, H.E. The Interactions of Retroviruses
and their Hosts. (1997).
72. Gatherer, D. et al. High-resolution human cytomegalovirus transcriptome. Proc
Natl Acad Sci U S A 108, 19755-60 (2011).
73. Black, D.L. Protein diversity from alternative splicing: a challenge for
bioinformatics and post-genome biology. Cell 103, 367-70 (2000).
74. Nilsen, T.W. & Graveley, B.R. Expansion of the eukaryotic proteome by
alternative splicing. Nature 463, 457-63 (2010).
75. Graveley, B.R. Alternative splicing: increasing diversity in the proteomic
world. Trends Genet 17, 100-7 (2001).
76. Irimia, M. & Blencowe, B.J. Alternative splicing: decoding an expansive
regulatory layer. Curr Opin Cell Biol 24, 323-32 (2012).
77. Stamm, S. et al. Function of alternative splicing. Gene 344, 1-20 (2005).
78. Kelemen, O. et al. Function of alternative splicing. Gene 514, 1-30 (2013).
79. Yap, K. & Makeyev, E.V. Regulation of gene expression in mammalian
nervous system through alternative pre-mRNA splicing coupled with RNA
quality control mechanisms. Mol Cell Neurosci (2013).
80. Rowen, L. et al. Analysis of the human neurexin genes: alternative splicing and
the generation of protein diversity. Genomics 79, 587-97 (2002).
81. Ichtchenko, K. et al. Neuroligin 1: a splice site-specific ligand for betaneurexins. Cell 81, 435-43 (1995).
82. Missler, M., Hammer, R.E. & Sudhof, T.C. Neurexophilin binding to alphaneurexins. A single LNS domain functions as an independently folding ligandbinding unit. J Biol Chem 273, 34716-23 (1998).
83. Wollerton, M.C., Gooding, C., Wagner, E.J., Garcia-Blanco, M.A. & Smith,
C.W. Autoregulation of polypyrimidine tract binding protein by alternative
splicing leading to nonsense-mediated decay. Mol Cell 13, 91-100 (2004).
84. Orengo, J.P. & Cooper, T.A. Alternative splicing in disease. Adv Exp Med Biol
623, 212-23 (2007).
85. Buratti, E., Baralle, M. & Baralle, F.E. Defective splicing, disease and therapy:
searching for master checkpoints in exon definition. Nucleic Acids Research
34, 3494-510 (2006).
86. Faustino, N.A. & Cooper, T.A. Pre-mRNA splicing and human disease. Genes
Dev 17, 419-37 (2003).
87. Lopez-Bigas, N., Audit, B., Ouzounis, C., Parra, G. & Guigo, R. Are splicing
mutations the most frequent cause of hereditary disease? FEBS Lett 579, 19003 (2005).
88. Tazi, J., Bakkour, N. & Stamm, S. Alternative splicing and disease. Biochim
Biophys Acta 1792, 14-26 (2009).
89. Fu, X.D. Towards a splicing code. Cell 119, 736-8 (2004).
56
90. Chen, M. & Manley, J.L. Mechanisms of alternative splicing regulation:
insights from molecular and genomics approaches. Nat Rev Mol Cell Biol 10,
741-54 (2009).
91. Graveley, B.R. Sorting out the complexity of SR protein functions. RNA 6,
1197-211 (2000).
92. Del Gatto-Konczak, F., Olive, M., Gesnel, M.C. & Breathnach, R. hnRNP A1
recruited to an exon in vivo can function as an exon splicing silencer.
Molecular and Cellular Biology 19, 251-60 (1999).
93. Grover, A. et al. 5' splice site mutations in tau associated with the inherited
dementia FTDP-17 affect a stem-loop structure that regulates alternative
splicing of exon 10. J Biol Chem 274, 15134-43 (1999).
94. Zhu, J., Mayeda, A. & Krainer, A.R. Exon identity established through
differential antagonism between exonic splicing silencer-bound hnRNP A1 and
enhancer-bound SR proteins. Mol Cell 8, 1351-61 (2001).
95. Wang, E.T. et al. Alternative isoform regulation in human tissue
transcriptomes. Nature 456, 470-6 (2008).
96. David, C.J. & Manley, J.L. The search for alternative splicing regulators: new
approaches offer a path to a splicing code. Genes Dev 22, 279-85 (2008).
97. Castle, J.C. et al. Expression of 24,426 human alternative splicing events and
predicted cis regulation in 48 tissues and cell lines. Nat Genet 40, 1416-25
(2008).
98. Grabowski, P.J. & Black, D.L. Alternative RNA splicing in the nervous
system. Prog Neurobiol 65, 289-308 (2001).
99. Wang, E.T. et al. Transcriptome-wide regulation of pre-mRNA splicing and
mRNA localization by muscleblind proteins. Cell 150, 710-24 (2012).
100. Boutz, P.L. et al. A post-transcriptional regulatory switch in polypyrimidine
tract-binding proteins reprograms alternative splicing in developing neurons.
Genes Dev 21, 1636-52 (2007).
101. Ule, J. et al. Nova regulates brain-specific splicing to shape the synapse. Nat
Genet 37, 844-52 (2005).
102. McKee, A.E. et al. A genome-wide in situ hybridization map of RNA-binding
proteins reveals anatomically restricted expression in the developing mouse
brain. BMC Dev Biol 5, 14 (2005).
103. Makeyev, E.V., Zhang, J., Carrasco, M.A. & Maniatis, T. The MicroRNA
miR-124 promotes neuronal differentiation by triggering brain-specific
alternative pre-mRNA splicing. Mol Cell 27, 435-48 (2007).
104. Kornblihtt, A.R. et al. Alternative splicing: a pivotal step between eukaryotic
transcription and translation. Nat Rev Mol Cell Biol 14, 153-65 (2013).
105. Moore, M.J. & Proudfoot, N.J. Pre-mRNA processing reaches back to
transcription and ahead to translation. Cell 136, 688-700 (2009).
106. Shukla, S. & Oberdoerffer, S. Co-transcriptional regulation of alternative premRNA splicing. Biochim Biophys Acta 1819, 673-83 (2012).
107. Kornblihtt, A.R. Chromatin, transcript elongation and alternative splicing. Nat
Struct Mol Biol 13, 5-7 (2006).
108. Munoz, M.J., de la Mata, M. & Kornblihtt, A.R. The carboxy terminal domain
of RNA polymerase II and alternative splicing. Trends Biochem Sci 35, 497504 (2010).
109. Phatnani, H.P. & Greenleaf, A.L. Phosphorylation and functions of the RNA
polymerase II CTD. Genes Dev 20, 2922-36 (2006).
110. Cramer, P. et al. Coupling of transcription with alternative splicing: RNA pol II
promoters modulate SF2/ASF and 9G8 effects on an exonic splicing enhancer.
Mol Cell 4, 251-8 (1999).
57
111. Carrillo Oesterreich, F., Bieberstein, N. & Neugebauer, K.M. Pause locally,
splice globally. Trends Cell Biol 21, 328-35 (2011).
112. Perales, R. & Bentley, D. "Cotranscriptionality": the transcription elongation
complex as a nexus for nuclear transactions. Mol Cell 36, 178-91 (2009).
113. de la Mata, M. et al. A slow RNA polymerase II affects alternative splicing in
vivo. Mol Cell 12, 525-32 (2003).
114. de la Mata, M., Lafaille, C. & Kornblihtt, A.R. First come, first served
revisited: factors affecting the same alternative splicing event have different
effects on the relative rates of intron removal. RNA 16, 904-12 (2010).
115. Dutertre, M. et al. Cotranscriptional exon skipping in the genotoxic stress
response. Nat Struct Mol Biol 17, 1358-66 (2010).
116. Schmidt, U. et al. Real-time imaging of cotranscriptional splicing reveals a
kinetic model that reduces noise: implications for alternative splicing
regulation. J Cell Biol 193, 819-29 (2011).
117. Kadener, S., Fededa, J.P., Rosbash, M. & Kornblihtt, A.R. Regulation of
alternative splicing by a transcriptional enhancer through RNA pol II
elongation. Proc Natl Acad Sci U S A 99, 8185-90 (2002).
118. Alexander, R.D., Innocente, S.A., Barrass, J.D. & Beggs, J.D. Splicingdependent RNA polymerase pausing in yeast. Mol Cell 40, 582-93 (2010).
119. Luco, R.F., Allo, M., Schor, I.E., Kornblihtt, A.R. & Misteli, T. Epigenetics in
alternative pre-mRNA splicing. Cell 144, 16-26 (2011).
120. Allo, M. et al. Chromatin and alternative splicing. Cold Spring Harb Symp
Quant Biol 75, 103-11 (2010).
121. Hughes, T.A. Regulation of gene expression by alternative untranslated
regions. Trends Genet 22, 119-22 (2006).
122. Luco, R.F. & Misteli, T. More than a splicing code: integrating the role of
RNA, chromatin and non-coding RNA in alternative splicing regulation. Curr
Opin Genet Dev 21, 366-72 (2011).
123. Graveley, B.R. Alternative splicing: regulation without regulators. Nat Struct
Mol Biol 16, 13-5 (2009).
124. Beyer, A.L. & Osheim, Y.N. Splice site selection, rate of splicing, and
alternative splicing on nascent transcripts. Genes Dev 2, 754-65 (1988).
125. Misteli, T., Caceres, J.F. & Spector, D.L. The dynamics of a pre-mRNA
splicing factor in living cells. Nature 387, 523-7 (1997).
126. Neugebauer, K.M. & Roth, M.B. Distribution of pre-mRNA splicing factors at
sites of RNA polymerase II transcription. Genes Dev 11, 1148-59 (1997).
127. Zhang, G., Taneja, K.L., Singer, R.H. & Green, M.R. Localization of premRNA splicing in mammalian nuclei. Nature 372, 809-12 (1994).
128. Listerman, I., Sapra, A.K. & Neugebauer, K.M. Cotranscriptional coupling of
splicing factor recruitment and precursor messenger RNA splicing in
mammalian cells. Nat Struct Mol Biol 13, 815-22 (2006).
129. Kotovic, K.M., Lockshon, D., Boric, L. & Neugebauer, K.M. Cotranscriptional
recruitment of the U1 snRNP to intron-containing genes in yeast. Molecular
and Cellular Biology 23, 5768-79 (2003).
130. Pandya-Jones, A. & Black, D.L. Co-transcriptional splicing of constitutive and
alternative exons. RNA 15, 1896-908 (2009).
131. Lacadie, S.A. & Rosbash, M. Cotranscriptional spliceosome assembly
dynamics and the role of U1 snRNA:5'ss base pairing in yeast. Mol Cell 19,
65-75 (2005).
132. Vargas, D.Y. et al. Single-molecule imaging of transcriptionally coupled and
uncoupled splicing. Cell 147, 1054-65 (2011).
58
133. Das, R. et al. SR proteins function in coupling RNAP II transcription to premRNA splicing. Mol Cell 26, 867-81 (2007).
134. Carrillo Oesterreich, F., Preibisch, S. & Neugebauer, K.M. Global analysis of
nascent RNA reveals transcriptional pausing in terminal exons. Mol Cell 40,
571-81 (2010).
135. Khodor, Y.L. et al. Nascent-seq indicates widespread cotranscriptional premRNA splicing in Drosophila. Genes Dev 25, 2502-12 (2011).
136. Girard, C. et al. Post-transcriptional spliceosomes are retained in nuclear
speckles until splicing completion. Nat Commun 3, 994 (2012).
137. Windhager, L. et al. Ultrashort and progressive 4sU-tagging reveals key
characteristics of RNA processing at nucleotide resolution. Genome Research
22, 2031-42 (2012).
138. Khodor, Y.L., Menet, J.S., Tolan, M. & Rosbash, M. Cotranscriptional splicing
efficiency differs dramatically between Drosophila and mouse. RNA 18, 217486 (2012).
139. Tilgner, H. et al. Deep sequencing of subcellular RNA fractions shows splicing
to be predominantly co-transcriptional in the human genome but inefficient for
lncRNAs. Genome Research 22, 1616-25 (2012).
140. Sachs, A. The role of poly(A) in the translation and stability of mRNA. Curr
Opin Cell Biol 2, 1092-8 (1990).
141. Fabian, M.R., Sonenberg, N. & Filipowicz, W. Regulation of mRNA
translation and stability by microRNAs. Annual Review of Biochemistry 79,
351-79 (2010).
142. Andreassi, C. & Riccio, A. To localize or not to localize: mRNA fate is in
3'UTR ends. Trends Cell Biol 19, 465-74 (2009).
143. Millevoi, S. & Vagner, S. Molecular mechanisms of eukaryotic pre-mRNA 3'
end processing regulation. Nucleic Acids Research 38, 2757-74 (2010).
144. Proudfoot, N.J. Ending the message: poly(A) signals then and now. Genes Dev
25, 1770-82 (2011).
145. Mandel, C.R., Bai, Y. & Tong, L. Protein factors in pre-mRNA 3'-end
processing. Cell Mol Life Sci 65, 1099-122 (2008).
146. Shi, Y. Alternative polyadenylation: new insights from global analyses. RNA
18, 2105-17 (2012).
147. McCracken, S. et al. The C-terminal domain of RNA polymerase II couples
mRNA processing to transcription. Nature 385, 357-61 (1997).
148. Shi, Y. et al. Molecular architecture of the human pre-mRNA 3' processing
complex. Mol Cell 33, 365-76 (2009).
149. Di Giammartino, D.C., Nishida, K. & Manley, J.L. Mechanisms and
consequences of alternative polyadenylation. Mol Cell 43, 853-66 (2011).
150. Pinto, P.A. et al. RNA polymerase II kinetics in polo polyadenylation signal
selection. Embo Journal 30, 2431-44 (2011).
151. Willingham, A.T. & Gingeras, T.R. TUF love for "junk" DNA. Cell 125, 121520 (2006).
152. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq:
an assessment of technical reproducibility and comparison with gene
expression arrays. Genome Research 18, 1509-17 (2008).
153. Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome
defined by RNA sequencing. Science 320, 1344-9 (2008).
154. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping
and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 6218 (2008).
59
155. Sultan, M. et al. A global view of gene activity and alternative splicing by deep
sequencing of the human transcriptome. Science 321, 956-60 (2008).
156. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of
alternative splicing complexity in the human transcriptome by high-throughput
sequencing. Nat Genet 40, 1413-5 (2008).
157. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions
with RNA-Seq. Bioinformatics 25, 1105-11 (2009).
158. Pflueger, D. et al. Discovery of non-ETS gene fusions in human prostate
cancer using next-generation RNA sequencing. Genome Research 21, 56-67
(2011).
159. Maher, C.A. et al. Transcriptome sequencing to detect gene fusions in cancer.
Nature 458, 97-101 (2009).
160. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes
in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat
Biotechnol 28, 503-10 (2010).
161. Chepelev, I., Wei, G., Tang, Q. & Zhao, K. Detection of single nucleotide
variations in expressed exons of the human genome using RNA-Seq. Nucleic
Acids Research 37, e106 (2009).
162. Mercer, T.R. et al. Targeted RNA sequencing reveals the deep complexity of
the human transcriptome. Nat Biotechnol 30, 99-104 (2012).
163. Core, L.J., Waterfall, J.J. & Lis, J.T. Nascent RNA sequencing reveals
widespread pausing and divergent initiation at human promoters. Science 322,
1845-8 (2008).
164. Churchman, L.S. & Weissman, J.S. Nascent transcript sequencing visualizes
transcription at nucleotide resolution. Nature 469, 368-73 (2011).
165. Pandya-Jones, A. et al. Splicing kinetics and transcript release from the
chromatin compartment limit the rate of Lipid A-induced gene expression.
RNA 19, 811-27 (2013).
166. Bhatt, D.M. et al. Transcript dynamics of proinflammatory genes revealed by
sequence analysis of subcellular RNA fractions. Cell 150, 279-90 (2012).
167. Solnestam, B.W. et al. Comparison of total and cytoplasmic mRNA reveals
global regulation by nuclear retention and miRNAs. BMC Genomics 13, 574
(2012).
168. Liao, J.Y. et al. Deep sequencing of human nuclear and cytoplasmic small
RNAs reveals an unexpectedly complex subcellular distribution of miRNAs
and tRNA 3' trailers. PLoS One 5, e10563 (2010).
169. Wilhelm, B.T. & Landry, J.R. RNA-Seq-quantitative measurement of
expression through massively parallel RNA-sequencing. Methods 48, 249-57
(2009).
170. Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human
exomes. Nature 461, 272-6 (2009).
171. Bamshad, M.J. et al. Exome sequencing as a tool for Mendelian disease gene
discovery. Nat Rev Genet 12, 745-55 (2011).
172. Kiezun, A. et al. Exome sequencing and the genetic basis of complex traits.
Nat Genet 44, 623-30 (2012).
173. Veltman, J.A. & Brunner, H.G. De novo mutations in human genetic disease.
Nat Rev Genet 13, 565-75 (2012).
174. Rauch, A. et al. Range of genetic mutations associated with severe nonsyndromic sporadic intellectual disability: an exome sequencing study. Lancet
380, 1674-82 (2012).
175. Vissers, L.E. et al. A de novo paradigm for mental retardation. Nat Genet 42,
1109-12 (2010).
60
176. Iossifov, I. et al. De novo gene disruptions in children on the autistic spectrum.
Neuron 74, 285-99 (2012).
177. Crow, J.F. The origins, patterns and implications of human spontaneous
mutation. Nat Rev Genet 1, 40-7 (2000).
178. Eyre-Walker, A. & Keightley, P.D. The distribution of fitness effects of new
mutations. Nat Rev Genet 8, 610-8 (2007).
179. Lynch, M. Rate, molecular spectrum, and consequences of human mutation.
Proc Natl Acad Sci U S A 107, 961-8 (2010).
180. Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by wholegenome sequencing. Science 328, 636-9 (2010).
181. Shcherbik, N., Wang, M., Lapik, Y.R., Srivastava, L. & Pestov, D.G.
Polyadenylation and degradation of incomplete RNA polymerase I transcripts
in mammalian cells. EMBO Rep 11, 106-11 (2010).
182. Tilgner, H. et al. Deep sequencing of subcellular RNA fractions shows splicing
to be predominantly co-transcriptional in the human genome but inefficient for
lncRNAs. Genome Res 22, 1616-25 (2012).
183. Brody, Y. et al. The in vivo kinetics of RNA polymerase II elongation during
co-transcriptional splicing. PLoS Biol 9, e1000573 (2011).
184. Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-Seq data
without a reference genome. Nat Biotechnol 29, 644-52 (2011).
61
Acta Universitatis Upsaliensis
Digital Comprehensive Summaries of Uppsala Dissertations
from the Faculty of Medicine 939
Editor: The Dean of the Faculty of Medicine
A doctoral dissertation from the Faculty of Medicine, Uppsala
University, is usually a summary of a number of papers. A few
copies of the complete dissertation are kept at major Swedish
research libraries, while the summary alone is distributed
internationally through the series Digital Comprehensive
Summaries of Uppsala Dissertations from the Faculty of
Medicine.
Distribution: publications.uu.se
urn:nbn:se:uu:diva-209390
ACTA
UNIVERSITATIS
UPSALIENSIS
UPPSALA
2013