Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 939 Genome-wide Characterization of RNA Expression and Processing AMMAR ZAGHLOOL ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2013 ISSN 1651-6206 ISBN 978-91-554-8784-3 urn:nbn:se:uu:diva-209390 Dissertation presented at Uppsala University to be publicly examined in Rudbeck Salen, Dag Hammarskjölds väg 20,, Uppsala, Friday, 29 November 2013 at 09:15 for the degree of Doctor of Philosophy (Faculty of Medicine). The examination will be conducted in English. Faculty examiner: Rickard Sandberg (Karolinska Institutet). Abstract Zaghlool, A. 2013. Genome-wide Characterization of RNA Expression and Processing. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 939. 61 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-8784-3. The production of fully mature protein-coding transcripts is an intricate process that involves numerous regulation steps. The complexity of these steps provides the means for multilayered control of gene expression. Comprehensive understanding of gene expression regulation is essential for interpreting the role of gene expression programs in tissue specificity, development and disease. In this thesis, we aim to provide a better global view of the human transcriptome, focusing on its content, synthesis, processing and regulation using next-generation sequencing as a read-out. In Paper I, we show that sequencing of total RNA provides unique insights into RNA processing. Our results revealed that co-transcriptional splicing is a widespread mechanism in human and chimpanzee brain tissues. We also found a correlation between slowly removed introns and alternative splicing. In Paper II, we explore the benefits of exome capture approaches in combination with RNA-sequencing to detect transcripts expressed at low-levels. Based on our results, we demonstrate that this approach increases the sensitivity for detecting low level transcripts and leads to the identification of novel exons and splice isoforms. In Paper III, we highlight the advantages of performing RNA-sequencing on separate cytoplasmic and nuclear RNA fractions. In comparison with conventional poly(A) RNA, cytoplasmic RNA contained a significantly higher fraction of exonic sequence, providing increased sensitivity for splice junction detection and for improved de novo assembly. Conversely, the nuclear fraction showed an enrichment of unprocessed RNA compared to when sequencing total RNA, making it suitable for analysis of RNA processing dynamics. In Paper IV, we used exome sequencing to sequence the DNA of a patient with unexplained intellectual disability and identified a de novo mutation in BAZ1A, which encodes the chromatin-remodeling factor ACF1. Functional studies indicated that the mutation influences the expression of genes involved in extracellular matrix organization, synaptic function and vitamin D3 metabolism. The differential expression of CYP24A, SYNGAP1 and COL1A2 correlated with the patient’s clinical diagnosis. The findings presented in this thesis contribute towards an improved understanding of the human transcriptome in health and disease, and highlight the advantages of developing novel methods to obtain global and comprehensive views of the transcriptome. Keywords: RNA sequencing, RNA splicing, RNA processing, Gene expression Ammar Zaghlool, Department of Immunology, Genetics and Pathology, Rudbecklaboratoriet, Uppsala University, SE-751 85 Uppsala, Sweden. © Ammar Zaghlool 2013 ISSN 1651-6206 ISBN 978-91-554-8784-3 urn:nbn:se:uu:diva-209390 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-209390) 初⼼心忘るべからず “We should not forget our beginner’s spirit” Japanese proverb To my family r List of Papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals: I Ameur A, Zaghlool A, Halvardson J, Wetterbom A, Gyllensten U, Cavelier L, Feuk L. (2011) Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nature Structural and Molecular Biology, 18(12): 1435–40 II Halvardson J*, Zaghlool A*, Feuk L. (2013) Exome RNA sequencing reveals rare and novel alternative transcripts. Nucleic Acid Research, 41(1): e6 III Zaghlool A*, Ameur A*, Nyberg L, Grabherr M, Cavelier L, Feuk L. (2013) Efficient cellular fractionation improves RNA sequencing analysis of mature and nascent transcripts from human tissues. (Accepted in BMC Biotechnology) IV Zaghlool A, Halvardson J, Zhao J, Kalushkova A, Konska K, Jernberg-Wiklund H, Thuresson AC, Feuk L. (2013) Mutation in the chromatin-remodeling factor BAZ1A is associated with intellectual disability. (Manuscript) * Equal first authors Reprints were made with permission from the respective publishers. Additional publication I Kozyrev S *, Abelson AK *, Wojcik J, Zaghlool A, Linga Reddy MV, Sanchez E, Gunnarsson I, Svenungsson E, Sturfelt G, jönsen A, Truedsson L, Pons-Estel BA, Witte T, D'Alfonso S, Barizzone N, Danieli MG, Gutierrez C, Suarez A, Junker P, Laustrup H, González-Escribano MF, Martin J, Abderrahim H, Alarcón-Riquelme ME. (2008) Functional variants in the B-cell gene BANK1 are associated with systemic lupus erythematosus. Nature Genetics, 40(2): 211–6 Contents Abbreviations .................................................................................................ix Introduction ................................................................................................... 11 The transcriptome: the faithful message ....................................................... 13 Transcription and gene expression ........................................................... 14 Transcriptional regulation and RNA processing ...................................... 15 Chromatin remodeling ......................................................................... 15 RNA processing ................................................................................... 16 Transcriptome analysis: In light of next-generation sequencing .............. 26 Enabling technologies and method development .......................................... 28 RNA sequencing ....................................................................................... 28 Background .......................................................................................... 28 Library preparation .............................................................................. 28 SOLiD sequencing ............................................................................... 29 Exome sequencing .................................................................................... 30 Background .......................................................................................... 30 Sequencing library preparation and exome capture ............................. 31 Cytoplasmic and nuclear RNA purification ............................................. 33 The present investigation .............................................................................. 35 Aims.......................................................................................................... 35 Paper I .................................................................................................. 35 Paper II ................................................................................................. 35 Paper III ............................................................................................... 35 Paper IV ............................................................................................... 35 Results and discussion ................................................................................... 36 Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain (Paper I) .......................... 36 Exome RNA sequencing reveals rare and novel alternative transcripts (Paper II) ................................................................................................... 39 Efficient cellular fractionation improves RNA sequencing analysis of mature and nascent transcripts from human tissues (Paper III)................ 41 Mutation in the chromatin-remodeling factor BAZ1A is associated with intellectual disability (Paper IV)............................................................... 44 Concluding remarks and future perspectives................................................ 47 Acknowledgements....................................................................................... 50 References..................................................................................................... 53 Abbreviations APA AS cDNA ChIP CNV CTD CytoRNA-seq DCPM DNA ESE ESS EST ExomeRNA-seq GTF GWAS ID ISE ISS LncRNA mRNA ncRNA PCR PIC Poly(A) Poly(A) RNA Pre-mRNA qrtPCR RNA RNA pol II RNA-seq RPKM rRNA SNP SNV tRNA TSS Alternative polyadenylation Alternative splicing Complementary DNA Chromatin Immunoprecipitation Copy number variation The carboxy-terminal domain Cytoplasmic RNA sequencing Depth of coverage per million mapped reads Deoxyribonucleic acid Exonic splicing enhancer Exonic splicing silencer Expressed sequence tag Exome RNA capture sequencing General transcription factor Genome-wide association studies Intellectual disability Intronic splicing enhancer Intronic splicing silencer Long noncoding RNA Messenger RNA Non-coding RNA Polymerase chain reaction Transcription pre-initiation complex Polyadenylation Polyadenylated RNA Premature RNA Quantitative real time PCR Ribonucleic acid RNA polymerase II RNA sequencing Reads per kilobase per million mapped reads Ribosomal RNA Single nucleotide polymorphism Single nucleotide variant Transfer RNA Transcription start site 10 Introduction The human body contains an estimated 1014 cells, representing a large number of different cell types with specialized function. Although the cells are highly specialized, each cell contains the information required to create all cell types in the body. This information is stored within the genome, which is located in the cell nucleus. The genome is divided into twenty-three pairs of chromosomes, with the father and mother contributing one chromosome to each pair. The biological information in our chromosomes is mainly encoded by deoxyribonucleic acid (DNA). The DNA structure was revealed in 1953 by Watson and Crick. Their model suggested that the DNA molecule forms a double helix of two long chains encoded by four different nucelotides1, called adenine (A), cytosine (C), thymine (T) and guanine (G). The genetic code is determined by the order of nucleotides, and this code allows for traits to be reliably passed from one generation to the next. Therefore, the discovery of the DNA structure resolved the mystery of inheritance that had previously puzzled scientists. Another breakthrough finding in genomics research took place in 2001 when the sequence of the entire three billion bases that make up the human genome was published2,3. The completion of the human genome sequence has resulted in enormous surprises that fundamentally reshaped our view of genome architecture. The coding sequence represented only 1.5% of the genome and the number of coding genes is estimated to be between 20,00025,000 (International Human Genome Sequencing Consortium 2004), significantly fewer than previously predicted4. The fact that the number of genes was significantly lower than previously expected raised several important questions. If we have the same number of genes as many other species, where does our complexity come from, and how can such a low number of genes give rise to the large number of encoded proteins? There must be many mechanisms, processes and elements, other than the genes, that help explain our complexity. Efforts by the Encyclopedia of DNA Elements Consortium (ENCODE) are aiming to answer this question by applying genome-wide assays to many different human cell lines. In a recent publication from the ENCODE Consortium, they claim to have assigned biochemical function to almost 80% of the genome, particularly in regions outside known genes and regions previously considered “junk” DNA5. Furthermore, genomic complexity in higher eukaryotes can be achieved not only by gene number, but also through 11 mechanisms that lead to the production of multiple proteins from the same gene2. These mechanisms include alternative splicing, RNA editing, transsplicing, alternative promoter usage, and post-translational protein modifications6,7. In addition, non-coding RNAs (ncRNAs), such as miRNAs and siRNAs, transcribed from intronic regions or from regions outside genes may add another layer of complexity to the genome by controlling multiple and large regulatory networks8-10. In light of the many mechanisms of regulation that exist at the level of RNA, and as it represents a direct output of genetic information, deciphering and interpreting the complexity of all the RNA in the cell (the so-called transcriptome) represents an important step towards decoding the functional elements of the genome. Initial estimates of the proportion of the genome that is transcribed have largely relied on hybridization-based microarrays. Results coming from these studies showed that transcription is pervasive and many transcripts fall outside known genes. The nature of these transcripts was unclear, but it was suggested to represent a combination of functional transcripts and biological or experimental background11,12. Recently, the ENCODE project performed a more comprehensive analysis of transcriptional landscape in 15 human cell lines using RNA sequencing (RNA-seq) in combination with other technologies. The data revealed that more than three-quarters of the human genome is capable of being transcribed, firmly establishing the reality of pervasive transcription14. My thesis is based on four genome-wide studies of the human transcriptome. The findings presented in all the studies contribute towards an improved understanding of the human transcriptome in healthy and diseased individuals and highlight the advantages of developing novel methods to obtain global and comprehensive views of RNA expression and processing. • Paper I focuses on using total RNA-seq to study nascent RNA transcripts and co-transcriptional splicing genome-wide. • Paper II highlights the benefits of using whole exome capture and massive parallel sequencing on cDNA to study the transcriptional landscape of human tissues and to identify novel transcripts and splice junctions expressed at low levels. • Paper III highlights the advantages of sequencing separated cytoplasmic and nuclear RNA fractions using a modified cellular fractionation protocol to investigate mature RNA transcripts from cytoplasmic RNA and nascent transcripts from nuclear RNA. • Paper IV describes a de novo mutation in the chromatin-remodeling gene BAZ1A in a patient with intellectual disability. It characterizes how the effect of the mutation on the expression and function of the gene may be linked to the clinical manifestations of the patient. 12 The transcriptome: the faithful message Following the discovery of the genetic code, it was generally believed that genetic information flows from DNA to RNA to protein (i.e. the central dogma). As a consequence, it was assumed that RNA transcripts were primarily used as templates for translation to protein. Recent data have challenged this theory and showed that a large proportion of transcripts represent noncoding RNAs12. In fact, it is proposed that the regulatory potential of these noncoding RNAs correlates with the complexity of the transcriptome9,15. The transcriptome represents the complete collection of transcripts or RNA molecules present in a cell at a given time point. It is estimated that frequently expressed transcripts represent only 5% of the genetic code and only 1-2% protein coding transcripts16. In multicellular organisms, nearly all the cells share the same genome. However, the gene expression profiles can significantly vary between different cell types. Moreover, the expression patterns inside a single cell may change at different developmental stages, during differentiation and in response to external stimuli. To achieve this complexity, the transcriptome relies on extremely tight, yet flexible and orchestrated, regulatory mechanisms that provide the cells with a unique expression pattern suitable for the cell specific function at that very moment. The regulatory mechanisms function in concerted action to control the four coordinates of gene expression: where, when, how much and for how long a single transcript is expressed. Therefore, completing the catalogue of transcripts and comparing transcription and RNA processing patterns in different cells and tissues at different developmental stages will allow us to understand what constitutes a particular cell type and how transcriptional dynamics may cause or contribute to disease. In the following sections of my thesis, I will shed some light on the basic principles and recent understanding of transcription and RNA processing machineries. I will also touch upon the contribution of different nextgeneration sequencing technologies in expanding our knowledge of transcriptomics. 13 Transcription and gene expression The function and characteristics of any given cell are largely influenced by the cell’s protein content. The cell regulates how much of a protein is present or how active a protein is by mainly controlling gene expression. The expression of eukaryotic protein coding genes can be dissected into several processing, major stages including transcription, pre-mRNA polyadenylation, transport and translation. Transcription is the first stage of gene expression where a discrete segment of DNA, a gene, is copied into messenger RNA transcript in the nucleus by the enzyme RNA polymerase (RNA pol). In eukaryotes, there are three different types of RNA polymerases. RNA pol I produces most of the ribosomal RNAs (rRNAs). RNA pol II transcribes precursor RNAs (pre-mRNA) from genes which serve as templates for protein synthesis. RNA pol III catalyzes the synthesis of one small rRNA (5sRNA) and transfer RNAs (tRNAs). A simplified view of gene expression steps is illustrated in Figure 1. Figure 1. A schematic for the process of gene expression in a cell. First, both coding and noncoding regions of DNA are transcribed into RNA. Some regions are removed (introns) during initial RNA processing. The remaining exons are then spliced together. The spliced, polyadenylated and capped mRNA molecule (red) is then exported out of the nucleus to the cytoplasm. Once in the cytoplasm, the mRNA can be used as a template for protein synthesis. Figure adapted from © 2010 Nature Education. 14 The first step in the transcription process is initiation. The initiation of transcription requires the core enzymatic activity of the RNA pol II and multiple general transcription factors (GTFs)17-20. The transcription process also requires cis-acting elements to ensure accurate gene expression. Cis-acting elements include a promoter, which is composed of a core promoter sequence that includes a TATA box or initiator element. The core promoter and the TATA box are located upstream the transcription start site (TSS), providing a secure initial binding site for GTFs, RNA pol II and the TATA box binding protein (TBP)21. The location of the core promoter and the binding of transcription factors influence the transcription rate and determine the location of the TSS22. There are also distal regulatory cis-acting elements called enhancers, which are found at a considerable distance from the promoter they influence. Enhancer elements regulate gene activation when bound by activator proteins by changing the 3-D structure of the DNA to make it more accessible to transcription factors23. During transcription initiation, the GTFs assemble on the core promoter in an organized fashion forming a transcription pre-initiation complex (PIC), which facilitates the recruitment of RNA pol II to the TSS to ultimately form the transcription initiation complex24. Once transcription is initiated the carboxy-terminal domain (CTD) of RNA Pol II is phosphorylated and begins to recruit the factors needed for elongation and mRNA processing25. Concomitantly, RNA pol II begins unwinding the double stranded DNA and starts copying the template strand of DNA into RNA by adding nucleotides to the 3’ end of the growing transcript until the polymerase reaches the termination signal at the end of the coding sequence. The polymerase continues for several hundreds or even thousands of bases until it reaches the polyadenylation site (poly(A) site), a consensus sequence that is bound and cleaved by the polyadenylation complex. Following cleavage, a poly(A) tail is added to the newly synthesized mRNA transcript, a process called mRNA polyadenylation26. The newly synthesized mRNA is then exported to the cytoplasm and in most instances, translated into protein. It is important to mention that several polymerases can be found on the same gene producing multiple transcripts from a single gene copy27. Transcriptional regulation and RNA processing Chromatin remodeling DNA is condensed in the cell in a chromatin structure. The chromatin is composed of a repeated structural unite called a nucleosome which is comprised of 146 base pairs (bp) of DNA wrapped around a histone protein octamer formed from two copies of H2A, H2B, H3 and H428-30. Chromatin 15 does not only serve as mechanism to condense DNA, but it also represents a barrier for the transcriptional machinery31-35. The nucleosome complexes are not static structures. Their dynamics of “opening” or “closing” the chromatin structure represent one of the fundamental layers of gene expression regulation34. Various protein complexes tightly regulate the dynamics of chromatin structures36, including the chromatin-remodeling complexes37. Chromatin remodelers are multi-protein complexes that utilize ATP hydrolysis to mobilize and disrupt histone-DNA contacts, allowing the transcriptional machinery to access DNA and consequently, to transcriptional activation38-40. All eukaryotes have at least five families of chromatin remodelers. The two best-studied families include SWI/SNF and ISWI. The functions of these families include, but are not limited to, chromatin assembly, nucleosome spacing, transcriptional repression, double stranded DNAdamage repair and development38,41-43. Given the fundamental role of these complexes in regulating spatial and temporal gene expression, mutations affecting the function of the components of chromatin remodeling complexes generally cause cancers and multi-system developmental disorders44-48. Moreover, recent family-based studies using exome sequencing show that de novo mutations in chromatin-remodeling proteins are linked to intellectual disability (ID), autism and other neurodevelopmental disorders48,49. RNA processing The production of fully mature protein-coding transcripts is a complex process that involves numerous tightly regulated yet flexible steps. The complexity of these processes provides the means for multilayered regulation of gene expression. During and after transcription, newly synthesized pre-mRNA transcripts undergo a series of modifications and processing steps in the nucleus before the fully mature RNA is exported to the cytoplasm. These processes include: a) 5´ pre-mRNA capping, the addition of altered guanine nucleotide to the 5´end of a growing RNA transcript (7-methylguanylate cap or m7G), b) pre-mRNA splicing, the removal the intronic sequences from the transcribed RNA, and c) polyadenylation, the addition of a poly(A) stretch to the fully transcribed RNA transcript by a tightly coupled cleavage and polyadenylation reaction. Pre-mRNA splicing In eukaryotic genes, the protein coding sequences (i.e. exons) are interrupted by intervening sequences called introns. These introns have to be precisely removed form the pre-mRNA to produce a mature mRNA transcript that will be used to produce a functional protein. The process by which introns are removed and exons are joined together is called mRNA splicing and it 16 occurs in the nucleus. The mechanism by which introns are removed from the transcript is dependent on two conserved sequences located at the 5´and 3´ ends of introns. These specific sequences are called splice sites. Most often the dinucleotides GU is found at the 5´splice site (donor site) and an AG at the 3’ splice site (acceptor site). Mutations disrupting these consensus sequences have been linked to aberrant splicing and disease (discussed further below)50. Another conserved intronic sequence that is also important for intron removal is called the branch site. It is often located anywhere between 21 to 34 nucleotides upstream of the 3´end of introns. The branch site typically contains the consensus yUnAy, where the underlined A is the branch point, n is any nucleotide, and the lowercase pyrimidines are not as conserved as the uppercase A and U51. The splicing process is carried out by a series of reactions catalyzed by a macromolecular machine called the spliceosome52. The spliceosome consists of small nuclear ribonucleoprotein particles (snRNPs) called U1, U2, U4, U5 and U6, and a large number of non-snRNP factors53-55. The spliceosome assembly is initiated by a sequential recruitment of the spliceosome components to the intron (Figure 2). The U1 snRNP and the heterodimeric U2 snRNP auxiliary factor (U2AF) recognize the 5´ and the 3´ splice sites, respectively. Next, the U2 snRNP is recruited to the branch point site and then the U4-U5-U6 tri snRNPs join the spliceosome complex. This complex becomes activated by a series of structural rearrangements that ultimately leads to the joining of exons, spliceosome disassembly and release of the lariat intron. This splicing mechanism is called canonical splicing and accounts for more than 99% of the splicing events in eukaryotes56. Some introns do not possess the canonical 5´and 3´splice site sequences57. These introns are called minor introns and they are spliced using the U12 spliceosome (also referred to as the minor spliceosome)58-60. The minor spliceosome contains the same U5 snRNP as the major spliceosome, but instead of having U1, U2, U4 and U6, it has U11, U12, U4atac and U6atac snRNPs. This minor splicing event is considered rare and it is still under debate whether it occurs in the nucleus or in the cytoplasm61,62. 17 pre-mRNA splicing Exon 5´ss Exon branch point U1 U2 Exon U4 U6 3´ss U2AF Exon Exon Exon U1 U2AF U2 U5 U2AF U4 U1 U6 U2 U5 Exon Exon Figure 2. Pre-mRNA splicing. The spliceosomal factors are sequentially recruited to the 5´splice site, 3´splice site and to the branch point. Once the complete spliceosome is assembled over the intron, the exons are joined together and the intron lariat is released. The evolution of pre-mRNA splicing The process of intron splicing requires an accurate definition of the intron boundaries. This occurs through base-pairing interaction between U-rich sequences in the spliceosomal components (snRNPs) and a particular intronic sequence located in the splice site vicinity63. While the U-rich sequences have remained conserved during evolution, splicing signals are substantially more variable in higher eukaryotes compared to yeast64. In the Saccharomyces cerevisiae (S. cerevisiae) genome, there are only ∼250 annotated introns present in only 3% of the genes. The introns described in yeast are typically short and alternative splicing events are very rare65,66. Conversely, most genes in higher eukaryotes contain multiple introns with different lengths and alternative splicing is very common67. Therefore, the recognition of introns in higher eukaryotes depends on additional sequences (cis-elements) in exons and introns to compensate for the sequence divergence and the resulting poor splice site definition. 18 Alternative splicing: The source of functional and phenotypic diversity Before completion of the human genome sequence, it was hypothesized that there is a link between the complexity of an organism and the number of genes it possesses. The first draft of the human genome demonstrated that the number of genes in the human genome was lower than anticipated and not very different from the number of genes in several less complex organisms e.g. the fruit fly. One possible source of greater complexity in higher eukaryotes is alternative pre-mRNA splicing (AS), a process where exclusion of certain exons through splicing may give rise to a large number of different mRNA transcripts from the same gene. Despite the unexpectedly low number of protein coding genes identified in the human genome (around 21,000), these genes code for more than 100,000 protein coding isoforms68,69. These isoforms can be created by several mechanisms including AS, alternative transcription initiation, and alternative polyadenylation. AS was first discovered in 1977 in adenoviruses70. It also occurs in other viruses such as HIV and cytomegalovirus71,72. In higher eukaryotes, AS has emerged as one of the most essential mechanisms underlying the phenotypic diversity displayed by mammalian cells. A single pre-mRNA transcript thus has the potential to give rise to a number of different protein isoforms simply by generating different combinations of exons via alternative splicing. In the human genome, up to 95% of the multi-exon genes are believed to be subjected to alternative splicing, thereby greatly enriching the functional diversity of the human proteome and adding additional layers for regulation of gene expression67,73-76. AS may create different mRNA isoforms that differ in their coding capacity or display distinctive stability; they can even have profound effects on the protein properties77,78. Particular mRNA isoforms might be specific to a sex, developmental stage, cell or tissue type, or environmental conditions78,79. For example, the human neurexin gene family of neuronal proteins, has been reported to give rise to more than 3000 distinct mRNA isoforms80 which differ in their structure and their interaction capability with other proteins81,82. In other cases, products of AS might not code for a functional protein, but rather function as a negative regulator to prevent the overexpression of the encoded protein83. AS events can be classified into five major categories: a) exon skipping, representing the most common mode of alternative splicing, b) intron retention; c) alternative 5´splice site usage, d) alternative 3´splice site usage, and e) mutually exclusive exons. The precision of the AS process is a key element in gene expression and production of functional proteins. Therefore, aberrant AS may result in the creation of abnormal transcripts with incorrect gene products that might have severe effects and cause disease. In the last decade, a handful diseases have been reported to be associated with aberrant splicing events84-86. In fact, up to 50% of the disease causing mutations introduce defects in the splicing 19 mechanisms87,88. Such mutations might be located in a splice site or other cis-acting sequence element that influence the recognition of the correct exon borders. Alternative splicing regulation AS has emerged as one of the most essential mechanisms underlying the complex phenotypic and functional diversity displayed by the mammalian cells. This enormous diversity is achieved by a tight regulation of AS outcomes. The production of multiple mRNA transcripts from a single gene often occurs through the inclusion or exclusion of different combinations of exons. In a typical mRNA transcript, some exons are constitutively spliced, i.e. they are always included in the transcript. Other exons are alternatively spliced, therefore, they are sometimes included and sometimes excluded. The decision of including or excluding a particular alternative exon is dependent not only on 5´ and 3´ splice site, but also on many cis-acting sequence elements distributed in exons and introns. These sequence elements provide flexibility in their usage and possibility of regulation by serving as binding sites for a series of splicing regulatory proteins89. Cis-acting elements can either promote exon inclusion (splicing enhancers) or exon skipping (splicing silencers) and both kinds occur in both exonic and intronic regions. These elements are classified as exonic splicing enhancers (ESE) or exonic splicing silencers (ESS) if they function from exonic locations, and as intronic splicing enhancers (ISE) or intronic splicing silencers (ISS) if they function from intronic locations (Figure 3). ESEs are the best characterized sequence elements, serving as binding sites for a family of proteins called SR (Ser–Arg) proteins. SR proteins influence the inclusion of exons by recruiting the splicing machinery to the adjacent exon/intron borders90,91. ESS and ISS repress exon inclusion by recruiting members of the heterogeneous nuclear ribonucleoprotein family (hnRNPs), such as hnRNPA190,92. Cis-elements have also been shown to influence splice site selection by altering splice site recognition through forming looplike secondary structures93. Furthermore, the concentration ratios between negative and positive splicing factors in a given cell influences splice site selection94. 20 pre-mRNA splicing regulation U2AF Exon ISE ISS 3´ss SR hnRNP Exon ESE ESS U1 5´ss Exon ISE ISS Figure 3. Splicing regulation by the cis-acting elements and trans-acting splicing factors. Two alternative splicing pathways with the middle exon either included or excluded. Splicing enhancing signals are shown in green, while splicing inhibitory signals are shown in red. In this model, ESS or ISS are bound by hnRNPs and repress exon inclusion by inhibiting the recruitment of the splicing machinery (e.g. U2AF and U1). ESE or ISE serve as binding sites for SR proteins that influence the inclusion of exons by recruiting the splicing machinery. AS regulation also has an important role in tissue specificity75,95. Tissuespecific AS events can be partially explained by tissue-specific expression of splicing factors, and their corresponding influence on their RNA targets96-99. Recently, large numbers of tissue-specific AS events have been identified90. The brain was found to harbor the highest incidence of AS events, as might be expected considering that the brain represents the most complex and functionally diverse tissue in the body. Relatedly, many brain-specific splicing factors have also been identified, such as NOVA1 and nPTB100,101. Cell type and developmental-specific expression of these factors was observed in proliferating and post-mitotic mouse brain cells, neural progenitor cells and in differentiated neurons100,102,103. These findings highlight the importance of deciphering the ‘splicing code’ to understand the molecular mechanisms that define the function and specificity of cells and tissues. AS is not only regulated through trans-acting splicing factors but also by processes linked to the transcription machinery. Recently, several lines of research have established that splicing is physically and functionally coupled with transcription, and that this coupling may influence AS regulation and other downstream RNA processing mechanisms104-106. This implies that 21 RNA processing factors are recruited to the emerging RNA transcript during transcription and that splicing occurs co-transcriptionally. Two models have been suggested to explain the coupling between the transcription unit and the splicing machinery107. The first model is called recruitment coupling where the CTD of RNA pol II plays a major role in co-transcriptional coupling between RNA biogenesis and processing108. In this model, the CTD of RNA pol II offers a flexible landing pad for various transcription and splicing factors, and facilitates their recruitment to the emerging nascent RNA transcript109. This recruitment may influence AS regulation by the ability of different transcription factors to recruit distinct splicing factors110. The second model is called the kinetic coupling111. This model proposes that RNA pol II-mediated elongation rate influences the regulation of alternative splicing and splicing outcomes by affecting the period during which splice signals are exposed to the splice factors in the growing nascent RNA transcript112. For example, if the RNA pol II elongation rate is high due to a strong upstream promoter or an open chromatin structures, it increases the possibility that weak 3´ splice sites around cassette exons will be outcompeted by an already transcribed strong 3´ splice site downstream. In comparison, low RNA pol II processivity will provide enough time for the splicing machinery to recognize any weak 3´splice sites leading to the inclusion of a cassette exon113. There are several lines of evidence that support this model114-116. For example, it has been demonstrated that slow transcription of the fibronectin gene (FN1) leads to the inclusion of the fibronectin extra domain 1 (ED1) exon which is preceded by a weak 3´splice site. However, when transcription elongation rate was higher, this exon was excluded117. More recent global studies have revealed a pausing of RNA pol II near the 3´end of intron-containing genes. It is suggested that the pausing represents a check point to allow the splicing machinery to cope with transcription118. There is also increasing evidence that chromatin status and histone modifications also play a key role in AS regulation119,120. The outcome of AS might also be affected by other post-transcriptional mechanisms such as RNA editing, mRNA decay and microRNA binding121123 . Co-transcriptional splicing Transcription and pre-mRNA splicing were previously depicted as totally independent events. However, in the last decade, there is increasing evidence that splicing and transcription are actually coupled and that splicing can occur co-transcriptionally. Co-transcriptional splicing means that the splicing machinery is tightly coupled to the transcription units, thereby splicing out introns while transcription of the gene is still progressing. Illustration of co-transcriptional splicing is presented in Figure 4. 22 Co-transcriptional splicing POL II on Ex U4 U2AF on Ex U1 U4 U6 U5 U2 U6 U1 U2AF U2 U5 on Ex Figure 4. Co-transcriptional splicing. As the newly synthesized RNA emerges from RNA pol II, the splicing factors are recruited to the nascent transcript and the introns are excised prior to completion of the transcription process. Initial support for co-transcriptional splicing was achieved using electron microscopy of Drosophila melanogaster embryonic transcription units, visualizing intron lariat formation and associated ribonucleoproteins splicing complexes on transcripts that are still attached to DNA124. Since then, different approaches have been adopted to further investigate these observations. Immunofluorescent microscopy experiments demonstrated localization of splicing factors at sites of active transcription125,126. Moreover, RNA-FISH probes against exon-exon junctions showed partially spliced premRNA at their gene template, thereby suggesting that pre-mRNA splicing occur prior to transcript release127. Further evidence identified spliced mRNA, spliceosomal components and splicing factors in the chromatin environment of transcriptionally active genes128-131. Although multiple independent observations collectively suggest that cotranscriptional splicing is the common splicing mechanism, other studies have revealed that splicing can also occur post-transcriptionally, after the transcript release from the template DNA130. Investigation of co- 23 transcriptional splicing patterns in chromatin-associated RNA and free RNA in the nucleoplasm of human cell lines revealed that introns adjacent to constitutive exons are mostly co-transcriptionally spliced, and that these introns are removed in the 5´to 3´order. Meanwhile, 3´ introns were shown to be primarily post-transcriptionally spliced130. Internal introns harboring alternative exons are also excised during transcription, although with variable efficiencies114,130. These observations were recently validated using single molecule imaging of transcriptionally coupled and uncoupled splicing132. These data, in addition to the fact that co-transcriptional recruitment of SR proteins increases splicing efficiency133, suggests a regulatory potential of co-transcriptional splicing. Further studies have provided support for the kinetic coupling between splicing and transcription. One study analyzed chromatin associated nascent transcripts and revealed RNA pol II pausing at terminal exons134. Given that terminal exons are generally short, it was suggested that RNA pol II pausing would allow sufficient time for intron removal before transcript release. Similarly, another study reported that RNA pol II pausing at the 3´end of introns in reporter genes is coincident with splicing factor recruitment118. Together, these results suggest that co-transcriptional splicing in eukaryotes represents the rule rather than the exception. Early studies on co-transcriptional splicing were mainly based on a single gene or a limited number of genes. Consequently, an accurate estimation of the global frequency of co-transcriptional was lacking. Presently, recent advances in global approaches allow us to obtain more comprehensive and global analysis of co-transcriptional splicing events in tissues and cells. A handful of global analysis investigating co-transcriptional splicing events in tissues and cells in four different organisms have been published, including the analysis presented in Paper I in this thesis. Although these studies differ in their experimental and analytical strategies, they generally conclude that co-transcriptional splicing is widespread but not exclusive in cells and tissue134-139. The frequencies of co-transcriptional splicing reported in these studies are presented in Table 1. Table 1 Global studies on co-transcriptional splicing in four different organisms Organism S. cerevisiae134 Drosophila135 Human (HeLa cells)136 Human (K562 cells)139 Human (B-cells)137 Mouse liver138 Paper I in this thesis 24 Co-transcriptional splicing frequency 0,74 0,83 0,80 0,75 0,75 0,45 0,84 Co-transcriptional splicing may be advantageous for gene expression as it may facilitate recognition of splicing signals as they stem from the elongating transcription unit, and it may provide a platform for transcription and splicing to regulate and facilitate reciprocal functions. For these reasons, additional extended analysis of co-transcriptional splicing on a global level will be important to fully understand mechanisms for gene expression regulation in development and disease. Polyadenylation The vast majority of protein-coding and noncoding RNA transcripts undergo polyadenylation in the nucleus, a process by which a stretch of adenine bases are added to the 3´end of RNA molecules. The proper addition of a poly(A) tail is required for nuclear export, stability of the mRNA transcript and for efficient translation by serving as a major docking platform for factors that control and regulate these processes140-142. Polyadenylation is a two-step process that involves an endonucleolytic cleavage of a newly produced premRNA transcript and an addition of 200-300 adenine residues. These steps require several cis-elements and various trans-acting proteins143. The key cis-elements include a polyadenylation signal (PAS), a motif with the consensus sequence AAUAAA located 15-25 nucleotides upstream of the cleavage site, and a downstream GU-rich sequence element. There are several essential trans-acting factors that catalyze polyadenylation and cleavage. These include the cleavage and polyadenylation specificity factor (CPSF), which recognizes the PAS, and the cleavage-stimulating factor (CSTF) that binds to the GU-rich region. These factors in turn recruit the poly(A) polymerase (PAP) to the cleavage site144,145. Polyadenylation can occur at multiple PASs, leading to the production of different transcripts from the same gene by alternative polyadenylation (APA). In fact, recent discoveries using genome-wide high throughput technologies have demonstrated that APA is a widespread phenomenon that occurs in approximately 70–75% of human genes146, thereby adding an extra layer of complexity to the mechanisms of gene expression regulation. Of note, increasing evidence suggests interplay between APA and transcription through physical and kinetic coupling. Several polyadenylation factors have been demonstrated to directly interact with the RNA pol II CTD and with transcription factors147,148, thereby increasing cleavage and polyadenylation efficiency149. Similar to alternative splicing regulation, kinetic coupling (represented by the transcription elongation rate), has been shown to influence the choice of polyadenylation sites150. Splicing and polyadenylation factors also interact, leading to enhanced cleavage efficiency at the poly(A) sites143. 25 Transcriptome analysis: In light of next-generation sequencing As described in previous sections, the transcriptome represents the complete set of transcripts in a cell at any given time. The ultimate aim of whole transcriptome analysis is to establish and characterize a catalogue of all the transcripts within a particular cell/tissue, to determine transcript structure, splicing and regulation patterns, and to quantify their differential expression in both normal physiological and pathological conditions. Many different technologies have been utilized to study the transcriptome. Until the introduction of next-generation sequencing, transcriptome studies mainly relied on hybridization-based approaches such as transcription tiling microarrays. These approaches represented highthroughput, rapid and cost-effective methods to investigate gene expression patterns. They are typically based on hybridizing fluorescently labeled cDNA to custom-designed or standardized high-density probes anchored to an array slide. Due to their nature, these approaches possess several drawbacks including high background and differential hybridization affinities across the probes. In addition, they generally rely on existing knowledge about transcribed sequences that limit their ability to identify novel transcripts and splicing events. In contrast, the recent advances in next generation sequencing, especially RNA-seq, have offered a revolutionary platform to perform hypothesis-free analysis of the transcriptome and unprecedented single base pair resolution. It also offers higher dynamic range, compared to arrays for detecting transcripts at different expression levels combined with minimal amounts of noise. These technologies are generally based on converting RNA into a cDNA fragment library. These fragments are then sequenced in a parallelized, high-throughput manner. Early RNA-seq studies have proven effective in investigating the mammalian transcriptome landscape and revealed much more complexity than previously believed. Before the advent of RNA-seq, several studies using tiling microarray analysis revealed pervasive transcription over most of the genome and a large proportion of transcripts mapped to intergenic regions with unknown function (also referred to as dark matter)11,151. These observations were challenged when early RNA-seq studies showed that most sequencing reads align to known genes or around known genes with many reads located in intronic regions, indicating that most dark matter might represent transcriptional noise and experimental background13. Nevertheless, some intergenic transcripts exhibit functional characteristics. More recent RNA-seq data from the ENCODE project revealed that three-quarters of the genome is transcribed refueling the debate about the dark matter14. RNA-seq has also emerged as an affective method to assay presence and prevalence of RNA transcripts. While microarrays are associated with significant background noise that limits its ability for accurate quantitative 26 readout, RNA-seq data provides highly reproducible and quantitative measurements of transcript abundance152,153. The high resolution, sensitivity and low background of RNA-seq have provided us with an unprecedented global view of the composition and complexity of the transcriptome in different species. These advantages have enabled the identification and characterization of many novel transcripts, exons and splicing events, and have provided an accurate identification of the 5´ and 3´ ends of genes95,154-157. Other studies have also demonstrated the ability of RNA-seq to detect fusion transcripts158,159, identify long noncoding RNAs (lncRNAs)160 and identify single nucleotide variants (SNVs) and allele ratios in RNA161. Although RNA-seq has provided tremendous advantages, this method is still associated with certain challenges, such as biases associated with library preparations and the need for continuous improvements in statistical analysis and computational infrastructure. In addition, the transcriptome is dominated by highly expressed transcripts that consume most of the sequencing reads, leading to less power to detect transcripts expressed at low levels. In fact, targetenrichment strategies, which were originally used to capture specific sequences of DNA, have been used to perform capture on cDNA. Combined with RNA-seq, this method has lead to the identification of many novel transcripts and splice junctions which are below the level of detection in conventional RNA-seq162 (plus Paper II in this thesis). The function and the nature of these rare transcripts still need to be investigated. Current RNA-seq studies largely depend on poly(A) enriched RNA populations to study transcriptomes. Most RNA-seq results therefore provide little information about transcription dynamics and about forms of RNA that lack polyA tails. Recent studies have implemented different manipulations and improvements to RNA sample preparation steps to extend the boundaries of RNA-seq technologies, thereby obtaining deeper insight into the dynamics of gene expression and RNA processing. For example, two technologies named GRO-seq (Global Run-On Sequencing)163 and NET-seq (Native Elongating Transcript Sequencing)164 have been developed to provide insights into the dynamics of transcript synthesis and decay by capturing and sequencing nascent transcripts genome wide. In parallel, other studies rely on cellular fractionation to isolate and sequence sub-populations of RNA molecules. RNA sequencing of these fractions revealed interesting findings related to splicing kinetics, co-transcriptional splicing, posttranscriptional splicing, polyadenylation, transcript release from the chromatin and miRNAs distribution and function139,165-168. These studies also provide strong evidence for extensive cross-talk between RNA biogenesis and processing mechanisms. 27 Enabling technologies and method development RNA sequencing Background One of the most valuable achievements in genomics has been the ability to perform a comprehensive, dynamic, unbiased and hypothesis-free analysis of a genome in a high-throughput manner. Approaches developed for investigating and quantifying RNA by high-throughput sequencing (nextgeneration sequencing (NGS)) are collectively called RNA-seq169. NGS platforms for performing RNA-seq studies are dominated by three different companies: Roche 454 (FLX pyrosequencing system), Illumina (Genome Analyzer) and Life Technologies (SOLiD system). For each of these platforms, RNA fragments are first converted to cDNA libraries and then sequenced in parallel, producing massive numbers of relatively short reads. However, the different platforms produce reads that vary in their length ranging from 30-150 bp for SOLiD and the Genome Analyzer to 200-500 for the FLX pyrosequencing. It is important to mention, that sequencing-read length and the number of reads produced is to-date continuously increasing. The SOLiD platform (Life Technologies) was used for sequencing in all the papers presented in this thesis. However, the starting material was different in each of the studies and included total RNA (Paper I), exomecaptured RNA (Paper II), cytoplasmic, nuclear and poly(A) RNA (Paper III) and exome captured DNA (Paper IV). I will discuss SOLiD sequencing technology in more detail below. Library preparation The quality of RNA samples are controlled using the Agilent Bioanalyzer and only Agilent-defined RIN-values above 7,0 are accepted. RNA samples are first depleted from ribosomal RNA (rRNA) using the RiboMinus™ Eukaryote Kit v2 (Ambion) except for the nuclear RNA. rRNA depleted samples are then fragmented using RNase III and subsequently purified. Adaptors are then ligated to the fragmented and purified RNA, and cDNA is generated and amplified in a strand-specific manner. Thereafter, the cDNA 28 fragments are clonally amplified on SOLiD™ P1 DNA Beads by emulsion PCR. After the enrichment of the beads with cDNA, the beads are deposited and covalently attached on a glass slide. SOLiD sequencing Following the library preparation, the first step in the sequencing reaction is the binding of a universal sequencing primer to the P1 adaptor which is attached to the cDNA template. Mixtures of four differentially labeled dibase probes, where the first two bases are known, compete for binding to the template. The bound probe, where the first two bases match those of the template, is ligated to the free 3´end of the universal primer. Subsequently, further probes bind and ligate to the growing strand. After each ligation, a fluorescent marker attached to each probe is detected and cleaved. The template sequence is interrogated by the specificity of 1st and 2nd base in each ligation reaction. The sequencing reaction is outlined in Figure 5. universal seq primer 3’ 3’ p5’ Ligate Bead 5’ 5 Cleavage 5’ 5 P1 Adapter 5’ 5 3’ 3’ 5’ TA nnnzzz P1 Adapter Template Sequence Cleavage 3’ 5’ zzz TA nnn P1 Adapter 5’ AT nnnzzz Template Sequence universal seq primer Bead 5’ TC nnnzzz TA nnnzzz universal seq primer Bead 3’ 5’ universal seq primer 3’ p5’ 5’ GG nnnzzz ligase Template Sequence 3’ Figure 5. Overview of the SOLiD sequencing chemistry. Figure adapted from Life Technologies Corporation © 2013 29 After multiple cycles of ligation, detection and cleavage, the template is reset. This is done by removing and discarding the initial primer and all the ligated probes from the template. The number of the cycles determines the eventual read length. After a reset, a new universal primer starting from n-1 is used and the cycles of ligation, detection and cleavage are repeated. The process is repeated with five different universal primers (n, n-1, n-2, n-3, n4), generating overlapping reads with each base interrogated twice. Universal primers and the dual interrogation system are illustrated in Figure 6. Figure 6. The SOLiD dual interrogation of each base using the five different universal primers. For example, universal primer 2 and 3 interrogate the base at read position 10. Figure adapted from Life Technologies Corporation © 2013 Exome sequencing Background Over the last years, great advances have been achieved in next-generation sequencing and capture technologies, enabling the identification of almost all protein-coding variants in the genome170. Because a large proportion of disease-causing mutations are likely to be located in coding regions, exome capture followed by massive parallel sequencing have promoted the discovery of many disease causing variants in Mendelian diseases and rare variants in complex traits171,172. Compared to whole-genome sequencing, which is still prohibitively expensive for most applications, and GWAS, which is largely based on the analysis of common variants located in noncoding regions, exome sequencing has proven more efficient and cost effective to detect disease-associated, protein-coding variants. Exome sequencing has also emerged as an approach to study de novo mutations and their role in the genetics of complex diseases173. In fact, several recent studies focused on parent-offspring exome sequencing have provided strong evidence that de novo mutations in many different genes represent a major 30 cause for common neurodevelopmental diseases such as intellectual disability and autism48,174-176. The results also indicate that de novo mutations are more common in patients than controls174. A de novo mutation is a spontaneous alteration in a gene that occurs for the first time in one family member due to a mutation either in a germ cell of one of the parents or during early cell divisions in the fertilized embryo. De novo mutations are considered rare genetic variants and they tend to be more harmful than inherited mutations because they have not been subjected to the same levels of evolutionary selection177,178. Collectively, de novo mutations are not rare and have per generation mutation rate between 7.6 × 10−9 and 2.2 × 10−8, corresponding to 50 - 120 new mutations per generation, with an average of 0.86 amino acid altering mutations per newborn179,180. The fact that these mutations are under less stringent selection and are collectively common, make them prime candidates as causative mutations in sporadic and commonly occurring diseases. Sequencing library preparation and exome capture SOLiD based exome sequencing is mainly based on introducing an exome enrichment step after preparing the sequence library. The DNA sample is first fragmented to an average size of 165 bp using the Covaris® System. The DNA libraries are prepared as mentioned in the Library preparation section (see above). The genomic DNA library is then ready to use for an exome enrichment stage. Exome capture in the projects described in this thesis was performed using the Agilent SureSelect 50 Mb Exome Enrichment kit (Agilent Technologies). In this assay, the amplified sequencing library is hybridized to the SureSelect Capture Library mix containing biotinylated RNA probes (BAITS). The hybridized DNA fragments (exome) are then captured using streptavidin-coated magnetic beads. After several washing steps, the exome library is amplified and then ready for sequencing. The exome capture procedure is illustrated in Figure 7. In the case of Paper II, where exome capture was performed on RNA, double stranded cDNA was generated prior to the library preparation and exome capture steps. 31 Figure 7. The SureSelect Target Enrichment process. Figure adapted from Agilent Technologies Copyright © 2013 32 Cytoplasmic and nuclear RNA purification The quality of isolated RNA from tissues, blood and cells is the initial and the most crucial step for performing RNA based research. RNA extraction methods are used in basic molecular biology research, medical research and in the diagnoses of many medical conditions. Currently, RNA research is mainly performed on total and poly(A) RNA. RNA sequencing using total RNA provides a snapshot of the complete content of RNA molecules in cells or tissues whereas poly(A) RNA enrich for mature transcripts. Working with RNA-seq, we realized that total RNA represents heterogeneous pools of RNA populations with different RNA molecules at different levels of maturity. We also know that poly(A) selection introduces several biases to RNA samples. These biases mainly arise from poly(A) stretches in introns, polyadenylated pre-mRNA transcripts166, and due to the fact that polyadenylation also occurs during degradation of incomplete RNA polymerase I transcripts181. In an attempt to increase the sensitivity of RNA sequencing, we aimed to develop a protocol to obtain pure populations of mRNA transcripts enriched from the cytoplasmic fraction and nascent RNA enriched from the nuclear fraction. To achieve this goal, we modified the original protocol of the commercially available kit from Norgen Biotech (Cytoplasmic & Nuclear RNA Purification Kit), which suffers from significant cross contamination between the fractions, and we were able to obtain highly pure subcellular RNA populations. Sequencing the RNA fractions extracted using this new protocol allowed us to obtain greater insight into transcription dynamics from the nuclear fraction and a more accurate assessment of RNA expression from the cytoplasmic fraction. The modifications applied to the original protocol are illustrated in Figure 8. Briefly, frozen tissue (20 mg) is ground in liquid nitrogen using a mortar and pestle. Tissue powder is transferred to ice cold 1.5 ml tubes. Then, 200 µl lysis buffer (Norgen) is added to the ground tissue. The tubes are incubated on ice for 10 minutes and then centrifuged for 3 minutes at 13.000 rpm to separate the cellular fractions. The supernatant containing the cytoplasmic fraction and the pellet containing the nuclei are mixed with 400 µl 1.6M sucrose solution and carefully layered on the top of a 500 µl sucrose solution in two separate tubes. Both fractions are then centrifuged at 13.000rpm for 15 minutes at 4C°. The cytoplasmic fraction is collected from the top of the sucrose cushion and the cytoplasmic RNA is then further purified according to the Norgen kit recommendations. The nuclear pellet is collected from the bottom of the tube and washed with 200 µl 1x PBS. The nuclei are collected after another centrifugation step at 13,000 rpm for 3 minutes. The nuclear RNA is then purified from the nuclear fraction according to the Norgen kit recommendations. 33 Figure 8. Schematic illustration of the cytoplasmic and nuclear RNA purification using the Cytoplasmic and Nuclear RNA Purification kit (Norgen), with the modifications to the protocol marked in the green box. 34 The present investigation Aims The general aim of this thesis is to provide a better global view of the human transcriptome via analyzing its content, synthesis, processing and regulation using next generation sequencing. In the long run, a better characterization of the transcriptome will offer a valuable resource to understand the molecular mechanisms in health and disease conditions. It will also provide a tool for more accurate disease diagnosis and efficient treatment. The specific objectives of each paper in this thesis are: Paper I Characterize the global pattern of intronic transcription and cotranscriptional splicing in the human brain using RNA-seq. Paper II Utilize the increased sensitivity of targeted enrichment of coding sequences via RNA exome capture and massively parallel sequencing to improve the detection of lowly expressed transcripts genome wide. Paper III Develop a more efficient cellular fractionation method to obtain homogenous pools of RNA molecules for more focused and comprehensive transcriptome analysis using RNA sequencing. Paper IV To functionally characterize a causative de novo mutation identified in a patient with intellectual disability and to understand how it may be linked to the clinical manifestations of the patient. 35 Results and discussion Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain (Paper I) The advent of next-generation sequencing (NGS) has revolutionized our understanding of the transcriptome and revealed greater complexity than previously believed. In the last decade, series of tiling array studies showed that a much larger portion of the mammalian genome is transcribed than was previously expected and many transcripts were detected outside known genes. More recent studies of the mammalian transcriptome using RNA-seq have revealed that most sequencing reads map within known genes and many are intronic, indicating that most transcripts outside known genes represent experimental background or transcriptional noise. The function and the nature of intron transcripts have not been thoroughly investigated. In this paper, we aimed to investigate the origin and function of intronic transcripts. We preformed RNA-seq on brain and liver samples obtained from adult and fetal human tissues using the SOLiD4 system. Our total RNA sequencing results indicated that around 70%-80% of the sequence reads map to regions outside known exons. Specifically, 38% of these reads map to introns (Figure 1 in Paper I). The results also showed that the intronic reads were significantly more enriched in brain compared to liver (P value <10 -300) and in fetal brain compared to adult brain, indicating high level of nascent transcription in brain tissues and especially in fetal brain. Interestingly, the genes with the highest intronic expression were involved in brain development and neural signaling pathways. The higher levels of intronic RNAs in fetal brain were further validated using qrtPCR. Another interesting observation was evident in our global expression analysis. The intronic reads were unequally distributed across introns. The expression of intronic RNAs were substantially higher in the 5’ end compared to the 3’ end forming a clear saw-tooth like pattern, especially in the long introns (Figure 9). To experimentally validate these observations, we conducted qrtPCR on two genes with high intronic expression, namely NRXN1 and GRID2. The qrtPCR results showed high correlation with the RNA-seq data and indicated that the highly expressed 5’ intronic RNAs are 36 connected to the upstream exon, whereas low levels of the 3’ intronic transcripts are attached to their adjacent exon. a 500 kb AUTS2 AUTS2 AUTS2 200 kb C21orf34 C21orf34 b RefSeq genes AAAAA AAAAA AAAAA AAAAA AAAAA RefSeq genes MIR99A MIRLET7C MIR125B2 Figure 9. Nascent transcription and co-transcriptional splicing. (a) Pattern for AUTS2 (top) and C21orf34, a noncoding RNA gene (bottom), viewed in the University of California, Santa Cruz (UCSC) Genome Browser. The RNA-seq signals have been smoothed using window averaging. For both protein coding genes and long noncoding RNA genes, there is an apparent 'saw-tooth' pattern with higher RNA-seq read signals toward the 5′ end of each intron. (b) Model for co-transcriptional splicing. The total RNA-seq data give rise to a typical saw-tooth pattern across genes that are actively transcribed. The gradient of RNA across the introns can be explained by a large number of nascent transcripts in various stages of completion. The pattern is repeated for each intron because the nascent transcript is spliced very rapidly after the polymerase completes transcribing each intron. The sequence read coverage is comparatively higher for exons, as the RNA-seq is measuring both the pool of nascent transcripts and the pool of mature polyadenylated RNA. Based on these results, we proposed an explanation for this observation. In our model, we assume that each gene is transcribed continuously forming a pool of RNA molecules at different stages of maturation and processing. These molecules generate a gradient of nascent transcripts along the whole gene. The fact that this gradient is over a single intron, rather than the whole gene, is explained by co-transcriptional splicing. In other words, the splicing machinery shadows the transcription machinery and excises each transcribed intron (Figure 9b). To investigate this phenomenon globally, we measured the average read depth for all introns in brain and liver tissues. As expected, the analysis showed that large introns >100 kb in the brain contained the highest levels of intronic reads at their 5’ end compared to the small introns, and particularly in genes involved in axonal growth and synaptic transmission. These results can be explained by the fact that shorter introns are transcribed and spliced faster than larger introns and the resolution of RNA-seq 37 is limited to detect these events. The analysis also confirmed the differences between brain and liver tissue as well as between fetal and adult brain. An independent qrtPCR quantification of the differences showed that intronic RNAs were higher in fetal brain, while exonic RNAs were equal between both tissues. These observations might indicate higher rates of transcription and turn over in the fetal brain. Early studies of splicing regulation proposed that understanding the order of intron splicing in a single transcript is crucial to understand splicing kinetics. A recent study by de la Mata et al114 proposed the model “first come first served (committed)” for intronic splicing. This model implies that as introns are being transcribed, they commit to the splicing machinery directly. To test this theory in our data, we designed a PCR experiment to investigate cotranscriptional splicing of consecutive exons in NRXN1 and ERBB4. Our results demonstrated that indeed these introns are co-transcriptionally spliced. However, in agreement with de la Mata et al, we noticed a time gap between the transcription and the excision of these introns. We also observed that co-transcriptional splicing is not restricted to the long introns or to certain tissue. In fact, it also occurs in short introns and at different levels in other human tissues. To show that our analysis can measure levels of cotranscriptional splicing, we re-sequenced the fetal brain to obtain higher read coverage. We then analyzed the levels of co-transcriptional splicing globally by quantifying the levels of intronic expression adjacent to the 3’ and 5’ end of each exon. After correcting for intron length, the analysis revealed evidence for co-transcriptional splicing for 84% of exons surrounded by long introns. Moreover, introns surrounding cassette exons showed high levels of intronic RNA compared to constitutive exons. In agreement with PandyaJones et al and de la Mata et al113,130, our results suggest that introns surrounding alternatively spliced exons are excised at slower rate. The results in this study indicate that sequencing reads over introns mainly originate from nascent RNA transcripts undergoing transcription. Moreover, the distribution of these reads over the introns, reflected by a sawtooth like pattern, revealed widespread co-transcriptional splicing in the human brain. The results also demonstrate that genes with long introns and high levels of intronic RNA are associated in neurodevelopmental disorders. These genes may experience more complex transcriptional regulation patterns. Our experimental model provided a unique platform to view and analyze nascent transcripts undergoing transcription and measure levels of cotranscriptional splicing in human tissues. 38 Exome RNA sequencing reveals rare and novel alternative transcripts (Paper II) One of the ultimate goals of transcriptomics research is to identify the complete catalogue of expressed transcripts. RNA sequencing technologies have provided valuable tools to assess the quantity and composition of the human transcriptome and have led to the identification of many unannotated transcripts and exons. However, the detection of rare transcripts and splice junction still represent a challenge for these technologies. In this study we aimed to improve the detection of transcripts expressed at low-levels genome wide by performing exome capture on RNA followed by massive parallel RNA-seq (ExomeRNAseq). We performed target enrichment of exome sequences using the Agilent SureSelect 50 Mb kit on four cDNA samples from an adult liver, an adult cortex, a fetal liver and a fetal cortex. The captured targets were then sequenced on the SOLiD4 and 73 million reads per sample were generated. The expression analysis of ExomeRNAseq using reads per kilobase per million reads (RPKM) showed weak correlation with the expression analysis from total RNA sequencing (TotalRNAseq) of the same samples. To explore the distribution of sequencing read coverage over the genome resulting from the different sequencing techniques, we compared the fraction of reads mapping to exons, introns, UTRs and intergenic regions between ExomeRNAseq, TotalRNAseq and a poly(A) selected RNA-seq. The results demonstrated that the largest fraction of exonic reads was detected in ExomeRNAseq compared to the other sequencing technologies. Our analysis also revealed that larger numbers of genes could be detected using ExomeRNAseq. Moreover, the transcripts determined to be expressed at low levels using TotalRNAseq showed higher expression in ExomeRNAseq. To evaluate if ExomeRNAseq can be considered as a quantitative test, we selected seven random genes and five genes with differential expression between ExomeRNAseq and TotalRNAseq and tested their expression using qrtPCR. The qrtPCR data demonstrated a better correlation with the TotalRNAseq data indicating that there are differences in capture affinity across the different probes and that ExomeRNAseq does not accurately reflect quantitative differences. Given that ExomeRNAseq contains a higher fraction of sequence reads mapping to exonic regions, we hypothesized that the capture technology will improve splice junction discovery. To evaluate this hypothesis, we used the TopHat and SplitSeek tools to identify splice junctions and novel exons in the different samples sequenced. The results demonstrated that ExomeRNAseq resulted in the identification of 100,000 splice junctions compared to 25,000 and 44,000 junctions identified using TotalRNAseq and poly(A) selected RNA-seq, respectively. We further investigated the nature and the novelty of the identified splice junctions. The analysis was based on 39 comparing the splice junctions in our samples to University of California, Santa Cruz genome browser (UCSC) annotated genes. The splice junctions were divided into two groups: The first group included known splice junctions reported in the UCSC; the second group included novel splice junctions that are not represented in the current mRNA annotation in the UCSC. Our ExomeRNAseq showed a significantly higher efficiency in detecting novel splice junction (n=22,432) compared to TotalRNAseq (n=3,036). To determine if the larger number of novel splice junctions detected in the ExomeRNAseq data reflected false positive observations, we calculated the number of novel splice junctions that overlap with the junctions reported in the EST (UCSC) for both ExomeRNAseq and TotalRNAseq. We found that the fraction of novel splice junctions mapping to the UCSC EST data to be similar between the two technologies suggesting that ExomeRNAseq does not increase the false positive discovery rate of novel splice junctions. Three novel splice junctions and six novel exons were further validated using PCR, Sanger sequencing and qrtPCR. Having established that ExomeRNAseq represents an efficient technology to enrich for coding sequences, we sought to assess the ability of ExomeRNAseq to identify coding variants. We performed single nucleotide variant (SNV) calling on the data from ExomeRNAseq and TotalRNAseq. The analysis resulted in the detection of substantially larger numbers of SNVs in ExomeRNAseq (average of 9,106 per sample) compared to TotalRNAseq (average of 1,919 variants per sample). However, the SNVs detected in the TotalRNAseq had better overlap with the SNVs reported in the dbSNP database (average 91% vs. 83% in the capture data). The multiple cDNA amplification cycles during sample preparation may introduce mutations in the original sequence and might represent one potential explanation for the reduced accuracy of ExomeRNAseq for SNPs discovery. Another explanation may be that the capture probes had some level of unspecific binding or cross-hybridization. In this study, we demonstrate that whole exome capture can be applied to cDNA to study the transcriptome landscape. We also show that using the capture step combined with parallel RNA sequencing leads to the detection of large amounts of novel exons and splice junctions. Compared to TotalRNAseq, ExomeRNAseq showed increased ability to detect rare transcripts and splice variants expressed at low levels. However, ExomeRNAseq is less sensitive in detecting expression differences. We propose that ExomeRNAseq is an effective platform to expand our knowledge and understanding of the transcriptome, especially in detecting novel transcripts, splice variants and disease causing fusion transcripts. An illustration for the workflow used to identify novel exons and splices junction is represented in Figure 10 40 Intergenic regions Novel exons Case 1 5’ novel exon Case 2 Internal novel exon Case 3 Novel splice isoform Known exons Case 4 3’ novel exon Unannotated splice junctions Known splice juctions 5‘ 3‘ Reference genome Data analysis Splice junction N N N N A C B A B C A B N N Novel exons and novel splice isoforms PCR validation and Sanger sequencing Figure 10. Illustration of the workflow used for the detection and validation of novel exons and novel splice junctions. Efficient cellular fractionation improves RNA sequencing analysis of mature and nascent transcripts from human tissues (Paper III) Currently, total RNA is considered the standard starting material in RNA sequencing experiments to obtain a global view of the transcriptome. However, if the aim is to study mature RNA transcripts, poly(A) selected RNA is the method of choice. RNA sequencing studies using total and poly(A) selected RNA have successfully provided us with novel insights into the transcriptome. However, many challenges still remain in sample preparation and data analysis steps. Previous RNA-seq studies reported that total and poly(A) RNA represent diverse and heterogeneous pools of different RNA molecules at different maturation and processing levels. Moreover, the presence of substantial amounts of intronic RNA originating from immature transcripts in these RNA populations will consume most of the sequencing reads, and thereby impacts the sensitivity of RNA-seq to detect transcripts and measure expression levels. Thus, the analysis of a specific RNA type or a specific regulation mechanism is still not common. In recent years, considerable efforts have been made to extend the resolution of RNA-seq data, and to develop new methods and strategies to monitor alternative regulation of mRNA biogenesis and to detect novel transcripts expressed at low levels. Recently, two new technologies GROseq and NET-seq, have been described to study nascent transcripts163,164. Although these technologies have successfully provided snapshots of ongoing transcription, they do not provide insights into post-transcriptional 41 events, they require manipulation of the physiological conditions and they need extensive optimization. Recent advances in RNA extraction techniques also offer the possibility to analyze particular pools of RNA molecules by cellular fractionation. However, these techniques are associated with significant amount of cross contamination between the nucleus and the cytoplasm. To overcome this problem, recent studies166,182 added a poly(A) selection step to the cytoplasmic fraction to enrich for mature transcripts, and a chromatinassociated transcript selection step to enrich for nascent transcripts. These selection steps are time consuming and require large amounts of starting material. In this study, we aimed to investigate the benefits of performing RNA-seq on fractionated cytosolic and nuclear RNA from small amounts of tissue starting material and without any selection steps. To obtain high quality cytosolic and nuclear RNA, and minimize the amounts of cross contamination, we modified and improved the Cytoplasmic and Nuclear RNA Purification Kit (Norgen). The modifications to the kit’s protocol are illustrated in Figure 8. Quality assessment of our protocol showed that the nuclear fraction is virtually free of any ribosomal RNA and no genomic DNA traces could be detected in the cytosolic fraction. To analyze the efficiency of the protocol in enriching for the mature transcripts from the cytosolic fraction and the nascent transcripts from the nuclear fraction, we sequenced cytosolic and nuclear RNA from two human fetal frontal cortex tissues, designated sample 1 and 2, along with total and poly(A) RNA from the same tissues using the SOLiD 5500 platform. Upon investigation of the sequencing read distribution in the different RNA fractions, we observed a striking enrichment of exonic reads in the cytosolic fraction compared to the total, poly(A) and nuclear RNA fractions. Surprisingly, poly(A) RNA contained high levels of intronic RNA across the entire transcript, contradicting the idea that poly(A) selection exclusively enriches for mature RNA. In agreement with our findings, recent studies demonstrated that transcript polyadenylation might occur before the transcript is fully spliced166,183. These findings, along with the biases associated with the poly(A) selection, may provide an explanation for the high intronic levels in the poly(A) RNA fraction. The differences in exonic and intronic levels between the different fractions were further validated by qrtPCR. Of note, based on data from Paper I, we believe these differences are especially prominent in the fetal brain where long neuronal genes are highly transcriptionally active. Next, we devised a measurement called the EI ratio (exonic-to-intronic) to globally quantify the relative enrichment of exonic compared to intronic reads in the different fractions. The EI ratio is equal to1 when all intragenic reads are exonic and 0 when all reads are intronic. Again, cytoplasmic RNA had the highest exonic enrichment with ratios of 0.74 and 0.72 for sample 1 42 and 2, respectively. Poly(A), total and nuclear RNAs show substantially lower values with EI ratios of 0.27-0.45, 0.20-0.42 and 0.12-0.31, respectively. Having established that cytosolic RNA is significantly enriched for exons, we propose that it is the best extraction method for studying mature transcripts. We also propose that analyzing the nuclear fraction is a preferable choice to study nascent transcripts as it contains the highest amount of intronic RNA. We proposed that the higher amount of exonic reads in the cytosolic fraction leads to increased efficiency in detection and the quantification of mature RNA transcripts. To test this, we used depth of coverage per million mapped reads (DCPM) as a measure to quantify the expression levels of all exons in the human genome. As expected, cytosolic RNA-seq gives the highest DCPM values for exons compared to the other fractions. We were also able to quantify the number of expressed exons in the different RNA fractions by normalizing the DCPM values over the exons against the noise signals coming from the anti-sense strand. The cytosolic RNA contained 8 to 19% more expressed exons than in poly(A) RNA, and 29 to 49% more than in total RNA. Presently, there is an increasing appreciation of the importance of alternative splicing as a process for proteome diversity and tissue specific regulation. Many efforts have been devoted to increase the efficiency of RNA-seq to detect splice junctions. To test if cytoplasmic RNA-seq improves the efficiency of splice junction detection compared to poly(A) and total RNA-seq, we used TopHat software to perform splice junction analysis. Indeed, we found that cytosolic RNA-seq improves splice junction discovery and generated data where we detected 10% more junctions than poly(A) RNA-seq. We next used Trinity software184 to perform de novo transcriptome assembly for each RNA population. This test allowed us to evaluate the data in unbiased fashion without prior information about gene coordinates. As expected, using cytoplasmic RNA provides longer contigs and generates 30% more transcripts longer than 1 kb compared to poly(A) RNA and 10 times more compared to total RNA. These results highlight the benefits of using cytoplasmic RNA to improve gene discovery of complex transcriptomes and to study the genetic characteristics of non-model organisms. In Paper I, we demonstrated that total RNA-seq could be used to study nascent transcripts undergoing transcription. We also showed that the 5’-3’ negative gradient of RNA-seq reads across long introns represent an indicator of ongoing transcription and co-transcriptional splicing. In this study, we showed that the highest amounts of nascent transcripts were detected in the RNA-seq data from the nuclear fraction (i.e. showed the lowest EI ratio). To evaluate if nuclear RNA-seq increases the efficiency of nascent transcript detection, we performed a global analysis of the sequence coverage across introns for all RNA fractions. The analysis revealed that 43 nuclear RNA-seq contains the highest amount of nascent transcripts and the most distinct 5’-3’ gradients among all RNA fractions. Thus, the nuclear RNA represents the optimal fraction to study ongoing transcription and splicing dynamics. In this study, we presented the advantages of performing RNA-seq on cytoplasmic and nuclear RNA fractions using our modified and improved protocol. Compared to poly(A) RNA, which is considered the standard material for studying mature transcripts, cytoplasmic RNA-seq results in the detection of proportionally greater amounts of exonic sequences with minimal levels of intronic reads, offering increased sensitivity for measuring expression levels, detecting splice junctions and for de novo transcriptome assembly. On the other hand, nuclear RNA-seq contains higher amounts of intronic transcripts compared to total RNA-seq, providing a preferable fraction to study ongoing transcription, splicing dynamics and intron retention patterns. Mutation in the chromatin-remodeling factor BAZ1A is associated with intellectual disability (Paper IV) Intellectual disability (ID), also known as mental retardation, is a complex and heterogeneous group of disorders that causes significant health burden and affects 3-4% of the general population. The causes of intellectual disability encompass inborn, environmental and inherited factors. Over the past years, genetic testing of ID patients was focused on family-based linkage analysis, homozygosity testing in consanguineous families, cytogenetic analysis and array based screening. Causative factors have only been identified in approximately 40% of patients leaving the majority of the cases without any explanation for their symptoms. It is believed that sporadic cases of ID are mainly caused by genetic factors. Understanding the genetic causes of ID benefits the patients and their families by providing information about the disease prognosis and avoids unnecessary additional testing. It may also provide better-tailored therapy and medical care. Recent studies have suggested a major role for de novo mutations in the etiology of ID, and have led to the identification of many causative de novo mutations in sporadic ID using family-based exome capture and massive parallel sequencing. In this study, we performed exome sequencing on a trio family with healthy parents and a child suffering from ID, epilepsy, ataxia and hyper mobile joints to investigate if a de novo mutation is responsible for the disease in the child. Exome sequencing and Sanger sequencing validated a de novo non-synonymous mutation in the gene BAZ1A which codes for the chromatin remodeling factor ACF1. The mutation led to a phenylalanine to 44 cysteine substitution of amino acid 1347 in the linker motif between the PHD domain and the bromodomain (Brd) in the C terminus of ACF1. Given the established role of ACF1 in the transcription repression of the vitamin D receptor (VDR) regulated genes, we conducted targeted PCR on RNA samples from the trio to evaluate the impact of the mutation on the expression of VDR regulated genes and on the expression levels of BAZ1A. Interestingly, CYP24A1, involved in the metabolism of vitamin D3 (VD3), was significantly upregulated in the patient compared to the parents. Of note, BAZ1A expression levels showed no difference between the trio members. To determine if the upregulation of CYP24A1 influences VD3 levels, we performed a clinical test to measure VD3 levels in the patient. Interestingly, the results showed that the patient was suffering from VD3 insufficiency. To independently test if the differential expression of CYP24A1 is caused by the identified mutation in BAZ1A, we performed transient transfection of wild type BAZ1A (wt-BAZ1A) and mutant BAZ1A (mut-BAZ1A) transcripts in SAOS2 cell lines. qrtPCR results validated CYP24A1 upregulation in the cells transfected with the mut-BAZ1A, indicating the role of the de novo mutation on the levels of CYP24A1. To investigate whether the mutation influence the expression of genes involved in other pathways, we performed poly(A) RNA-seq on the trio samples. RNA-seq results confirmed the previously identified CYP24A1 upregulation. In addition, we identified the expression of SYNGAP1 to be significantly higher in the patient compared to the parents. SYNGAP1 upregulation was further confirmed using qrtPCR. Mutations in this gene were previously found to cause epilepsy and ID. Moreover, overexpression of SYNGAP1 has been shown to influence neuronal signaling and synaptic function. Consecutive GO analysis of the differentially expressed genes in the trio RNA-seq revealed significant enrichment of the category extracellular matrix (ECM) organization. To further study the association of BAZ1A and genes involved in ECM organization, we carried out an interaction network analysis between the differentially expressed ECM organization genes and BAZ1A using GeneMANIA. Interestingly, COL1A2 was among the genes that showed close interaction with BAZ1A. COL1A2, which causes Ehler-Dahnlos syndrome (inherited connective tissue), was shown to have direct genetic interaction with BAZ1A. ACF1 is known to directly interact with the promoter region of many VDR regulated genes leading to their suppression. In this study, we show that CYP24A1 is upregulated in the ID child and in the cells transfected with mut-BAZ1A plasmid. To test if the de novo mutation decreases the association of ACF1 to its target regions and consequently relieves the repression of CYP24A1 in the patient, we performed chromatinimmunoprecipitation assay (ChIP) using human ACF1 antibodies on fresh blood monocytes extracted from the trio. We then conducted qrtPCR to evaluate the occupancy of ACF1 on the CYP24A1 promoter. The results 45 indicated that ACF1 is significantly less enriched on the CYP24A1 promoter in the patient compared with the parents, thereby explaining the increased expression level of CYP24A1 in the patient. In summary, we detected a de novo mutation in BAZ1A encoding the ATP-utilizing chromatin assembly and remodeling factor ACF1. Consistent with the established role of ACF1 in chromatin remodeling and transcriptional regulation, our RNA sequencing data revealed that the de novo mutation influenced the expression profile of several biological pathways in the patient but most significantly pathways involved in extracellular matrix organization, vitamin D3 metabolism and synaptic functions. We performed functional studies and showed that the mutated ACF1 protein is less efficient in suppressing the expression of the CYP24A1. Moreover, we demonstrate that the overexpression of CYP24A1, SYNGAP1 and COL1A2, detected in the patient is highly correlated with the reported clinical features of the patient. Our results suggest that BAZ1A mutation is the causative mutation. 46 Concluding remarks and future perspectives The findings presented in this thesis contribute towards an improved understanding of the human transcriptome in healthy and diseased conditions and highlight the advantages of developing novel methods and strategies to obtain global and comprehensive views of RNA expression and processing. In Paper I, we show that sequencing of total RNA provides unique insights into RNA processing in cells. Our results revealed that cotranscriptional splicing is a widespread mechanism in human and chimpanzee brain tissues. We also found that co-transcriptional splicing is more frequent in long introns than in short introns. Moreover, we found a correlation between slowly removed introns and alternative splicing. This study is among the first to analyze global co-transcriptional splicing events. More recently, several global studies investigating different tissues and cell lines in human, mouse, Drosophila and S. cerevisiae, have further supported our finding that splicing occurs primarily co-transcriptionally134-139. Challenges for the future include designing studies to fully explore the significance of co-transcriptional and post-transcriptional splicing on the fate of RNA and how co-transcriptional splicing contributes to gene expression regulation. In Paper II, we exploited the benefits of exome capture approaches in combination with RNA-seq to detect transcripts expressed at low levels and to identify novel exons and splice junctions. We demonstrate that exome capture can be successfully applied to double stranded cDNA. We also show that this approach increases the sensitivity for identifying transcripts expressed at levels below the detection threshold in conventional RNA-seq methods. Moreover, this approach led to the identification of large numbers of novel exons and splice isoforms. Therefore, we highlight the advantages of this approach for genome-wide discovery of novel transcripts, alternative splice variants and fusion transcripts. This approach has also been used by Mercer et al. where they were able to detect large number of transcripts expressed at very low levels162. The sequencing depth required to detect such low level transcripts with conventional RNA-seq would be very costly, making the capture strategy an attractive alternative. However, it is reasonable to assume that in the near future, conventional RNA-seq will be able to achieve sufficient resolution at affordable costs. Regardless, we have yet to answer the important question regarding the functional relevance of low-level transcripts. 47 In Paper III, we demonstrate the advantages of performing RNA-seq on separated cytoplasmic and nuclear RNA fractions. Compared to poly(A) and total RNA, cytoplasmic RNA-seq yields a substantially higher fraction of exonic sequences with minimal levels of intronic reads, offering increased sensitivity for measuring expression levels, detecting splice junctions and assembling transcripts de novo. On the other hand, data from nuclear RNAseq contains higher sequence coverage over introns compared to total RNAseq, providing a preferable fraction to study ongoing transcription, splicing dynamics and intron retention patterns. Recently, there has been considerable effort focused on the development of methods to perform sequencing on different RNA subtypes. Published data from these efforts have led to interesting findings on nascent transcription and splicing dynamics163,164. Future RNA-seq experiments on subcellular RNA fractions may provide additional novel insights into the subcellular localization of different RNA types, identify nuclear retained transcripts, translation dynamics and RNA turnover rates. In Paper IV, we describe the identification of a non-synonymous de novo mutation in BAZ1A encoding the ATP-utilizing chromatin assembly and remodeling factor 1 (ACF1), in a patient with unexplained ID. The mutation was found to affect the expression of many genes involved in extra cellular matrix organization, synaptic function and vitamin D3 metabolism. The differential expression of CYP24A, SYNGAP1 and COL1A2 correlates with the clinical diagnosis of the patient. Given that ACF1 is involved in transcription repression of many genes, and that the mutation likely decreases ACF1’s binding affinity to its targets. Therefore, future experiments will include a ChIP-seq of ACF1 in the trio to further identify affected genes. BAZ1A was not reported previously to be linked to ID, therefore we will also scan additional ID patients for mutations in this gene. The development of next-generation sequencing technologies was certainly one of the greatest breakthroughs since the end of the Human Genome Project. These technologies have, for the first time, provided a platform to obtain a global view of genomes. Applying NGS to RNA has pushed the borders of this technology and allowed a comprehensive analysis of the transcriptome, contributing towards the goal of a complete understanding of splicing patterns, gene structures and gene expression profiles in both physiological and pathological conditions. Initial RNA-seq studies have uncovered extremely complex, yet exquisitely organized, gene expression programs. Now, rapid improvements in RNA-seq technologies, methods and analysis are continuously contributing knowledge about the composition of the transcriptome and the mechanisms that regulate and control gene expression. We now realize that the transcriptome is a very dynamic system, where transcription is functionally and kinetically coupled to pre-mRNA splicing, and influences mRNA polyadenylation, stability and export. As a result, the 48 decision to express a gene is controlled by the combinatorial effect of all these mechanisms. Despite the great amount of knowledge we have gained over the years about gene expression programs, we are just scratching the surface in terms of understanding transcriptome regulation. For example, we know that premRNA splicing and transcription are coupled, and that splicing in most cases occurs co-transcriptionally. However, we still lack detailed information about the extent of co-transcriptional splicing in vivo, how it occurs and how it is coupled with other RNA processing mechanisms. These questions also apply to many other mechanisms involved in gene expression. To be able to achieve this kind of information, multidisciplinary work is required. Sequencing technologies are working in our favor and are developing at a rapid pace to obtain longer sequence reads and greater data output. Increasing length of sequence reads will enable facilitated read mapping, identification of differentially spliced isoforms and facilitated linkage of novel exons to known genes. It will also allow us to obtain deeper insight into RNA processing mechanisms such as intron removal order, co-transcriptional splicing frequencies in different tissues and coupling between different RNA processing mechanisms. Increasing data output and sequencing depth could be very useful to perform comprehensive profiling of transcriptome composition. Moreover, more attention has to be paid to experimental design, especially to the quality and type of RNA to be used in experiments. Low quality RNA can significantly influence the interpretation of the results. Selective RNA extraction from different subcellular pools of RNA has become a very powerful strategy to perform more focused analysis of RNA. These more focused studies can also lead to better understanding of the biogenesis and processing of specific RNA subtypes. In the forthcoming years, I believe that we will witness many exciting discoveries in the transcriptomics field, especially with the development of 3rd generation amplification-free sequencing technologies such as PacBio, which promises to improve the sensitivity of gene expression quantification, and to avoid biases associated with PCR amplification and highly expressed transcripts. As it is becoming increasingly clear that seemingly homogenous cell population might have different gene expression profiles, single cell transcriptome analysis will provide new opportunities to decipher expression patterns linked to tissue heterogeneity in normal and tumor cells. The hope for the future is to integrate RNA-seq data with complementary sequencing and omics applications to obtain a comprehensive understanding of the complete regulatory networks of gene expression in development and disease. 49 Acknowledgements The research in this thesis was performed at the Department of Immunology, Genetics and Pathology, Medical Faculty, Uppsala University, Sweden. The research was funded by the Swedish Foundation for Strategic Research (SSF). Completing a PhD is truly a marathon event and I would not have been able to complete this journey without the help and support of the people around me; to only some of whom it is possible to give a particular mention here. First and foremost I would like to express my sincere gratitude and deepest appreciation to my supervisor Lars Feuk, whose help, knowledge, patience, commitment and scholarship to the highest standards has inspired and motivated me and set an example I hope to match some day. Thank you for your trust and belief in me when I really needed it, encouraging me to follow my ideas and for the great working environment you have provided in the lab. Thank you so much Lars for everything! It is always a pleasure and fun to work with you. You are such a wonderful person and an excellent supervisor. Thank you to my co supervisor Ulf Landegren for all your support, encouragement and fruitful discussions and for always being there whenever I needed help. You are a great resource to have around. My deepest appreciation goes to Lucia Cavelier. I really don't how to thank you for all the things you have done for me. You have always been there for me in all aspects. You are a real friend, a great collaborator and a true inspiration. I want to thank you also for your artistic skills, which has made the cover of this thesis look so fashionable. To my current and previous colleagues in Lars Feuk’s group, Eva L, for being such a kind, helpful and supportive colleague, Jonatan H, for being an important player in all the papers presented in this thesis, (…and by the way, I am in for sequencing the vampire squid genome ), Jin Z, for being such a nice colleague, I wish you all the luck with your PhD), Anna J, thank you for being the sport spirit of our group and Johan S, we are missing your stylish outfits on Fridays. 50 I would like to offer my special thanks to our collaborators, Ulf Gyllensten, Adam Ameur, Lucia Cavelier, Lotta Thuresson, Manfred Grabherr, Antonia Kalushkova and Helena Jernberg Wiklund. I also want to thank the Uppsala Genome Center for performing all the sequencing experiments for the studies presented in this thesis. Special thanks to Inger J, Inger G, Stefan, Ignas, Sebastian, Nina, Cecilia, Christian and Olga. Olga, I always enjoyed eating your home made delicious cookies! I would also like to express my gratitude to the administration and IT departments for always being so nice and patient and for all your professional support. During my years in Uppsala, I have had the pleasure of meeting many wonderful people that have become true friends! Thank you for making me feel at home. I would like to show my greatest appreciation to my awesome friends and colleagues! Friends from Rudbeck: Spyros, Fiona, Lucy, Nikki, Frank, Lesley, Larry, Jelena, Mi, Sara K, Antonia, Sara B, Veronica, Tatjana, Marco M, Justyna, Demet, Vasil, Jimmy, Mattias, Joakim K, Linnea, Ida, Sanna, Michaela and Emma for all the fun in and outside of Rudbeck. Yo Nikk, I am so glad you are back to Sweden. By the way, I still only understand 31,5% of what you’re saying. Spyros! I hope you will behave as a toastmaster, despite my doubts. Thanks for everything buddy! Special thanks also to Lars Forsberg, Tom A, Anders H and Adam A, you are real friends. Lars, I will never forget the days we had in San Francisco and Vegas . Anders, you are a friend to trust. Thanks for the great days in Hawaii. Tom, my kettlebell partner, thank you for always being around. Adam, one day I will be able to top your skills in billiard. Störningsjouren har kommit till fel lägenhet, eller hur? Big thanks to Joakim B, Ola, Tomas and Paul for all the fun we have had on our “guys nights with high drinkability”. Paul, you are a loyal friend with many talents, thanks for your true friendship. Thank you to the previous SLE group; Anna-Karin, Prassad, Madhu, Serena, Angelica, Sara L and Caroline. Angelica, you are a very special friend. Thank you for all the years and for being a nice colleague, friend and conference companion. Caroline, first of all thank you for introducing me to Lars 51 . You were always a caring person and very generous to share your knowledge and experience. Thank you also for proofreading this thesis. Many thanks also to the people in the B11-4 BMC corridor for creating such a unique working environment! Friends outside Rudbeck To the T-shirts crew; Daniel C, Daniel O, Ola, Clair, Tanai, Sahar, Jessica, Anna E, Tod, Tobias J, Tobias B, Angelica and Emilia. Thank you guys for all the great adventures in Sälen and Norreda. I am deeply grateful to Sten Dahlborg for letting me know that there is a world outside science and for all our fruitful discussions. I truly enjoyed all of our meetings and discussions that have shaped my view of the business world. I already miss the sashimi at Miss Voon. Jordanian friends Thank you to all my friends in Jordan; Muath, Mohamad T, Musa, Khaled A and Ihab O. Your friendship has been so precious to me all these years. You have always been there for me even though I have been many miles away! I am so lucky to have had the unconditional support and encouragement from my big family in Jordan. Thank you to my uncles, aunts and cousins. Apologies for not being able to mention all of your names, you are rather a big group, and I would need a couple of pages only for you all. A big thank you to you all! To my parents Mustafa and Hanan D, I would have never become the person I am today without you. Your unconditional love, support and encouragement are invaluable. You have always been there for me no matter what…. To my brothers and sister, Hamze, Tarek, Osama and Seema, you are my treasures. I could never ask for a better family. A thank you will never be enough! Finally, my greatest appreciation and love to Hanan H, who left her family, friends and work to be with me. Thank you for your endless love and support and for taking care of me especially over the last few months. Thank you! Ammar Zaghlool Uppsala 2013 52 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Watson, J.D. & Crick, F.H. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171, 737-8 (1953). Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001). Venter, J.C. et al. The sequence of the human genome. Science 291, 1304-51 (2001). Finishing the euchromatic sequence of the human genome. Nature 431, 931-45 (2004). Bernstein, B.E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012). Lareau, L.F., Green, R.E., Bhatnagar, R.S. & Brenner, S.E. The evolving roles of alternative splicing. Curr Opin Struct Biol 14, 273-82 (2004). Yang, X.J. Multisite protein modification and intramolecular signaling. Oncogene 24, 1653-62 (2005). Prasanth, K.V. & Spector, D.L. Eukaryotic regulatory RNAs: an answer to the 'genome complexity' conundrum. Genes Dev 21, 11-42 (2007). Mattick, J.S. RNA regulation: a new genetics? Nat Rev Genet 5, 316-23 (2004). Shabalina, S.A. & Spiridonov, N.A. The mammalian transcriptome and the function of non-coding DNA sequences. Genome Biol 5, 105 (2004). Johnson, J.M., Edwards, S., Shoemaker, D. & Schadt, E.E. Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. Trends Genet 21, 93-102 (2005). Mattick, J.S. The functional genomics of noncoding RNA. Science 309, 1527-8 (2005). van Bakel, H., Nislow, C., Blencowe, B.J. & Hughes, T.R. Most "dark matter" transcripts are associated with known genes. PLoS Biol 8, e1000371 (2010). Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101-8 (2012). Szymanski, M. & Barciszewski, J. Regulatory RNAs in mammals. Handb Exp Pharmacol, 45-72 (2006). Frith, M.C., Pheasant, M. & Mattick, J.S. The amazing complexity of the human transcriptome. Eur J Hum Genet 13, 894-7 (2005). Hahn, S. Structure and mechanism of the RNA polymerase II transcription machinery. Nat Struct Mol Biol 11, 394-403 (2004). Kadonaga, J.T. Regulation of RNA polymerase II transcription by sequencespecific DNA binding factors. Cell 116, 247-57 (2004). Kornberg, R.D. Eukaryotic transcriptional control. Trends Cell Biol 9, M46-9 (1999). Orphanides, G., Lagrange, T. & Reinberg, D. The general transcription factors of RNA polymerase II. Genes Dev 10, 2657-83 (1996). 53 21. Juven-Gershon, T. & Kadonaga, J.T. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol 339, 225-9 (2010). 22. Kim, T.H. et al. A high-resolution map of active promoters in the human genome. Nature 436, 876-80 (2005). 23. Szutorisz, H., Dillon, N. & Tora, L. The role of enhancers as centres for general transcription factor recruitment. Trends Biochem Sci 30, 593-9 (2005). 24. Ptashne, M. & Gann, A. Transcriptional activation by recruitment. Nature 386, 569-77 (1997). 25. Buratowski, S. The CTD code. Nat Struct Biol 10, 679-80 (2003). 26. Edmonds, M. & Abrams, R. Polynucleotide biosynthesis: formation of a sequence of adenylate units from adenosine triphosphate by an enzyme from thymus nuclei. J Biol Chem 235, 1142-9 (1960). 27. Szentirmay, M.N., Musso, M., Van Dyke, M.W. & Sawadogo, M. Multiple rounds of transcription by RNA polymerase II at covalently cross-linked templates. Nucleic Acids Res 26, 2754-60 (1998). 28. Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F. & Richmond, T.J. Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature 389, 251-60 (1997). 29. Felsenfeld, G. & Groudine, M. Controlling the double helix. Nature 421, 44853 (2003). 30. Woodcock, C.L. Chromatin architecture. Curr Opin Struct Biol 16, 213-20 (2006). 31. Edmondson, D.G. & Roth, S.Y. Chromatin and transcription. FASEB J 10, 1173-82 (1996). 32. Knezetic, J.A. & Luse, D.S. The presence of nucleosomes on a DNA template prevents initiation by RNA polymerase II in vitro. Cell 45, 95-104 (1986). 33. Lorch, Y., LaPointe, J.W. & Kornberg, R.D. Nucleosomes inhibit the initiation of transcription but allow chain elongation with the displacement of histones. Cell 49, 203-10 (1987). 34. Li, B., Carey, M. & Workman, J.L. The role of chromatin during transcription. Cell 128, 707-19 (2007). 35. Paranjape, S.M., Kamakaka, R.T. & Kadonaga, J.T. Role of chromatin structure in the regulation of transcription by RNA polymerase II. Annu Rev Biochem 63, 265-97 (1994). 36. Kouzarides, T. Chromatin modifications and their function. Cell 128, 693-705 (2007). 37. Clapier, C.R. & Cairns, B.R. The biology of chromatin remodeling complexes. Annu Rev Biochem 78, 273-304 (2009). 38. Saha, A., Wittmeyer, J. & Cairns, B.R. Chromatin remodelling: the industrial revolution of DNA around histones. Nat Rev Mol Cell Biol 7, 437-47 (2006). 39. Ehrenhofer-Murray, A.E. Chromatin dynamics at DNA replication, transcription and repair. Eur J Biochem 271, 2335-49 (2004). 40. Smith, C.L. & Peterson, C.L. ATP-dependent chromatin remodeling. Curr Top Dev Biol 65, 115-48 (2005). 41. Ewing, A.K., Attner, M. & Chakravarti, D. Novel regulatory role for human Acf1 in transcriptional repression of vitamin D3 receptor-regulated genes. Mol Endocrinol 21, 1791-806 (2007). 42. Lickert, H. et al. Baf60c is essential for function of BAF chromatin remodelling complexes in heart development. Nature 432, 107-12 (2004). 54 43. van Attikum, H., Fritsch, O., Hohn, B. & Gasser, S.M. Recruitment of the INO80 complex by H2A phosphorylation links ATP-dependent chromatin remodeling with DNA double-strand break repair. Cell 119, 777-88 (2004). 44. Davis, P.K. & Brackmann, R.K. Chromatin remodeling and cancer. Cancer Biol Ther 2, 22-9 (2003). 45. Morris, C.A. & Mervis, C.B. Williams syndrome and related disorders. Annu Rev Genomics Hum Genet 1, 461-84 (2000). 46. Zentner, G.E. & Scacheri, P.C. The chromatin fingerprint of gene enhancer elements. J Biol Chem 287, 30888-96 (2012). 47. Santen, G.W. et al. Mutations in SWI/SNF chromatin remodeling complex gene ARID1B cause Coffin-Siris syndrome. Nat Genet 44, 379-80 (2012). 48. O'Roak, B.J. et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246-50 (2012). 49. Kosho, T. et al. Clinical correlations of mutations affecting six components of the SWI/SNF complex: detailed description of 21 patients and a review of the literature. Am J Med Genet A 161, 1221-37 (2013). 50. Ward, A.J. & Cooper, T.A. The pathobiology of splicing. Journal of Pathology 220, 152-163 (2010). 51. Gao, K.P., Masuda, A., Matsuura, T. & Ohno, K. Human branch point consensus sequence is yUnAy. Nucleic Acids Research 36, 2257-2267 (2008). 52. Will, C.L. & Luhrmann, R. Spliceosome Structure and Function. Cold Spring Harbor Perspectives in Biology 3(2011). 53. Zhou, Z.L., Licklider, L.J., Gygi, S.P. & Reed, R. Comprehensive proteomic analysis of the human spliceosome. Nature 419, 182-185 (2002). 54. Rappsilber, J., Ryder, U., Lamond, A.I. & Mann, M. Large-scale proteomic analysis of the human splicesome. Genome Research 12, 1231-1245 (2002). 55. Kramer, A. The structure and function of proteins involved in mammalian premRNA splicing. Annual Review of Biochemistry 65, 367-409 (1996). 56. Wahl, M.C., Will, C.L. & Luhrmann, R. The Spliceosome: Design Principles of a Dynamic RNP Machine. Cell 136, 701-718 (2009). 57. Wu, Q. & Krainer, A.R. AT-AC Pre-mRNA splicing mechanisms and conservation of minor introns in voltage-gated ion channel genes. Molecular and Cellular Biology 19, 3225-3236 (1999). 58. Hall, S.L. & Padgett, R.A. Requirement of U12 snRNA for in vivo splicing of a minor class of eukaryotic nuclear pre-mRNA introns. Science 271, 17161718 (1996). 59. Patel, A.A. & Steitz, J.A. Splicing double: Insights from the second spliceosome. Nature Reviews Molecular Cell Biology 4, 960-970 (2003). 60. Turunen, J.J., Niemela, E.H., Verma, B. & Frilander, M.J. The significant other: splicing by the minor spliceosome. Wiley Interdisciplinary Reviews-Rna 4, 61-76 (2013). 61. Konig, H., Matter, N., Bader, R., Thiele, W. & Muller, F. Splicing segregation: The minor spliceosome acts outside the nucleus and controls cell proliferation. Cell 131, 718-729 (2007). 62. Pessa, H.K. et al. Minor spliceosome components are predominantly localized in the nucleus. Proc Natl Acad Sci U S A 105, 8655-60 (2008). 63. Valadkhan, S. snRNAs as the catalysts of pre-mRNA splicing. Curr Opin Chem Biol 9, 603-8 (2005). 64. Ast, G. How did alternative splicing evolve? Nat Rev Genet 5, 773-82 (2004). 65. Ares, M., Jr., Grate, L. & Pauling, M.H. A handful of intron-containing genes produces the lion's share of yeast mRNA. RNA 5, 1138-9 (1999). 55 66. Lopez, P.J. & Seraphin, B. Genomic-scale quantitative analysis of yeast premRNA splicing: implications for splice-site recognition. RNA 5, 1135-7 (1999). 67. Black, D.L. Mechanisms of alternative pre-messenger RNA splicing. Annual Review of Biochemistry 72, 291-336 (2003). 68. Flicek, P. et al. Ensembl 2012. Nucleic Acids Research 40, D84-90 (2012). 69. Modrek, B. & Lee, C. A genomic view of alternative splicing. Nat Genet 30, 13-9 (2002). 70. Berget, S.M., Moore, C. & Sharp, P.A. Spliced segments at the 5' terminus of adenovirus 2 late mRNA. Proc Natl Acad Sci U S A 74, 3171-5 (1977). 71. Coffin, J.M., Hughes, S.H. & Varmus, H.E. The Interactions of Retroviruses and their Hosts. (1997). 72. Gatherer, D. et al. High-resolution human cytomegalovirus transcriptome. Proc Natl Acad Sci U S A 108, 19755-60 (2011). 73. Black, D.L. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell 103, 367-70 (2000). 74. Nilsen, T.W. & Graveley, B.R. Expansion of the eukaryotic proteome by alternative splicing. Nature 463, 457-63 (2010). 75. Graveley, B.R. Alternative splicing: increasing diversity in the proteomic world. Trends Genet 17, 100-7 (2001). 76. Irimia, M. & Blencowe, B.J. Alternative splicing: decoding an expansive regulatory layer. Curr Opin Cell Biol 24, 323-32 (2012). 77. Stamm, S. et al. Function of alternative splicing. Gene 344, 1-20 (2005). 78. Kelemen, O. et al. Function of alternative splicing. Gene 514, 1-30 (2013). 79. Yap, K. & Makeyev, E.V. Regulation of gene expression in mammalian nervous system through alternative pre-mRNA splicing coupled with RNA quality control mechanisms. Mol Cell Neurosci (2013). 80. Rowen, L. et al. Analysis of the human neurexin genes: alternative splicing and the generation of protein diversity. Genomics 79, 587-97 (2002). 81. Ichtchenko, K. et al. Neuroligin 1: a splice site-specific ligand for betaneurexins. Cell 81, 435-43 (1995). 82. Missler, M., Hammer, R.E. & Sudhof, T.C. Neurexophilin binding to alphaneurexins. A single LNS domain functions as an independently folding ligandbinding unit. J Biol Chem 273, 34716-23 (1998). 83. Wollerton, M.C., Gooding, C., Wagner, E.J., Garcia-Blanco, M.A. & Smith, C.W. Autoregulation of polypyrimidine tract binding protein by alternative splicing leading to nonsense-mediated decay. Mol Cell 13, 91-100 (2004). 84. Orengo, J.P. & Cooper, T.A. Alternative splicing in disease. Adv Exp Med Biol 623, 212-23 (2007). 85. Buratti, E., Baralle, M. & Baralle, F.E. Defective splicing, disease and therapy: searching for master checkpoints in exon definition. Nucleic Acids Research 34, 3494-510 (2006). 86. Faustino, N.A. & Cooper, T.A. Pre-mRNA splicing and human disease. Genes Dev 17, 419-37 (2003). 87. Lopez-Bigas, N., Audit, B., Ouzounis, C., Parra, G. & Guigo, R. Are splicing mutations the most frequent cause of hereditary disease? FEBS Lett 579, 19003 (2005). 88. Tazi, J., Bakkour, N. & Stamm, S. Alternative splicing and disease. Biochim Biophys Acta 1792, 14-26 (2009). 89. Fu, X.D. Towards a splicing code. Cell 119, 736-8 (2004). 56 90. Chen, M. & Manley, J.L. Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nat Rev Mol Cell Biol 10, 741-54 (2009). 91. Graveley, B.R. Sorting out the complexity of SR protein functions. RNA 6, 1197-211 (2000). 92. Del Gatto-Konczak, F., Olive, M., Gesnel, M.C. & Breathnach, R. hnRNP A1 recruited to an exon in vivo can function as an exon splicing silencer. Molecular and Cellular Biology 19, 251-60 (1999). 93. Grover, A. et al. 5' splice site mutations in tau associated with the inherited dementia FTDP-17 affect a stem-loop structure that regulates alternative splicing of exon 10. J Biol Chem 274, 15134-43 (1999). 94. Zhu, J., Mayeda, A. & Krainer, A.R. Exon identity established through differential antagonism between exonic splicing silencer-bound hnRNP A1 and enhancer-bound SR proteins. Mol Cell 8, 1351-61 (2001). 95. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470-6 (2008). 96. David, C.J. & Manley, J.L. The search for alternative splicing regulators: new approaches offer a path to a splicing code. Genes Dev 22, 279-85 (2008). 97. Castle, J.C. et al. Expression of 24,426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines. Nat Genet 40, 1416-25 (2008). 98. Grabowski, P.J. & Black, D.L. Alternative RNA splicing in the nervous system. Prog Neurobiol 65, 289-308 (2001). 99. Wang, E.T. et al. Transcriptome-wide regulation of pre-mRNA splicing and mRNA localization by muscleblind proteins. Cell 150, 710-24 (2012). 100. Boutz, P.L. et al. A post-transcriptional regulatory switch in polypyrimidine tract-binding proteins reprograms alternative splicing in developing neurons. Genes Dev 21, 1636-52 (2007). 101. Ule, J. et al. Nova regulates brain-specific splicing to shape the synapse. Nat Genet 37, 844-52 (2005). 102. McKee, A.E. et al. A genome-wide in situ hybridization map of RNA-binding proteins reveals anatomically restricted expression in the developing mouse brain. BMC Dev Biol 5, 14 (2005). 103. Makeyev, E.V., Zhang, J., Carrasco, M.A. & Maniatis, T. The MicroRNA miR-124 promotes neuronal differentiation by triggering brain-specific alternative pre-mRNA splicing. Mol Cell 27, 435-48 (2007). 104. Kornblihtt, A.R. et al. Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nat Rev Mol Cell Biol 14, 153-65 (2013). 105. Moore, M.J. & Proudfoot, N.J. Pre-mRNA processing reaches back to transcription and ahead to translation. Cell 136, 688-700 (2009). 106. Shukla, S. & Oberdoerffer, S. Co-transcriptional regulation of alternative premRNA splicing. Biochim Biophys Acta 1819, 673-83 (2012). 107. Kornblihtt, A.R. Chromatin, transcript elongation and alternative splicing. Nat Struct Mol Biol 13, 5-7 (2006). 108. Munoz, M.J., de la Mata, M. & Kornblihtt, A.R. The carboxy terminal domain of RNA polymerase II and alternative splicing. Trends Biochem Sci 35, 497504 (2010). 109. Phatnani, H.P. & Greenleaf, A.L. Phosphorylation and functions of the RNA polymerase II CTD. Genes Dev 20, 2922-36 (2006). 110. Cramer, P. et al. Coupling of transcription with alternative splicing: RNA pol II promoters modulate SF2/ASF and 9G8 effects on an exonic splicing enhancer. Mol Cell 4, 251-8 (1999). 57 111. Carrillo Oesterreich, F., Bieberstein, N. & Neugebauer, K.M. Pause locally, splice globally. Trends Cell Biol 21, 328-35 (2011). 112. Perales, R. & Bentley, D. "Cotranscriptionality": the transcription elongation complex as a nexus for nuclear transactions. Mol Cell 36, 178-91 (2009). 113. de la Mata, M. et al. A slow RNA polymerase II affects alternative splicing in vivo. Mol Cell 12, 525-32 (2003). 114. de la Mata, M., Lafaille, C. & Kornblihtt, A.R. First come, first served revisited: factors affecting the same alternative splicing event have different effects on the relative rates of intron removal. RNA 16, 904-12 (2010). 115. Dutertre, M. et al. Cotranscriptional exon skipping in the genotoxic stress response. Nat Struct Mol Biol 17, 1358-66 (2010). 116. Schmidt, U. et al. Real-time imaging of cotranscriptional splicing reveals a kinetic model that reduces noise: implications for alternative splicing regulation. J Cell Biol 193, 819-29 (2011). 117. Kadener, S., Fededa, J.P., Rosbash, M. & Kornblihtt, A.R. Regulation of alternative splicing by a transcriptional enhancer through RNA pol II elongation. Proc Natl Acad Sci U S A 99, 8185-90 (2002). 118. Alexander, R.D., Innocente, S.A., Barrass, J.D. & Beggs, J.D. Splicingdependent RNA polymerase pausing in yeast. Mol Cell 40, 582-93 (2010). 119. Luco, R.F., Allo, M., Schor, I.E., Kornblihtt, A.R. & Misteli, T. Epigenetics in alternative pre-mRNA splicing. Cell 144, 16-26 (2011). 120. Allo, M. et al. Chromatin and alternative splicing. Cold Spring Harb Symp Quant Biol 75, 103-11 (2010). 121. Hughes, T.A. Regulation of gene expression by alternative untranslated regions. Trends Genet 22, 119-22 (2006). 122. Luco, R.F. & Misteli, T. More than a splicing code: integrating the role of RNA, chromatin and non-coding RNA in alternative splicing regulation. Curr Opin Genet Dev 21, 366-72 (2011). 123. Graveley, B.R. Alternative splicing: regulation without regulators. Nat Struct Mol Biol 16, 13-5 (2009). 124. Beyer, A.L. & Osheim, Y.N. Splice site selection, rate of splicing, and alternative splicing on nascent transcripts. Genes Dev 2, 754-65 (1988). 125. Misteli, T., Caceres, J.F. & Spector, D.L. The dynamics of a pre-mRNA splicing factor in living cells. Nature 387, 523-7 (1997). 126. Neugebauer, K.M. & Roth, M.B. Distribution of pre-mRNA splicing factors at sites of RNA polymerase II transcription. Genes Dev 11, 1148-59 (1997). 127. Zhang, G., Taneja, K.L., Singer, R.H. & Green, M.R. Localization of premRNA splicing in mammalian nuclei. Nature 372, 809-12 (1994). 128. Listerman, I., Sapra, A.K. & Neugebauer, K.M. Cotranscriptional coupling of splicing factor recruitment and precursor messenger RNA splicing in mammalian cells. Nat Struct Mol Biol 13, 815-22 (2006). 129. Kotovic, K.M., Lockshon, D., Boric, L. & Neugebauer, K.M. Cotranscriptional recruitment of the U1 snRNP to intron-containing genes in yeast. Molecular and Cellular Biology 23, 5768-79 (2003). 130. Pandya-Jones, A. & Black, D.L. Co-transcriptional splicing of constitutive and alternative exons. RNA 15, 1896-908 (2009). 131. Lacadie, S.A. & Rosbash, M. Cotranscriptional spliceosome assembly dynamics and the role of U1 snRNA:5'ss base pairing in yeast. Mol Cell 19, 65-75 (2005). 132. Vargas, D.Y. et al. Single-molecule imaging of transcriptionally coupled and uncoupled splicing. Cell 147, 1054-65 (2011). 58 133. Das, R. et al. SR proteins function in coupling RNAP II transcription to premRNA splicing. Mol Cell 26, 867-81 (2007). 134. Carrillo Oesterreich, F., Preibisch, S. & Neugebauer, K.M. Global analysis of nascent RNA reveals transcriptional pausing in terminal exons. Mol Cell 40, 571-81 (2010). 135. Khodor, Y.L. et al. Nascent-seq indicates widespread cotranscriptional premRNA splicing in Drosophila. Genes Dev 25, 2502-12 (2011). 136. Girard, C. et al. Post-transcriptional spliceosomes are retained in nuclear speckles until splicing completion. Nat Commun 3, 994 (2012). 137. Windhager, L. et al. Ultrashort and progressive 4sU-tagging reveals key characteristics of RNA processing at nucleotide resolution. Genome Research 22, 2031-42 (2012). 138. Khodor, Y.L., Menet, J.S., Tolan, M. & Rosbash, M. Cotranscriptional splicing efficiency differs dramatically between Drosophila and mouse. RNA 18, 217486 (2012). 139. Tilgner, H. et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Research 22, 1616-25 (2012). 140. Sachs, A. The role of poly(A) in the translation and stability of mRNA. Curr Opin Cell Biol 2, 1092-8 (1990). 141. Fabian, M.R., Sonenberg, N. & Filipowicz, W. Regulation of mRNA translation and stability by microRNAs. Annual Review of Biochemistry 79, 351-79 (2010). 142. Andreassi, C. & Riccio, A. To localize or not to localize: mRNA fate is in 3'UTR ends. Trends Cell Biol 19, 465-74 (2009). 143. Millevoi, S. & Vagner, S. Molecular mechanisms of eukaryotic pre-mRNA 3' end processing regulation. Nucleic Acids Research 38, 2757-74 (2010). 144. Proudfoot, N.J. Ending the message: poly(A) signals then and now. Genes Dev 25, 1770-82 (2011). 145. Mandel, C.R., Bai, Y. & Tong, L. Protein factors in pre-mRNA 3'-end processing. Cell Mol Life Sci 65, 1099-122 (2008). 146. Shi, Y. Alternative polyadenylation: new insights from global analyses. RNA 18, 2105-17 (2012). 147. McCracken, S. et al. The C-terminal domain of RNA polymerase II couples mRNA processing to transcription. Nature 385, 357-61 (1997). 148. Shi, Y. et al. Molecular architecture of the human pre-mRNA 3' processing complex. Mol Cell 33, 365-76 (2009). 149. Di Giammartino, D.C., Nishida, K. & Manley, J.L. Mechanisms and consequences of alternative polyadenylation. Mol Cell 43, 853-66 (2011). 150. Pinto, P.A. et al. RNA polymerase II kinetics in polo polyadenylation signal selection. Embo Journal 30, 2431-44 (2011). 151. Willingham, A.T. & Gingeras, T.R. TUF love for "junk" DNA. Cell 125, 121520 (2006). 152. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 18, 1509-17 (2008). 153. Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344-9 (2008). 154. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 6218 (2008). 59 155. Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956-60 (2008). 156. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40, 1413-5 (2008). 157. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105-11 (2009). 158. Pflueger, D. et al. Discovery of non-ETS gene fusions in human prostate cancer using next-generation RNA sequencing. Genome Research 21, 56-67 (2011). 159. Maher, C.A. et al. Transcriptome sequencing to detect gene fusions in cancer. Nature 458, 97-101 (2009). 160. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28, 503-10 (2010). 161. Chepelev, I., Wei, G., Tang, Q. & Zhao, K. Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq. Nucleic Acids Research 37, e106 (2009). 162. Mercer, T.R. et al. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat Biotechnol 30, 99-104 (2012). 163. Core, L.J., Waterfall, J.J. & Lis, J.T. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322, 1845-8 (2008). 164. Churchman, L.S. & Weissman, J.S. Nascent transcript sequencing visualizes transcription at nucleotide resolution. Nature 469, 368-73 (2011). 165. Pandya-Jones, A. et al. Splicing kinetics and transcript release from the chromatin compartment limit the rate of Lipid A-induced gene expression. RNA 19, 811-27 (2013). 166. Bhatt, D.M. et al. Transcript dynamics of proinflammatory genes revealed by sequence analysis of subcellular RNA fractions. Cell 150, 279-90 (2012). 167. Solnestam, B.W. et al. Comparison of total and cytoplasmic mRNA reveals global regulation by nuclear retention and miRNAs. BMC Genomics 13, 574 (2012). 168. Liao, J.Y. et al. Deep sequencing of human nuclear and cytoplasmic small RNAs reveals an unexpectedly complex subcellular distribution of miRNAs and tRNA 3' trailers. PLoS One 5, e10563 (2010). 169. Wilhelm, B.T. & Landry, J.R. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48, 249-57 (2009). 170. Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272-6 (2009). 171. Bamshad, M.J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12, 745-55 (2011). 172. Kiezun, A. et al. Exome sequencing and the genetic basis of complex traits. Nat Genet 44, 623-30 (2012). 173. Veltman, J.A. & Brunner, H.G. De novo mutations in human genetic disease. Nat Rev Genet 13, 565-75 (2012). 174. Rauch, A. et al. Range of genetic mutations associated with severe nonsyndromic sporadic intellectual disability: an exome sequencing study. Lancet 380, 1674-82 (2012). 175. Vissers, L.E. et al. A de novo paradigm for mental retardation. Nat Genet 42, 1109-12 (2010). 60 176. Iossifov, I. et al. De novo gene disruptions in children on the autistic spectrum. Neuron 74, 285-99 (2012). 177. Crow, J.F. The origins, patterns and implications of human spontaneous mutation. Nat Rev Genet 1, 40-7 (2000). 178. Eyre-Walker, A. & Keightley, P.D. The distribution of fitness effects of new mutations. Nat Rev Genet 8, 610-8 (2007). 179. Lynch, M. Rate, molecular spectrum, and consequences of human mutation. Proc Natl Acad Sci U S A 107, 961-8 (2010). 180. Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by wholegenome sequencing. Science 328, 636-9 (2010). 181. Shcherbik, N., Wang, M., Lapik, Y.R., Srivastava, L. & Pestov, D.G. Polyadenylation and degradation of incomplete RNA polymerase I transcripts in mammalian cells. EMBO Rep 11, 106-11 (2010). 182. Tilgner, H. et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res 22, 1616-25 (2012). 183. Brody, Y. et al. The in vivo kinetics of RNA polymerase II elongation during co-transcriptional splicing. PLoS Biol 9, e1000573 (2011). 184. Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644-52 (2011). 61 Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 939 Editor: The Dean of the Faculty of Medicine A doctoral dissertation from the Faculty of Medicine, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine. Distribution: publications.uu.se urn:nbn:se:uu:diva-209390 ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2013
© Copyright 2026 Paperzz