Surveying the transcriptome using RNA-seq Sarah Djébali Quélen GenPhySE, INRA, Toulouse [email protected] EMBO Bioinformatics and genome analyses course May 12th 2016 - Izmir, Turquey 1 Outline ● ● ● ● Introduction Gene annotation RNA-seq experimental protocol RNA-seq basic processing pipeline ○ read mapping: STAR ○ gene and transcript expression quantification: RSEM ● References 2 Outline Species: higher eukaryotes ● ● ● ● Introduction Gene annotation RNA-seq experimental protocol RNA-seq basic processing pipeline ○ read mapping: STAR ○ gene and transcript expression quantification: RSEM ● References Data analysis Hands-on Species: yeast 3 Hands-on data (GSE78204) ● Samples: ○ Three yeast (S. cerevisiae) strains: ■ ASK10WT ■ ask10Δ ■ ASK10M475R ○ 2 bio-replicates per sample ● Sequencing: ○ Paired-end ○ Unstranded ● References from UCSC: ○ Genome: SaCer3 (April 2011) ○ Annotation: SGD genes on SaCer3 4 Molecular biology dogma ● 1% of the human genome produces proteins, however much more is transcribed (60%) ● The genome is the same in all cell types, however not all cell types have the same function → the transcriptome can help understanding function 5 RNA transcription and processing Primary RNA transcripts are extensively processed: ● capping ● splicing ● polyadenylation ● edited Regulated processing means that each gene produces many distinct transcript isoforms: one gene, many transcripts The transcriptome is distinct from and more complicated than the genome The transcriptome cannot be predicted from the genome sequence alone: it must be measured 6 Genome and transcriptome Gene Isoforms Some genomic definitions: ● Genome: the full DNA complement of a species’ cell ● Gene: the physical region of a chromosome producing some kind or RNA transcript ● Isoforms: distinct RNAs arising from the gene, through differential exon inclusion ● Transcript: The RNA molecule from the gene, will correspond to one of the isoforms ● Transcriptome: the full RNA complement of a species' cell 7 Complexity of RNA arising from differential processing Alternative transcription start site (TSS) Alternative exon inclusion Alternative transcription termination site (TTS) These processing events can result in: ● Different protein products ● Differentially transcriptionally regulated mRNAs ● Differentially post-transcriptionally regulated mRNAs ● Non-protein coding isoforms 8 Reference gene annotation ● For a given species and associated genome assembly, the reference gene annotation is the collection of all genes known for this species ● The gene annotation (and even genome assembly) can be in various completion stages depending on the species, from yet unassembled species like the lynx and the ant to high-quality annotated species like human, mouse, drosophila, C. elegans and yeast ● It is important to choose well the reference gene annotation beforehand since it will represent the known transcriptome the RNA-seq transcriptome will be compared to 9 Reference gene annotation for human and mouse: Gencode 10 http://www.gencodegenes.org/ 11 Several features: ● gene ● transcript ● exon ● CDS ● UTR 4 broad gene categories: ● protein-coding genes ● long non-coding genes ● pseudogenes ● small non-coding genes 3 confidence levels: ● level 1: validated ● level 2: manually annotated ● level 3: automatically annotated 12 The Gencode reference human gene annotation ● The systematic manual annotation of the human genome (VEGA → Havana) officially started in 2003 with 1% of the genome (ENCODE regions, within ENCODE project) ● From 2008 to 2015: ○ the total number of annotated protein-coding genes (PCG) has remained quite stable around 20,000 ○ the number of pseudogenes has increased from 10,000 to 14,000 ○ the number of lncRNAs has dramatically increased from 6,000 to 15,000!!! ● Contrarily to others, Gencode has always made particular efforts to annotate: ○ pseudogenes ○ lncRNAs 13 Gencode lncRNA gene annotation ● Gencode has always annotated lncRNA genes, but was globally calling them “processed_transcript” ● Since we find more and more of them and they have gained interest, Gencode now better classifies them, partly using their location wrt PCGs: 3prime_overlapping_ncrna Transcripts where ditag and/or published experimental data strongly supports the existence of long non-coding transcripts transcribed from the 3'UTR. sense_intronic Long non-coding transcript in introns of a coding gene that does not overlap any exons. sense_overlapping Long non-coding transcript that contains a coding gene in its intron on the same strand. antisense Transcript believed to be an antisense product used in the regulation of the gene to which it belongs. non_coding Transcript which is known from the literature to not be protein coding. processed_transcript Doesn't contain an ORF. lincRNA Long, intervening noncoding (linc)RNAs, that can be found in evolutionarily conserved, intergenic regions. 14 GTF format features 15 RNA-seq experimental protocol 16 RNA sequencing (RNA-seq) ● Allows to measure gene and transcript expression in different conditions, developmental stages... ● Allows to discover / annotate novel elements: genes (coding and non-coding), transcripts, exons, junctions ● Allows to study alternative splicing and polyadenylation ● Allows to discover chimeric junctions, circular RNAs, eQTLs, sQTLs, ... 17 Experimental design Biological Organism Cell type Treatment Replicates Technical Economical Sequencing technology Budget Hardware Ethical restrictions Expertize Control Differential gene expression Novel gene annotation Novel isoform annotation Conditions : replicates : multiple conditions : long reads 18 RNA-seq experiment Library preparation Sequencing Analysis 19 RNA purification protocol ● PolyA+ ○ get rid of the ribosomal RNAs and purify mature polyadenylated transcripts ● PolyA○ enrich for non-mature RNAs ● Ribo○ get rid of the ribosomal RNAs but capture both mature and non-mature RNAs 20 Experimental variables of RNA-seq Cellular localization Preparation RNA purification Whole cell Total RNA Chromatin PolyA+ Exosome PolyA- Size selection Single end Long (>200nt) Paired end Short (<200nt) Strandness Nucleus RiboCytoplasm Stranded Unstranded Special protocols Single-cell RNA-seq Nascent RNA-seq (GRO-seq/NUN-seq) miRNA-seq 21 Experimental variables of RNA-seq Cellular localization Preparation RNA purification Whole cell Total RNA Chromatin PolyA+ Exosome PolyA- Size selection Single end Long (>200nt) Paired end Short (<200nt) Strandness Nucleus RiboCytoplasm Stranded Unstranded Special protocols Single-cell RNA-seq Nascent RNA-seq (GRO-seq/NUN-seq) miRNA-seq OUR HANDSON 22 RNA-seq read mapping 23 RNA-seq read mapping Input data: ● single-end or paired-end reads ● base call qualities for each base in each read ● quality offset Prior biological knowledge: ● reference genome sequence ● reference gene annotation 24 RNA-seq read mapping Find a correspondence between the query sequences (the RNA-seq reads) and our prior knowledge (the reference genome sequence, the reference gene annotation). ⇒ correctly determine each read’s corresponding location in the reference genome/transcriptome 25 RNA-seq read mapping ● Unlike DNA-seq and ChIP-seq reads, RNA-seq reads may span introns (~1-100kb in mammals) ○ They need to be mapped in a non-contiguous way ● Many RNA-seq read mappers usually heavily rely on a contiguous/DNA read mapper and use two possible strategies: ○ make an exon pair database from the annotation and then map the reads to the genome and this exon pair database (e.g. TopHat (Bowtie), the GEM RNA pipeline (GEM)) ○ Split the reads into sequential blocks and map them independently and contiguously (e.g. MapSplice) ● Some other RNA-seq read mappers can directly map the reads to the genome allowing introns (e.g. STAR, CRAC) 26 RNA-seq read mapping ● Contiguous read mapping ○ Alignment: definition and basic concept ○ Indexing: concept and strategies ○ Seed-and-extend mapping strategy ○ Paired end RNA-seq read mapping ● The STAR program ● RNA-seq read mapping ○ Variables to consider ○ Read mapping statistics ○ SAM/BAM format 27 Alignment A common technique for mapping is alignment: Reference: CATGGAACTTATCTCACAGCCTTT Read: CATGGAACTT–TCGCACAGCCTTT Not always easy: ● Reads are short with respect to the genome (~100 bp) ● Human genome is ∼3G bp long and rather repetitive ● Reference genome is different from sample genome (SNPs, indels, structural variants) ● Reads are prone to errors (if lucky 1/1000 base calls are wrong) 28 Alignment ● online versus indexed ● global versus local ● sequence similarity ○ mismatches as base substitutions (A→T) ○ insertions/deletions or gaps ○ block transpositions or rearrangements ● multimaps ● heuristic vs exhaustive Given a metric distance (eg. mismatches) and a threshold (eg. 96% homology) the alignment is exhaustive if it contains all possible matches in the reference for that distance and threshold 29 Indexing Pre-compute the reference text into an index, providing fast sorted access to substrings of the reference ● based on the idea of sorting the content of the reference to provide quick access to certain regions ● most indices provide exact access to suffixes (k-mers) of the reference ● an index is computed once and used many times Index structures: ○ Hash based (e.g. GSNAP, MrFast) ○ Suffix arrays (e.g. STAR) ○ BWT based (FM-index) (e.g. GEM, BWA, Bowtie) 30 Indexing strategies ● indexing the reference (most common choice): ○ each read is mapped individually ○ reference is usually big but is fixed ○ read/sample size is unknown and variable ● indexing the reads: ○ reference is scanned to perform the mapping ○ makes sense with small references (e.g. yeast) ● indexing both the reference and the reads: ○ high memory consumption - keeps both indices 31 Seed-and-extend mapping strategy An alignment of a sequence of length m with at most k differences must contain an exact match at least s=m/(k+1) bp long 4b E.g BLAST, Bowtie2, ... sensitivity depends on seed length and overlap poor choice of seed might lead to unmapped reads not exhaustive i. ii. iii. extract seeds (usually exact) look-up each of them into the index “extend” the search to validate the alignments 32 Paired-end alignment Both ends of the fragments are sequenced→paired-end reads ● Connectivity information ● Insert size and read length are known in advance (from library preparation) ● Insert size distribution can be used to solve ambiguities (or even enhance the mapping process) 33 The STAR program ● STAR: Spliced Transcripts Alignment to a Reference ● Chosen as RNA-seq read mapper by many international consortia (ENCODE2/3, TCGA, Blueprint) ● Advantages: ○ proven to be one of the best at the RGASP2 competition (Engström et al, Nat Meth, 2013) ○ Fast ○ Actively maintained by his author Alex Dobin ○ Has the potential to accurately align long (several kilobases) reads emerging from the third-generation sequencing technologies (PacBio, Oxford Nanopore) ● Drawbacks: ○ truncate reads and produces some false positive junctions ○ Needs quite a lot of ram 34 The STAR workflow - Clustering Stitching Scoring Alex Dobin,CSHL 35 STAR seed search: basic idea ● Sequential search for a Maximal Mappable Prefix (MMP) ○ Where MMP is similar to the Maximal Exact (Unique) Match concept used by Mummer and MAUVE ● Given a read sequence R, a read location i and a reference genome sequence G, MMP(R,i,G) is defined as the longest substring (Ri, Ri+1, …, Ri+MML-1) of R that matches exactly one or more substring of G, where MML is the maximal mappable length ● The MMP search is implemented through uncompressed suffix array (SA) 36 MMP in STAR for detecting (a) splice junctions, (b) mismatches and (c) tails Dobin et al, Bioinformatics, 2013 37 Seed stitching strategy ● Most DNA aligners use the seed-and-extend strategy ○ Local alignment, affine gap penalty, dynamic programming ○ Non-zero gap extension penalty: cannot deal with very large gaps (100kb!) ● STAR uses the “seed-stitching” strategy: → build the best local alignment out of all seeds Alex Dobin, CSHL 38 Anchor seeds and windows ● All seeds that map <10,000 times are recorded ○ 10-mers map 10,000 on average in human genome ● “Anchors”: seeds that map <50 times ● “Alignment windows”: genome regions around anchors ○ Size of the window ~ maximum intron size, ~1Mb for human Anchor ● All seeds inside alignment windows are stitched together Alex Dobin, CSHL 39 Stitching two seeds seed 1 seed 2 No gap or insertion: simple stitching 40 Gapped stitching ● Simplify the problem by limiting to 1 gap only but an arbitrary number of mismatches ○ Then the gap length is known, only need to find the gap position ○ Time ~ O(UnmappedReadBases), compared to ~ O (UnmappedReadBases*UnmappedGenomeBases) for standard dynamic programming ● Issues: ○ Needs one seed for each exon ○ Cannot detect indels near splice junctions, or near read ends 41 Stitching all seed ● For each window, consider all possible seed combinations ● N seeds: 2N combinations - only works for short reads (<200b) ● For longer reads: simple dynamic programming scheme: ○ each seed is stitched to all the preceding seeds within a window ○ best total score is recorded for every seed ○ O(N2), better scaling for longer reads 42 ! Always make sure that your reference genome and your annotation use the same chromosome names! 43 RNA-seq read mapping Specific variables to consider when mapping RNA-seq data ● intron size ● overhang ○ number of bases from each side of the junction that should be covered by the read ● splice site consensus ○ donor/acceptor splice site consensus sequences ● junction “filtering”: ○ chromosome/strand ○ block order ○ min/max distance 44 RNA-seq read mapping statistics ● ● ● ● ● total reads mapped reads (number and %) uniquely mapped reads (number and %) mappings (including multimaps) genomic regions (number and %) 45 SAM format Sequence Alignment/Map Headers Alignment 46 SAM format Sequence Alignment/Map Flag: https://broadinstitute.github.io/picard/explain-flags.html CIGAR: - N → intron - M → match - I → insertion - D → deletion - S → soft-clip More specification on SAM format: https://samtools.github.io/hts-specs/SAMv1.pdf 47 BAM format Compressed binary representation of the SAM format ● specific block compression ○ BGZF ● support random access through the index fast retrieval of alignments overlapping a specified region ! BAM file must be sorted by genomic position (chromosome name and leftmost coordinate) in order to be indexed! 48 Gene expression quantification 49 Gene expression quantification ● To quantify the expression of a gene, a simple idea is to count the RNA-seq reads that fall within the projected exons of this gene: 7 + 1 + 2 + 8 = 18 reads fall into that gene ● As a consequence: ○ in a given experiment, long genes (cumulative exon length) will get more reads than small genes ○ in an experiment with a high number of mapped reads, a gene will get more reads than in an experiment with a small number of mapped reads 50 Gene expression quantification ● Since we need to compare the expression of: ○ several genes in the same experiment ○ the same gene in different experiments ● Mortazavi et al. (2008) introduced RPKM = Read Per Kilobase of exon model per Million mapped reads, which normalizes the read count of a gene in an experiment by both: ○ the length of the gene ○ the number of mapped reads in the experiment (library size) 51 Gene and transcript expression measures ● Raw read counts ● RPKM = Reads Per Kilobase of exon per Million mapped reads (Mortazavi et al, 2008) ● FPKM = Fragments Per Kilobase of exon per Million mapped reads (Cufflinks) (Trapnell et al, 2010) ● TPM = Transcripts Per Million (RSEM) (Li and Dewey, 2011) TPM = FPKM / (∑ FPKM over all genes (transcripts)) Advantage: ∑TPM = 1 for any sample; TPM does not depend on the length of the expressed transcripts; TPM is a technology independent measure, it is simply a fraction. It is now the preferred gene and transcript expression measure 52 Individual transcript expression ● Gene expression is “quite” easy to compute, however estimating the expression of individual transcripts of each gene is a difficult problem: do the 2 circled reads come from the red or from the blue transcript? ● Read deconvolution or transcript isoform quantification ● There are 2 categories of transcript isoform quantifiers: ○ read-centric (Cufflinks, IsoEM, RSEM, Sailfish, eXpress, Kallisto) ○ exon-centric (Poisson model, linear regression approaches like rQuant, IsoLasso, SLIDE, flux capacitor) 53 Individual transcript expression ● An increasing number of programs (mostly read-centric such as RSEM, Sailfish, Kallisto) only use a mapping of the reads to the transcriptome (reference annotation) as input to produce transcript expression ● Although this works better for well-annotated species, it is likely to fail with an incomplete annotation, since it will likely overestimate transcript expression → one possibility is to first make a transcript assembly (directly from the reads or from their mapping to the genome) and use it as the transcriptome/annotation 54 The RSEM program ● RSEM: RNA-seq by Expectation Maximization ○ Parameters = transcript abundances ○ Hidden variable = alignment ● Transcript-level alignment ● No need for a reference genome, requires a set of reference transcripts (eg. de novo transcriptome assembler, EST database... ) ● Computes ML abundance estimates using the EM algorithm as its statistical method ● Good handling of multimaps leading to accurate transcript and gene expression quantifications 55 Length dependence • Assuming reads are of length 1 (usually they are > 35bp) • Prob. of a read coming from a transcript ∝ relative abundance × length transcripts 1 2 3 reads 100 A 200 60 60 C 80 40 G Colin Dewey The basics of quantification from RNA-seq data • Basic assumption: expression level length • Normalization factor is the mean length of expressed transcripts Colin Dewey The basics of quantification from RNA-seq data • Estimate the probability of reads being generated from a given transcript by counting the number of reads that align to that transcript # reads mapping to transcript i total # of mappable reads • Convert to expression levels by normalizing by transcript length Colin Dewey What if reads do not uniquely map to transcripts? • The approach described assumes that every read can be uniquely aligned to a single transcript • This is generally not the case • Some genes have similar sequences - gene families, repetitive sequences • Alternative splice forms of a gene share a significant fraction of sequence Colin Dewey Multi-mapping reads in RNA-seq Species Read length % multi-mapping reads Mouse 25 17% Mouse 75 10% Maize 25 52% Axolotl 76 23% • Throwing away multi-mapping reads leads to 1. Loss of information 2. Potentially biased estimates of abundance Colin Dewey Distributions of alignment counts Colin Dewey What if reads do not uniquely map to transcripts? • “multi-read”: a read that could have been derived from multiple transcripts transcripts 3 90 A 200 1 2 reads 60 40 C 80 40 G 30 T • How would you estimate the relative abundances for these transcripts? Colin Dewey Some options for handling multireads - Discard all multi-reads, estimate based on uniquely mapping reads only - Discard multi-reads, but use “unique length” of each transcript in calculations - “Rescue” multi-reads by allocating (fractions of) them to the transcripts → Three step algorithm 1. Estimate abundances based on uniquely mapping reads only 2. For each multiread, divide it between the transcripts to which it maps, proportionally to their abundances estimated in the first step 3. Recompute abundances based on updated counts for each transcript Colin Dewey Rescue method example - Step 1 transcripts 3 90 A 200 1 2 reads 60 40 C 80 40 G Step 1 Colin Dewey 30 T Rescue method example - Step 2 transcripts 3 90 A 200 1 2 reads 60 40 C 80 40 G Step 2 Colin Dewey 30 T Rescue method example - Step 3 transcripts 3 90 A 200 1 2 reads 60 40 C 80 Step 3 40 G 30 T Colin Dewey An observation about the rescue method • Note that at the end of the rescue algorithm, we have an updated set of abundance estimates • These new estimates could be used to reallocate the multi-reads • And then we could update our abundance estimates once again • And repeat! • This is the intuition behind the statistical approach to this problem Colin Dewey EM Algorithm • Expectation-Maximization for RNA-seq • E-step: Compute expected read counts given current expression levels • M-step: Compute expression values maximizing likelihood given expected read counts • Rescue algorithm ≈ 1 iteration of EM Colin Dewey predicted expression level Improved accuracy over unique and rescue true expression level Gene-level expression estimation Colin Dewey predicted expression level Improving accuracy on repetitive genomes: maize true expression level Gene-level expression estimation Colin Dewey ● ● Alignments of reads against reference transcript sequences (Bowtie) Calculate the relative abundances The RSEM program workflow 71 Wiggle output: expected read depth at each position of the genome BAM output: probabilistically-weighted read alignments. Paired reads are connected by a thin black line. The darkness of the read indicates the posterior probability of its alignment (black meaning high probability) 72 STAR/RSEM gene/transcript expression quantification pipeline ● STAR/RSEM was proven to be one of the best pipeline for gene and transcript expression quantification, on the basis of reproducibility of quantification on two bioreplicates (Teng et al, Genome Biology, 2016) ● STAR/RSEM has been chosen as the gene and transcript expression quantification pipeline by many international consortia (ENCODE3, TCGA, Blueprint) 73 The ENCODE3 STAR-RSEM pipeline Alex Dobin, CSHL Reads are mapped to the genome with STAR which then internally converts genome mappings to transcriptome mappings (from genome to transcriptome coordinates). RSEM takes this transcriptome mapping as input. 74 The Grape pipeline Emilio Palumbo, CRG https://github.com/guigolab/grape-nf 75 References 1. 2. Annotation a. Ensembl: Curwen,..., Clamp, The Ensembl automatic gene annotation system, Genome Res, 2004 / Flicek,...,Searle, Ensembl 2013. Nucleic Acids Res, 2013 / http://www.ensembl.org/index.html b. UCSC: Hsu,..., Haussler, The UCSC Known Genes, Bioinformatics, 2006 / http://genome.ucsc.edu/ c. Gencode: Harrow,...,Hubbard, GENCODE: the reference human genome annotation for The ENCODE Project, Genome Res, 2012 Sequencing technology review: Metzker, Sequencing technologies - the next generation, Nat Rev Genet, 2010. 3. mapping program review: 4. a. Fonseca,…, Marioni, Tools for mapping high-throughput sequencing data, Bioinformatics, 2012. b. Ruffalo,..., Koyutürk, Comparative analysis of algorithms for next-generation sequencing read alignment, Bioinformatics, 2011. mapping algorithms a. GEMsuite: Santiago Marco-Sola et al., The GEM mapper: fast, accurate and versatile alignment by filtration, Nat Methods, 2012 / http://algorithms.cnag.cat/wiki/The_GEM_library b. Ricardo A. Baeza-Yates, Chris H. Perleberg, Fast and practical approximate string matching, Information Processing Letters, Volume 59, Issue 1, 8 July 1996 c. Alex Dobin et al., STAR: ultrafast universal RNA-seq aligner, Bioinformatics. 2013 Jan 1;29(1):15-21 5. sam/bam format: http://samtools.sourceforge.net/SAMv1.pdf 6. samtools: http://samtools.sourceforge.net/ 76 References 1. bamtools: Barnett,…, Marth, BamTools: a C++ API and toolkit for analyzing and managing BAM files, Bioinformatics, 2011. 2. bedtools: Quinlan, …, Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics, 2010 / http://code.google. com/p/bedtools/ 3. RPKM definition: Mortazavi,..., Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat Methods, 2008. 4. Multisplice: Huang, …, Liu, A robust method for transcript quantification with RNA-seq data, RECOMB 2012 5. Review on transcript quantification methods: Pachter, Models for transcript quantification from RNA-seq, arXiv, 2011 6. RSEM: Li, B., Ruotti, V., Stewart, R., Thomson, J., Dewey, C, RNA-Seq gene expression estimation with read mapping uncertainty, Bioinformatics, 2010 7. Lovén,...,Young, Revisiting global gene expression analysis, Cell, 2012 8. UCSC genome browser: http://genome.ucsc.edu/ 77 RNA-seq basic processing hands-on http://genoweb.toulouse.inra. fr/~sdjebali/courses/izmirmay2016/rnaseqbasic_proc/ 78 Additional slides 79 Reference gene annotation ● Two projects gather reference gene annotations for many different species: ○ Ensembl (U.K.): http://www.ensembl.org/index.html ○ UCSC (U.S.A.): http://genome.ucsc.edu/ ● They are both based on automated pipelines that: ○ heavily rely on experimental evidence present in public databases such as ESTs, cDNAs and proteins, from the same or from other species ○ first align the experimental evidence to the genome, and then assemble alignments into transcript structures using heuristic rules ○ for coding transcripts, determine the coordinates of the coding sequence ○ associate a unique id to each transcript and to each gene ○ provide a gene name when evidence shows the gene is already known ● There is current effort to use RNA-seq data to help annotating genes, but for the moment it is currently only useful for genomes that are unassembled, very newly assembled or lacking a close well-annotated species 80 Reference gene annotation ● The advantage of ensembl over ucsc gene annotation is that it includes manual curation for the human and mouse species ● These two projects also provide easy-to-use on-line tools to interrogate their database and retrieve the underlying data: ○ Biomart for ensembl: http://www.biomart.org/ ○ Table browser for UCSC: http://genome.ucsc.edu/cgi-bin/hgTables ● They enable to visualize genome, annotation and data at the same time ● The UCSC project enables to visualize and download ensembl genes, while the ensembl project only provides ensembl genes ● The vast majority of reference long gene annotation sets are based on experimental evidence from polyadenylated samples, and thus will not contain transcripts seen only in non-polyadenylated samples 81 Gencode pseudogene annotation 82 Reference gene annotation ● Well studied species also have their own annotation databases, that ensembl and UCSC also mirror: ○ D. melanogaster: flybase: http://flybase.org/ ○ C. elegans: Wormbase: http://www.wormbase.org/ ○ S. cerevisiae: SGD: http://www.yeastgenome.org/ Morrell et al., Nature Genetics review, 2012 83 Experimental variables of RNA-seq Cellular localization Preparation RNA purification Whole cell Total RNA Chromatin PolyA+ Exosome PolyA- Size selection Single end Long (>200nt) Paired end Short (<200nt) Strandness Nucleus RiboCytoplasm Stranded Unstranded Special protocols Single-cell RNA-seq Nascent RNA-seq (GRO-seq/NUN-seq) miRNA-seq 84 Total RNA rRNAs 18s 28s The amount of rRNAs in the cell is very high, and if performing sequencing without depleting them most of the sequences obtained would be rRNAs. The rRNAs are PolyA-, thus the purification with oligodT beads removes them. By selecting the PolyA+ transcripts the proportion of intronic reads should be very low, because most of RNAs will be fully processed. The depletion by Ribokits removes most of the rRNAs, but it is not 100 % efficient; thus, they should be taken into account during the mapping against the reference sequence. With the Ribopurifications you may have a relative high proportion of intronic reads, also to be taken into account. 85 Library preparation, part 1 ● Purification of RNA ○ by oligo-dT beads or depletion of rRNAs ● Fragmentation (usually around 160 nts) ● Optional step: gel size selection. ○ However, you lose a lot of material in this step. 86 Library preparation, part 2, unstranded ● Reverse transcription (1st and 2nd strand) ● Adenylation of 3’ ends ● Adapter ligation ● PCR amplification 87 Library preparation, part 2, stranded ● Reverse transcription (1st strand only) ● Using dUTP instead of dTTP for the second strand cDNA synthesis ● Adenylation of 3’ ends ● Adapter ligation ● Degradation of the second strand ● PCR amplification u u u u 88 Library preparation, stranded Note: Elimination of the second strand may be different between protocols. Some protocols (used by ENCODE and Blueprint) digest the second strand by using a UDGase (enzyme that digests the Uracil strand). More recent protocols (ours, for instance) use a DNA polymerase that is not able to amplify the Uracil strand and, thus, only enriches the 1st strand 89 Sequencing ● Sequencing platforms: Illumina, Ion Torrent, PacBio,... ● Spike-Ins inclusion? ● Randomized sequencing run? 90 Sequencing Quail MA, et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012 91 Illumina sequencing Bridge amplification Metzker, Michael L. "Sequencing technologies—the next generation."Nature Reviews Genetics 11.1 (2010): 31-46. 92 Illumina sequencing Four-color cyclic reversible termination Sequencing by synthesis (SBS) technology Metzker, Michael L. "Sequencing technologies—the next generation."Nature Reviews Genetics 11.1 (2010): 31-46. 93 Third generation sequencing (PacBio, Oxford nanopore) ● Allows to sequence genomic material without neither fragmentation nor clonal amplification ● Enables to get longer read sequences (many tens of kb), but at the price of a much higher error rate than Illumina ● Has been mostly used for genome sequencing, since those reads can span complicated repeat-rich regions which are trickier to assemble using short reads 94 Pacbio sequencing For RNA-seq see: Sharon, Tilgner, Grubert, Snyder, Nature Biotechnology, 2013 95 Pacbio sequencing 96 Oxford nanopore sequencing From Irene González Navarrete, CRG 97 PacBio versus Oxford Nanopore for RNA-seq From Alex Dobin, CSHL 98 99 100 101 How much to sequence? ● L read length ● N number of reads ● G haploid genome length COVERAGE C = LN / G http://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf 102 How much to sequence? ● L read length ● N number of reads ● G haploid genome length COVERAGE C = LN / G E.g. C = (100 bp)*(189×106)/(3×109bp) = 6.3 http://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf 103 Fastq files and read QC 5.1 104 FASTQ Format a text-based format for storing biological sequences and their corresponding quality scores 1st character Sequence id 1 2 3 4 Optionally: The sequence id can be followed by a description 105 FASTQ Format a text-based format for storing biological sequences and their corresponding quality scores Raw sequence 1 2 3 4 106 FASTQ Format a text-based format for storing biological sequences and their corresponding quality scores 1st character 1 2 3 4 Optionally: “+” can can be followed by the sequence id and any description 107 FASTQ Format a text-based format for storing biological sequences and their corresponding quality scores Quality code associated to each base of the sequence 1 2 3 4 108 FASTQ Format - summary Four lines per sequence are used in a FASTQ file: 1. begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line) 2. the raw sequence 3. begins with a '+' character and is optionally followed by the same sequence identifier (and any description) 4. encodes the quality values for the sequence contained in line 2 (must contain the same number of symbols as the sequence) 109 FASTQ Format - quality offset A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). The most used formula is the Phred quality score: offset max Phred score range max ASCII range real-world Phred score range real-world ASCII range 33 0 - 93 33 - 126 0 - 40 33 - 73 64 0 - 62 64 - 126 0 - 40 64 - 104 https://en.wikipedia.org/wiki/FASTQ_format#Encoding 110 FastQC 111 Index structures ● Hash based ○ Simple Idea -> Store k-mers/seeds/samples using some hash function H (·) ○ Usually requires a lot of space (several times the reference size) ● SuffixArrays ○ Sort suffixes of the text, storing the sorted positions in an array ● FM-Index (BWT Based) ○ Same logic as SuffixArrays ○ Based on a compression scheme (BZIP) ■ Space efficient (sizes the same as the reference) 112 STAR: Suffix arrays 113 CRAM format improved compressed binary representation of SAM ● different compression formats ○ gzip, bzip2, CRAM records ● CRAM records use different encoding strategies, e.g. bases are reference compressed by encoding base differences rather than storing the bases themselves ● random access support through the format itself (slices) ! CRAM indexing is external to the file format itself and may change independently of the file format specification in the future 114 Gene expression quantification ● RPKM has been widely used for assessing gene expression, however it assumes that the absolute amount of total RNA in each cell is similar across different cell types or experimental perturbations, which is not always the case (Loven, 2012) ● For example, Mortazavi et al. (2008) estimates that 3 RPKM corresponds to ~ 1 transcript per cell in mouse liver, while Klish et al. (2011) say that 1 RPKM corresponds to between 0.3 and 1 transcript per cell... Li, Ruotti, Stewart, Thomson, Dewey, “RNA-seq gene expression estimation with read mapping uncertainty”, Bioinformatics, 26(4), 2010, 493-500. 115 Data normalization Normalization allows to: ● Compare different datasets ● Compare different genes ● Remove unwanted variation SEQC project: NATURE BIOTECHNOLOGY, Volume 32, Number 9, Sept. 2014 116 Normalization methods ● Scaling factors ➤ ➤ ➤ Upper quartile, mean, quantile Trimmed Median of M-values (TMM) Z-score ● Variance stabilizing ➤ ➤ ➤ vst (DESeq2) rlog (DESeq2) voom ● Variance removing ➤ ➤ ➤ PEER RUV SVA SEQC project: NATURE BIOTECHNOLOGY, Volume 32, Number 9, Sept. 2014 117 Individual transcript expression ● ● There are two categories of transcript isoform quantifiers: ○ read-centric (Cufflinks, IsoEM, RSEM, Sailfish,eXpress, Kallisto): assign probability for each transcript fragment (paired-end read) to one transcript by maximizing the joint likelihood of read alignments based on the distribution of transcript fragment ○ exon-centric (Poisson model, linear regression approaches like rQuant, IsoLasso, SLIDE, flux capacitor): considers the read abundance on an exonic segment as the cumulative abundance of all transcript isoforms. The transcript is represented as a combination of exons and aims at estimating individual transcript abundance from the observed read counts at each exon The RPKM of a gene can then be obtained by summing the RPKM of its constituent transcripts (assuming that reads were assigned to transcripts in a mutually exclusive way) 118 The basics of quantification with RNA-seq data • For simplicity, suppose reads are of length one (typically they are > 35 bases) transcripts 200 1 2 3 60 80 reads 100 A 60 C 40 G • What relative abundances would you estimate for these genes? The basics of quantification from RNA-Seq data • Basic quantification algorithm • Align reads against a set of reference transcript sequences • Count the number of reads aligning to each transcript • Convert read counts into relative expression levels Counts to expression levels • RPKM - Reads Per Kilobase per Million mapped reads • TPM - Transcripts Per Million (estimate of) • Prefer TPM to RPKM/FPKM because of normalization factor • TPM is a technology-independent measure (simply a fraction) Our solution - a generative probabilistic model transcript probabilities (expression levels) number of reads transcript fragment length start position read length orientation paired read quality scores read sequence Colin Dewey Quantification as maximum likelihood inference • Observed data likelihood • Likelihood function is concave w.r.t. θ • Has a global maximum (or global maxima) • Expectation-Maximization for optimization “RNA-Seq gene expression estimation with read mapping uncertainty” Li, B., Ruotti, V., Stewart, R., Thomson, J., Dewey, C. Bioinformatics, 2010 Colin Dewey Approximate inference with read alignments • Full likelihood computation requires O(NML2) time • N (number of reads) ~ 107 • M (number of transcripts) ~ 104 • L (average transcript length) ~ 103 Colin Dewey • Approximate by alignment all local alignments of read n with at most x mismatches HMM Interpretation transcript 1 transcript 2 start transcript 3 ... observed: read sequences ... hidden: read start positions Colin Dewey transcript M Learning parameters: Baum-Welch Algorithm (EM for HMMs) Approximation: Only consider a subset of paths for each read RSEM: the directed graphical model ● Generative probabilistic model associated inference with the EM algorithm 126 BED format provides a flexible and compact way to represent genomic regions (with breaks) ● 3 required fields + additional 9 fields ● more compact than GFF tradeoff between size and provided information block length block position required fields region 127 bedGraph and wig formats bedGraph ● allows the display of continuous-valued data ● useful for probability scores and transcriptome data (CHIp-seq, RNA-seq) ● is a text file track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20 chr19 49302000 49302300 -1.0 chr19 49302300 49302600 -0.75 wig ● allows the display of continuous-valued data ● more compressed than bedGraph ● is a text file fixedStep chrom=chr3 start=400601 step=100 11 22 33 128 bigBed, bigWig Useful formats to display data on the UCSC genome browser ● BED, bedGraph, wig etc are tab delimited text files ● for each type of file there is a specific procedure to make a binary file ○ easily transferable ○ not so big ○ allows indexed access 129
© Copyright 2026 Paperzz