Surveying the transcriptome using RNA-seq

Surveying the transcriptome using
RNA-seq
Sarah Djébali Quélen
GenPhySE, INRA, Toulouse
[email protected]
EMBO Bioinformatics and genome analyses course
May 12th 2016 - Izmir, Turquey
1
Outline
●
●
●
●
Introduction
Gene annotation
RNA-seq experimental protocol
RNA-seq basic processing pipeline
○ read mapping: STAR
○ gene and transcript expression
quantification: RSEM
● References
2
Outline
Species:
higher eukaryotes
●
●
●
●
Introduction
Gene annotation
RNA-seq experimental protocol
RNA-seq basic processing pipeline
○ read mapping: STAR
○ gene and transcript expression
quantification: RSEM
● References
Data analysis Hands-on
Species: yeast
3
Hands-on data (GSE78204)
● Samples:
○ Three yeast (S. cerevisiae) strains:
■ ASK10WT
■ ask10Δ
■ ASK10M475R
○ 2 bio-replicates per sample
● Sequencing:
○ Paired-end
○ Unstranded
● References from UCSC:
○ Genome: SaCer3 (April 2011)
○ Annotation: SGD genes on SaCer3
4
Molecular biology dogma
● 1% of the human genome produces proteins, however much more is
transcribed (60%)
● The genome is the same in all cell types, however not all cell types have
the same function → the transcriptome can help understanding function
5
RNA transcription and processing
Primary RNA transcripts
are extensively processed:
● capping
● splicing
● polyadenylation
● edited
Regulated processing
means that each gene
produces many distinct
transcript isoforms: one
gene, many transcripts
The transcriptome is distinct
from and more complicated
than the genome
The transcriptome cannot be predicted from the genome sequence alone: it must be
measured
6
Genome and transcriptome
Gene
Isoforms
Some genomic definitions:
● Genome: the full DNA complement of a species’ cell
● Gene: the physical region of a chromosome producing some kind or RNA
transcript
● Isoforms: distinct RNAs arising from the gene, through differential exon
inclusion
● Transcript: The RNA molecule from the gene, will correspond to one of the
isoforms
● Transcriptome: the full RNA complement of a species' cell
7
Complexity of RNA arising from
differential processing
Alternative transcription
start site (TSS)
Alternative exon inclusion
Alternative transcription
termination site (TTS)
These processing events can result in:
● Different protein products
● Differentially transcriptionally regulated mRNAs
● Differentially post-transcriptionally regulated mRNAs
● Non-protein coding isoforms
8
Reference gene annotation
● For a given species and associated genome assembly, the
reference gene annotation is the collection of all genes known
for this species
● The gene annotation (and even genome assembly) can be in
various completion stages depending on the species, from yet
unassembled species like the lynx and the ant to high-quality
annotated species like human, mouse, drosophila, C. elegans
and yeast
● It is important to choose well the reference gene annotation
beforehand since it will represent the known transcriptome the
RNA-seq transcriptome will be compared to
9
Reference gene annotation for
human and mouse: Gencode
10
http://www.gencodegenes.org/
11
Several features:
● gene
● transcript
● exon
● CDS
● UTR
4 broad gene categories:
● protein-coding genes
● long non-coding genes
● pseudogenes
● small non-coding genes
3 confidence levels:
● level 1: validated
● level 2: manually
annotated
● level 3: automatically
annotated
12
The Gencode reference human gene
annotation
● The systematic manual annotation of the human genome (VEGA →
Havana) officially started in 2003 with 1% of the genome (ENCODE
regions, within ENCODE project)
● From 2008 to 2015:
○ the total number of annotated protein-coding genes (PCG) has
remained quite stable around 20,000
○ the number of pseudogenes has increased from 10,000 to 14,000
○ the number of lncRNAs has dramatically increased from 6,000 to
15,000!!!
● Contrarily to others, Gencode has always made particular efforts to
annotate:
○ pseudogenes
○ lncRNAs
13
Gencode lncRNA gene annotation
●
Gencode has always annotated lncRNA genes, but was globally calling them
“processed_transcript”
●
Since we find more and more of them and they have gained interest,
Gencode now better classifies them, partly using their location wrt PCGs:
3prime_overlapping_ncrna
Transcripts where ditag and/or published experimental data strongly
supports the existence of long non-coding transcripts transcribed from the
3'UTR.
sense_intronic
Long non-coding transcript in introns of a coding gene that does not
overlap any exons.
sense_overlapping
Long non-coding transcript that contains a coding gene in its intron on the
same strand.
antisense
Transcript believed to be an antisense product used in the regulation of the
gene to which it belongs.
non_coding
Transcript which is known from the literature to not be protein coding.
processed_transcript
Doesn't contain an ORF.
lincRNA
Long, intervening noncoding (linc)RNAs, that can be found in evolutionarily
conserved, intergenic regions.
14
GTF format
features
15
RNA-seq experimental protocol
16
RNA sequencing (RNA-seq)
● Allows to measure gene and transcript
expression in different conditions, developmental
stages...
● Allows to discover / annotate novel elements:
genes (coding and non-coding), transcripts,
exons, junctions
● Allows to study alternative splicing and
polyadenylation
● Allows to discover chimeric junctions, circular
RNAs, eQTLs, sQTLs, ...
17
Experimental design
Biological
Organism
Cell type
Treatment
Replicates
Technical
Economical
Sequencing
technology
Budget
Hardware
Ethical
restrictions
Expertize
Control
Differential gene expression
Novel gene annotation
Novel isoform annotation
Conditions
: replicates
: multiple conditions
: long reads
18
RNA-seq experiment
Library preparation
Sequencing
Analysis
19
RNA purification protocol
● PolyA+
○ get rid of the ribosomal RNAs and purify
mature polyadenylated transcripts
● PolyA○ enrich for non-mature RNAs
● Ribo○ get rid of the ribosomal RNAs but capture both
mature and non-mature RNAs
20
Experimental variables of RNA-seq
Cellular
localization
Preparation
RNA
purification
Whole cell
Total RNA
Chromatin
PolyA+
Exosome
PolyA-
Size selection
Single end
Long (>200nt)
Paired end
Short (<200nt)
Strandness
Nucleus
RiboCytoplasm
Stranded
Unstranded
Special protocols
Single-cell RNA-seq
Nascent RNA-seq (GRO-seq/NUN-seq)
miRNA-seq
21
Experimental variables of RNA-seq
Cellular
localization
Preparation
RNA
purification
Whole cell
Total RNA
Chromatin
PolyA+
Exosome
PolyA-
Size selection
Single end
Long (>200nt)
Paired end
Short (<200nt)
Strandness
Nucleus
RiboCytoplasm
Stranded
Unstranded
Special protocols
Single-cell RNA-seq
Nascent RNA-seq (GRO-seq/NUN-seq)
miRNA-seq
OUR
HANDSON
22
RNA-seq read mapping
23
RNA-seq read mapping
Input data:
● single-end or paired-end reads
● base call qualities for each base in each read
● quality offset
Prior biological knowledge:
● reference genome sequence
● reference gene annotation
24
RNA-seq read mapping
Find a correspondence between the query sequences (the
RNA-seq reads) and our prior knowledge (the reference
genome sequence, the reference gene annotation).
⇒ correctly determine each read’s corresponding location
in the reference genome/transcriptome
25
RNA-seq read mapping
● Unlike DNA-seq and ChIP-seq reads, RNA-seq reads may
span introns (~1-100kb in mammals)
○ They need to be mapped in a non-contiguous way
● Many RNA-seq read mappers usually heavily rely on a
contiguous/DNA read mapper and use two possible strategies:
○ make an exon pair database from the annotation and then
map the reads to the genome and this exon pair database
(e.g. TopHat (Bowtie), the GEM RNA pipeline (GEM))
○ Split the reads into sequential blocks and map them
independently and contiguously (e.g. MapSplice)
● Some other RNA-seq read mappers can directly map the reads
to the genome allowing introns (e.g. STAR, CRAC)
26
RNA-seq read mapping
● Contiguous read mapping
○ Alignment: definition and basic concept
○ Indexing: concept and strategies
○ Seed-and-extend mapping strategy
○ Paired end RNA-seq read mapping
● The STAR program
● RNA-seq read mapping
○ Variables to consider
○ Read mapping statistics
○ SAM/BAM format
27
Alignment
A common technique for mapping is alignment:
Reference: CATGGAACTTATCTCACAGCCTTT
Read: CATGGAACTT–TCGCACAGCCTTT
Not always easy:
● Reads are short with respect to the genome (~100 bp)
● Human genome is ∼3G bp long and rather repetitive
● Reference genome is different from sample genome
(SNPs, indels, structural variants)
● Reads are prone to errors (if lucky 1/1000 base calls are
wrong)
28
Alignment
● online versus indexed
● global versus local
● sequence similarity
○ mismatches as base substitutions (A→T)
○ insertions/deletions or gaps
○ block transpositions or rearrangements
● multimaps
● heuristic vs exhaustive
Given a metric distance (eg. mismatches) and a threshold (eg. 96%
homology) the alignment is exhaustive if it contains all possible matches in
the reference for that distance and threshold
29
Indexing
Pre-compute the reference text into an index, providing fast
sorted access to substrings of the reference
● based on the idea of sorting the content of the reference
to provide quick access to certain regions
● most indices provide exact access to suffixes (k-mers) of
the reference
● an index is computed once and used many times
Index structures:
○ Hash based (e.g. GSNAP, MrFast)
○ Suffix arrays (e.g. STAR)
○
BWT based (FM-index) (e.g. GEM, BWA, Bowtie)
30
Indexing strategies
● indexing the reference (most common choice):
○ each read is mapped individually
○ reference is usually big but is fixed
○ read/sample size is unknown and variable
● indexing the reads:
○ reference is scanned to perform the mapping
○ makes sense with small references (e.g. yeast)
● indexing both the reference and the reads:
○ high memory consumption - keeps both indices
31
Seed-and-extend mapping strategy
An alignment of a sequence of length m with at most k differences must contain
an exact match at least s=m/(k+1) bp long 4b
E.g BLAST, Bowtie2, ...
sensitivity depends on seed length and overlap
poor choice of seed might lead to unmapped reads
not exhaustive
i.
ii.
iii.
extract seeds (usually exact)
look-up each of them into the index
“extend” the search to validate the alignments
32
Paired-end alignment
Both ends of the fragments are
sequenced→paired-end reads
● Connectivity
information
● Insert size and read
length are known in
advance (from library
preparation)
● Insert size distribution
can be used to solve
ambiguities (or even
enhance the mapping
process)
33
The STAR program
● STAR: Spliced Transcripts Alignment to a Reference
● Chosen as RNA-seq read mapper by many international
consortia (ENCODE2/3, TCGA, Blueprint)
● Advantages:
○ proven to be one of the best at the RGASP2 competition
(Engström et al, Nat Meth, 2013)
○ Fast
○ Actively maintained by his author Alex Dobin
○ Has the potential to accurately align long (several kilobases)
reads emerging from the third-generation sequencing
technologies (PacBio, Oxford Nanopore)
● Drawbacks:
○ truncate reads and produces some false positive junctions
○ Needs quite a lot of ram
34
The STAR workflow
-
Clustering
Stitching
Scoring
Alex Dobin,CSHL
35
STAR seed search: basic idea
● Sequential search for a Maximal Mappable Prefix (MMP)
○ Where MMP is similar to the Maximal Exact (Unique)
Match concept used by Mummer and MAUVE
● Given a read sequence R, a read location i and a
reference genome sequence G, MMP(R,i,G) is defined as
the longest substring (Ri, Ri+1, …, Ri+MML-1) of R that
matches exactly one or more substring of G, where MML
is the maximal mappable length
● The MMP search is implemented through uncompressed
suffix array (SA)
36
MMP in STAR for detecting (a) splice
junctions, (b) mismatches and (c) tails
Dobin et al, Bioinformatics, 2013
37
Seed stitching strategy
● Most DNA aligners use the seed-and-extend strategy
○ Local alignment, affine gap penalty, dynamic
programming
○ Non-zero gap extension penalty: cannot deal with
very large gaps (100kb!)
● STAR uses the “seed-stitching” strategy:
→ build the best local alignment out of all seeds
Alex Dobin, CSHL
38
Anchor seeds and windows
● All seeds that map <10,000 times are recorded
○ 10-mers map 10,000 on average in human genome
● “Anchors”: seeds that map <50 times
● “Alignment windows”: genome regions around anchors
○ Size of the window ~ maximum intron size, ~1Mb for human
Anchor
● All seeds inside alignment windows are stitched together
Alex Dobin, CSHL
39
Stitching two seeds
seed 1
seed 2
No gap or insertion: simple stitching
40
Gapped stitching
● Simplify the problem by limiting to 1 gap only but an arbitrary
number of mismatches
○ Then the gap length is known, only need to find the gap
position
○ Time ~ O(UnmappedReadBases), compared to ~ O
(UnmappedReadBases*UnmappedGenomeBases) for
standard dynamic programming
● Issues:
○ Needs one seed for each exon
○ Cannot detect indels near splice junctions, or near read ends
41
Stitching all seed
● For each window, consider all possible seed combinations
● N seeds: 2N combinations - only works for short reads (<200b)
● For longer reads: simple dynamic programming scheme:
○ each seed is stitched to all the preceding seeds within a
window
○ best total score is recorded for every seed
○ O(N2), better scaling for longer reads
42
!
Always make sure that your reference
genome and your annotation use the
same chromosome names!
43
RNA-seq read mapping
Specific variables to consider when mapping RNA-seq data
● intron size
● overhang
○ number of bases from each side of the junction that
should be covered by the read
● splice site consensus
○ donor/acceptor splice site consensus sequences
● junction “filtering”:
○ chromosome/strand
○ block order
○ min/max distance
44
RNA-seq read mapping statistics
●
●
●
●
●
total reads
mapped reads (number and %)
uniquely mapped reads (number and %)
mappings (including multimaps)
genomic regions (number and %)
45
SAM format
Sequence Alignment/Map
Headers
Alignment
46
SAM format
Sequence Alignment/Map
Flag:
https://broadinstitute.github.io/picard/explain-flags.html
CIGAR:
- N → intron
- M → match
- I → insertion
- D → deletion
- S → soft-clip
More specification on SAM format:
https://samtools.github.io/hts-specs/SAMv1.pdf
47
BAM format
Compressed binary representation of the SAM format
● specific block compression
○ BGZF
● support random access through the index
fast retrieval of alignments overlapping a specified
region
!
BAM file must be sorted by genomic position
(chromosome name and leftmost coordinate)
in order to be indexed!
48
Gene expression quantification
49
Gene expression quantification
● To quantify the expression of a gene, a simple idea is to count the
RNA-seq reads that fall within the projected exons of this gene:
7 + 1 + 2 + 8 = 18 reads fall into that gene
● As a consequence:
○ in a given experiment, long genes (cumulative exon length) will
get more reads than small genes
○ in an experiment with a high number of mapped reads, a gene will
get more reads than in an experiment with a small number of
mapped reads
50
Gene expression quantification
● Since we need to compare the expression of:
○ several genes in the same experiment
○ the same gene in different experiments
● Mortazavi et al. (2008) introduced RPKM = Read Per Kilobase of exon
model per Million mapped reads, which normalizes the read count of a
gene in an experiment by both:
○ the length of the gene
○ the number of mapped reads in the experiment (library size)
51
Gene and transcript expression
measures
● Raw read counts
● RPKM = Reads Per Kilobase of exon per Million mapped reads
(Mortazavi et al, 2008)
● FPKM = Fragments Per Kilobase of exon per Million mapped
reads (Cufflinks)
(Trapnell et al, 2010)
● TPM = Transcripts Per Million (RSEM)
(Li and Dewey, 2011)
TPM = FPKM / (∑ FPKM over all genes (transcripts))
Advantage: ∑TPM = 1 for any sample; TPM does not depend
on the length of the expressed transcripts; TPM is a technology
independent measure, it is simply a fraction. It is now the
preferred gene and transcript expression measure
52
Individual transcript expression
● Gene expression is “quite” easy to compute, however estimating the
expression of individual transcripts of each gene is a difficult problem:
do the 2 circled reads come from the red or from the blue transcript?
● Read deconvolution or transcript isoform quantification
● There are 2 categories of transcript isoform quantifiers:
○ read-centric (Cufflinks, IsoEM, RSEM, Sailfish, eXpress, Kallisto)
○ exon-centric (Poisson model, linear regression approaches like
rQuant, IsoLasso, SLIDE, flux capacitor)
53
Individual transcript expression
● An increasing number of programs (mostly read-centric
such as RSEM, Sailfish, Kallisto) only use a mapping of
the reads to the transcriptome (reference annotation) as
input to produce transcript expression
● Although this works better for well-annotated species, it is
likely to fail with an incomplete annotation, since it will
likely overestimate transcript expression
→ one possibility is to first make a transcript assembly
(directly from the reads or from their mapping to the
genome) and use it as the transcriptome/annotation
54
The RSEM program
● RSEM: RNA-seq by Expectation Maximization
○ Parameters = transcript abundances
○ Hidden variable = alignment
● Transcript-level alignment
● No need for a reference genome, requires a set of reference
transcripts (eg. de novo transcriptome assembler, EST
database... )
● Computes ML abundance estimates using the EM algorithm as
its statistical method
● Good handling of multimaps leading to accurate transcript and
gene expression quantifications
55
Length dependence
• Assuming reads are of length 1 (usually they are > 35bp)
• Prob. of a read coming from a transcript ∝ relative abundance × length
transcripts
1
2
3
reads
100 A
200
60
60 C
80
40 G
Colin Dewey
The basics of quantification from
RNA-seq data
• Basic assumption:
expression
level
length
• Normalization factor is the mean length of expressed
transcripts
Colin Dewey
The basics of quantification from
RNA-seq data
• Estimate the probability of reads being generated from a
given transcript by counting the number of reads that align
to that transcript
# reads mapping to transcript i
total # of mappable reads
• Convert to expression levels by normalizing by transcript
length
Colin Dewey
What if reads do not uniquely map to
transcripts?
• The approach described assumes that every read can be
uniquely aligned to a single transcript
• This is generally not the case
• Some genes have similar sequences - gene families,
repetitive sequences
• Alternative splice forms of a gene share a significant
fraction of sequence
Colin Dewey
Multi-mapping reads in RNA-seq
Species
Read length
% multi-mapping reads
Mouse
25
17%
Mouse
75
10%
Maize
25
52%
Axolotl
76
23%
• Throwing away multi-mapping reads leads to
1. Loss of information
2. Potentially biased estimates of abundance
Colin Dewey
Distributions of alignment counts
Colin Dewey
What if reads do not uniquely map to
transcripts?
• “multi-read”: a read that could have been derived from
multiple transcripts
transcripts
3
90 A
200
1
2
reads
60
40 C
80
40 G
30 T
• How would you estimate the relative abundances for
these transcripts?
Colin Dewey
Some options for handling multireads
- Discard all multi-reads, estimate based on uniquely mapping reads
only
- Discard multi-reads, but use “unique length” of each transcript in
calculations
- “Rescue” multi-reads by allocating (fractions of) them to the
transcripts
→ Three step algorithm
1. Estimate abundances based on uniquely mapping reads only
2. For each multiread, divide it between the transcripts to which
it maps, proportionally to their abundances estimated in the
first step
3. Recompute abundances based on updated counts for each
transcript
Colin Dewey
Rescue method example - Step 1
transcripts
3
90 A
200
1
2
reads
60
40 C
80
40 G
Step 1
Colin Dewey
30 T
Rescue method example - Step 2
transcripts
3
90 A
200
1
2
reads
60
40 C
80
40 G
Step 2
Colin Dewey
30 T
Rescue method example - Step 3
transcripts
3
90 A
200
1
2
reads
60
40 C
80
Step 3
40 G
30 T
Colin Dewey
An observation about the rescue
method
• Note that at the end of the rescue algorithm, we have an updated
set of abundance estimates
• These new estimates could be used to reallocate the multi-reads
• And then we could update our abundance estimates once again
• And repeat!
• This is the intuition behind the statistical approach to this problem
Colin Dewey
EM Algorithm
• Expectation-Maximization for RNA-seq
• E-step: Compute expected read counts given current
expression levels
• M-step: Compute expression values maximizing
likelihood given expected read counts
• Rescue algorithm ≈ 1 iteration of EM
Colin Dewey
predicted expression level
Improved accuracy over unique and
rescue
true expression level
Gene-level expression estimation
Colin Dewey
predicted expression level
Improving accuracy on repetitive
genomes: maize
true expression level
Gene-level expression estimation
Colin Dewey
●
●
Alignments of
reads against
reference
transcript
sequences
(Bowtie)
Calculate the
relative
abundances
The RSEM program
workflow
71
Wiggle output: expected read depth at each position of the
genome
BAM output: probabilistically-weighted read alignments.
Paired reads are connected by a thin black line. The darkness of the read indicates
the posterior probability of its alignment (black meaning high probability)
72
STAR/RSEM gene/transcript
expression quantification pipeline
● STAR/RSEM was proven to be one of the best pipeline
for gene and transcript expression quantification, on the
basis of reproducibility of quantification on two bioreplicates (Teng et al, Genome Biology, 2016)
● STAR/RSEM has been chosen as the gene and
transcript expression quantification pipeline by many
international consortia (ENCODE3, TCGA, Blueprint)
73
The ENCODE3 STAR-RSEM pipeline
Alex Dobin, CSHL
Reads are mapped to the genome with STAR which then internally converts
genome mappings to transcriptome mappings (from genome to transcriptome
coordinates). RSEM takes this transcriptome mapping as input.
74
The Grape pipeline
Emilio Palumbo, CRG
https://github.com/guigolab/grape-nf
75
References
1.
2.
Annotation
a.
Ensembl: Curwen,..., Clamp, The Ensembl automatic gene annotation system, Genome Res, 2004 / Flicek,...,Searle, Ensembl 2013.
Nucleic Acids Res, 2013 / http://www.ensembl.org/index.html
b.
UCSC: Hsu,..., Haussler, The UCSC Known Genes, Bioinformatics, 2006 / http://genome.ucsc.edu/
c.
Gencode: Harrow,...,Hubbard, GENCODE: the reference human genome annotation for The ENCODE Project, Genome Res, 2012
Sequencing technology review: Metzker, Sequencing technologies - the next generation, Nat Rev Genet, 2010.
3.
mapping program review:
4.
a.
Fonseca,…, Marioni, Tools for mapping high-throughput sequencing data, Bioinformatics, 2012.
b.
Ruffalo,..., Koyutürk, Comparative analysis of algorithms for next-generation sequencing read alignment, Bioinformatics, 2011.
mapping algorithms
a.
GEMsuite: Santiago Marco-Sola et al., The GEM mapper: fast, accurate and versatile alignment by filtration, Nat Methods, 2012 /
http://algorithms.cnag.cat/wiki/The_GEM_library
b.
Ricardo A. Baeza-Yates, Chris H. Perleberg, Fast and practical approximate string matching, Information Processing Letters,
Volume 59, Issue 1, 8 July 1996
c.
Alex Dobin et al., STAR: ultrafast universal RNA-seq aligner, Bioinformatics. 2013 Jan 1;29(1):15-21
5.
sam/bam format: http://samtools.sourceforge.net/SAMv1.pdf
6.
samtools: http://samtools.sourceforge.net/
76
References
1.
bamtools: Barnett,…, Marth, BamTools: a C++ API and toolkit for analyzing and managing BAM files, Bioinformatics, 2011.
2.
bedtools: Quinlan, …, Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics, 2010 / http://code.google.
com/p/bedtools/
3.
RPKM definition: Mortazavi,..., Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat Methods, 2008.
4.
Multisplice: Huang, …, Liu, A robust method for transcript quantification with RNA-seq data, RECOMB 2012
5.
Review on transcript quantification methods: Pachter, Models for transcript quantification from RNA-seq, arXiv, 2011
6.
RSEM: Li, B., Ruotti, V., Stewart, R., Thomson, J., Dewey, C, RNA-Seq gene expression estimation with read mapping uncertainty,
Bioinformatics, 2010
7.
Lovén,...,Young, Revisiting global gene expression analysis, Cell, 2012
8.
UCSC genome browser: http://genome.ucsc.edu/
77
RNA-seq basic processing hands-on
http://genoweb.toulouse.inra.
fr/~sdjebali/courses/izmirmay2016/rnaseqbasic_proc/
78
Additional slides
79
Reference gene annotation
●
Two projects gather reference gene annotations for many different species:
○ Ensembl (U.K.): http://www.ensembl.org/index.html
○ UCSC (U.S.A.): http://genome.ucsc.edu/
●
They are both based on automated pipelines that:
○ heavily rely on experimental evidence present in public databases such
as ESTs, cDNAs and proteins, from the same or from other species
○ first align the experimental evidence to the genome, and then assemble
alignments into transcript structures using heuristic rules
○ for coding transcripts, determine the coordinates of the coding sequence
○ associate a unique id to each transcript and to each gene
○ provide a gene name when evidence shows the gene is already known
●
There is current effort to use RNA-seq data to help annotating genes, but for
the moment it is currently only useful for genomes that are unassembled, very
newly assembled or lacking a close well-annotated species
80
Reference gene annotation
●
The advantage of ensembl over ucsc gene annotation is that it includes
manual curation for the human and mouse species
●
These two projects also provide easy-to-use on-line tools to interrogate their
database and retrieve the underlying data:
○ Biomart for ensembl: http://www.biomart.org/
○ Table browser for UCSC: http://genome.ucsc.edu/cgi-bin/hgTables
●
They enable to visualize genome, annotation and data at the same time
●
The UCSC project enables to visualize and download ensembl genes, while
the ensembl project only provides ensembl genes
●
The vast majority of reference long gene annotation sets are based on
experimental evidence from polyadenylated samples, and thus will not contain
transcripts seen only in non-polyadenylated samples
81
Gencode pseudogene annotation
82
Reference gene annotation
●
Well studied species also have their own annotation databases, that ensembl
and UCSC also mirror:
○ D. melanogaster: flybase: http://flybase.org/
○ C. elegans: Wormbase: http://www.wormbase.org/
○ S. cerevisiae: SGD: http://www.yeastgenome.org/
Morrell et al., Nature Genetics review, 2012
83
Experimental variables of RNA-seq
Cellular
localization
Preparation
RNA
purification
Whole cell
Total RNA
Chromatin
PolyA+
Exosome
PolyA-
Size selection
Single end
Long (>200nt)
Paired end
Short (<200nt)
Strandness
Nucleus
RiboCytoplasm
Stranded
Unstranded
Special protocols
Single-cell RNA-seq
Nascent RNA-seq (GRO-seq/NUN-seq)
miRNA-seq
84
Total RNA
rRNAs
18s
28s
The amount of rRNAs in the cell is
very high, and if performing
sequencing without depleting them
most of the sequences obtained
would be rRNAs.
The rRNAs are PolyA-, thus the
purification with oligodT beads
removes them. By selecting the
PolyA+ transcripts the proportion of
intronic reads should be very low,
because most of RNAs will be fully
processed. The depletion by Ribokits removes most of the rRNAs, but
it is not 100 % efficient; thus, they
should be taken into account during
the mapping against the reference
sequence. With the Ribopurifications you may have a relative
high proportion of intronic reads, also
to be taken into account.
85
Library preparation, part 1
● Purification of RNA
○ by oligo-dT beads or depletion of rRNAs
● Fragmentation (usually around 160 nts)
● Optional step: gel size selection.
○ However, you lose a lot of material
in this step.
86
Library preparation, part 2, unstranded
● Reverse transcription
(1st and 2nd strand)
● Adenylation of 3’
ends
● Adapter ligation
● PCR amplification
87
Library preparation, part 2, stranded
● Reverse transcription
(1st strand only)
● Using dUTP instead
of dTTP for the
second strand cDNA
synthesis
● Adenylation of 3’
ends
● Adapter ligation
● Degradation of the
second strand
● PCR amplification
u
u
u
u
88
Library preparation, stranded
Note:
Elimination of the second strand may be different
between protocols. Some protocols (used by
ENCODE and Blueprint) digest the second strand by
using a UDGase (enzyme that digests the Uracil
strand). More recent protocols (ours, for instance)
use a DNA polymerase that is not able to amplify the
Uracil strand and, thus, only enriches the 1st strand
89
Sequencing
● Sequencing platforms:
Illumina, Ion Torrent, PacBio,...
● Spike-Ins inclusion?
● Randomized sequencing run?
90
Sequencing
Quail MA, et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific
Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012
91
Illumina sequencing
Bridge amplification
Metzker, Michael L. "Sequencing technologies—the next generation."Nature Reviews Genetics 11.1 (2010): 31-46.
92
Illumina sequencing
Four-color cyclic reversible termination
Sequencing by synthesis (SBS) technology
Metzker, Michael L. "Sequencing technologies—the next generation."Nature Reviews Genetics 11.1 (2010): 31-46.
93
Third generation sequencing
(PacBio, Oxford nanopore)
● Allows to sequence genomic material without neither
fragmentation nor clonal amplification
● Enables to get longer read sequences (many tens of kb),
but at the price of a much higher error rate than Illumina
● Has been mostly used for genome sequencing, since
those reads can span complicated repeat-rich regions
which are trickier to assemble using short reads
94
Pacbio sequencing
For RNA-seq see: Sharon, Tilgner, Grubert, Snyder, Nature Biotechnology, 2013
95
Pacbio sequencing
96
Oxford nanopore sequencing
From Irene González Navarrete, CRG
97
PacBio versus Oxford Nanopore for
RNA-seq
From Alex Dobin, CSHL
98
99
100
101
How much to sequence?
●
L read length
●
N number of reads
●
G haploid genome length
COVERAGE
C = LN / G
http://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
102
How much to sequence?
●
L read length
●
N number of reads
●
G haploid genome length
COVERAGE
C = LN / G
E.g. C = (100 bp)*(189×106)/(3×109bp) = 6.3
http://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
103
Fastq files and read QC
5.1
104
FASTQ Format
a text-based format for storing biological sequences and
their corresponding quality scores
1st character
Sequence id
1
2
3
4
Optionally: The sequence id can be followed by a description
105
FASTQ Format
a text-based format for storing biological sequences and
their corresponding quality scores
Raw sequence
1
2
3
4
106
FASTQ Format
a text-based format for storing biological sequences and
their corresponding quality scores
1st character
1
2
3
4
Optionally: “+” can can be followed by the sequence id and any description
107
FASTQ Format
a text-based format for storing biological sequences and
their corresponding quality scores
Quality code associated to each base of the sequence
1
2
3
4
108
FASTQ Format - summary
Four lines per sequence are used in a FASTQ file:
1. begins with a '@' character and is followed by a sequence
identifier and an optional description (like a FASTA title line)
2. the raw sequence
3. begins with a '+' character and is optionally followed by the
same sequence identifier (and any description)
4. encodes the quality values for the sequence contained in line
2 (must contain the same number of symbols as the
sequence)
109
FASTQ Format - quality offset
A quality value Q is an integer mapping of p (i.e., the probability
that the corresponding base call is incorrect). The most used
formula is the Phred quality score:
offset
max Phred score
range
max ASCII range
real-world Phred
score range
real-world ASCII
range
33
0 - 93
33 - 126
0 - 40
33 - 73
64
0 - 62
64 - 126
0 - 40
64 - 104
https://en.wikipedia.org/wiki/FASTQ_format#Encoding
110
FastQC
111
Index structures
●
Hash based
○ Simple Idea -> Store k-mers/seeds/samples using some hash function H
(·)
○ Usually requires a lot of space (several times the reference size)
●
SuffixArrays
○ Sort suffixes of the text, storing the sorted positions in an array
●
FM-Index (BWT Based)
○ Same logic as SuffixArrays
○ Based on a compression scheme (BZIP)
■ Space efficient (sizes the same as the reference)
112
STAR: Suffix arrays
113
CRAM format
improved compressed binary representation of SAM
● different compression formats
○ gzip, bzip2, CRAM records
● CRAM records use different encoding strategies, e.g.
bases are reference compressed by encoding base
differences rather than storing the bases themselves
● random access support through the format itself (slices)
!
CRAM indexing is external to the file format itself
and may change independently of the file format
specification in the future
114
Gene expression quantification
●
RPKM has been widely used for assessing gene expression, however it
assumes that the absolute amount of total RNA in each cell is similar across
different cell types or experimental perturbations, which is not always the case
(Loven, 2012)
●
For example, Mortazavi et al. (2008) estimates that 3 RPKM corresponds to ~
1 transcript per cell in mouse liver, while Klish et al. (2011) say that 1 RPKM
corresponds to between 0.3 and 1 transcript per cell...
Li, Ruotti, Stewart, Thomson, Dewey, “RNA-seq gene expression estimation with read mapping
uncertainty”, Bioinformatics, 26(4), 2010, 493-500.
115
Data normalization
Normalization allows to:
● Compare different datasets
● Compare different genes
● Remove unwanted variation
SEQC project: NATURE BIOTECHNOLOGY, Volume 32, Number 9, Sept. 2014
116
Normalization methods
● Scaling factors
➤
➤
➤
Upper quartile, mean, quantile
Trimmed Median of M-values (TMM)
Z-score
● Variance stabilizing
➤
➤
➤
vst (DESeq2)
rlog (DESeq2)
voom
● Variance removing
➤
➤
➤
PEER
RUV
SVA
SEQC project: NATURE BIOTECHNOLOGY, Volume 32, Number 9, Sept. 2014
117
Individual transcript expression
●
●
There are two categories of transcript isoform quantifiers:
○
read-centric (Cufflinks, IsoEM, RSEM, Sailfish,eXpress, Kallisto): assign
probability for each transcript fragment (paired-end read) to one transcript
by maximizing the joint likelihood of read alignments based on the
distribution of transcript fragment
○
exon-centric (Poisson model, linear regression approaches like rQuant,
IsoLasso, SLIDE, flux capacitor): considers the read abundance on an
exonic segment as the cumulative abundance of all transcript isoforms.
The transcript is represented as a combination of exons and aims at
estimating individual transcript abundance from the observed read counts
at each exon
The RPKM of a gene can then be obtained by summing the RPKM of its
constituent transcripts (assuming that reads were assigned to transcripts in a
mutually exclusive way)
118
The basics of quantification with
RNA-seq data
• For simplicity, suppose reads are of length one (typically they are > 35 bases)
transcripts
200
1
2
3
60
80
reads
100 A
60 C
40 G
• What relative abundances would you estimate for these genes?
The basics of quantification from RNA-Seq data
• Basic quantification algorithm
• Align reads against a set of reference transcript sequences
• Count the number of reads aligning to each transcript
• Convert read counts into relative expression levels
Counts to expression levels
• RPKM - Reads Per Kilobase per Million mapped reads
• TPM - Transcripts Per Million
(estimate
of)
• Prefer TPM to RPKM/FPKM because of normalization factor
• TPM is a technology-independent measure (simply a fraction)
Our solution - a generative probabilistic model
transcript probabilities (expression levels)
number of reads
transcript
fragment length
start position
read length
orientation
paired read
quality scores
read sequence
Colin Dewey
Quantification as maximum likelihood inference
• Observed data likelihood
• Likelihood function is concave w.r.t. θ
• Has a global maximum (or global maxima)
• Expectation-Maximization for optimization
“RNA-Seq gene expression estimation with read mapping uncertainty”
Li, B., Ruotti, V., Stewart, R., Thomson, J., Dewey, C.
Bioinformatics, 2010
Colin Dewey
Approximate inference with read alignments
• Full likelihood computation requires O(NML2) time
• N (number of reads) ~ 107
• M (number of transcripts) ~ 104
• L (average transcript length) ~ 103
Colin Dewey
• Approximate by alignment
all local alignments of read n with at most x mismatches
HMM Interpretation
transcript 1
transcript 2
start
transcript 3
...
observed: read sequences
...
hidden: read start positions
Colin Dewey
transcript M
Learning parameters: Baum-Welch Algorithm (EM for
HMMs)
Approximation: Only consider a subset of paths for each
read
RSEM: the directed graphical model
●
Generative
probabilistic model
associated
inference with the
EM algorithm
126
BED format
provides a flexible and compact way to represent genomic regions (with breaks)
● 3 required fields + additional 9 fields
● more compact than GFF tradeoff between size and provided information
block length block position required fields region
127
bedGraph and wig formats
bedGraph
● allows the display of continuous-valued data
●
useful for probability scores and transcriptome data (CHIp-seq, RNA-seq)
●
is a text file
track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200
priority=20
chr19 49302000 49302300 -1.0
chr19 49302300 49302600 -0.75
wig
● allows the display of continuous-valued data
●
more compressed than bedGraph
●
is a text file
fixedStep chrom=chr3 start=400601 step=100
11
22
33
128
bigBed, bigWig
Useful formats to display data on the UCSC genome browser
●
BED, bedGraph, wig etc are tab delimited text files
●
for each type of file there is a specific procedure to make a binary file
○
easily transferable
○
not so big
○
allows indexed access
129