RNAseq Applications in Genome Studies

RNAseq Applications in Genome
Studies
Alexander Kanapin, PhD
Wellcome Trust Centre for Human Genetics,
University of Oxford
RNAseq Protocols
} 
} 
} 
} 
} 
Next generation sequencing protocol
cDNA, not RNA sequencing
Types of libraries available:
}  Total RNA sequencing
}  polyA+ RNA sequencing
}  Small RNA sequencing
Special protocols:
}  DSN treatment
}  Ribominus
}  SMARTer: Ultra Low RNA sequencing protocol
Strand-specific sequencing
}  Sequencing only + or – strand
}  Mostly paired-end
Genome Study Applications
} 
} 
} 
} 
} 
transcriptome analysis
identifying new transcribed regions
expression profiling
alternative splicing studies
resequencing to find genetic polymorphisms:
} 
} 
SNPs, micro-indels
CNVs
cDNA Synthesis
Arrays vs RNAseq (1)
} 
} 
} 
} 
Correlation of fold change between
arrays and RNAseq is similar to
correlation between array platforms
(0.73)
Technical replicates are almost identical,
no need to run
Extra analysis: prediction of alternative
splicing, SNPs
Low- and high-expressed genes do not
match
Array vs RNAseq (2)
Data processing and analysis
} 
Alignment
} 
} 
Splice-aware
Reads counting/preprocessing
} 
} 
Adaptor trimming
Counting
} 
} 
} 
} 
Sanity checks
Expression studies
} 
} 
} 
Overlapping genes
Strand specific sequencing protocols
Differential expression
Alternative splicing
GO and pathway analysis
Dataflow and Formats
Illumina Pipeline
(FASTQ)
Alignment (BAM)
Preprocessing
(FASTQ/
FASTA)
Expression
profiles/
RNA
abundance
(BED,GTF)
Splice
variants
(GTF)
SNP
analysis
(VCF)
Software
} 
} 
} 
} 
} 
} 
Short reads aligners
}  TopHat, STAR
Data preprocessing (reads statistics, adapter clipping, formats conversion, read counters)
}  Fastx toolkit
}  Htseq
}  samtools
Expression studies
}  Cufflinks, cuffdiff, cuffcompare
}  RSEQtools
}  R packages (DESeq, edgeR, baySeq, DEGseq, Genominator)
Alternative splicing
}  Cufflinks
}  MISO
}  Augustus
Downstream analysis
}  GOSeq
}  GOStats
}  SPIA
Commercial software
}  Partek
}  CLCBio
RNASeq alignment
} 
TopHat
} 
} 
} 
} 
} 
University of Maryland (http://tophat.cbcb.umd.edu/
manual.shtml)
Python wrapper around bowtie aligner
Identifies exons without reference database
Assisted or de novo transcripts assembly
STAR
} 
} 
} 
} 
} 
CSHL (http://gingeraslab.cshl.edu/STAR/STARintro.htm)
Used by ENCODE project as RNASeq aligner
Unbiased detection of splice junctions
Arbitrary large intron length
Heuristic non-exhaustive algorithm
FASTQ: Sequence Data
FASTA with Qualities
}  PHREQ quality score (probability that the
corresponding base call is incorrect) with +33 or +64
offset, recorded as an ASCII code
} 
@HWI-EAS225:3:1:2:854#0/1
GGGGGGAAGTCGGCAAAATAGATCCGTAACTTCGGG !
+HWI-EAS225:3:1:2:854#0/1 !
a`abbbbabaabbababb^`[aaa`_N]b^ab^``a !
SAM(BAM): Alignment Data
Bitwise
Read ID flag
Chr Pos
S35_42763_
4
0X
15401991
Insert
MapQ CIGAR Mate ref Mate pos size Sequence
255 18M
*
0
Scores
Extra tags
0 CACACGATTCTCAAAGGT IIIIIIIIIIIIIIIIII XA:i:0
Statistics and Algorithms
} 
} 
} 
Аim: to detect changes between experimental conditions of
interest that are significantly larger than the technical and
biological variability among replicates.
Short reads distribution
}  Poisson
}  Negative binomial
}  Normal
Expression values normalization
}  FPKM
}  Normalized reads number
}  VST (variance stabilized transformation)
FPKM (RPKM): Expression Values
Fragments Reads Per Kilobase of exon model per
Million mapped fragments
}  Nat Methods. 2008, Mapping and quantifying mammalian
transcriptomes by RNA-Seq. Mortazavi A et al.
} 
C
FPKM = 10 "
NL
9
C= the number of reads mapped onto the gene's exons
N= total number of reads in the experiment
L= the sum of the exons in base pairs.
Read counts
} 
HTSeq-count
} 
} 
} 
http://www-huber.embl.de/users/anders/HTSeq
Python script producing raw read counts using sorted sam files
BEDTools
} 
http://code.google.com/p/bedtools/
} 
coverageBed computes both the
depth and breadth
of coverage of features
in file A across the features in file B.
Sanity checks
} 
} 
} 
Read counts by category
Counts distribution
Pairwise correlation
Normalised count distributions
WTCHG_52442_273
WTCHG_52442_274
WTCHG_52442_275
WTCHG_52442_276
WTCHG_52442_277
WTCHG_52442_288
Density
0.3
0.4
alignment_not_unique
ambiguous
no_feature
not_aligned
too_low_aQual
Ensembl genes
1.5
0.1
0.2
1.0
0.0
0.5
0.0
TCHG_52442_288
TCHG_52442_277
TCHG_52442_276
TCHG_52442_275
TCHG_52442_274
0
TCHG_52442_273
Number of reads (millions)
0.5
Read Assigment by Category
5
10
Log2 normalised counts
15
Cufflinks package
http://cufflinks.cbcb.umd.edu/
}  “Cufflinks is a program that assembles aligned RNA-Seq
reads into transcripts, estimates their abundances, and
tests for differential expression and regulation
transcriptome-wide”
}  Cuffcompare:
} 
} 
} 
Transcripts comparison (de novo/genome annotation)
Cuffdiff:
} 
Differential expression analysis
Cufflinks (Expression analysis)
gene_id bundle_id
chr left right FPKM FPKM_conf_lo FPKM_conf_hi
ENSG00000236743 31390 chr1 459655 461954 0
0
0
OK
ENSG00000248149 31391 chr1 465693 688071 787.12 731.009 843.232 OK
ENSG00000236679 31391 chr1 470906 471368 0
0
0
OK
ENSG00000231709 31391 chr1 521368 523833 0
0
0
OK
ENSG00000235146 31391 chr1 523008 530148 0
0
0
OK
ENSG00000239664 31391 chr1 529832 532878 0
0
0
OK
ENSG00000230021 31391 chr1 536815 659930 2.53932 0
5.72637 OK
ENSG00000229376 31391 chr1 657464 660287 0
0
0
OK
ENSG00000223659 31391 chr1 562756 564390 0
0
0
OK
ENSG00000225972 31391 chr1 564441 564813 96.9279 77.2375 116.618 OK
ENSG00000243329 31391 chr1 564878 564950 0
0
0
OK
ENSG00000240155 31391 chr1 564951 565019 0
0
0
OK
status
Cuffdiff (differential expression)
} 
} 
} 
Pairwise or time series comparison
Normal distribution of read counts
Fisher s test
test_id gene
locus
ENSG00000000003TSPAN6
ENSG00000000005TNMD
ENSG00000000419DPM1
ENSG00000000457SCYL3
sample_1
sample_2
chrX:99883666-99894988 q1
chrX:99839798-99854882 q1
chr20:49551403-49575092 q1
chr1:169631244-169863408 q1
status
q2
q2
q2
q2
value_1 value_2
NOTEST 0
NOTEST 0
NOTEST 15.0775
OK
32.5626
ln(fold_change) test_stat
p_value significant
0
0
0
1
no
0
0
0
1
no
23.8627 0.459116 -1.39556 0.162848 no
16.5208 -0.678541
15.8186 0
yes
R/bioconductor Packages
} 
} 
Based on raw read counts per gene/transcript/genome feature
(miRNA)
DESeq
} 
} 
} 
baySeq
} 
} 
} 
} 
} 
} 
http://www-huber.embl.de/users/anders/DESeq/
Negative binomial distribution
http://www.bioconductor.org/help/bioc-views/release/bioc/html/
baySeq.html
Bayesian approach
Choice of Poisson and negative binomial distribution
edgeR
DEGSeq
Genominator
0.8
DESeq: Noise and Variance estimation
0.4
0.2
0.0
squared coefficient of variation
0.6
B
IFN
M
NK
base mean density
1e-01
1e+01
1e+03
base mean
1e+05
SCV: the ratio of the variance
at base level to the square of
the base mean
Solid line: biological replicates
noise
Dotted line: full variance
scaled by size factors
Shot noise: dotted minus solid
DESeq: Differential Expression
1.566626326
23.78874526
3.924546167 2.85599311970997e-17
ENSG00000001036
5.999081213
33.49328888
2.481058581 9.8485739442166e-13
ENSG00000001084
23.3067067
156.2725598
2.745247408 4.38856094441354e-33
ENSG00000001461
46.14566905
18.67886919
-1.304788134 2.66197080043655e-07
ENSG00000001497
68.54035056
35.87868221
-0.933826668 3.36052669642687e-05
ENSG00000001630
13.86061772
55.92825318
2.012585716 1.27410028391540e-13
ENSG00000002549
27.33856924
1096.051286
ENSG00000002587
15.64872305
2.223202568
-2.815333625 8.43968907932538e-10
ENSG00000002834
95.68814397
272.3502328
1.509051013 8.21570437569004e-16
ENSG00000003056
63.65513823
296.6257971
2.220295194 2.92583705156055e-30
ENSG00000003400
52.02308495
117.3028844
1.173014631 4.62918844505763e-08
ENSG00000003402
154.7003657
311.1815114
1.008279739 2.59997904482726e-08
ENSG00000003756
434.3712708
180.9106662
-1.263651217 3.58591978350734e-14
ENSG00000004399
1.199584318
56.96561073
5.569484777 9.87310306834046e-40
ENSG00000004455
145.4361806
331.8994483
1.190360014 3.17246841765643e-10
ENSG00000004468
17.27590102
128.1030372
2.89047182 1.99020901042234e-33
ENSG00000004534
331.0046525
176.1290195
-0.910218864 2.28719252897662e-07
ENSG00000004799
5.425570485
18.0426855
1.733567341 1.67150844663169e-06
ENSG00000004961
15.22078545
54.5536795
1.841633697 2.76802192307592e-11
ENSG00000005020
133.1474289
248.379817
0.899523377 3.00900687072175e-06
ENSG00000005022
86.49374889
154.5210394
0.837135513 3.79777250197792e-05
ENSG00000005238
0.818439748
8.567484894
3.387923626 7.38045118427266e-07
ENSG00000005249
1.442397316
17.22208291
3.577719117 2.69990749254895e-12
ENSG00000005379
25.15059092
4.02264298
-2.644376691 2.75953193496745e-12
ENSG00000005381
0.376344415
19.36188435
5.685021995 4.99727503015434e-18
ENSG00000005436
28.46288463
11.16816604
-1.349689587 4.23389957443192e-06
10
ENSG00000000971
-5
0
5
5.325233754 1.97553508993745e-133
res_m_i$log2FoldChange
id
B cells
IFG
expressio expressio log2FoldCh
n
n
ange
pValue
1e-01
1e+01
1e+03
res_m_i$baseMean
1e+05
Alternative splicing analysis
} 
Cufflinks
} 
MISO (http://genes.mit.edu/burgelab/miso/)
} 
} 
probabilistic framework that quantitates the expression level of
alternatively spliced genes from RNA-Seq data, and identifies
differentially regulated isoforms or exons across samples
DEXSeq (
http://www.bioconductor.org/packages/release/bioc/html/
DEXSeq.html)
} 
differential exon usage
Cufflinks: Alternative splicing
trans_id bundle_id chr left
effective_length
status
ENST00000503254 31391
ENST00000458203 31391
ENST00000417636 31391
ENST00000423796 31391
ENST00000450696 31391
ENST00000440196 31391
ENST00000357876 31391
ENST00000440200 31391
ENST00000441245 31391
ENST00000419394 31391
ENST00000448605 31391
ENST00000414688 31391
ENST00000447954 31391
ENST00000440782 31391
ENST00000452176 31391
ENST00000416931 31391
ENST00000485393 31391
ENST00000482877 31391
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
right FPKM
465693
470906
521368
523008
523047
529832
529838
536815
637315
639064
639064
646721
655437
657464
562756
564441
564878
564951
688071
471368
523833
530148
529954
530595
532878
655580
655530
655574
655580
655580
659930
660287
564390
564813
564950
565019
FMI frac FPKM_conf_lo
787.12 1
0
0
0
0
0
0
0
0
0
0
0
0
2.53932 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
96.9279 1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
FPKM_conf_hi
coverage length
731.009 843.232 124.849 1509 440.26 OK
0
0
462 440.005 OK
0
0
842 842 OK
0
0
607 607 OK
0
0
402 402 OK
0
0
437 437 OK
0
0
498 498 OK
0
5.72637 0.185236
413 413 OK
0
0
629 629 OK
0
0
480 480 OK
0
0
274 274 OK
0
0
750 750 OK
0
0
336 336 OK
0
0
2823 2823 OK
0
0
802 802 OK
77.2375 116.618 21.1488 372 372 OK
0
0
72
72
OK
0
0
68
68
OK
DEXSeq
The statistical model is based on
generalised linear models of the
Negative Binomial family (NBGLMs)
}  Exon-oriented
read counts
} 
Visualization: Genome Viewers
} 
} 
} 
} 
} 
Visualize reads alignment and analysis results
Manual check of computational predictions: expression levels,
alternative splicing, variants
Track-based visual presentation of data
Custom tracks upload: BAM, BED, BigWig, GTF…
Web based:
}  Gbrowse (http://gmod.org/wiki/Gbrowse)
}  UCSC Genome Browser
}  Standalone
} 
Integrated Genome Viewer (http://www.broadinstitute.org/
software/igv/)
UCSC Genome Browser
} 
http://genome.ucsc.edu/
Scale
chr21:
BC041449
RefSeq Genes
Sequences
SNPs
Human mRNAs
Spliced ESTs
100 _
Layered H3K27Ac
0_
DNase Clusters
Txn Factor ChIP
4_
Mammal Cons
33,033,000
SOD1
33,034,000
2 kb
hg19
33,035,000
33,036,000
33,037,000
33,038,000
UCSC Genes (RefSeq, UniProt, CCDS, Rfam, tRNAs & Comparative Genomics)
RefSeq Genes
Publications: Sequences in scientific articles
Human mRNAs from GenBank
Human ESTs That Have Been Spliced
H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE
Digital DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE
Transcription Factor ChIP-seq from ENCODE
Placental Mammal Basewise Conservation by PhyloP
0-4 _
Rhesus
Mouse
Dog
Elephant
Opossum
Chicken
X_tropicalis
Zebrafish
Common SNPs(137)
RepeatMasker
33,039,000
Multiz Alignments of 46 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of Samples
Repeating Elements by RepeatMasker
33,040,000
33,041,000
IGV: Differential Expression
Visualization
Downstream analysis and bias
corrections
} 
} 
Bias correction
}  RNASeqBias (
http://bioinformatics.med.yale.edu/group/software/
20110309RNAseqRPackage/)
}  Gene length bias
}  GC content bias
}  Dinucleotide bias
GO enrichment
}  GOStats (
http://www.bioconductor.org/packages/release/bioc/html/GOstats.html)
} 
} 
Initially a microarray package, but can be used in RNASeq
GOSeq (
http://www.bioconductor.org/packages/release/bioc/html/goseq.html)
} 
Detects Gene Ontology and/or other user defined categories which are over/
under represented in RNA-seq data
Pathway analysis
} 
SPIA (
http://www.bioconductor.org/packages/release/bioc/html/
SPIA.html)
} 
} 
} 
Signaling Pathway Impact Analysis (SPIA) uses the information form a
list of differentially expressed genes and their log fold changes
together with signaling pathways topology, in order to identify the
pathways most relevant to the
condition under the study
KEGG pathways database
Human and mouse only
Part II: Practical demonstration
The aim of this demo is to use DESeq package for
RNAseq data analysis. The dataset prodcued by a gene
expression study in different types of immune cell, namely
B-cells and monocytes. We have a total of 8 samples, 4
from B-cells and 4 from monocytes.
}  Prerequisites:
} 
} 
} 
} 
R (version > 2.15.1)
Bioconductor
DESeq
Input data
} 
Raw read counts prepared with htseq-count
gene !075_B_cell !083_B_cell !088_B_cell !085_B_cell
!085_monocyte !075_monocyte !083_monocyte !088_monocyte!
ENSG00000000003 !0 !0 !0 !0 !0 !0 !1 !0!
ENSG00000000419 !23 !12 !9 !12 !14 !4 !14 !12!
ENSG00000000457 !48 !26 !10 !17 !19 !5 !8 !12!
Read and normalize data
} 
} 
} 
} 
} 
} 
} 
} 
} 
countsTable <-read.delim
("raw_counts.txt",header=TRUE,stringsAsFactors=TRUE)
rownames( countsTable ) <- countsTable$gene
countsTable <- countsTable[ , -1 ] The next step is to create conditions vector to attribute each
column to a given cell type, “B” for B-cells and “M” for
monocytes:
conds <- c(rep("B",4), rep("M",4))
Then we create main dataframe for the count data set using
function newCountDataSet:
cds <- newCountDataSet( countsTable, conds )
And normalize the number of read counts:
cds <- estimateSizeFactors(cds)
Estimate variance and dispersion
} 
} 
} 
} 
} 
} 
Finally, we estimate variance functions for the dataset:
cds <- estimateDispersions(cds, method="per-condition",
sharingMode="maximum")
Now we find genes, which are differentially expressed between
the two different cell types using negative binomial distribution
test:
res <- nbinomTest(cds, "B", "M")
Now we plot MA diagram to estimate expression values and
fold changes. Also we put a threshold for the adjusted p-value
(padj field in res) as 0.0001 to estimate visually a scale of the
differential expression:
plot( res$baseMean, res$log2FoldChange, log="x", pch=20,
cex=.1, col = ifelse( res$padj < .0001, "red", "black" ) )
Significant genes
Finally, we filter out the genes with padj > 0.001 and
create the subset of the results for differentially
expressed ones: }  sig <- res[ res$padj < .001, ]
}  sig <- sig[ is.na(sig$pval) != "TRUE", ]
} 
head(sig[with(sig, order(padj)), ])
}  Select 50 most significant genes for functional annotation
analysis: }  noquote(head(sig[with(sig, order(padj)), ]$id, 50))
} 
Downstream analysis
The Database for Annotation, Visualization and
Integrated Discovery (DAVID )is a powerful resource
for functional annotation analysis
}  We are going to use it to check if there are any important
functional categories describing the differentially
expressed genes we detected.
}  http://david.abcc.ncifcrf.gov
}