Statistical challenges in (RNA

Statistical challenges in (RNA-Seq) data analysis
Julie Aubert
UMR 518 AgroParisTech-INRA Mathématiques et Informatique
Appliquées
Ecole de bioinformatique, Station biologique de Roscoff, 2014 Oct. 7
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
1 / 81
Statistics and biology
Statistics = study of the collection, analysis, interpretation, presentation
and organization of data (source : Wikipedia)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
2 / 81
Biological question
Experimental design
Sequencing experiment
Low-level analysis
Higher-level analysis
Exploratory Data Analysis,
image analysis, base calling,
read mapping, metadata
integration
Exploratory Data Analysis,
normalization and expression
quantification, differential
analysis, metadata integration
Biological validation and
interprétation
*Adapted from S. Dudoit, Berkeley
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
3 / 81
Outline
1
Experimental design
2
Normalisation and Differential analysis
3
Conclusions
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
4 / 81
Experimental design
Another definition
Statistics consists of trying to understand data and to obtain more
understandable data. Savage (1977)
Key of a good data analysis : having good data to analyze.
Requisite : a clearly defined research objective.
The statistician who supposes that his main contribution to the planning
of an experiment will involve statistical theory, finds repeatedly that he
makes his most valuable contribution simply by persuading the investigator
to explain why he wishes to do the experiment. Gertrude M Cox
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
5 / 81
Experimental design
A statistical model : what for ?
Aim of an experiment : answer to a biological question.
Results of an experiment : (numerous, numerical) measurements.
Model : mathematical formula that relates the experimental
conditions and the observed measurements (response).
(Statistical) modelling : translating a biological question into a
mathematical model (6= PIPELINE !)
Statistical model : mathematical formula involving
the experimental conditions,
the biological response,
the parameters that describe the influence of the
conditions on the (mean, theoretical) response,
and a description of the (technical, biological) variability.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
6 / 81
Experimental design
Steps of experiment designing
1
Formulate a broadly stated research problem in terms of explicit,
addressable questions.
2
Considering the population under study, identifying appropriate
sampling or experimental units, defining relevant variables, and
determining how those variables will be measured.
3
Describe the data analysis strategy
4
Anticipate eventual complications during the collection step and
propose a way to handle them
source : Northern Prairie Wildlife Research Center, Statistics for Wildlifers : How
much and what kind ?
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
7 / 81
Experimental design
Experimental Design
Definition
A good design is a list of experiments to conduct in order to answer to the
asked question which maximize collected information and minimize the
number of experiments (or the experiments cost) with respect to
constraints.
Basic principles - Fisher (1935)
(technical and biological) replications
Replication (independent obs.) 6= Repeated measurements
Randomization : randomize as much as is practical, to protect against
unanticipated biases
Blocking : dividing the observations into homogeneous groups.
Isolating variation attributable to a nuisance variable (e.g. lane)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
8 / 81
Experimental design
How to Design a good RNA-Seq experiment in an
interdisciplinary context ?
Some basic rules
Rule 1 Share a minimal common language
Rule 2 Well define the biological question
Rule 3 Anticipate difficulties with a well designed experiment
Make good choices : Replicates vs Sequencing depth
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
9 / 81
Experimental design
Rule 2 : Well define the biological question
From Alon, 2009
Choose scientific problems on feasibility and interest
Order your objectives (primary and secondary)
Ask yourself if RNA-seq is better than microarray regarding the
biological question
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
10 / 81
Experimental design
Objectives - RNA-Seq
Identify differentially expressed (DE) genes ?
Detect and estimate isoforms ?
Construct a de Novo transcriptome ?
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
11 / 81
Experimental design
Rule 3 : Anticipate difficulties with a well designed
experiment
1
Prepare a checklist with all the needed elements to be collected,
2
Collect data and determine all factors of variation,
3
Choose bioinformatics and statistical models,
4
Draw conclusions on results.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
12 / 81
Experimental design
Be aware of different types of bias
Keep in mind the influence of effects on results :
lane ≤ run ≤ RNA library preparation ≤ biological
(Marioni, 2008), (Bullard, 2010)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
13 / 81
Experimental design
RNA-seq experiment analysis : from A to Z
Adapted from Mutz, 2013
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
14 / 81
Experimental design
Make good choices
How many reads ?
100M to detect 90% of the transcripts of 81% of human genes
(Toung, 2011)
20M reads of 75bp can detect transcripts of medium and low
abundance in chicken (Wand, 2011)
10M to cover by at least 10 reads 90% of all (human and zebrafish)
genes (Hart, 2013) . . .
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
15 / 81
Experimental design
Biological and technical replicates
Biological replicate : sampling of individuals from a population in order to
make inferences about that population
Technical replicate adresses the measurement error of the assay.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
16 / 81
Experimental design
Why increasing the number of biological replicates ?
Technical variability => inconsistent detection of exons at low levels of
coverage (<5reads per nucleotide) (McIntyre et al. 2011)
Doing technical replication may be important in studies where low
abundant mRNAs are the focus.
To generalize to the population level
To estimate to a higher degree of accuracy variation in individual
transcript (Hart, 2013)
To improve detection of DE transcripts and control of false positive
rate : TRUE with at least 3 (Sonenson 2013, Robles 2012)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
17 / 81
Experimental design
Replication : quality rather than quantity
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
18 / 81
Experimental design
More biological replicates or increasing sequencing depth ?
It depends ! (Haas, 2012), (Liu, 2014)
DE transcript detection : (+) biological replicates
Construction and annotation of transcriptome : (+) depth and (+)
sampling conditions
Transcriptomic variants search : (+) biological replicates and (+)
depth
A solution : multiplexing.
Tag or bar coded with specific sequences added during library construction
and that allow multiple samples to be included in the same sequencing
reaction (lane)
Decision tools available : Scotty (Busby, 2013),
RNAseqPower (Hart, 2013)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
19 / 81
Experimental design
Experimental Design : Scotty
It’s a balance : cost, precision ⇐⇒ nb bio. replicates, sequencing depth.
An example output from the Scotty application.
Busby M A et al. Bioinformatics 2013;29:656-657
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: [email protected]
This figure shows the user which of the tested experimental configurations do (white) and do not (shaded) conform to the
user-defined constraints. Scotty then indicates the optimal configuration based on cost (filled triangle) and power (filled circle).
Busby M A etJ.al.Aubert
Bioinformatics 2013
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
20 / 81
Experimental design
Experimental Design : key points
The scientific question of interest drives the experimental choices
Collect informations before planning
All skills are needed to discussions right from project construction
Sequencing and other technical biases potentially increase the required
sample size and sequencing depth
Optimum compromise between replication number and sequencing depth
depends on the question
Biological replicates are important in most RNA-seq experiments
Wherever possible apply the three Fisher’s principles of
randomization, replication and local control (blocking)
And do not forget : budget also includes cost of biological data acquisition,
sequencing data backup, bioinformatics and statistical analysis.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
21 / 81
Normalisation and Differential analysis
RNA-sequencing
_____
Random
fragmentation
__ _ __
___
…
_____
mRNAs from a sample
Reverse
transcription
__ _ __
_ __
…
__ __ _
_ __
…
__ __ _
Fragmented mRNAs
PCR
amplification &
sequencing
cDNAs
mapping
# of reads mapped to
Gene 1 Gene 2 … Gene m
25
320 … 23
A vector of counts
counting
from Gene 19
ATTGCC...
from Gene 23
…
GCTAAC...
…
from Gene 56
AGCCTC...
mapped reads
A list of reads
Adapted from Li et al. (2011)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
22 / 81
Normalisation and Differential analysis
Isoform detection and quantification
From E. Bernard
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
23 / 81
Normalisation and Differential analysis
Differential Analysis
Identification of differentially expressed genes (DE)
A gene is declared differentially expressed (DE) between two conditions if
the observed difference is statisticially significant, ie more than only du to
natural random variation.
Statistical tools are necessary to take this decision.
The main steps are : experimental design, normalisation and
differential analysis, multiple testing.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
24 / 81
Normalisation and Differential analysis
Fold Change approach and ideal cut-off values
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
25 / 81
Normalisation and Differential analysis
Fold Change approach and ideal cut-off values
FCi =
1
2
3
4
Gene
Gene1
Gene2
Gene3
Gene4
CondA1
5.00
800.00
700.00
500.00
CondA2
7.00
1000.00
1100.00
1300.00
xi .
yi .
CondB1
2.00
350.00
350.00
550.00
CondB2
2.00
250.00
250.00
50.00
FC
3.00
3.00
3.00
3.00
pvalue
0.06
0.03
0.10
0.33
FC does not take the variance of the samples into account.
Problematic since variability in gene expression is partially gene-specific.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
26 / 81
Normalisation and Differential analysis
Normalization
Normalization
Definition
Normalization is a process designed to identify and correct technical
biases removing the least possible biological signal. This step is
technology and platform-dependant.
Within-lane normalization
Normalisation enabling comparisons of fragments (genes) from a same
sample.
No need in a differential analysis context.
Between-lane normalization
Normalisation enabling comparisons of fragments (genes) from different
samples.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
27 / 81
Normalisation and Differential analysis
Normalization
Sources of variability
Within-sample
Gene length
Sequence composition (GC content)
Between-sample
Depth (total number of sequenced and mapped reads)
Sampling bias in library construction ?
Presence of majority fragments
Sequence composition du to PCR-amplification step in library
preparation‘(Pickrell et al. 2010, Risso et al. 2011)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
28 / 81
Normalisation and Differential analysis
Normalization
Comparison of normalization methods
At lot of different normalization methods...
Some are part of models for DE, others are ’stand-alone’
They do not rely on similar hypotheses
But all of them claim to remove technical bias associated with
RNA-seq data
Which one is the best ?
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
29 / 81
Normalisation and Differential analysis
Normalization
Normalisation methods
Global methods : normalised counts are raw counts divided by a scaling
factor calculated for each sample
Distribution adjustment
Assumption (TC, UQ, Median) : read counts are prop. to expression level
and sequencing depth
Total number of reads : TC (Marioni et al. 2008), Quantile : FQ (Robinson
and Smyth 2008), Upper Quartile : UQ (Bullard et al. 2010), Median
Method taking length into account
Reads Per KiloBase Per Million Mapped : RPKM (Mortazavi et al. 2008)
The Effective Library Size concept
Trimmed Means of M-values TMM (Robinson et Oschlack 2010, edgeR)
DESeq (Anders et Huber 2010, DESeq)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
30 / 81
Normalisation and Differential analysis
Normalization
4 real datasets and one simulated dataset
RNA-seq or miRNA-seq, DE, at least 2 conditions, at least 2 bio. rep., no
tech. rep.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
31 / 81
Normalisation and Differential analysis
Normalization
Comparison procedures
Distribution and properties of normalized datasets
Boxplots, variability between biological replicates
Comparison of DE genes
Differential analysis : DESeq v1.6.1 (Anders and Huber 2010), default
param.
Number of common DE genes, similarity between list of genes (dendrogram
- binary distance and Ward linkage)
Power and control of the Type-I error rate
simulated data
non equivalent library sizes
presence of majority genes
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
32 / 81
Normalisation and Differential analysis
Normalization
So the Winner is ... ?
In most cases
The methods yield similar results
However ...
Differences appear based on data characteristics
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
33 / 81
Normalisation and Differential analysis
Normalization
Interpretation
RawCount Often fewer differential expressed genes (A. fumigatus : no DE
gene)
TC, RPKM
Sensitive to the presence of majority genes
Less effective stabilization of distributions
Ineffective (similar to RawCount)
FQ
Can increase between group variance
Is based on an very (too) strong assumption (similar distributions)
Median High variability of housekeeping genes
TC, RPKM, FQ, Med, UQ Adjustment of distributions, implies a similarity
between RNA repertoires expressed
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
34 / 81
Normalisation and Differential analysis
Normalization
Conclusions
Hypothesis : the majority of genes is invariant between two samples.
Differences between methods when presence of majority sequences,
very different library depths.
TMM and DESeq : performant and robust methods in a DE analysis
context on the gene scale.
Normalisation is necessary and not trivial.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
35 / 81
Normalisation and Differential analysis
Normalization
Normalization : key points
RNA-seq data are affected by technical biaises (total number of mapped
reads per lane, gene length, composition bias)
⇒ A normalization is needed and has a great impact on the DE genes
(Bullard et al 2010), (Dillies et al 2012)
Detection of differential expression in RNA-seq data is inherently biased
(more power to detect DE of longer genes)
Do not normalise by gene length in a context of differential analysis.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
36 / 81
Normalisation and Differential analysis
Differential analysis
Differential analysis
Aim : To detect differentially expressed genes between two conditions
Discrete quantitative data
Few replicates
Overdispersion problem
Challenge : method which takes into account overdispersion and few
number of replicates
Proposed methods : edgeR, DESeq for the most used and known
Anders et al. 2013, Nature Protocols
An abundant littérature
Comparison of methods : Pachter et al. (2011), Kvam et Liu (2012),
Soneson et Delorenzi (2013), Rapaport el al. (2013)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
37 / 81
Normalisation and Differential analysis
Differential analysis
Differential analysis gene-by-gene- with replicates
For each gene i
Is there a significant difference in expression between condition A and B ?
Statistical model (definition and parameter estimation) - Generalized
linear framework
Test : Equality of relative abundance of gene i in condition A and B
vs non-equality
The Poisson Model
Let be Yijk the count for replicate j in condition k from gene i
Yijk follows a Poisson distribution (µijk ).
Property : Var (Yijk ) = Mean(Yijk ) = µijk
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
38 / 81
Normalisation and Differential analysis
Differential analysis
Mean-Variance Relationship
From D. Robinson and D. McCarthy
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
39 / 81
Normalisation and Differential analysis
Differential analysis
Overdispersion in RNA-seq data
Counts from biological replicates tend to have variance exceeding the
mean (= overdispersion relative to the Poisson distribution). Poisson
describes only technical variation.
What causes this overdispersion ?
Correlated gene counts
Clustering of subjects
Within-group heterogeneity
Within-group variation in transcription levels
Different types of noise present...
In case of overdispersion, ↑ of the type I error rate (prob. to declare
incorrectly a gene DE).
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
40 / 81
Normalisation and Differential analysis
Differential analysis
Alternative : Negative Binomial Models
A supplementary dispersion parameter φ to model the variance
Poisson vs Negative Binomial Models
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
41 / 81
Normalisation and Differential analysis
Tag ID A1 A2 A3 A4 B1 B2 B3
ENSG00000124208
478
619
628
744
483
716
240
ENSG00000182463
27
20
27
26
48
55
24
ENSG00000125835
132
200
200
228
560
408
103
ENSG00000125834
42
60
72
86
131
99
30
ENSG00000197818
21
29
35
31
52
44
20
0
0
0
0
ENSG00000215443
4
4
4
0
9
7
4
ENSG00000222008
30
23
29
19
0
0
0
46
63
58
54
53
17
ENSG00000125831
ENSG00000101444
ENSG00000101333
…
0
0
2
2256 2793 3456
71
3362
Differential analysis
Model assumptions
Mj = library size
λij = relative abundance of
feature i
2702 2976 1320
……
  Poisson describes technical variation:
Yij ~ Pois( Mj * λij )
mean(Yij)= variance(Yij) = Mj * λij
  Negative binomial models biological variability using the
dispersion parameter ϕ:
Yij ~ NB( μij=Mj * λij , ϕi )
  Same mean, variance is quadratic in the mean:
variance( Yij ) = μij ( 1 + μij ϕi )
Critical parameter to estimate: dispersion
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
42 / 81
Normalisation and Differential analysis
Differential analysis
φ estimation : crucial !
Many genes, very few biological samples - difficult to estimate φ on a
gene-by-gene basis
Some proposed solutions
Method
DESeq
edgeR
NBPseq
Variance
µ(1 + φµ µ)
µ(1 + φµ)
µ(1 + φµα−1 )
Reference
Anders et Huber (2010)
Robinson et Smyth (2009)
Di et al. (2011)
edgeR : borrow information across genes for stable estimates of φ
3 ways to estimate φ (common, trend, moderated)
DESeq : data-driven relationship of variance and mean estimated
using parametric or local regression for robust fit across genes
NBPseq : φ and α estimated by LM based on all the genes.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
43 / 81
Normalisation and Differential analysis
Differential analysis
A lot of statistical methods...still developped
Parametric Model
Normalized
Count Data
NB
TSPM
zéro inf. NB
BaySeq, DESeq,
ShrinkSeq
EBSeq, edgeR,
NBPseq, ShrinkSeq
DESeq, edgeR,
NBSeq, TSPM
Bayesian
Bayseq, EBSeq,
ShrinkSeq
Differential Analysis
Classical
Test
NOISeq, NOISeqBIO, SAMseq
Estimation
P+
Quasi-L
Non Parametric Model
semiparametric
BaySeq, DESeq, EBSeq, edgeR,
NBPSeq, ShrinkSeq, TSPM
Wilcoxon's
statistic
Gaussian
Kernel
Empirical
distribution
« noise »
SAMseq
NOISeqBIO
NOISeq
Permutation
SAMseq
Bayesian Thresholding
NOISeqBIO
NOISeq
Differential Analysis between two conditions
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
44 / 81
Normalisation and Differential analysis
Differential analysis
Comparaison of differential analysis methods
Soneson et Delorenzi (2013)
Evaluation of 11 methods on both simulated and real data.
Obs 1 : The number of replicates matters ! (Differently for different
methods)
Obs 2 : Results are more accurate and less variable between methods
if DE genes are regulated in both directions.
Obs 3 : Outlier counts affect different methods in different ways
The dispersion estimation method matters !
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
45 / 81
Normalisation and Differential analysis
Differential analysis
Comparaison of differential analysis methods
Soneson et Delorenzi (2013)
Small number of replicates (2-3) or low expression → be careful ! !
Large number of replicates (10 or so) or very high expression →
method choice does not matter much.
Removing genes with outlier counts or using non-parametric methods
reduce the sensitivity to outliers
Allow tagwise dispersion values
Normalization methods have problems when all DE genes are
regulated in one direction. Iterative approaches like TCC improve
performance
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
46 / 81
Normalisation and Differential analysis
Differential analysis
Comparaison of differential analysis methods
Rapaport et al. (2013)
Evalution on methods using SEQC benchmark dataset and ENCODE data.
Significant differences between methods.
Array-based methods adapted perform comparably to specific
methods.
Increasing the number of replicates samples significantly improves
detection power over increased sequencing depth.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
47 / 81
Normalisation and Differential analysis
Differential analysis
DESeq2 Love and Huber (2013)
Differences with DESeq.
Dispersion shrinkage
Fold Change shrinkage (for PCA and Gene Set Enrichissment
Analysis)
Detection of outliers
Improves power
Only one command line
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
48 / 81
Soneson and Delorenzi BMC Bioinformatics 2013, 14:91
Normalisation and Differential analysis
Differential analysis
http://www.biomedcentral.com/1471-2105/14/91
Outliers
Type I error rate at p_nom < 0.05, B00
A
B
0.20
Type I error rate at p
0.20
Challenge: edgeR can be sensitive to outliers
0.15
0.15
0
10
Sample
20
30
2000
40
50
60
0.00
0
10
20
Sample
30
40
Sample
50
60
Figure 2. Counts from some miRNAs found to be very significant by edgeR do not seem to follow negative binomial
0
0
Type Ithe
error
rate from
at p_nom
< 0.05, inBthe
Type
I error
at p_nom
< 0.05,
Type
I 10th
error
rate rate
at p_nom
< 0.05,
S0 P0
distributions. Each panel shows
counts
one miRNA
Witten data.28 These miRNAs are the
7th,
0
and 11th most significant features detected by edgeR. The heights of vertical bars show the scaled counts from the
samples.
29 bars, coloured blue, are
0.20 The first 29 bars, coloured red, are samples from the one class, and the other
0.20 0.20
from the other. The black broken line is also drawn to separate the two classes. In each panel, we see that one count
has much larger values than all the other counts.
CB
False positive rate
A
no outliers
0.00
D
DESeq.5
edgeR.5
vst.5
60
DESeq.2
edgeR.2
vst.2
voom.2
TSPM.2
NBPSeq.2
50
DESeq.10
edgeR.10
vst.10
voom.10
TSPM.10
NBPSeq.10
40
DESeq.5
edgeR.5
vst.5
voom.5
TSPM.5
NBPSeq.5
30
0.05
Page 8 of 18
DESeq.2
edgeR.2
vst.2
voom.2
TSPM.2
NBPSeq.2
20
0.10
Li and Tibshirani, 2011
0
0
10
0.05
6000
Scaled counts
2000
Scaled counts
500 1000
6000
Scaled counts
0
2000
Soneson and Delorenzi BMC Bioinformatics 2013, 14:91
http://www.biomedcentral.com/1471-2105/14/91
0
miR−375, No. 11 by edgeR
0.10
10000
miR−133b, No. 10 by edgeR
10000
miR−206, No. 7 by edgeR
Type I error rate
Statistical Methods in Medical Research 0(0)
Type I error rate
4
Type I error rate at p
0.20
presence of outliers
0.20
DESeq.5
voom.5
vst.5
DESeq.10
vst.10
DESeq.10
voom.10
edgeR.10
TSPM.10
vst.10
edgeR.10
voom.10
NBPSeq.10
TSPM.10
NBPSeq.10
DESeq.2
voom.2
vst.2
TSPM.2
edgeR.2
NBPSeq.2
Type I error rate
0.20
DESeq.5
vst.5
DESeq.5
voom.5
edgeR.5
TSPM.5
vst.5
edgeR.5
voom.5
NBPSeq.5
TSPM.5
NBPSeq.5
DESeq.2
vst.2
DESeq.2
voom.2
edgeR.2
TSPM.2
vst.2
edgeR.2
voom.2
NBPSeq.2
TSPM.2
NBPSeq.2
Type I error rate
DESeq.10
edgeR.10
vst.10
voom.10
TSPM.10
NBPSeq.10
DESeq.5
edgeR.5
vst.5
voom.5
TSPM.5
NBPSeq.5
DESeq.2
edgeR.2
vst.2
voom.2
TSPM.2
NBPSeq.2
Type I error rate
Type I error rate
0.15 0.15
0.15
0.15
When
the assumed model does not hold, our statistic is able to select significant
features much more
efficiently than parametric methods. Also, in contrast to parametric methods, our method gives a
reliable estimate of the FDR. On several real data sets, our method is able to find features that are
expressed
consistently higher in one class, and these are more likely to 0.10
be biologically
meaningful.
0.10
0.10
0.10
Moreover, the use of current parametric methods is limited in the outcome types that they can
handle. Except for PoissonSeq,20 to our knowledge, existing methods can only be used for data with
two-class outcomes. PoissonSeq can also be used for data with quantitative outcomes and multiple0.05 0.05
0.05
class0.05
outcomes, but not survival outcomes. Because of the complexity of
parametric methods, it is
often difficult to extend them to other types of outcomes. In contrast, our nonparametric method
can be used for all the types of outcomes mentioned above. Further, the resampling strategy that we
developed
(Section 2.2) eliminates the difference between sequencing depths
of experiments, making
0.00 0.00
0.00
0.00
it easy to generalize our method to other possible types of outcomes.
The rest of this article is organized as follows. In Section 2, we propose a nonparametric statistic
for data with a two-class outcome and the associated resampling strategy, as well as a permutation
plug-in method to estimate the false discovery rate FDR. In Section 3, we study the performance of
our nonparametric method on simulated data sets, and compare it with three available methods,
Soneson
and
2013
Type PoissonSeq
I error rates.and
Type I error
rates, for the
six Delorenzi,
methods providing
nominal p-values, in simulation studie
edgeR, PoissonSeq and DESeq. In Section 4, we apply our method as Figure
well as 3edgeR,
I error rate at p_nom < 0.05, S0
Type I error rate at p_nom < 0.05, R00
0
S00 (panel
D). as
Letting some counts follow a Poisson distribution (panel B) reduced the type I error r
C on three real Type
D C) and
DESeq
RNA-Seq data sets, and compare
the list of
features
that Rare
called
0 (panel
overall
small(RNA-Seq)
effect. Including
counts (panels
C and
D) had7a detrimental
differentially
expressed by different methods. In SectionStat.
5, we extend
our anonparametric
statisticoutliers with abnormally high
J. Aubert
challenges
Roscoff,
2014
Oct.
49 effect
/ 81on t
Normalisation and Differential analysis
Differential analysis
Vol 464 | 1 April 2010 | doi:10.1038/nature08903
LETTERS
Why is robustness
needed?
Transcriptome genetics using second generation
sequencing in a Caucasian population
Stephen B. Montgomery1,2, Micha Sammeth3, Maria Gutierrez-Arcelus1, Radoslaw P. Lach2, Catherine Ingle2,
James Nisbett2, Roderic Guigo3 & Emmanouil T. Dermitzakis1,2
Nature, 2010
(meanexpression
6 s.d.) million reads that were then mapped to the NCBI36
and environmental
on cellular
state. Many studies
Random split of dataset: n1=5; n2=5 genetic
Very
littleeffects
true
differential
assembly of the human genome (Supplementary Fig. 1) using MAQ .
have previously identified genetic variants for gene expression pheGene expression is an important phenotype that informs about
in one lane of an Illumina GAII analyzer and yielded 16.9 6 5.9
16
Results driven by outliers
CPMs
(counts
per
million)
4004
2538
4962
7921
6115
5156
2527
1115
3175
7951
7631
3437
4004
2538
4962
7921
6115
5156
2527
1115
3175
7951
7631
3437
J. Aubert
1–5
notypes using custom and commercially available microarrays .
Second generation sequencing technologies are now providing
unprecedented access to the fine structure of the transcriptome6–14.
We have sequenced the mRNA fraction of the transcriptome in 60
extended HapMap individuals of European descent and have combined these data with genetic variants from the HapMap3 project15.
We have quantified exon abundance based on read depth and have
also developed methods to quantify whole transcript abundance.
We have found that approximately 10 million reads of sequencing
can provide access to the same dynamic range as arrays with better
quantification of alternative and highly abundant transcripts.
Correlation with SNPs (small nucleotide polymorphisms) leads to
a larger discovery of eQTLs (expression quantitative trait loci) than
with arrays. We also detect a substantial number of variants that
influence the structure of mature transcripts indicating variants
responsible for alternative splicing. Finally, measures of allelespecific expression allowed the identification of rare eQTLs and
allelic differences in transcript structure. This analysis shows that
high throughput sequencing technologies reveal new properties of
genetic effects on the transcriptome and allow the exploration of
genetic effects in cellular processes.
Genetic variation in gene expression is an important determinant of
human phenotypic variation; a number of studies have elucidated
genome-wide patterns of heritability and population differentiation
and are beginning to unravel the role of gene expression in the aetiology
of disease1–5. Interrogation of the transcriptome in these studies has
been greatly facilitated by the use of microarrays, which quantify transcript abundance by hybridization. However, microarrays possess
several limitations and recent advances in transcriptome sequencing
in second generation sequencing platforms have now provided singlenucleotide resolution of gene expression providing access to rare transcripts, more accurate quantification of abundant transcripts (above
the signal saturation point of arrays), novel gene structure, alternative
splicing and allele-specific expression6–14. Although RNA-Seq studies
have addressed issues of transcript complexity, they have not yet
addressed how genetic studies can benefit from this increased resolution to reveal novel effects of sequence variants on the transcriptome.
To understand the quantitative differences in gene expression
within a human population as determined from second generation
sequencing, we sequenced the mRNA fraction of the transcriptome of
lymphoblastoid cell lines (LCLs) from 60 CEU (HapMap individuals
of European descent) individuals (from CEPH—Centre d’Etude du
Polymorphisme Humain) using 37-base pairs (bp) paired-end
Illumina sequencing. Each individual’s transcriptome was sequenced
We subsequently filtered reads that had low mapping quality, mapped
sex chromosomes or mitochondrial DNA and were not correctly
paired, which yielded 9.4 6 3.3 million reads. On average, 86% of
the filtered reads mapped to known exons in Ensembl version
54(ref. 17) and 15% of read pairs spanned more than one exon.
Evaluation of sequence and mapping quality measures was preformed
to ensure that the data quality is acceptable for analysis (Supplementary Fig. 2, also see methods).
We quantified reads for known exons, transcripts and whole genes.
Read counts for each individual were scaled to a theoretical yield of 10
million reads and corrected for peak insert size across corresponding
libraries. Each quantification was filtered to exclude those with missing data for . 10% of the individuals. For exons, this resulted in data
for 90,064 exons for 10,777 genes. Of these, 95% had on average more
than 10 reads, 38% more than 50 reads and 20% had a mean quantification of $ 100 reads (Supplementary Fig. 3). For transcript quantification, new methods needed to be developed to map reads
into specific isoforms18,19. We developed a methodology, called the
FluxCapacitor, to quantify abundances of annotated alternatively
spliced transcripts (see Methods). Using this method, we obtain relative quantities for 15,967 transcripts from 11,674 genes. For each
individual, we compared whole-gene read counts to array intensities
generated with Illumina HG-6 version 2 microarrays. Correlations
coefficients between RNA-Seq and array quantities and among
RNA-Seq samples were high and consistent with previous studies20
(Supplementary Figs 4 and 5). Finally, we explored whether the correlation structure of abundance among exons could facilitate the
development of a framework that will allow the imputation of abundance values for exons that are not screened, given a set of reference
RNA-Seq samples. This is the same principle as using the correlation
structure (Linkage Disequilibrium) of genetic variants to impute
variants from a reference to any population sample of interest21. For
each of the 10,777 genes, we assessed the pairwise correlation of all exons
and on average, any two pairs of exons within a gene were moderately
correlated (mean Pearson’s correlation R2 5 0.378 6 0.261) (Supplementary Fig. 6). This correlation increased with increase in total
number of reads present in each exon. It is worth noting that the average
correlation coefficient between SNPs within the same recombination
hotspot interval in HapMap3 is R2 5 0.326 6 0.174, indicating that the
correlation structure within genes is stronger and probably more accessible by imputation methodologies than SNPs; however, this needs to be
assessed in a tissue-specific context.
Association of gene expression measured by RNA-Seq with genetic
variation was evaluated in cis with the use of 1.2 million HapMap3
NA19222 NA12287 NA19172 NA11881 NA18871 NA12872 NA18916 NA18856 NA19193 NA19140
0.0
1.9
178.1
0.0
0.5
0.0
0.0
0.0
0.0
0.0
2.0
0.6
235.5
6.8
60.2
1.0
0.0
0.0
2.5
1.3
3.5
0.6
429.5
1.0
35.9
0.0
0.4
0.0
0.0
4.7
1.0
5.1
78.9
2.9
0.0
0.0
0.8
0.0
0.0
0.4
0.0
1.3
0.0
1.9
0.0
0.5
46.1
0.0
100.1
1.3
13.8
1.3
30.7
0.0
7.1
0.0
0.0
1.0
0.0
1.3
23.7
111.0
228.8
77.0
129.5
10.0
45.3
27.4
26.3
19.1
2.0
15.2 1074.8
19.5
13.2
10.0
29.6
0.0
1.3
5.5
3.0
6.3
181.0
7.8
7.6
0.0
5.5
3.0
3.1
2.5
1.0
12.1
35.9
0.0
1.0
1.0
0.0
1.0
0.0
0.0
0.0
1.9
0.4
1.0
0.0
0.5
29.6
0.0
24.4
5.5
24.6
31.1
167.0
4.9
21.2
4.5
8.3
10.1
8.1
0.4
logFC
logCPM
LR
PValue
FDR
-10.413038 4.186203 30.07924 4.147469e-08 0.0002239513
-5.942865 4.963086 29.60406 5.299369e-08 0.0002239513
-6.387829 5.576979 26.06085 3.308237e-07 0.0009320406
-5.808379 3.183079 22.51927 2.080466e-06 0.0043960241
5.746084 3.921353 21.37010 3.786299e-06 0.0064003595
-4.573655 2.512035 20.13483 7.217026e-06 0.0101663841
-2.154480 6.128702 18.44343 1.750229e-05 0.0211327628
-4.575934 6.873996 18.14127 2.051076e-05 0.0211672325
Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, 1211 Switzerland. Wellcome Trust Sanger Institute, Cambridge CB10 1HH, UK.
-3.843458 4.473754 17.71318 2.568407e-05
0.0211672325
Center for Genomic Regulation, University Pompeu Fabra, Barcelona, Catalonia, 08003 Spain.
773
-4.786326 2.416892 17.66324 2.636730e-05 0.0211672325
©2010 Macmillan Publishers Limited. All rights reserved
4.311717 2.683367 17.57990 2.754846e-05 0.0211672325
-3.014484 4.821100 17.05690 3.627624e-05 0.0255505626
1
2
3
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
50 / 81
Normalisation and Differential analysis
Differential analysis
Current policies (robustness)
edgeR ? one option : moderate dispersion less towards trend
Allows dispersions to be driven more by the data
DESeq ? take the maximum of the fit (trended) or the feature-specific
dispersion
Very robust, but many genes pay a penalty, less powerful.
DESeq2 ? calculate Cook’s distance and filter genes with outliers
Can inadvertently filter interesting genes
Goal (Robinson and co.) : achieve a middle ground between protection
against outliers while maintaining high power with observation weights
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
51 / 81
Normalisation and Differential analysis
Differential analysis
Robinson’s simulations
Robust edgeR suffers a tiny bit in power with no outliers, but has
good capacity to dampen their effect if present
DESeq’s policy on outliers has a global effect, resulting in (sometimes
drastic) drop in power
DESeq2 is very powerful in the absence of outliers, but policy to filter
outliers results in loss of power
edgeR and edgeR robust are a bit liberal (5% FDR might mean 6% or
7%
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
52 / 81
Normalisation and Differential analysis
Multiple testing
Multiple Testing
False positive (FP) (type I error : α) : A not DE gene which is declared
DE.
For all ’genes’, we test H0 (gene i is not DE) vs H1 (the gene is DE) using
a statistical test (calcul of a score)
Pb :
Let assume all the G genes are not DE. Each test is realized at α level
Ex : G = 10000 genes and α = 0.05 → E (FP) = 500 genes.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
53 / 81
Normalisation and Differential analysis
Multiple testing
The Family Wise Error Rate (FWER)
Definition
FWER = P(FP > 0)
Probability of having at least one false positive, of declaring DE at least
one non DE gene.
The Bonferroni procedure
Either each test is realized at α = α∗ /G level or use of adjusted pvalue
pBonfi = min(1, pi ∗ G)
and FWER ≤ α∗ .
For G = 2000, ≤ α∗ = 0.05, α = 2.510−5 .
Easy but conservative and not powerful.
When the number of tests increases, the FWER → 1 with constant FP.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
54 / 81
Normalisation and Differential analysis
Multiple testing
The False Discovery Rate
Source : M. Guedj, Pharnext
The procedure of Benjamini-Hochberg (95) is one of the procedure which
controls the False Discovery Rate FDR=E (FP/P) siP > 0.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
55 / 81
Normalisation and Differential analysis
Multiple testing
False Discovery Rate (FDR) Benjamini et Hochberg (95)
FDR =
E (FP/P)
=1
si P > 0
sinon
Idea : Do not control the error rate ut the proportion of error ⇒ less
conservative than control of the FWER.
Prop
FDR ≤ FWER
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
56 / 81
Normalisation and Differential analysis
Multiple testing
FDR
Principle : The number of declared positive elements P is given by the
greater i p(i) ≤ iα∗ /G.
Prop
In case of independant tests, FDR ≤ (G0 /G)α∗ ≤ α∗
Prop
FDR Benjamini-Hochberg : π0 =
J. Aubert
G0
G =1
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
57 / 81
Normalisation and Differential analysis
Multiple testing
Multiple testing : key points
Important to control for multiple tests
FDR or FWER depends on the cost associated to FN and FP
Controlling the FWER :
Having a great confidendence on the DE elements (strong control).
Accepting to not detect some elements (lack of power ⇔ a few DE
elements)
Controlling the FDR :
Accepting a proportion of FP among DE elements. Very interesting in
exploratory study.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
58 / 81
Normalisation and Differential analysis
Multiple testing
Interpretation - Statistical and practical significance
Practical significance (importance) and statistical significance
(detectability) have little to do with each other.
An effect can be important, but undetectable (statistically
insignificant) because the data are few, irrelevant, or of poor quality.
An effect can be statistically significant (detectable) even if it is small
and unimportant, if the data are many and of high quality.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
59 / 81
Normalisation and Differential analysis
Multiple testing
Differential Analysis : key points
Methods dedicated to microarrays are not applicable to RNA-seq
Small number of replicates (2-3) or low expression → be careful ! !
Large number of replicates (10 or so) or very high expression →
method choice does not matter much.
Removing genes with outlier counts or using non-parametric methods
reduce the sensitivity to outliers
Don’t forget to correct for multiple testing !
Adapt the method to your data (nb of rep.)
Specific methods developped for few replicates.
The need for ’sophisticated’ methods decreases when the number of
replicates increases.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
60 / 81
Normalisation and Differential analysis
Multiple testing
Other questions
Gene-Set Enrichissment Analysis
These tests assume that genes have the same chance to be declared DE.
But sometimes over-detection of longer and mare expressed genes
GOSeq (Young et al. 2011)
Filter or not
Rau et al. 2013 ; Huber et al.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
61 / 81
Conclusions
General conclusions
Pratical conclusions
Need to collaborate between biologists, bioinformaticians et
statisticians
and in a ideal world since the project construction
Adaptation of methods and tools to the asked question (no pipeline)
Check all the steps of the data analysis (quality, normalization,
differential analysis . . . )
Statistics not only useful for differential analysis with RNA-seq
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
62 / 81
Conclusions
From Kumamaru, 2012
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
63 / 81
Conclusions
Aknowledgements
StatOmique (in particular M.-A. Dillies, A. Rau, C. Hennequet-Antier)
PEPI IBIS (INRA) Pôle planification expérimentale et RNA-Seq
David Robinson, Charlotte Soneson
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
64 / 81
Conclusions
References
Anders, S and Huber, W. (2010) Differential expression analysis
for sequence count data, Genome Biology,11 :R106.
Anders, S, McCarthy, DJ, Chen, Y, Okoniewski, M, Smyth GK,
Huber, W and Robinson, MD (2013) Count-based differential
expression analysis of RNA sequencing data using R and
Bioconductor, Nature Protocols, doi :10.1038.
Bullard JH, Purdom E, Hansen KD, Dudoit S. (2010) Evaluation of
statistical methods for normalization and differential expression
in mRNA-seq experiments, BMC Bioinformatics, 11 :94
Di Y, Schaef, DW, Cumbie JS, Chang JH (2011) The NBP
Negative Binomial Model for Assessing Differential Gene
Expression from RNA-Seq, Statistical Applications in Genetics and
Molecular Biology, 10(1), Article 24.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
65 / 81
Conclusions
References
Dillies M-A et al. on behalf of The French StatOmique Consortium
(2012) A comprehensive evaluation of normalization methods
for Illumina high-throughput RNA sequencing data analysis,
Briefing in Bioinformatics.
Kvam V, Liu P (2012) A comparison of statistical methods for
detecting differentially expressed genes from RNA-seq data
Li J, Jiang H, Wong WH, (2010) Modeling non-uniformity in short
read rates in RNA-Seq data, Genome Biology, 11 :R50
Li J, Witten DM, Johnstone IM, Tisbhirani R (2011) Normalization,
testing, and false discovery rate estimation for RNA-sequencing
data, Biostatistics, 1-16
Marioni J.C., Mason C.E. et al. (2008) RNA-seq : An assessment
of technical reproducibility and comparison with gene
expression arrays, Genome Research, 18 : 1509-1517.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
66 / 81
Conclusions
References
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. (2008)
Mapping and quantifying mammalian transcriptomes by
RNA-seq. Nature Methods, 5(7), 621-628
Pachter L (2011) Models for transcript quantification from
RNA-seq
Rapaport et al. (2013) Comprehensive evaluation of differential
gene expression analysis methods for RNA-seq data, Genome
Biology,14 :R95
Robles et al (2012) Efficient experimental design and analysis
strategies for the detection of differential expression using
RNA-Sequencing, preprint
Robinson MD, Oshlack A. (2010) A scaling normalization method
for differential expression analysis of RNA-seq data. Genome
Biology, 11 :R25
Robinson MD and Smyth, GK. (2008) Small-sample estimation of
negative binomial dispersion, with applications to SAGE data
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
67 / 81
Conclusions
References
Robinson MD, McCarthy DJ, Smyth, GK. (2009) edgeR : a
Bioconductor package for differential expression analysis of
digital gene expression data, Bioinformatics
Soneson, C, Delorenzi, M. (2013) A comparison of methods for
differential expression analysis of RNA-seq data. BMC
Bioinformatics,14 :91
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren
MJ, Salzberg SL, Wold BJ, Pachter L. (2011) Transcript assembly
and quantification by RNA-Seq reveals unannotated transcripts
and isoform switching during cell differentiation, Nature
Biotechnology, 28(5) : 511 ?515.
Young MD, Wakefiled MJ, Smyth GK., Oshlack A. (2011) Gene
ontology analysis for RNA-seq :accounting for selection bias ,
Genome Biology
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
68 / 81
Conclusions
Notations
xij : number of reads for gene i in sample j
Nj : number of reads in sample j (library size of sample j)
n : number of samples in the experiment
ŝj : normalization factor associated with sample j
Li : length of gene i
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
69 / 81
Conclusions
RPKM normalization
Reads Per Kilobase per Million mapped reads
Adjust for lane sequencing depth (library size) and gene length
Motivation greater lane sequencing depth and gene length =>
greater counts whatever the expression level
Assumption read counts are proportional to expression level, gene
length and sequencing depth (same RNAs in equal proportion)
Method divide gene read count by total number of reads (in million)
and gene length (in kilobase)
xij
∗ 103 ∗ 106
Nj ∗ Li
(1)
Allows to compare expression levels between genes of the same sample
Unbiased estimation of number of reads but affect the variance.
(Oshlack et al. 2009)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
70 / 81
Conclusions
The Effective Library Size concept
Motivation
Different biological conditions express different RNA repertoires, leading to
different total amounts of RNA
Assumption
A majority of transcripts is not differentially expressed
Aim
Minimizing effect of (very) majority sequences
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
71 / 81
Conclusions
Methods based on the Effective Library Size Concept
Trimmed Mean of M-values Robinson et al. 2010 (edgeR)
Filter on transcripts with nul counts, on the resp. 30% and 5% more extreme
k
Mi = log2( YYik0/N
0 ) and A values
ik /Nk
Hyp : We may not estimate the total ARN production in one condition but we may
estimate a global expression change between two conditions from non extreme Mi
distribution.
Calculation of a scaling factor between two conditions and normalization to avoid
dependance on a reference sample
Anders and Huber 2010 - Package DESeq
sˆj = mediani (
kij
)
(πvm=1 kiv )1/m
kij : number of reads in sample j assigned to gene i
denominator : pseudo-reference sample created from geometric mean across samples
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
72 / 81
Conclusions
Length bias (Oshlack 2009, Bullard et al. 2010)
At same expression level, a long transcript will have more reads than a
shorter transcript. Number of reads 6= expression level
µ = E (X ) = cNL = Var (X )
X mesured number of reads in a library mapping a specific transcript,
Poisson r.v.
c proportionnality constant
N total number of transcripts
L gene length
Test :
X1 − X2
t=p
(cN1 L + cN2 L)
p
Power of test depends on a parameter prop. to (L).
Identical result after normalization by gene length (but out of the Poisson
framework).
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
73 / 81
Conclusions
Negative Binomial Models
µijk = λij mjk where mj k : size factor (library size)
Test : H0i : λiA = λiB vs H1i : λiA 6= λiB
edgeR
Adjust observed counts up or down depending on whether library sizes
are below or above the geometric mean => Creates approximately
identically distributed pseudodata
Estimation of φi by conditional ML conditioning on the TC for gene i
Empirical Bayes procedure to shrink dispersions toward a consensus
value
An exact test analogous to Fisher’s exact test but adapted to
overdispersed data (Robinson and Smyth 2008)
DESeq
Test similar to Fisher’s exact test (calculation has changed)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
74 / 81
Conclusions
Negative Binomial Models - DESeq
Assumptions :
1
Yijk ∼ NB(µijk , σijk ), where µijk is the mean, and σijk is the variance
2
The mean µijk is the product of a condition-dependent per-gene value
λij and a size factor (library size) mjk :
µijk = λij mjk
3
Variance decomposition : The variance σijk is the sum of a shot noise
term and a raw variance term : σijk = µijk + αi µ2 where αi the
dispersion value.
4
Per-gene dispersion αi or pooled α is a smooth function of the mean :
αi = fj (λij )
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
75 / 81
Conclusions
DESeq Bioconductor package
Three sets of parameters need to be estimated :
1
Size factors mj k (normalization factors) (see normalization part)
2
For each experimental condition j, n expression strength parameters
λij estimated by average of counts from the replicates for each
condition, transformed to the common scale :
1 X yijk
λˆij =
rj k mˆj k
3
The smooth functions fj for each condition j to model dependence of
αi on the expected mean λij : local or gamma GLM estimation
(fit=’local’ or fit=’parametric’)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
76 / 81
Conclusions
Practical considerations
Input Data = raw counts
normalization offsets are included in the model
Version matters : edgeR 2.6.7 et DESeq 1.6.1 (Bioconductor 2.9)
edgeR
TMM normalization Common dispersion must be estimated before tagwise
dispersions GLM functionality (for experiments with multiple factors) now
available
DESeq
Two possibilities to obtain a smooth functions fj (·)
Conservative estimates : genes are assigned the maximum of the
fitted and empirical values of αi (sharingMode = "maximum")
Local fit regression (as described in paper) is no longer the default
Each column = independent biological replicate
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
77 / 81
Conclusions
First commands
Installation des packages :
source("http ://www.bioconductor.org/biocLite.R")
biocLite(c("DESeq", "edgeR"))
Chargement des packages :
library(DESeq)
library(edgeR)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
78 / 81
Conclusions
edgeR main commands
generate raw counts from NB, create list object
y <- matrix(rnbinom(80,size=1/0.2,mu=10),nrow=20,ncol=4)
rownames(y) <- paste("Gene",1 :nrow(y),sep=".")
group <- factor(c(1,1,2,2))
perform DA with edgeR
y <- DGEList(counts=y,group=group)
y <- calcNormFactors(y,method="TMM")
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)
result <- exactTest(y,dispersion="tagwise")
Observe some results - DGE with FDR BH
topTags(result)
summary(decideTestsDGE(result),p.value=0.05)
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
79 / 81
Conclusions
DESeq main commands
cds <- newCountDataSet(y, group)
cds <- estimateSizeFactors(cds)
sizeFactors(cds)
cds <- estimateDispersions(cds)
res <- nbinomTest( cds, "1", "2" )
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
80 / 81
Conclusions
Quelques références pour débuter
http://www.r-project.org/ : manuel, FAQ, RJournal, etc...
http://www.bioconductor.org/help/publications/
cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf
G. Millot, (2009), Comprendre et réaliser les tests statistiques à l’aide
de R, Editions De Boeck, 704 p.
J. Aubert
Stat. challenges (RNA-Seq)
Roscoff, 2014 Oct. 7
81 / 81