baySeq

Identifying differentially expressed genes from RNA-seq data
Many recent algorithms for calling differentially expressed genes:
edgeR: Empirical analysis of digital gene expression data in R
http://www.bioconductor.org/packages/2.10/bioc/html/edgeR.html
DEseq: Differential gene expression analysis based on the
negative binomial distribution
http://www-huber.embl.de/users/anders/DESeq/
baySeq: Empirical Bayesian analysis of patterns of differential
expression in count data
http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html
1
Identifying differentially expressed genes from RNA-seq data
Why can’t we use the same software for microarray data analysis?
1. Microarray data are continuous values. Sequence data are discrete values of
read counts.
e.g. 5.342 signal intensity versus 5 counts
2. Reproducibility of RNA-seq measurements is different for low-abundance versus
high-abundance transcripts – this is called over-dispersion
2
Overdispersion: variance of replicates is higher for high-abundance reads
3
baySeq
Empirical Bayesian analysis of patterns of
differential expression in count data
http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html
Identifies differentially expressed genes between 2 or more samples
using replicated RNA-seq data
4
Bayes’ Theorem
P (M | D) = P (D | M) * P (M)
P (D)
Read in English: The Probability that your Model is correct Given ( | ) the Data is
equal to the Probability of your Data Given the Model times the
Probability of your Model divided by the Probability of the Data
What the hell does this mean?
5
Bayes’ Theorem: wikipedia’s ridiculous example
Your friend had a conversation with someone who happened to have long hair.
What is the probability that that person was a woman, given that you know
~50% of people are women and ~75% of women have long hair.
P (W) = Probability that this person was a woman = 0.5
P (L | W) = Probability that the person had long hair IF the person is a woman = 0.75
P (L | M) = Probability that the person had long hair IF that person is a man = 0.3
P (L) = Probability that any random person has long hair =
P (L | W) * P (W) = 75% of 50% of the population = 0.75*0.5 = 0.375
+ P (L | M) * P (M) = 30% of 50% of the population = 0.3 * 0.5 = 0.15
So:
P (W|L) = P(L|W)*P(W)
P(L)
=
P(L|W)*P(W)
= 0.714
P (L|W)*P(W) + P (L|M)*P(M)
6
Bayes’ Theorem
P (M | D) = P (D | M) * P (M)
P (D)
Read in English: The Probability that your Model is correct Given ( | ) the Data is
equal to the Probability of your Data Given the Model times the
Probability of your Model divided by the Probability of the Data
P is called the ‘Posterior Probability’ that this Model is the right one to
describe your data.
Each gene will have a PP for each model, where S PP = 1.
The Bigger the PP, the more likely this is the right model.
Posterior Probability is NOT a p-value!
7
Bayes’ Theorem
Imagine you have three RNA-seq replicates of two samples (WT vs mutant).
There are two models for each gene
M0 = the model that your gene is NOT differentially expressed across the 2 samples
MDE = the model that a given gene IS differentially expressed
P (M0 | DgeneX) = P (DgeneX | M0) * P (M0)
P (DgeneX)
P (MDE | DgeneX) = P (DgeneX | MDE) * P (MDE)
P (DgeneX)
We don’t know some of these factors ( e.g. P(D|M) ), but
we can describe the data with some parameter set called q
(which includes mean and stdev of count data across replicates of each gene)
8
How to model the mean and standard deviation of your replicates:
Poisson Distribution: Accounts for random fluctuations
wikepedia
E.g. Infecting a cell culture with viral particles.
l = Multiplicity of Infection (MOI) = # of viral particles/cell in your culture
9
Overdispersion: variance of replicates is higher for high-abundance reads
10
How to model the mean and standard deviation of your replicates:
Negative Binomial Distribution: Mean and variance are different
Notice the wider spread on the right side of each distribution, for higher numbers
11
baySeq
Empirical Bayesian analysis of patterns of
differential expression in count data
http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html
Imagine you have triplicate measurements of sample A and sample B. We want to
identify genes differentially expressed (DE) in A versus B.
Arep1 Arep2 Brep1 Brep2
tuple: simply S counts per transcription unit
We define TWO possible models:
NO DE (NDE): Arep1 = Arep2 = Brep1 = Brep2
we say that the data for all samples was drawn from the same distribution
(* i.e. same mean and standard deviation, if normally distributed)
DE: Arep1 = Arep2 NOT EQUAL Brep1 = Brep2
A and B replicates are drawn from two different distributions
12
baySeq
Empirical Bayesian analysis of patterns of
differential expression in count data
http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html
Next, we use Bayes’ Rule to try to estimate the probability that the DE model
is true for geneX
prior probability
P (MDE | DgeneX) = P(DgeneX|MDE) * P(MDE)
P(DgeneX)
baySeq tries to use data sharing across the dataset to estimate the components
on the right side of the equation
P (DgeneX | MDE ) = Int[ P(DgeneX| K, MDE) * P(K | MDE) ] dK
where K is the parameter set of q (mean, dispersion) for each gene in each replicate
** q for each gene is estimated by looking at all data in the replicates
13
baySeq
Empirical Bayesian analysis of patterns of
differential expression in count data
http://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html
prior probability
P (MDE | DgeneX) = P(DgeneX|MDE) * P(MDE)
P(DgeneX)
The prior probability is the probability of M before any data are considered.
Here:
1. Guess at a starting prior probability for MDE
2. Iteratively test different alternatives
3. Repeat until ‘convergence’ (P(MDE) does not change with more iterations)
* For data with strong DE signal, the posterior probability is not very dependent
on the starting prior probability P(MDE)
14
15
Volcano Plot
16