Estimating the Posterior Probability of Differential

Estimating the Posterior Probability
of Differential Gene Expression from Microarray Data
Marina Sapir and Gary A. Churchill
The Jackson Laboratory, Bar Harbor, ME
Abstract: We present a robust algorithm for estimating the posterior probability of differential expression of
genes from microarray data. Our approach is based on an orthogonal linear regression of the signals obtained
from the two color channels. Residuals from the regression are modeled as a mixture of a common component
and a differentially expressed component and an EM-algorithm is used to deconvolve the mixture. The algorithm
provides estimates of the measurement error variance, the proportion of differentially expressed genes, and
the probability that each individual gene belongs to the differentially expressed class. We have applied this
procedure to both real and simulated microarray data. Our simulation results demonstrate that the algorithm can
estimate the key parameters with high precision over a wide range of models. Application of the method to
replicated experiments demonstrates that the classification of differentially expressed genes is highly reproducible.
Part I: Modeling Fluorescent Intensities
We used data from self comparisons to identify a
scale transformation of the raw fluorescent intensities
for which the relationship between the color channels
is linear with additive errors that are independent
of the absolute signal intensity.
This empirically derived relationship between the
fluorescent signals in self comparisons suggests a
plausible model for the relationship between signal
intensity yig and mRNA concentration xig for a
gene g in sample i for non-self comparisons
We found that a logarithm transform of intensities offset
yig = aixigbi + ci.
by a color dependent constant provided linearity and
homogeneous variance for a wide range of data sets. We The follow panels show
fit the relationship to fluorescent intensities y1 and y2 • Details of the method for fitting a linear function
obtained from the “red” and “green” color channels:
to a scatterplot of (transformed) intensities.
• The effects of different scaling functions (identity,
log(y1 — γ1) = α + β log(y2 — γ2)
square root, logarithm and reciprocal) on
the distribution of residuals.
where
• The effect of the offset parameter on the linearity
g1 and g are the offsets and
of the log-log scatter plot of signal intensities. We
a and b are the parameters of the regression line.
believe that the offset is directly related to the
background signals in the two color channels.
Selection of the Scaling Function: Placenta-Placenta Self Comparison on GEM Array
Log data
2.5
Square root data
Raw data
4
x 10
Logarithm data
Square root data
160
Reciprocal data
Logarithm data
10
Reciprocal data
−3
1
x 10
0.9
140
9.5
120
9
100
8.5
0.8
√y
80
1
0.7
0.6
1/y2
log(y2)
2
1.5
y2
Scaling plot
2
0.5
8
60
7.5
40
7
0.4
0.3
0.2
0.5
0.1
0
0
0.2
0.4
0.6
0.8
1
1.2
y1
1.4
1.6
1.8
20
20
2
40
60
80
100
120
6.5
6.5
140
7
7.5
8
√y1
4
x 10
log(y1)
8.5
9
9.5
0
0
10
0.2
0.4
6000
corr(|r|,f) = 0.53
corr(|r|,f) = –0.02
0.8
10
0.8
1
1.2
−3
x 10
corr(|r|,f) = 0.53
4
0.6
2000
3
0.4
2
−10
0.2
Residuals r
−2000
0
Residuals r
0
Residuals r
Residuals r
Residuals
corr(|r|,f) = 0.33
20
4000
0.6
1/y1
corr(|r|,f) = 0.53
−4
x 10
0
−0.2
1
0
−20
−4000
−0.4
−30
−6000
−1
−0.6
−2
−0.8
−40
−8000
−3
0.2
0.4
0.6
0.8
1
1.2
Fitted Value f
1.4
1.6
1.8
2
40
50
60
70
80
90
100
110
Fitted Value f
4
x 10
120
130
140
7.5
8
8.5
9
Fitted Value f
9.5
1
2
3
4
0.999
0.997
0.997
0.997
0.997
0.99
0.98
0.99
0.98
0.99
0.98
0.99
0.98
0.95
0.95
0.95
0.95
0.90
0.90
0.90
0.90
0.75
0.75
0.75
0.75
0.50
0.25
0.50
0.25
Probability
0.999
Probability
0.999
Probability
Probability
Quantiles
0.999
0.50
0.25
0.10
0.10
0.05
0.05
0.05
0.05
0.02
0.01
0.02
0.01
0.02
0.01
0.02
0.01
0.003
0.003
0.003
0.003
−6000
−4000
−2000
Residuals
0
2000
4000
0.001
6000
−40
−30
−20
−10
Residu ls
0
10
20
0.001
30
−1
−0.8
−0.6
−0.4
−0.2
0
Residuals
0.2
0.4
0.6
0.001
0.8
7
8
−4
x 10
0.25
0.10
−8000
6
tile Plot
0.50
0.10
0.001
5
Fitted Value f
Qu
−3
−2
−1
0
1
Residuals
2
3
4
Conclusion: The residuals from a logarithm scaling transformation are approximately normal and have minimal
correlation with fitted values.
The Shifted Logarithm Scaling Function
10.5
10.5
10
10
9.5
9.5
log(R)
log(R0)
9
9
8.5
8.5
8
8
7.5
7
7.5
6.5
7
7.5
8
8.5
9
log(G0)
9.5
10
10.5
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
log(G)
10.5
10.5
10
9.5
10
log(R−260.9)
log(R0 + 1659)
9
9.5
9
8.5
8
7.5
7
6.5
8.5
6
5.5
8
7.5
8
8.5
9
log(G0)
9.5
10
10.5
log(G)
Conclusion: Shifting the raw measurement improves linearity on the logarithm scale.
−4
x 10
Fitting a Regression Line to the Data
Orthogonal Residuals
Least Squares Fit
Robust Fit
11
10.5
10
10
10
9.5
9.5
9.5
9
8.5
log(y2 − 543.28)
10.5
log(y2 − 543.28)
10.5
9
8.5
9
8.5
8
8
8
7.5
7.5
7.5
7
7
7
6.5
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
7
7.5
8
8.5
log(y )
9
9.5
10
7
10.5
7.5
8
8.5
log(y1)
1
9
9.5
10
10.5
Orthogonal residuals provide a symmetric treatment of the two fluorescent intensities y1 and y2.
Robust regression reduces the influence of differentially expressed genes of the Fitted line.
Part II: Modeling the Residuals
Deconvolution of the Residuals
In a non-self comparison, the orthogonal residuals,
r are modeled as a two component mixture. The
first component, C, represents the class of common
genes that expressed equally in the two mRNA
populations. The second component, D, represents
differentially expressed genes.
An EM algorithm is used to obtain estimates of
the proportion of differentially expressed genes, π,
and the residual error variance, s2. The E-step of
the EM algorithm applies Bayes’ rule to obtain the
conditional probabilities
Pr(D|r) =
π Pr(r|D)
π Pr(r|D) + (1— π) Pr(r|C)
Pr(r) = (1— π) Pr(r|C) + π Pr(r|D)
where
1 — π = proportion of common genes
π = proportion of differentially expressed genes
Thus we can obtain estimates of the posterior probability
of differential expression based on information in
the regression residuals.
Residuals from common genes are modeled as a
Normal distribution with mean 0 and variance s2.
Residuals from differentially expressed genes are
modeled as a Uniform distribution.
Examples of Differentially Expressed Genes
Corning C1302
PLGEM1
Stanford data, G6:R6
PLGEM2
11
10.5
10.5
10.5
10
10.5
9.5
10
10
10
7.5
7
9
8.5
log(R+529.8)
8
log(R+116.9)
9.5
8.5
log(R−391.4)
log(R−260.9)
9
9.5
9
9.5
9
8.5
6.5
8
8.5
7.5
8
6
8
5.5
6
6.5
7
7.5
8
8.5
log(G)
9
9.5
10
10.5
7
7.5
8
8.5
log(G)
9
9.5
10
10.5
7
7.5
8
8.5
9
log(G)
9.5
10
10.5
8
8.5
9
log(G)
9.5
10
10.5
Consistency of Results
Replication across arrays:
Placenta vs Liver From GEM2
For spots replicated with an array: Corning data
Array 1
Array 2
P1<-0.5
-0.5<P1<0.5
Posterior probability
P1>0.5
P2 < -0.5
3
8
0
-0.5 < P2 < 0.5
6
720
5
P2 > 0.5
0
17
20
1 / Igfbp1
2 / Gh
46 / Mt2
Array 1
Array 2
P1<-0.95 -0.95<P1<0.95
P1>0.95
P2 < -0.95
2
3
0
-0.95 < P2 < 0.95
0
748
2
P2 > 0.95
0
11
13
58 / Ax3
94 / igf2r
97 / ppara
98 / Igf2
Chip 1300
Chip 1302
1.00
1.00
1.00
1.00
0.22
1.00
0.44
0.09
1.00
1.00
1.00
1.00
0.29
1.00
1.00
1.00
1.00
1.00
1.00
1.00
-0.02
0.01
-0.01
0.01
1.00
1.00
1.00
1.00
-0.98
-0.03
-0.02
-0.02
-1.00
-0.99
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
-0.35
-0.92
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
1.00
1.00
1.00
1.00
Conclusions
•
Raw ratios are not reliable measures of differential expression
•
The two color channels in self comparisons are related by a linear model on the logarithm
scale with an additive offset
•
Robust orthogonal regression can be used to fit a linear model to data from non-self comparisons
•
Residuals from the orthogonal regression can be separated into common and differentially expressed
components using an EM algorithm
•
Classification of differentially expressed genes is consistent on both within array and across array replications