Estimating the Posterior Probability of Differential Gene Expression from Microarray Data Marina Sapir and Gary A. Churchill The Jackson Laboratory, Bar Harbor, ME Abstract: We present a robust algorithm for estimating the posterior probability of differential expression of genes from microarray data. Our approach is based on an orthogonal linear regression of the signals obtained from the two color channels. Residuals from the regression are modeled as a mixture of a common component and a differentially expressed component and an EM-algorithm is used to deconvolve the mixture. The algorithm provides estimates of the measurement error variance, the proportion of differentially expressed genes, and the probability that each individual gene belongs to the differentially expressed class. We have applied this procedure to both real and simulated microarray data. Our simulation results demonstrate that the algorithm can estimate the key parameters with high precision over a wide range of models. Application of the method to replicated experiments demonstrates that the classification of differentially expressed genes is highly reproducible. Part I: Modeling Fluorescent Intensities We used data from self comparisons to identify a scale transformation of the raw fluorescent intensities for which the relationship between the color channels is linear with additive errors that are independent of the absolute signal intensity. This empirically derived relationship between the fluorescent signals in self comparisons suggests a plausible model for the relationship between signal intensity yig and mRNA concentration xig for a gene g in sample i for non-self comparisons We found that a logarithm transform of intensities offset yig = aixigbi + ci. by a color dependent constant provided linearity and homogeneous variance for a wide range of data sets. We The follow panels show fit the relationship to fluorescent intensities y1 and y2 • Details of the method for fitting a linear function obtained from the “red” and “green” color channels: to a scatterplot of (transformed) intensities. • The effects of different scaling functions (identity, log(y1 — γ1) = α + β log(y2 — γ2) square root, logarithm and reciprocal) on the distribution of residuals. where • The effect of the offset parameter on the linearity g1 and g are the offsets and of the log-log scatter plot of signal intensities. We a and b are the parameters of the regression line. believe that the offset is directly related to the background signals in the two color channels. Selection of the Scaling Function: Placenta-Placenta Self Comparison on GEM Array Log data 2.5 Square root data Raw data 4 x 10 Logarithm data Square root data 160 Reciprocal data Logarithm data 10 Reciprocal data −3 1 x 10 0.9 140 9.5 120 9 100 8.5 0.8 √y 80 1 0.7 0.6 1/y2 log(y2) 2 1.5 y2 Scaling plot 2 0.5 8 60 7.5 40 7 0.4 0.3 0.2 0.5 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 y1 1.4 1.6 1.8 20 20 2 40 60 80 100 120 6.5 6.5 140 7 7.5 8 √y1 4 x 10 log(y1) 8.5 9 9.5 0 0 10 0.2 0.4 6000 corr(|r|,f) = 0.53 corr(|r|,f) = –0.02 0.8 10 0.8 1 1.2 −3 x 10 corr(|r|,f) = 0.53 4 0.6 2000 3 0.4 2 −10 0.2 Residuals r −2000 0 Residuals r 0 Residuals r Residuals r Residuals corr(|r|,f) = 0.33 20 4000 0.6 1/y1 corr(|r|,f) = 0.53 −4 x 10 0 −0.2 1 0 −20 −4000 −0.4 −30 −6000 −1 −0.6 −2 −0.8 −40 −8000 −3 0.2 0.4 0.6 0.8 1 1.2 Fitted Value f 1.4 1.6 1.8 2 40 50 60 70 80 90 100 110 Fitted Value f 4 x 10 120 130 140 7.5 8 8.5 9 Fitted Value f 9.5 1 2 3 4 0.999 0.997 0.997 0.997 0.997 0.99 0.98 0.99 0.98 0.99 0.98 0.99 0.98 0.95 0.95 0.95 0.95 0.90 0.90 0.90 0.90 0.75 0.75 0.75 0.75 0.50 0.25 0.50 0.25 Probability 0.999 Probability 0.999 Probability Probability Quantiles 0.999 0.50 0.25 0.10 0.10 0.05 0.05 0.05 0.05 0.02 0.01 0.02 0.01 0.02 0.01 0.02 0.01 0.003 0.003 0.003 0.003 −6000 −4000 −2000 Residuals 0 2000 4000 0.001 6000 −40 −30 −20 −10 Residu ls 0 10 20 0.001 30 −1 −0.8 −0.6 −0.4 −0.2 0 Residuals 0.2 0.4 0.6 0.001 0.8 7 8 −4 x 10 0.25 0.10 −8000 6 tile Plot 0.50 0.10 0.001 5 Fitted Value f Qu −3 −2 −1 0 1 Residuals 2 3 4 Conclusion: The residuals from a logarithm scaling transformation are approximately normal and have minimal correlation with fitted values. The Shifted Logarithm Scaling Function 10.5 10.5 10 10 9.5 9.5 log(R) log(R0) 9 9 8.5 8.5 8 8 7.5 7 7.5 6.5 7 7.5 8 8.5 9 log(G0) 9.5 10 10.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 log(G) 10.5 10.5 10 9.5 10 log(R−260.9) log(R0 + 1659) 9 9.5 9 8.5 8 7.5 7 6.5 8.5 6 5.5 8 7.5 8 8.5 9 log(G0) 9.5 10 10.5 log(G) Conclusion: Shifting the raw measurement improves linearity on the logarithm scale. −4 x 10 Fitting a Regression Line to the Data Orthogonal Residuals Least Squares Fit Robust Fit 11 10.5 10 10 10 9.5 9.5 9.5 9 8.5 log(y2 − 543.28) 10.5 log(y2 − 543.28) 10.5 9 8.5 9 8.5 8 8 8 7.5 7.5 7.5 7 7 7 6.5 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 7 7.5 8 8.5 log(y ) 9 9.5 10 7 10.5 7.5 8 8.5 log(y1) 1 9 9.5 10 10.5 Orthogonal residuals provide a symmetric treatment of the two fluorescent intensities y1 and y2. Robust regression reduces the influence of differentially expressed genes of the Fitted line. Part II: Modeling the Residuals Deconvolution of the Residuals In a non-self comparison, the orthogonal residuals, r are modeled as a two component mixture. The first component, C, represents the class of common genes that expressed equally in the two mRNA populations. The second component, D, represents differentially expressed genes. An EM algorithm is used to obtain estimates of the proportion of differentially expressed genes, π, and the residual error variance, s2. The E-step of the EM algorithm applies Bayes’ rule to obtain the conditional probabilities Pr(D|r) = π Pr(r|D) π Pr(r|D) + (1— π) Pr(r|C) Pr(r) = (1— π) Pr(r|C) + π Pr(r|D) where 1 — π = proportion of common genes π = proportion of differentially expressed genes Thus we can obtain estimates of the posterior probability of differential expression based on information in the regression residuals. Residuals from common genes are modeled as a Normal distribution with mean 0 and variance s2. Residuals from differentially expressed genes are modeled as a Uniform distribution. Examples of Differentially Expressed Genes Corning C1302 PLGEM1 Stanford data, G6:R6 PLGEM2 11 10.5 10.5 10.5 10 10.5 9.5 10 10 10 7.5 7 9 8.5 log(R+529.8) 8 log(R+116.9) 9.5 8.5 log(R−391.4) log(R−260.9) 9 9.5 9 9.5 9 8.5 6.5 8 8.5 7.5 8 6 8 5.5 6 6.5 7 7.5 8 8.5 log(G) 9 9.5 10 10.5 7 7.5 8 8.5 log(G) 9 9.5 10 10.5 7 7.5 8 8.5 9 log(G) 9.5 10 10.5 8 8.5 9 log(G) 9.5 10 10.5 Consistency of Results Replication across arrays: Placenta vs Liver From GEM2 For spots replicated with an array: Corning data Array 1 Array 2 P1<-0.5 -0.5<P1<0.5 Posterior probability P1>0.5 P2 < -0.5 3 8 0 -0.5 < P2 < 0.5 6 720 5 P2 > 0.5 0 17 20 1 / Igfbp1 2 / Gh 46 / Mt2 Array 1 Array 2 P1<-0.95 -0.95<P1<0.95 P1>0.95 P2 < -0.95 2 3 0 -0.95 < P2 < 0.95 0 748 2 P2 > 0.95 0 11 13 58 / Ax3 94 / igf2r 97 / ppara 98 / Igf2 Chip 1300 Chip 1302 1.00 1.00 1.00 1.00 0.22 1.00 0.44 0.09 1.00 1.00 1.00 1.00 0.29 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -0.02 0.01 -0.01 0.01 1.00 1.00 1.00 1.00 -0.98 -0.03 -0.02 -0.02 -1.00 -0.99 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -0.35 -0.92 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 1.00 1.00 1.00 1.00 Conclusions • Raw ratios are not reliable measures of differential expression • The two color channels in self comparisons are related by a linear model on the logarithm scale with an additive offset • Robust orthogonal regression can be used to fit a linear model to data from non-self comparisons • Residuals from the orthogonal regression can be separated into common and differentially expressed components using an EM algorithm • Classification of differentially expressed genes is consistent on both within array and across array replications
© Copyright 2026 Paperzz