Tail posterior probability for inference in pairwise and multiclass

Tail posterior probability for inference in pairwise and multiclass
gene expression data.
Natalia Bochkina, Sylvia Richardson
January 12, 2006
1
Introduction
Gene expression microarrays are used to study the expression of genes simultaneously across several samples.
This technique has a wide range of applications including investigating how genes interact, how gene expression changes under different conditions or which gene expression changes are associated with specific diseases.
A description of microarray experiments and associated statistical problems can be found at Causton et al
(2003), Parmigiani et al (2004), Speed (2003), McLachlan et al (2004).
Here we focus on the problem of differential expression in pairwise and multiclass experiments which is
a key problem in many microarray studies. We consider it in a Bayesian framework which allows flexibility
necessary for modelling microarray data. This problem can be formulated as a hypothesis testing problem
which in Bayesian settings is often addressed by using a mixture prior for modelling the null and alternative
distributions of the difference δg between the expression of gene g in compared conditions on the logarithmic
scale, see for example Lonnstedt & Speed (2002), Smyth (2004) and Lonnstedt & Britton (2005). Various
decision functions for classification of genes into null and alternative hypotheses have been proposed by the
authors, in the form of posterior probability, p-values, log-likelihood ratios and posterior odds ratios. Lewin
et al (2005) avoided the problem of modelling the alternative distribution by using non-informative prior for
the difference δg and considering an interval null hypothesis |δg | ≤ log2 2 which is based on the interpretation
of δg as a logarithm of the fold change and a priori information about the fold change of interest being above
a set value, for example 2.
However, in some cases, small but consistent fold changes are of interest and it is delicate to choose a priori
a set value above which expression changes are of interest. In this paper, we propose to use a data-related
threshold which takes into account variability of each gene which leads to a new type of a selection rule which
we refer to as tail posterior probability. It is based on the parameter representing the standardised differential
expression, or Bayesian T-statistic, and does not require any modelling of the alternative distribution. We
advocate that this selection rule is particularly appropriate in application to microarray data.
When analysing multiclass experiments, the primary set of biological questions asked in many cases are
simple pairwise comparisons which ignore the multiclass structure and correspond to alternative hypotheses
involving single parameters. Following on from the results of the first set of simple comparisons, more
complex biological queries become then of interest, which corresponds to structured alternative hypotheses.
For example, a typical question would be to find genes differentially expressed in two or more comparisons.
The standard approach to testing such hypotheses is to state a null hypothesis where the set of parameters of
interest are zero, and to consider genes for which this hypothesis is rejected. But this way of proceeding often
does not directly answer the relevant biological question. For instance, suppose we are interested in finding
genes which are differentially expressed with respect to comparison 1 and to comparison 2, with difference
parameter δg1 and δg2 respectively. In this case, the question of interest is “δg1 6= 0 and δg2 6= 0”. But
rejecting the null hypothesis: “δg1 = δg2 = 0” corresponds to a different alternative from the one required.
It includes genes of interest, but also genes differentially expressed in each of the single comparisons which
need then to be separated using other means (e.g. Dong et al, 2005).
Here we propose a flexible method which allows to find genes of interest according to any specified
criterion, possibly compound, by estimating the (joint) tail posterior probability for each gene to satisfy the
1
stated criterion. This extends the framework for simple pairwise comparisons outlined in Lewin et al (2005).
We place both the modelling and the associated computations in a fully Bayesian framework. This has the
advantage of providing us directly with the posterior probabilities required for any compound comparison
as well as giving greater flexibility in the ANOVA formulation, avoiding the need for technical constraints
(e.g. the assumption of equal variance across all conditions) that have been used in the empirical Bayesian
framework (e.g. as in Kerr et al, 2000).
The paper is organised as follows. In Section 2, we describe the microarray data set used for illustration
throughout the paper. In Section 3, we introduce the tail posterior probabilities based on a Bayesian Tstatistic and compare it to other methods for identifying differentially expressed genes in the considered
Bayesian framework. In Section 4 we propose an estimator for the false discovery rate based on the tail
posterior probabilities. A generic Bayesian hierarchical model for multiclass experiments is specified in
Section 5. In Section 6, we apply the proposed methodology to the microarray data. First, we consider the
pairwise comparisons of interest to identify compound criteria of interest, and secondly we consider testing
a compound hypothesis against a specific alternative. We also illustrate, using a simulated data set, the
advantages of the joint ANOVA model formulation for estimating both parameters and joint tail posterior
probabilities of the compound statement that are used to select genes of interest. In particular, we compare
this joint approach to that consisting of combining lists of genes obtained from pairwise comparisons on
subsets of the data. We finish with a brief discussion.
2
Microarray Experiment
The data considered in this paper derives from sets of experiments on the H2Kb muscle cell line of mouse
(Tatoud et al, 2005). The cells were treated with insulin and metformin, an insulin-replacement drug, and
observed at two time points after the treatment, 2 and 12 hours. For comparison, non-treated cells at the
time of treatment (baseline, or 0 hours) and at 2 and 12 hours were grown. This experiment was repeated
3 times giving three biological replicates.
RNA from the 21 samples, 9 non-treated and 12 treated, was hybridized on Affymetrix MOE430A chips,
making a total of 21 microarrays, with 22690 genes per chip. This set of data can be thought of as a
multifactorial experiment, with two factors, time and treatment, with 7 distinct factor levels (referred to as
conditions) and 3 replicates in each condition.
Due to importance of the reaction of cells to insulin and insulin-replacing drug metformin in development
of diabetes, the aim of the experiment was to find genes changing their expression in the cells due to
treatment of insulin or metformin which should help to study the mechanism of the development of diabetes
on transcriptomic level.
The data we use below was background-corrected and the gene expression index was calculated using
RMA software (Irrizary et al, 2003) implemented in R (http://www.r-project.org) which results in data
being transformed on the log2 scale. Array normalization was performed using intensity-dependent LOESS
normalization (Cleveland et al, 1992). For the models described in this paper we assume that all necessary
preprocessing has been carried out, and consider the data to be representative of the log2 of the mRNA
concentration in the original sample.
To illustrate performance of the methodology applied for the data analysis, we will also use simulated
data which has the same design, i.e. 7 conditions with 3 replicates in each conditions, and 2000 variables
which is described in Section 5.
3
3.1
Bayesian selection rules for pairwise comparisons
Testing hypothesis of no differential expression
In this section we consider the problem of finding genes differentially expressed between two conditions. We
assume that the errors have normal distribution with condition-specific variances:
2
ygsr ∼ N (αg ± δg /2, σgs
),
2
(1)
where ygsr is the observed intensity on the log2 scale for gene g = 1, . . . , N , condition s = 1, 2 and replicate
r = 1, . . . , ms . We are interested in testing the null hypothesis relating to the parameter of differential
expression δg :
H0 :
δg = 0
versus H1 :
δg 6= 0.
(2)
Now we specify the hierarchical model for the parameters of (1). Since we do not have any prior information about the distribution of the non-zero differences δg , we avoid using the mixture prior for δg used by
Lonnstedt & Speed (2002), Smyth (2004) and other authors, and consider a non-informative prior:
αg , δg ∼ 1.
Although this prior distribution is improper, it can be easily shown that under model (1) with non-zero
number of observations in each group ms ≥ 1, that the resulting posterior distribution is proper. Further,
we introduce a new type of decision rule to determine the differentially expressed genes based on the tail
posterior probability which takes into account variability of each gene and is suitable to detect small but
consistent change in differential expression.
2
have exchangeable distributions
To complete the prior specification, we assume that the variances σgs
for each s, so that
2
σgs
∼
2
f (σgs
| θs ),
(3)
θs
∼
f (θs ),
(4)
where θs , s = 1, 2, are the hyperparameters of the distribution of variance (with possibility θ1 = θ2 = θ),
and their prior distributions f (θs ) are fully specified.
3.2
Posterior distributions
Pms
ygsr and and ȳg = ȳg2 − ȳg1 , where ms is the number of replicates in condition s. Since
Let ȳgs = m1s r=1
the errors are gaussian, the likelihood depends on the following data summaries: sample mean ȳg , sample
Pms
variances s2gs = ms1−1 r=1
(ygsr − ȳgs )2 , s = 1, 2, and sample sizes in each group.
Therefore, under the model described in the previous section, the following conditional distributions hold:
ȳg |δg , wg
δg |ȳg , wg
∼
N (δg , wg2 ),
(5)
∼
N (ȳg , wg2 ),
(6)
2
2
/m2 is the variance of ȳg . The posterior distribution of wg2 depends on the data
/m1 + σg2
where wg2 = σg1
2
only through s2gs and its density function can be written in terms of the prior distributions of σgs
. In case
of unequal variances between the groups it is a convolution of the posterior densities of the variances:
³
´
2 |s2 (m1 ·) ∗ fσ 2 |s2 (m2 ·)
f (wg2 |s2gs ) = m1 m2 fσg1
(wg2 ),
g1
g2 g2
R
2
2
f (s2gs | σgs
) f (σgs
| θs )f (θs )dθs
R
R
2 |s2 (x)
fσgs
=
,
gs
2
2
2
2 )
f (sgs | σgs )f (σgs | θs )f (θs )dθs d(σgs
R
2
where the convolution of densities is defined by (f ∗g)(x) = f (x−u)g(u)du, and under model (1) f (s2gs | σgs
)
2
2
is the density of the χ distribution with ms − 1 degrees of freedom and scale σgs . If the variances are equal
or proportional to each other up to a fixed constant, then the posterior distribution of wg2 is appropriately
rescaled posterior distribution of the variance fσg2 |s2gs (x).
Integrating the full conditional distribution (6) over wg2 , we obtain the marginal posterior distribution of
δg :
Z ∞
1
ϕ((δg − ȳg )/wg )f (wg2 |s2gs )d(wg2 ),
(7)
f (δg |ȳg , s2gs ) =
w
g
0
where ϕ(x) is the density function of the standard normal distribution. Consider now the standardised
difference tg = δg /wg . We have
tg |ȳg , wg ∼ N (ȳg /wg , 1),
3
(8)
and the corresponding marginal posterior distribution is given by an expression similar to (7) but where the
term wg plays a different role:
Z ∞
f (tg |ȳg , s2gs ) =
ϕ(tg − ȳg /wg )f (wg2 |s2gs )d(wg2 ).
(9)
0
3.3
Tail posterior probabilities
To select differentially expressed genes, we propose to use selection rules of the form: P {Tg > Tgα } ≥ pcut ,
(α)
where Tg is a statistic of interest and Tg denotes an α percentile of Tg obtained under the null hypothesis.
(α)
We start with the case Tg = δg . To define the percentile, δg , we use the following ‘counterfactual’
heuristics. Suppose that we could have observed data under the null hypothesis, and in ideal circumstances
that we would have observed a zero value for the sample mean difference ȳg . In this case, the posterior
distribution of δg : f (δg |ȳg = 0, s2g1 , s2g2 ) could be used to obtain the percentile of order 1 − α/2 under H0 ,
(α)
(α)
δg . Once δg is known, we can define the tail posterior probability by:
P {|δg | > δg(α) |ygsr }.
(α)
To find the percentiles δg in general case (i.e. for an arbitrary prior distribution of wg2 ), we need to calculate
the distribution function of (7) which involves numerical integration over the posterior distribution of wg2
for each gene and thus is very computationally intensive. If instead of δg , we consider the standardised
(α)
difference tg = δg /wg , we can obtain a straightforward expression for tg . Indeed, using (8) and a similar
counterfactual argument, we see that given ȳg = 0 (i.e. under H0 ), f (tg |ȳg = 0, wg ) ∼ N (0, 1), i.e. that this
distribution, and the distribution f (tg |ȳg = 0, s2gs ), does not involve gene-specific parameters. Therefore the
(α)
corresponding percentile tg is easy to calculate:
t(α)
= t(α) = Φ−1 (1 − α/2).
g
(α)
(α)
We investigated the performance of selection rules using either P {|δg | > δg |ygsr }, or P {|tg | > tg |ygsr }
as tail probabilities and found their performance similar (the results are not shown). Since the threshold for
the T statistic is easier to calculate, throughout the paper we are going to consider only the tail posterior
probability based on the standardised difference tg :
p(tg , t(α) ) = P {|tg | > t(α) |ygsr }.
(10)
This posterior probability can also be interpreted as the posterior expected loss, with the loss function
l(δg , wg ) = I{|δg | > wg Φ−1 (1 − α/2)} i.e. if the absolute value of the difference exceeds percentile of order
1 − α/2 of its error distribution, we have a unit loss, otherwise the loss is zero. Note that it is similar to the
loss function for an interval null hypothesis but where the interval depends on the variability of δg .
3.4
A “p-value”-type selection rule
In this section, we define a Bayesian rule to select differentially expressed genes equivalent to a frequentist
approach. In the considered framework, it turns out to be a posterior probability that simply compares the
parameter of interest δg to its value under the null hypothesis given observed data, i.e. compare δg to 0.
For example, if we are interested in the one-sided alternative H1 : δg > 0, we can estimate its posterior
probability:
p(δg , 0) = P {δg > 0|ygsr }.
If we are interested in the complementary alternative, we can consider posterior probability P {δg < 0|ygsr } =
1 − p(δg , 0). Since the posterior probability of the two-sided alternative P {δg 6= 0|ygsr } under the considered
hierarchical Bayesian model is non-informative, to test the null hypothesis against the two-sided alternative
we can take the maximum of the posterior propabilities of the one-sided alternatives: max{p(δg , 0), 1 −
p(δg , 0)}. Then, for any of the three alternatives, the closer to the value 1 are the values of the corresponding
posterior probabilities p(δg , 0), 1 − p(δg , 0) and max{p(δg , 0), 1 − p(δg , 0)}, the more likely the alternative
hypothesis is accepted. The theorem below gives some theoretical results related to the distribution of these
posterior probabilities under the null hypothesis.
4
Theorem 3.1. Under the Bayesian model corresponding to (1) with the noninformative priors for αg and
2
δg and some hierarchical prior distribution for σgs
, the following statements hold.
1. Distributions of p(δg , 0), 1 − p(δg , 0) and 2 max{p(δg , 0), 1 − p(δg , 0)} − 1 are uniform on [0, 1] under
the null hypothesis.
2. Testing the null hypothesis δg = 0 against alternatives δg > 0, δg < 0 or δg 6= 0 using the posterior
probabilities above is equivalent to testing the same hypotheses using corresponding p-values based on
the marginal distribution of ȳg under H0 .
2
2
To illustrate the theorem, we consider the case where the variances are equal for both groups σg1
= σg2
=
and have the inverse gamma prior distribution. This conjugate setting was also considered by Smyth
(2004) who introduced moderated t statistic for the frequentist testing of hypothesis (2). The moderated t
2b+s2g (m1 +m2 −2)
ȳ
= √ g2 where Sg2 = 2a+m
statistic is defined by tmod
is the Bayesian posterior estimate of the
g
+m −2
σg2
Sg /k
1
2
−1 −1
and s2g is the pooled sample estimate of the variance. Since under the specified
variance, k = (m−1
1 + m2 )
assumptions we have the following conditional distributions for ȳg and dg :
³
√ ´
dg |ȳg , s2g , a, b ∼ t 2a + m1 + m2 − 2, ȳg , Sg / k ,
³
√ ´
ȳg |dg , s2g , a, b ∼ t 2a + m1 + m2 − 2, dg , Sg / k ,
under the null hypothesis the moderated t statistic has Student’s t distribution with 2a+m1 +m2 −2 degrees
of freedom. Thus, the corresponding p-value can be written as
Ã
!
ª
©
ȳg
mod
√
P t > tg |H0 = 1 − FT (2a+m1 +m2 −2)
,
Sg / k
where FT (2a+m1 +m2 −2) (x) is the cumulative distribution function of Student’s t distribution with 2a + m1 +
m2 − 2 degrees of freedom. This coincides with the posterior probability that the difference dg is less than
zero given hyperparameters a and b:
(
)
!
Ã
©
ª
dg − ȳg
ȳg
ȳg
2
2
√ <−
√ | ȳg , sg , a, b = 1 − FT (2a+m1 +m2 −2)
√
P dg < 0|ȳg , sg , a, b = P
.
Sg / k
Sg / k
Sg / k
By testing the null hypothesis in the Bayesian way, we have the additional advantage of including the
variability of a and b in the form of their posterior distributions rather than using plug-in values.
Now we give the proof of Theorem 3.1 where the case of general prior distribution of the variance wg2 is
considered.
Proof. Using (7), we can show that
Z
∞
p(δg , 0) =
0
Φ(ȳg /wg )f (wg2 |s2gs )d(wg2 ).
Using (5), the distribution function of ȳg given s2gs and δg = 0 is
Z ∞
2
G0 (ȳg ) = F (ȳg |δg = 0, sgs ) =
Φ(ȳg /wg )f (wg2 |s2gs )d(wg2 ),
0
under the null hypothesis. Thus p(δg , 0) = G0 (ȳg ) which is uniformly distributed under H0 . Thus we showed
that the distribution of p(δg , 0) and 1 − p(δg , 0) is uniform under H0 .
To prove uniformity of 2 max{p(δg , 0), 1 − p(δg , 0)} − 1 under H0 , we use uniformity of p(δg , 0):
P {2 max{p(δg , 0), 1 − p(δg , 0)} − 1 < x}
= P {p(δg , 0) < (x + 1)/2 & p(δg , 0) > (1 − x)/2}
= (x + 1)/2 − (1 − x)/2 = x
and observe that 2 max{p(δg , 0), 1 − p(δg , 0)} − 1 takes values between 0 and 1.
Note that 1 − G0 (ȳg ) = 1 − p(δg , 0), 1 − G0 (−ȳg ) = 1 − (1 − p(δg , 0)) and 2(1 − G0 (|ȳg |)) = 1 −
[2 max{G0 (−ȳg ), G0 (ȳg )} − 1] = 1 − [2 max{p(δg , 0), 1 − p(δg , 0)} − 1] are right-hand side, left-hand side
5
1.0
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
0.8
0.6
0.4
0.2
Posterior probability
0.0
−1
0
1
2
−1
Difference
0
1
2
Difference
(b) p(δg , 0)
(α)
(a) p(tg , tg )
−1
0
1
2
Difference
(c) p(δg , log2 2)
Figure 1: Volcano plots based on different posterior probabilities for the change between insulin and control
at 2 hours in H2Kb data (α = 0.95). In (b), 2 max(p, 1 − p) − 1 is plotted for the corresponding posterior
probability.
and two-sided p-value respectively for testing H0 using marginal distribution of test statistic ȳg under the
null hypothesis. Each of them complements the posterior probability of the corresponding alternative to 1.
Therefore, testing the null hypothesis based on the marginal p-values of ȳg is equivalent to testing it using
the posterior probabilities.
Theorem 3.1 is not applicable only to the specified Bayesian model. It can be generalised to a model
where the distribution of δg − ȳg conditioned on data and the remaining parameters, depends only on the
difference δg − ȳg but not on ȳg (see Appendix A.2). A simple example of such model is a nongaussian error
model whose distribution is symmetric.
3.5
Comparison of selection rules
In this section, we compare the gene lists obtained by using the tail posterior probability based on standardised difference (“Bayes T”) to other posterior probabilities, namely the posterior probability that the
difference exceeds in absolute value a priori specified threshold log2 2, and the posterior probability that the
difference is greater than its null value:
p(tg , t(α) )
p(δg , log2 2)
p(δg , 0)
= P {|tg | > t(α) |ygsr },
(11)
= P {|δg | > log2 2|ygsr },
(12)
= P {δg > 0|ygsr }.
(13)
These posterior probabilities are calculated for the comparison between control and insulin-treated cells at 2
hours of the H2Kb data. Hierarchical model (1) with the inverse gamma prior distribution for equal variances
2
σgs
= σg2 and non-informative prior distributions for αg and δg was fitted using WinBUGS (see Section 6 for
more details).
Variants of the criterion (12) that uses an a priori specified threshold have been described in Lewin et
al (2005). We see that it divides genes in three groups (see Figure 1c): genes whose fold change is greater
than the threshold with posterior probability close to 1, genes with very small posterior probability of having
the fold change greater the threshold, and a small group genes (compared to the other approaches) with
uncertainty about their fold change relative to the threshold. Therefore, it leads to the division of genes into
three biologically interpretable groups: differentially expressed, non-differentially expressed and genes with
uncertainty about their differential expression. However, this approach is applicable only if the threshold of
interest on the fold change is known in advance which is not always the case in microarray experiments.
In contrast, for the two-sided criterion p(δg , 0) (Figure 1b), the interval of the differences where the
posterior probability to be differentially expressed is close to zero is very small, and genes with the difference
around zero can have high posterior probability. Hence this posterior probability is difficult to interpret on
its own, and in practice a joint criterion with the cutoff on the posterior probability and the fold change is
applied (e.g. Woll et al, 2005).
6
The measure of the differential expression p(tg , t(α) ) (Figure 1a) avoids the disadvantages of the above
measures. It takes small values for the differences around zero and values close to one for large absolute
differences thus specifying groups of genes with low as well as high probability of differential expression. In
addition, it varies smoothly as a function of the difference between the low and high probabilities, allowing
to choose genes with the desired level of uncertainty about their differential expression.
To summarize the advantages and disadvantages of the considered measures of differential expression
and their associate selection rules, we make the following comments. The measures p(tg , t(α) ) and p(δg , 0)
are sensitive to small but consistent fold changes adapting to the level of variability for each gene and a
cutoff can easily be selected using their parameter-free distribution under the null hypothesis to define the
associated selection rules. Measure p(tg , t(α) ) has more spread around the alternative than p(δg , 0), thus
allowing better sensitivity to define gene lists. Measure p(δg , log2 2) has the advantage of being biologically
interpretable if we are only interested in genes having fold change above some a priori known threshold.
Overall, we conclude that in the considered framework, the posterior probability p(tg , t(α) ) is the most
advantageous to use.
4
False discovery rate for selection rules based on tail posterior
probability
We introduce an estimator of the false discovery rate of a gene list based on tail posterior probabilities. In
2
this section we assume that the variances are the same for both groups σgs
= σg2 and have inverse gamma
prior distribution Γ(a, b), as we shall apply to the microarray data in Section 6.
4.1
Distribution of the tail posterior probability under H0 .
In Section 3.3, we approximated the posterior distribution of the Bayesian T statistic tg under H0 by taking
the sample difference to be zero, i.e. ȳg = 0. In this section, we will use the marginal distribution of ȳg to
study the distribution of the tail posterior probability p(tg , t(α) ) under the null hypothesis as a function of
ȳg . This result will be used in the following section to estimate the false discovery rate.
As we show in Appendix A.1, under the null hypothesis stated in (2), the distribution F0 (x) of p(tg , t(α) ) is
quite computationally intensive to calculate, so here we use its approximation F̃0 (x) which is the distribution
function of the expression
Φ(²g − t(α) ) + Φ(−²g − t(α) ),
(14)
where ²g follows the distribution ΦBeta (x):
√
Γ(2A)
ΦBeta (x) = EΦ(x V ) = 2
Γ (A)
Z
∞
Φ(xv)
0
v 2A−1
dv.
(1 + v 2 )2A
Here V is a random variable such that (V + 1)−1 follows a beta distribution B(A, A) where A = a + 12 (m1 +
m2 ) − 1. To evaluate the distribution ΦBeta (x), we can either integrate out parameter A using its posterior
distribution, or plug-in a sensible estimate.
The difference between the true distribution F0 (x) and its approximation F̃0 (x) consists of conditioning
the probability P {|tg | > t(α) } not only on data, as in F0 (x), but also on the posterior distribution of wg2
(for details and comparison of these distributions see Appendix A.1). We also show that both expressions
have gene-independent distribution under the null hypothesis and therefore can be used to estimate FDR.
Note that we use this approximation to simplify the computational complexity of estimating FDR in the
considered framework rather than for methodological reasons.
Distribution function
q F̃0 (x) can be evaluated in practice as the empirical distribution function of expres-
6
sion (14) with ²i = ξi Q−1
i − 1, i = 1, . . . , 10 , where ξi ∼ N (0, 1), Qi ∼ B(A, A) and plug-in value of
A = E(a | ygsr ) + 21 (m1 + m2 ) − 1 for simplicity. We will use this in the following section to estimate the
false discovery rate.
7
Comparison
Estimated π0
True π0
δg11
0.929
0.95
δg21
0.711
0.7
δg12
0.881
0.9
δg22
0.993
0.999
0
50
100
150
0.4
0.0
0.1
0.2
FDR
0.3
0.4
0.3
0.2
FDR
0.0
0.1
0.2
0.0
0.1
FDR
0.3
0.4
Table 1: Estimate of π0 for simulated data.
0
Number of genes
50
150
250
0
Number of genes
(a) Comparison δg11
(b) Comparison δg12
200 400 600 800
Number of genes
(c) Comparison δg21
Figure 2: Estimates of FDR for the simulated data using Storey’s method (solid line) and true FDR (dashed
line).
4.2
False discovery rate
To estimate the false discovery rate (FDR) corresponding to the selection rule p(tg , t(α) ) > pcut based on the
tail posterior probability p(tg , t(α) ), we can apply the approach proposed by Storey (2002):
F DR(pcut ) =
π0 P {p(tg , t(α) ) > pcut |H0 }
.
P {p(tg , t(α) ) > pcut }
(15)
As shown in the previous section, the value of P {p(tg , t(α) ) > pcut |H0 } = 1−F0 (pcut ) in the numerator is geneindependent under the null hypothesis given the hyperparameters a and b and it can be well approximated
by 1 − F̃0 (pcut ) (Section 4.2). Note that since 1 − F̃0 (x) > 1 − F0 (x) for x > 0.4 (see Appendix A.1), the
estimate of FDR using approximation F̃0 (pcut ) with pcut > 0.4 is conservative.
A natural way to estimate the probability P {p(tg , t(α) ) > pcut } in the denominator is to take the proportion of genes whose posterior probability is above pcut . To estimate π0 , we adapt the approach of Storey
(2002), using the following estimate:
πˆ0 (λ) =
Card{g : g ∈ {1, . . . , n} & p(tg , t(α) ) < λ}
nF̃0 (λ)
with λ ∈ [0.1, 0.2] where there is the least difference between true and approximation and which does not
correspond to differential expression (3). The advantage of this estimate is that it does not depend on the
uniformity of the null distribution and therefore can be easily adapted to the considered framework. Then,
we can use the median of computed π̂0 (λ) over the chosen range of λ as the estimate of π0 . Therefore, our
estimate of FDR is
nπ̂0 (1 − F0 (pcut ))
ˆ
F DR(p
.
cut ) =
Card{g : g ∈ {1, . . . , n} & p(tg , t(α) ) > pcut }
For the simulated data described in Section 5, the estimates of π0 compared with their true values are shown
in Table 1. There is close agreement between estimated and true values used in the simulation.
The plots of the estimated FDR for the simulated data are given in Figure 2. We can see that the false
discovery rate is reasonably well estimated for values up to 30% where the estimate is conservative.
8
5
Model for the multiclass experiment
In this section we describe a model formulation suitable for multifactorial experiments. Without loss of
generality, we conduct our discussion by specifying a model suitable for the experiment described in Section
2, that is a multifactorial experiment with two factors, time and treatment, each having 3 levels. Extension
of our discussion to the case of more factors is straightforward.
Let us assume that the data consist of the intensities xgtcr representing the rth replicated expression
measure for gene g under conditions t and c.
Following the notation of Kerr et al (2000), a flexible ANOVA model for log-transformed intensity xgtcr
for this experiment can be written as follows:
log2 (xgtcr ) = µ + Ar + Tt + Vc + Gg + (AG)gr + (V G)gc + (T G)gt + (T V G)gtc + ²gtcr ,
where µ is the overall average signal, Ar represents the effect of the rth array, Vc represents the effect of the
cth variety of treatment, Tt represents the effect of time point t, Gg represents the effect of the gth gene,
(AG)gr represents a combination of array r and gene g, (V G)gc represents the interaction between the cth
treatment and the gth gene, (T G)gt represents the interaction between the tth time point and the gth gene,
and (T V G)gtc represents the interaction between the cth treatment, the tth time point and the gth gene.
We have 3 treatments, c = 0 for controls, c = 1 for insulin and c = 2 for metformin, and 3 time points, t = 0
for baseline, t = 1 for 2 hours and t = 2 for 12 hours time point. It has been shown that the background and
between-arrays variability are best adjusted for in a non-linear fashion (Choe et al, 2005), thus we assume
that they have been corrected for in the preprocessing step, and therefore the effects µ, Ar , Tt , and Vc vanish
from the model in the background correction step whereas gene specific array effects (AG)gr are taken care
of in the normalization step. Our model is thus reduced to:
log2 (xgtcr ) = Gg + (V G)gc + (T G)gt + (T V G)gtc + ²gtcr .
(16)
The error terms ²gtcr are assumed to be normal, independent with mean 0 and with gene specific structured
variances that will be discussed later.
In some situations, it may happen that some of the terms contained in the general model form given
in (16) are not of direct biological interest and that a more parsimonious formulation is preferable. In
our particular experiment, we are interested in gene-treatment interactions at each time point. The term
γgt = (T G)gt accounts for the average variability between time points. Similarly, the term (V G)gc , the
average effect of treatment over time points, does not have any biological interpretability as only the timespecific treatment effects are of interest. So we merge (V G)gc and (T V G)gtc into one key contrast of interest:
δgtc = (V G)gc + (T V G)gtc . Reducing the number of uninteresting contrasts increases the number of degrees
of freedom that are used to estimate the error variances in the experiment (see discussion in Kerr et al, 2000),
a feature which is particularly interesting for experiments with a small number of replicates and structured
variance models. Non-zero values for the treatment-time-gene interaction parameters δgtc for a given gene
thus indicate differential expression contrasts of interest.
With this reparametrization, we now consider the following linear model for log-transformed preprocessed
intensity ygtcr :
ygtcr = αg + γgt + δgtc + εgtcr ,
(17)
where αg is the value of gene g at baseline, g = 1, . . . , n, and we set γg0 = δgt0 = δg0c = 0 for model
identifiability.
At the next level of the model, we specify the hierarchical structure of the variance parameters as well as
the prior distributions for the main effects and interactions parameters in (17). The variances for each gene
are modelled exchangeably, so that information about their variability is shared between genes in order to
stabilize them. That is, the variances are assumed to come from a common distribution, chosen here to be
the inverse gamma distribution, a usual choice in Bayesian hierarchical models (Gilks et al, 1996). We detail
below four possible biologically interpretable exchangeability or partial exchangeability assumptions on the
9
Parameters
Number of non-zero values
γg1
0
γg2
20
δg11
100
δg12
200
δg21
600
δg22
2
Table 2: Number of non-zero values for each of the differential parameter for simulated data.
distributions of errors that correspond to different indexing of the parameters:
εgtcr
2
∼ N (0, σgct
),
−2
σgct
∼ Γ(act , bct );
(18)
εgtcr
2
∼ N (0, σgc
),
−2
σgc
∼ Γ(ac , bc );
(19)
εgtcr
∼
2
N (0, σgt
),
−2
σgt
∼ Γ(at , bt );
(20)
∼
N (0, σg2 ),
σg−2
∼ Γ(a, b).
(21)
εgtcr
The variance model best fitting the data can be selected using the DIC model selection criterion (Spiegelhalter
et al, 2002) which is particularly well suited to comparing models in a Bayesian hierarchical framework where
the posterior mean is considered to be a good estimate of the stochastic parameters. In Section 6 we show
that the variance model (21) suits our experimental data set best. Note that in contrast to a standard
ANOVA formulation where the gaussian errors are assumed to have the same distribution across conditions
(Kerr et al, 2000), for each gene, we allow the error variances to depend if necessary on the factors levels.
For the main effects and interaction terms in model equation (17), we assume no prior knowledge about
the effect size and assign them independent shift-invariant non-informative prior distributions:
αg , γgt , δgtc ∼ 1,
(22)
t = 1, 2, c = 1, 2, g = 1, . . . , n. Finally, we use non-informative scale-invariant priors (Bernardo & Smith,
1994, Chapter 5.6.2) on the hyperparameters of the Gamma distribution, with density function f (x) = 1/x.
Simulated data
To illustrate our methodology, we will use a data set simulated from the model (2) and (6), with variances σg−2 ∼ Γ(2, 0.05), g = 1, . . . , 2000, and non-zero values of the linear parameters simulated from
(−1)Bern(0.5) N (0.9, 0.52 ). The number of non-zero values for each differential parameter is summarised
in Table 2. Out of 200 non-zero values of δg12 , 36 are chosen to be the same of those of δg11 . The purpose
is to create some commonality of effects between the two conditions observed in the considered microarray
data. Altogether the simulated data set has 3 replicates in each condition.
6
Application of tail posterior probability to microarray data
Model (17) was fitted to both real and simulated data using WinBUGS (www.mrc-bsu.cam.ac.uk/bugs/
welcome.shtml), with the variance model (21) chosen as the best parametrisation using the Deviance Information Criterion proposed by Spiegelhalter et al (2002). In practice, we approximate the non-informative
prior distributions by “just proper” distributions which allows to use WinBUGS to fit the model, f (x) = 1
by N (0, 104 ) for the main effects and interactions parameters and f (x) = 1/x by Γ(0.001, 0.001) for positive
hyperparameters of the variance.
6.1
Simple criteria for selecting differentially expressed genes.
As we discussed in Section 5, the following pairwise comparisons are considered to have relevant biological
interpretation: insulin-treated vs non-treated cells at each time point (IC2 at 2 hours and IC12 at 12 hours),
metformin-treated vs non-treated cells at each time point (MC2 at 2 hours and MC12 at 12 hours). The
difference between these conditions is parameter δgct in model (17). For each comparison, we calculate the
tail posterior probability defined by (10) that gene g is differentially expressed.
Histograms of the posterior probability and the estimate of FDR defined in Section 4.2 (smoothed using
splines) for each comparison are given in Figures 3 and 4 respectively. The first conclusion is that there is a
10
Comparison
Estimated π0
Number of genes
Posterior probability
cutoff
Insulin vs
Control 2hrs
0.575
1151
Insulin vs
Control 12hrs
0.974
13
Metformin vs
Control 2hrs
0.529
1854
Metformin vs
Control 12hrs
0.753
72
0.94
0.99
0.92
0.99
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0
1000
2000
0
1000
0
0.2
3000
6000
4000
3000
2000
3000
2000
1000
0
0.0
5000
Table 3: Number of differentially expressed genes and the posterior probability cutoff corresponding to a
gene list with estimated F DR = 0.5% in H2Kb data.
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
(a) Insulin vs Control 2hrs (b) Metformin vs Control (c) Insulin vs Control 12hrs (d) Metformin vs Control
2hrs
12hrs
Figure 3: Histograms of posterior probability for the comparisons of interest in muscle cell line. Lines
correspond to the approximation of the density of the tail posterior probability under H0 (see Section 4.1).
0
1000 2000 3000 4000
Number of genes
(a) Insulin vs Control 2hrs
0 1000
3000
5000
0.04
0.00
0
Number of genes
10
20
30
40
Number of genes
(b) Metformin vs Control 2hrs
0.02
FDR
0.04
FDR
0.00
0.02
0.04
FDR
0.02
0.00
0.02
0.00
FDR
0.04
large number of differentially expressed genes between treated and non-treated cells at 2 hours, whereas very
few genes change between treated and non-treated cells at 12 hours. Note that for the comparison Insulin
vs Control at 12 hours, the observed histogram and the line corresponding to the approximate distribution
under the null hypothesis are very close (apart from the interval (0, 0.1)), implying that in practice the
approximation works quite well. On the FDR plots, we can see that for the fixed range of values chosen on
the y-axis, the corresponding number of differentially expressed genes on the x-axes differs widely for each
comparison. The number of differentially expressed genes corresponding to an FDR value of 0.5% as well as
the estimates of π0 (as described in Section 4.2) are given in Table 3.
For some studies, including the one considered here, we need to choose the same posterior probability
cutoff for all comparisons to help biological interpretation of the study. In this case, based on both selection
criteria and the lack of considerable change in gene expression in 12 hour comparisons, we choose the posterior
probability cutoff to be 0.92. This corresponds to 1372 genes and F DR = 0.7% in comparison of Insulin vs
Control at 2 hours.
(c) Insulin vs Control 12hrs
0
50
150
250
350
Number of genes
(d) Metformin
12hrs
vs
Control
Figure 4: Estimated false discovery rate for the comparisons of interest in H2Kb data. Dotted lines correspond to FDR = 0.5%.
11
8000
6000
5000
2000
4000
3000
1000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
−0.4
(a) Correlation between differences
−0.2
0.0
0.2
0.4
0.6
0.8
(b) Correlation between T statistics
Figure 5: Histograms of correlation between δg11 and δg12 and correlation between corresponding T statistics
(using their posterior distributions), H2Kb data.
6.2
Compound criterion
In the previous section we discovered that there are two comparisons where there is a noticable change in
gene expression: for each type of treated cell versus non-treated cells at 2 hours. The next question to answer
is: which genes are differentially expressed in both comparisons?
We suggest to choose the differentially expressed genes corresponding to this compound criterion by
estimating the following joint posterior expected loss, or joint posterior probability:
pJg = E(I{|tg,IC2 | > t(α) }I{|tg,M C2 | > t(α) } | ygtcr ) = P {|tg,IC2 | > t(α) & |tg,M C2 | > t(α) | ygtcr }.
(23)
In practice, a common solution to this problem is to select a gene list for each comparison separately,
and then find the genes common to the lists. Formally, this selection procedure with the same posterior
probability cutoff can be written as follows:
³
´
pmin
= min P {|tg,IC2 | > t(α) | ygtcr }, P {|tg,M C2 | > t(α) | ygtcr } .
g
≥ pJg .
Note that pmin
g
Since both T statistics share the variance parameter of the non-treated cells at 2 hours, the joint posterior
probability is appropriate because it automatically takes into account dependence between the T statistics
(see Figure 5 for the histogram of posterior correlation between δg11 and δg12 and the correlation between
corresponding T statistics). To illustrate this, we compare these two approaches using the number of false
positives and false negatives based on the joint posterior probability for all possible cutoffs, to those for gene
lists obtained using combined individual posterior probabilities, also for all possible cutoffs. Their plots for
the simulated data, described in Section 5, are given in Figure 6. They show that the list of differentially
expressed genes identified by joint posterior probability for the simulated data has fewer number of false
positives, and about the same number of false negatives.
Applying the joint posterior probability to the microarray data with posterior probability cutoff 0.92,
280 genes are declared to be differentially expressed in both comparisons, and combining gene lists (using
pmin
) adds extra 47 genes to the list. Given the same posterior probability cutoff, the list based on pmin
g
g
includes the list based on pJg due to the relation pmin
≥ pJg .
g
6.3
Joint vs pairwise parameter estimation
In case the information is shared between the parameters belonging to different comparisons (e.g. variance
pooled between some conditions), it is advantageous to use parameter estimated in the joint model rather
than in a separate pairwise comparison. We illustrate this on the simulated data by fitting the joint model
described in Section 5, and compare it to a model fitted to two conditions only with condition-specific
variances, as described in Section 3.
Plots of the log fold change, variance and the posterior probability to be differentially expressed between
insulin and control at 2 hours are shown in Figure 7. We see that the fold change estimate hardly differs at
12
0.010
0.20
0.000
0.10
0.00
0.80 0.85 0.90 0.95 1.00
Posterior Probability
0.80 0.85 0.90 0.95 1.00
Posterior Probability
(a) FDR
(b) FNR
0.6
Pairwise
0.4
0.4
0.3
Pairwise
0.2
0
−2
−1
0
1
0.0
0.0
−2
0.1
0.2
−1
Pairwise
0.5
0.8
1
0.6
1.0
Figure 6: False discovery (FDR) and non-discovery (FNR) rates for different thresholds, simulated data.
Joint probability - red, minimum of pairwise probabilities - black.
0.0
0.2
0.4
0.6
0.0
0.2
0.4
Joint
Joint
(a) Difference
0.6
0.8
1.0
Joint
(b) Variance
(c) Posterior probability
Figure 7: Difference, variance and posterior probability estimates in joint (x-axis) and pairwise (y-axis)
models for simulated data (comparison δg11 ).
all however the estimate of the variance varies greatly due to pooling information from all the time points
in joint estimation. As a result, the posterior probability of differential expression also differs.
The number of false positives is lower when the parameters are estimated in joint model (Figure 9). The
number of false negatives is similar in both models.
7
Discussion
0.6
Pairwise
0.4
0.4
Pairwise
−1
0
1
Joint
(a) Difference
2
0.0
0.0
−1
0.2
0.2
0
Pairwise
1
0.6
0.8
2
1.0
0.8
In this paper we introduced a posterior measure of differential expression based on a Bayesian T statistic,
the tail posterior probability, and showed that it possesses good properties compared to other posterior
measures of differential expression. Moreover we derived the distribution of the tail posterior probabilities
under the null hypothesis which allows us to propose an estimate of the false discovery rate associated with
this measure. Even though our results were discussed in the context of gene expression experiments, we stress
0.0
0.2
0.4
Joint
(b) Variance
0.6
0.8
0.0
0.2
0.4
0.6
0.8
1.0
Joint
(c) Posterior probability
Figure 8: Difference, variance and posterior probability estimates in joint (x-axis) and pairwise (y-axis)
models for H2Kb data (Insulin vs Control at 2hrs).
13
False positives
False negatives
False negatives
0.8
0.9
1.0
Posterior probability
(a) FP, comparison δg11
0.8
0.9
1.0
Posterior probability
0.80 0.85 0.90 0.95 1.00
Posterior probability
(b) FP, comparison δg21
(c) FN, comparison δg12
100 150
50
0
20 40 60 80
0
0
0
10
10
20
20
30
30
False positives
0.80 0.85 0.90 0.95 1.00
Posterior probability
(d) FN, comparison δg22
Figure 9: Number of false positives (FP) and false negatives (FN) in joint model (red line) and in pairwise
model (black line), simulated data.
that the hierarchical model formulated in (17) is generic to a large number of situations (e.g. toxicology
experiments, drug testing) to which tail posterior probabilities could be easily extended.
Our motivation for studying tail posterior probabilities was that besides their obvious use in simple
differential expression setups, such flexible rules are particularly adapted to testing compound hypotheses
that arise in generalised ANOVA settings. The generalised ANOVA settings that we considered were related
to multiclass experimental data where the variance can be condition-specific. In this setting, we illustrated
how estimating parameters jointly in a full linear model is advantageous to estimating parameters separately
in each pairwise comparison. We showed that the tail posterior probability selection rules can be naturally
extended to multiclass situations using its interpretation as a posterior expected loss and that it yields a
lower number of false positives compared to the traditional approach of combining individual gene lists.
In general, the rich output produced by a Bayesian analysis of multiclass experiments allows a number of
decision rules to be investigated. As a first step, we showed an equivalence between frequentist testing based
on the marginal distribution of the sample difference and selecting gene lists using posterior probabilities
estimated in the corresponding Bayesian model. Characterising general properties and comparing decision
rules based on tail posterior probabilities in more general setups that the hierarchical model considered in
this paper are interesting research topics for future work.
8
Acknowledgements
Natalia Bochkina’s work was funded by a Wellcome Trust Cardio - Vascular grant 066780/Z/01/Z. The
authors would like to thank Ulrika Andersson and David Carling who generated the data, their colleagues
Anne Mette Hein, Alex Lewin and Maria de Iorio for stimulating discussion on the statistical aspects and
the BAIR group (BAIR project, www.bair.org.uk) for sharing their biological insights.
A
A.1
Appendix
Distribution of the tail posterior probability under H0 .
In this section, we study the distribution of the tail posterior probability p(tg , t(α) ) as a function ȳg under
the null hypothesis, and assume that the variances σg2 are the same for both groups and have inverse gamma
prior distribution (21). We also assume that the hyperparameters of the inverse gamma distribution a and
b are fixed. In this case, we have a conjugate model and therefore it is easy to show that the full conditional
distribution of σg2 is also the following inverse gamma distribution (see, for instance, Smyth, 2004):
¡
¢
σg−2 | ygsr , a, b ∼ Γ a − 1 + (m1 + m2 )/2, (2b + (m1 − 1)s2g1 + (m2 − 1)s2g2 )/2 ,
where s2gs is the sample variance of gene g in group s.
14
(24)
0.5
(a) CDF
0.6 0.7 0.8 0.9 1.0
Tail posterior probability
(b) CDF, detail
−0.010
Difference
0.005
0.020
0.92 0.94 0.96 0.98 1.00
0.0 0.2 0.4 0.6 0.8 1.0
0.2 0.4 0.6 0.8 1.0
Tail posterior probability
0.0 0.2 0.4 0.6 0.8 1.0
Posterior probability
(c) Difference
Figure 10: Cumulative distribution functions (CDF) of the approximation F̃0 (x) (solid line) and of the true
distribution of p(tg , t(α) ) (dotted line), conditioned on hyperparameter a (with A = 5.36, as in H2Kb data).
On the right-hand side plot, the difference between approximated and the true CDFs is plotted.
Following (5), the marginal distribution of ȳg under the null hypothesis is
µ q
¶q
Z ∞
f (ȳg |δg = 0, a, b) =
ϕ(ȳg /w)w−1 f (w2 | s2g , a, b)dw2 = fT (2A) ȳg A/B̃g
A/B̃g ,
(25)
0
where fT (2A) (x) is the density of Student’s t distribution with 2A degrees of freedom, A = a+(m1 +m2 )/2−1
and Bg = b + (m1 − 1)s2g1 /2 + (m2 − 1)s2g2 /2 are the parameters of the gamma distribution (24), and
m2
B̃g = mm11+m
Bg . Since tg has distribution (8), we need to find the distribution of the mean of the posterior
2
distribution of the T statistic ȳg /wg , which has the following cumulative distribution function:
Z ∞Z ∞
F (x) = P {ȳg /wg < x} =
Φ(xz/w)f (w2 | s2g , a, b)f (z 2 | s2g , a, b)dz 2 dw2
0
0
µ
¶
Z
= EΦ x
,
W
where random variables W 2 and Z 2 are independent and have the same distribution fwg−2 |s2
gs ,a,b
Γ(A, B̃g ). Their ratio V =
−2
W
Z −2
=
(x) which is
2
Z
W2
has the following distribution:
Z ∞
−1
−2
−2
A
FV (v) = P {V < v} = P {W < vZ } = B̃g [Γ(A)]
P {W −2 < vz}z A−1 e−B̃g z dz,
0
Z ∞
Γ(2A)v A−1
2A
−2
fV (v) = B̃g [Γ(A)]
z(vz)A−1 e−B̃g vz z A−1 e−B̃g z dz =
,
Γ(A)2 (v + 1)2A
0
i.e. (V + 1)−1 follows beta distribution B(A, A). Note that the distribution of V does not depend on the
gene-specific scale parameter B̃g , but it depends only on the first parameter of the prior inverse gamma
distribution of σg2 and the number of samples in each group.
Therefore, the distribution F̃0 (x) of the random variable
Φ(−t(α) + ȳg /wg ) + Φ(−t(α) − ȳg /wg )
(26)
is gene-independent under the null hypothesis and can be easily calculated approximately via simulations of
a gene-independent random variable ȳg /wg . To find the distribution F0 (x) of p(tg , t(α) ) under the null hypothesis, we need to integrate (26) over the distribution of wg2 conditioned on the data. Although computing
the distribution F0 (x) is more computationally intensive than the distribution function of (26), we can show
that it is also gene-independent:
Z ∞
P {|tg | > t(α) | ygsr , a, b} =
[Φ(−t(α) + ȳg /wg ) + Φ(−t(α) − ȳg /wg )]f (wg2 | s2gs , a, b)dwg2
0
¶
µ
¶¸
Z ∞· µ
q
q
(α)
(α)
2
2
fγ (vg2 , A, 1)dvg2 ,
=
Φ −t + ȳg vg /B̃g + Φ −t − ȳg vg /B̃g
0
15
where fγ (x, A, 1)qis the density function of the gamma distribution with shape A and unit scale. Since
distribution of ȳg A/B̃g is Student’s t distribution with 2A degrees of freedom, the distribution of p(tg , t(α) )
under the null hypothesis is gene-independent.
Plots of these two distributions are given in Figure 10. We can see in Figure 10a that the distribution
function F̃0 provides a reasonable approximation of the true distribution function F0 , as the solid and dotted
lines are very close. Figure 10c shows that the approximation differs from the true distribution mostly around
point 0.1 and also at interval (0.4, 0.8). For the values of the tail posterior probability greater than 0.8, the
difference is small and tends to zero as the values of the posterior probability approach 1. Also, as the detail
of the plot in Figure 10b shows, for x > 0.4 the approximating function is below the true one. This implies
that 1 − F0 (x) < 1 − F̃0 (x) for x > 0.4 which we use in Section 4.2 to show that the suggested estimate of
FDR is conservative.
A.2
Uniformity of p(δg , 0) under H0 in general settings.
Here we give a more general version of Theorem 3.1. To simplify the notation, we omit the gene-related
index g.
Theorem A.1. Assume that ȳ is a sufficient statistic for δ and ȳ | δ, θ ∼ G(x − δ | θ), where G(x | θ) is
a distribution function with a vector of parameters θ, and we want to test whether δ is zero. Denote ya the
vector of auxiliary statistics with respect to statistic ȳ. We assume Bayesian settings with the (improper)
uniform prior distribution for δ, and that the prior distribution of θ does not depend on δ.
Then, P {δ > 0 | ysr } = 1 − P {η > ȳ} where a random variable η has the same distribution as ȳ
given δ = 0 and ya , and thus P {η > ȳ} with the observed value of ȳ is a p-value. Also, distribution of
P {δ > 0 | ysr } is uniform under the null hypothesis.
Proof. By the definition of the sufficient statistic, the likelihood f (ysr | δ, θ) can be represented as a product
of two density functions: f (ȳ | δ) and f (ya | θ), where ya = S(ysr ), S : Rm1 +m2 → S ⊂ Rm1 +m2 −1 , is
an auxiliary statistic with respect to ȳ. Therefore, the marginal distribution of ysr and the joint posterior
distribution of θ and δ also factorise:
Z
f (ysr ) =
f (ysr | θ = z, δ = x)fθ (z)fδ (x)dxdz
x,z
Z
Z
=
f (ya | θ = z) f (ȳ | θ = z, δ = x)fθ (z)fδ (x)dxdz
x
Zz
Z
=
f (ya | θ = z)fθ (z)dz g(ȳ − x | θ = z)dx = f (ya )1(ȳ)
z
f (δ, θ | ysr ) =
=
x
f (ysr | δ, θ)f (δ)f (θ)
f (ȳ | δ, θ)f (δ) f (ya | θ)f (θ)
=
f (ysr )
f (ȳ)
f (ya )
g(ȳ − δ | θ)f (θ | ya ),
where g(x | θ) is the density function of distribution G(x | θ). This implies that the posterior distributions
of parameters δ and θ are:
Z
F (δ | ysr ) =
G(δ − ȳ | θ)f (θ | ya )dθ,
θ
f (θ | ysr ) =
f (θ | ya ).
Note that the assumptions of the uniform prior for δ and independence of the prior for θ of δ are crucial
here since they allow factorisation of marginal and posterior distributions.
Thus we see that the posterior distribution of θ does not depend on ȳ, and the distribution function of
R
η is Fη (x | ya ) = P {ȳ < x | δ = 0, ya } = G(x − d | θ)f (θ | ya )dθ.
Therefore, the posterior probability of interest can be written as follows:
Z
Z
P {δ > 0 | ysr } = G(−ȳ | θ)f (θ | ya )dθ = 1 − G(ȳ | θ)f (θ | ya )dθ = 1 − Fη (ȳ).
Also, distribution of P {δ > 0 | ysr } is uniform under the null hypothesis since P {η > ȳ} is uniform under
the null hypothesis.
16
References
Bernardo J.M. and Smith A.F.M. (1994) Bayesian Theory. Wiley, New York.
Broët, P., Lewin, A., Richardson, S., Dalmasso, C. and Magdelenat, H. (2004) A mixture model based
strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics,
20(16):2562-2571.
Causton H.C., Quackenbush J., Brazma A. (2003). Microarray Gene Expression Data Analysis: A Beginners Guide. Blackwell Publishing.
Choe S.E., Boutros M., Michelson A.M., Church G.M., Halfon M.S. (2005) Preferred analysis methods for
Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology, 6(2):R16.
Cleveland W.S., Grosse E. and Shyu W.M. (1992) Local regression models. Chapter 8 of Statistical Models
in S, eds J.M. Chambers and T.J. Hastie, Wadsworth & Brooks/Cole.
Dong X., Park S., Lin X., Copps K., Yi X., White M.F. (2005) Irs1 and Irs2 signaling is essential for hepatic
glucose homeostasis and systemic growth. J Clin Invest., doi:10.1172/JCI25735.
Gilks W., Richardson S., Spiegelhalter D. (eds.) (1996) Markov Chain Monte Carlo in practice. Chapman
and Hall.
Irizarry R., Bolstad B., Collin F., Cope L., Hobbs B. and Speed T. (2003) Summaries of Affymetrix
GeneChip probe level data. Nucleic Acids Research, 31(4):e15.
Kerr M.K., Martin M. and Churchill G.A. (2000). Analysis of variance for gene expression microarray data.
Journal of Computational Biology, 7:819837.
Lewin A., Richardson S., Marshall C., Glazier A., and Aitman T. (2005) Bayesian Modeling of Differential
Gene Expression. Biometrics, Advance access July 2005 doi:10.1111/j.15410420.2005.00394.x
Lonnstedt I. and Britton T. (2005) Hierarchical Bayes models for cDNA microarray gene expression. Biostatistics, 6:279291.
Lonnstedt I. and Speed T. (2002) Replicated microarray data. Statistica Sinica, 12:3146.
Marshall E.C. and Spiegelhalter D.J. (2003) Approximate crossvalidatory predictive checks in disease mapping models. Statistics in Medicine, 22:16491660.
McLachlan G.J., Do K.-A., Ambroise C. (2004) Analyzing Microarray Gene Expression Data. Wiley.
Parmigiani G., Garret E.S., Irrizary R.A., Zeger S. (2003) The Analysis of gene expression data: Methods
and software (edited). Springer.
Smyth G.K. (2004). Linear models and empirical Bayes methods for assessing differential expression in
microarray experiments. Statistical Applications in Genetics and Molecular Biology 3: No. 1, Article
3. http://www.bepress.com/sagmb/vol3/iss1/art3.
Spiegelhalter D.J., Best N.G., Carlin B.P. and van der Linde A. (2002) Bayesian measures of model complexity and fit (with discussion). J. Roy. Statist. Soc. B. 64:583-640.
Speed T. (2003) Statistical analysis of gene expression microarray data (edited). Chapman and Hall.
Storey J. (2002) A direct approach to false discovery rates. J. Roy. Statist. Soc. B., 64:479-498.
Tatoud R., Bochkina N., Jamain A., Mande K., Andersson U., Blangiardo M., Craig A., Aitman T.,
Richardson S., Carling D., Scott J. (2006) An overview of the transcriptomic response to Metformin
in Mouse Skeletal Muscle Cells.
17
Wit E. and McClure J. (2004) Statistics for Microarrays: Design, Analysis and Inference. Wiley.
Woll K., Borsuk L.A., Stransky H., Nettleton D., Schnable P.S., Hochholdinger F. (2005) Isolation, characterization, and pericycle-specific transcriptome analyses of the novel maize lateral and seminal root
initiation mutant rum1. Plant Physiol., 139(3):1255-67.
18