Inference with Transposable Data: Modeling the Effects of Row and

Inference with Transposable Data: Modeling the Effects
of Row and Column Correlations
Genevera I. Allen†
Department of Pediatrics-Neurology, Baylor College of Medicine, Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital,& Department of Statistics, Rice University,
Houston, TX, 77005.
Robert Tibshirani
Departments of Health Research & Policy and Statistics, Stanford University, Stanford, CA,
94305.
Summary. We consider the problem of large-scale inference on the row or column variables
of data in the form of a matrix. Many of these data matrices are transposable meaning that
neither the row variables nor the column variables can be considered independent instances.
An example of this scenario is detecting significant genes in microarrays when the samples
may be dependent due to latent variables or unknown batch effects. By modeling this matrix
data using the matrix-variate normal distribution, we study and quantify the effects of row and
column correlations on procedures for large-scale inference. We then propose a simple solution to the myriad of problems presented by unanticipated correlations: We simultaneously
estimate row and column covariances and use these to sphere or de-correlate the noise in the
underlying data before conducting inference. This procedure yields data with approximately
independent rows and columns so that test statistics more closely follow null distributions and
multiple testing procedures correctly control the desired error rates. Results on simulated models and real microarray data demonstrate major advantages of this approach: (1) increased
statistical power, (2) less bias in estimating the false discovery rate, and (3) reduced variance
of the false discovery rate estimators.
Keywords: multiple testing, false discovery rate, transposable regularized covariance models,
large-scale inference, covariance estimation, matrix-variate normal
1.
Introduction
As statisticians, we often make assumptions when constructing a model to ease computations
or employ existing methodologies. When conducting inference on matrix data, we often
assume that the variables along one dimension (say the columns) are independent, allowing
us to pool these observations to make inferences on the variables along the other dimension
(rows). In microarrays, for example, it is common to assume that the arrays are independent
observations when computing test statistics allowing us to assess differential expression in
genes. Since we are testing many row variables (genes, for example) simultaneously, we
correct for multiple testing using procedures that are known only to control error measures
when the row variables are independent or follow limited dependence structures. Thus,
when conducting inference along the row variables of matrix data, we make the following
assumptions to employ existing methodologies: (i) independent column variables and (ii)
independent or limited dependencies among row variables. What if these assumptions are
incorrect? What if this matrix data is in fact transposable, meaning that potentially both
the rows and the columns are correlated?
†To whom correspondence should be addressed.
2
G. I. Allen & R. Tibshirani
In this paper, we consider the problem of testing the significance of row variables in a
data matrix where there are correlations among the rows, or among the columns, or among
both. We study the behavior of standard statistical methodology, such as two-sample tstatistics and controlling for multiplicity via the false discovery rate (FDR), on transposable
data. Then, we propose a method to directly account for these two-way dependencies by
de-correlating the data matrix before conducting inference.
We motivate the presence of these two-way dependencies through an example we will
refer to often: testing for differential expression in the two-class microarray. Consider a
microarray data set investigating differential gene expression between subjects of Asian
descent or European descent in which there are 4,167 genes and 142 subjects with arrays
processed between 2003 and 2007 (Spielman et al., 2007). In Figure 1, we display the
histogram of two-sample t-statistics for this data set superimposed with the density of the
theoretical null distribution, the t(140) distribution. The test statistics are strongly overdispersed compared to the theoretical null distribution. Many have proposed that this
effect could be due to correlations among the genes (Dudoit et al., 2003; Efron, 2004; Qiu
et al., 2005; Efron, 2010). Others have noted that correlations among the arrays, perhaps
induced by their differing processing dates in this example, can produce the same effect (Qiu
et al., 2005; Owen, 2005; Efron, 2009). What if this apparent over-dispersion is caused by
both gene and array correlations? And, what effect do these correlations have on standard
statistical methods used for large-scale inference?
Answers to these questions have been well studied for correlations among the rows or
test statistics (Dudoit et al., 2003; Efron, 2004; Owen, 2005; Qiu and Yakovlev, 2006; Efron,
2010; Schwartzman and Lin, 2011). Methods to control the false discovery rate such as the
step-up method of Benjamini and Hochberg (1995) or the permutation based method of
Storey (2002) were originally shown to have theoretical control only when test statistics are
independent. Later, however, these conditions have been relaxed to show that the FDR is
controlled under types of weak dependencies (Yekutieli and Benjamini, 1999; Storey et al.,
2004; Sarkar, 2008). Benjamini and Yekutieli (2001) developed a step-up procedure that
permits FDR control under arbitrary dependencies among tests, but at a great cost in terms
of statistical power (Farcomeni, 2008). Thus, this method is not preferred in the literature.
As the conditions of weak dependence are not easily checked with real data, it is unknown
whether false discovery rates are controlled in data with strong correlations among tests
such as with microarrays. In addition, recently many have noted when tests are correlated,
one not only needs to worry about average false discovery proportion, but also about the
variance of the number of false discoveries and the FDR (Owen, 2005; Qiu and Yakovlev,
2006; Desai et al., 2009; Efron, 2010; Schwartzman and Lin, 2011).
While correlations among test statistics have been studied especially in the context of
microarray data, correlations among the columns and its effect on large-scale inference has
received much less scrutiny. First, one might ask whence correlations among columns arise.
For microarrays, these can occur because of batch effects, latent variables or instrument
drift, for example (Yang et al., 2002; Fare et al., 2003; Li and Rabinovic, 2007; Leek and
Storey, 2008; Efron, 2009; Leek et al., 2010a). If these are known to the statistician in
advance, these can be modeled directly (Li and Rabinovic, 2007; Leek et al., 2010a). Many
of these causes of potential array correlations are unknown to the statistician or unavailable
for data in many public gene expression repositories (Edgar et al., 2002). In addition,
simply assessing the presence of column correlations in the presence of row correlations is
significant challenge (Efron, 2009; Muralidharan, 2010). Not surprisingly then, relatively
little work has been done on modeling and correcting for the effects of both row and column
correlations in the context of large-scale inference.
In this paper, we propose to study and develop methodology to correct for row and
column correlations when conducting inference on matrix data. As the effect of the later
3
0.0
0.1
Density
0.2 0.3
0.4
Inference and Dependence
−10
−5
0
T−Statistics
5
10
Fig. 1. Histogram of two-sample T -statistics for the Spielman et al. (2007) microarray data. The
theoretical null, t(140) , is superimposed, and the T -statistics are over-dispersed compared to the null
distribution.
is less developed in the literature, we focus more on the behavior of common test statistics
and null distributions when there are unanticipated column correlations. We show that even
though many have noted that over-dispersion of test statistics can result from correlated
tests (Efron, 2004, 2010), unanticipated correlations among the columns can also lead to
this result. The main contribution of this paper is a novel procedure to de-correlate both the
rows and columns of the data matrix prior to conducting inference. Several have proposed
such methods in the context of only row correlations (Tibshirani and Wasserman, 2006; Lai,
2008; Zuber and Strimmer, 2009) or for latent variable models (Leek and Storey, 2008),
but none for scenarios in which both the rows and columns are correlated with arbitrary
structure. This may be surprising as the idea seems like a simple and logical first step to
tackling the problems arising from strongly correlated matrix data. It turns out, however,
that estimating both row and column covariance matrices from a single matrix of data is a
major challenge (Efron, 2009; Muralidharan, 2010).
We model separable row and column covariances via the matrix-variate normal which
assumes that the covariance between elements of the data matrix is given by the Kronecker
product of its column and row covariances (Gupta and Nagar, 1999). This matrix-variate or
Kronecker product model has been used by others in the context of microarray data (Efron,
2009; Teng and Huang, 2009), but not for the challenging problem of directly estimating
of row and column covariances. In prior work, however, we have developed a method of
simultaneously estimating row and column covariances via a penalized likelihood approach
(Allen and Tibshirani, 20). In this paper, we introduce a novel procedure that uses these
covariances estimates to de-correlate or sphere the noise in the data matrix without changing
the underlying signal. This sphered data, which has approximately independent rows and
columns, can then be used to conduct large-scale inference. Our approach has several
advantages: (i) Tests are re-ordered leading to a better ranking, a lower true false discovery
proportion, and greater statistical power. (ii) Estimates of the false discovery proportion
are more consistent. (iii) The variance of the estimated false discovery rate is reduced.
The paper is organized as follows. In Section 2 we introduce our matrix model based on
the mean-restricted matrix-variate normal distribution. We then study the behavior of test
statistics when the columns are correlated, Section 2.2, and illustrate how common problems
associated with microarray data can lead to unanticipated array (column) correlations, Section 2.3. In Section 3, we develop our main sphering algorithm. Results on both simulated
models and real microarray data are given in Section 4 and we conclude with a discussion
of our work in Section 5.
4
2.
G. I. Allen & R. Tibshirani
Framework: A Matrix-variate Model
We present a matrix decomposition model based on the matrix-variate normal distribution.
Using this model, we study the effects of unanticipated column correlations on common
two-sample test statistics, showing that non-zero column correlations lead to an over or
under dispersion of the null distribution.
2.1.
Matrix-variate Model
We propose to study row and column correlations through a simple matrix decomposition
model based on the matrix-variate normal. We motivate the use of this distribution through
the example of microarray data. With this data, the genes are often assumed to follow a
multivariate normal distribution with the arrays independent and identically distributed.
Since we aim to study the effects of array correlations, we need a parametric model that has
the flexibility to model either array independence or various array correlation structures.
To this end, we turn to the mean-restricted matrix-variate normal introduced in Allen
and Tibshirani (20), a variation of the familiar matrix-variate normal (Gupta and Nagar,
1999). For data, X ∈ <m×n , this distribution is denoted as X ∼ Nm,n (ν, µ, Σ, ∆) and
has separate mean and covariance parameters for the rows, ν ∈ <m and Σ ∈ <m×m , and
columns, µ ∈ <n and ∆ ∈ <n×n . Thus, we can model array correlations directly though
the covariance matrix ∆. If the data matrix is transformed into a vector of length mn, we
have that vec(X) ∼ N (vec(M), Ω), where M = ν1T(n) + 1(m) µT and Ω = ∆ ⊗ Σ. Also,
the commonly used multivariate normal is a special case of the distribution. If ∆ = I and
µ = 0, then X ∼ N (ν, Σ). In fact, all marginal models of the matrix-variate normal are
multivariate normal, meaning that both the genes and arrays separately are multivariate
normal. Further properties of this distribution are given in Allen and Tibshirani (20).
In our matrix decomposition model, we assume that there is an additional signal beyond
the row and column means. We then decompose the data into a mean, signal, and correlated
noise matrix as follows:
Xm×n = Mm×n + Sm×n + Nm×n .
(1)
Here, M = ν1T(n) + 1(m) µT is the mean matrix, S is the problem specific signal matrix,
and N ∼ Nm,n (0, 0, Σ, ∆) is the noise matrix. Thus, X − S ∼ Nm,n (ν, µ, Σ, ∆), meaning
that after removing the signal, the data follows a mean-restricted matrix-variate normal
distribution. With two-class microarray data, for example, the signal matrix captures the
class means. Let there be n1 arrays in class one, indices denoted by C1 , and n2 in class
two, denoted by C2 . (For simplicity of notation, we assume that the first n1 arrays are in
m
class one and the last n2 arrays are in class two.) Let the class
h signals be ψ 1 ∈i < and
ψ 2 ∈ <m . Then, the signal matrix, S, can be written as S = ψ 1 1T(n1 ) ψ 2 1T(n2 ) . Notice
that our matrix decomposition model is similar in spirit to the latent variable model of Leek
and Storey (2008) also proposed in the context of large-scale inference.
There are several further remarks to make regarding this model. Prior to analyzing
data, it is common to standardize the rows. Some have proposed to doubly-standardize
this two-way data by iteratively scaling both the rows and columns (Efron, 2009; Olshen
and Rajaratnam, 2010). With our model, we center both the rows and columns through
the mean matrix M, but do not directly scale them. Instead, we allow the diagonals of the
covariance matrices of the rows, Σ, and columns ∆, to capture the differences in variablities.
Thus, our model keeps the mean and variances separate in the estimation process.
Inference and Dependence
5
2.2. Null Distributions: The Two-Class Problem
We study the effect of column correlations on the theoretical null distribution of two-sample
test statistics computed for a single row of the data matrix. More specifically, we calculate
the distributions of test statistics under our matrix decomposition model instead of the
typical two-sample framework where samples are drawn independently from two populations.
In the familiar two-class inference problem, we have a vector x = [xT1 xT2 ]T with x1 of
iid
length n1 and x2 of length n2 where the elements of each vector are x1,i ∼ N (ψ1 , σ 2 ) and
iid
x2,i ∼ N (ψ2 , σ 2 ). We wish to test whether there is a shift in means between the two classes,
namely H0 : ψ1 = ψ2 vs. H1 : ψ1 6= ψ2 . Here, we assume that the variances, σ 2 , are equal
between the two classes, but note that analogous results are obtainable if the variance is
assumed to be unequal, as is often the case with microarray data.
If the variance, σ 2 is known, we have the familiar two-sample Z-statistic, Z =
1 −
P(x̄
√
√
nk
xi
x̄2 )/σ cn that follows the distribution Z ∼ N ((ψ1 − ψ2 )/σ cn , 1), where x̄k = n1k i=1
1
1
and cn = n1 + n2 (Lehmann and Romano, 2005). Going back to our matrix decomposition
model, we wish to know the distribution of this Z-statistic for each row when there are
column correlations:
ψ1 − ψ2 η
2
T
T
T
,
,
Claim 1. Let x = [x1 x2 ] ∼ N1,n 0, [ψ1 1(n1 ) ψ2 1(n2 ) ], σ , ∆ . Then, Z ∼ N
√
σ cn cn
(
1
Pn Pn
i ∈ C1
where η = i=1 j=1 ∆ij Wi Wj = W T ∆ W , with Wi , n1 1
.
− n2 i ∈ C2
Thus, when the columns are correlated, the variance of the two-sample Z-statistic is
inflated or deflated by η. In terms of the matrix decomposition model, the assumptions
of Claim 1 correspond to a row vector that has previously been centered by ν and µ, has
signal [ψ1 1(n1 ) ψ2 1(n2 ) ], column covariance ∆, and row variance σ 2 , the diagonal element
of Σ. Note that separability of the covariance matrices of our model allows us to dissect the
effects of column correlations in this manner. Notice that if ∆ = I, η = cn and the variance
of Z is one, as desired. If there is only column correlation within the two classes, then the
effects of these correlations can be parsed as follows:
T
2
T
2
Corollary
1. Let x1 ∼ N1,n
1 (0, ψ1 1(n1 ) , σ , ∆1 ) independent of x2 ∼ N1,n2 (0, ψ2 1(n2 ) , σ , ∆2 ).
P
P
ψ1 − ψ2 η1 + η2
nk
nk
,
Then, Z ∼ N
, where ηk = n12 i=1
√
j=1 ∆k,ij for k = 1, 2.
k
σ cn
cn
These effects are explored numerically in a small study described below.
We have assumed that the row variance, σ 2 , was known, however, in most microarray
experiments this is not known and must be estimated. With σ 2 unknown, the two-sample t√
statistic is used: T = (x̄1 − x̄2 )/sx1 ,x2 cn , where s2x1 ,x2 is the pooled estimate of the sample
variance. Under the null hypothesis, T ∼ t(n−2) , while under the alternative, T ∼ t(δ)(n−2) ,
√
a non-central t distribution with non-centrality parameter δ = (ψ1 − ψ2 )/(σ cn ) (Lehmann
and Romano, 2005).
When there are column correlations as in the assumptions of Claim 1, however, the
distribution of T does not have a closed form. (The square of the pooled sample standard
deviation is no longer distributed as a Chi-squared random variable and the numerator and
denominator of T are not independent.) Hence, we explore the effects of column correlations
on the T -statistic through a small simulation study. Data is simulated according to the
assumptions of Claim 1 with n = 50 columns with n1 = n2 = 25 in each class. Four
structured covariance matrices were used to assess the variances of Z and T -statistics:
∆1 : ∆1,ij = 0.9|i−j| , ∆2 is block diagonal with blocks of size 10, and within each block
∆2,ij = 0.9|i−j| , ∆3 : ∆3,ij = 0.5|i−j| , ∆4 is block diagonal with blocks of size 10 and within
0.4
G. I. Allen & R. Tibshirani
0.4
Density
0.3
T, df = 48
∆1
∆2
∆3
∆4
0.0
0.1
0.2
0.0
0.1
Density
0.3
N(0,1)
∆1
∆2
∆3
∆4
0.2
6
−10
−5
0
Z−Statistic
5
10
−10
−5
0
T−Statistic
5
10
Fig. 2. Comparison of theoretical null distributions for the two-sample Z-statistic (left) and T -statistic
(right) under four column correlation scenarios given in Section 2.2. Variances of the Z-statistics
were calculated by the result in Claim 1, while the densities of the T -statistics were estimated via
Monte Carlo simulation.
∆1
∆2
∆3
∆4
Var(Z-statistic)
9.215
6.069
2.76
2.45
Var(T -statistic)
19.94 (0.029)
9.492 (0.0144)
3.197 (0.00472)
2.79 (0.00411)
Fig. 3. Variances of the two-sample Z and T -statistics under four column correlation scenarios. The
theoretical variance of the Z-statistic should be 1, and 1.022 for the T -statistic.
each block, ∆4,ij = 0.5|i−j| . Positive correlation structures are used as most observed array
covariances in microarray studies are positive. We note that with negative correlations
among columns, η < cn resulting in under-dispersed null distributions.
Figure 2 demonstrates the effect of column correlations on the distributions of Z and
T . We see that positive column correlations can cause dramatic over-dispersion of the test
statistics compared to their theoretical null distribution. This is a possible explanation to
the over-dispersion seen in the real microarray examples displayed in Figure 1. Compared to
the variance of the Z-statistic, the T -statistic is even more affected by column correlations.
This is confirmed in Table 3 where we present the variances of the Z-statistic calculated by
Claim 1 and the variances of the T -statistic estimated by Monte Carlo simulation. Indeed,
small amounts of correlation in the columns can cause a dramatic increase in the variance
of the T -statistic.
We have shown how the distribution of T and Z-statistics behave when columns or
arrays are correlated. When analyzing microarrays, however, many have advocated using
non-parametric null distributions estimated by permuting the class labels (Dudoit et al.,
2003; Storey and Tibshirani, 2003; Tusher et al., 2001). This approach is also problematic,
however, when the columns or arrays are not independent. As the under the null hypothesis,
the joint distribution of the columns is not invariant under permutations, the randomization hypothesis fails (Lehmann and Romano, 2005). Therefore, inferences drawn by using
permutation null distributions instead of theoretical nulls suffer from the same troublesome
effects of unanticipated column correlations.
Our brief study of the behavior of two-sample test statistics in the presence of unantic-
Inference and Dependence
7
ipated column correlations reveal several problematic behaviors. Relatively small column
correlations can have a large effect leading to over or under dispersion of the theoretical null
distribution and incorrect inferences. In the supplementary materials, we use our matrixvariate model to study the performance of large-scale inference methodology (in particular,
the step-up procedure (Benjamini and Hochberg, 1995), permutation procedure (Storey and
Tibshirani, 2003), and the empirical null based local FDR procedure (Efron, 2007)) under
several row and column correlation scenarios. These results reveal that when both the rows
and the columns are correlated, these problems are exacerbated. Specifically, further overdispersion of null distribution occurs leading to (i) biased estimates of the FDR and (ii)
greater variance of FDR estimates whose effect is even greater than when only the rows
or only the columns are correlated. Therefore, a troubling picture emerges regarding the
statistical perils of performing large-scale inference on a data matrix in which the rows and
columns may be correlated.
2.3. Microarrays & Unanticipated Array Correlations
Before continuing with our proposal to solve the problems associated with two-way dependencies and large-scale inference, we pause to understand and quantify some possible
sources of unanticipated array correlations in microarray data. We consider models for three
possible sources of array correlations in microarray data: a batch-effect model (Li and Rabinovic, 2007; Leek et al., 2010a), a latent variable model (Leek and Storey, 2008), and an
instrument drift model (Fare et al., 2003). Clearly, if these sources of array correlations are
known to the statistician, they should be modeled directly. Unfortunately, this information
is often missing and we seek to understand the effect of these correlations if they cannot be
directly accounted for. Hence, we quantify how array correlations are induced if one fits a
standard model assuming that the arrays are independent when they are in fact distributed
according to these other models. In other words, we calculate the array covariance resulting
from model bias.
Consider a standard model for microarray data assuming that the arrays are independent
with Gaussian noise: Xij = Sij + ij , where Sij denotes the fixed effects from the signal
iid
of interest and ij is a random effect, ij ∼ N (0, 1). Then, the expectation of the cross(S)
(S) (S)
products of population residuals, rij = Xij − Sij = ij , is obviously E(rij , rij 0 ) = 0 for
j 6= j 0 . These cross-products are non-zero, however, for the other models we consider.
PK
Let us consider the following batch effects model: Xij = Sij + k=1 βk I(k∈I(k)) +ij where
iid
I(k) denotes the batch membership and βk ∼ N (µk , σk2 ) independent of ij . Thus, the batch
(B)
effect is a random effect given by βk . Defining the population residuals, rij , in the same
manner as above, we see
( that the expected cross products are non-zero for arrays in the same
µ2k + σk2 (j, j 0 ) ∈ I(k)
(B) (B)
Hence, if either the mean batch effect or the
batch: E(rij rij 0 ) =
0
otherwise.
additional variance among arrays in the batch are large, then strong correlations among
the arrays can result. Similarly, consider the following latent variable model: Xij = Sij +
PK
iid
k=1 Γik Gkj + ij where Gik is the fixed latent variable with random weights Γik ∼ N (0, 1)
(L)
independent of ij . Then, the expected cross products of the population residuals, rij are
P
(L) (L)
K
given by: E(rij rij 0 ) = k=1 Gkj Gkj 0 . To measure the effect of instrument drift, we employ
Pj
a random walk with drift model. Define Dij = µ + k=1 (Dik + ψk ) with Di1 ∼ N (0, σ 2 )
iid
and ψk ∼ N (0, σ 2 ) independent of ij , and µ the fixed instrument drift. Then, consider the
following instrument drift model: Xij = Sij + Dij + ij . Again, the expected cross product
(D)
(D) (D)
of the population residuals, rij are: E(rij rij 0 ) = (j − 1)(j 0 − 1)µ2 + σ 2 (j ∧ j 0 ). Here, j ∧ j 0
8
G. I. Allen & R. Tibshirani
denotes the minimum of j and j 0 . Calculations for all of these covariances are in Appendix
A.
Based on these simple models for microarray data, we see that large correlations among
the arrays can be induced with relatively small batch effects, latent variable effects or instrument drifts if these effects are not explicitly modeled. Putting this together with the results
discussed in the previous section, these small non-zero correlations can lead to dramatically
wider null distributions of common tests statistics. This in turn, leads to many more genes
being rejected than are truly differentially expressed. This illustration then serves as motivation for methods of directly addressing two-way dependencies when conducting large-scale
inference.
3.
De-Correlating a Matrix Prior to Conducting Inference
We propose a novel method for solving the numerous problems associated with conducting
large-scale inference on matrix data in which the rows and columns may be correlated. The
approach has three simple steps: (1) Estimate the signal in the data and subtract this to
get an estimate of the two-way correlated noise. (2) Simultaneously estimate separable row
and column covariances of the noise and use these estimates to sphere or de-correlate the
noise. (3) Add the de-correlated noise and the signal to yield the sphered data on which
one conducts large scale inference.
Before we introduce our methodology, we briefly review the results and existing literature
illustrating the challenges of estimating both row and column covariances from a single
matrix of data. First, the problem of estimating these two separable covariances for multiple
instances of independent and identically distributed matrix data has been established by
Dutilleul (1999). The number of repeated matrix instances, however, must be large relative
to the row and column dimensions. In our case, we have only one replicate of size mn from
which to estimate m(m−1)
+ n(n−1)
parameters. Furthermore, the empirical estimates of
2
2
the row and column covariances share the same information content. Assume that the data,
X has been row and column centered and decompose the data according the the singular
value decomposition giving X = U D VT . Then, the empirical covariances, Σ̂ = X XT /m
ˆ = XT X /n, can be written as Σ̂ = U D2 UT /m and ∆
ˆ = V D2 VT /n. That is, the
and ∆
empirical covariances share the same eigenvalues. Efron (2009) goes on to show that the
variances of the elements of the two empirical correlation matrices are same. Muralidharan
(2010) likens this problem to estimating the variance of two random variables having only
observed their sum.
Given this, there are several important notes to discuss. First, if the underlying data is
truly matrix-variate, meaning that neither the rows nor the columns are independent, then
simply estimating the row covariance or the column covariance is insufficient. This occurs
as non-zero column correlations influence the apparent row correlations and vice versa, a
point discussed in detail in Efron (2009). If our ultimate goal is to de-correlate the data
matrix before conducting inference, then only estimating the row covariance or only the
column covariance would lead to erroneous conclusions when the data is truly transposable.
Additionally, estimating the row and column covariances separately would lead to these
same problems. Finally, the empirical estimates of at least one covariance matrix, and likely
both, are necessarily singular. As the inverse covariance matrix is needed to de-correlate
the data, this presents an additional challenge.
ˆ simultaneWe propose to estimate non-singular row and column covariances, Σ̂ and ∆,
ously via the Transposable Regularized Covariance Models framework introduced in Allen
and Tibshirani (20). Then, we use these estimates to de-correlate the noise of the data
−1/2
ˆ −1/2 , yielding a new noise matrix, Ñ which has approximately
matrix, Ñ = Σ̂
N∆
Inference and Dependence
9
independent rows and columns. (We note that if N ∼ Nn,p (0, 0, Σ ∆), for example, then
Σ−1/2 N ∆−1/2 ∼ Nn,p (0, 0, I(n) , I(p) ) (Gupta and Nagar, 1999)). Finally, this new noise
matrix is added to the signal estimated from the model to conduct large-scale inference.
3.1. Review: Transposable Regularized Covariance Models
The Transposable Regularized Covariance Model (TRCM) allows us to estimate non-singular
row and column covariances simultaneously by maximizing a penalized log-likelihood of
the matrix-variate normal distribution (Allen and Tibshirani, 20). The model places a
matrix-convex penalty on the inverse covariances, or concentration matrices of the rows and
columns, allowing one to estimate non-singular covariances. For the ultimate purpose of
de-correlating a data matrix when conducting inference, we choose to employ an `1 -norm
penalty on the concentration matrices. This is done for both practical and theoretical
reasons which we will discuss shortly.
First, let us review the model, some of its properties, and the algorithms used to find
the penalized MLE. Following from the matrix decomposition model, (1), if we let N be
the noise matrix remaining after removing the means and the signal in the data, then the
penalized log-likelihood is as follows:
`(Σ, ∆) =
n
m
1 log| Σ-1 | + log| ∆-1 | − tr Σ-1 N ∆-1 NT − λm|| Σ-1 ||1 − λn|| ∆-1 ||1
2
2
2
(2)
Pn Pn
where || ∆-1 ||1 = i=1 j=1 | ∆-1 ij | and λ is a penalty parameter controlling the amount
of sparsity in the concentration matrices. Notice that the first three terms of (2) are the loglikelihood of the matrix-variate normal distribution (Gupta and Nagar, 1999). Hence, the
model assumes that the row and column covariances are separable and the joint distribution
of the noise, vec(N) ∼ N (0, ∆ ⊗ Σ), or the joint covariance is given by the Kronecker
product.
The `1 -norm penalties placed on the concentration matrices are an extension of the
graphical lasso-type penalties to the matrix-variate framework (Friedman et al., 2007; Rothman et al., 2008). These penalties encourage zeros in the off-diagonals of the concentration
matrices corresponding to a selection of edges in a graph structure and indicating that the
variables are conditionally independent (Dempster, 1972). Allen and Tibshirani (20) showed
that successive application of the graphical lasso algorithm which solves the sub-gradient
equations of (2) converges to a maximum of the penalized log-likelihood. The resulting
−1
ˆ −1 are necessarily non-singular, as desired.
estimates for Σ̂ and ∆
Before discussing the rationale for our model with `1 penalties, we pause to address the
logical question, why not use `2 -norm penalties as also presented in Allen and Tibshirani
ˆ have the same eigenvectors
(20)? Recall that the `2 -norm TRCM solutions for Σ̂ and ∆
as their empirical counterparts. The solution for the eigenvalues are simply regularized
versions of the empirical eigenvalues. From results in random matrix theory, we know that
the empirical eigenvectors and eigenvalues of covariances can be inconsistent with highdimensional data (Johnstone, 2001; Johnstone and Lu, 2009). Furthermore, suppose that
we were to de-correlate the noise by left and right multiplying by the matrix square root of
−1
ˆ −1 . Let N = U D VT be the SVD of the noise, and Λ and Λ ˆ be the diagonal
Σ̂ and ∆
Σ̂
∆
ˆ respectively. Then, the sphered noise, Ñ, that would
matrix of eigenvalues of Σ̂ and ∆
result from using the `2 TRCM estimates has the following form:
−1/2
Ñ = Σ̂
ˆ
N∆
−1/2
−1/2
= U ΛΣ̂
−1/2
UT U D VT V Λ∆
ˆ
−1/2
−1/2
VT = U ΛΣ̂ D Λ∆
VT .
ˆ
10
G. I. Allen & R. Tibshirani
Hence, using the `2 -norm TRCM estimates to de-correlate the noise returns a matrix with
the same singular vectors as the original noise. Also the resulting singular values are simply
a regularized version of the original singular values of the noise, a result which can be
calculated using the formulas for for ΛΣ̂ and Λ∆
ˆ in Allen and Tibshirani (20). Therefore,
employing `2 -norm TRCM estimates changes the scale of the noise instead of projecting the
noise onto directions that yield approximately independent rows and columns.
Using the `1 -norm penalty in the TRCM framework, however, has many practical advantages in the context of large-scale inference. First, one usually assumes that the columns
are independent, so having ∆ = I should be our default position. As the penalty encourages
sparsity in the off-diagonals of ∆-1 , estimating a diagonal covariance is a special case of this
model. Furthermore, notice that the penalty parameter, λ is modulated by the dimension
of the rows and columns. (We note that λ can be estimated via cross validation using the
efficient alternating conditional expectations algorithm (Allen and Tibshirani, 20).) Thus,
the evidence of partial correlations among the columns must be strong relative to the partial correlations among the rows in order for non-zero column correlations to be estimated.
Secondly, especially in the context microarrays, it seems reasonable to assume that the covariance among the genes is sparse as biologically, genes are likely only to be correlated with
genes in the same or related pathways.
There is also a theoretical foundation motivating the use of the `1 -norm TRCM estimates.
These estimates, by encouraging sparsity, regularize both the eigenvectors and eigenvalues.
This turns out to be important for covariance estimation. For multivariate normal data,
covariance estimators resulting from the graphical lasso penalty are consistent in both the
Frobenius norm and operator norm for estimating a true underlying sparse matrix (Rothman
et al., 2008). Note that while convergence in the Frobenius norm gives convergence of
the eigenvalues, convergence in the operator norm implies convergence of the eigenvectors
(El Karoui, 2008). Additionally, regularizing eigenvectors using sparsity has been shown to
yield consistent directions for dimension reduction (Johnstone and Lu, 2009). While these
results are for multivariate data, we note that both the feature space and the sample space
are permitted to increase as long as they increase at a constant ratio asymptotically. Given
this and from our experience with the `1 -norm TRCM estimates, we conjecture that under
ˆ that maximize (2) are consistent
the right assumptions, one may prove that the Σ̂ and ∆
for estimating sparse separable inverse covariance matrices of the matrix-variate normal.
We leave this open problem for future work.
3.2. Sphering Algorithm
We develop a simple method to directly address the problems associated with inference on
data exhibiting row and column correlations: We de-correlate the underlying data before
conducting inference. Among the many advantages of this approach is that (i) it can be
used with any test statistic and (ii) with any method of controlling for multiple testing.
Algorithm 1 Sphering Algorithm
(a) Estimate row and column means, ν̂ and µ̂ forming M̂, and the signal matrix, Ŝ.
(b) Define the noise, N̂ , X −M̂ − Ŝ. Estimate row and column covariances of noise, Σ̂
ˆ via TRCM.
and ∆
1
−1
ˆ − 2 . Form the sphered data matrix: X̃ , Ŝ + Ñ.
(c) Sphere the noise: Ñ , Σ̂ 2 N̂∆
Our sphering algorithm, based on the matrix decomposition model (1), is given in Algorithm 1. This algorithm simply removes the means and signal, estimates the row and column
covariances of the noise, and uses these to de-correlate or sphere the noise. The estimated
Inference and Dependence
11
h
i
signal, Ŝ, is problem specific. For the two-class model, for example, Ŝ = ψ̂ 1 1T(n1 ) ψ̂ 2 1T(n2 )
where ψ̂ 1 and ψ̂ 2 are the vectors of estimated class means for each row. Note that we use
−1
the symmetric square root defined by the following: Let Σ̂ = PΛPT be the eigenvalue de−1
−1/2
composition of Σ̂ , then the symmetric matrix square root is given by Σ̂
= PΛ1/2 PT .
Then, adding the signal back into this sphered noise, we obtain X̃ which we call the sphered
data.
Thus, the sphering algorithm solves the problems associated with row and column correlations by sphering the underlying data. As this algorithm operates on the original data
matrix, it can be used with any test statistic and any multiple testing procedure. To better
understand the algorithm, however, we investigate some of its properties for the two-class
problem:
Proposition 1. Let X ∼ Nm,n (M + S, Σ, ∆) where M = ν1T(n) + 1(m) µT and S =
[ψ 1 1T(n1 ) ψ 2 1T(n2 ) ] and let X̃ be the sphered data given by Algorithm 1. Then,
(i) E(X̃) = S = [ψ 1 1(n1 ) ψ 2 1(n2 ) ],
d
(ii) If in addition, we take some N0 = N independent of N and define
1
1
−1/2
ˆ −1/2 , then, Ñ0 |Σ̂, ∆
ˆ ∼ Nm,n 0, 0, Σ̃, ∆
˜ , where Σ̃ = Σ̂− 2 Σ Σ̂− 2
Ñ0 = Σ̂
N0 ∆
1
1
˜ =∆
ˆ −2 ∆ ∆
ˆ −2 .
and ∆
Thus, the signal remains the same between X and X̃, and the covariance structure is all
that changes. Each row of X̃ then becomes a linear combination of the other rows weighted
by their partial correlations. The same applies to the columns.
Now, let us study how sphering the data affects the Z and T -statistics from Section
2.2. First, the Z-statistic does not change with sphering. The numerator of both the Z
and T statistic, x̄1,i − x̄2,i is given by ψ̂1,i − ψ̂2,i , the components of the estimated signal
matrix Ŝ. The denominator of the T -statistic, namely sx1 ,x2 , the estimate of the noise,
however, changes with sphering. Thus, since the denominator of the T -statistic changes
with sphering, the ranking of the rows changes as well. This is an important point which
we will discuss in more detail subsequently. Recall also that in Section 2.2, we noted that
the T -statistic does not have a closed form distribution when there are column correlations.
After sphering the data, however, the T -statistic on the sphered data approximately follows
a scaled t distribution under certain conditions:
Claim 2. Assume the assumptions in Proposition 1 (ii) hold. In addition, let X̃ be the
sphered data defined by Ñ0 + Ŝ, and let the statistic T̃i be the statistic for the ithr
row for the
σ̃
η
i
˜ = I,
data X̃. Then under the null hypothesis H0 : ψ1,i = ψ2,i , if ∆
T̃i ∼
t(n−2) .
σi cn
Here, cn and η are defined as in Claim 1.
Using our sphering algorithm to de-correlate the noise in the data matrix, we obtain
test statistics that follow approximately known distributions under certain conditions. The
˜ is assumed to be the identity. If ∆
˜ is instead a diagonal
sphered column covariance, ∆,
matrix, then a simple scaling of the columns will give the above result. Notice that if the
original data, X, has no column correlations, ∆ = I, then T and T̃ both approximately
follow a scaled t distribution with n − 2 degrees of freedom. Thus, if the data originally
follows the correct theoretical null distribution, then sphering the data does not change its
null distribution, an important property. Also, if the sphered rows are independent, Σ̃ = I,
or approximately independent, then the statistics, Ti are independent or approximately
12
G. I. Allen & R. Tibshirani
independent. We also note that we can often assume that σ̃i = σi , thus eliminating that
coefficient ratio from the distribution. This is especially a reasonable assumption if the rows
˜ and Σ̃ are not likely to be
are scaled prior to applying the sphering algorithm. While ∆
exactly the identity, we have observed in simulations that these are often diagonal or nearly
diagonal.
When calculating p-values for T̃ based on the distribution given in Claim 2, we must
know the value of η which depends on the original column covariance ∆. While one might
ˆ this is problematic for several reasons. First, ∆
ˆ is the
be inclined to estimate η from ∆,
penalized MLE, meaning that the estimate is biased for finite samples and the exact formula
for this bias has not yet been established. Thus, estimating η in this manner would result in a
ˆ and Σ̂ are only identifiable
global underestimate of the population variance, η. Secondly, ∆
up to a multiplicative constant (Allen and Tibshirani, 20). Hence, the scale of the variances
of the columns are not separable from that of the rows, meaning that one cannot determine
the variance associated with the columns, η, from the TRCM estimates. Further research
ˆ and Σ̂ are needed as well as investigations into
on the consistency of the estimates for ∆
estimating η directly.
For our proposes then, we propose to estimate η and hence re-scale the distribution of the
sphered test statistics, T̃ , in a data-driven manner. Note that if all of the test statistics were
truly from the null distribution, then we could simply re-scale the test-statistics to have the
same variance as that of the t(n−2) distribution. As we often expect a portion of the tests
to be non-null, however, we do not want these tests to contaminate our variance estimate.
Thus, we propose to scale by only the central portion of the observed distribution of test
statistics, as these are mostly likely to be truly null tests. More specifically, we estimate η
by comparing the variance of the central portion of the t(n−2) distribution to that of the
central portion of the T̃ -statistics. This procedure is outlined in Algorithm 2 where ρα (x)
denotes the αth quantile of x and I() is the indicator function.
Algorithm 2 Scaling by the central portion of T̃ .
(a) Let the expected proportion of null test statistics be π̂0 = m̂0 /m.
(b) Estimate the variance of the central portion of sphered test statistics:
h
i
ˆ T̃ I T̃ ≥ ρ((1−π̂ )/2) (T̃ ), T̃ ≤ ρ(1−π̂ )/2 (T̃ )
σ̂T̃2 (π̂0 ) , Var
0
0
(c) Define the central-matched T̃ -statistics: T̃ ∗ , T̃ σt(n−2) (π̂0 )/σ̂T̃ (π̂0 ), where σt2(n−2) (π̂0 )
is the variance of the central portion of the t(n−2) distribution.
The estimate for the scaling factor η, is then η̂ = cn σT̃2 (π̂0 )/σ̂t2(n−2) (π̂0 ). Thus, the resulting T̃ ∗ -statistics can be tested against the t(n−2) distribution. As we do not want statistics
corresponding to non-null tests to contaminate the variance estimates, we recommend using
a conservative estimate of π0 , such as 0.8 or 0.9 for microarrays.
We pause to ask a logical question. Based on the results in Section 2.2, why does one
need to sphere the data and then estimate η instead of simply estimating η via Algorithm 2
at the onset? Recall that the result in Claim 1 giving the altered variance of the Z-statistic
was for a single test associated with a single row of the data matrix. Also, recall that over or
under dispersion of the test statistic can result from correlation among the rows alone (Qiu
et al., 2005; Efron, 2010). Thus, if the row variables are left un-sphered, then estimating
η via central matching will result in a biased estimate that leads to incorrect inferences.
As a brief illustration of this effect, we applied central matching to the original data in an
analysis of real microarray data in the next section.
Inference and Dependence
13
Our sphering algorithm takes a simple but direct approach to the problems associated
with correlation and large-scale inference. The noise in the data is de-correlated using the
TRCM estimates of the row and column covariances so that the resulting sphered noise is
approximately independent. While we have mainly discussed the distributions of standard
two-sample test statistics resulting from our algorithm, we note that as our approach works
with the original data matrix, it is general and can be used in conjunction with any test
statistic and any multiple testing procedure.
4.
Results
We evaluate the performance of our sphering method through simulations and a real microarray example in which batch effects have been documented.
4.1. Simulations
We test our method of directly accounting for row and column correlations when conducting
inference on simulated data sets and compare the results to those of competing methods.
Four simulation models are employed for this purpose: 1) our matrix-variate model with
correlation structure inspired by that of the Spielman et al. (2007) microarray data, 2) a
latent variable model, 3) a batch-effect model, and 4) an instrument drift model.
Data of dimension 250×50 is simulated according to the following model: X = S + B + N
for signal matrix S, effect matrix B and noise matrix N. The signal is constant for all four
simulation models and is that of a two-class model with 25 columns in each class and 50
non-null rows: S = [ψ 1 1T(25) ψ 2 1T(25) ] where ψ1,1:25 = 0.5, ψ1,26:50 = −0.5, ψ2,1:25 = −0.5,
ψ2,26:50 = 0.5 and ψ1,51:250 = ψ2,51:250 = 0. For the matrix-variate model, the effect matrix,
iid
B = 0 and the noise matrix, N = Σ1/2 Z ∆1/2 with Zij ∼ N (0, 1). The row and column
covariances are inspired by the correlation observed in the Spielman et al. (2007) data: Σ
and ∆ are taken as the correlation matrices of 250 randomly sampled genes and 50 columns
randomly sampled according to the class labels. The latent variable simulation model is
iid
taken from Leek and Storey (2008) and consists of the noise matrix, Nij ∼ N (0, 1) and the
effect matrix, B = Γ G. Here, the latent variables, G, of dimension 2 × 50 are given by
iid
iid
Gij ∼ Bern(0.5) and the weights, Γ of dimension 250 × 2 are given by Γij ∼ N (0, (.5)2 ).
PK
th
batch
For the batch effect model, Bij =
k=1 βik I(j∈I(k)) , where I(k) indicates the k
iid
membership and βik ∼ N (µk , (.5)2 ). We simulate K = 5 batches with ten members each
1/2
and µ = [−0.5, −0.25, 0, 0.25, 0.5]. The noise, N = Σ̃
Z where Σ̃ij = (−.9)|i−j| and
iid
Zij ∼ N (0, 1) independent of the effect matrix. Finally, the effect matrix for the instrument
drift model is given by Bi,j = µ + γBi,j−1 + Zi,j where the drift, µ = 0.01, the shrinkage,
iid
γ = 0.1 and the innovations Zij ∼ N (0, (.1)2 ). As with the batch effect model, the noise,
1/2
N = Σ̃ Z.
We compare the results of our sphering algorithm to those of competing methods, specifically to standard methodology (row variables are standardized and column variables are centered), surrogate variable analysis (Leek and Storey, 2008), the correlation sharing method
(Tibshirani and Wasserman, 2006), the correlation predicted method (Lai, 2008), and the
correlation adjusted method (Zuber and Strimmer, 2009). Note that all of these methods reorder the rank of the row variables compared to that of standard methodology. This means
that the true false discovery proportion (FDP) will change for each method depending on
whether the true non-null rows are re-ordered correctly. Thus, we compare results by fixing
the number of tests rejected and comparing the true FDP and estimated FDR, estimated
via the step-up procedure (Benjamini and Hochberg, 1995). Best performing methods will
14
G. I. Allen & R. Tibshirani
Matrix-variate Model
Standard
40 tests
45 tests
50 tests
55 tests
60 tests
Latent
Sphered
[ FDP
[
FDP FDR
FDR
0.001 0.000 0.020
0.002
0.006 0.000 0.023
0.007
0.046 0.000 0.041
0.026
0.121 0.000 0.106 0.124
0.189 0.000 0.178 0.249
Variable Model
Standard
Sphered
[
FDP FDR
40 tests 0.053 0.072
45 tests 0.089 0.113
50 tests 0.131 0.162
55 tests 0.178 0.221
60 tests 0.229 0.280
Batch Effect Model
Standard
FDP
0.073
0.104
0.138
0.186
0.233
[
FDR
0.058
0.086
0.128
0.171
0.225
Sphered
[ FDP
[
FDP FDR
FDR
40 tests 0.044 0.064 0.040
0.051
0.079
45 tests 0.074 0.107 0.069
50 tests 0.117 0.166 0.112
0.122
55 tests 0.165 0.232 0.164 0.176
60 tests 0.216 0.295 0.216 0.233
Instrument Drift Model
Standard
Sphered
40
45
50
55
60
tests
tests
tests
tests
tests
FDP
0.036
0.061
0.103
0.156
0.211
[
FDR
0.056
0.097
0.158
0.227
0.298
FDP
0.033
0.058
0.101
0.152
0.208
[
FDR
0.045
0.077
0.120
0.174
0.236
SVA
FDP
0.003
0.016
0.055
0.129
0.195
[
FDR
0.000
0.000
0.000
0.001
0.002
SVA
FDP
0.051
0.080
0.122
0.174
0.226
[
FDR
0.063
0.101
0.153
0.211
0.274
SVA
FDP
0.519
0.540
0.553
0.571
0.587
[
FDR
0.025
0.031
0.036
0.040
0.046
SVA
FDP
0.099
0.136
0.174
0.217
0.259
[
FDR
0.101
0.138
0.181
0.226
0.271
Correlation
Sharing
[
FDP FDR
0.110 0.000
0.152 0.000
0.198 0.000
0.240 0.000
0.285 0.000
Correlation
Predicted
[
FDP FDR
0.790 0.000
0.792 0.000
0.791 0.000
0.792 0.000
0.792 0.000
Correlation
Adjusted
[
FDP
FDR
0.000
0.000
0.002
0.000
0.032
0.000
0.109
0.000
0.179
0.000
Correlation
Sharing
[
FDP FDR
0.144 0.032
0.186 0.040
0.226 0.050
0.265 0.059
0.304 0.068
Correlation
Predicted
[
FDP FDR
0.805 0.000
0.802 0.000
0.801 0.000
0.804 0.000
0.800 0.000
Correlation
Adjusted
[
FDP
FDR
0.046 0.049
0.077 0.081
0.122 0.123
0.175
0.171
0.226 0.226
Correlation
Sharing
[
FDP FDR
0.024 0.023
0.044 0.039
0.081 0.063
0.139 0.092
0.197 0.117
Correlation
Predicted
[
FDP FDR
0.055 0.010
0.087 0.021
0.126 0.044
0.183 0.080
0.237 0.116
Correlation
Adjusted
[
FDP
FDR
0.001 0.004
0.004 0.012
0.025 0.059
0.098
0.182
0.171
0.273
Correlation
Sharing
[
FDP FDR
0.018 0.020
0.039 0.034
0.074 0.057
0.135 0.090
0.196 0.112
Correlation
Predicted
[
FDP FDR
0.044 0.013
0.072 0.030
0.116 0.063
0.171 0.110
0.227 0.153
Correlation
Adjusted
[
FDP
FDR
0.001 0.002
0.001 0.006
0.013 0.049
0.094
0.201
0.168
0.311
Fig. 4. Simulation results comparing the average true false discovery proportion (FDP) to the average
\
estimated false discovery rate (F
DR) over 100 replicates of the four simulation models for the six
methods as described in Section 4.1. Results are compared when the number of rejected tests are
fixed between 40 to 60 tests out of 250 total tests, with 50 tests being truly non-null. Rejecting 55
tests corresponds to controlling the oracle FDP at 10%. Best performing methods for each model,
meaning that the most true non-null tests are rejected while still controlling the true FDP, are denoted
in bold.
Inference and Dependence
15
re-order the rows such that the true FDP is lower, meaning that the statistical power is
higher, while the FDR is well-estimated or slightly conservative. Methods yielding anticonservative FDR estimates, meaning that the FDR under-estimates the true FDP, exhibit
problematic tendencies as controlling the FDR does not imply controlling the proportion of
false discoveries.
In Table 4, we display the average true FDP and estimated FDR for the six comparison
methods on the four simulations models over 100 replicates for fixed numbers of rejected
tests. As controlling the oracle FDP at 10% corresponds to rejecting 55 tests, we present
boxplots of the true FDP and estimated FDR in Figure 5 when 55 tests are rejected. These
results reveal that the sphering algorithm performs well in comparison to the other five
methods. In the matrix-variate, batch effect and instrument drift models, the sphering
algorithm re-orders the row rankings in such as way as to lower the true FDP, yielding
an increased in statistical power. In addition, the estimated FDR is a more consistent
estimate when sphering is used, allowing one to reject more truly non-null rows than other
methods. Notice also that sphering decreases the variance of the FDR estimates compared
to standard methodology. The correlation adjusted method (Zuber and Strimmer, 2009)
generally results in a favorable re-ordering of the row rankings, but leads to an FDR estimate
with larger variance. All competing methods exhibit troubling behavior in the matrix-variate
simulation. Here, all methods estimate that there are no false discoveries when in fact there
are at least five false discoveries. Also, competing methods such as SVA, the correlation
sharing and correlation predicted exhibit these problematic behaviors in at least two of
the simulation models. In microarray analysis, this behavior would lead to identifying too
many genes as significant when many are likely to be false discoveries. Overall, our sphering
algorithm is the most consistent and most robust method for conducting inference on matrix
data with both row and column correlations.
4.2. Results: Real Microarray Study
We compare the performance of our sphering algorithm to that of competing methods on
a real microarray data set in which strong batch effects have been documented. The data
presented in Spielman et al. (2007) measures the gene expression of 4,167 genes for 142
subjects with 60 of European ancestry (CEU) and 82 of Asian (ASN) ancestry. Spielman
et al. (2007) find that 78% of genes are differentially expressed between the CEU and ASN
groups. Subsequent work, however, has questioned the validity of these results due to
strong batch effects (Akey et al., 2007; Leek et al., 2010a). Specifically, the microarrays
were processed from years 2003 - 2007, with the bulk (44/60) of the CEU group processed
in years 2003 - 2004 and all of the ASN group processed in years 2005 - 2007. Due to the
strong batch effects measured by the processing year and the confounding between these
batches and the two classes, it is difficult to determine the set of truly differentially expressed
genes. In fact after removing the batch effects, no genes are found to be significant (Akey
et al., 2007).
On this data, we apply our sphering algorithm, standard methodology, and the four
competing methods described in the previous section to determine the number of genes that
are significantly differentially expressed between the two groups. Prior to analysis, the genes
and arrays are centered. Two-sample t-statistics with the variance correction as proposed in
Tusher et al. (2001) are used. Test statistics are compared to the permutation distribution
(Storey and Tibshirani, 2003) to assess significance and correct for multiple testing. The
FDR is controlled at 10%. While the batches are known for this data set, we ignore this
information when comparing methods to test the performance of each method.
In Table 6, we present the number of significant genes found by each method. We find
that 49% of genes are differentially expressed using the standard methodology, but after
16
G. I. Allen & R. Tibshirani
Matrix−variate Model
0.8
●
True FDP
Estimated FDR
0.4
0.6
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.2
●
●
●
●
●
Sphered
Standard
SVA
●
●
Correlation
Sharing
Correlation
Predicted
●
Correlation
Adjusted
0.8
Latent Variable Model
True FDP
Estimated FDR
0.6
●
0.4
●
●
●
0.2
●
0.0
●
●
Sphered
Standard
SVA
Correlation
Sharing
Correlation
Predicted
Correlation
Adjusted
0.8
Batch Effect Model
True FDP
Estimated FDR
0.6
●
●
●
●
●
●
0.4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.2
●
●
●
●
●
●
●
●
●
●
●
0.0
●
●
●
●
●
●
●
Sphered
Standard
SVA
Correlation
Sharing
Correlation
Predicted
Correlation
Adjusted
0.8
Instrument Drift Model
True FDP
Estimated FDR
0.6
●
●
●
●
●
●
●
●
0.4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.2
●
●
0.0
●
●
●
Standard
Sphered
SVA
Correlation
Sharing
Correlation
Predicted
Correlation
Adjusted
Fig. 5. Boxplots of the true false discovery proportion (FDP) and the estimated false discovery rate
(FDR) for six methods under four simulation models each repeated 100 times. For each method, the
number of tests rejected is fixed at 55 out of 250 tests corresponding to controlling the oracle FDP
at 10%, as shown with the dotted line. Methods and simulation models are described in Section 4.1.
Best performing methods reorder the tests such that the true FDP is close to the oracle FDP, while
the estimated FDR is a good or slightly conservative estimate of the true FDP with small variance.
The best methods are then our sphering method for the matrix-variate, batch effect and instrument
drift models and the SVA method for the latent variable model.
Inference and Dependence
Standard
Standard (with Central Matching)
Standard after removing Batch Effects
SVA
Correlation Sharing
Correlation Predicted
Correlation Adjusted
Sphering
17
Number of Significant Genes
2040
80
0
2787
4167
248
178
28
Fig. 6. Number of significant genes found in the Spielman et al. (2007) data. Two-sample T -statistics
with a variance adjustment were used, and the FDR was controlled at 10%. When the batch labels are
known and the batch effects removed, no genes are found significant. Assuming the batch labels are
unknown, competing methods find many more genes significant. Our sphering method appropriately
estimates and removes the effect of the batches finding only 28 genes significant.
removing the batch effects by standardizing the arrays with respect to processing year, no
genes are found to be significant. These results are consistent with previous re-analysis of
this data (Akey et al., 2007). When employing the surrogate variable analysis and correlation
sharing methods, however, even more genes are found to be significant at 67% and 100%
respectively. The correaltion predicted and correlation adjusted methods find 6% and 4%
signfiicant genes, while our sphering algorithm only finds 0.67% signficant genes. Notice also
that while central matching adjusts for some of the effects of correlation, since it does not
re-order the gene rankings, it still finds 1.9% of genes significant. Thus, even without using
knowledge of the processing years, the sphering method correctly adjusts for these batch
effects, finding very few genes differentially expressed. The SVA and correlation sharing
methods, however, display troubling behavior as they estimate that even more genes are
significant than the standard methodology. These real data results are consistent then with
what we have previously observed in the simulation study.
To explore these results further, we present heatmap dendograms of the top 250 genes
for the original data, the data resulting from the SVA method, and our sphered data in
Figure 7. Colorbars indicating the processing years as well as the group labels are shown
with these heatmaps. From this, we see that the arrays in the original data cluster by
processing years 2003 - 2004 or years 2005 - 2006 instead of by the group status. When
the SVA method is applied, this effect appears to be exacerbated as the separation between
processing years is more pronounced. In contrast, the sphered data clusters the arrays by
the group status and not the processing years. In addition, the arrays do not cluster by the
processing years 2003-2004 or 2005-2006 indicating that our method removed these strong
batch effects. Instead, the true problem is clearly illustrated, namely there is confounding
between processing years and group status.
These results on a real microarray example provide strong evidence for the utility of our
sphering approach, revealing that the technique outperforms all competing methods.
5.
Discussion
In this paper, we have demonstrated that using standard statistical methodology to conduct
inference on transposable data is problematic. As a method of solving these problems,
we have prosed a sphering algorithm that de-correlates the data yielding approximately
independent rows and columns to be used before conducting large-scale inference. We have
demonstrated the advantages and robustness of this method through simulations on many
correlated data sets.
G. I. Allen & R. Tibshirani
18
Fig. 7. Heatmap dendograms of the top 250 genes for the original standardized Spielman et al. (2007) data (left), the data after removing latent variables via
the SVA method (center), and after sphering the data (right). Samples are labeled according to the year in which the arrays were processed (upper colorbar)
and the class labels (lower colorbar). In the original data and the SVA method, samples cluster largely according to processing years, 2003 - 2004 or 20052007. Samples in the sphered data do not seem to cluster according to processing year, and instead, confounding of the class labels and processing years
is clearly seen.
Inference and Dependence
19
We comment briefly on the context in which our methods should be used. If the source
of correlations among the rows or columns is known, then these dependencies should be
modeled directly. In microarrays, for example, if the batches or processing dates are known,
then one may adjust for these (Li and Rabinovic, 2007; Leek et al., 2010b). There are
many instances, however, in which this information may be unknown to the statistician.
Several public repositories of gene expression data, for example, do not always include this
additional information (Edgar et al., 2002). It is in this instance, when two-way dependencies
are suspected but not confirmed, that we recommend using our sphering algorithm to adjust
for these possible two-way correlations.
A disadvantage of our method is its computational cost. Fitting the transposable regularized covariance model with L1 penalties is approximately O(k(m3 + p3 )), where k is the
total number of iterations needed until convergence. Thus, directly fitting this model to
microarrays, for example, where m may be twenty or thirty thousand, is not currently feasible. A simple is to first filter the genes by the absolute value of their un-sphered T -statistics
down to say 1,000 or 500 genes. Since the signal in each gene remains the same before and
after sphering, filtering should not effect the power to detect non-null genes, especially since
researchers are rarely interested in re-testing over 500 genes. As future work, we will examine approximations to the TRCM covariance estimates that can be used in high-dimensional
settings and would circumvent the need to filter the genes before sphering.
There are many components of our work that deserve further investigation and testing.
Allen and Tibshirani (20) outline some of the properties of the TRCM covariance estimates,
but several questions, such as the consistency of the estimates, remain. We conjecture that
under the right conditions, one may prove this consistency even when the covariances are
estimated from a single matrix of data. Also, alternative methods of estimating the altered
variance of the Z-statistic, η, when there are column correlations should be examined.
In conclusion, we have studied and proposed a solution to the many problems associated with conducting inference on matrix data exhibiting two-way dependencies. While
our approach may be based on the simple concept of sphering the data before conducting
inference, we have shown that it is a powerful tool for reversing the ill effects of row and
column correlations leaving open many possibilities for further research.
The R language software package Tsphere implementing our Transposable Sphering
methods will be made available on CRAN, http://cran.r-project.org/.
6.
Acknowledgments
We would like to thank Jonathan Taylor for several helpful comments and conversations
regarding this work. Thanks to two anonymous referees and the Editor and Associate
Editor for constructive comments and suggestions that led to substantial improvements in
this paper. Thanks also to Bradley Efron whose observations and ideas on microarrays
partly inspired this work. Robert Tibshirani was supported by National Science Foundation
Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183.
A.
Proofs
Proof (Claim 1). This result is straightforward since if x ∼ N (0, ∆), aT x ∼ N (0, aT ∆ a).
Proof (Calculations for Section 2.3). For all of these calculations, we assume that
iid
zk ∼ N (0, 1).
Batch Effect Model: As the random batch effect, βk are independent, it is obvious that the
residuals only have non-zero covariance if j and j 0 are in the same batch, say k:
`
´
(B) (B)
E(rij rij 0 ) = E ((µk + σk zk )(µk + σk zk )) = E µ2k + 2µk σk zk + σ 2 zk2 = µ2k + σk2 .
20
G. I. Allen & R. Tibshirani
Latent Variable Model:
(L) (L)
E(rij rij 0 )
=E (
K
X
!
!
K
K
K
X
X
X
+ zK+1 )(
zk Gkj 0 + zK+2 ) = E (
zk Gkj )(
zk Gkj 0 )
zk Gkj
k=1
K
X
=E
k=1
!
zk2 Gkj Gkj 0
=
k=1
k=1
k=1
Gkj Gkj 0 .
k=1
Instrument Drift Model:
0
(D) (D)
E(rij rij 0 )
K
X
= E @(µ(j−1) + σ
j
X
0
0
zk )(µ(j − 1) + σ
j
X
1
zk )A
k=1
k=1
0
0
2
2
0
2
2
= (j − 1)(j − 1)µ + σ E @(
j
X
k=1
1
0 0 1
j0
j∧j
X
X 2
0
2
2
zk )(
zk )A = (j − 1)(j − 1)µ + σ E @
zk A
k=1
k=1
0
= (j − 1)(j − 1)µ + σ (j ∧ j ).
Proof (Proposition 1). For part (i), the matrix decomposition model implies that E(Xij ) =
νi + µj + ψk,i for k = 1, 2 depending on the class of array j. EachP
element of the noise
Pp as defined in
1
Step 2 can be written as N̂ij = Xij − µ̂j − ν̂i − ψ̂k,i where µ̂j = n1 n
i=1 Xij , ν̂i = p
j=1 (Xij − µ̂j ),
P
1
and ψ̂k,i = nk j∈Ck (Xij − µ̂j − ν̂i ). We show that E(N̂ij ) = 0, which in the process proves that
E(X̃ij ) = ψk,i .
n
n
1X
1 X
1 1 XX
E(Xij ) +
E(Xij ) −
E(Xij )
n i=1
nk j∈C
n nk j∈C i=1
E(µ̂j + ν̂i + ψ̂k,i ) =
k
k
= µj + ν̄ + ψ̄k + µ̄k + νi − ψk,i − µ̄k − ν̄ − ψ̄k
= µj + νi + ψk,i .
1
1
−
ˆ − 2 , and following the proof of part (i), N0 ∼ Nm,n (0, 0, Σ, ∆),
For part (ii), since Ñ0 , Σ̂ 2 N0 ∆
the result follows from a straightforward matrix-variate calculation (Gupta and Nagar, 1999). We
include this calculation for completeness. The characteristic function of the centered matrix-variate
normal is φX (Z) = etr(− 12 ZT Σ Z ∆) where etr is the exponential of the trace function. The charˆ can then be written as
acteristic function of Ñ0 |Σ̂, ∆
"
„
«T „
« #
1
1
1
−2
−2
−2
−1
1
2
ˆ
ˆ
φÑ0 |Σ̂,∆
Σ̂
Z∆
Σ Σ̂
Z∆
∆
ˆ (Z) = etr −
2
»
–
1
−1
1 ˆ − 12 T − 21
ˆ −2 ∆
Z Σ̂
Σ Σ̂ 2 Z ∆
= etr − ∆
2
« „
«–
»
„
1
1
−1
−1
1
ˆ −2 ∆∆
ˆ −2
= etr − ZT Σ̂ 2 Σ Σ̂ 2 Z ∆
.
2
1
−2
Thus, letting Σ̃ = Σ̂
−1
2
Σ Σ̂
1
1
˜ =∆
ˆ −2 ∆∆
ˆ − 2 , we have Ñ0 |Σ̂, ∆
ˆ ∼ Nm,n (0, 0, Σ̃, ∆).
˜
and ∆
Proof (Claim 2). Since we are considering the test statistic for one gene, we will suppress
√
the index i. We can define the random variables Z , (X̄1 − X̄2 )/σ cn and D , s2X̃1 X̃2 /σ 2 . Then,
√
√
T̃ can be written
as T̃ = Z/ D. From Section 2.2, Z ∼ N ((ψ1 − ψ2 )/σ cn , η/cn ). Then, under
p
the null, Z/ η/cn ∼ N (0, 1). Also, under the null, Dσ 2 /σ̃ 2 ∼ χ2(n−2) with D and Z independent.
Then,
q
r
Z/ cηn
Z
σ̃
η
q
∼ t(n−2) ⇒ T̃ = √ ∼
t(n−2) .
σ cn
σ2
D
D σ̃2
Inference and Dependence
21
References
Akey, J., S. Biswas, J. Leek, and J. Storey (2007). On the design and analysis of gene expression
studies in human populations. Nature genetics 39 (7), 807–808.
Allen, G. I. and R. Tibshirani (20). Transposable regularized covariance models with an applicaiton
to missing data imputation. Annals of Applied Statistics 4 (2), 764–790.
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57 (1), 289–300.
Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing
under dependency. Annals of Statistics 29, 1165–1188.
Casella, G. and R. Berger (2001). Statistical inference. Duxbury Press.
Dempster, A. P. (1972). Covariance selection. Biometrics 28 (1), 157–175.
Desai, K., J. Deller, and J. McCormick (2009). The distribution of number of false discoveries for
highly correlated null hypotheses. The Annals of Applied Statistics.
Dudoit, S., J. Shaffer, and J. Boldrick (2003). Multiple hypothesis testing in microarray experiments. Statistical Science 18 (1), 71–103.
Dutilleul, P. (1999). The MLE algorithm for the matrix normal distribution. J. Statist. Comput.
Simul. 64, 105–123.
Edgar, R., M. Domrachev, and A. Lash (2002). Gene Expression Omnibus: NCBI gene expression
and hybridization array data repository. Nucleic acids research 30 (1), 207.
Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis.
Journal of the American Statistical Association 99, 96–104.
Efron, B. (2007). Size, power and false discovery rates. Ann. Statist. 35 (4), 1351–1377.
Efron, B. (2009). Are a set of microarrays independent of each other? Ann. App. Statist. 13 (3),
922–942.
Efron, B. (2010). Correlated z-values and the accuracy of large-scale statistical estimates. Journal
of the American Statistical Association 105 (491), 1042–1055.
El Karoui, N. (2008). Operator norm consistent estimation of large-dimensional sparse covariance
matrices. Ann. Statist. 36 (6), 2717–2756.
Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular attention
to the false discovery proportion. Statistical methods in medical research 17 (4), 347.
Fare, T., E. Coffey, H. Dai, Y. He, D. Kessler, K. Kilian, J. Koch, E. LeProust, M. Marton,
M. Meyer, et al. (2003). Effects of atmospheric ozone on microarray data quality. Analytical
Chemistry 75 (17), 4672–4675.
Friedman, J., T. Hastie, and R. Tibshirani (2007). Sparse inverse covariance estimation with the
lasso. Biostatistics 9 (3), 432–441.
Gupta, A. K. and D. K. Nagar (1999). Matrix variate distributions, Volume 104 of Monographs and
Surveys in Pure and Applied Mathematics. Boca Raton, FL: Chapman & Hall, CRC Press.
Johnstone, I. and A. Lu (2009). On consistency and sparsity for principal components analysis in
high dimensions. Journal of the American Statistical Association 104 (486), 682–693.
22
G. I. Allen & R. Tibshirani
Johnstone, M. (2001). On the distribution of the largest eigenvalue in principal components analysis.
Ann. Statist 29, 295–327.
Lai, Y. (2008). Genome-wide co-expression based prediction of differential expressions. Bioinformatics 24 (5), 666.
Leek, J., R. Scharpf, H. Bravo, D. Simcha, B. Langmead, W. Johnson, D. Geman, K. Baggerly,
and R. Irizarry (2010a). Tackling the widespread and critical impact of batch effects in highthroughput data. Nature Reviews Genetics 11 (10), 733–739.
Leek, J., R. Scharpf, H. Bravo, D. Simcha, B. Langmead, W. Johnson, D. Geman, K. Baggerly,
and R. Irizarry (2010b). Tackling the widespread and critical impact of batch effects in highthroughput data. Nature Reviews Genetics 11 (10), 733–739.
Leek, J. T. and J. D. Storey (2008). A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences 105 (48), 18718–18723.
Lehmann, E. and J. Romano (2005). Testing statistical hypotheses. Springer Verlag.
Li, C. and A. Rabinovic (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8 (1), 118.
Muralidharan, O. (2010). Detecting column dependence when rows are correlated and estimating
the strength of the row correlation. Electronic Journal of Statistics 4, 1527–1546.
Olshen, R. A. and B. Rajaratnam (2010). Successive normalization of rectangular arrays. Ann.
Statist. 38 (3), 1638–1664.
Owen, A. B. (2005). Variance of the number of false discoveries. Journal Of The Royal Statistical
Society Series B 67 (3), 411–426.
Qiu, X., A. I. Brooks, L. Klebanov, and A. Yakovlev (2005). The effects of normalization on the
correlation structure of microarray data. BMC Bioinformatics 6 (120).
Qiu, X. and A. Yakovlev (2006). Some comments on instability of false discovery rate estimation.
Journal of bioinformatics and computational biology 4 (5), 1057–1068.
Rothman, A. J., P. J. Bickel, E. Levina, and J. Zhu (2008). Sparse permutation invariant covariance
estimation. Electron. J. Stat. 2, 494–515.
Sarkar, S. K. (2008). On methods controlling the false discovery rate. Sankhya : The Indian Journal
of Statistics 70, 135–168.
Schwartzman, A. and X. Lin (2011). The effect of correlation in false discovery rate estimation.
Biometrika 98 (1), 199–214.
Spielman, R., L. Bastone, J. Burdick, M. Morley, W. Ewens, and V. Cheung (2007). Common
genetic variants account for differences in gene expression among ethnic groups. Nature genetics 39 (2), 226–231.
Storey, J. D. (2002). A direct approach to false discovery rates. Journal Of The Royal Statistical
Society Series B 64 (3), 479–498.
Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estimation
and simultaneous conservative consistency of false discovery rates: a unified approach. Journal
Of The Royal Statistical Society Series B 66 (1), 187–205.
Storey, J. D. and R. Tibshirani (2003). Statistical significance for genomewide studies. Proceedings
of the National Academy of Sciences 100 (16), 9440–9445.
Inference and Dependence
23
Teng, S. L. and H. Huang (2009). A statistical framework to infer functional gene relationships
from biologically interrelated microarray experiments. Journal of the American Statistical Association 104 (486), 465–473.
Tibshirani, R. and L. Wasserman (2006). Correlation-sharing for detection of differential gene
expression. Technical Report 839, Carnegie Mellon University.
Tusher, V. G., R. Tibshirani, and G. Chu (2001). Significance analysis of microarrays applied to
the ionizing radiation response. Proceedings of the National Academy of Sciences of the United
States of America 98 (9), 5116–5121.
Yang, Y., S. Dudoit, P. Luu, D. Lin, V. Peng, J. Ngai, and T. Speed (2002). Normalization
for cDNA microarray data: a robust composite method addressing single and multiple slide
systematic variation. Nucleic acids research 30 (4), e15.
Yekutieli, D. and Y. Benjamini (1999). Resampling-based false discovery rate controlling multiple
test procedures for correlated test statistics. Journal of Statistical Planning and Inference 82 (12), 171 – 196.
Zuber, V. and K. Strimmer (2009). Gene ranking and biomarker discovery under correlation.
Bioinformatics 25 (20), 2700–2707.