Modeling The Marginal Distribution of Gene Expression with Mixture

Modeling The Marginal Distribution of Gene
Expression with Mixture Models
Edward Wijaya, Hajime Harada and Paul Horton
AIST, Computational Biology Research Center,
2-42 Aomi, Koutou-Ku, Tokyo
[email protected]
Abstract
We report the results of fitting mixture models to the distribution of
expression values for individual genes over a broad range of normal tissues,
which we call the marginal expression distribution of the gene. The base
distributions used were normal, lognormal and gamma. The standard
expectation-maximization algorithm was used to learn the model parameters. Experiments with articifial data were performed to ascertain the
robustness of learning. Applying the procedure to data from two publicly
available microarray datasets, we conclude that lognormal performed the
best function for modeling the marginal distributions of gene expression.
Our results should provide some guidance in the development of informed
priors or gene specific normalization for use with gene network inference
algorithms.
1
Introduction
Several studies have used finite mixtures to model the distributions of gene expression values. Some notable works include those by Hoyle [5] and Yuan [6].
Hoyle investigated the entire distributions of expression levels of mRNA extracted from human tissues. Yuan examined the distribution of gene expression’s correlation coefficient on cancer cells. However, less analysis has been
done on the marginal distribution of gene expression levels.
In this paper we present a preliminary analysis of modeling the marginal
distributions using mixture models with normal, lognormal or gamma distributions as the model components. In contrast to previous works, this study
attempts to answer the following questions:
1. Is there a generic form of distribution that describe best the marginal
expression of genes?
2. Can we find what is common amongst the genes that have similar mixture
components?
Published in the Proceedings of Bioscience and BioTechnology (BSBT2008), Sānyà, China,
December 2008. Post-proceedings published as: CCIS, 28:164-175, 2009.
1
The gamma and lognormal distributions belong to the family of skewed
distributions. We expected that these distributions could model microarray data
which is often skewed [9]. Additionally we use the standard normal distribution
as a control experiment to compare how well the gamma and lognormal mixture
models perform.
We selected the gamma distribution because of its flexible shape. Furthermore it has been successfully used in many studies of biological systems [4,7,11].
With regard to the lognormal distribution, there is a strong evidence that this
distribution appears in many biological phenomena [10]. In practice it is also
convenient for analyzing microarray data is because it is easy to perform calculations and capable of determining the data z-scores, a possible common unit
for data comparison [8].
Below we describes the details of our methods and experimental results.
2
2.1
Methods
Statistical Model
Let {xi }, i = 1, . . . , N denote the expression value of a gene probe, where N
is the total number of observations (samples). Under a mixture model, the
probability density function for observing finite data points xi is:
p(x) =
K
X
p(x|j)p(j)
(1)
j=1
The density function for each component is denoted as p(x|j). p(j) denotes
the prior probability of the data point having been generated from
j
Pcomponent
K
of the mixture. These priors are chosen to satisfy the constraints j=1 p(j) = 1.
The log likelihood function of the data is given by:
LL = −logL = −
N
X
log
i=1
2.2
K
X
p(xi |j)p(j)
j=1
Expectation Maximization
Using the R programming language, we implemented the standard expectationmaximization (EM) algorithm to learn mixture models of normal, lognormal
and gamma distribution for each probe’s expression level. The EM algorithm
iteratively maximizes the loglikelihood and updates the conditional probability
that x comes from K-th component. This is defined as:
p(x|j)∗ = E[p(x|j)|x, α̂1 , β̂1 , . . . , α̂K , β̂K ]
The set of parameters: [α̂1 , β̂1 , . . . , α̂K , β̂K ] maximizes the loglikelihood for
given p(x|j). The EM algorithm iterates between an E-step where values p(x|j)∗
are computed from the current parameter estimates, and M-step in which the
loglikelihood with each p(x|j) replaced by its current conditional expectation
p(x|j)∗ is maximized with respect to the parameters α and β. An outline of the
EM algorithm is as follows:
2
1. Initialize α and β. This is done by first randomly partitioning the dataset
into K groups and then calculating method of moment estimates for each
of the groups.
2. M-step: Given p(x|j)∗ , maximize loglikelihood (LL) with respect to the
parameters α and β. We maximize [α̂1 , β̂1 , . . . , α̂K , β̂K ] numerically.
3. E-step: Given the parameter estimates from M-step, we compute:
p(x|j)∗ = E[p(x|j)|x, α̂1 , β̂1 , . . . , α̂K , β̂K ]
p̂(x|j)
= PK
j=1 p̂(x|j)
4. Repeat M-step and E-step until the change in the value of the loglikelihood
(LL) is negligible.
In order to avoid local maxima, we run the above EM algorithms multiple
times with different starting points.
2.3
Model Selection
When fitting mixture models to expression data, it is necessary to choose an
appropriate number of components, which fits the data well, but does not overfit. For this task we tried two information criteria: AIC (Akaike Information
Criterion [1]) and BIC (Bayesian Information Criteria [12]. Specifically:
AIC = −2LL + 2K
and
BIC = −2LL + K log(n)
To choose models, we fit mixture models with the EM algorithm for one
to five components and chose the model with the smallest information criteria
value (the degree of freedom K in the above formulas, is equal to 3c − 1 for c
components).
3
3.1
Experimental Results
Estimating Number of Components
We generated simulated datasets from mixture models containing with one, two
and three components. We performed two sets of equivalent experiments, one
using the gamma and one using the lognormal distribution for the mixture model
components. For the component parameters, each distinct combination of the
parameter pairs shown in table 3.1was tried; yielding 4 mixture models with one
and three components and 6 = 42 mixture models with two components. The
mixing components (P (j) in equation 1) were set to the uniform distribution in
each case. To aid comparison with the real data described below, we simulated
datasets with sizes 122 and 158.
3
Figure 1. indicates the observed effectiveness of information criteria to induce the number of components in the generating model, from the generated
data. BIC outperforms AIC, correctly inducing the number of components in
about 80% of the trials. Even for BIC, the error is slightly skewed toward
overpredicting the number of components. This suggests it may be possible to
further optimize the criteria for this task, but we did not pursue this possibility.
Parameters
α
β
1
0.1
7
2
0.1
14
3
5
3
4
10
6
Table 1: The parameter pairs used as components for the simulated data generating mixture models are shown.
3.2
Real Dataset
As a preliminary study, we investigated the gene expression dataset from human (GDS596) and mouse (GDS592) from GEO database [3] gene expression data repository (http://www.ncbi.nlm.nih.gov/geo/). GDS596 contains
data from a study profiling 158 types of normal human tissue (22,283 probes)
and GDS592 with 122 types of mouse tissues (31373 probes) [13].
3.2.1
Likelihood and K-S Test Comparison
We evaluate the likelihood of mixture models from three types of distributions
on the real datasets. Furthermore we evaluate goodness of fit of a model by using
Kolmogorov-Smirnov (K-S) test as represented by D, the maximum discrepancy
in the cumulative probability distribution, and a p-value statistic.
#Comp
1
2
3
4
5
Loglik
-1063.02
-979.13
-952.29
-913.66
-881.67
Normal
D
1.04e-3
1.74e-3
2.72e-3
4.63e-2
1.55e-2
p-value
0.99
0.99
0.99
0.92
0.90
Loglik
-212.85
-205.58
-204.55
-203.05
-201.78
Lognormal
D
p-value
2.30e-4
0.99
4.55e-4
0.99
7.04e-4
0.99
8.27e-4
0.99
1.26e-3
0.99
Loglik
-968.59
-955.11
-963.20
-967.20
-968.56
Gamma
D
3.00e-3
1.60e-3
1.19e-3
9.88e-4
5.77e-4
p-value
0.99
0.99
0.99
0.99
0.99
Table 2: loglikelihood, D and p-value, averaged over each proble of the GDS596
dataset.
The goodness of fit p-value statistics, indicate that the gamma mixtures
can fit the marginal distribution of gene expression reasonable well. However,
lognormal mixtures fit better than gamma mixtures. Over all experiments they
obtain a higher likelihood than the gamma mixtures. The K-S test also confirm
this observation where the D statistics of lognormal is smaller than gamma by
an order of magnitude.
Note that the likelihood in tables 2 and 3 does not alway improve monotonically with the number of components. We believe this is due to the EM
procedure getting trapped in poor quality local optima when more components
are used.
4
#Comp
1
2
3
4
5
Loglik
-728.47
-683.53
-661.46
-637.21
-612.37
Normal
D
1.14e-3
1.44e-3
2.70e-3
5.79e-3
9.44e-2
p-value
0.99
0.99
0.99
0.97
0.88
Loglik
-85.04
-47.39
-46.64
-46.01
-45.08
Lognormal
D
p-value
7.70e-5
0.99
7.73e-5
0.99
9.20e-5
0.99
6.17e-5
0.99
7.45e-5
0.99
Loglik
-492.66
-437.79
-487.58
-474.41
-465.15
Gamma
D
1.72e-3
1.05e-3
1.26e-3
1.19e-3
8.36e-4
p-value
0.99
0.99
0.99
0.99
0.99
Table 3: loglikelihood, D and p-value, averaged over each probe of the GDS592
dataset.
3.2.2
Marginal Distribution of CSN3 gene
CSN3 is a component of the COP9 signalosome complex, a complex involved
in signal transduction Figure 2 shows examples how one and two components
mixtures of normal, lognormal, and gamma fit the marginal distribution of these
gene. It is taken from the GDS596 dataset.
3.2.3
Gene Ontology Comparison
From the two datasets we identified “two component genes”, the genes with
probes for which the BIC criteria suggested two components for the learned
lognormal mixture. (We found 256 such genes for GDS596 and 56 for GDS592).
We examined the overrepresented terms for each datasets using the web tool
Babelomics (http://www.babelomics.org/) [2].
One common cellular component term - intracellular - that occurs in both
of the datasets.
Terms
cytoskeleton
non-membrane-bound organelle
organelle part
intracellular
cytoplasm
GO
GO:0005856
GO:0043228
GO:0044422
GO:0005622
GO:0005737
p-value
0.000428
0.00083
0.00633
0.0117
0.175
Table 4: Statistically overrepresented terms from the “two component” probes
in GDS596
Terms
proteinaceous extracellular matrix
intracellular
midbody
cell soma
extracellular space
GO
GO:0005578
GO:0005622
GO:0030496
GO:0043025
GO:0005615
p-value
0.0618
0.0548
0.0985
0.0933
0.116
Table 5: Statistically overrepresented terms from the “two component” probes
in GDS592
5
4
Conclusion
In this paper we provide a statistical framework using normal, lognormal, and
gamma mixture models for analyzing the marginal distributions of expression
levels in gene probes. To our knowledge this is the first study that provides such
a framework for analyzing expression data.
Although theoretically gamma distributions are capable of modeling skewed
distributions, our experiments showed that lognormal appears to be more suitable in modeling the marginal distribution of gene expression. We also showed
that amongst the two model selection criteria we used, BIC is more accurate in
selecting the number of components for lognormal and gamma mixtures. AIC
on the other hand tends to over estimate the number of components.
We hypothesize that different functional categories of genes (e.g. transcription factors, kinases, structural proteins, etc) may show similar marginal distributions. Unforunately this expectation is not clearly supported by our study.
Only the single, vague gene ontology term intracellular was found to be overrepresented in both datasets. We believe follow-up experiments are necessary
to determine if this is a due to the quantity/quality of the expression data used,
a deficiency in our methodology, or whether our hypothesis is simply wrong.
To achieve more definitive results we are now preparing to analyze a much
larger dataset including multiple GEO datasets. This will be essential to sample
the expression probes at the resolution needed to accurately model multimodal
marginal distributions. Our results should provide some guidance in the development of informed priors or gene specific normalization for use with gene
network inference algorithms.
References
[1] H. Akaike. Information theory and extension of the maximum likelihood
principle. In Second International Symposium on Information Theory,
pages 267–281, 1973.
[2] F. Al-Shahrour, P. Minguez, J.M. Vaquerizas, L. Conde, and J. Dopazo.
Babelomics: a suite of web tools for functional annotation and analysis of
groups of genes in high-throughput experiments. Nucleic Acids Research,
33:W460, 2005.
[3] T. Barrett, D. B. Troup, S. E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. F. Kim, A. Soboleva, M. Tomashevsky, and R. Edgar. NCBI GEO:
mining tens of millions of expression profiles - database and tools update.
Nucleic Acids Research, 35(Database-Issue):760–765, 2007.
[4] B. Dennis and G. P. Patil. The gamma distribution and weighted multimodal gamma distributions as models of population abundance. Mathematical Biosciences, 68:187–212, 1984.
[5] D. Hoyle, M. Rattray, R. Jupp, and A. Brass. Making sense of microarray
data distributions. Bioinformatics, 18:576–584(9), April 2002.
[6] Y. Ji, C. Wu, P. Liu, J. Wang, and K. R. Coombes. Applications of betamixture models in bioinformatics. Bioinformatics, 21(9):2118–2122, 2005.
6
[7] S. Keles. Mixture modeling for genome-wide localization of transcription
factors. Biometrics, 63(1):2118–2122, 2007.
[8] T. Konishi. Parametric treatment of cDNA microarray data. Genome
Informatics, 7(13):280–281, 2002.
[9] V.A. Kuznetsov. Family of skewed distributions associated with the gene
expression and proteome evolution. Signal Process., 83(4):889–910, 2003.
[10] E. Limpert, W.A. Stahel, and M. Abbt. Log-normal distributions across
the sciences: keys and clues. Bioscience, 51(5):341–352, 2001.
[11] I. Mayrose, N. Friedman, and T. Pupko. A gamma mixture model better
accounts for among site rate heterogeneity. Bioinformatics, 21(2):151–158,
2005.
[12] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics,
6:461–464, 1978.
[13] A. I. Su, T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching, D. Block, J. Zhang,
R. Soden, M. Hayakawa, G. Kreiman, M. P. Cooke, J. R. Walker, and
J. B. Hogenesch. A gene atlas of the mouse and human protein-encoding
transcriptomes. Proc Natl Acad Sci U S A, 101(16):6062–6067, April 2004.
7
Figure 1: Performance of gamma and lognormal mixture models in predicting
number of components on simulated datasets. The percentage of accuracy is
taken from the average performance of 4 or 6 parameter settings x 100 trials x
2 dataset sizes.
8
NORMAL
LOGNORMAL
GAMMA
Figure 2: The components (one or two) of mixture models of three distribution
types fit to the marginal distribution of the CSN3 gene is shown.
9