Modeling The Marginal Distribution of Gene Expression with Mixture Models Edward Wijaya, Hajime Harada and Paul Horton AIST, Computational Biology Research Center, 2-42 Aomi, Koutou-Ku, Tokyo [email protected] Abstract We report the results of fitting mixture models to the distribution of expression values for individual genes over a broad range of normal tissues, which we call the marginal expression distribution of the gene. The base distributions used were normal, lognormal and gamma. The standard expectation-maximization algorithm was used to learn the model parameters. Experiments with articifial data were performed to ascertain the robustness of learning. Applying the procedure to data from two publicly available microarray datasets, we conclude that lognormal performed the best function for modeling the marginal distributions of gene expression. Our results should provide some guidance in the development of informed priors or gene specific normalization for use with gene network inference algorithms. 1 Introduction Several studies have used finite mixtures to model the distributions of gene expression values. Some notable works include those by Hoyle [5] and Yuan [6]. Hoyle investigated the entire distributions of expression levels of mRNA extracted from human tissues. Yuan examined the distribution of gene expression’s correlation coefficient on cancer cells. However, less analysis has been done on the marginal distribution of gene expression levels. In this paper we present a preliminary analysis of modeling the marginal distributions using mixture models with normal, lognormal or gamma distributions as the model components. In contrast to previous works, this study attempts to answer the following questions: 1. Is there a generic form of distribution that describe best the marginal expression of genes? 2. Can we find what is common amongst the genes that have similar mixture components? Published in the Proceedings of Bioscience and BioTechnology (BSBT2008), Sānyà, China, December 2008. Post-proceedings published as: CCIS, 28:164-175, 2009. 1 The gamma and lognormal distributions belong to the family of skewed distributions. We expected that these distributions could model microarray data which is often skewed [9]. Additionally we use the standard normal distribution as a control experiment to compare how well the gamma and lognormal mixture models perform. We selected the gamma distribution because of its flexible shape. Furthermore it has been successfully used in many studies of biological systems [4,7,11]. With regard to the lognormal distribution, there is a strong evidence that this distribution appears in many biological phenomena [10]. In practice it is also convenient for analyzing microarray data is because it is easy to perform calculations and capable of determining the data z-scores, a possible common unit for data comparison [8]. Below we describes the details of our methods and experimental results. 2 2.1 Methods Statistical Model Let {xi }, i = 1, . . . , N denote the expression value of a gene probe, where N is the total number of observations (samples). Under a mixture model, the probability density function for observing finite data points xi is: p(x) = K X p(x|j)p(j) (1) j=1 The density function for each component is denoted as p(x|j). p(j) denotes the prior probability of the data point having been generated from j Pcomponent K of the mixture. These priors are chosen to satisfy the constraints j=1 p(j) = 1. The log likelihood function of the data is given by: LL = −logL = − N X log i=1 2.2 K X p(xi |j)p(j) j=1 Expectation Maximization Using the R programming language, we implemented the standard expectationmaximization (EM) algorithm to learn mixture models of normal, lognormal and gamma distribution for each probe’s expression level. The EM algorithm iteratively maximizes the loglikelihood and updates the conditional probability that x comes from K-th component. This is defined as: p(x|j)∗ = E[p(x|j)|x, α̂1 , β̂1 , . . . , α̂K , β̂K ] The set of parameters: [α̂1 , β̂1 , . . . , α̂K , β̂K ] maximizes the loglikelihood for given p(x|j). The EM algorithm iterates between an E-step where values p(x|j)∗ are computed from the current parameter estimates, and M-step in which the loglikelihood with each p(x|j) replaced by its current conditional expectation p(x|j)∗ is maximized with respect to the parameters α and β. An outline of the EM algorithm is as follows: 2 1. Initialize α and β. This is done by first randomly partitioning the dataset into K groups and then calculating method of moment estimates for each of the groups. 2. M-step: Given p(x|j)∗ , maximize loglikelihood (LL) with respect to the parameters α and β. We maximize [α̂1 , β̂1 , . . . , α̂K , β̂K ] numerically. 3. E-step: Given the parameter estimates from M-step, we compute: p(x|j)∗ = E[p(x|j)|x, α̂1 , β̂1 , . . . , α̂K , β̂K ] p̂(x|j) = PK j=1 p̂(x|j) 4. Repeat M-step and E-step until the change in the value of the loglikelihood (LL) is negligible. In order to avoid local maxima, we run the above EM algorithms multiple times with different starting points. 2.3 Model Selection When fitting mixture models to expression data, it is necessary to choose an appropriate number of components, which fits the data well, but does not overfit. For this task we tried two information criteria: AIC (Akaike Information Criterion [1]) and BIC (Bayesian Information Criteria [12]. Specifically: AIC = −2LL + 2K and BIC = −2LL + K log(n) To choose models, we fit mixture models with the EM algorithm for one to five components and chose the model with the smallest information criteria value (the degree of freedom K in the above formulas, is equal to 3c − 1 for c components). 3 3.1 Experimental Results Estimating Number of Components We generated simulated datasets from mixture models containing with one, two and three components. We performed two sets of equivalent experiments, one using the gamma and one using the lognormal distribution for the mixture model components. For the component parameters, each distinct combination of the parameter pairs shown in table 3.1was tried; yielding 4 mixture models with one and three components and 6 = 42 mixture models with two components. The mixing components (P (j) in equation 1) were set to the uniform distribution in each case. To aid comparison with the real data described below, we simulated datasets with sizes 122 and 158. 3 Figure 1. indicates the observed effectiveness of information criteria to induce the number of components in the generating model, from the generated data. BIC outperforms AIC, correctly inducing the number of components in about 80% of the trials. Even for BIC, the error is slightly skewed toward overpredicting the number of components. This suggests it may be possible to further optimize the criteria for this task, but we did not pursue this possibility. Parameters α β 1 0.1 7 2 0.1 14 3 5 3 4 10 6 Table 1: The parameter pairs used as components for the simulated data generating mixture models are shown. 3.2 Real Dataset As a preliminary study, we investigated the gene expression dataset from human (GDS596) and mouse (GDS592) from GEO database [3] gene expression data repository (http://www.ncbi.nlm.nih.gov/geo/). GDS596 contains data from a study profiling 158 types of normal human tissue (22,283 probes) and GDS592 with 122 types of mouse tissues (31373 probes) [13]. 3.2.1 Likelihood and K-S Test Comparison We evaluate the likelihood of mixture models from three types of distributions on the real datasets. Furthermore we evaluate goodness of fit of a model by using Kolmogorov-Smirnov (K-S) test as represented by D, the maximum discrepancy in the cumulative probability distribution, and a p-value statistic. #Comp 1 2 3 4 5 Loglik -1063.02 -979.13 -952.29 -913.66 -881.67 Normal D 1.04e-3 1.74e-3 2.72e-3 4.63e-2 1.55e-2 p-value 0.99 0.99 0.99 0.92 0.90 Loglik -212.85 -205.58 -204.55 -203.05 -201.78 Lognormal D p-value 2.30e-4 0.99 4.55e-4 0.99 7.04e-4 0.99 8.27e-4 0.99 1.26e-3 0.99 Loglik -968.59 -955.11 -963.20 -967.20 -968.56 Gamma D 3.00e-3 1.60e-3 1.19e-3 9.88e-4 5.77e-4 p-value 0.99 0.99 0.99 0.99 0.99 Table 2: loglikelihood, D and p-value, averaged over each proble of the GDS596 dataset. The goodness of fit p-value statistics, indicate that the gamma mixtures can fit the marginal distribution of gene expression reasonable well. However, lognormal mixtures fit better than gamma mixtures. Over all experiments they obtain a higher likelihood than the gamma mixtures. The K-S test also confirm this observation where the D statistics of lognormal is smaller than gamma by an order of magnitude. Note that the likelihood in tables 2 and 3 does not alway improve monotonically with the number of components. We believe this is due to the EM procedure getting trapped in poor quality local optima when more components are used. 4 #Comp 1 2 3 4 5 Loglik -728.47 -683.53 -661.46 -637.21 -612.37 Normal D 1.14e-3 1.44e-3 2.70e-3 5.79e-3 9.44e-2 p-value 0.99 0.99 0.99 0.97 0.88 Loglik -85.04 -47.39 -46.64 -46.01 -45.08 Lognormal D p-value 7.70e-5 0.99 7.73e-5 0.99 9.20e-5 0.99 6.17e-5 0.99 7.45e-5 0.99 Loglik -492.66 -437.79 -487.58 -474.41 -465.15 Gamma D 1.72e-3 1.05e-3 1.26e-3 1.19e-3 8.36e-4 p-value 0.99 0.99 0.99 0.99 0.99 Table 3: loglikelihood, D and p-value, averaged over each probe of the GDS592 dataset. 3.2.2 Marginal Distribution of CSN3 gene CSN3 is a component of the COP9 signalosome complex, a complex involved in signal transduction Figure 2 shows examples how one and two components mixtures of normal, lognormal, and gamma fit the marginal distribution of these gene. It is taken from the GDS596 dataset. 3.2.3 Gene Ontology Comparison From the two datasets we identified “two component genes”, the genes with probes for which the BIC criteria suggested two components for the learned lognormal mixture. (We found 256 such genes for GDS596 and 56 for GDS592). We examined the overrepresented terms for each datasets using the web tool Babelomics (http://www.babelomics.org/) [2]. One common cellular component term - intracellular - that occurs in both of the datasets. Terms cytoskeleton non-membrane-bound organelle organelle part intracellular cytoplasm GO GO:0005856 GO:0043228 GO:0044422 GO:0005622 GO:0005737 p-value 0.000428 0.00083 0.00633 0.0117 0.175 Table 4: Statistically overrepresented terms from the “two component” probes in GDS596 Terms proteinaceous extracellular matrix intracellular midbody cell soma extracellular space GO GO:0005578 GO:0005622 GO:0030496 GO:0043025 GO:0005615 p-value 0.0618 0.0548 0.0985 0.0933 0.116 Table 5: Statistically overrepresented terms from the “two component” probes in GDS592 5 4 Conclusion In this paper we provide a statistical framework using normal, lognormal, and gamma mixture models for analyzing the marginal distributions of expression levels in gene probes. To our knowledge this is the first study that provides such a framework for analyzing expression data. Although theoretically gamma distributions are capable of modeling skewed distributions, our experiments showed that lognormal appears to be more suitable in modeling the marginal distribution of gene expression. We also showed that amongst the two model selection criteria we used, BIC is more accurate in selecting the number of components for lognormal and gamma mixtures. AIC on the other hand tends to over estimate the number of components. We hypothesize that different functional categories of genes (e.g. transcription factors, kinases, structural proteins, etc) may show similar marginal distributions. Unforunately this expectation is not clearly supported by our study. Only the single, vague gene ontology term intracellular was found to be overrepresented in both datasets. We believe follow-up experiments are necessary to determine if this is a due to the quantity/quality of the expression data used, a deficiency in our methodology, or whether our hypothesis is simply wrong. To achieve more definitive results we are now preparing to analyze a much larger dataset including multiple GEO datasets. This will be essential to sample the expression probes at the resolution needed to accurately model multimodal marginal distributions. Our results should provide some guidance in the development of informed priors or gene specific normalization for use with gene network inference algorithms. References [1] H. Akaike. Information theory and extension of the maximum likelihood principle. In Second International Symposium on Information Theory, pages 267–281, 1973. [2] F. Al-Shahrour, P. Minguez, J.M. Vaquerizas, L. Conde, and J. Dopazo. Babelomics: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucleic Acids Research, 33:W460, 2005. [3] T. Barrett, D. B. Troup, S. E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. F. Kim, A. Soboleva, M. Tomashevsky, and R. Edgar. NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Research, 35(Database-Issue):760–765, 2007. [4] B. Dennis and G. P. Patil. The gamma distribution and weighted multimodal gamma distributions as models of population abundance. Mathematical Biosciences, 68:187–212, 1984. [5] D. Hoyle, M. Rattray, R. Jupp, and A. Brass. Making sense of microarray data distributions. Bioinformatics, 18:576–584(9), April 2002. [6] Y. Ji, C. Wu, P. Liu, J. Wang, and K. R. Coombes. Applications of betamixture models in bioinformatics. Bioinformatics, 21(9):2118–2122, 2005. 6 [7] S. Keles. Mixture modeling for genome-wide localization of transcription factors. Biometrics, 63(1):2118–2122, 2007. [8] T. Konishi. Parametric treatment of cDNA microarray data. Genome Informatics, 7(13):280–281, 2002. [9] V.A. Kuznetsov. Family of skewed distributions associated with the gene expression and proteome evolution. Signal Process., 83(4):889–910, 2003. [10] E. Limpert, W.A. Stahel, and M. Abbt. Log-normal distributions across the sciences: keys and clues. Bioscience, 51(5):341–352, 2001. [11] I. Mayrose, N. Friedman, and T. Pupko. A gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics, 21(2):151–158, 2005. [12] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461–464, 1978. [13] A. I. Su, T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching, D. Block, J. Zhang, R. Soden, M. Hayakawa, G. Kreiman, M. P. Cooke, J. R. Walker, and J. B. Hogenesch. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A, 101(16):6062–6067, April 2004. 7 Figure 1: Performance of gamma and lognormal mixture models in predicting number of components on simulated datasets. The percentage of accuracy is taken from the average performance of 4 or 6 parameter settings x 100 trials x 2 dataset sizes. 8 NORMAL LOGNORMAL GAMMA Figure 2: The components (one or two) of mixture models of three distribution types fit to the marginal distribution of the CSN3 gene is shown. 9
© Copyright 2026 Paperzz