Bayesian Model Averaging methods and R package for gene network construction Ka Yee Yeung Chris Fraley William Chad Young Institute of Technology University of Washington, Tacoma, WA 98402. Department of Statistics University of Washington, Seattle, WA 98195 Department of Statistics University of Washington, Seattle, WA 98195 [email protected] [email protected] [email protected] Roger Bumgarner Adrian E. Raftery Department of Microbiology University of Washington, Seattle, WA 98195. Department of Statistics University of Washington, Seattle, WA 98195 [email protected] [email protected] ABSTRACT Advances in biotechnology allow the generation of big data quantifying activities or signals across the entire genome. These genome-wide data can be used to infer functional relationships between biological entities. In this paper, we will review Bayesian Model Averaging (BMA) methods developed to infer these gene networks. In particular, our work uses a regression framework that allows systematic integration of multiple types of biological data. We will also present our latest contributions to the software implementation of our methods. Many methods have been developed to infer gene networks, and different methods generally yield different networks. Hence, the assessment and evaluation of inferred networks is of great importance. We present an R package “networkBMA” for network inference and assessment implementing: (i) Bayesian Model Averaging methods for network inference using time series gene expression data; (ii) assessment functions that compute contingency tables and associated evaluation statistics for comparing inferred networks to complete or incomplete external knowledge; (iii) plotting functions that compute receiver operating characteristic (ROC) and precision-recall (PR) curves. Our package also includes in silico and real test data that demonstrate the usability of the functions. Categories and Subject Descriptors G.3 [Probability and Statistics]: Statistical computing. G.4 [Mathematical software]: Algorithm design and analysis. General Terms Algorithms, Performance, Experimentation. Keywords Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’14 Workshop KDDBHI, Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, August 2014, New York City, USA. Copyright 2014 ACM 1-58113-000-0/00/0010 …$15.00. Computational biology, gene network inference, data integration, Bayesian statistics, software. 1. INTRODUCTION The ultimate goal of genomics research is to enable the discovery of novel diagnostics and therapeutics. Advances in biotechnology allow the measurement of the signal or activity levels across the entire genome in the order of thousands or even tens of thousands of biological entities. Various types of these genome-wide data, such as gene expression and sequence data, have been generated and made available from public databases such as the Gene Expression Omnibus (GEO) [1, 2], ArrayExpress [3] and the European Nucleotide Archive (ENA) [4]. By mining these genome-wide data, we aim to deduce the complex interactions between genes and gene products through predictive modeling. The resulting models can subsequently be used to develop novel therapies [5, 6]. In this paper, we define a gene network as a directed graph, in which each node represents a messenger RNA (mRNA) level and each directed edge (r→g) represents the relationship between the expression levels (or activity levels) of a regulator r and a gene g. We aim to infer the directed edges that describe the relationships among the nodes. This is a challenging problem, especially on a genome-wide scale, since the goal is to unravel a small number of regulators (parent nodes) out of thousands of candidate nodes in the graph. Another challenge is the effective integration of external knowledge and data sources to guide the network search. 1.1 Related Work A variety of methods have been developed to infer gene networks, for example, Bayesian networks [5, 7-10], ordinary differential equations [11] and regression-based methods [12-23]. In regression-based methods, network construction is formulated as a series of variable (feature) selection problems to infer regulators (parent nodes) for each gene. Examples of regressionbased network inference methods include regression trees [24-27], forward selection, nonparametric regression embedded within a Bayesian network [28], and L1-norm regularization methods such as the elastic net [29, 30] and weighted LASSO [31]. 1.2 Our Contributions In this manuscript, we review our Bayesian Model Averaging (BMA) methods for gene network inference using time series gene expression data [32-34], and present our software implementation (“networkBMA” R package) that has not previously been documented in the literature. Specifically, we formulate network inference as a variable selection problem in which candidate regulators are selected for each gene. Integration of multiple genome-wide data sources is accomplished using prior probabilities in the Bayesian framework. These prior probabilities are used to guide the search of regulators (parent nodes) in the model space. The greatest challenge in formulating network construction as a series of variable selection problems is that there are usually far more candidate regulators than observations for each gene. Our methods build on BMA, which is a multivariate variable (feature) selection method that accounts for uncertainty in model selection [35, 36]. In the case of high-dimensional gene expression data, it is highly likely that there are multiple models (sets of regulators) that fit the data well. BMA is an ensemble method that averages over multiple possible models. This is in contrast to other variable selection methods that select a single “best” model from models of similar quality. The resulting network is a directed graph consisting of a set of directed edges from regulators to genes, with edges calibrated by posterior probabilities. Our “networkBMA” package (with major revision in April 2014, publicly available at http://bioconductor.org/packages/release/bioc/html/networkBMA. html) extends our previous work in two main ways. First, we unify the iBMA and ScanBMA methods for network construction and provide a software implementation that allows the specification of prior probabilities to bias the search of candidate regulators for each gene. Second, we add functions for network assessment, including functions for computing contingency tables for inferred networks as compared to complete or incomplete reference regulatory relationships, functions for computing evaluation statistics from contingency tables, and functions for plotting receiver operating characteristic (ROC) curves and precision-recall (PR) curves. , where Xg,t is the expression of gene g at time t, H is the set of regulators for gene g in a candidate model, 2. BAYESIAN MODEL AVERAGING (BMA) 2.1 BMA as a variable selection method 2.2.3 Applications to static (non-time series) data BMA is a variable selection approach that takes model uncertainty into account by averaging over the posterior distribution of a quantity of interest based on multiple models, weighted by their posterior model probabilities [35, 36]. In the context of network construction, each model consists of a set of candidate regulators. To efficiently identify a small set of promising models out of all possible models, our approach applies the leaps and bounds algorithm [37] to identify the best nbest models for each number of variables (i.e., regulators), and Occam’s window to discard models with much lower posterior model probabilities than the best one [38]. The Bayesian Information Criterion (BIC) [39] is used to approximate each model's integrated likelihood, from which its posterior model probability can be determined. BIC corrects for over-fitting with a penalty term for the number of parameters in the model. 2.2 BMA as a network inference algorithm We formulate network construction from time series data as a regression problem in which the expression of each gene is predicted by a linear combination of the expression of candidate regulators at the previous time point. Specifically, are the regression coefficients, and is the error term for gene g=1, …, n and time t=2, …, T. 2.2.1 Iterative BMA algorithm (iBMA) Since BMA in its original form is not applicable to highdimensional data that contain more variables than samples, we developed the iterative BMA (iBMA) algorithm [32, 40]. In iBMA, we use a pre-processing step to rank all variables. We then iteratively apply the original BMA to the top w variables (w=30 by default), and discard predictor variables with low posterior inclusion probabilities. In the iterative step, new variables from the ranked list are added to replace the discarded variables. This procedure of repeatedly applying BMA and variable swaps is continued until the nvar top ranked variables have been processed. 2.2.2 Scan BMA algorithm (ScanBMA) ScanBMA is an alternative algorithm for exploring the space of models for BMA when there are more variables than samples [34]. It does so by taking a set of the best models found thus far and looking at models in which a single variable is added or removed, adding those models to the set of best models to explore around if their score is within a factor of the score of the best model. Additionally, ScanBMA allows the use of Zellner’s g-prior [41], which replaces BIC for scoring the models. Zellner’s g-prior is more flexible than BIC and can be either specified or estimated using an Expectation-Maximization (EM) algorithm. There are two distinguishing features of ScanBMA. First, a preprocessing step ranking all variables via a univariate measure is used in the model exploration step of iBMA, but not in ScanBMA. Second, there is no upper limit on the maximum size of the models inferred in ScanBMA. This is in contrast to iBMA, in which the number of maximum variables is limited by the BMA window size w. Our iBMA and ScanBMA algorithms can be modified to be applied to non-time series data [32]. Specifically, let Yg,e be the expression of gene g in experiment e, H is the set of regulators for gene g in a candidate model, are the regression coefficients, and is the error term for g=1, …, n and e=1, …, E. Our model is defined as . An issue with this formulation is that the resulting networks may contain directed cycles. In order to yield a well-defined probability distribution, we need a Gaussian graphical model that does not contain any cycles [42-44]. Drawing from our previous work [32, 33], we identify strongly connected components using the directed graph resulting from applying iBMA to each gene, and then greedily remove the edges with the lowest posterior probabilities until no strongly connected component exists in the graph. 2.3 Data integration in BMA A major challenge of data mining in computational biology is data integration. In the context of gene networks, we aim to integrate multiple genome-wide data sources to infer robust, accurate and compact gene-to-gene interactions. We have previously developed a supervised framework to integrate multiple data sources and illustrated our framework using yeast data as an example [32, 33]. Specifically, we compute prior probabilities of regulatory relationships using multiple data sources. Known regulatory relationships in the form of transcription factor and gene (TF,G) pairs from the literature curated in various yeast databases are used as positive training examples. Randomly generated (TF,G) relationships that are not documented in any databases are used as negative training examples. We further adjust for the sampling bias between positive and negative training examples using the expected fraction of regulatory relationships in the literature [33]. These prior probabilities are then used to guide the selection of candidate regulators (parent nodes) for each gene in the regression step using gene expression data. While our work was done primarily in yeast, the supervised framework is applicable to other organisms for which similar data sets exist. 3. IMPLEMENTATION 3.1 networkBMA R package The “networkBMA” package is written in R and C++, and is publicly available from the Bioconductor project [45]. The core function networkBMA returns a set of directed edges consisting of the inferred regulator (parent), target gene (child), and the corresponding posterior probability of the inferred relationship. The user can subsequently specify a posterior probability threshold to filter out edges with low posterior probabilities. In our latest revision (April 2014), we add the implementation of ScanBMA, written in R and C++, which provides a significant increase in computational efficiency over our previous implementation of iBMA in R. The most computationally intensive routines searching the model space using the ScanBMA algorithm are written in C++. Our latest revision allows the inference of large gene networks (i.e. running the core function networkBMA across thousands of genes) to be accomplished in the order of minutes for small nvar such as 100, while also being viable for larger nvar, in the thousands. Please refer to Young et al. [34] for detailed running time analyses. Key input parameters to the core function networkBMA include: The argument nvar specifies the number of top univariate ranked variables to be retained before the regression step in both iBMA and ScanBMA. During the regression step, this univariate ranking will only be used in iBMA, but not in ScanBMA. The argument ordering specifies the univariate measure used to pre-process and rank variables prior to the regression step. We extend our previous work [32, 33] by allowing users to choose from the univariate BIC score (ordering=”bic1”), user-specified prior probabilities (ordering=”prior”) or univariate BIC score corrected by prior probabilities (ordering=”bic1+prior”). The univariate BIC score (“bic1”) measures the fit of each predictor variable (candidate regulator) in the linear regression. The Bayesian Information Criterion (BIC), given by , where p is the number of parameters (regression coefficients), and n is the number of observations, is an approximation to the integrated likelihood for a model [46]. Ordering variables by the “bic1” score is equivalent to ordering them by likelihood, because in the univariate case is constant across variables (the parameters are the intercept and a coefficient for the variable in question). The option ”bic1+prior” corrects the BIC score by subtracting twice the log odds of the prior probabilities, hence combining the contribution of the fit and the prior. The specification of prior probabilities to bias the search of candidate regulators for each gene is accomplished using the argument prior.prob in function networkBMA. This argument can be used to specify a matrix P such that entry (i,j) corresponds to the prior probability that regulator i regulates gene j computed using external data sources. If external data sources are unavailable, the user is encouraged to specify a constant prior probability, network density , which is equal to the expected number of regulators per gene divided by the total number of genes in the genome. Guelzim et al. [47] estimated that each yeast gene is regulated by approximately 2.76 transcription factors on average, hence, we set this network density parameter to 2.76/6000 = 0.00046 when inferring yeast networks. We have shown that this size prior yields compact and accurate networks [33]. 3.2 Contingency tables for network assessment Our “networkBMA” package allows the specification of either a complete or incomplete reference network to assess an inferred network. In the case of simulated data, a complete reference network (i.e., the underlying true network spanning all regulators and genes) is often available. However, for real biological data, knowledge of reference networks would, at best, be incomplete. As an example, we compared networks inferred from yeast timeseries data [32-34] to the literature curated regulatory relationships in the YEASTRACT database [48]. The function contabs.netwBMA in the networkBMA package computes a 2 x 2 contingency table to quantify the associations between an inferred and an incomplete reference network for each distinct value in a vector of posterior probability thresholds. This function computes the numbers of true positive (TP), false positive (FP), true negative (TN) and false negative (FN) using the set of regulators and genes covered by the reference network at a given posterior probability threshold to define the inferred edges. Table 1 shows the definitions of TP, FP, TN and FN in a contingency table. Table 1. Definition of contingency tables. Inferred network Yes No Gold standard reference network Yes No TP FP FN TN The function scores summarizes the concordance in the computed contingency tables using the following statistics: True positive rate (TPR) or sensitivity or hit rate or recall, TPR = TP/(TP + FN) False positive rate (FPR) = FP / (FP + TN) True negative rate (TNR) or specificity = TN / (FP + TN) = 1 - FPR False discovery rate (FDR) = FP / (TP + FP) Positive predictive value (PPV) or precision = TP/ (TP + FP) Negative predictive value (NPV) = TN / (TN + FN) F1 score is defined as the harmonic mean of precision and recall, i.e. F1 = 2 * (precision * recall) / (precision + recall) Accuracy (ACC) = (TP + TN) / (TP+FN+FP+TN) O/E ratio = TP/ [(TP+FN)*(TP+FP)/(TP+FN+FP+TN)] 3.3 ROC and PR curves for network assessment Functions roc and prc in the networkBMA package take the evaluation statistics returned by contabs.netwBMA over the range of posterior probability thresholds for the inferred edges, and plot the TPR against FPR in the ROC curve, and the precision against recall in the PR curve, respectively. TPR measures the fraction of correct edges among all edges in the reference network, precision measures the fraction of correct edges among all predicted edges, and FPR quantifies the fraction of incorrect edge predictions among edges not in the reference network. A perfect prediction would yield TPR=1, precision=1, and FPR=0. In general, good predictions would yield a point above the diagonal line in the ROC curve. The contingency tables from “networkBMA” cover a range of values for the FPR (ROC) or recall (PR) that may not span the entire interval from 0 to 1. For ROC curves, we use linear interpolation, while for PR curves, we use the method described by Davis and Goadrich [49] to interpolate values outside the range. Davis and Goadrich [49] address practical issues of interpolation between points in the ROC and PR space. While linear interpolation between points makes sense for ROC curves, interpolation is more complicated in the PR space. As the level of Recall varies, the Precision does not necessarily change linearly due to the fact that FP replaces FN in the denominator of the Precision metric. As a result, linear interpolation gives an overly optimistic estimate of performance. The achievable PR curve can be found using the analogous ROC convex hull, and interpolation can be based on this curve. We use this method when plotting PR curves in the networkBMA package. predictor removed (added) with that of the best model itself to assess the relative importance of each predictor. We call this process differentiation, and use arguments diff100 and diff0 as indicator variables denoting whether differentiation is performed. Subsequently, we threshold the resulting directed edges at 50% and obtain a network consisting of 439 edges. > library(networkBMA) > data(vignette) > edges.ScanBMA <- networkBMA(data = timeSeries[,-(1:2)], nTimePoints = 6, prior.prob = reg.prob, nvar = 50, ordering = "bic1+prior", diff100 = TRUE, diff0 = TRUE) > sum (edges.ScanBMA[,3] >= 0.5) [1] 439 Next, we compare this inferred network to the reference network using contingency tables at posterior probability thresholds 50%, 75% and 90%. > ctables <- contabs.netwBMA (edges.ScanBMA, referencePairs, reg.known, thresh=c(.5,.75,.9)) > ctables TP FN FP TN 0.5 18 251 21 782 0.75 18 251 16 787 0.9 18 251 13 790 Figure 1 shows the ROC curve plotting the TPR against the FPR by varying posterior probability thresholds of the inferred networks. The area under the ROC curve is 0.65. See the vignette accompanying the “networkBMA” package for details of the R code used to produce this example. 4. RESULTS 4.1 Datasets 4.2 Using the networkBMA package We illustrate our package using the 100-gene yeast time series expression data (called timeSeries in our example). First, we use the core function networkBMA and apply ScanBMA. The prior probability matrix of this 100-gene subset is denoted by reg.prob. We can estimate the true posterior probabilities for a predictor with an estimated 100% (0%) posterior probability by comparing the posterior probability of the best model with the 1.0 0.8 0.6 0.4 TPR (sensitivity) 0.2 0.0 The “networkBMA” package includes the following set of test data and reference networks: 1. A subset of the yeast time-series expression data consisting of 100 genes over 6 time points [32] and literature-curated regulatory relationships from the YEASTRACT database [48]. This time-series expression data measure the response to a drug perturbation over 6 time points at 10-minute intervals in 95 yeast segregants and 2 parental strains. The full dataset is available from http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB412/. We also include the prior probabilities computed using external data sources for these 100 genes (as described in Section 2.3). 2. In-silico 10-gene and 100-gene time-series data over 21 time points and the corresponding reference networks from DREAM4 [50-54]. 3. A subset of static (non-time series) yeast gene expression data [55]. ROC 0.0 0.2 0.4 0.6 0.8 1.0 FPR (1 − specificity) Figure 1. ROC curve produced by networkBMA. The red vertical lines highlight the sector covered by varying the posterior probability thresholds in the contingency table. The black lines indicate the ROC curve. The diagonal blue line in the ROC curve indicates the line between (0,0) and (1,1). . 5. CONCLUSIONS We present the networkBMA bioconductor package for network inference and assessment. Our R package implements both the ScanBMA and iBMA methods, and allows specification of prior probabilities to bias the search of regulators towards candidates supported by additional data sources or expert knowledge. In addition, our package provides functions to evaluate inferred networks against complete or incomplete regulatory relationships using contingency tables, to compute various assessment statistics and to plot receiver operating characteristic (ROC) curves and precision-recall (PR) curves. Most importantly, our method and software implementation is highly computationally efficient compared to other network inference methods, with projected running time ~15 minutes for 20,000 genes, see [34]). Thus, networkBMA is a promising candidate to infer gene networks in mammalian systems. 6. ACKNOWLEDGMENTS We would like to thank Drs. Kenneth Dombek, Kenneth Lo and John Mittler for their participation in the project. AER, CF, KYY, REB and WCY are supported by NIH grant 5R01GM084163. KYY is also supported by the Institute of Translational Health Sciences (ITHS) eScience Seed Grant at University of Washington and the Microsoft Azure for Research Award. AER is also supported by NIH grants R01 HD054511 and R01 HD070936, and by Science Foundation Ireland ETS Walton visitor award 11/W.1/I207. 7. REFERENCES [1] Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30(1):207-210. [2] Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM et al: NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 2011, 39(Database issue):D1005-1010. [3] Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E et al: ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 2011, 39(Database issue):D1002-1004. [4] Kodama Y, Shumway M, Leinonen R, International Nucleotide Sequence Database C: The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res 2012, 40(Database issue):D54-56. [5] Schadt EE: Molecular networks as sensors and drivers of common human diseases. Nature 2009, 461(7261):218-223. [6] Schadt EE, Zhang B, Zhu J: Advances in systems biology are enhancing our understanding of disease and moving us closer to novel disease treatments. Genetica 2009, 136(2):259-269. [7] Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol 2000, 7(3-4):601-620. [8] Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, Sieberts SK, Monks S, Reitman M, Zhang C et al: An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 2005, 37(7):710-717. [9] Zhu J, Sova P, Xu Q, Dombek KM, Xu EY, Vu H, Tu Z, Brem RB, Bumgarner RE, Schadt EE: Stitching together multiple data dimensions reveals interacting metabolomic and transcriptomic networks that modulate cell regulation. PLoS Biol 2012, 10(4):e1001301. [10] Zhu J, Zhang B, Smith EN, Drees B, Brem RB, Kruglyak L, Bumgarner RE, Schadt EE: Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 2008, 40(7):854-861. [11] Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D: How to infer gene networks from expression profiles. Mol Syst Biol 2007, 3:78. [12] Charbonnier C, Chiquet J, Ambroise C: Weighted-LASSO for structured network inference from time course data. Stat Appl Genet Mol Biol 2010, 9(1):Article 15. [13] Gustafsson M, Hornquist M: Gene expression prediction by soft integration and the elastic net-best performance of the DREAM3 gene expression challenge. PLoS One 2010, 5(2):e9134. [14] Hecker M, Goertsches RH, Engelmann R, Thiesen HJ, Guthke R: Integrative modeling of transcriptional regulation in response to antirheumatic therapy. BMC Bioinformatics 2009, 10:262. [15] Hecker M, Goertsches RH, Fatum C, Koczan D, Thiesen HJ, Guthke R, Zettl UK: Network analysis of transcriptional regulation in response to intramuscular interferon-beta-1a multiple sclerosis treatment. Pharmacogenomics J 2010, in press. [16] James G, Sabatti C, Zhou N, Zhu J: Sparse regulatory networks. Annals of Applied Statistics 2010, 4:663-686. [17] Lee SI, Dudley AM, Drubin D, Silver PA, Krogan NJ, Pe'er D, Koller D: Learning a prior on regulatory potential from eQTL data. PLoS Genet 2009, 5(1):e1000358. [18] Li F, Yang Y: Recovering genetic regulatory networks from micro-array data and location analysis data. Genome Inform 2004, 15(2):131-140. [19] Pan W, Xie B, Shen X: Incorporating predictor network in penalized regression with application to microarray data. Biometrics 2010, 66(2):474-484. [20] Peng J, Zhu J, Bergamaschi A, Han W, Noh D-Y, Pollack JR, Wang P: Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann Appl Stat 2010, 4(1):53-77. [21] van Someren EP, Vaes BL, Steegenga WT, Sijbers AM, Dechering KJ, Reinders MJ: Least absolute regression network analysis of the murine osteoblast differentiation network. Bioinformatics 2006, 22(4):477-484. [22] van Someren EP, Wessels LFA, Backer E, Reinders MJT: Multi-criterion optimization for genetic network modeling. Signal Processing 2003, 83(4):763-775. [23] Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, Thorsson V: The Inferelator: an algorithm for learning parsimonious regulatory networks from systemsbiology data sets de novo. Genome Biol 2006, 7(5):R36. [24] Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P: Inferring regulatory networks from expression data using tree-based methods. PLoS One 2010, 5(9):e12776. [25] Lee SI, Pe'er D, Dudley AM, Church GM, Koller D: Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification. Proc Natl Acad Sci U S A 2006, 103(38):14062-14067. [26] Nepomuceno-Chamorro IA, Aguilar-Ruiz JS, Riquelme JC: Inferring gene regression networks with model trees. BMC Bioinformatics 2010, 11:517. [27] Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet 2003, 34(2):166-176. [28] Imoto S, Kim S, Goto T, Aburatani S, Tashiro K, Kuhara S, Miyano S: Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Journal of Bioinformatics and Computational Biology 2003, 1(2):231-252. selection and classification tool for microarray data. Bioinformatics 2005, 21(10):2394-2402. [41] Zellner A: On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian Inference Decis Tech: Essays Honor of Bruno De Finetti 1986, 6:233-243. [42] Pearl J: Causality: Models, Reasoning and Inference: Cambridge University Press; 2000. [43] Spirtes P, Glymour C, Scheines R: Causation, Prediction and Search: MIT Press; 2000. [44] Shipley B: Cause and Correlation in Biology: A User's Guide to Path Analysis, Structural Equations and Causal Inference: Cambridge University Press; 2002. [45] networkBMA package [http://bioconductor.org/packages/release/bioc/html/network BMA.html] [46] Schwarz G: Estimating the dimension of a model. Annals of Statistics 1978, 6:461-464. [29] Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010, 33(1):1-22. [47] Guelzim N, Bottani S, Bourgine P, Képès F: Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet 2002, 31:60-63. [30] Zou H, Trevor H: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 2005, 67(2):301-320. [48] Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, Alenquer M, Freitas AT, Oliveira AL, Sa-Correia I: The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Res 2006, 34(Supp~1):D446-451. [31] Shimamura T, Imoto S, Yamaguchi R, Miyano S: Weighted lasso in graphical Gaussian modeling for large gene network estimation based on microarray data. Genome Inform 2007, 19:142-153. [32] Yeung KY, Dombek KM, Lo K, Mittler JE, Zhu J, Schadt EE, Bumgarner RE, Raftery AE: Construction of regulatory networks using expression time-series data of a genotyped population. Proc Natl Acad Sci U S A 2011, 108(48):1943619441. [33] Lo K, Raftery AE, Dombek KM, Zhu J, Schadt EE, Bumgarner RE, Yeung KY: Integrating external biological knowledge in the construction of regulatory networks from time-series expression data. BMC systems biology 2012, 6(1):101. [34] Young WC, Raftery AE, Yeung KY: Fast Bayesian inference for gene regulatory networks using ScanBMA. BMC systems biology 2014, 8(1):47. [35] Raftery AE: Bayesian model selection in social research (with discussion). Sociol Methodol 1995, 25:111-193. [36] Hoeting JA, Madigan D, Raftery AE, Volinsky CT: Bayesian model averaging: A tutorial. Statistical Science 1999, 14:382-401. [37] Furnival GM, Wilson RW: Regression by leaps and bounds. Technometrics 1974, 16:499-511. [38] Madigan D, Raftery A: Model selection and accounting for model uncertainty in graphical models using Occam's window. J Am Stat Assoc 1994, 89:1335-1346. [39] Schwarz G: Estimating the dimension of a model. Annals of Statistics 1978, 6:461–464. [40] Yeung KY, Bumgarner RE, Raftery AE: Bayesian model averaging: development of an improved multi-class, gene [49] Davis J, Goadrich M: The relationship between PrecisionRecall and ROC curves. In: In ICML ’06: Proceedings of the 23rd international conference on Machine learning. ACM Press; 2006: 233-240. [50] Stolovitzky G, Monroe D, Califano A: Dialogue on reverseengineering assessment and methods: the DREAM of highthroughput pathway inference. Ann N Y Acad Sci 2007, 1115:1-22. [51] Prill RJ, Saez-Rodriguez J, Alexopoulos LG, Sorger PK, Stolovitzky G: Crowdsourcing network inference: the DREAM predictive signaling network challenge. Sci Signal 2011, 4(189):mr7. [52] Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G: Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci U S A 2010, 107(14):6286-6291. [53] Marbach D, Schaffter T, Mattiussi C, Floreano D: Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol 2009, 16(2):229-239. [54] Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS One 2010, 5(2):e9202. [55] Brem RB, Kruglyak L: The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci U S A 2005, 102(5):1572-1577.
© Copyright 2026 Paperzz