B RIEFINGS IN BIOINF ORMATICS . VOL 12. NO 3. 245^252 Advance Access published on 22 December 2010 doi:10.1093/bib/bbq078 Validation of gene regulatory networks: scientific and inferential Edward R. Dougherty Submitted: 25th August 2010; Received (in revised form) : 15th November 2010 Abstract Gene regulatory network models are a major area of study in systems and computational biology and the construction of network models is among the most important problems in these disciplines. The critical epistemological issue concerns validation. Validity can be approached from two different perspectives (i) given a hypothesized network model, its scientific validity relates to the ability to make predictions from the model that can be checked against experimental observations; and (ii) the validity of a network inference procedure must be evaluated relative to its ability to infer a network from sample points generated by the network. This article examines both perspectives in the framework of a distance function between two networks. It considers some of the obstacles to validation and provides examples of both validation paradigms. Keywords: epistemology; gene network; genomics; systems biology; validation INTRODUCTION Gene regulatory network models are a major area of study in systems and computational biology and the construction of network models is among the most important problems in these disciplines. Network models provide quantitative knowledge concerning gene regulation and, from a translational perspective, they provide a basis for mathematical analyses leading to system-based optimal therapeutic strategies [1]. Network models run the gamut from discrete networks to the detailed description of stochastic differential equations. A number of reviews are available in the literature, discussing both model structure and inference [2–8]. From the standpoint of biological knowledge, the salient concern is network validation: to what extent does a network model agree with observed phenomena? Without an answer to this question, the network lacks scientific validity. Validity can be approached from two perspectives. Given a hypothesized network model, its scientific validity relates to the ability to draw predictions from the model that can be checked against experimental observations. Validation requires the formulation of relations between model characteristics and observables. On the other hand, the validity of an inference procedure must be evaluated relative to its ability to infer a hypothetical network from sample points generated by the network. Given a hypothetical model, generate data from the model, apply the inference procedure to construct an inferred model and compare the hypothetical and inferred models via some objective function. In practice, these two perspectives can overlap because an inference procedure is used to design a model from real data and the scientific validity of the inferred model must be examined. DISTANCE FUNCTIONS To quantitatively compare networks H and M, we use a distance function, mðM, HÞ, measuring the difference between them. We require that m be a semi-metric, meaning that it satisfies the following four properties: (1) (2) (3) (4) mðM, HÞ 0, mðM, MÞ ¼ 0, mðM, HÞ ¼ mðH, MÞ½symmetry, mðM, HÞ mðM, NÞ þ mðN, HÞ ½triangle inequality: Corresponding author. Edward R. Doughetry. E-mail: [email protected] Edward R. Dougherty is a Professor in the Department of Electrical and Computer Engineering at Texas A&M University, College Station, TX, USA, where he holds the Robert M. Kennedy ’26 Chair and is Director of the Genomic Signal Processing Laboratory. He is also Director of the Computational Biology, Division of the Translational Genomics Research Institute in Phoenix, AZ, whose mission is to translate frontier genomic science into patient applications. ß The Author 2010. Published by Oxford University Press. For Permissions, please email: [email protected] 246 Dougherty If m should satisfy a fifth condition, (5) mðM, HÞ ¼ 0 ) M ¼ H; then it is a metric. It is commonplace to define a distance function via some characteristic associated with a network, such as its regulatory graph or steady-state distribution. Thus, we do not require the fifth condition for a network distance function. For the sake of explanation, consider a network composed of a finite node (gene) set, V ¼ {X1, X2, . . . , Xn}, with each node taking discrete values in [0, d – 1]. The state space has N ¼ dn states of the form xj ¼ (xj1, xj2, . . . , xjn), for j ¼ 1, 2, . . . , N. The corresponding dynamical system is based on discrete time, t ¼ 0, 1, 2, . . . , with the state-vector transition X(t) ! X(t þ 1) at each time instant. The state X ¼ (X1, X2, . . . , Xn) is often referred to as a gene activity profile (GAP). In [9], a number of distance functions are discussed in this framework. Here, we consider three. Suppose the network is a homogenous Markov chain possessing a steady-state distribution. Owing to the importance of steady-state behavior, a natural choice for a network distance is a measure of difference between steady-state distributions. If p ¼ (p1, p2, . . . , pN) is a probability vector, then its r-norm is defined by r jjpjjr ¼ N i¼1 jpi j 1=r ð1Þ for r 1. If n ¼ (p1, p2, . . . , pN) and u ¼ (o1, o2, . . . , oN) are the steady-state distributions for networks H and M, respectively, then a network distance is defined by mrstead ðM, HÞ ¼ jjn ujjr ð2Þ Concern often focuses on the connectivity of a regulatory network, which is represented by the adjacency matrix. Given an n-gene network, for i, j ¼ 1, 2, . . . , n, the (i, j) entry in the matrix is 1 if there is a directed edge from the i-th to the j-th gene; otherwise, the (i, j) entry is 0. If A ¼ (aij) and B ¼ (bij) are the adjacency matrices for networks H and M, respectively, where H and M, possess the same gene set, then the Hamming distance between the networks is defined by mham ðM, HÞ ¼ ni, j¼1 jaij bij j ð3Þ Alternatively, the Hamming distance may be computed by normalizing the sum, such as by the number of genes or the number of edges in one of the networks, for instance, when one of the networks is considered as representing ground truth. With a rule-based network, the transition X(t) ! X(t þ 1) is governed by a state function f ¼ (f1, f2, . . . , fn) such that Xi(t þ 1) ¼ fi(X(t)). A classical example of a rule-based network is a Boolean network (BN), where the values are binary, 0 or 1, and the function fi is defined via a logic expression or a truth table consisting of 2n rows, with each row assigning a 0 or 1 as the value for the GAP defined by the row [10]. As defined, the BN is deterministic and the entries in its transition probability matrix are either 0 or 1. The connectivity of the BN is the maximum number of predictors for a gene. The model becomes stochastic if the BN is subject to perturbation, meaning that at any time point, instead of necessarily being governed by the state function f ¼ (f1, f2, . . . , fn), there is a positive probability p < 1 that the GAP may randomly switch to another GAP. The Markov chain for a BN with perturbation possesses a steady-state distribution. For BNs (with our without perturbation) possessing the same gene set, a distance is given by the proportion of unequal rows in the function-defining truth tables. Denoting the state functions for networks H and M, by f ¼ (f1, f2, . . . , fn) and g ¼ (g1, g2, . . . , gn), respectively, since there are n truth tables consisting of 2n rows each, this distance is given by mfun ðM, HÞ ¼ 1 n N I½gi ðxk Þ 6¼ fi ðxk Þ n2n i¼1 k¼1 ð4Þ where I denotes the indicator function, I[A] ¼ 1 if A is a true statement and I[A] ¼ 0 otherwise [11]. If we wish to give more weight to states more likely to be observed in the steady state, then we can weight the inner sums in Equation (4) by the corresponding terms in the steady-state distribution, n ¼ (p1, p2, . . . , pN). SCIENTIFIC THEORY AND VALIDATION A scientific theory is composed of two parts: (i) a mathematical model consisting of symbols (variables and relations among the variables), and (ii) a set of ‘operational definitions’ that relate the symbols to data (see [12] for a more in depth discussion of the epistemology of computational biology). A mathematical model is a necessary constituent of a scientific theory because quantitative experiments are the basis of scientific knowledge and relations among the Validation of gene regulatory networks quantitative variables constitute our formal understanding of the phenomena. But a mathematical model alone does not constitute a scientific theory. The model must be predictive. Its formal structure must lead to experimental predictions in the sense that there are relations between model variables and observable phenomena such that experimental observations are in accord with the predicted values of corresponding variables. These relations characterize model validity. The method of validation is outside the mathematical formalization of scientific knowledge, but it is a necessary part of a scientific theory because it is by way of this methodology that relations between model variables and phenomena are verified. Absent a rigorous specification of a validation procedure, there is only a mathematical theory, not a scientific theory. Validation is an epistemological requirement emanating from the empirical basis of science. Given the same observations, scientists can legitimately disagree over the validity of a theory because data are only validating within the framework of a specified validation protocol. As stated by Albert Einstein, ‘In order that thinking might not degenerate into ‘metaphysics’, or into empty talk, it is only necessary that enough propositions of the conceptual system be firmly enough connected with sensory experiences.’ [13]. At issue is what is meant by ‘enough propositions’ being ‘firmly enough connected’ with sensory experiences. Because a model consists of mathematical relations and system variables must be checked against quantitative experimental observations, these epistemological criteria must be addressed in mathematical (including logical) statements. A scientific theory is not complete without the specification of achievable measurements that can be compared with predictions derived from the conceptual theory. The validity of a theory is relative to this specification, but what is not at issue is the necessity of a set of definitions operationally tying the conceptual system to measurements. Once the validation requirements are specified, the mathematical model (conceptual system) is valid to the degree that the requirements are satisfied, that is, to the degree that predictions demanded by the validation protocol and resulting from the mathematical model agree with corresponding experimental observations. Scientific validation presupposes the existence of a model. The model may have been obtained with the help of an inference procedure, it may have been derived solely from scientific theory, or it may 247 have been communicated to the scientist by Attila the Hun during a séance. Its origin is epistemologically irrelevant; only its ability to be used for prediction matters. Two issues arise. First, predictions must relate to mathematical consequences of the model and therefore we are bound by our ability to deduce consequences. More importantly, and of greater practical import in biology, we are limited by the ability to perform experiments. There may be all kinds of consequences deducible from the model, but the scientist must conceive of a suitable experiment to check a specific consequence and must be able to construct the apparatus to conduct the experiment so as to relate to the model consequences. Given a hypothesized model, M, validation involves a characteristic l, either directly specified in the model formulation or deducible from the model formulation, and an experimental design that yields an independent observation, z, of the same kind corresponding to l. For instance, if our concern is mainly with steady-state behavior, then l could be the steady-state distribution of the model network and z be a distribution inferred from the test data and presumed to come from the steady state of the biological process under study. This immediately raises the question as to what procedure will be used to infer z. There is ipso facto a confounding effect of the procedure because different procedures will yield different distributions to compare with the steady-state distribution of M: For instance, z could be a histogram of the data or a parametric approach can be taken and z could be a distribution estimated from the data by estimating the free parameters in a distribution model. Since model validation is based on prediction accuracy, we require a test distance d(l, z). The choice of distance belongs to the scientist and this choice affects the meaning of the validity and, therefore, the scientific meaning of the model M: A natural way to proceed is to apply a network distance defined via a characteristic metric. For instance, if l is the steady-state distribution and z a distribution estimated from the data, then a natural way to define the distance is by dðl, zÞ ¼ mrstead ðl, zÞ. Not only is validation not absolute because a validation characteristic l must be chosen, once having chosen l, validation is satisfactory or unsatisfactory depending on the criterion of closeness, d(l, z). The test distance is a function of the sample, S, of data points collected to test the model network M: Thus, the test distance is of the form d(l, z(S)). Viewing the 248 Dougherty sampling procedure as a random process, , the test distance is a random variable, d(l, z()). In essence, the issue is hypothesis testing. We can formulate a hypothesis test with null hypothesis H0: lH ¼lM and alternative hypothesis H1: lH 6¼ lM , where lH is the characteristic for the physical network. In this case, z plays the role of a test statistic. A critical value v is chosen (an epistemological choice) such that the null hypothesis is accepted if d(l, z) < v and the alternative hypothesis accepted if d(l, z) v. It is not our intention here to go into questions of Type I and Type II errors, and their corresponding probabilities. These are well known. We note only that, as stated, evaluation of Type I error requires knowledge of the distribution of d(l, z()) under the null hypothesis, knowledge we are not likely to have, and evaluation of Type II error would require some assumption as to a specific competing hypothesis and knowledge of the distribution of d(l, z()) under that competing hypothesis. Hence, we are often not able to formulate validation in a rigorous statistical fashion. To illustrate model validation, we consider a modeling approach, based on piecewise-linear differential equations, that employs inequality constraints on the parameters, rather than the parameters themselves [14]. Prediction of the signs of concentration derivatives can be obtained from the model, thereby facilitating model validation [15]. Basically, given a piecewise-linear DE model for protein concentrations based upon the derivatives and various parameters, including threshold concentrations, synthesis parameters and degradation parameters, knowing the relative order of certain parameters and quotients of parameters results in a partitioning of the phase space. The result is a set of domains of the phase space, a set of transitions between domains, and a labeling of each domain in terms of the signs of the derivatives of the concentration variables and a marker as to whether the domain is persistent or instantaneous. We refer to this system as the domain system. A sequence of domains in this system constitutes a path. A key property is that every solution of the original DE model corresponds to a domain path (however, the converse is not true). The basic point regarding validation is that paths correspond to predicted regulatory behavior and can therefore be compared with experimental data. From the perspective of the original DE model, the paths correspond to characteristics of the model and therefore the original model is being validated via these characteristics. In [15], a seven-variable piecewise-linear DE model is constructed to characterize nutritional stress response in Escherichia coli. The model consists of six protein concentrations corresponding to six genes and one input variable denoting the presence or absence of a carbon starvation signal. To test the model, the authors consider the absence of the carbon starvation signal and, under this condition, the domain system reaches a single equilibrium state (domain) corresponding to exponential growth in E. coli cells. Starting from this equilibrium state, the authors flip the starvation signal, which results in a 66-state transition graph possessing a single equilibrium state corresponding to stationary-phase conditions. The authors then ask whether predictions obtained from this model are concordant with experimental data. For instance, they consider Fis concentration. Experimentally, the concentration of Fis has been shown to decrease at the end of the exponential phase and become steady in the stationary phase. This behavior is in agreement with model predictions. On the other hand, preliminary data indicate that the level of DNA supercoiling decreases during and after the transition to the stationary phase, which implies that the concentration of GyrAB must decrease or the concentration of TopA must increase, neither of which is predicted by the model since in all paths the TopA concentration remains constant and the GyrAB concentration increases. The authors conclude that the current knowledge concerning nutritional stress response is inadequate and that the model can perhaps be extended by including more interactions or more variables in order to make it consistent with observations. INFERENCE VALIDITY When an inference procedure is employed to obtain a model network, the inference procedure operates on data and constructs a network M to serve as a model for the ‘real biological process’ (here not being the place to delve into the epistemological issues concerning the meaning of a ‘real’ process). The hope is that the model is sufficiently representative of the real process that predictions made from it will agree with biological measurements. This entails the hope that the inference procedure generates good predictive models from data. Since there is no way Validation of gene regulatory networks of knowing the exact nature of the real process, or even if there is a real process mirroring the scientist’s conception, there is no way to directly evaluate the performance of the inference procedure as an operator on experimental data. In fact, an inference procedure is not an operator in the physical world at all; rather, it is defined as a mathematical function that operates on sample points generated by a mathematical network H and constructs an inferred network M to serve as an estimate of H, or it constructs a characteristic to serve as an estimate of the corresponding characteristic of H. For instance, the sample points may be used to infer a distribution that estimates the steady-state distribution of H. The sample points could be dynamical, consisting of time-course points generated as the network evolves over time, or perhaps just a sampling of the steady-state distribution. In the latter case, it makes sense to only consider inference accuracy relative to the steady-state distribution of H. In the case of full network inference, the inference procedure is a mapping from a space of samples to a space of networks, and it must be evaluated as such. There is a generated data set S and the inference procedure is of the form c(S) ¼ M: When a characteristic is being estimated, c(S) is a characteristic, for instance, c(S) ¼ F, a probability distribution. Focusing on full network inference, the goodness of an inference procedure c is measured relative to a distance, specifically, m(M, H) ¼ m(c(S), H), which is a function of the sample S. In fact, S is a realization of the random set process governing data generation from H: Hence, m(c(), H) is a random variable and the performance of c is characterized by the distribution of m(c(), H), which is dependent upon the distribution of . The salient statistic regarding the distribution of m(c(), H) is its mean, E[m(c(), H)], where the expectation is taken with respect to . More generally (and following [9]), rather than considering a single network, we can consider a distribution H of random networks, where by definition the occurrences of realizations H of H are governed by a probability distribution. This is precisely the situation with regard to the classical study of random BNs. Averaging over the class of random networks, our interest focuses on m ðH, , cÞ ¼ EH ½E ½mðcðÞ, HÞ ð5Þ It is natural to define inference procedure c1 better than inference procedure c2 relative to the distance 249 m, the random network H and the sampling procedure if m ðH, , c1 Þ < m ðH, , c2 Þ ð6Þ In practice, the expectation must be estimated by an average, ^ ðH, , cÞ ¼ mj¼1 mðcðSj Þ, Hj Þ m ð7Þ where S1, S2, . . . , Sm are sample point sets generated according to from networks H1 , H2 , . . . , Hm randomly chosen from H. In the usual way, good estimation requires a sufficiently large sample. The preceding analysis applies virtually unchanged when a characteristic is being estimated. One need only replace H and H by l and , where l and are a characteristic and a random characteristic, respectively, and replace the network distance m by the characteristic distance. The Hamming distance in Equation (3) is a commonly used distance function used to assess inference validity [16, 17]. Moreover, a number of non-distance measures related to the Hamming distance have been used in the context of regulatory graphs. A true-positive edge is a directed edge in both H and M, and a true-negative edge is a directed edge in neither H nor M. Let TP and TN be the numbers of true-positive and true-negative edges, respectively, and FP be the number of false positive edges. The positive predictive value is defined by TP/(TP þ FP), the sensitivity is defined by TP/ (TP þ FN), and the specificity is defined by TN(TN þ FP). The receiver operating characteristic (ROC) curve is formulated by plotting the sensitivity and specificity of the classifier against each other as a function of some threshold criterion. The resulting ROC curve presents graphically the trade-off between false positives and false negatives. The area under the ROC curve provides a scalar parameter that reflects the overall quality of the classifier. These kinds of measures have been used in a number of papers concerning regulatory-graph inference [18–23]. A recent paper describes a large study in which 29 proposed inference procedures are analyzed mainly according to their ability to predict regulatory interactions [24]. We illustrate inference validity using a proposed method for BNs with perturbation that is based on the observation of a single dynamic realization of the network (see [11] for inference details). We are interested in the distance between the inferred network and the original network generating the data, where 250 Dougherty Figure 1: Performance of Boolean-network inference for connectivity k ¼ 2, 3 based on mfun. the distance function is given by mfun(M, H) in Equation (4). Figure 1 shows the average (in percentage) of the distance function using 80 data sequences generated from 16 randomly generated BNs with 7 genes, perturbation probability p ¼ 0.01, uniform connectivity k ¼ 2 or k ¼ 3, and data sequence lengths varying from 500 to 40 000. The reduction in performance from connectivity 2 to connectivity 3 is not surprising because the number of truth-table lines jumps dramatically. An important issue in judging inference validity is the manner in which the data are simulated. An obvious approach is to generate the data directly from the network and then apply the proposed inference procedure; however, this approach may not reflect the full inference methodology because in practice the data may be filtered prior to inference. The data may be filtered to reduce noise, augmented by missing value estimation, quantized or normalized in time, space or quantification. If any of these filtering techniques is used in practice, then it may be wise to generate the synthetic data in such a way as to better reflect real-world data and apply the appropriate filtering scheme when validating inference (for instance, see [22]). assumed to approximate the data-generating network). For instance, a directed graph (adjacency matrix), A, is constructed from relations found in the literature and the Hamming distance is used to compare the network inferred from the data with the human-constructed network. The desire is to compare the result of the inference procedure with some characteristic having to do with existing biological knowledge. The problem is that the humanconstructed regulatory graph may not be a good representative of the regulatory relations in the biological process. This can happen because the literature is incomplete, there are insufficiently validated connections reported in the literature, or the conditions under which connections have been discovered or not discovered in certain papers are not compatible with the conditions under which the current data have been derived. As a result of any of these situations, the overall validation procedure can be flawed owing to confounding by the imprecision of the human-constructed model relative to the data. The problem can be quantitatively addressed by considering the human-constructed model as an approximation of a hypothesized ‘real’ network (a special case of the general problem of approximate ground truth treated in [9]). Suppose we do not know the random network H generating the sample points from which we want to evaluate the inference procedure c, but know a network N that we believe to be a good approximation to the networks in H. We might then compare the inferred network to N in an effort to validate c. In effect, such a comparison is approximating m*(H, , c) in Equation (5) by E[m(c(), N)]. Using the triangle inequality, it is straightforward to show that E ½mðcðÞ, NÞ EH ½mðN, HÞ EH ½E ½mðcðÞ, HÞ E ½mðcðÞ, NÞ þ EH ½mðN, HÞ If EH[m(N, H)] & 0, then the preceding inequality leads to the approximation m ðH, , cÞ E ½mðcðÞ, NÞ INFERENCE VALIDATION WITH REAL DATA It is not uncommon in the literature to see a proposed inference procedure applied to one or more real data sets. The inferred network is compared with some other model network that has been human constructed from the literature (and implicitly ð8Þ ð9Þ and it is reasonable to judge the performance of c relative to H by E[m(c(), N)]. However; if EH[m(N, H)] is not small, then both bounds in Equation (8) are loose and nothing can be asserted regarding the performance of c relative to the data sets on which it is being applied. Therefore, unless EH[m(N, H)] is small, the entire validation procedure is flawed because the approximation of H by N is Validation of gene regulatory networks confounding the procedure. But the only way to know that EH[m(N, H)] is small is to estimate this expectation, which in an experimental setting would amount to testing the scientific validity of N as a model for the data. 6. 7. 8. 9. CONCLUSION We have reviewed scientific and inferential validation for regulatory networks. If systems biology is going to be on sound footing, then validation issues must be addressed carefully and significantly more study concerning them needs to be undertaken at a fundamental level. Rigorous methods must be developed and analyzed, in particular, validation as it applies to stochastic dynamical networks. At present there is a paucity of investigations into validation, the net result being that biologists are uncertain of the status of networks proposed in the literature: under what conditions are they valid and what characteristics can be considered trustworthy. This lack of attention is consistent with a general lack of attention to epistemology in systems and computational biology [25–35]—for instance, the egregious use of cross-validation—and this inattention will have a stultifying effect on progress if not rectified. Key Points Network validation is critical if networks are to constitute biological knowledge. Validation requires the use of distance functions to measure the closeness of two networks. Scientific validation involves evaluating the concordance of predictions made from the model and corresponding experimental outcomes. Since inference algorithms are often employed to obtain networks, the performance of such algorithms should be quantified in the context of random sampling. References 1. 2. 3. 4. 5. Shmulevich I, Dougherty ER. Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks. New York: SIAM Press, 2010. de Jong H. Modeling and simulation of genetic regulatory systems: a literature review. Comput Biol 2002;9:67–103. van Someren EP, Wessels LFA, Backer E, Reinders MJT. Genetic network modeling. Pharmacogenomics 2002;3: 507–25. Barabasi A-L, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet 2004;5: 101–13. Shmulevich I, Dougherty ER. Genomic Signal Processing. Princeton: Princeton University Press, 2007. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 251 Cho KH, Choo SM, Jung SH, et al. Reverse engineering of gene regulatory networks. IET Syst Biol 2007;1:149–63. Schlitt T, Brazma A. Current approaches to gene regulatory network modeling. BMC Bioinformatics 2007;8(Suppl. 6):S9. Sima C, Hua J, Jung S. Inference of gene regulatory network using time-series data: a survey. Curr Genomics 2009; 10:416–29. Dougherty ER. Validation of inference procedures for gene regulatory networks. Curr Genomics 2007;8:351–9. Kauffman SA. Metabolic stability and epigenesis in randomly constructed genetic nets. Theor Biol 1969;22: 437–67. Marshall S, Yu L, Xiao Y, Dougherty ER. Inference of probabilistic Boolean networks from a single observed temporal sequence. EURASIP J Bioinformatics Syst Biol 2007. Article ID 32454. Dougherty ER, Braga-Neto U. Epistemology of computational biology: mathematical models and experimental prediction as the basis of their validity. Biol Syst 2006;14:65–90. Einstein A. The library of living philosophers. In: Schilpp PA, (ed). The Philosophy of Bertrand Russell. Evanston and Chicago: Northwestern University, 1944;289. de Jong H, Geiselmann J, Hernandez C, Page M. Genetic network analyzer: qualitative simulation of genetic regulatory networks. Bioinformatics 2003;19:36–344. Batt G, Ropers D, de Jong H, et al. Validation of qualitative models of genetic regulatory networks by model checking: analysis of the nutritional stress response in Escherichia coli. Bioinformatics 2005;21(Suppl. 1):19–28. Zhao W, Serpedin E, Dougherty ER. Inferring connectivity of genetic networks using information theoretic criteria. IEEE/ACM Trans Comput Biol Bioinform 2008;5: 262–74. Liang S, Fuhrman S, Somogyi R. REVEAL a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput 1998;3:18–29. Margolin A, Nemenman I, Basso K, et al. ARACNE: an algorithm for reconstruction of genetic networks in a mammalian cellular context. BMC Bioinformatics 2006; 7(Suppl. 1):S7. Chen X, Anantha G, Wang X. An effective structure learning method for constructing gene networks. Bioinformatics 2006;22:1367–74. Zou M, Conzen SD. A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005;21: 71–9. Bansal M, Belcastro V, Ambesi-Implombato A, di Bernardo D. How to infer gene networks from expression profiles. Mol Syst Biol 2006;3:78. Husmeier D. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 2003;19: 2271–82. Werhli AV, Grzegorczyk M, Husmeier D. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics 2006;22:2523–31. Marbach D, Prill RJ, Schaffter T, et al. Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci USA 2010;107:6286–91. 252 Dougherty 25. Mehta T, Murat T, Allison DB. Towards sound epistemological foundations of statistical methods for highdimensional biology. Nat Genet 2004;36:943–7. 26. Hanczar B, Hua J, Dougherty ER. Decorrelation of the true and estimated classifier errors in high-dimensional settings. EURASIP J Bioinformatics Syst Biol 2007. Article ID 38473. 27. Braga-Neto UM. Fads and fallacies in the name of small-sample microarray classification. IEEE Signal Process Mag 2007;24:91–9. 28. Keller EF. Revisiting scale-free networks. BioEssays 2005; 27:1060–8. 29. Dupuy A, Simon RM. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst 2007;99:147–57. 30. Brun M, Sima C, Hua J, et al. Model-based evaluation of clustering validation measures. Pattern Recogn 2007;40: 807–24. 31. Dougherty ER. On the epistemological crisis in genomics. Curr Genomics 2008;9:69–79. 32. Boulesteix A-L, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol 2009; 9:85. 33. Boulesteix AL. Over-optimism in bioinformatics research. Bioinformatics 2010;26:437–9. 34. Yousefi MR, Hua J, Sima C, Dougherty ER. Reporting bias when using real data sets to analyze classification performance. Bioinformatics 2010;26:68–76. 35. Jelizarow M, Guillemot V, Tenenhaus A, et al. Over optimism in bioinformatics: an illustration. Bioinformatics 2010; 26:1990–98.
© Copyright 2026 Paperzz