Computer-Aided Engineering for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays Shih Chi Peng1 and Chuan Yi Tang1, 2, 1 * Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan, ROC [email protected], [email protected] 2 Department of Computer Science and Information Engineering, Providence University, Taichung, 43301 Taiwan, ROC Received 3 July 2010; Revised 25 August 2010; Accepted 15 September 2010 Abstract. Biological research topics gradually shift from structural genomics into functional genomics. DNA microarrays have been used to generate abundant data for exploring functions and interactions among genes. We propose a reverse-engineering strategy to predict the interactions between genes within a genetic network. Our inputs are perturbation matrices experimentally obtained from DNA microarrays. First, we make some assumptions for the interactions in the network. The proposed network is represented as a directed graph. After that, we enumerate all possible network models according to the assumptions. And then, some candidate models are obtained, resulted from calculated perturbation matrices out of computational simulation. The network involves in not only the transcription level but also the nucleotide/protein interactions in general. To justify this method, we take a well-known genetic regulatory network in yeast Saccharomyces cerevisia for a test. The result shows that one of the candidate models can generate an identical perturbation matrix as that from the yeast’s DNA microarrays, experimentally determined by others. In conclusion, our method is useful and feasible for determining probable interactions within biological networks Keywords: Reverse engineering, genetic regulatory network, DNA microarray 1 Introduction In the post-genomics era, an important goal in biology is to elucidate the interaction principles that may provide a common ground underlying the interactive networks in cells. However we need powerful computational tools to analyze the more complicated data obtained, from such as DNA microarrays, and to infer the networks. Biologists exploit biochemical experiments to infer and to model such networks. We usually only know the relationships between the input and output revealed by experiments without understanding the internal functions and interactions among the different components within the system. This system is like a black box such that we have no specifications but some vague understandings. However, we can utilize the method of reverse engineering to reconstruct the probable internal connections among functional components of the black box. We may then have an opportunity to further modify the system for a better performance if necessary. Chikofsky and Cross [1] made the definition of reverse engineering “Reverse engineering is the process of analyzing a subject system to identify the system's components and their inter-relationships, and to create representations of the system in another form at higher levels of abstraction.” This idea has been widely used in various fields such as including protocol testing [2]-[5], relational database [6] and computer-aided design [7]. Here we utilize such a method to meet the following challenge: inferring probable regulatory networks from the abundant data out of DNA microarrays. Without reverse engineering, Chen et al. [8] has shown that the problem of inferring maximum gene regulations, from multiple time-series data, is NP-complete, even with highly restricted vertex degrees. The advent of technology of DNA microarrays [9], [10] provides an excellent tool for large-scale gene expression analyses. This technology is one of the most important developments in experimental molecular biology in decades, and it allows biologists to monitor the expression of genes on the genomic scale. This means that it is possible to measure expression of tens of thousands of genes in parallel out of a single experiment. * Correspondence author Peng and Tang: CAE for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays Thank to this new tool, a vast amount of data can be produced rapidly and used for reverse engineering. The objectives of this paper are to reconstruct probable models of genetic regulatory networks and to test their validity by computer simulation. 2 Biological Meanings of the Related Interactions We divide the interactions within the gene regulatory network into two levels: the transcription level and other nucleotide/protein level. Throughout the paper, we illustrate the interactions with different forms of edges showed in Figure 1. At the transcription level, the interaction occurs between a protein and a DNA sequence. A type of proteins called transcription factors control gene expressions by binding to their promoters or regulatory sequences, some substrings on the DNA sequence responsible for the efficiency of transcription, as illustrated in Figure 2. On the other hand, interactions on the other level include those occur between proteins (genes’ products) and objects other than DNA. A protein affects one specific substrate, which could be an mRNA, a protein, or a metabolite. Mutation of the coding gene of an upstream protein usually affects interactions of the downstream proteins in the genetic regulatory network. Here we indicate a protein at head/tail of a directed edge at the protein level to be upstream/downstream, and so forth. Sometimes, if a protein is inactivated by other proteins, deleting the protein will show a null effect on proteins’ activities downstream (see Figure 3). Fig. 1. Illustrations of different forms of edges, as well as their elucidations. Fig. 2. An example shows a transcription factor (protein A), binds to a specific part of a DNA sequence, then induces the transcription of specific genes (gene B and gene C). 51 Journal of Computers Vol.21, No.3, October 2010 1’ X Y 2’ A+ X Y X Y 3’ A+ X Y A B X Y 4’ A+B X Y 5 A B X Y 5’ A+B X Y 6 A B X Y 6’ A+B X Y 7 A B X Y 7’ 1 X Y 2 A X 3 A 4 Y A+B+ X Y Fig. 3. Left networks show the original states of the interactions, yet right networks represent how the upstream nodes affect the concentration of node Y eventually. Some transcriptional edges become inactive because of the interactions at the protein level by the upstream nodes. A dotted line denotes that the interaction is inactive. Input Input Data Data DNA Microarray Perturbation Matrix Modeling (Hypotheses) Enumerate all possible models Simulation (activity, concentration) Output Output Data Data Delete models whose perturbation matrices don’t meet the real one. Candidate Set Distinguishable Experiment No Unique? Yes Gene Regulatory Network Fig. 4. The flowchart represents the strategy of inferring the genetic regulatory networks. The bold frames are jobs with regard to computational analyses; and the thin frames are jobs with regard to biochemical experiments. 52 Peng and Tang: CAE for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays (a) (b) Fig. 5. The pseudo code of the hierarchical algorithm that enumerates all possible models which are appended edges at (a) transcription level (b) the protein level 53 Journal of Computers 3 Vol.21, No.3, October 2010 Inference Method We propose an experimentally feasible strategy that combines biochemical experiments and computational simulation. Figure 4 illustrates the flowchart of our strategy. First of all, the input data is the perturbation matrix got from DNA microarray experiments. The gene expression is perturbed by a specific mutation. The ratio of the perturbed gene expressions to the unperturbed ones is calculated. We just code the ratio by +1, -1, or zero, which represents an increase, decrease and no-change in expressions, respectively. In this initial demonstration, some hypotheses are made to simplify the model. Our model is a directed graph G=(V, E) consisting of a set V={P0,P1,…,Pm} of nodes and a set E={E00, E01,…, Eij, …,Emm} of edges. Pi represents the protein affected by the ith gene, and we indicate that P0 represents a transcription factor that is a leading factor in the network. Eij means a directed edge from Pi to Pj, which performs DNA-protein activation, protein-protein activation, protein-protein inhibition, or non-action. We omit the non-active edges in our illustration. The hypotheses are as follow: 1. Nodes in the model are as many as those in the practical genetic regulatory network. 2. We assume that one gene affect one protein in our networks. 3. We assume that the concentrations of mRNAs and the corresponding proteins expressed are in linear relationship. 4. We assume that in our networks a protein can only affect one particular object, either a promoter or another protein. According to the hypotheses, we use brute-force algorithms to enumerate all possible models. We divide the enumeration process into two phases, one is to append edges at the transcription level, and the other is to append edges at the protein level. We call the enumerating illustration an enumerating tree. In the first algorithm, as shown in Figure 5(a), we append edges at the transcription level between nodes. The architecture of this algorithm is hierarchical, which uses breath-first mechanism with the aid of stacks. To begin the enumeration, we take an edgeless network as a root model (parent) at level 0 in the enumerating tree, and then add possible edges, from 1, 2, …, until m (at most) edges at a time, within the root model from nodes at height 0 (P0) to isolated ones, P1~Pm, taking account of all possible combinations. Every new edge-appended step is regarded as a new model (child) inherited from the root model. Next, we regard the new models as roots at the level 1, and append possible edges within the models from nodes at the height 1 to isolated ones, to enumerate even new models. Repeat this step until all possible models are enumerated, that is, to level m. In the following steps, we bring the other algorithm to the models in the individual set. The second algorithm in Figure 5(b), we append edges at the protein level between nodes of models in the individual set generated by the first algorithm. Now that an activation edge and an inhibition edge are mutual exclusion, they don’t coexist between two particular nodes. Therefore, we can first treat them as identical edges to sketch the network roughly, and then the functional meaning of the edges is provided in later steps. This algorithm is recursive. To begin the enumeration, it first takes one model out of individual set to be a root model in the enumerating tree. Then it counts the number of the leaf node of the individual and assigns numbers to them from 1 to k. It appends one edge on to the no.1 leaf node, pointing at other nodes. Each of different pointing constructs a new model (child) inherited from the root model (parent). Every time when a new model is constructed, it is regarded as a new root for the next recursion, which is adding edge on to no.2 leaf node and pointing at a third node. The recursion stops after the edge on the kth leaf node being added. Ultimately, the leaf models in the enumerating tree are completely inferred models and are used in the network set for simulation. If there are k leaf nodes in the original root model, it will enumerate at most (1+m)k networks in the network set. While thinking of two kinds of edges at the protein level, it would enumerate at most (1+m)k.2k networks. For the time being, we ignore some networks that are repeated or have loops, both pieces of information are unavailable just from DNA microarrays, unless measured by other kinds of biochemical experiments. After enumerating all possible models, the issue becomes a search problem to find the best model out of the enumerated one by comparing the simulated data from our models with the real data from the black box. Here we propose algorithms to get the perturbation matrices of the models. We imitate the mutations of perturbation assays, mutate genes once at a time, and detect the variation of expressions of all the other genes in the network. Before designing the algorithms, we have to understand how interactions at the protein level may affect those at the transcription level, which can be measured in DNA microarrays. 54 Peng and Tang: CAE for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays Fig. 6. Examples show that how interactions at the protein level affect those at the transcription level, and show the changes of concentration in the perturbation matrix. Sign△ means that the gene of the protein is mutated. Sign-/+ means that the concentration of specific protein is decreased/increased after a protein is deleted. A blank space means changeless. For instance, in Figure 6, originally, protein A activates (6a) /inhibits (6b) protein B, which is a transcription factor that binds to a DNA sequence to activate protein C. Precisely speaking, the presence of protein A regulates the binding of protein B to a specific promoter in a DNA sequence, so to activate protein C. Now we mutate the gene of protein A, in other words, we reduce its concentration. Consequently, this perturbation reduces /enhances the activity of protein B to bind to the DNA sequence. Finally, the concentration of protein C will be decreased/increased. In this process, the concentrations of protein A and protein C are changed but not in protein B, compared with those in the wild-type condition. This variations can be measured by DNA microarrays. On the other hand, when we mutate the gene of protein B, the concentration of protein C, in Figure 6(a) followed will be decreased. However, in the other condition, in Figure 6(b), there is no change in C. This is because the interaction between protein B and protein C is inhibited originally by protein A, therefore the connection between B and C is hidden. 4 Validation Test To justify this method, we take a practical genetic regulatory network (Figure 7) [9]-[11] to evaluate the reliability of our inference strategy. Fig. 7. Model of galactose utilization. A red dotted box is the genetic regulatory network primarily consisting of GAL4, GAL80, and GAL3. [9] How do the proteins interact with one another in the network? The protein, GAL4, is the transcription factor in this network; another protein, GAL80, inhibits its activity in the absence of galactose. In the presence of 55 Journal of Computers Vol.21, No.3, October 2010 galactose, a third protein GAL3 is activated to inhibit GAL80. Consequently, GAL4 is released and able to bind to the DNA sequence for gene transcription. Figure 8 shows the perturbation matrix of the GAL pathway cited from Fig.2 in [9]. DNA microarrays were used to measure the expression profiles of various genes under each of 20 perturbations in the GAL pathway. Each spot in this matrix represents the change of a GAL gene’s expression due to a particular perturbation by mutating a specific gene listed on the top of each column. The gray level of the spots represents the ratio of changes in concentration under the perturbed condition to the under the wild-type condition. vs. vs. vs. Fig. 8. A perturbation matrix of the GAL pathway measured by DNA microarray. Each spot represents the change in expression of a GAL gene due to a particular perturbation. The left half of the matrix shows experiments with galactose; the right half shows experiments without galactose. The far left columns show changes compared between wild-type experiments without and with galactose [9]. We choose GAL4, GAL3, GAL80, and GAL5 to be our sample nodes, i.e., we predict the candidate models of the genetic regulatory network consisting of these five proteins. In Figure 9, the data in the two matrices are picked out from the matrix in Figure 8. We take these two perturbation matrices (a and b) to be the input data in our two inference algorithm a and b, respectively. After a series of systematic and automated processes, we get the inferred candidate models whose simulated perturbation matrices meet the matrices from experiments. Figure 10 exhibits candidate models that meet the matrix in Figure 9(a), and Figure 11 exhibits whose meet the matrix in Figure 9(b). Then, we take these two sets of candidate models for biologists to further verification. Fig. 9. (a) Perturbation matrix that is experimented with galactose. (b) Perturbation matrix that is experimented without galactose. Sign + /- represents a significant increase/decrease. According to our previous assumptions, the interactions between two nodes are directed and primary. However our candidate models may be a partial interactions in one larger network. Take candidate models in Figure 10 (b) and in Figure 11 (b) for example, the model in Figure 11 (b) is a subgraph of the model in Figure 10 (b). The only different is just the protein-inhibited interaction between GAL3 and GAL80, which could be activated by the advent of galactose. Figure 12 illustrates the resultant network that provides the same gene expression results as in experiments and a same structure as shown in Figure 7. This suggests that our computational inference has biological significance. 56 Peng and Tang: CAE for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays Fig. 10. Candidate models that meets the perturbation matrix in Figure 17 (a). Fig 11. Candidate models that meets the perturbation matrix in Figure 17 (b). Fig. 12. Candidate models predicted by referring to two different experiments. The red arrow points at the node that is activated by the galactose in the cell. 5 Conclusions In this paper, we suggest a reverse engineering approach to infer interactions in a genetic regulatory network. Our model consists of interactions at the transcription level and at other levels. DNA microarrays can be used to provide our input data. Candidate models with their simulated data consistent with the input data are subjected to biological verification. We have shown that our preliminary candidate models do contain a biologically significant result. References [1] E.J. Chikofsky and J.H. Cross, “Reverse engineering and design recovery: A taxonomy,” IEEE Software, Vol. 7, No. 1, pp. 13-17, 1990. [2] D. Lee, “Reverse Engineering of Communication Protocols,” Proceedings of International Conference on Network Protocols, pp.218-216, 1993. [3] K. Saleh, R. Probert, K. Al-Saqabi, “Recovery of CFSM-based Protocol and Service Design from Protocol Execution Traces,” Information and Software Technology, Vol. 41, No. 11-12, pp. 839-852, 1999. [4] I. Baxter and M. Mehlich, “Reverse Engineering is Reverse Forward Engineering,” Science of Computer Programming, Vol. 36, pp. 131-147, 2000. 57 Journal of Computers [5] Vol.21, No.3, October 2010 K. Saleh and A. Boujarwah, “Communications Software Reverse Engineering: a Semi-Automatic Approach,” Information and Software Technology, Vol. 38, No. 6, pp. 379-390, 1996. [6] R. Chiang, T. Barron, A. Storey, “A Framework for the Design and Evaluation of Reverse Engineering Methods for Relational Databases,” Data & Knowledge Engineering. Vol. 21, No. 1, pp. 57-77, 1996. [7] T. Varady, R. Martin, J. Cox, “Reverse Engineering of Geometric Models-An Introduction,” Computer-Aided Design, Vol. 29, No. 4, pp. 255-268, 1997. [8] T. Chen, V. Filkov, S. Skiena, “Identifying Genetic Regulatory Networks from Experimental Data,” Proceedings of RECOMB’99, 1999. [9] T. Ideker, V. Thorsson, J.A. Ranish, R. Christmas,J. Buhler, J.K. Eng, R. Bumgarner, D.R. Goodlett, R. Aebersold, L. Hood, “Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network,” Science Vol. 292, No. 5518, pp. 929-933, 2001. [10] R.W. We, T. Wang, L. Bedzyk, K.M. Croker, “Applications of DNA Microarrays in Microbial systems,” Journal of Microbiological Methods, Vol. 47, No. 3, pp. 257-272, 2001. [11] A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, R.A. Young, “Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networks,” Pacific Symposium on Biocomputing’01, Vol. 6, pp. 422-433., 2001. 58
© Copyright 2026 Paperzz