Computer-Aided Engineering for Inference of Genetic Regulatory

Computer-Aided Engineering for Inference of Genetic Regulatory
Networks Using Data from DNA Microarrays
Shih Chi Peng1 and Chuan Yi Tang1, 2,
1
*
Department of Computer Science,
National Tsing Hua University,
Hsinchu 30013, Taiwan, ROC
[email protected], [email protected]
2
Department of Computer Science and Information Engineering,
Providence University,
Taichung, 43301 Taiwan, ROC
Received 3 July 2010; Revised 25 August 2010; Accepted 15 September 2010
Abstract. Biological research topics gradually shift from structural genomics into functional genomics. DNA
microarrays have been used to generate abundant data for exploring functions and interactions among genes.
We propose a reverse-engineering strategy to predict the interactions between genes within a genetic network.
Our inputs are perturbation matrices experimentally obtained from DNA microarrays. First, we make some
assumptions for the interactions in the network. The proposed network is represented as a directed graph.
After that, we enumerate all possible network models according to the assumptions. And then, some
candidate models are obtained, resulted from calculated perturbation matrices out of computational
simulation. The network involves in not only the transcription level but also the nucleotide/protein
interactions in general. To justify this method, we take a well-known genetic regulatory network in yeast
Saccharomyces cerevisia for a test. The result shows that one of the candidate models can generate an
identical perturbation matrix as that from the yeast’s DNA microarrays, experimentally determined by others.
In conclusion, our method is useful and feasible for determining probable interactions within biological
networks
Keywords: Reverse engineering, genetic regulatory network, DNA microarray
1
Introduction
In the post-genomics era, an important goal in biology is to elucidate the interaction principles that may provide
a common ground underlying the interactive networks in cells. However we need powerful computational tools
to analyze the more complicated data obtained, from such as DNA microarrays, and to infer the networks.
Biologists exploit biochemical experiments to infer and to model such networks. We usually only know the
relationships between the input and output revealed by experiments without understanding the internal functions
and interactions among the different components within the system. This system is like a black box such that we
have no specifications but some vague understandings. However, we can utilize the method of reverse
engineering to reconstruct the probable internal connections among functional components of the black box. We
may then have an opportunity to further modify the system for a better performance if necessary.
Chikofsky and Cross [1] made the definition of reverse engineering “Reverse engineering is the process of
analyzing a subject system to identify the system's components and their inter-relationships, and to create
representations of the system in another form at higher levels of abstraction.” This idea has been widely used in
various fields such as including protocol testing [2]-[5], relational database [6] and computer-aided design [7].
Here we utilize such a method to meet the following challenge: inferring probable regulatory networks from the
abundant data out of DNA microarrays. Without reverse engineering, Chen et al. [8] has shown that the problem
of inferring maximum gene regulations, from multiple time-series data, is NP-complete, even with highly
restricted vertex degrees.
The advent of technology of DNA microarrays [9], [10] provides an excellent tool for large-scale gene
expression analyses. This technology is one of the most important developments in experimental molecular
biology in decades, and it allows biologists to monitor the expression of genes on the genomic scale. This means
that it is possible to measure expression of tens of thousands of genes in parallel out of a single experiment.
*
Correspondence author
Peng and Tang: CAE for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays
Thank to this new tool, a vast amount of data can be produced rapidly and used for reverse engineering. The
objectives of this paper are to reconstruct probable models of genetic regulatory networks and to test their
validity by computer simulation.
2
Biological Meanings of the Related Interactions
We divide the interactions within the gene regulatory network into two levels: the transcription level and other
nucleotide/protein level. Throughout the paper, we illustrate the interactions with different forms of edges
showed in Figure 1.
At the transcription level, the interaction occurs between a protein and a DNA sequence. A type of proteins
called transcription factors control gene expressions by binding to their promoters or regulatory sequences, some
substrings on the DNA sequence responsible for the efficiency of transcription, as illustrated in Figure 2.
On the other hand, interactions on the other level include those occur between proteins (genes’ products) and
objects other than DNA. A protein affects one specific substrate, which could be an mRNA, a protein, or a
metabolite. Mutation of the coding gene of an upstream protein usually affects interactions of the downstream
proteins in the genetic regulatory network. Here we indicate a protein at head/tail of a directed edge at the
protein level to be upstream/downstream, and so forth. Sometimes, if a protein is inactivated by other proteins,
deleting the protein will show a null effect on proteins’ activities downstream (see Figure 3).
Fig. 1. Illustrations of different forms of edges, as well as their elucidations.
Fig. 2. An example shows a transcription factor (protein A), binds to a specific part of a DNA sequence, then induces the
transcription of specific genes (gene B and gene C).
51
Journal of Computers
Vol.21, No.3, October 2010
1’
X
Y
2’
A+ X
Y
X
Y
3’
A+ X
Y
A
B
X
Y
4’
A+B
X
Y
5
A
B
X
Y
5’
A+B
X
Y
6
A
B
X
Y
6’
A+B
X
Y
7
A
B
X
Y
7’
1
X
Y
2
A
X
3
A
4
Y
A+B+ X
Y
Fig. 3. Left networks show the original states of the interactions, yet right networks represent how the upstream nodes
affect the concentration of node Y eventually. Some transcriptional edges become inactive because of the
interactions at the protein level by the upstream nodes. A dotted line denotes that the interaction is inactive.
Input
Input
Data
Data
DNA Microarray
Perturbation Matrix
Modeling
(Hypotheses)
Enumerate all
possible models
Simulation
(activity, concentration)
Output
Output
Data
Data
Delete models whose
perturbation matrices
don’t meet the real one.
Candidate Set
Distinguishable Experiment
No
Unique?
Yes
Gene Regulatory Network
Fig. 4. The flowchart represents the strategy of inferring the genetic regulatory networks. The bold frames are jobs with
regard to computational analyses; and the thin frames are jobs with regard to biochemical experiments.
52
Peng and Tang: CAE for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays
(a)
(b)
Fig. 5. The pseudo code of the hierarchical algorithm that enumerates all possible models which are appended edges at
(a) transcription level (b) the protein level
53
Journal of Computers
3
Vol.21, No.3, October 2010
Inference Method
We propose an experimentally feasible strategy that combines biochemical experiments and computational
simulation. Figure 4 illustrates the flowchart of our strategy.
First of all, the input data is the perturbation matrix got from DNA microarray experiments. The gene
expression is perturbed by a specific mutation. The ratio of the perturbed gene expressions to the unperturbed
ones is calculated. We just code the ratio by +1, -1, or zero, which represents an increase, decrease and
no-change in expressions, respectively.
In this initial demonstration, some hypotheses are made to simplify the model. Our model is a directed graph
G=(V, E) consisting of a set V={P0,P1,…,Pm} of nodes and a set E={E00, E01,…, Eij, …,Emm} of edges. Pi
represents the protein affected by the ith gene, and we indicate that P0 represents a transcription factor that is a
leading factor in the network. Eij means a directed edge from Pi to Pj, which performs DNA-protein activation,
protein-protein activation, protein-protein inhibition, or non-action. We omit the non-active edges in our
illustration. The hypotheses are as follow:
1. Nodes in the model are as many as those in the practical genetic regulatory network.
2. We assume that one gene affect one protein in our networks.
3. We assume that the concentrations of mRNAs and the corresponding proteins expressed are in linear
relationship.
4. We assume that in our networks a protein can only affect one particular object, either a promoter or another
protein.
According to the hypotheses, we use brute-force algorithms to enumerate all possible models. We divide the
enumeration process into two phases, one is to append edges at the transcription level, and the other is to append
edges at the protein level. We call the enumerating illustration an enumerating tree.
In the first algorithm, as shown in Figure 5(a), we append edges at the transcription level between nodes. The
architecture of this algorithm is hierarchical, which uses breath-first mechanism with the aid of stacks. To begin
the enumeration, we take an edgeless network as a root model (parent) at level 0 in the enumerating tree, and
then add possible edges, from 1, 2, …, until m (at most) edges at a time, within the root model from nodes at
height 0 (P0) to isolated ones, P1~Pm, taking account of all possible combinations. Every new edge-appended
step is regarded as a new model (child) inherited from the root model. Next, we regard the new models as roots
at the level 1, and append possible edges within the models from nodes at the height 1 to isolated ones, to
enumerate even new models. Repeat this step until all possible models are enumerated, that is, to level m. In the
following steps, we bring the other algorithm to the models in the individual set.
The second algorithm in Figure 5(b), we append edges at the protein level between nodes of models in the
individual set generated by the first algorithm. Now that an activation edge and an inhibition edge are mutual
exclusion, they don’t coexist between two particular nodes. Therefore, we can first treat them as identical edges
to sketch the network roughly, and then the functional meaning of the edges is provided in later steps. This
algorithm is recursive. To begin the enumeration, it first takes one model out of individual set to be a root model
in the enumerating tree. Then it counts the number of the leaf node of the individual and assigns numbers to
them from 1 to k. It appends one edge on to the no.1 leaf node, pointing at other nodes. Each of different
pointing constructs a new model (child) inherited from the root model (parent). Every time when a new model is
constructed, it is regarded as a new root for the next recursion, which is adding edge on to no.2 leaf node and
pointing at a third node. The recursion stops after the edge on the kth leaf node being added. Ultimately, the leaf
models in the enumerating tree are completely inferred models and are used in the network set for simulation. If
there are k leaf nodes in the original root model, it will enumerate at most (1+m)k networks in the network set.
While thinking of two kinds of edges at the protein level, it would enumerate at most (1+m)k.2k networks. For
the time being, we ignore some networks that are repeated or have loops, both pieces of information are
unavailable just from DNA microarrays, unless measured by other kinds of biochemical experiments.
After enumerating all possible models, the issue becomes a search problem to find the best model out of the
enumerated one by comparing the simulated data from our models with the real data from the black box. Here
we propose algorithms to get the perturbation matrices of the models. We imitate the mutations of perturbation
assays, mutate genes once at a time, and detect the variation of expressions of all the other genes in the network.
Before designing the algorithms, we have to understand how interactions at the protein level may affect those at
the transcription level, which can be measured in DNA microarrays.
54
Peng and Tang: CAE for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays
Fig. 6. Examples show that how interactions at the protein level affect those at the transcription level, and show the changes
of concentration in the perturbation matrix. Sign△ means that the gene of the protein is mutated. Sign-/+ means
that the concentration of specific protein is decreased/increased after a protein is deleted. A blank space means
changeless.
For instance, in Figure 6, originally, protein A activates (6a) /inhibits (6b) protein B, which is a transcription
factor that binds to a DNA sequence to activate protein C. Precisely speaking, the presence of protein A regulates
the binding of protein B to a specific promoter in a DNA sequence, so to activate protein C. Now we mutate the
gene of protein A, in other words, we reduce its concentration. Consequently, this perturbation reduces /enhances
the activity of protein B to bind to the DNA sequence. Finally, the concentration of protein C will be
decreased/increased. In this process, the concentrations of protein A and protein C are changed but not in protein
B, compared with those in the wild-type condition. This variations can be measured by DNA microarrays. On
the other hand, when we mutate the gene of protein B, the concentration of protein C, in Figure 6(a) followed
will be decreased. However, in the other condition, in Figure 6(b), there is no change in C. This is because the
interaction between protein B and protein C is inhibited originally by protein A, therefore the connection between
B and C is hidden.
4
Validation Test
To justify this method, we take a practical genetic regulatory network (Figure 7) [9]-[11] to evaluate the
reliability of our inference strategy.
Fig. 7. Model of galactose utilization. A red dotted box is the genetic regulatory network primarily consisting of
GAL4, GAL80, and GAL3. [9]
How do the proteins interact with one another in the network? The protein, GAL4, is the transcription factor
in this network; another protein, GAL80, inhibits its activity in the absence of galactose. In the presence of
55
Journal of Computers
Vol.21, No.3, October 2010
galactose, a third protein GAL3 is activated to inhibit GAL80. Consequently, GAL4 is released and able to bind
to the DNA sequence for gene transcription.
Figure 8 shows the perturbation matrix of the GAL pathway cited from Fig.2 in [9]. DNA microarrays were
used to measure the expression profiles of various genes under each of 20 perturbations in the GAL pathway.
Each spot in this matrix represents the change of a GAL gene’s expression due to a particular perturbation by
mutating a specific gene listed on the top of each column. The gray level of the spots represents the ratio of
changes in concentration under the perturbed condition to the under the wild-type condition.
vs.
vs.
vs.
Fig. 8. A perturbation matrix of the GAL pathway measured by DNA microarray. Each spot represents the change in
expression of a GAL gene due to a particular perturbation. The left half of the matrix shows experiments with
galactose; the right half shows experiments without galactose. The far left columns show changes compared between
wild-type experiments without and with galactose [9].
We choose GAL4, GAL3, GAL80, and GAL5 to be our sample nodes, i.e., we predict the candidate models
of the genetic regulatory network consisting of these five proteins. In Figure 9, the data in the two matrices are
picked out from the matrix in Figure 8. We take these two perturbation matrices (a and b) to be the input data in
our two inference algorithm a and b, respectively. After a series of systematic and automated processes, we get
the inferred candidate models whose simulated perturbation matrices meet the matrices from experiments. Figure
10 exhibits candidate models that meet the matrix in Figure 9(a), and Figure 11 exhibits whose meet the matrix
in Figure 9(b). Then, we take these two sets of candidate models for biologists to further verification.
Fig. 9. (a) Perturbation matrix that is experimented with galactose. (b) Perturbation matrix that is experimented
without galactose. Sign + /- represents a significant increase/decrease.
According to our previous assumptions, the interactions between two nodes are directed and primary.
However our candidate models may be a partial interactions in one larger network. Take candidate models in
Figure 10 (b) and in Figure 11 (b) for example, the model in Figure 11 (b) is a subgraph of the model in Figure
10 (b). The only different is just the protein-inhibited interaction between GAL3 and GAL80, which could be
activated by the advent of galactose. Figure 12 illustrates the resultant network that provides the same gene
expression results as in experiments and a same structure as shown in Figure 7. This suggests that our
computational inference has biological significance.
56
Peng and Tang: CAE for Inference of Genetic Regulatory Networks Using Data from DNA Microarrays
Fig. 10. Candidate models that meets the perturbation matrix in Figure 17 (a).
Fig 11. Candidate models that meets the perturbation matrix in Figure 17 (b).
Fig. 12. Candidate models predicted by referring to two different experiments. The red arrow points at the node that is
activated by the galactose in the cell.
5
Conclusions
In this paper, we suggest a reverse engineering approach to infer interactions in a genetic regulatory network.
Our model consists of interactions at the transcription level and at other levels. DNA microarrays can be used to
provide our input data. Candidate models with their simulated data consistent with the input data are subjected to
biological verification. We have shown that our preliminary candidate models do contain a biologically
significant result.
References
[1]
E.J. Chikofsky and J.H. Cross, “Reverse engineering and design recovery: A taxonomy,” IEEE Software, Vol. 7, No. 1,
pp. 13-17, 1990.
[2]
D. Lee, “Reverse Engineering of Communication Protocols,” Proceedings of International Conference on Network
Protocols, pp.218-216, 1993.
[3]
K. Saleh, R. Probert, K. Al-Saqabi, “Recovery of CFSM-based Protocol and Service Design from Protocol Execution
Traces,” Information and Software Technology, Vol. 41, No. 11-12, pp. 839-852, 1999.
[4]
I. Baxter and M. Mehlich, “Reverse Engineering is Reverse Forward Engineering,” Science of Computer Programming,
Vol. 36, pp. 131-147, 2000.
57
Journal of Computers
[5]
Vol.21, No.3, October 2010
K. Saleh and A. Boujarwah, “Communications Software Reverse Engineering: a Semi-Automatic Approach,”
Information and Software Technology, Vol. 38, No. 6, pp. 379-390, 1996.
[6]
R. Chiang, T. Barron, A. Storey, “A Framework for the Design and Evaluation of Reverse Engineering Methods for
Relational Databases,” Data & Knowledge Engineering. Vol. 21, No. 1, pp. 57-77, 1996.
[7]
T. Varady, R. Martin, J. Cox, “Reverse Engineering of Geometric Models-An Introduction,” Computer-Aided Design,
Vol. 29, No. 4, pp. 255-268, 1997.
[8]
T. Chen, V. Filkov, S. Skiena, “Identifying Genetic Regulatory Networks from Experimental Data,” Proceedings of
RECOMB’99, 1999.
[9]
T. Ideker, V. Thorsson, J.A. Ranish, R. Christmas,J. Buhler, J.K. Eng, R. Bumgarner, D.R. Goodlett, R. Aebersold, L.
Hood, “Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network,” Science Vol.
292, No. 5518, pp. 929-933, 2001.
[10] R.W. We, T. Wang, L. Bedzyk, K.M. Croker, “Applications of DNA Microarrays in Microbial systems,” Journal of
Microbiological Methods, Vol. 47, No. 3, pp. 257-272, 2001.
[11] A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, R.A. Young, “Using Graphical Models and Genomic Expression Data to
Statistically Validate Models of Genetic Regulatory Networks,” Pacific Symposium on Biocomputing’01, Vol. 6, pp.
422-433., 2001.
58