1 Artificial Intelligence in Systems Biology Hai Huang, Master Student, IBBME Abstract — Systems biology extends the perspective from individual biological components to the system level. This development requires advanced modelling skills and data processing techniques. Artificial intelligence could be the solution of this demand. Artificial Intelligence has showed its power in gene multiple alignment modelling and phylogenetic likelihood inference. Its active learning algorithm will accelerate the evolution of systems biology. Index term — Systems biology, system structure, system dynamic, artificial intelligence, knowledge and reasoning, machine learning. I. INTRODUCTION S YSTEMS biology is a system level understanding of biology [1], which was first introduced about 50 years ago. Compared to traditional biology, it is still in its infancy. But as a emerging science, it recently shows its potential in dominating the developmental trend of molecular, genomic, and pharmacological researches. However, further advancement in systems biology is not free of obstacles. Technologies from other scientific fields are demanded to assist the breakthroughs in systems biology. Artificial Intelligence (AI) as one of the assistant tools, has demonstrated its potential in overcoming the difficulties faced by systems biology. Some concepts of AI have already been applied in systems biology, while others are beginning to be utilized. The purpose of this paper is to provide an overview of systems biology and AI. The application of AI in systems biology is introduced, and the trend of the future relationships between AI and systems biology is discussed. II. CHALLENGES IN SYSTEMS BIOLOGY The understanding of system-level biology is derived from the insight into the four key elements: system structure, system dynamic, control method, and design method [1]. Progress has been made in each of the above areas since the emergence of systems biology, but every step of advancement was full of frustration. System structure is not a list of isolated components of a cell or organism; it is more about the relationships between these components [1]. However, from a biological view point, to clearly describe those relationships is very challenging. For Manuscript received October 21, 2003. H. Huang is with the Institute of Bio-material and Bio-medical Engineering, University of Toronto. Canada (corresponding author to provide e-mail: [email protected]). example, the similarity of DNA between different species has a profound impact on evolutionary biology; however, in searching for this similarity, J. P. Huelsenbeck et al. found it very frustrating because it was hidden in a flood of data [2]. To find the clues from the voluminous data demands great experience, knowledge and patience. Human error will inevitably have intense adverse influence on the outcome. To understand system dynamic is to find out “How a system behaves over time and under various conditions. [1]” A Biological system is far more complex than a mechanical system; sometimes the same chemical messenger can carry several signals simultaneously on different time scales [3]. This brings a lot of confusion in understanding the roles of different parallel progresses and feedback mechanisms. Based on the knowledge of the structure and dynamic of a particular system, control and design methods can be utilized to control the state and modify the property of the system [1]. For example, monitoring and controlling of the side effects are major issues in the development of new drugs, especially gene-protein target drugs [4]. Difficulties arise here because the target genes produce large amounts of proteins, some functions of which are unknown. Yet to control the therapeutic effects and design the drugs, it is essential to identify the unknown functions and eliminate the undue functions. III. ARTIFICIAL INTELLIGENCE Facing all these challenges, systems biology adopts a lot of techniques from other fields, such as System Engineering, Information Technology, and Control Theory. AI is a relatively new science incorporated in the development of systems biology. AI emerged as a new science category in the 1950s at the same time when the term “systems biology” was coined. It refers to thinking and acting as a human being or at least thinking and acting rationally, rather than just imitating what a human being does [5]. If a system can only mimic a person’s actions, it is just a manipulator, but not actual artificial-intelligence. The Turing Test in 1950s was the first landmark of AI. In the Turing test, an interrogator was connected to a person or a machine via a terminal, which prevented him/her from seeing his/her counterpart. His/her task was to find out whether the counterpart was a machine by only asking questions [6]. If the machine could “fool” the interrogator, this machine system was considered an intelligent entity. The Turing Test demonstrated the possibility that a machine could act as a human being. 2 Another well-known milestone of AI was Deep Blue. In May 1997, IBM's Deep Blue Supercomputer played a match with the World Chess Champion, Garry Kasparov, and won the game [7]. It revealed that machines were able to compete with human beings to some degree. Generally speaking, the scope of AI covers all human activities, such as observing the environment, judging successful behaviour, seeking the proper method, and adjusting knowledge while interacting with the target. It can be classified into four categories: problem solving, knowledge and reasoning, machine learning, autonomous planning, communicating, perceiving and acting [5]. Problem solving is the basis of AI. It presents a topological view. Usually in the AI perspective, a problem like “How can one thing go from state A to state B?” could be solved by searching an existing database based on constraints and conditions. This search could be target oriented, start-point oriented, or bidirectional. Knowledge and reasoning is to understand and identify a successful behaviour in a complex environment. It is the key component of AI. Knowledge and reasoning play a crucial role in dealing with partially observable environments. Based on logic, probability, and the statistics theory, two important theories were developed. One is the Bayesian network; and the other is the Hidden Markov Model, both of which are dominant in current AI. They will be discussed in detail in section IV of this paper. Machine learning enables a system to adjust itself to the environment. Whether supervised or unsupervised, passive or active, machine learning is to improve the system’s ability to act in the future. It is now the most important trend in the development of AI. Autonomous planning, communicating, perceiving and acting are the implementations of the thinking part of AI into its acting part. They are the applications of problem solving, knowledge and reasoning, and machine learning. The above four aspects enable AI to become a very good tool to reduce human errors, improve efficiency, save time, and derease costs, and thus allows it to be applied in overcoming the difficulties faced by systems biology. IV. also produce an enormous amount of data. Therefore, new methods are highly demanded to process these data. AI has been acting as a useful tool in these situations. Of the four components of AI, problem solving is the basis of the other three aspects. Thus, its application in systems biology is involved in the application of the other three aspects of AI. Being the best studied element of AI, knowledge and reasoning has a relatively wide application in systems biology today. Examples of its application at two different levels of systems biology are discussed in the following paragraphs. Some pioneering studies are also being done on the application of machine learning in systems biology. However the application of autonomous planning, communicating, perceiving and acting has not yet been seen. A. Bayesian Inference of phylogeny The idea that species are related is not new. More than one century ago, Darwin became one of the pioneers in the area of evolutionary biology. These pioneers intended to reveal a systematic structure from a biological point of view. Just like the trend of biology nowadays, biological phylogeny is more like a bioinformatics science. A lot of molecular data transform this question of the history of life to a statistical and computational problem. Many different inferential methods were introduced into phylogenetic analysis, seeking the relationship between different biological classes. Among them, Bayesian inference, an important AI theory and application, is relatively new in this field, but it is a powerful tool for addressing a number of long-standing and complex questions in evolutionary biology. Table 1 lists some Bayesian inference application in the phylogeny perspective. Problem Bayesian approach Find tree with maximum posterior probability; Inferring phylogeny evaluate features in common among the sampled trees Evaluating uncertainty phylogenies Detecting selection Model substitution process on the codon and calculate probability of being in purifying or positively selected class; sample substitutions and count number of synonymous and nonsynonymous changes Comparative analyses Perform analysis on many trees, and weight results by the probability that each tree is correct Divergence times Use fossils as a calibration. Infer divergence times by using a strict or relaxed molecular clock APPLICATION OF AI IN SYSTEMS BIOLOGY Systems biology and AI were developed parallel to each other before the 1980s as two distinct disciplines. However in the past twenty years, the rapid technological development has created the opportunities for AI to be applied in systems biology. The advancement in computer science and information technology allows AI to have more powerful computer platforms as its tool. At the same time, new theoretical concepts and approaches in computer science enhance the theoretical development of AI. On the other hand, new technologies such as gene Microarray have been brought into systems biology. These technologies create the opportunities to digitize the experimental results and improve the repeatability of tests. They Evaluate clade probabilities; form credible set in containing trees whose cumulative probability sums to 0.95 Testing clock molecular Calculate Bayes factor for the clock versus no branch length restrictions Table 1 Bayesian approach to problems in phylogeny Bayesian inference is to compute the posterior probability distribution for a set of query variables over a Bayesian network, which is able to represent the dependencies among variables and give a concise specification of any full joint probability distribution [5]. As a part of knowledge and reasoning in AI categories, this inference is to identify the 3 correct relationship between different elements. The basic expression of Bayesian theory is: In phylogeny, this expression is used to combine the prior probability of a phylogeny (Pr[Tree]) with the likelihood (Pr[Data | Tree]) to produce a posterior probability distribution on trees (Pr[Tree | Data]). Inferences about the history of the group are based on the posterior probability of trees. The tree with the highest posterior probability might be chosen as the best estimate of phylogeny [2]. Huelsenbeck et al. implemented this approach by a numerical method MCMC (Markov chain Monte Carlo) of Bayesian inference. There were two important practical problems associated with the application of MCMC. One was the modelling assumption. A poorly fitted assumption would lead to a wrong inference. Their assumption was the general time reversible (GTR) model of DNA substitution in the analyses, which allowed each nucleotide change to have its own rate and the nucleotide bases to have different frequencies. It allowed rates to vary across sites either by assuming the randomness of the rate or by dividing the sites into several codon positions. Another problem was to determine how long to run a chain to obtain a good approximation of the posterior probabilities of trees. In some cases the MCMC algorithm would fail to converge. Eventually they identified convergence by a trial-error method. Based on a variant of MCMC called Metropolis-coupled MCMC, Huelsenbeck et al. deisgned a computer program [2]. They applied this program to four large phylogenetic data. The smallest data set included 106 wingless sequences sampled from insects, and the largest included 357 atpB sequences sampled from plants. Figure 1 shows the posterior probability of a clade condition on the observed DNA sequences for two chains, each of them starting from different random trees. The posterior probabilities of the individual clade found in different chains are highly correlated. There is no obvious correlation found cross the clades. This result proved that Bayesian inference could be a precise method in phylogenetic analysis. Huelsenbeck and his colleagues pointed out that Bayesian inference could be used as an important method in the study of Molecular Evolution, especially in the field of substitution patterns. Their next step was to construct a large tree / network for better understanding of the evolution of genome in the context of phylogeny. Huelsenbeck’s study implements an important AI theory Bayesian inference to find the relationship between different species. This approach demonstrates that AI is able to recognize and build the structure of a complex bio-system, such as an evolution tree. B. Hidden Markov Model (HMM) in Biopolymers Hidden Markov Model is one of the most important contributions of the Russian mathematician A.A. Markov. It is a very influential modeling method in the AI knowledge and reasoning category. HMM is a temporal probabilistic model, where the state of the process is described by a single discrete random variable; this variable is a possible state in the real world [5]. The structure of HMM allows simple and elegant computation of all basic AI logical algorithms. HMM is used to search for patterns and to detect phenomena in uncharacterized data. It was first used in speech recognition in the 1970s and 1980s. From the late 90s, some genomic researchers started to use HMM as an analysis tool. In the year 2002, M. Amitai et al. tried to use HMM in the study of gene-finding and functional annotation [8]. Figure 1 Convergence of independent Markov Chain A particular challenge of gene-finding and functional annotation is how to describe multiple alignments. Multiple alignments show the dynamic property of protein sequence. Finding multiple alignments can be done in a laboratory with real “wet” experiments, which are very expensive and time consuming. Figure 2 An example of a multiple alignment. To save cost and time, Amitai wanted to find the solutions from the existed public data (mainly genomic DNA, messenger RNA, and their corresponding protein sequences), which were in large amounts. Just in GenBank alone, there were approximately 28,507,990,166 bases in 22,318,883 sequence records as of January 2003 [10]. It was impossible for a human being to find out the hidden relationships from these billions of 4 bases. HMM, as a model of AI, was then utilized. There were three reasons for using HMM in modelling proteins and genes. First of all, HMM had the advantage of precise probabilistic modeling. Second, the experience gained from the same tools in speech recognition could be utilized [8]. Third, some computer programs were well developed to build and apply HMM. Among these programs, there were a few focusing on the sequential analysis of protein, such as HMMer and SAM [12, 13]. Figure 2 is a real example taken from the PDGF (platelet-derived growth factor) family. In position 17, half of it has an amino acid; which could be proline or arginine in half-half chance. Another half in position 17 has no amino acid (called deleted position) [8]. Although the statistical population is relatively small, based on the knowledge of protein evolution, the new member of the same family behaves similar in the same position [11]. This similarity meets the assumption of HMM. Figure 3 is part of HMM a constructed form of the multiple alignments. Here state M16 is corresponding to position 16. From this state, there is a 50% possibility to D17 (deleted position), and a 50% possibility to M17. M17 is a clustered state with 50% possibility of P (praline) and 50% possibility of R (arginine). Now the protein is aligned to the HMM according to the probabilities. This model identifies the similarity with other proteins, and predicts the multiple alignments for the same family members. It draws a dynamic picture of protein sequence. colleagues had to use the trial-error method to determine the convergence. In this case, a self-learning model convergence module could have been put in their program to improve the efficiency. Some pioneering studies are in process. C. Yoo and G. F. Cooper introduced a system named GEEVE, which can automatically pick the best model to find a causal pathway in genes [9]. This system will try to recommend the model based on previous results, and adjust the recommendation by recent evaluation [9]. V. CONCLUSION Technologies of AI have been proven to be beneficial to the development of systems biology. Problem solving is the basis of AI, and its importance is represented in the application of all the other aspects of AI in systems biology. Knowledge and reasoning is currently the most widely applied. It helps in the identification of system structure as in the example of Bayesian Inference of phylogeny. It also shows its value in understanding system dynamics, as exemplified by the Hidden Markov Model in biopolymers. The machine learning algorithm starts to demonstrate its power in assisting control and design methods. Also, the exploration of the application in systems biology of the areas such as building a knowledge base, choosing models, analyzing data and evaluating results, will be the trend of AI implementation in systems biology in the near future. The more complex application of autonomous planning, communicating, perceiving and acting is likely to happen after machine learning is well adopted in systems biology. REFERENCES [1] [2] [3] [4] Figure 3 Part of HMM for multiple alignment from figure 1 [5] [6] In this example HMM, the most popular theory in AI, describes the multiple alignments in PDGF. It shows AI’s ability in finding and understanding the dynamics of a biological system. [7] C. Preliminary Application of Machine Learning A very important aspect of AI is learning like a human being. This technique will be greatly helpful for the modelling procedure. Usually the modelling and modelling assumptions are crucial for the systems biology research. There could be several possible models that can fit in one topic. How to find the best choice is very difficult in most cases. Because of no proper method to detect the validity of a model, Huelsenbeck and his [10] [8] [9] [11] [12] [13] H. Kitano, “System’s biology: a brief overview,” Science, 2002, vol. 295, pp. 1662-1664 J.P. Huelsenbeck, F. Ronquest, and R. Nielsen, “Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology”, Science, 2002, vol.294, 2310-2318 N.C. Spitzer and T.J. Sejnowski, “Biological Information Processing: Bits of Progress”, Science, 2000, vol. 277, pp. 1060-1063 A. Renner and A. Aszodi, “High-throughput Functional Annotation of Novel Gene Products using Document Clustering”, Pacific Symposium on Biocomputing 2000,pp. 54 -68 S. J. Russell and P. Norvig, Artificial Intelligence – A Modern Approach, Pearson Education, New Jersey, USA, 2003 A.M Turing, “A Quarterly Review of Psychology and Philosophy, ” 1950, Available online: http://www.abelard.org/turpap/turpap.htm IBM, “Deep Blue”1997, Available online: http://www.research.ibm.com/deepblue/ M. Amitai, “Hidden Models in Biopolymers, ” Science, 2001, vol. 282, pp. 1436-1440 C. Yoo and G. F. Cooper, “An Evaluation of a System that Recommends Microarray Experiments to Perform to Discover Gene-Regulation Pathways,” unpublished. NCBI, “What is GenBank?,” 2003, Availabe online: http://www.ncbi.nlm.nih.gov/Genbank/ J. Sjolander et al., Comput. Appl. Biosci. 1996, vol. 12, pp 327 S. Eddy, “HMMer: Profile HMMs for protein sequence analysis”, 2003. available on line: http://hmmer.wustl.edu/ UCSC, “Sequence Alignment and Modelling System,” 2003, available online: http://www.cse.ucsc.edu/research/compbio/sam.html
© Copyright 2026 Paperzz