III. ARTIFICIAL Intelligence - Engineering Computing Facility

1
Artificial Intelligence in Systems Biology
Hai Huang, Master Student, IBBME

Abstract — Systems biology extends the perspective from
individual biological components to the system level. This
development requires advanced modelling skills and data
processing techniques. Artificial intelligence could be the solution
of this demand. Artificial Intelligence has showed its power in gene
multiple alignment modelling and phylogenetic likelihood
inference. Its active learning algorithm will accelerate the
evolution of systems biology.
Index term — Systems biology, system structure, system
dynamic, artificial intelligence, knowledge and reasoning,
machine learning.
I. INTRODUCTION
S
YSTEMS biology is a system level understanding of
biology [1], which was first introduced about 50 years ago.
Compared to traditional biology, it is still in its infancy. But as a
emerging science, it recently shows its potential in dominating
the developmental trend of molecular, genomic, and
pharmacological researches. However, further advancement in
systems biology is not free of obstacles. Technologies from
other scientific fields are demanded to assist the breakthroughs
in systems biology. Artificial Intelligence (AI) as one of the
assistant tools, has demonstrated its potential in overcoming the
difficulties faced by systems biology. Some concepts of AI have
already been applied in systems biology, while others are
beginning to be utilized.
The purpose of this paper is to provide an overview of
systems biology and AI. The application of AI in systems
biology is introduced, and the trend of the future relationships
between AI and systems biology is discussed.
II. CHALLENGES IN SYSTEMS BIOLOGY
The understanding of system-level biology is derived from
the insight into the four key elements: system structure, system
dynamic, control method, and design method [1]. Progress
has been made in each of the above areas since the emergence of
systems biology, but every step of advancement was full of
frustration.
System structure is not a list of isolated components of a cell
or organism; it is more about the relationships between these
components [1]. However, from a biological view point, to
clearly describe those relationships is very challenging. For
Manuscript received October 21, 2003.
H. Huang is with the Institute of Bio-material and Bio-medical Engineering,
University of Toronto. Canada (corresponding author to provide e-mail:
[email protected]).
example, the similarity of DNA between different species has a
profound impact on evolutionary biology; however, in
searching for this similarity, J. P. Huelsenbeck et al. found it
very frustrating because it was hidden in a flood of data [2]. To
find the clues from the voluminous data demands great
experience, knowledge and patience. Human error will
inevitably have intense adverse influence on the outcome.
To understand system dynamic is to find out “How a system
behaves over time and under various conditions. [1]” A
Biological system is far more complex than a mechanical
system; sometimes the same chemical messenger can carry
several signals simultaneously on different time scales [3]. This
brings a lot of confusion in understanding the roles of different
parallel progresses and feedback mechanisms.
Based on the knowledge of the structure and dynamic of a
particular system, control and design methods can be utilized
to control the state and modify the property of the system [1].
For example, monitoring and controlling of the side effects are
major issues in the development of new drugs, especially
gene-protein target drugs [4]. Difficulties arise here because the
target genes produce large amounts of proteins, some functions
of which are unknown. Yet to control the therapeutic effects and
design the drugs, it is essential to identify the unknown
functions and eliminate the undue functions.
III. ARTIFICIAL INTELLIGENCE
Facing all these challenges, systems biology adopts a lot of
techniques from other fields, such as System Engineering,
Information Technology, and Control Theory. AI is a relatively
new science incorporated in the development of systems
biology.
AI emerged as a new science category in the 1950s at the
same time when the term “systems biology” was coined. It refers
to thinking and acting as a human being or at least thinking and
acting rationally, rather than just imitating what a human being
does [5]. If a system can only mimic a person’s actions, it is just
a manipulator, but not actual artificial-intelligence.
The Turing Test in 1950s was the first landmark of AI. In the
Turing test, an interrogator was connected to a person or a
machine via a terminal, which prevented him/her from seeing
his/her counterpart. His/her task was to find out whether the
counterpart was a machine by only asking questions [6]. If the
machine could “fool” the interrogator, this machine system was
considered an intelligent entity. The Turing Test demonstrated
the possibility that a machine could act as a human being.
2
Another well-known milestone of AI was Deep Blue. In May
1997, IBM's Deep Blue Supercomputer played a match with the
World Chess Champion, Garry Kasparov, and won the game [7].
It revealed that machines were able to compete with human
beings to some degree.
Generally speaking, the scope of AI covers all human
activities, such as observing the environment, judging
successful behaviour, seeking the proper method, and adjusting
knowledge while interacting with the target. It can be classified
into four categories: problem solving, knowledge and reasoning,
machine learning, autonomous planning, communicating,
perceiving and acting [5].
Problem solving is the basis of AI. It presents a topological
view. Usually in the AI perspective, a problem like “How can
one thing go from state A to state B?” could be solved by
searching an existing database based on constraints and
conditions. This search could be target oriented, start-point
oriented, or bidirectional.
Knowledge and reasoning is to understand and identify a
successful behaviour in a complex environment. It is the key
component of AI. Knowledge and reasoning play a crucial role
in dealing with partially observable environments. Based on
logic, probability, and the statistics theory, two important
theories were developed. One is the Bayesian network; and the
other is the Hidden Markov Model, both of which are dominant
in current AI. They will be discussed in detail in section IV of
this paper.
Machine learning enables a system to adjust itself to the
environment. Whether supervised or unsupervised, passive or
active, machine learning is to improve the system’s ability to act
in the future. It is now the most important trend in the
development of AI.
Autonomous planning, communicating, perceiving and
acting are the implementations of the thinking part of AI into its
acting part. They are the applications of problem solving,
knowledge and reasoning, and machine learning.
The above four aspects enable AI to become a very good tool
to reduce human errors, improve efficiency, save time, and
derease costs, and thus allows it to be applied in overcoming the
difficulties faced by systems biology.
IV.
also produce an enormous amount of data. Therefore, new
methods are highly demanded to process these data. AI has
been acting as a useful tool in these situations.
Of the four components of AI, problem solving is the basis of
the other three aspects. Thus, its application in systems biology
is involved in the application of the other three aspects of AI.
Being the best studied element of AI, knowledge and reasoning
has a relatively wide application in systems biology today.
Examples of its application at two different levels of systems
biology are discussed in the following paragraphs. Some
pioneering studies are also being done on the application of
machine learning in systems biology. However the application
of autonomous planning, communicating, perceiving and acting
has not yet been seen.
A. Bayesian Inference of phylogeny
The idea that species are related is not new. More than one
century ago, Darwin became one of the pioneers in the area of
evolutionary biology. These pioneers intended to reveal a
systematic structure from a biological point of view. Just like
the trend of biology nowadays, biological phylogeny is more
like a bioinformatics science. A lot of molecular data transform
this question of the history of life to a statistical and
computational problem. Many different inferential methods
were introduced into phylogenetic analysis, seeking the
relationship between different biological classes. Among them,
Bayesian inference, an important AI theory and application, is
relatively new in this field, but it is a powerful tool for
addressing a number of long-standing and complex questions in
evolutionary biology. Table 1 lists some Bayesian inference
application in the phylogeny perspective.
Problem
Bayesian approach
Find tree with maximum posterior probability;
Inferring phylogeny evaluate features in common among the sampled
trees
Evaluating
uncertainty
phylogenies
Detecting selection
Model substitution process on the codon and
calculate probability of being in purifying or
positively selected class; sample substitutions and
count
number
of
synonymous
and
nonsynonymous changes
Comparative
analyses
Perform analysis on many trees, and weight results
by the probability that each tree is correct
Divergence times
Use fossils as a calibration. Infer divergence times
by using a strict or relaxed molecular clock
APPLICATION OF AI IN SYSTEMS BIOLOGY
Systems biology and AI were developed parallel to each
other before the 1980s as two distinct disciplines. However in
the past twenty years, the rapid technological development has
created the opportunities for AI to be applied in systems biology.
The advancement in computer science and information
technology allows AI to have more powerful computer
platforms as its tool. At the same time, new theoretical concepts
and approaches in computer science enhance the theoretical
development of AI. On the other hand, new technologies such as
gene Microarray have been brought into systems biology.
These technologies create the opportunities to digitize the
experimental results and improve the repeatability of tests. They
Evaluate clade probabilities; form credible set
in containing trees whose cumulative probability
sums to 0.95
Testing
clock
molecular Calculate Bayes factor for the clock versus no
branch length restrictions
Table 1 Bayesian approach to problems in phylogeny
Bayesian inference is to compute the posterior probability
distribution for a set of query variables over a Bayesian
network, which is able to represent the dependencies among
variables and give a concise specification of any full joint
probability distribution [5]. As a part of knowledge and
reasoning in AI categories, this inference is to identify the
3
correct relationship between different elements. The basic
expression of Bayesian theory is:
In phylogeny, this expression is used to combine the prior
probability of a phylogeny (Pr[Tree]) with the likelihood
(Pr[Data | Tree]) to produce a posterior probability distribution
on trees (Pr[Tree | Data]). Inferences about the history of the
group are based on the posterior probability of trees. The tree
with the highest posterior probability might be chosen as the
best estimate of phylogeny [2].
Huelsenbeck et al. implemented this approach by a numerical
method MCMC (Markov chain Monte Carlo) of Bayesian
inference. There were two important practical problems
associated with the application of MCMC. One was the
modelling assumption. A poorly fitted assumption would lead to
a wrong inference. Their assumption was the general time
reversible (GTR) model of DNA substitution in the analyses,
which allowed each nucleotide change to have its own rate and
the nucleotide bases to have different frequencies. It allowed
rates to vary across sites either by assuming the randomness of
the rate or by dividing the sites into several codon positions.
Another problem was to determine how long to run a chain to
obtain a good approximation of the posterior probabilities of
trees. In some cases the MCMC algorithm would fail to
converge. Eventually they identified convergence by a
trial-error method.
Based on a variant of MCMC called Metropolis-coupled
MCMC, Huelsenbeck et al. deisgned a computer program [2].
They applied this program to four large phylogenetic data. The
smallest data set included 106 wingless sequences sampled from
insects, and the largest included 357 atpB sequences sampled
from plants. Figure 1 shows the posterior probability of a clade
condition on the observed DNA sequences for two chains, each
of them starting from different random trees. The posterior
probabilities of the individual clade found in different chains are
highly correlated. There is no obvious correlation found cross
the clades. This result proved that Bayesian inference could be a
precise method in phylogenetic analysis.
Huelsenbeck and his colleagues pointed out that Bayesian
inference could be used as an important method in the study of
Molecular Evolution, especially in the field of substitution
patterns. Their next step was to construct a large tree / network
for better understanding of the evolution of genome in the
context of phylogeny.
Huelsenbeck’s study implements an important AI theory Bayesian inference to find the relationship between different
species. This approach demonstrates that AI is able to recognize
and build the structure of a complex bio-system, such as an
evolution tree.
B. Hidden Markov Model (HMM) in Biopolymers
Hidden Markov Model is one of the most important
contributions of the Russian mathematician A.A. Markov. It is
a very influential modeling method in the AI knowledge and
reasoning category. HMM is a temporal probabilistic model,
where the state of the process is described by a single discrete
random variable; this variable is a possible state in the real
world [5]. The structure of HMM allows simple and elegant
computation of all basic AI logical algorithms. HMM is used to
search for patterns and to detect phenomena in uncharacterized
data. It was first used in speech recognition in the 1970s and
1980s. From the late 90s, some genomic researchers started to
use HMM as an analysis tool. In the year 2002, M. Amitai et al.
tried to use HMM in the study of gene-finding and functional
annotation [8].
Figure 1 Convergence of independent Markov Chain
A particular challenge of gene-finding and functional
annotation is how to describe multiple alignments. Multiple
alignments show the dynamic property of protein sequence.
Finding multiple alignments can be done in a laboratory with
real “wet” experiments, which are very expensive and time
consuming.
Figure 2 An example of a multiple alignment.
To save cost and time, Amitai wanted to find the solutions
from the existed public data (mainly genomic DNA, messenger
RNA, and their corresponding protein sequences), which were
in large amounts. Just in GenBank alone, there were
approximately 28,507,990,166 bases in 22,318,883 sequence
records as of January 2003 [10]. It was impossible for a human
being to find out the hidden relationships from these billions of
4
bases. HMM, as a model of AI, was then utilized. There were
three reasons for using HMM in modelling proteins and genes.
First of all, HMM had the advantage of precise probabilistic
modeling. Second, the experience gained from the same tools in
speech recognition could be utilized [8]. Third, some computer
programs were well developed to build and apply HMM.
Among these programs, there were a few focusing on the
sequential analysis of protein, such as HMMer and SAM [12,
13].
Figure 2 is a real example taken from the PDGF
(platelet-derived growth factor) family. In position 17, half of it
has an amino acid; which could be proline or arginine in
half-half chance. Another half in position 17 has no amino acid
(called deleted position) [8]. Although the statistical population
is relatively small, based on the knowledge of protein evolution,
the new member of the same family behaves similar in the same
position [11]. This similarity meets the assumption of HMM.
Figure 3 is part of HMM a constructed form of the multiple
alignments. Here state M16 is corresponding to position 16.
From this state, there is a 50% possibility to D17 (deleted
position), and a 50% possibility to M17. M17 is a clustered state
with 50% possibility of P (praline) and 50% possibility of R
(arginine). Now the protein is aligned to the HMM according to
the probabilities. This model identifies the similarity with other
proteins, and predicts the multiple alignments for the same
family members. It draws a dynamic picture of protein
sequence.
colleagues had to use the trial-error method to determine the
convergence. In this case, a self-learning model convergence
module could have been put in their program to improve the
efficiency.
Some pioneering studies are in process. C. Yoo and G. F.
Cooper introduced a system named GEEVE, which can
automatically pick the best model to find a causal pathway in
genes [9]. This system will try to recommend the model based
on previous results, and adjust the recommendation by recent
evaluation [9].
V. CONCLUSION
Technologies of AI have been proven to be beneficial to the
development of systems biology. Problem solving is the basis of
AI, and its importance is represented in the application of all the
other aspects of AI in systems biology. Knowledge and
reasoning is currently the most widely applied. It helps in the
identification of system structure as in the example of Bayesian
Inference of phylogeny. It also shows its value in understanding
system dynamics, as exemplified by the Hidden Markov Model
in biopolymers. The machine learning algorithm starts to
demonstrate its power in assisting control and design methods.
Also, the exploration of the application in systems biology of
the areas such as building a knowledge base, choosing models,
analyzing data and evaluating results, will be the trend of AI
implementation in systems biology in the near future. The more
complex application of autonomous planning, communicating,
perceiving and acting is likely to happen after machine learning
is well adopted in systems biology.
REFERENCES
[1]
[2]
[3]
[4]
Figure 3 Part of HMM for multiple alignment from figure 1
[5]
[6]
In this example HMM, the most popular theory in AI,
describes the multiple alignments in PDGF. It shows AI’s
ability in finding and understanding the dynamics of a
biological system.
[7]
C. Preliminary Application of Machine Learning
A very important aspect of AI is learning like a human being.
This technique will be greatly helpful for the modelling
procedure. Usually the modelling and modelling assumptions
are crucial for the systems biology research. There could be
several possible models that can fit in one topic. How to find the
best choice is very difficult in most cases. Because of no proper
method to detect the validity of a model, Huelsenbeck and his
[10]
[8]
[9]
[11]
[12]
[13]
H. Kitano, “System’s biology: a brief overview,” Science, 2002, vol. 295,
pp. 1662-1664
J.P. Huelsenbeck, F. Ronquest, and R. Nielsen, “Bayesian Inference of
Phylogeny and Its Impact on Evolutionary Biology”, Science, 2002,
vol.294, 2310-2318
N.C. Spitzer and T.J. Sejnowski, “Biological Information Processing:
Bits of Progress”, Science, 2000, vol. 277, pp. 1060-1063
A. Renner and A. Aszodi, “High-throughput Functional Annotation of
Novel Gene Products using Document Clustering”, Pacific Symposium
on Biocomputing 2000,pp. 54 -68
S. J. Russell and P. Norvig, Artificial Intelligence – A Modern Approach,
Pearson Education, New Jersey, USA, 2003
A.M Turing, “A Quarterly Review of Psychology and Philosophy, ” 1950,
Available online: http://www.abelard.org/turpap/turpap.htm
IBM, “Deep Blue”1997, Available online:
http://www.research.ibm.com/deepblue/
M. Amitai, “Hidden Models in Biopolymers, ” Science, 2001, vol. 282,
pp. 1436-1440
C. Yoo and G. F. Cooper, “An Evaluation of a System that Recommends
Microarray Experiments to Perform to Discover Gene-Regulation
Pathways,” unpublished.
NCBI, “What is GenBank?,” 2003, Availabe online:
http://www.ncbi.nlm.nih.gov/Genbank/
J. Sjolander et al., Comput. Appl. Biosci. 1996, vol. 12, pp 327
S. Eddy, “HMMer: Profile HMMs for protein sequence analysis”, 2003.
available on line: http://hmmer.wustl.edu/
UCSC, “Sequence Alignment and Modelling System,” 2003, available
online: http://www.cse.ucsc.edu/research/compbio/sam.html