The 4M (Mixed Memory Markov Model) Algorithm for Finding Genes

26
SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008
The 4M (Mixed Memory Markov Model) Algorithm
for Finding Genes in Prokaryotic Genomes
Mathukumalli Vidyasagar, Fellow, IEEE, Sharmila S. Mande, Ch. V. Siva Kumar Reddy, and V. V. Raja Rao
Abstract—In this paper, we present a new algorithm called
4M (mixed memory Markov model) for finding genes from the
genomes of prokaryotes. This is achieved by modeling the known
coding regions of the genome as a set of sample paths of a multistep
Markov chain (call it ) and the known non-coding regions as
a set of sample paths of another multistep Markov chain (call it
). The new feature of the 4M algorithm is that different states
are allowed to have different memory lengths, in contrast to a fixed
multistep Markov model used in GeneMark in its various versions.
At the same time, compared with an algorithm like Glimmer3
that uses an interpolation of Markov models of different memory
lengths, the statistical significance of the conclusions drawn from
the 4M algorithm is quite easy to quantify. Thus, when a whole
genome annotation is carried out and several new genes are
predicted, it is extremely easy to rank these predictions in terms
of the confidence one has in the predictions. The basis of the 4M
algorithm is a simple rank condition satisfied by the matrix of
frequencies associated with a Markov chain.
The 4M algorithm is validated by applying it to 75 organisms
belonging to practically all known families of bacteria and archae.
The performance of the 4M algorithm is compared with those of
Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It is found
that, in a vast majority of cases, the 4M algorithm finds many more
genes than it misses, compared with any of the other three algorithms. Next, the 4M algorithm is used to carry out whole genome
annotation of 13 organisms by using 50% of the known genes as
the training input for the coding model and 20% of the known
non-genes as the training input for the non-coding model. After
this, all of the open reading frames are classified. It is found that
the 4M algorithm is highly specific in that it picks out virtually all
of the known genes, while predicting that only a small number of
the open reading frames whose status is unknown are genes.
Index Terms—Algorithm, gene prediction, K–L divergence,
Markov model, prokaryotes.
I. INTRODUCTION
A. Gene-Finding Problem
LL LIVING things consist of DNA, which is a very complex molecule arranged in a double helix. DNA consists
of a series of nucleotides, where each nucleotide is denoted
by the base it contains, namely, A (Adenine), C (Cytosine), G
(Guanine), or T (Thymine). The genome of an organism is the
listing of one strand of DNA as an enormously long sequence of
symbols from the four-symbol alphabet {A, C, G, T}. Certain
A
Manuscript received January 22, 2007; revised September 1, 2007.
The authors are with the Advanced Technology Centre, Tata Consultancy Services, Software Units Layout, Madhapur, Hyderabad 500081,
India (e-mail: [email protected]; [email protected]; [email protected];
[email protected]).
Digital Object Identifier 10.1109/TAC.2007.911360
parts of the genome correspond to genes that get converted into
proteins, while the rest are non-coding regions. In prokaryotes,
or “lower” organisms, the genes are in one continuous stretch,
whereas in eukaryotes, or “higher” organisms, the genes consist of a series of exons, interruped by introns. The junctions
between exons and introns are called splice sites, and the detection of splice sites is a very difficult problem. For this reason,
the focus in this paper is on finding genes in prokaryotes.
It is easy to state some necessary but not sufficient conditions
for a stretch of genome to be a gene, which are given as follows.
• The sequence must begin with the start codon ATG. In
some organisms, GTG is also a start codon.
• The sequence must end with one of the three stop codons,
namely, TAA, TAG, or TGA.
• The length of the sequence must be an exact multiple of
three.
A stretch of genome that satisfies these conditions is referred to
as an open reading frame (ORF).
B. Statistical Approaches to Gene-Finding
There are in essence two distinct approaches to gene-finding,
namely, string-matching and statistical modeling. In
string-matching algorithms, one looks for symbol-for-symbol
matching, whereas in statistical modeling one looks for similarity of the statistical behavior. If one were to examine genes
with the same function across two organisms, then it is likely
that the two DNA sequences would match quite well at a
symbol-for-symbol level. For instance, if one were to compare
the gene that generates insulin in a mouse and in a human,
the two strings would be very similar, except that occasionally
one would have to introduce a “gap” in one sequence or the
other. This particular problem, namely to determine the best
possible match between two sequences, after inserting a few
gaps here and there, is known as the optimal gapped alignment problem. If and are the two strings to be aligned
and have lengths and , respectively, then it is possible to
give an optimal alignment based on dynamic programming,
. Parallel implementations of this
whose complexity is
alignment algorithm are also possible. For further details, see
[7] and [5].
On the other hand, if one were to examine two different genes
with distinct functionalities but from within the same organism
or within the same family of organisms, then the two genes
would not be similar at a symbol-for-symbol level. However,
it is widely believed that they would match at a statistical level.
The idea behind statistical prediction of genes can be summarized as follows. Suppose we are given several known strings
and several known strings of non-genes
of genes
/ © 2008 IEEE
VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM
27
. We think of the ’s as sample paths of one stochastic process generated by a “coding model” and of the
’s as sample paths of another stochastic process generated
. Now suppose
is an ORF,
by a “non-coding model”
and we wish to classify it as being a gene or a non-gene. The
logical approach is to use “log-likelihood ratio” classification.
and
Thus, we compute the probabilities (or likelihoods)
, that is, the likelihood of the string
according to
the coding model and the non-coding model, respectively. If
, then we classify
as a gene, whereas if
, then we classify
as a non-gene. In the
unlikely event that both likelihoods are comparable, the method
is inconclusive.
The existing statistical gene prediction methods differ only
in the manner in which the coding model and the non-coding
model are constructed. In Genescan [18], the basic premise is
that, in coding regions, the four nucleic acid symbols occur
roughly with a period of three. Thus, there is no non-coding
model as such. Instead, a discrete Fourier transform is taken
of the occurrence of each nucleotide symbol, and the value is
compared against a threshold. In Genscan [2], [3], the sample
paths are viewed as outputs of a hidden Markov model that
can also emit a “blank” symbol. Because of this, the length
of the observed string does not match the length of the state
sequence, thus leading to an extremely complicated dynamic
programming problem. In GeneMark [12], the observed sequences are interpreted as the outputs of a multistep Markov
process. Probably the most widely used classification method
is Glimmer, which has several refinements to suit specific
requirements. Some references to Glimmer can be found in
[4] and [17]. We shall return to Glimmer and GeneMark again
in Section V, when we present our computational results and
compare them against those produced by Glimmer.
some cases, the resulting reduced-size Markov model has as
few as 150 states—which is an 85% reduction! In addition to
the singular value test to choose the level of reduction permissible, we also use the Kullback–Leibler (K-L) divergence rate
[9] to bound the error introduced by reducing the size of the
state space. This upper bound on the K-L divergence rate can
be used to choose a threshold parameter in the rank condition
in an intelligent fashion. In addition, the K-L divergence rate
is also used to demonstrate that the three-periodicity effect is
very pronounced in coding regions but not in the non-coding
regions. The statistical significance of the 4M algorithm is
rather easy to analyze. As a result, when some ORFs are
predicted to be genes using the 4M algorithm, our confidence
in the prediction can be readily quantified, and the predictions
can be ranked in order of decreasing confidence. In this way,
the most confident predictions (if they are also interesting from
a biological standpoint) can be followed up for experimental
verification.
All in all, the conclusion is that the 4M algorithm performs
comparably well or somewhat better than Glimmer and GeneMark; however, in the case of the 4M algorithm, it is quite easy
and straightforward to compute the statistical significance of the
conclusions drawn.
C. Contributions of This Paper
Glimmer uses an interpolated Markov model (IMM) whereby
the sample paths are fit with multistep Markov models whose
memory varies from 0 to 8. (Note that a Markov process with
zero memory is an i.i.d. process.) This requires the estimation
of a fairly large number of parameters. To overcome this difficulty, Glimmer uses high-order Markov models only when
there is sufficient data to get a reliable estimate. Initially, GeneMark used a fifth-order Markov chain, but subsequent versions use refined versions, including a hidden Markov model
(HMM). In contrast, the premise of our study is that, even in a
multistep Markov process, different states have different effective memory. This leads to a “mixed memory Markov model.”
Hence, the algorithm is called 4M.
We begin by fitting the sample paths of both the coding
regions and the non-coding regions with a fifth-order Markov
model each. The reason for using a fifth-order model (thus
exactly replicating hexamer frequencies) is that lower order
models result in noticeably poorer performance, whereas higher
order models do not seem to improve the performance very
much. Using two fifth-order Markov models for the coding and
non-coding regions results in two models having
states each. Then, by using a simple singular value condition,
many of these states are combined into one common state. In
II. 4M ALGORITHM
A. Multistep Markov Models
Recall that the statistical approach to gene-finding depends on
being able to construct two distinct models, for the coding refor the non-coding regions. From purely a mathgions and
ematical standpoint, we have only one problem at hand, namely,
of a stationary stochastic
given a set of sample paths
process, construct a model for these paths. In other words, both
and
are constructed using exactly the same methodology,
but applied to distinct sets of sample paths. Thus, let us concentrate on this problem formulation.
is a positive integer, and define
Suppose
. (In the case of genomics,
.) Suppose
is a stationary stochastic process assuming values in
,
and we have at hand several sample paths
of this
process. The objective is to construct a stochastic model for the
on the basis of these observations.
process
Suppose an integer is specified, and we know the statistics
of the process
up to order . This means that the probabilare specified for the
ities of occurrence of all -tuples
. Let
denote the probability of occurence of the
process
,
string . Thus, if is a string of length , say
then
Since the process is stationary, the above probability is independent of . Note that the frequencies
must satisfy a set of
“consistency conditions” as follows:
There is a well-known procedure that perfectly reproduces
the specified statistics
by modeling the given
28
SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008
process
as a
-step Markov process. For brevity, let
to mean
, and so on.
us use the notation
is a
-step Markov process
Assuming that the process
means that, if
is a string of length larger than , then
In short, it is assumed that the probability
is not
affected by values of when
. Moreover, the transition
probability of this multistep Markov process is computed as
B. 4M Algorithm
Here, we introduce the basic idea behind the 4M algorithm
and then present the algorithm itself. We begin with a multistep
Markov model and then reduce the size of the state space further
by using a criterion for determining whether some states are
“Markovian.”
The basis for the 4M algorithm is a simple property of Markov
evolving over a finite
processes. Consider a Markov chain
. Let
and consider the
alphabet
. Clearly, for any process (Markovian
frequency of the triplet
or not), we have
Note that in the above formula we simplify notation by writing
The above model, though it is often called a
-step
Markov model, is also a traditional (one-step) Markov model
. Suppose
are
over the larger state space
two states. Then a transition from to is possible only if the
last
symbols of (read from left to right) are the same as
symbols of ; in other words, it must be the case
the first
that
and so on. Now, if the process is Markovian, then we have
Hence, if we examine the
matrix
for some
it will have rank one. This is because, with fixed, it looks like
In this case, the probability of transition from the state
state is given by
to the
..
.
(1)
For all other , the transition probability
equals zero. It
is clear that, though the state transition matrix has dimension
, every row contains at most nonzero entries.
Such a
-step Markov model perfectly reproduces the
-tuple frequencies
for all
.
Given a long string of length , we can write
..
.
There is nothing special about using only a single symbol
. Suppose is a string of finite length, denoted as usual
by
. Then, just as above, we have that
Thus, if we fix an integer and examine the
As a result,
Now, in the above summation, all numbers are of reasonable
size.
To round out the discussion, suppose that the statistics of the
are not known precisely, but need to be inferred on
process
the basis of observing a set of sample paths. This is exactly the
problem we have at hand. In this case, one can still apply (1), but
and replaced
with the actual (but unknown) probabilities
by their empirically observed frequencies. Each of these gives
an unbiased estimate of the corresponding probability.
matrix
then
has rank one. Conversely, if the semi-infinite matrix
has rank one for every
, then
is Markovian.
the process
Now suppose we drop the assumption that the process
is Markovian, and suppose the matrix
has rank one for a
. An elementary exercise in linear algebra
particular fixed
shows that, in such a case, we must have
(2)
This follows by reversing the above reasoning. Accordingly, let
to be Markovian of order if
has
us define a state
rank one. The distinction here is that we are now speaking about
an individual state being Markovian as opposed to the entire
process.
VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM
29
The above rank one property and definition can also be
is an
extended to multistep Markov processes. Suppose
-step Markov process. This means that, for each fixed
,
we have
Now, if a substring of the string
is Markovian,
, then we can take advantage of (3) and do the
say
substitution
Hence, for each fixed
, the
matrix
has rank one. Following earlier reasoning, we can define a state
to be a Markovian state of order if the matrix
has rank one.
Let us now consolidate all of this discussion and apply it
to reducing the size of the state space of a multi-step Markov
is a stationary stochastic
process. Suppose, as before, that
.
process assuming values in a finite set
We have already seen that, in order to reproduce perfectly
-step
the -tuple frequencies, it suffices to construct a
Markov model. Suppose that such a multi-step model has
indeed been constructed. This model makes use of the -tuple
. Now suppose that, for
frequencies
and some string
, it is the case
some integer
that the matrix
(3)
has rank one. By elementary linear algebra, this implies that
(4)
In other words, the conditional probability of finding a symbol
at a particular location depends only on the preceding symbols
and not on the
symbols that precede . Hence, if
has rank one, then we can “collapse” all
states of
for all
into a single state . For this
the form
a Markovian state if
has rank one.
reason, we call
The interpretation of being a Markovian state is that, when
this string occurs, the process has a “memory” of only time
in general.
steps and not
To implement the reduction in state space, we therefore proceed as follows.
Step 1) Compute the vector of -tuple frequencies.
and, for each
, compute the matrix
Step 2) Set
. If
has rank one, then collapse all
states of the form
for all
into a
.
single state . Repeat this test for all
Step 3) Increase the value of by one and repeat until
.
When the search process is complete, the initial set of
states will have been collapsed into some intermediate number,
whose value depends on the -tuple frequencies. Since we are
modeling the process
as a
-step Markov process,
, we can write
in general, for a string of length
in the above formula. This is the reason for calling the algorithm
-tuples
a “mixed memory Markov model” since different
have memories of different lengths.
The preceding theory is “exact” provided we use true probabilities in the various computations. However, in setting up the
multistep Markov model, we are using empirically observed frequencies and not true probabilities. Hence, it is extremely unwill exactly have rank one. At this
likely that any matrix
point, we take advantage of the fact that we wish to do classification and not modeling. This means that, in constructing the
, it is really not
coding model and the non-coding model
necessary to get the likelihoods exactly right—it is sufficient
for them to be of the right order of magnitude. Hence, in imas having
plementing the 4M algorithm, we take a matrix
“effective rank one” if it satisfies the condition
(5)
and
denote the largest two singular
where
values of the matrix
, and is an adjustable threshold parameter. We point out in Section VI that, by using the K-L
divergence rate between Markov models, it is possible to
choose the threshold “intelligently.” Setting a state
to be a Markovian state even if
is not exactly a rank one
matrix is equivalent to making the approximation
(6)
-step Markov model, the
In other words, in the original
are
entries in the rows corresponding to all states of the form
modified according to (6).
III. K-L DIVERGENCE RATE
Here, we introduce the notion of the K-L divergence rate
between stochastic processes and its applications to Markov
chains. Then, we derive an expression for the K-L divergence
-step Markov model and the
rate between the original
4M-reduced model. This formula is of interest because the two
processes have a common output space, but not a common state
space. These results are applied to some problems in genomics
in Section IV.
A. K-L Divergence
Let
be an integer, and let
denote the -simplex, namely,
Thus,
is just the set of probability distributions for an
-valued random variable. Suppose
are two such
30
SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008
probability distributions. Then the K-L divergence between the
two vectors is defined as
(7)
To get around this difficulty, it is better to use the K-L divergence rate. It appears that the K-L divergence rate was introand if
duced in [9]. If and are two probability laws on
is a finite set, we define
(8)
to be finite, needs to be domNote that, in order for
. We write
or
inated by , that is,
to denote that is dominated by or that dominates
. Here, we adopt the usual convention that
.
The K-L divergence has several possible interpretations, of
which only one is given here. Suppose we are given data generated by an i.i.d. sequence whose one-dimensional marginal
distribution is . There are two competing hypotheses, namely,
that the probability distribution is and that the probability distribution is , neither of which may be “the truth” . If we observe a sequence
, where is the length of the observation and each has one of the possible values, we compute the likelihood of the observation under each of the two hypotheses and choose the more likely one, that is, the hypothesis
that is more compatible with the observed data. In this case, it is
easy to show that the expected value of the log-likelihood ratio
. Thus, in the long
is precisely equal to
run, we will choose the hypothesis if
and the hypothesis if
. In other words,
in the long run, we will choose the hypothesis that is closer’ to
the “truth” . Therefore, even though the K-L divergence is not
truly a distance (it does not satisfy either the symmetry property
or the triangle inequality), it does induce a partial ordering on
is the per-symbol
the set . The difference
contribution to the log-likelihood ratio. As a parenthetical aside,
see [11] for a very general discussion of divergence generated
by an arbitrary convex function. For an appropriate choice of
the convex function, the corresponding divergence will in fact
satisfy the one-sided triangle inequality. However, the popular
is not one such function.
choice
B. K-L Divergence Rate
The traditional K-L divergence measure is perfectly fine
when the classification problem involves a sequence of independent observations. However, in trying to model a stochastic
process via observing it, it is not always natural to assume that
the observations are independent. It is therefore desirable to
have a generalization of the K-L divergence to the case where
the samples may be dependent. Such a generalization is given
by the K-L divergence rate.
is a stochastic process asSuppose is some set, and
suming values in the set . Thus, the stochastic process itself
assumes values in the infinite Cartesian product space
. Supare two probability laws, that is, probability meapose
. In principle, we could define
sures on the product space
by extending
the K-L divergence between the two laws
the standard definition, using Radon–Nikodym derivatives and
so on. The trouble is that most of the time the divergence would
be infinite and conveys no useful information. Thus, blindly
computing the divergence between the two laws of a stochastic
process gives no useful information most of the time.
where and
are the marginal distributions on and , reis just
spectively, onto the -dimensional product , and
the conventional K-L divergence (without the rate). The idea is
apthat, in many cases, the “pure” K-L divergence
proaches infinity as
. However, dividing by moderates the rate of growth. Moreover, if the ratio has a finite limit
, then the K-L divergence rate gives a measure of
as
the asymptotic rate at which the “pure” divergence blows up as
.
The K-L divergence rate has essentially the same interpretation as the K-L divergence. Suppose we are observing a stochastic process whose law is . We are trying to decide between
two competing hypotheses: The process has the law , and the
process has the law . After samples, the expected value of
the log-likelihood ratio is asymptotically equal to
.
The paper [16] gives a good historical overview of the properties of the K-L divergence. Specifically, in general the K-L divergence may not exist between arbitrary probability measures,
but it seems to exist under many reasonable conditions. For example, it is known [6] that, if is a stationary law and is the
law of a finite-state Markov process, then the K-L divergence
rate is well defined. It is shown in [14] that the K-L divergence
rate exists if both laws correspond to ergodic processes.
C. K-L Divergence Rate Between Markov Processes
In [16], an explicit formula is given for the K-L divergence
rate between two Markov processes over a common (finite) state
space. We give an alternate version of the formula derived in
[16], which generalizes very cleanly to multistep Markov processes.
and
are the laws of two Markov
Suppose
processes over a finite set
. Thus,
are stochastic matrices
and
are corresponding stationary vectors. Thus,
. If is the law of the Markov process
and is the law of the Markov process
, then it is shown
in [16] that
(9)
where
denote the th rows of the matrics and ,
respectively. In order for the divergence rate to be finite, the
two state transition matrices and must satisfy the condition
or, in the earlier notation, we must have
. We denote this condition by
or
.
VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM
31
Now we give an alternate formulation of (9) that is in some
sense a little more intuitive.
Theorem 1: Suppose
are stochastic matrices, and let
denote associated stationary probability disand
. Let
denote the
tributions. Thus,
and let denote the law of
law of the Markov process
the Markov process
. Let
denote the frequency vector of doublets
, under the Markov chain
. Similarly let
denote the frequency vector
, under the Markov chain
. Suppose
of doublets
. Then, the K-L divergence rate between the Markov
chains is given by
stochastic process. Then, the K-L divergence rate between the
-step Markov model and the 4M reduced Markov
original
and
model (or equivalently between the laws of the process
) is given by
(10)
is the conventional K-L divergence between probwhere
ability vectors.
1) Remarks: Formula (10) gives a nice interpretation of the
K-L divergence rate between Markov chains: it is just the difference between the divergence of the doublet frequencies and
the divergence of the singlet frequencies. Moreover, it is easy
to extend it to -step Markov chains. The K-L divergence rate is
-tuple frejust the difference between the divergence of
quencies and -tuple frequencies.
The proof is omitted in the interests of brevity. It can be found
in [21].
In [16], the authors do not give an explicit formula for the K-L
divergence rate between multistep Markov chains. There is an
analogous formula to (10) in the case of -step Markov models.
to be the frequency vector of
-tuples for the
Define
in the
first Markov chain, and define the symbols
obvious fashion. Then
(11)
The proof, based on the fact that an -step Markov model is
, is easy and is left to the
just a one-step Markov model on
reader.
D. K-L Divergence Rate When the 4M Algorithm is Used
Theorem 1 gives the K-L divergence rate between two
Markov processes over the same state space. In this paper, we
-step Markov process and approximate it
begin with a
by some other Markov process by applying the 4M algorithm.
When we do so, the resulting processes no longer share a
common state space. Thus, Theorem 1 no longer applies. The
next theorem gives a formula for the K-L divergence rate when
the 4M algorithm is used to achieve this reduction. Note that
the problem of computing the K-L divergence rate between two
entirely arbitrary HMMs with a common output space is still
an open problem.
is a stationary stochastic process
Theorem 2: Suppose
of all -tuples
are speciand that the frequencies
denote the approximation of
by a
-step
fied. Let
Markov process. Suppose now that we apply the 4M algorithm
and choose various tuples as “Markovian states.” Let
denote the Markovian states and let
denote the length
of the Markovian state . Finally, let
denote the resulting
(12)
-order Markov model has
Proof: Note that the full
exactly nonzero entries in each row labeled by , and these
as varies over . One can think of the reentries are
duced-order model obtained by the 4M algorithm as containing
rows, except that, if is a Markovian state, then the
the same
entry in all rows of the form
are changed from
to
. The vector
is a stationary distribution
-order Markov model. Now (12) readily
of the original
follows from (11).
Note that, in applying the 4M algorithm, we approximate the
by the ratio
for each string that is
ratio
deemed to be a Markovian state. Hence, the quantity inside the
logarithm in (12) should be quite close to one, and its logarithm
should be close to zero.
IV. COMPUTATIONAL RESULTS—I: APPLICATIONS OF
THE K-L DIVERGENCE RATE
This section contains the first set of computational results.
Here, we study the three-periodicity of coding regions using the
K-L divergence rate. The same K-L divergence rate is also used
to show that there is virtually no three-periodicity effect in the
non-coding regions. Then, we analyze the effect of reducing the
size of the state space using the 4M algorithm in terms of the
generalization error.
A. List of Organisms Analyzed
The 4M algorithm was applied to 75 prokaryote genomes of
microbial organisms. These genomes comprised both bacteria
as well as archae. To save space, in the tables showing the computational results we give the names of the various organisms
in a highly abbreviated form. Table V gives a list of all the organisms for which the computational results are presented here,
together with the abbreviations used.
B. Three-Periodicity of Coding Regions
There is an important feature that needs to be built into any
stochastic model of genome sequences. It has been observed
that, if one were to treat the coding regions as sample paths of a
stationary Markov process, then the results are pretty poor. The
reason is that genomic sequences exhibit a pronounced threeperiodicity. This means that the conditional probability
is not independent of , but is instead periodic with a period of
three. Thus, instead of constructing one
-step Markov
model, we must in fact construct three such models. These are
referred to as the Frame 0, Frame 1, and Frame 2 models.
We begin from the start of the genome and label the first nucleotide as Frame 0, the second nucleotide as Frame 1, the third
32
SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008
TABLE I
DIVERGENCES BETWEEN MARKOV MODELS OF CODING REGIONS
nucleotide as Frame 2, and loop back to label the fourth nucleotide as Frame 0, and so on. Then the 4M reduction using the
rank condition is applied to each frame. Since a three-periodic
can also be written as
Markov chain over a state space
a stationary Markov chain over the state space
, there
are no conceptual difficulties because of three-periodicity.
In this subsection, we use the K-L divergence rate introduced
in Section III to assess the significance of three-periodicity in
both coding and non-coding regions in various organisms. The
study is carried out as follows. In 13 organisms, we constructed
three-periodic models for the known coding regions as well
. This
as known non-coding regions, using the value
means that for each organism we constructed six different
fifth-order Markov models that perfectly reproduced the observed hexamer frequencies. These models are denoted by
respectively (three coding-region
models and three non-coding region models). For the three
coding region models, we computed six different divergence
for all
. Then, we did the same
rates, namely
for the three non-coding region models. (Remember that the
K-L divergence rate is not symmetric.) Tables II and III show
these six divergence rates for 13 of the organisms listed in
Section IV-A, for both coding regions as well as non-coding
regions. Actually, we computed 75 such divergences, but
only 13 are presented here. To make the tables fit within the
to denote
two-column format, we use the obvious notation
or
as appropriate.
Note that throughout the base of the logarithm used in (12) is
2. Thus, all logarithms are binary logarithms.
From Tables I and II, it is clear that the three-periodicity effect
in the non-coding regions is noticeably less than in the coding
regions, in the sense that the K-L divergence rates between the
three frames of the non-coding models are essentially negligible, compared with the corresponding divergence rates in the
coding regions. In fact, except for Mycoplasma genitalium, the
divergences in the non-coding regions are essentially negligible.
M. genitalium is a peculiar organism, in which the codon TGA
codes for the amino acid tryptophane, instead of being a stop
codon as it is in practically all other organisms. Thus, when we
construct algorithms for predicting genes, we would be justified in ignoring the three-periodicity effect in the non-coding
regions.
TABLE II
DIVERGENCES BETWEEN MARKOV MODELS OF NON-CODING REGIONS
TABLE III
SIZES OF 4M-REDUCED MARKOV MODELS
C. Reduction in Size of State Space
Here, we study the reduction in the size of the state space
when the 4M algorithm is used. In applying the 4M algorithm,
. It has been verified numerically that
we used the value
smaller values of do not give good predictions, while larger
values of do not lead to any improvement in performance.
seems to be the right value. Hence, for each orThus,
ganism, we constructed three coding region models, and one
non-coding region model, each model being fifth-order Markovian. Recall that, due to the three-periodicity of the coding
regions, we need three models, one for each frame in the coding
region. Since the non-coding region does not show so much
of a three-periodicity effect (as demonstrated in the preceding
subsection), we ignore that possibility and construct just one
model for the non-coding region. Each of the fifth-order Marstates, consisting of pentamers
kovian models has
of nucleotides. Then, for each of these four models, we applied
the 4M reduction, with the threshold in (5) set somewhat arbi. A more systematic way to choose is given
trarily at
in Section VI. Recall that the larger the threshold , the larger
the number of states that will satisfy the “near rank one” condition (5), and the greater the reduction in the size of the state
space.
The CPU time for computing the hexamer frequencies of a
genome with about one million base pairs is approximately 10
s on a Intel Pentium IV processor running at 2.8 GHz, while
the state space reduction takes just 0.3 s or about 3% of the time
VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM
TABLE IV
RESULTS OF WHOLE GENOME ANNOTATION USING THE 4M ALGORITHM
USING 50% TRAINING DATA
33
TABLE V
ABBREVIATIONS OF ORGANISM NAMES AND COMPARISON OF
4M VERSUS GLIMMER 3
needed to compute the frequencies. Thus, once the hexamer frequencies are constructed, the extra effort needed to apply the 4M
algorithm is negligible.
Table III shows the size of the 4M-reduced state space for
each of the 13 organisms studied. All of the numbers in the table
should be compared with 1024, which is the number of states of
the full fifth-order Markov chain. Moreover, in Glimmer and its
variants, one uses up to eighth-order Markov models for certain
organisms, meaning that in the worst case the size of the state
. From this table, it is clear
space could be as high as
that in most cases the 4M algorithm leads to fairly significant
reduction in the size of the state spaces. There are some dramatic
reductions, such as in the case of B. sub, for which the reduction
in the size of the state space is of the order of 85%. Moreover,
in almost all cases, the size of the state space is reduced by at
least 50%.
V. COMPUTATIONAL RESULTS—II: GENE PREDICTION
Now we come to the main topic of this paper, namely,
finding genes. One can identify two distinct philosophies
in gene-prediction algorithms. Some algorithms, including
the one presented here, can be described as “bootstrapping.”
Thus, we begin with some known genes, construct a stochastic
model based on those, and then use that model to classify
the remaining ORFs as potential genes. The most promising
predictions are then validated either through experiment or
through comparison with known genes of other similar organisms. The validated genes are added to the training sample and
the process is repeated. This is why the process may be called
bootstrapping. In contrast, Glimmer (in its several variants),
which is among the most popular and most accurate prediction
algorithm at present, can be described as an ab initio scheme.
In Glimmer, all ORFs longer than 500 base pairs are used as the
training set for the coding regions. The premise is that almost
all of these are likely to be genes anyway. In principle, we
could apply the 4M algorithm with the same initial training set
and the results would not be too different from those presented
here.
For a gemone with one million base pairs, the 4M algorithm
required approximately 10 s of CPU time and approximately 5
Mb of storage for training the coding and non-coding models,
compared with 10 s of CPU time and 50 Mb of storage for
Glimmer3. The prediction problem took about 60 s of CPU time
and 20 Mb of storage for 4M versus 13 s of CPU time and 4
Mb of storage for Glimmer3. Our implementation of the 4M algorithm was done using Python, which is very efficient for the
34
SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008
TABLE VI
COMPARISON OF 4M ALGORITHM VERSUS GENEMARK 2.5D AND GENEMARKHMM2.6g
programmer but very inefficient in terms of CPU time Thus, we
believe that there is considerable scope for reduction in both
the CPU time as well as the storage requirements when implementing the 4M algorithm.
VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM
A. Classification of Annotated Genes Using 4M and Other
Methods: Comparison of Results
Here, we take the database of “annotated” genes for each of
75 organisms and classify them using 4M, Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. The database of “annotated” genes represents a kind of consensus. Some of the genes
in the database are experimentally validated, while others are
sufficiently similar at a symbol for symbol level to other known
genes that these too are believed to be genes. Thus, it is essential that any algorithm should pick up most if not all of these
annotated genes.
The test was conducted as follows. To construct the coding
and non-coding models for the 4M algorithm, we took some
known genes to train the coding model and some known noncoding regions to train the non-coding model
. The fraction
of known genes used to train the coding model was 50%,
that is, we used every other gene. For the non-coding model,
we picked around 20% of the known non-coding regions at
random. Throughout, we used a three-periodic model for the
coding regions and a “uniform” (i.e., nonperiodic) model for
the non-coding regions. These models were then 4M-reduced
. Then the remaining (known) coding
using the threshold
and non-coding regions were classified using the log-likelihood
method.
In the tables, we have used the following notation.
• Total Genes denotes the total number of genes in the annotated database.
• 4M & Gl denotes the genes picked up by both the 4M
algorithm and Glimmer3.
&
denotes the genes missed by both the
•
algorithms.
denotes the genes picked up by the 4M algo• 4M &
rithm but missed by Glimmer3.
& Gl denotes the genes missed by the 4M algorithm
•
but picked up by Glimmer3.
Similar notation is used in Table VI, with Glimmer3 replaced by
GeneMark2.5d (denoted by GMK) and GeneMarkHMM (denoted by GHMM).
First, we compare the performance of the 4M algorithm
against that of Glimmer3, as detailed in Table V. It is worth
pointing out that, in the results presented here, there is no
“postprocessing” of the raw output of the 4M algorithm, as is
common with other algorithms. The key points of comparison
are the numbers in the next-to-last and last columns. From this
table, it can be seen that, except in the case of seven organisms
(B. jap, C. vio, G. vio, the three Pseudomonas family, and R.
etl), 4M finds at least as many genes as it misses, compared
with Glimmer3. On the other side, in 32 organisms, the number
of annotated genes found by 4M and missed by Glimmer3
is more than double the number missed by 4M and found by
Glimmer3.
Next, a glance at Table VI reveals that 4M overwhelmingly
outperforms GeneMark2.5d, which is an older algorithm based
on modeling the genes using a fifth-order Markov model. Since
4M is also based on a fifth-order Markov model but with some
reduction in the size of the state space, the vastly superior performance of 4M is intriguing to say the least. Compared with Gen-
35
TABLE VII
COMPARISON OF 4M ALGORITHM VERSUS GLIMMER3 ON SHORT GENES
eMarkHMM2.6g, the superiority of 4M is not so pronounced;
nevertheless, 4M has the better performance.
To summarize, 4M somewhat outperforms Glimmer3 and
GeneMarkHMM2.6g in most cases, and considerably outperforms GeneMark2.5d .
Finally, we compared the performance of the 4M algorithm
with Glimmer3 on short genes. It is widely accepted that long
genes are easy to find using just about any algorithm and that
the real test of an algorithm is its ability to find short genes.
Since 4M significantly outperforms both versions of GeneMark
in any case, we present only the comparison of 4M against
Glimmer3 on three sets of genes: those of length less than 150
base pairs, between 151 and 300 base pairs, and between 301
and 500 base pairs. In presenting the results, we omitted any organism where the number of “ultrashort genes” of length less
than 150 base pairs was less than 20. These results are found in
Table VII. From this table, it is clear that 4M vastly outperforms
Glimmer3 in predicting ultrashort genes and is somewhat superior in finding short genes.
In the case of the organism M. genitalium, which has the exceptional property that the codon TGA codes for the amino acid
tryptophane instead of being a stop codon, the 4M algorithm
performs poorly when this fact is not incorporated. However,
when this fact is incorporated, the performance of 4M improves
dramatically. This is why there are two rows corresponding to
M. genitalium: the first row is with assuming that TGA is a stop
codon, and the second assumes that TGA is not a stop codon. We
were at first rather startled by the extremely poor performance
36
of the 4M algorithm in the case of M. genitalium, considering
that the algorithm performed so well on the rest of the organisms. This caused us to investigate the organism further and led
us to discover from the literature that, in fact, M. genitalium has
a nonstandard genetic code. The “moral of the story” is that,
by purely statistical analysis, we could find out that there was
something unusual about this organism.
More interestingly, even in the case of the extremely wellstudied organism E. coli, neither the 4M algorithm nor Glimmer
3 performs particularly well. This kind of poor performance is
usually indicative of some nonstandard behavior on the part of
the organism, as in the case of M. genitalium. This issue needs
to be studied further.
B. Whole Genome Annotation Using the 4M Algorithm
Here, we carry out “whole genome annotation” of 13 organisms using the 4M algorithm. First, we identify all of the
ORFs in the entire genome. Then, we train the model using
every other gene in the database of annotated genes and about
. Both models
20% of the known noncoding regions to train
are 4M-reduced. Then, the entire set of ORFs are classified
using the log-likelihood classification scheme. The legend for
the column headings in Table IV is as follows: organism, the
number of ORFs, the number of annotated genes, the number of
annotated genes that are “picked up” by 4M and predicted to be
genes, the number of “other” ORFs whose current status is unknown, and finally, the number of these “other” ORFs that are
predicted by 4M to be genes. To save space, we present results
for only 13 organisms.
Now let us discuss the results in the Table IV. It is reasonable
to expect that, since half of the annotated genes are used as the
training set, the other half of the annotated genes are “picked
up” by 4M as being genes. However, what is surprising is how
few of the other ORFs whose status is unknown are predicted
to be genes. Actually, for each organism, the 4M algorithm predicts several hundred ORFs to be additional genes. However,
since there are many overlaps amongst these ORFs, we eliminate the overlaps to predict a single gene for each set of overlapping regions. Thus, there is a clear differentiation between the
statistical properties of the annotated genes and the ORFs whose
status is unknown, and the 4M algorithm is able to differentiate
between them. These “additional predicted genes” are good candidates for experimental verification. In order to prioritize them,
they can be ranked in terms of the normalized log-likelihood
ratio
where
denotes the length of the ORF . The reason for
normalizing the log-likelihood ratio is that, as
becomes
large, the “raw” log-likelihood ratio will also become large.
Thus, comparing the raw log-likelihood ratio of two ORFs is
not meaningful. However, comparing the normalized log-likelihood ratio allows us to identify the predictions about which we
are most confident. This is the advantage of having a stochastic
modeling methodology whose significance is easy to analyze.
SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008
VI. CONCLUSION AND FUTURE WORK
We have studied the problem of finding genes from prokaryotic genomes using stochastic modeling techniques. The genefinding problem is formulated as one of classifying a given sequence of bases using two distinct Markovian models for the
coding regions and the non-coding regions, respectively. For
the coding region, we construct a three-periodic fifth-order Markovian model, whereas for the non-coding we construct a fifthorder Markovian model (ignoring the three-periodicity effect).
Then, we introduced a new method known as 4M that allows us
to assign variable length memories to each symbol, thus permitting a substantial reduction in the size of the state space of the
Markovian model.
The disparities between various models have been quantified
using the K-L divergence rate between Markov processes. The
K-L divergence rate has a number of useful applications, some
of which are brought out in this paper. For instance, using this
measure, it has been conclusively demonstrated that the threeperiodicity effect is much more pronounced in coding regions
than in non-coding regions. This is why we could ignore threeperiodicity in non-coding regions. An explicit formula has been
given for the K-L divergence rate between a fifth-order Markov
model and the 4M-reduced model. This formula allows us to
quantify the classification error resulting from this model order
reduction.
Using this new algorithm, we annotated 75 different microbial genomes from several classes of bacteria and archae. The
performance of the 4M algorithm was then compared with
those of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g.
It has been shown that the 4M algorithm somewhat outperforms
Glimmer3 and considerably outperforms both versions of GeneMark. When it comes to finding ultrashort and short genes,
the 4M algorithm significantly outperforms even Glimmer3.
We also carried out whole genome annotations of all ORFS in
several organisms using the 4M algorithm. We found that, while
)
the 4M algorithm detects an overwhelming majority (
of annotated genes as genes, it picks up a surprisingly small
fraction of the remaining ORFs as genes. Thus, the 4M algorithm
is able to differentiate very clearly between the “known” genes
and the “unknown” ORFs. Moreover, since the 4M algorithm
uses a simple log-likelihood test, it is possible to rank all
the “predicted” genes in terms of decreasing log-likelihood
ratio. In this way, the most confident predictions can be tried
out first.
Formula (12) can be used to choose the threshold in (5) in an
denote the laws of the full fifth-order
adaptive manner. Let
Markov models for the coding regions and the non-coding regions, respectively, and let denote the law of the 4M-reduced
model, obtained using a threshold . We should choose to be
as large as possible while maintaining the constraint
where is a new adjustable parameter. If we choose a very small
, then the log-likelihood ratio between
value of , say
the coding and non-coding models will hardly be affected, if
VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM
37
is replaced by . This will be a more intelligent and adaptive
way to choose .
Mathukumalli Vidyasagar (F’83) was born in
Guntur, India, on September 29, 1947. He received
the B.S., M.S., and Ph.D. degrees from the University of Wisconsin, Madison, in 1965, 1967, and
1969, respectively, all in electrical engineering.
Between 1969 and 1989, he was a Professor of
Electrical Engineering with various universities in
the United States and Canada. His last overseas job
was with the University of Waterloo, Waterloo, ON,
Canada from 1980 to 1989. In 1989, he returned to
India as the Director of the newly-created Centre for
Artificial Intelligence and Robotics (CAIR) and built up CAIR into a leading
research laboratory of about 40 scientists working on aircraft control, robotics,
neural networks, and image processing. In 2000, he joined Tata Consultancy
Services (TCS), Hyderabad, India, India’s largest IT firm, as an Executive
Vice President in charge of Advanced Technology. In this capacity, he created
the Advanced Technology Centre (ATC), which currently consists of about 80
engineers and scientists working on e-security, advanced encryption methods,
bioinformatics, Open Source/Linux, and smart-card technologies. He is the
author or coauthor of nine books and more than 130 papers in archival journals.
Dr. Vidyasagar is a Fellow of the Indian Academy of Sciences, the Indian
National Science Academy, the Indian National Academy of Engineering, and
the Third World Academy of Sciences. He was the recipient of several honors
in recognition of his research activities, including the Distinguished Service Citation from the University of Wisconsin at Madison, the 2000 IEEE Hendrik W.
Bode Lecture Prize, and the 2008 IEEE Control Systems Award.
ACKNOWLEDGMENT
The authors would like to thank B. Nittala and M. Haque for
assisting with the interpretation of some of the computational
results.
REFERENCES
[1] P. Baldi and S. Brünak, Bioinformatics: A Machine Learning Approach. Cambridge, MA: MIT Press, 2001.
[2] C. Burge and S. Karlin, “Prediction of complete gene structures in
human genomic DNA,” J. Molec. Biol., vol. 268, pp. 78–94, 1997.
[3] C. Burge and S. Karlin, “Finding genes in genomic DNA,” Curr. Opin.
Struct. Biol., vol. 8, pp. 346–354, 1998.
[4] A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg, “Improved microbial gene identification with GLIMMER,” Nucleic Acids
Res., vol. 27, no. 23, pp. 4636–4641, 1999.
[5] W. J. Ewens and G. R. Grant, Statistical Methods in Bioinformatics,
2nd ed. New York: Springer-Verlag, 2006.
[6] R. M. Gray, Entropy and Information Theory. New York: SpringerVerlag, 1990.
[7] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer
Science and Computational Biology. Cambridge, U.K.: Cambridge
Univ. Press, 1997.
[8] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge,
MA: MIT Press, 1997.
[9] B.-H. Juang and L. R. Rabiner, “A probabilistic distance measure for
hidden Markov models,” AT&T Tech. J., vol. 64, no. 2, pp. 391–408,
Feb. 1985.
[10] A. Krogh, I. S. Mian, and D. Haussler, “A hidden Markov model that
finds genes in E. coli DNA,” Nucleic Acids Res., vol. 22, no. 22, pp.
4768–4778, 1994.
[11] F. Liese and L. Vajda, “On divergences and informations in statistics
and information theory,” IEEE Trans. Inf. Theory, vol. 52, no. 10, pp.
4394–4412, Oct. 2006.
[12] A. V. Lukashin and M. Borodovsky, “GeneMark.hmm: New solutions
for gene finding,” Nucleic Acids Res., vol. 26, no. 4, pp. 11–7, 1998.
[13] W. H. Majoros and S. L. Salzberg, “An empirical analysis of training
protocols for probabilistic gene finders,” BMC Bioinformat., vol. ???,
p. 206, 2004.
[14] K. Marton and P. C. Shields, “The positive-divergence and blowing up
properties,” Israel J. Math, vol. 86, pp. 331–348, 1994.
[15] L. W. Rabiner, “A tutorial on hidden Markov models and selected
applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp.
257–285, Feb. 1989.
[16] Z. Rached, F. Alalaji, and L. L. Campbell, “The Kullback-Leibler divergence rate between Markov sources,” IEEE Trans. Inf. Theory, vol.
50, no. 5, pp. 917–921, May 2004.
[17] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, “Microbial gene
identification using interpolated Markov models,” Nucleic Acids Res.,
vol. 26, no. 2, pp. 544–548, 1998.
[18] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. bhattacharya, and R.
Ramaswamy, “Prediction of probable genes by Fourier analysis of gene
sequences,” Computat. Appl. Biosci., vol. 13, no. 3, pp. 263–270, Jun.
1997.
[19] J. C. Venter et al., “The sequence of the human genome,” Science, vol.
291, pp. 1304–1351, 2001.
[20] M. Vidyasagar, “A realization theory for hidden Markov models: The
partial realization problem,” in Proc. Symp. Math. Theory Netw. Syst.,
Kyoto, Japan, Jul. 2006, pp. 2145–2150.
[21] M. Vidyasagar, “Bounds on the Kullback-Leibler divergence rate between hidden Markov models,” in Proc. IEEE Conf. Decision Control,
Dec. 12–17, 2007.
Sharmila S. Mande received the Ph.D. degree
in physics from the Indian Institute of Science,
Bangalore, India, in 1991.
Her research interests include genome informatics, protein crystallography, protein modeling,
protein–protein interaction, and comparative genomics. She performed research work with the
University of Groningen, The Netherlands, University of Washington, Seattle, the Institute of
Microbial Technology, Chandigarh, India, and the
Post Graduate Institute of Medical Education and
Research, Chandigarh, before joining Tata Consultancy Services, Hyderabad,
India, in 2001 as head of the Bio-Sciences Division, which is part of TCS’
Innovation Lab.
Ch. V. Siva Kumar Reddy received the M.Tech. degree in computer science and engineering from the
University of Hyderabad, Hyderabad, India.
He is currently with Tata Consultancy Services
(TCS)’s Bio-Sciences Division, which is part of
the TCS’ Innovation Lab, Hyderabad. His research
interests include computational methods in gene
prediction.
V. Raja Rao received the M.S. degree in computer
science and engineering from the Indian Institute of
Technology, Mumbai, India, in 2002.
He then joined the Bioinformatics Division, Tata
Consultancy Services (TCS), Hyderabad, India,
and was involved in the development of various
bioinformatics products. His research interests
include computational methods in gene prediction
and parallel computing for bioinformatics. He is
currently consulting for TCS at Sequenom Inc., San
Diego, CA.

Download Report

The 4M (Mixed Memory Markov Model) Algorithm for Finding Genes

Paperzz.com

Your Paperzz