Hidden Markov Models with Applications to DNA Sequence Analysis

Hidden Markov Models with Applications to DNA Sequence
Analysis
Christopher Nemeth, STOR-i
May 4, 2011
Contents
1 Introduction
1
2 Hidden Markov Models
2.1 Introduction . . . . . . . . . . . . . . .
2.2 Determining the observation sequence
2.2.1 Brute Force Approach . . . . .
2.2.2 Forward-Backward Algorithm .
2.3 Determining the state sequence . . . .
2.3.1 Viterbi Algorithm . . . . . . .
2.4 Parameter Estimation . . . . . . . . .
2.4.1 Baum-Welch Algorithm . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
4
4
5
6
7
8
8
3 HMM Applied to DNA Sequence Analysis
3.1 Introduction to DNA . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 CpG Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Modelling a DNA sequence with a known number of hidden states
3.4 DNA sequence analysis with an unknown number of hidden states
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
10
12
15
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Evaluation
20
4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1
Abstract
The Hidden Markov Model (HMM) is a model with a finite number of states, each associated
with a probability distribution. The transitions between states cannot be directly measured
(hidden), but in a particular state an observation can be generated. It is the observations
and not the states themselves which are visible to an outside observer. However, by applying
a series of statistical techniques it is possible to gain insight into the hidden states via the
observations they generate. In the case of DNA analysis, we observe a strand of DNA which
we believe can be segmented into homogeneous regions to identify the specific functions of the
DNA strand. Through the use of HMMs we can determine which parts of the strand belong
to which segments by matching segments to hidden states. HMMs have been applied to a
wide range of applications including speech recognition, signal processing and econometrics to
name a few. Here we will discuss the theory behind HMMs and the applications that they
have in DNA sequence analysis. We shall be specifically interested in discussing “how can we
determine from which state our observations are being generated?”, “how do we determine the
parameters of our model?” and “how do we determine the sequence of hidden states given our
observations?” While there are many applications of HMMs we shall only be concerned with
their use in terms of DNA sequence analysis and shall cover examples in this literature where
HMMs have been successfully used. In this area we shall compare approaches to using HMMs
for DNA segmentation. Firstly, the approach taken when the number of hidden states is known
and secondly, how it is possible to segment a DNA sequence when the number of hidden states
is unknown?
Chapter 1
Introduction
Since the discovery of DNA by Crick and Watson in 1953 scientists have endeavored to better
understand the basic building blocks of life. By identifying patterns in the DNA structure it
is possible not only to categorise different species, but on a more detailed level, it is possible
to discover the more subtle characteristics such as gender, eye colour, predisposition to disease,
etc.
It is possible for scientists to gain a better understanding of DNA through segmenting long
DNA strands into smaller homogeneous regions which are different in composition from the rest
of the sequence. Identifying these homogeneous regions may prove useful to scientists who wish
to understand the functional importance of the DNA sequence. There are various statistical
techniques available to assist in the segmentation effort which are covered in Braun and Muller
(1998). However, here we shall only focus on the use of Hidden Markov models (HMM) as an
approach to DNA segmentation.
Hidden Markov models were first discussed by Baum and Petrie (1966) and since then, have
been applied in a variety of fields such as speech recognition, hand-writing identification, signal
processing, bioinformatics, climatology, econometrics, etc. (see Cappe (2001) for a detailed list
of applications). HMMs offer a way to model the latent structure of temporally dependent data
where we assume that the observed process evolves independently given an unobserved Markov
chain. There are a discrete finite number of states in the Markov chain which switch between one
another according to a small probability. Given that these states are unobserved and random
in occurrence they form a hidden Markov chain. It is possible to model the sequence of state
changes that occur in the hidden Markov chain via observations which are dependent on the
hidden states.
Since the 1980’s and early 1990’s HMMs have been applied to DNA sequence analysis with
the seminal paper by Churchill (1989) that first applied HMM to DNA segmentation. Since
then hidden Markov models have been widely used in the field of DNA sequence analysis with
many papers evolving and updating the original idea laid out by Churchill. Aside from the
papers written in this area the practical use of these techniques has found its way into gene
finding software such as FGENESH+, GENSCAN and SLAM, which can be used to predict the
location of genes in a genetic sequence.
In this report we shall firstly outline the theory behind HMMs, covering areas such as
parameter estimation, identification of hidden states and determining the sequence of hidden
states. We shall then develop this theory into several motivating examples in the field of DNA
sequence analysis. Specifically, the theory behind HMMs is applied to practical examples in
DNA sequence analysis, and how to model the hidden states on the occasions when the number
of hidden states is known and when the number of hidden states is unknown. We shall conclude
the report by discussing extensions which can be made to the HMM.
1
Chapter 2
Hidden Markov Models
2.1
Introduction
A Markov chain represents a sequence of random variables q1 , q2 , . . . , qT where the future state
qt+1 is dependent only on the current state qt , (2.1).
P (qt+1 |q1 , q2 , . . . , qt ) = P (qt+1 |qt )
(2.1)
There are a finite number of states which the chain can be in at time t which we define
as S1 , S2 , . . . , SN . Where at any time t, qt = Si , 1 ≤ i ≤ N . In the case of a hidden Markov
model it is not possible to directly observe the state qt at and time t. Instead we observe an
extra stochastic process Yt which is dependent on our unobservable state qt . We refer these
unobservable states as hidden states where all inferences about the hidden states is determined
through the observations Yt as shown in Figure 2.1.
Figure 2.1: Hidden Markov model with observations Yt and hidden states qt .
2
Example
Imagine we have two coins, one of which is fair and the other biased. If we choose one of
the coins at random and toss it, how can we determine whether we are tossing the fair coin or
the biased coin, based on the outcome of the coin tosses? One way of modelling this problem is
through a hidden Markov model where we treat the coins as being the hidden states qt (i.e. fair,
biased) and the tosses of the coin are the observations Yt (i.e. heads, tails). As we can see from
Figure 2.1, the observation at time t is dependent on the choice of coin which can be either fair
or biased. The coin itself which is used at time t is dependent on the coin used at time t − 1
which can either be fair or biased depending on the transition probability between the two states.
Formal Definition
Throughout this report we shall use the notation from, Rabiner (1989). Hidden Markov
models are adapted from simpler Markov models with the extension that the states Si of the
HMM cannot be observed directly. In order to determine the state of the HMM we must make
inference from the observation of some random process Yt which is dependent on the state at
time t. The HMM is characterised by a discrete set of N states, S = {S1 , S2 , . . . , SN } where the
state at time t is denoted as qt (i.e. qt = Si ). Generally speaking the states are connected in
such a way that it is possible to move from any state to any other state (e.g. in the case of an
ergodic model). The movement between states is defined through a matrix of state transition
probabilities A = {aij } where,
aij = P (qt+1 = Sj |qt = Si ),
for
1 ≤ i, j ≤ N
(2.2)
For the special case where any state can reach any other state in a single step aij > 0 for all
i, j.
The observations Yt can take M distinct observations (i.e. symbols per state). The observation symbols are the observed output of the system and are sometimes referred to as the
discrete alphabet. We define the probability of a given symbol being observed from state j at
time t as following a probability distribution, B = {bj (k)}, where,
1 ≤ j ≤ N,
bj (k) = P (Yt = k|qt = Sj ),
1≤k≤M
(2.3)
which is sometimes referred to as the emission probability as this is the probability that the
state j generates the observation k .
Finally, In order to model the beginning of the process we will introduce an initial state
distribution π = {πi } where,
πi = P (q1 = Si ),
1≤i≤N
(2.4)
Now that we have defined our observations and our states we can now model the system.
In order to do this we require model parameters N (number of states) and M (number of
distinct observations), observation symbols and the three probability measures A, B and π. For
completeness we define the complete set of parameters as λ = (A, B, π).
For the remainder of this section we aim to cover three issues associated with HMMs, which
are as follows:
3
Issue 1: How can we calculate P (Y1:T |λ), the probability of observing the sequence
Y1:T = {Y1 , Y2 , . . . , YT } for a given model with parameters λ = (A, B, π)
Issue 2: Given an observation sequence Y1:T with model parameters λ, how do we determine
the sequence of hidden states based on the observations P (Q1:T |Y1:T , λ)
with Q1:T = {q1 , q2 , . . . , qT }?
Issue 3: How can we determine the optimal model parameter values for λ = (A, B, π) so as
to maximise P (Y1:T |Q1:T , λ)?
2.2
2.2.1
Determining the observation sequence
Brute Force Approach
Suppose we wish to calculate the probability of observing a given sequence of observations
Y1:T = {Y1 , Y2 , . . . , YT } from a given model. This can be useful as it allows us to test the
validity of the model. If there are several candidate models available to choose from then our
aim will be chose the model which best explains the observations. In other words the model
which maximises P (Y1:T |λ). Solving this problem can be done by enumerating all of the possible
state sequences Q1:T = {q1 , q2 , . . . , qT } which generate the observations. Then the probability
of observing the sequence Y1:T is,
P (Y1:T |Q1:T , λ) =
T
Y
P (Yt |qt , λ)
(2.5)
t=1
where we assume independence of observations, which gives,
P (Y1:T |Q1:T , λ) = bq1 (Y1 )bq2 (Y2 ) . . . bqT (YT )
(2.6)
The joint posterior probability of Y1:T and Q1:T P (Y1:T , Q1:T |λ) given model parameters λ
can be found by multiplying (2.6) by P (Q1:T |λ).
P (Y1:T , Q1:T |λ) = P (Y1:T |Q1:T , λ) · P (Q1:T |λ)
(2.7)
where the probability of such a state sequence Q1:T of occurring is,
P (Q1:T |λ) = πq1 aq1 q2 aq2 q3 . . . aqT −1 qT
(2.8)
In order to calculate the probability of observing Y we simply sum the joint probability
given in (2.7) over all possible state sequences qt
P (Y1:T |λ) =
X
P (Y1:T |Q1:T , λ) · P (Q1:T |λ)
(2.9)
all Q
=
X
πq1 bq1 (Y1 )aq1 q2 bq2 (Y2 ) . . . aqT −1 qT bqT (YT )
(2.10)
q1 ,q2 ,...,qT
Calculating P (Y1:T |λ) through direct enumeration of all the states may seem like the simplest
approach. However, while this approach may appear to be straight forward the required computation is not. Altogether, if we were to calculate P (Y1:T |λ) in this fashion, we would require
2T N T calculations which is computationally expensive, even for small problems. Therefore
given the computational complexity of this approach an alternative approach is required.
4
2.2.2
Forward-Backward Algorithm
A computationally faster way of determining P (Y1:T |λ) is by using the forward-backward algorithm. The forward-backward algorithm (Baum and Egon (1967) and Baum (1972)) comprises
of two parts. Firstly, we compute forwards through the sequence of observations the joint
probability of Y1:t with the state qt = Si at time t, (i.e. P (qt = Si , Y1:t )). Secondly, we compute backwards the probability of observing the observations Yt+1:T given the state at time t
(i.e. P (Yt+1:T |qt = Si )). We can then combine the forwards and backwards pass to obtain the
probability of a state Si at any given time t from the entire set of observations.
P (qt = Si , Y1:T |λ) = P (Y1 , Y2 , . . . , Yt , qt = Si , |λ) · P (Yt+1 , Yt+2 , . . . , YT |qt = Si , λ)
(2.11)
The forward-backward algorithm also allows us to define the probability of being in a state
Si at time t (qt = Si ) by taking (2.11) and P (Y1:T |λ) from either the forwards or backwards
pass of the algorithm. In the next section we will see how (2.12) can be used for determining
the entire state sequence q1 , q2 , . . . , qT .
P (qt = Si |Y1:T , λ) =
P (qt = Si , Y1:T , λ)
P (Y1:T |λ)
(2.12)
Forward Algorithm
We define a forward variable αt (i) to be the partial observations for a sequence Y1:t with
the state at time t, qt = Si given model parameters λ.
αt (i) = P (Y1 , Y2 , . . . , Yt , qt = Si |λ)
(2.13)
We can now use our forward variable to enumerate through all the possible states up to time
T with the following algorithm.
Algorithm
1. Initialisation:
1≤i≤N
α1 (i) = πi bi (Y1 )
(2.14)
2. Recursion:
"
αt+1 (j) =
N
X
#
1 ≤ j ≤ N,
αt (i)aij bj (Yt+1 )
1≤t≤T −1
(2.15)
i=1
3. Termination:
P (Y1:T |λ) =
N
X
αT (i)
(2.16)
i=1
The algorithm starts by initialising α1 (i) at time t = 1 with the joint probability of the
first observation Y1 with the initial state πi . To determine αt+1 (j) at time t + 1 we enumerate
through all the possible state transitions from time t to t + 1. As αt (i) is the joint probability
that Y1 , Y2 , . . . , Yt are observed with state Si at time t, then αt (i)aij is the probability of the
joint event that Y1 , Y2 , . . . , Yt are observed and that the state Sj at time t + 1 is arrived at via
state Si at time t. Once we sum over all possible states Si at time t we will then know Sj at
time t + 1. It is then a case of determining αt+1 (j) by accounting for the observation Yt+1 in
state j, i.e. bj (Yt + 1). We compute αt+1 (j) for all states j, 1 ≤ j ≤ N at time t and then
iterate through t = 1, 2, . . . , T − 1 until time T . Our desired calculation P (Y1:T |λ) is calculated
by summing through all the of the possible N states which as,
5
αT (i) = P (Y1 , Y2 , . . . , YT , qT = Si |λ)
(2.17)
we can find P (Y1:T |λ) by summing αT (i).
Backward Algorithm
In a similar fashion as with the forward algorithm we can now calculate the backward
variable βt (i) which is defined as,
βt (i) = P (Yt+1 , Yt+2 , . . . , YT |qt = Si , λ)
(2.18)
which are the partial observations recorded for time t + 1 to T , given that the state at time t
is Si with model parameters λ. As with the forward case there is a backward algorithm which
solves βt (i) inductively.
Algorithm
1. Initialisation:
1≤i≤N
βT (i) = 1,
(2.19)
2. Recursion:
βt (i) =
N
X
aij bj (Yt+1 )βt+1 (j),
t = T − 1, T − 2, . . . , 1,
1≤i≤N
(2.20)
j=1
3. Termination:
P (Y1:T |λ) =
N
X
πj bj (Y1 )β1 (j)
(2.21)
j=1
We set βT (i) = 1 for all i as we require the sequence of observations to end a time T but do
not specify the final state as it is unknown. We then induct backwards from t + 1 to t through
all possible transition states. To do this we must account for all possible transitions between
qt+1 = Sj and qt = Si at time t, as well as the observation Yt+1 and all of the observations from
time T to t + 1 (βt+1 (j)).
Both the forward and backward algorithms require approximately N 2 T calculations which
means that in comparison with the brute force approach (which requires 2T N T calculations)
the forward-backward algorithm is much faster.
2.3
Determining the state sequence
Suppose we wish to know “What is the optimal sequence of hidden states?” For example, in
the coin toss problem we may wish to know which coin (biased or fair) was used at time t
and whether the same coin was used at time t + 1. There are several ways of answering this
question, one possible approach is to choose the state at each time t which is most likely given
the observations. To solve this problem we use (2.11) from the forward-backward algorithm,
where P (qt = Si , Y1:T |λ) = αt (i)βt (i).
P (qt = Si |Y1:T , λ) =
αt (i)βt (i)
αt (i)βt (i)
= PN
P (Y1:T |λ)
i=1 αt (i)βt (i)
(2.22)
where αt (i) represents the observations Y1:t with state Si at time t and βt (i) represents the
remaining observations Yt+1:T given state Si at time t. To ensure that P (qt = Si |Y1:T , λ) is a
proper probability measure we normalise by P (Y1:T |λ).
6
Once we know P (qt = Si |Y1:T , λ) for all states 1 ≤ i ≤ N we can calculate the most likely
state at time t by finding the state i which maximises P (qt = Si |Y1:T , λ).
qt = arg max P (qt = Si |Y1:T , λ),
1≤t≤T
(2.23)
1≤i≤N
While (2.23) allows us to find the most likely state at time t, it is not, however, a realistic
approach in doing so. The main disadvantage of this approach is that it does not take into
account the state transitions. It may be the case that the optimal state sequence includes states
qt−1 = Si and qt = Sj when in fact the transition between the two states is not possible (i.e.
aij = 0). This is because (2.23) gives the most likely state at each time t without regard to the
state transitions.
A logical solution to the problem regarding (2.23) is to change the optimality criterion, and
instead of seeking the most likely state at time t we instead find the most likely pairs of states
(qt , qt+1 ). However, a more widely used approach is to find the single optimal state sequence
Q1:T , which is the best state sequence path P (Q1:T , Y1:T |λ). We find this using a dynamic
programming algorithm known as the Viterbi Algorithm, which chooses the best state sequence
that maximises the likelihood of the state sequence for a given set of observations.
2.3.1
Viterbi Algorithm
Let δt (i) be the maximum probability of the state sequence with length t that ends in state i
(i.e. qt = Si ) which produces the first t observations.
δt (i) =
max
q1 ,q2 ,...,qt−1
P (q1 , q2 , . . . , qt = i, Y1 , Y2 , . . . , Yt |λ)
(2.24)
The Viterbi algorithm (Viterbi (1967) and Forney (1973)) is similar to the forward-backward
algorithm except that here we use maximisation instead of summation at the recursion and
termination stages. We store the maximisations δt (i) in a N by T matrix ψ. Later this matrix
is used to retrieve the optimal state sequence path at the backtracking step.
1. Initialisation:
δ1 (i) = πi bi (Y1 ),
1≤i≤N
ψ1 (i) = 0
(2.25)
(2.26)
2. Recursion:
δt (j) = max [δt−1 (i)aij ]bj (Yt ),
2 ≤ t ≤ T, 1 ≤ j ≤ N
(2.27)
ψt (j) = arg max[δt−1 (i)aij ],
2 ≤ t ≤ T, 1 ≤ j ≤ N
(2.28)
1≤i≤N
1≤i≤N
3. Termination:
p∗ = max [δT (i)]
(2.29)
qT∗ = arg max[δT (i)]
(2.30)
1≤i≤N
1≤i≤N
4. Path (state sequence) backtracking:
∗
qt∗ = ψt+1 (qt+1
),
t = T − 1, T − 2, . . . , 1
7
(2.31)
The advantage of the Viterbi algorithm is that it does not blindly accept the most likely
state at each time t, but in fact takes a decision based on the whole sequence. This is useful
if there is an unlikely event at some point in the sequence. This will not effect the rest of the
sequence if the remainder is reasonable. This is particularly useful in speech recognition where
a phoneme may be garbled or lost, but the overall spoken word is still detectable.
One of the problems with the Viterbi algorithm is that multiplying probabilities will yield
small numbers that can cause underflow errors in the computer. Therefore it is recommended
that the logarithm of the probabilities is taken so as to change the multiplication into a summation. Once the algorithm has terminated, an accurate value can be obtained by taking the
exponent of the results.
2.4
Parameter Estimation
The third issue which we shall consider is, “How do we determine the model parameters λ =
(A, B, π)?” We wish to select the model parameters such that they maximise the probability
of the observation sequence given the hidden states. There is no analytical way to solve this
problem, but we can solve it iteratively using the Baum-Welch algorithm (Baum et al. (1970)
and Baum (1972)) which is an Expectation-Maximisation algorithm that finds λ = (A, B, π)
such that P (Y1:T |λ) is locally maximised.
2.4.1
Baum-Welch Algorithm
The Baum-Welch algorithm calculates the expected number of times each transition (aij ) and
emission (bj (Yt )) is used, from a training sequence. To do this it uses the same forward and
backward values as used to determine the state sequence. Firstly, we define the probability of
being in state Si at time t and state Sj at time t+1 given the model parameters and observation
sequence as,
αt (i)aij bj (Yt+1 )βt+1 (j)
P (Y1:T |λ)
αt (i)aij bj (Yt+1 )βt+1 (j)
= PN PN
i=1
j=1 αt (i)aij bj (Yt+1 )βt+1 (j)
P (qt = Si , qt+1 = Sj |Y1:T , λ) =
(2.32)
(2.33)
equation (2.32) is illustrated in Figure 2.2.
From the forward-backward algorithm we have already defined P (qt = Si |Y1:T , λ) as the
probability of being in state Si at time t given the model parameters and sequence of observations. We notice that this equation relates to (2.32) as follows,
P (qt = Si |Y1:T , λ) =
N
X
P (qt = Si , qt+1 = Sj |Y1:T , λ)
(2.34)
j=1
If we then sum P (qt = Si |Y1:T , λ) over t we get the expected number of times that state Si
is visited. Similarly summing over t for P (qt = Si , qt+1 = Sj |Y1:T , λ) is the expected number of
transitions from state Si to state Sj .
Combining the above we are now able to determine a re-estimation of the model parameters
λ = (A, B, π).
1. Initial probabilities:
πi = P (qt = Si |Y1:T , λ),
expected number of times in state Si at time t = 1 (2.35)
8
Figure 2.2: Graphical representation of the computation required for the joint event that the
system is in state Si at time t and state Sj at time t + 1 as given in (2.32).
2. Transition probabilities:
PT −1
P (qt = Si , qt+1 = Sj |Y1:T , λ)
PT −1
t=1 P (qt = Si |Y1:T , λ)
expected number of transitions from state Si to state Sj
=
expected number of transitions from state Si
aij =
t=1
3. Emission probabilities:
P∗
P (qt = Si |Y1:T , λ)
bj (k) = PTt=1
t=1 P (qt = Si |Y1:T , λ)
expected number of times in state j and observing k
=
expected number of times in state j
P∗
where t=1 denotes the sum over t such that Yt = k
(2.36)
(2.37)
(2.38)
(2.39)
If we start by defining the model parameters λ as (A, B, π) we can then use these to calculate
(2.35)-(2.39) to create a re-estimated model λ = (A, B, π). If λ = λ then the initial model λ
already defines the critical point of the likelihood function. If, however, λ 6= λ and λ is more
likely than λ in the sense that P (Y1:T |λ) > P (Y1:T |λ) then the new model parameters are such
that the observation sequence is more likely to have been produced by λ.
Once we have found an improved λ through re-estimation we can then repeat this procedure
recursively, thus improving the probability of Y1:T being observed. We can then repeat this
procedure until some limiting point which will finally result in a maximum likelihood estimate
of the HMM.
9
Chapter 3
HMM Applied to DNA Sequence
Analysis
3.1
Introduction to DNA
Deoxyribonucleic acid (DNA) is the genetic material of a cell which is a code containing instructions for the make-up of human beings and other organisms. The DNA code is made up of
four chemical bases: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). The sequence
of these bases determines information necessary to build and maintain an organism, similar to
the way in which the arrangement of letters determine a word. DNA bases are paired together
as (A-T) and (C-G) to form base pairs which are attached to a sugar-phosphate backbone (deoxyribose). The combination of a base, sugar and phosphate is called a nucleotide, which is
arranged in two long strands that form a twisted spiral famously known as the double helix
(Figure 3.1).
Figure 3.1: Strand of DNA in the form of a double helix where the base pairs are separated by
a phosphate backbone (p)
Since the 1960s it has been known that the pattern in which the four bases occur in a DNA
sequence is not random. Early research into the composition of DNA relied on indirect methods
such as base composition determination or the analysis of nearest neighbour frequencies. It was
only when Elton (1974) noticed that models which assumed a homogeneous DNA structure
where inappropriate when modelling the compositional heterogeneity of DNA and thus it was
proposed that DNA should be viewed as a sequence of segments, where each segment follows its
own distribution of bases. The seminal paper by Churchill (1989) was the first to apply HMMs
to DNA sequence analysis where a heterogeneous strand of DNA was assumed to comprise of
homogeneous segments. Using the hidden states of the hidden Markov model it was possible to
detect the underlying process of the individual segments and categorise the entire sequence in
terms of shorter segments.
3.2
CpG Islands
To illustrate the use of hidden Markov models in DNA sequence analysis we will consider an
example given by Dubin et al. (1998).
10
In the human genome wherever the dinucleotide CG (sequence of two base pairs) occurs
where a cytosine nucleotide is found next to a guanine nucleotide in a linear sequence of bases
along its length (Figure 3.1). We use the notation CpG (-C-phosphate-G-) to separate the
dinucleotide CG from the base pair C-G. Typically wherever the CG dinucleotide occurs the C
nucleotide is modified by the process of methylation where the cytosine nucleotide is converted
into methyl-C before mutating into T, thus creating the dinucleotide TG. The consequence
of this is that the CpG dinucleotides are rarer in the genome than would be expected. For
biological reasons the methylation process is suppressed in short stretches of the genome, such
as around the start regions of genes. In these regions we see more CpG dinucleotides than
elsewhere in the gene sequence. These regions are referred to as CpG Islands (Bird, 1987) and
are usually anywhere from a few hundred to a few thousand bases long.
Using a hidden Markov model we can consider, given a short sequence of DNA, if it is from
a CpG island and also how do we find CpG islands in a longer sequence?
In terms of our hidden Markov model we can define the genomic sequence as being a sequence
of bases which are either within the CpG island or are not. This then gives us our two hidden
states {CpG island, Non-CpG island} which we wish to uncover by observing the sequence of
bases. As all four bases can occur in both the CpG island and non-CpG island regions, we first
must define a sensible notation to differentiate between C in a CpG island region and C in a
non-CpG island region. For A, C, G, T in a CpG island we have {A+ , C+ , G+ , T+ } and for
those bases that are not in a CpG island we have {A− , C− , G− , T− }.
Figure 3.2: Possible transitions between bases in CpG island and non-CpG island regions
Figure 3.2 illustrates the possible transitions between bases, where it is possible to transition
between all bases in both CpG island states and non-CpG island states. The transitions which
occur do so according to two sets of probabilities which specify firstly the state and then the
given observed chemical base from the state.
Once we have established the observations Yt = {A, C, G, T } and the states Si =(CpG island,
Non-CpG island) we are then able to construct a direct acyclic graph (DAG) which we shall
use to illustrate the dependent structure of the model. The DAG given in Figure 3.3 shows
that observations Yt are dependent on the hidden states qt = Si and that both the states
and observations are dependent on probability matrices A and B, respectively. The matrix
A = {aij } represents the transition between the two hidden states P (qt = Sj |qt−1 = Si ) = aij
and B denotes the observable state probabilities for the 2-hidden states B = (p+ , p− ).
As we have seen from the previous section we can estimate the parameters of A and B using
the Baum-Welch algorithm. In the CpG island example the observation probabilities p+ and
p− are given in Table 3.1 and Table 3.2.
We notice from the observation probabilities that the transitions from G to C and C to G
11
Figure 3.3: DAG of the hidden Markov model with A representing the state transition probabilities and B representing the observation probabilities for a given state
+
A
C
G
T
A
0.18
0.17
0.16
0.08
C
0.27
0.37
0.34
0.36
G
0.43
0.27
0.37
0.38
T
0.12
0.19
0.13
0.18
A
C
G
T
Table 3.1: Transitions for CpG island
region
A
0.30
0.32
0.25
0.18
C
0.21
0.30
0.25
0.24
G
0.28
0.08
0.30
0.29
T
0.21
0.30
0.20
0.29
Table 3.2: Transition probabilities for
non-CpG island region
in the CpG island region are higher than in the non-CpG region. The difference in observation
probabilities between the two regions justifies the use of the hidden Markov model. If the
observation probabilities were constant throughout the strand of DNA then the sequence would
be homogeneous and we would be able to model the DNA sequence with one set of probabilities
for the transitions between bases in the sequence. However, we know that the probability of
transition between certain bases is greater in specific regions and so the probability of moving
from a G to C is not constant throughout the sequence. Thus, we require an extra stochastic
layer which we model through a hidden Markov model.
3.3
Modelling a DNA sequence with a known number of hidden
states
Once we have established the theory of hidden Markov models and how they apply to DNA
analysis we can then develop models with which to analyse the DNA sequence. Here we will
use the paper Boys et al. (2000) to illustrate through an example how we can segment a DNA
sequence where the number of hidden states is known and each hidden state corresponds to a
segment of the DNA sequence. In this paper, the authors analyse the chimpanzee α-fetoprotein
gene; this protein is secreted by embryonic liver epithelial cells and is also produced in the
yolk sac of mammals. This protein plays an important role in the embryonic development of
mammals; in particular unusual levels of the protein found in pregnant mothers is associated
with genetic disorders, such as, neural tube defects, spina bifida and Down’s syndrome.
One approach which can be used to identify the parts of the DNA sequence where a transition
between states occurs is by using a multiple-changepoint model, where inferences about the
base transition matrices operating within the hidden state are made conditional on estimates
12
of their location. Carlin et al. (1992) give a solution to the multiple-changepoint problem for
Markov chains within a Bayesian framework. However, the drawback to the approach is that
it is difficult to specify informative prior knowledge about the location of the changepoints,
so Bayesian analysis of changepoint problems tends to assume relatively uninformative priors
for the changepoints. The authors also felt that while changepoint models are appropriate in
time series analysis, they are perhaps not appropriate for DNA sequence analysis as they fail to
capture the evolution of the DNA structure. Therefore, a more flexible approach to modelling
both the DNA sequence and the underlying prior information is to use a hidden Markov model.
The main advantage to this is that rather than specifying prior information about the precise
location of the changepoint, it is perhaps preferable to specify the length of a segment instead.
Previous work initially done by Churchill (1989) used a maximum likelihood approach and
an EM algorithm to determine base transitions for a given hidden state. Here the authors
adopt a Bayesian approach which incorporates prior information in identifying the hidden states.
Inferences are made by simulating from the posterior distribution using the Markov chain Monte
Carlo (MCMC) technique of Gibbs sampling. The advantage of this technique is that it allows
for prior information to be incorporated and permits the detection of segment types allowing
for posterior parameter uncertainty.
Model
As before, we we take our observations Yt ∈ {A, C, G, T } to be the four chemical bases and
the states to represent the different segment types qt ∈ {S1 , S2 , . . . , Sr }, t = 1, 2, . . . , T . In this
case we assume that the number of different segment types are known of which there are r. We
make the assumption that the transition between the four bases follows a first order Markov
chain, where P (Yt |Y1 , Y2 , . . . , Yt−1 ) = P (Yt |Yt−1 ), however, as we shall consider later, this is not
necessarily a valid assumption.
By establishing the same dependent structure as given by the DAG in Figure 3.3 we define the
base transition matrices for each segment type as B = {P 1 , P 2 , . . . , P r }, where the observations
follow a multinomial distribution P k = Pijk . Therefore, the base transitions follow,
P (Yt = j|qt = Sk , Y1 , Y2 , . . . , Yt−1 = i, B) = P (Yt = j|qt = Sk , Yt−1 = i, B) = PijSk
(3.1)
where i, j ∈ {A, C, G, T }, k ∈ {1, 2, . . . , r}
The hidden states are modelled using a first order Markov process with transition matrix
A = akl as shown in Figure 3.3. The hidden states at each location are unknown, therefore,
we must treat these hidden states as unknown parameters in our model. If we assume Y1 and
q1 follow independent discrete uniform distributions then we can define the likelihood function
for the model parameters A and B as follows given the observed DNA sequence Y1:T and the
unobserved segment types Q1:T .
L(A, B|Y1:T , Q1:T ) = P (Y1 , q1 |A, B)
n
Y
×
P (Yt = j|qt = Sk , Yt−1 = i, B) · P (qt = Sl |qt−1 = Sk , A) (3.2)
t=2
= (4r)
−1
n
Y
PijSk akl ,
i, j ∈ {A, C, G, T }, k, l ∈ {1, 2, . . . , r} (3.3)
t=2
where we define
P (qt = Sl |q1 , q2 , . . . , qt−1 = Sk , A) = P (qt = Sl |qt−1 = Sk , A) = akl
13
(3.4)
Prior distributions
Prior for base transitions
Given the multinomial form of the likelihood we can take the prior distribution to be the
Dirichlet distribution as this is the conjugate prior. Therefore, if we take the row of a base
transition matrix to be pi = (pij ), then the prior for pi will be a Dirichlet distribution.
P (pi ) ∝
4
Y
α
pijij ,
0 ≤ pij ≤ 1,
j = 1, 2, 3, 4
j=1
4
X
pij = 1
(3.5)
j=1
where α = (αij ) are the parameters of the distribution.
Prior for the segment types
As the transition matrices for the hidden states A follow in a similar fashion to the matrices
of the base transitions we shall again use a Dirichlet distribution of r dimension for the prior
of the rows (3.7), ak = akj . In general the prior belief for the hidden states is well defined,
particularly with regards to segment length. In practice it is difficult to identify short segments
and so it is assumed that the transition between hidden states is rare, E(aii ) is close to 1.
pki = (pij ) ∼ D(cki ), i = 1, 2, 3, 4, k = 1, 2, . . . , r
(3.6)
ak = (akj ) ∼ D(dk ),
(3.7)
k = 1, 2, . . . , r
Posterior analysis
The posterior distribution for the parameters A and B and the hidden states at time t (i.e.
qt = Si ) are found using Gibbs sampling with data augmentation. This involves simulating the
hidden states conditional on the parameters and then simulating the parameters conditional on
the hidden states. This process is then repeated until the parameters converge.
Determining the posterior distribution for the parameters P (A, B|Y1:T , Q1:T ) follows from
the model likelihood given by (3.2), which when incorporated with the conjugate Dirichlet
distribution given in the previous section produces independent posterior Dirichlet distributions
for the rows of the transition matrices given by (3.8-3.9).
pki |Y1:T , Q1:T ∼ D(cki + nki ), i = 1, 2, 3, 4, k = 1, 2, . . . , r
aki |Y1:T , Q1:T
∼ D(dk + mk ),
k = 1, 2, . . . , r
(3.8)
(3.9)
where
nki
=
(nkij ),
nkij
=
T
X
I(Yt−1 = i, Yt = j, qt = Sk )
t=2
T
X
mk = (mkj ), mkj =
I(qt−1 = Sk , qt = Sj )
(3.10)
(3.11)
t=2
where I(A) = 1 if A is true and 0 otherwise.
The second part of the Gibbs sampler involves determining the hidden states which are
simulated as P (Q1:T |Y1:T , A, B). This can simulated sequentially using its univariate updates
P (qt |Q−t , A, B), t = 1, 2, . . . , T where Q−t = 1, 2, . . . , t − 1, t + 1, . . . , T .
14
Results
In the α-fetoprotein example the authors compared whether the DNA sequence should be
segmented into two or three hidden states. Firstly, they consider the case of two hidden states
where the base transitions must follow one of two transition matrices P 1 or P 2 which mark the
transitions between the four bases (A, C, G, T). By selecting the number of hidden states a
priori the segment lengths are also pre-specified by setting E(aii ) = 0.99 with SD(aii ) = 0.01
gives a change between segments approximately every 100 bases. The posterior results for
parameters A and B = (P 1 , P 2 ) given in Figure 3.4 where the mean length of segment type 1
is around 500 bases and the length of segment type 2 is around 70 bases. The main difference
between the two transition probability matrices can be seen in the transitions to bases A and
C. Where in P 1 there are more transitions to A and in P 2 there are more transitions to C. The
larger variability of the “from A” and “from G” rows in P 2 is due to segment type 2 being rich
in C and T with few A’s and G’s.
Figure 3.4: Boys et al. (2000), Posterior summary of transition matrices with two hidden states,
E(aii ) = 0.99 and SD(aii ) = 0.01
The authors then compared the posterior analysis with the results obtained when the number
of hidden states is set to three. Figure 3.5 shows the approximate probabilities of being in each
of the three states through the DNA sequence. The figure indicates that it is reasonable to
assume that the sequence consists of three segments and not two as was first assumed. It
is possible to increase the number of segments until the point where the posterior standard
deviations of the base transition matrix is sufficiently small. In practice, however, determining
the exact number of segments can be tested by using the information criteria.
In conclusion, the method of segmenting the DNA sequence using a Bayesian framework
can be advantageous if sufficient prior information, such as the length and number of segments
is available. However, in practice this is not usually the case, and so we shall expand upon this
approach and show how it is possible to segment the DNA sequence when the number of hidden
states is unknown by using a reversible jump MCMC approach.
3.4
DNA sequence analysis with an unknown number of hidden
states
In the last example we considered the case where the number of hidden states was know; this
is frequently not the case and the number of hidden states must be determined. Here we will
consider how we can calculate the number of hidden states by utilising reversible jump MCMC
15
Figure 3.5: Boys et al. (2000), Posterior probability of being in state in one of the three states
at time t. (a) P (qt = S1 |Y1:T , A, B), (b) P (qt = S2 |Y1:T , A, B) and (c) P (qt = S3 |Y1:T , A, B)
algorithms. The paper Boys and Henderson (2001) uses the reversible jump MCMC approach
for the case of DNA sequence segmentation. We shall discuss this paper and the techniques
used when the number of hidden states is unknown. We shall also include in this section the
paper Boys and Henderson (2004) in which the authors expand on the idea that not only is the
number of hidden states unknown but also the order of the Markov dependence which, until
now, has been assumed to be first order.
Model
We use a similar notation to the last example where we take our observations Yt ∈ Y =
{A, C, G, T } to be the four bases (Adenine, Cytosine, Guanine and Thymine), to simplify notation we can denote the state space as Y = {1, 2, . . . , b} (for applications to DNA b = 4 letter
alphabet). We denote our hidden states as qt = Sk , t ∈ {1, 2, . . . , T }, k ∈ S = {1, 2, . . . , r},
representing the different segment types and δ represents the order of the Markov chain conditional on the hidden states. When δ = 0 we have the usual independence assumption, but for
δ > 0 we can include the short range dependent structure found in DNA (Churchill, 1992). The
HMM can be considered in terms of the observation equations (3.12) and the state equations
(3.13).
P (Yt |Y1:t−1 , Q1:t ) = P (Yt = j|Yt−δ , . . . , Yt−1 , qt = Sk ) = pkij
δ
i ∈ Yδ = {1, 2, . . . , b },
j ∈ Y,
P
where i = I(Y1:T , t, δ, b) = 1 + δl=1 (Yt−l − 1)bl−1
P (qt = Sl |qt−1 = Sk ) = akl ,
(3.12)
k ∈ {1, 2, . . . , r}
k, l ∈ Sr = {1, 2, . . . , r}
(3.13)
where A = akl is the matrix of hidden state transition probabilities and B = {P 1 , . . . , P r }
denotes the collection of observable base transitions in the r hidden states with P k = pkij ,
r ∈ R = {1, 2, . . . , rmax } and δ ∈ Q = {0, 1, 2, . . . , δmax }. While we treat r and δ are unknown
it is, however, necessary for the reversible jump algorithm to restrict the unknown number of
states and order of dependence as rmax and δmax . In this example we consider the case where the
16
number of hidden states r is unknown. The DAG in Figure 3.6 denotes the unknown quantities
with circles and the known with squares, thus in this case we label our unknown number of
states r with a circle.
Figure 3.6: DAG of the hidden Markov model with r hidden states
It is computationally convenient to model the hidden states Q1:T as missing data and work
with the complete-data likelihood P (Y1:T , Q1:T |r, δ, A, B). Where for a given, r, the completedata likelihood is simply the product of the observation and state equations (3.14).
P (Y1:T , Q1:T |r, δ, A, B) = P (Y1:T |r, δ, Q1:T , A, B) · P (Q1:T |r, δ, A, B)
Y Y Y
k Y Y
m
aij ij
∝
(pkij )nij
i∈Yδ jinY k∈Sr
nkij
=
T
X
(3.14)
(3.15)
i∈Sr j∈Sr
I(I(Y1:T , t, δ, b) = i, Yt = j, qt = Sk )
t=δmax +1
mij
=
T
X
I(qt−1 = Si , qt = Sj )
t=δmax +1
where I() denotes the indicator function which equals 1 if true and 0 otherwise.
Prior distributions
The advantage of using a Bayesian analysis is that it is possible to include a priori uncertainty about the unknown parameters. The aim of this analysis is to make inferences about the
unknown number of segments, r, the order of dependence δ, the model transition parameters
A, B, and also the sequence of hidden states, Q1:T . It is possible to quantify the uncertainty of
these parameters through the prior distribution (3.16).
P (r, δ, A, B) = P (r) · P (δ) · P (A, B|r, δ) = P (r) · P (δ) · P (A|r) · P (B|r, δ)
(3.16)
In reversible jump applications we restrict our number of hidden states r and order of
dependence δ to be rmax and δmax respectively. For the distributions of r and δ we use independent truncated prior distributions where r ∼ P o(αr ), r ∈ {1, 2, . . . , rmax } and δ ∼ P o(αδ ), δ ∈
{1, 2, . . . , δmax } with fixed hyperparameters αr > 0 and αδ > 0.
As in the last example we shall again take independent Dirichlet distributions for the priors
of the row elements of A and B, where ak = akj and pi = pij represent the rows of the matrices
A and B, respectively.
17
pki = (pkij )|r, δ ∼ D(cki ),
ak = (akl )|r ∼ D(dk ),
i ∈ Yδ ,
j ∈ Y,
k ∈ Sr
k, l ∈ Sr
(3.17)
(3.18)
where the Dirichlet parameters c and d are chosen to reflect the goal of the analysis.
Posterior analysis
In Bayesian analysis we can combine information about the model parameters from the
data and the prior distribution to obtain the posterior distribution (3.19), which calibrates the
uncertainty about the unknown parameters after observing the data.
P (r, δ, Q1:T , A, B|Y1:T ) ∝ P (Y1:T , Q1:T |r, δ, A, B) · · · P (r, δ, A, B)
(3.19)
In the last example it was possible to determine the posterior distribution using a straightforward MCMC technique with Gibbs sampling. However, in this case the posterior is more
complicated as we have now taken our number of hidden states r and the order of Markov
dependence δ to be unknown quantities. The extra complexity which this adds means that the
MCMC algorithm must now allow the sampler to jump between parameter spaces with different
dimensions which correspond to models with different values for r and δ. This can be achieved
using reversible jump techniques (Green, 1995), which are a generalisation of the MetropolisHastings algorithm. The term reversible jump comes from the fact that the parameter space
is explored by a number of move types which all attain detailed balance and some allow jumps
between subspaces of different dimensions. The two most popular categories of reversible jump
moves are the split/merge and birth/death moves. The basic idea behind the split/merge move
is that the hidden state is either split in two or combined with another hidden state according
to some probability. Whereas, the birth/death moves, which shall be focused on here, have a
random chance of creating or deleting a hidden state according to some probability.
MCMC scheme
After each iteration of the MCMC algorithm the following steps are performed:
1. Update the order of dependence and transition probability matrices P (δ, A, B|r, Y1:T , Q1:T ).
2. Update the number of hidden states r and also A, B and Q1:T conditional on δ.
3. Update the sequence of hidden states Q1:T using P (Q1:T |r, δ, Y1:T , A, B).
Step 3 of the MCMC procedure is simply an implementation of the forward-backward algorithm. In step 1 we update the order of Markov dependence P (δ|r, Y1:T , Q1:T ) and the transition
probability parameters P (A, B|r, δ, Y1:T , Q1:T ) in the same step. Choosing a conjugate Dirichlet
prior distribution for B allows δ to be updated without the need for a reversible jump move
but instead to be updated from the conditional distribution of the form,
P (δ|r, Y1:T , Q1:T ) ∝ P (δ|r, Q1:T ) · P (Y1:T |r, δ, Q1:T ) = P (δ) · P (Y1:T |r, δ, Q1:T )
(3.20)
it is possible to simplify P (δ|r, Q1:T ) as we defined δ to be independent of (r, Q1:T ) a priori.
In step 2 the number of hidden states r is updated using a birth/death reversible jump
move. The birth/death move is computationally simpler than the split/merge move. The
authors found that the birth/death moves produces the best mixing chain.
Birth and Death moves
18
The initial move begins with a random choice between creating or deleting a hidden state
with probabilities br and dr , respectively. In the birth move a new hidden state j ∗ is proposed
which increases the number of hidden states from r to r+1. A set of base transition probabilities
∗
u for the new state are generated from the prior distribution (3.17), with P̃ j = u and P˜j = P j
for j 6= j ∗ . Then simulate a row vector v for the state transitions à from the prior distribution
(3.18) and set the row of the proposed transition matrix
to be ãj ∗ = v. Column j ∗ is filled by
P
˜∗,
˜
taking ãij ∗ = wi for i 6= j ∗ , where wi ∼ Beta(dij
j6=j ∗ dij ). Finally, a new hidden state is
simulated conditional on Ã, B̃ and r + 1 using the forward-backward algorithm. The move is
then accepted with probability min(1, AB ) where,
P (Y1:T |r + 1, δ, Ã, B̃) P (r + 1)
×
× (r + 1)
P (Y1:T |r, δ, A, B)
P (r)
Q
Q
Q
k k
k∈Sr+1
i∈Yδ D(p̃i |c̃i )
i∈Sr+1 D(ãi |d̃i )
Q
Q
×
k k
D(ai |di )
k∈Sr
i∈Yδ D(pi |ci )
−1



Y
X
Y
∗
B wi |d˜ij ∗ ,
dij ∗  ×
D(ui |c̃ji )

∗
∗
AB =


× D(ṽ|d̃j ∗ )

i∈Sr+1 \j
j∈Sr+1 \j
dr+1
×
br (r + 1)
(3.21)
i∈Yδ
Y
(1 − wi )r−1
i∈Sr+1 \j ∗
The first two lines of the above expression contain the likelihood ratio and prior ratio with the
remaining lines consisting of the proposal ratio and Jacobian resulting from the transformation
of (B, u) → B̃ and (A, v, w) → Ã. We note that the expression does not depend on Q1:T and
Q̃1:T because the expression simplifies as,
P (Y1:T , Q̃1:T |r + 1, δ, Ã, B̃)
P (Q1:T |Y1:T , r, δ, A, B)
P (Y1:T |r + 1, δ, Ã, B̃)
=
×
P (Y1:T |r, δ, A, B)
P (Y1:T , Q1:T |r, δ, A, B)
P (Q̃1:T |Y1:T , r + 1, δ, Ã, B̃)
(3.22)
The death moves follow in a similar fashion to the approach given for the birth moves
where a randomly chosen hidden state j ∗ is proposed to be deleted, after which the remaining
∗
parameters are adjusted. Firstly, P̃ j is deleted with the remaining base transition probabilities
P j = P̃ j for j 6= j ∗ with the row and column j ∗ of à also being deleted. The death of a
hidden state is accepted with probability min(1, A−1
B ) and thus the birth and death moves form
a reversible pair.
Bacteriophage lambda genome
In the paper Boys and Henderson (2004) the authors call upon the example of analysing
the genome of the bacteriophage lambda, a parasite of the intestinal bacterium Escherichia coli
which is often considered a benchmark example for comparing DNA segmentation techniques.
Previous analysis of this genome structure as conducted by Churchill (1989) and others
have discovered that the number of hidden states is r ≤ 6 and the Markov dependence is
δ ≤ 1. However, by taking the Markov order of dependence and the number of hidden states
as parameters suggests that there are r = 6 (with a 95% highest density interval of (6,7,8))
hidden states with Markov dependence of the order q = 2. This is supported by the fact that
the bacteriophage lambda genome is predominately comprised of codons which are the coding
regions of DNA that occur as triplets (Yt−2 , Yt−1 , Yt ). However, it has been conjectured by
Lawerence and Auger that some of the hidden states are reverse complements of each other
which is an area that the authors are exploring further.
19
Chapter 4
Evaluation
4.1
Conclusions
The use of hidden Markov models for DNA sequence analysis has been well explored over the
past two decades and even longer in other fields of research. While in this report we only
considered the applications of HMMs to DNA, much work has also been done to apply these
techniques to RNA and protein sequences. Perhaps the best known example of these techniques
being used in practice is through in ab intio gene finding where the DNA sequence is scanned
for signs protein-coding genes.
One of the major drawbacks of the approaches given in this report is that most of the
work assumes a first-order Markov dependence for the hidden states which means that the
duration time (i.e. time spent in a state) follows a geometric distribution. In practice, the
duration times for the hidden states of a DNA sequence do not follow a geometric distribution
and so the constraint imposed by the first-order Markov assumption will undoubtedly lead to
unreliable results. One solution to this problem, which has been implemented in the GENSCAN
algorithm, is the use of hidden semi-Markov models, which follow in a similar fashion to the
hidden Markov model except that the hidden states are semi-Markov rather than Markov. The
advantage of this is that the duration times are no longer geometric; but instead the probability
of transitioning to a new state depends on the length of time spent in the current state. This
means that all states no longer have identical duration times.
In terms of DNA sequence analysis, HMMs are not the only statistical approach available for
segmenting the sequence. Much work has been done with multiple-changepoint segmentation
models, which, instead of using a hidden layer to detect a change in the base transitions, they
instead observe the sequence of bases and identify points in the sequence where the distribution
of bases changes. Compared to the HMM, multiple-changepoint models are computationally
more efficient as the posterior sample can be obtained without the use of MCMC techniques.
The development of HMMs over the past four decades has allowed them to be used in
various fields with many successful applications. Particularly, in terms of biology, HMMs have
been a great success in combining biology and statistics with both fields reaping the benefits
of developing new areas of research. The theory behind HMMs has expanded to allow for
greater flexibility of models available, including models with higher order Markov dependency
and models which do not require that the number of hidden states be pre-specified. There
is still the potential for further work with HMMs in terms of improved parameter estimation,
unknown Markov dependency and state identifiability, to name a few. There will certainly
be further applications to which HMMs will be applied and with those new applications, new
challenges will surely develop and improve upon the theory which has already been established.
20
Bibliography
Baum, L. (1972). An equality and associated maximisation technique in statistical estimation
for probablistic functions of markov processes. Inequalities, 3:1–8.
Baum, L. and Egon, J. A. (1967). An equality with applications to statistical estimation for
probabilistic functions of a markov process and to a model of ecology. Bulletin of the American
Mathematical Society, 73(3):360–363.
Baum, L. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state
Markov chains. The Annals of Mathematical Statistics, 37(6):1554–1563.
Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in
the statistical analysis of probabilistic functions of Markov chains. The annals of mathematical
statistics, 41(1):164–171.
Bird, A. (1987). Cpg islands as gene markers in the vertebrate nucleus. Trends in Genetics,
3:342–347.
Boys, R. and Henderson, D. (2001). A comparison of reversible jump MCMC algorithms for
DNA sequence segmentation using hidden Markov models. Comp. Sci. and Statist, 33:35–49.
Boys, R. and Henderson, D. (2004). A Bayesian approach to DNA sequence segmentation.
Biometrics, 60(3):573–581.
Boys, R., Henderson, D., and Wilkinson, D. (2000). Detecting homogeneous segments in DNA
sequences by using hidden Markov models. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 49(2):269–285.
Braun, J. V. and Muller, H.-G. (1998). Statistical methods for dna sequence segmentation.
Statistical Science, 13(2):142–162.
Cappe, O. (2001). Ten years of hmm. http://perso.telecom-paristech.fr/~cappe/docs/
hmmbib.html.
Carlin, B., Gelfan, A., and Smith, A. (1992). Hierarchical bayesian analysis of changepoint
problems. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):389–
405.
Churchill, G. (1989). Stochastic models for heterogeneous dna sequences. Bulletin of Mathematical Biology, 51:79–94. 10.1007/BF02458837.
Churchill, G. (1992). Hidden markov chains and the analysis of genome structure. Computers
and Chemistry, 16(2):107–115.
Dubin, R. E. S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis: probabilistic
models of proteins and nucleic acids. Cambridge University Press, Cambridge.
21
Elton, R. A. (1974). Theoretical models for heterogeneity of base composition in dna. Journal
of Theoretical Biology, 45(2):533–553.
Forney, G. J. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278.
Green, P. (1995). Reversible jump markov chain monte carlo computation and bayesian model
determination. Biometrika, 82(4):711–732.
Rabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77:257–286.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on, 13(2):260–269.
22