Markov Models and HMM in Genome Analysis - LAPTh

Colloque T.A.G – LAPTH
Annecy – 8-10 novembre 2006
Analysis of biological sequences
using Markov Chains
and Hidden Markov Models
Bernard PRUM,
La genopole – Evry – France
[email protected]
Why Markov Models ?
A biological sequence :
X = (X1, X2, … , Xn)
where Xk  A = { t , c , a , g }
or {A C D E F G H I K L M N P Q R S T V W Y}
A very common tool for analyzing these sequences is the
Markov Model (MM)
P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1)
denoted by
π(u , v)
if Xk – 1 = u
u, v  A
Why MM ? – 2
Exemple :
A complex, called Rec BCD,
protects the cell against viruses
Rec BCD
E. coli
viruses
chi
own bacteria genome
To avoid the destruction of the genome of the cell, along the
genome exists a password gctggtgg (it is called chi). When rec
BCD bumps into the chi, it stops its destruction. In order to be
efficient the number of occurrences of the chi is much higher
that the number predicted in a Markov model.
Results MM
Parsimonious Markov Models
When we modelize a sequence in order to find exceptional motifs
or for annotation, we have to estimate the parameters of the
model, and more parameters we have, worst is the estimation.
In a Markov Model of order m, there are 4m predictors
(the m-words), hence 3 x 4m parameters
In the M2 model, there are 16 predictors and 48 parameters
In M5, there are 1024 predictors, 3072 parameters
PMM – 2
A first restriction consists in taking into account the past up
to the point : we use a large past when the sequence shows
that this is necesary, we use a short past when the sequence
allows the economy : these models are called
VLMC = Variable Length Markov Chains
In this VLMC, there are 12
predictors :
aa
ca
ga
ta
ac
cc
gc
tc
g
at
tt
[gc]t
There are 36 parameters
Notation : [gc] denotes « g or c » ; [act] denotes « a or c or t »
PMM – 3
But it is not obvious that for the prediction of Xk ,
Xk – p is less and less informative inasmuch p increases.
As an example (*), let us consider the
’jumper’ model
P(Xt = v | past) = P(Xt = v | Xt – 2)
(the dependance ‘jumps’ over Xt – 1)
 it corresponds to this tree
(4 predictors, 12 parameters)
(*)
this model is not as scholar as it seems : for example in a coding region
(periodic model depending on the phase), the 2nd position in a codon strongly
depends on the 2nd position in the previous codon (cf hydrophobicity)
PMM – 4
These models are called
PMM = Parsimonious Markov Models
More general (?) example :
In this PMM there are 8 predictors (24 parameters) :
a[ac]
g
at
c[ac]
[cg]t
g[ac]
tt
t[ac]
PMM – 5
More precisely : in the tree of predictors (*) below any node
all the partitions of A = { t , c , a , g } may appear
(*) : the differents predictors appear in this tree
like the path from all the leaves to the root
Hence, there are 15 possibilities below each node.
PMM – 6
A Parsimonious Markov Model (PMM) is defined by
• such a dependance tree t
• for each leaf (= for each predictor) a law on A
P(Xt = u | Xt – 1 = [tc])
+
P(Xt = u | Xt – 1 = a, Xt – 2 = c)
P(Xt = u | Xt – 1 = a, Xt – 2 = [tag])
P(Xt = u | Xt – 1 = g)
PMM – 7
We will only work with finite order PMM : the longer predictor
contains, say, m letters (the depth of the tree is m)
Obviously a PMM of order m is a MM of order m
Note : the number of PMM increases very quickly with m : in
the 4-letter alphabet and for m = 5 there are some 1085 trees
–––––––
Notations : t denotes a tree of predictors
W its sets of predictors in t (the leaves)
For w  W , w,u = P(Xt = u | w)
Statistics on PMM
For a fixed tree t, the likelihood is obviously
N(wu)
L() = … ∏ w,u
(The dots correspond to the first letters in the sequence.
We will not care about them today)
Which leads to the classical MLE
^
N(wu)
w,u =
N(w+)
(where N(w+) = ∑ N(wv)
The difficulty arises when we want to choose the tree : problem of
choice of model
(within, for example 1085 models)
Statistics – 2
Therefore we adopt a Bayesian approach
A priori law :
• on the tree let us choose the uniform law (it can be changed)
• on the transition parameter, it is natural to chose a Dirichlet law
which is conjugate :
a(w,u)
if, for w  W , a priori P(w,•) = ∏ w,u
then, a posteriori
P(w,•) = ∏ w,u
a(w,u)
The MAP estimator of w,u remains the same as before, except
the fact that N(w,u) has to be changed in
N’(wu) = N(wu) + a(w, u)
Statistics – 3
The use of Bayes formula then gives as a posterior law on the trees
ln P(t | X) =
S S(w)
Where the sum is taken over all the predictors in the tree t
and
S(w) = S ln G (N(wu)) – ln G S N(wu))
(G is such as G(k+1) = k ! , k  N)
Writing the posterior law in this way shows that P(t | X) may be
maximized in a recursive way
Application to real genomes
We fitted MM and PMM for the orders m = 3 , 4 and 5
• on the set of the 224 complete bacterial genomes
published today
• on their coding regions (CDS)
To compare the adequacy of this modelizations, we computed
the BIC criterion for each model M
BIC(M) = 2 L(M) - nb_param(M) . ln n
“ The higher BIC, the better the model ”
Application – 2
This picture plots BIC(PMM) – BIC(MM) against the size of
the bacterial genome. For all the bacteriae, PMM fits better
than classical MM
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
Approach using FDA
Recent results (Gregory Nuel) concern the use of «Finite
(Deterministic) Automata» in the statistic of words or patterns
To a word, we may associate an FDA :
Example 1 : on {a,b}, w = aaab
States : b
a
aa
aaa
b
aaab
This can be generalized
if “one“ word w is replaced by
a motif (finite family of words)
or even a language.
a
aa
aaa
aaab
Approach using FDA
This automata is especially dedicated to the study of the word w (the
motif, ...) : if we “run“ a sequence on this graph, the automate counts
the occurences of w (the motif, ...)
It turns to be VERY efficient :
“wordcount “, program in EMBOSS, needs 4352 seconds to count
the occurrences of all 12-words in E. coli, Nuel’s program acheives
this task in 9.86 seconds
The prosite motif
[LIVMF]GE.[GAS][LIVM].(5-11)R[STAQ]A.[LIVAM].[STACV]
(some 1012 words) is treated by a FDA of 329(30) states in M0
1393(78) states in M1
Approach using FDA
If this sequence X is a Markov chain, we then have an other MC
running on this graph.
Even for “rather complicated motifs“, this allows to get the law of
“all“ statistics of words :
- exact law of the first occurrence of a motif (taking into
account the “starting point“),
- exact law of the number of occurrences of the motif,
- in particuler expectation and variance of these laws,
opening the possibility of gaussian, poisonnian,... approximations
(and an exhaustive study of the qualities of these approximation),
- law of a motif M conditionally to the number of
occurrences of another one, M’.
2nd Part :
Hidden Markov Models
Hidden Markov Models
An important criticism against Markov modelization is its stationarity:
a well known theorem says that, under weak conditions,
P(Xk = u)  µ(u)
(when k  ∞)
(and the rate of convergence is exponential.)
But biological sequences are not homogeneous.
There are g+c rich segments / g+c poor segments (isochores).
One may presume (and verify) that the rules of succession of letters
differ in coding parts / non-coding parts.
Is it possible to take avantage of this problem
and to develop a tool for the analysis of heterogeneity ?
=> annotation
HMM – 2
Suppose that d states alternate along the sequence
Sk = 1
Sk = 2 Sk = 1
Sk = 2
Sk = 1
And in each state we have a MC :
if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v)
if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v)
and (more technical than biological - see HSMM)
P(Sk = y | Sk–1 = x) = π0(u ; v)
Our objectives
• Estimate the parameters π1, π2, π0
• Allocate a state {1, 2} to each position
HMM – 3
¡ Use the likelihood !!
L() =
∑ µ (S ) µ
(X11 ) ....
...
∏ π (S
0
1
S
0
k-1,Sk)
πk (Xk-1,Xk)
n terms (length of the sequence)
over all possibilities S1S2...Sn ; there are sn terms
210 000 = 103 000
Désespoi !!!
r
Annotation
H.M.M. continue
Searching nucleosome positions
In eukaryotes (only), an important part of the chromosomes
forms chromatine, a state where the double helix winds round “beads“
forming a collar :
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
|
| 10 nm
|
Each bead is called a nucleosome. Its core is a complex
involving 8 proteins (an octamer) called histone (H2A, H2B, H3, H4).
DNA winds twice this core and is locked by an other histone (H1). The
total weight of the histones is ± equal to the weight of the DNA.
Curvature within curvature
The DNA helix turns twice
around the histone core.
Each turn corresponds to
about 7 pitches of the helix,
each one made with about
10 nucleotides.
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
Total = 146 nt within each
nucleosome.
Depending on the position (“in”vs “out”) the curvature satisfies
different constraints
Nuc and “no-nuc” states
Trifonov (99) as well as Rando (05) underline that
there are ‘no‘ nucleosome in the gene promotors (accessibility)
The introduce “before“ nucleosome a “no-nucleosme” state.
1
2
...
70
nucleosome core
no-nuc
spacer
Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999)
Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)
Bendability
Following an idea (Baldi, Lavery,...) we introduce an
indice of bendability ;
it depends on succession of 2, 3, 4, ...di-nucleotides.
g
t
g
a
a
c
t
a
t
c
a

t
PNUC table
There exist various
tables which indicate
the bendability of di-,
tri or even tetranucleotides (PNUC,
DNase, ...)
a
2nd letter
t
g
a
t
g
c
0.0
7,3
3,0
3,3
2.8
7,3
6,4
2,2
3,3
10,0
6,2
8,3
5.2
10,0
7,5
5,4
a
a
a
a
a
t
g
c
0.7
2.8
5.3
6,7
0.7
0.0
3.7
5.2
5,8
5.2
5.4
5.4
5,8
3,3
7,5
5.4
t
t
t
t
c
1rst letter
We used PNUC-3 :
PNUC(cga) = 8,3
PNUC(tcg) = 8,3
3rd letter
a
t
g
c
5.2
2,2
5.4
4,2
6,7
3,3
6,5
4,2
5.4
5.4
6,0
4,7
5.4
8,3
7,5
4,7
g
g
g
g
a
t
g
c
3.7
6,4
5.6
6,5
5.3
3,0
5.6
5.4
7,5
7,5
8.2
7,5
5.4
6,2
8.2
6,0
c
c
c
c
(*) Goodsell, Dickerson, NAR 22 (1994)
Scan of K3 of yeast
Sometime
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
it works :
What about positions ?
We represent (*) parts of the chromosome K3 of Yeast
The green curve (“proba” of the no-nuc
state) increases between genes (promotors)
QuickTime™ et un
décompresseur TIFF (LZW)
sont requis pour visionner cette image.
(*) using the software
MuGeN, by Mark Hoebeke
The red curve (“proba” of the nucleosome
state) appears periodically in genes.
Acknowledgements
Labo «Statistique et Génome»
Christopha AMBROISE
Maurice BAUDRY
Etienne BIRMELE
Cécile COT
Emmanuelle DELLA-CHIESA
Mark HOEBEKE
Mickael GUEDJ
François KÉPÈS
Sophie LEBRE
Catherine MATIAS
Vincent MIELE
Florence MURI-MAJOUBE
Grégory NUEL
Franck PICARD
Hugues RICHARD
Anne-Sophie TOCQUET
Nicolas VERGNE
Sec :
Michèle ILBERT
Labo MIG – INRA
Philippe BESSIÈRE
François RODOLPHE
Sophie SCHBATH
Élisabeth de TURCKHEIM
Labo AGRO
Jean-Noël BACRO
Jean-Jacques DAUDIN
Stéphane ROBIN
Lab’ Rouen
Dominique CELLIER
Sabine MERCIER