Colloque T.A.G – LAPTH Annecy – 8-10 novembre 2006 Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France [email protected] Why Markov Models ? A biological sequence : X = (X1, X2, … , Xn) where Xk A = { t , c , a , g } or {A C D E F G H I K L M N P Q R S T V W Y} A very common tool for analyzing these sequences is the Markov Model (MM) P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) denoted by π(u , v) if Xk – 1 = u u, v A Why MM ? – 2 Exemple : A complex, called Rec BCD, protects the cell against viruses Rec BCD E. coli viruses chi own bacteria genome To avoid the destruction of the genome of the cell, along the genome exists a password gctggtgg (it is called chi). When rec BCD bumps into the chi, it stops its destruction. In order to be efficient the number of occurrences of the chi is much higher that the number predicted in a Markov model. Results MM Parsimonious Markov Models When we modelize a sequence in order to find exceptional motifs or for annotation, we have to estimate the parameters of the model, and more parameters we have, worst is the estimation. In a Markov Model of order m, there are 4m predictors (the m-words), hence 3 x 4m parameters In the M2 model, there are 16 predictors and 48 parameters In M5, there are 1024 predictors, 3072 parameters PMM – 2 A first restriction consists in taking into account the past up to the point : we use a large past when the sequence shows that this is necesary, we use a short past when the sequence allows the economy : these models are called VLMC = Variable Length Markov Chains In this VLMC, there are 12 predictors : aa ca ga ta ac cc gc tc g at tt [gc]t There are 36 parameters Notation : [gc] denotes « g or c » ; [act] denotes « a or c or t » PMM – 3 But it is not obvious that for the prediction of Xk , Xk – p is less and less informative inasmuch p increases. As an example (*), let us consider the ’jumper’ model P(Xt = v | past) = P(Xt = v | Xt – 2) (the dependance ‘jumps’ over Xt – 1) it corresponds to this tree (4 predictors, 12 parameters) (*) this model is not as scholar as it seems : for example in a coding region (periodic model depending on the phase), the 2nd position in a codon strongly depends on the 2nd position in the previous codon (cf hydrophobicity) PMM – 4 These models are called PMM = Parsimonious Markov Models More general (?) example : In this PMM there are 8 predictors (24 parameters) : a[ac] g at c[ac] [cg]t g[ac] tt t[ac] PMM – 5 More precisely : in the tree of predictors (*) below any node all the partitions of A = { t , c , a , g } may appear (*) : the differents predictors appear in this tree like the path from all the leaves to the root Hence, there are 15 possibilities below each node. PMM – 6 A Parsimonious Markov Model (PMM) is defined by • such a dependance tree t • for each leaf (= for each predictor) a law on A P(Xt = u | Xt – 1 = [tc]) + P(Xt = u | Xt – 1 = a, Xt – 2 = c) P(Xt = u | Xt – 1 = a, Xt – 2 = [tag]) P(Xt = u | Xt – 1 = g) PMM – 7 We will only work with finite order PMM : the longer predictor contains, say, m letters (the depth of the tree is m) Obviously a PMM of order m is a MM of order m Note : the number of PMM increases very quickly with m : in the 4-letter alphabet and for m = 5 there are some 1085 trees ––––––– Notations : t denotes a tree of predictors W its sets of predictors in t (the leaves) For w W , w,u = P(Xt = u | w) Statistics on PMM For a fixed tree t, the likelihood is obviously N(wu) L() = … ∏ w,u (The dots correspond to the first letters in the sequence. We will not care about them today) Which leads to the classical MLE ^ N(wu) w,u = N(w+) (where N(w+) = ∑ N(wv) The difficulty arises when we want to choose the tree : problem of choice of model (within, for example 1085 models) Statistics – 2 Therefore we adopt a Bayesian approach A priori law : • on the tree let us choose the uniform law (it can be changed) • on the transition parameter, it is natural to chose a Dirichlet law which is conjugate : a(w,u) if, for w W , a priori P(w,•) = ∏ w,u then, a posteriori P(w,•) = ∏ w,u a(w,u) The MAP estimator of w,u remains the same as before, except the fact that N(w,u) has to be changed in N’(wu) = N(wu) + a(w, u) Statistics – 3 The use of Bayes formula then gives as a posterior law on the trees ln P(t | X) = S S(w) Where the sum is taken over all the predictors in the tree t and S(w) = S ln G (N(wu)) – ln G S N(wu)) (G is such as G(k+1) = k ! , k N) Writing the posterior law in this way shows that P(t | X) may be maximized in a recursive way Application to real genomes We fitted MM and PMM for the orders m = 3 , 4 and 5 • on the set of the 224 complete bacterial genomes published today • on their coding regions (CDS) To compare the adequacy of this modelizations, we computed the BIC criterion for each model M BIC(M) = 2 L(M) - nb_param(M) . ln n “ The higher BIC, the better the model ” Application – 2 This picture plots BIC(PMM) – BIC(MM) against the size of the bacterial genome. For all the bacteriae, PMM fits better than classical MM QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. Approach using FDA Recent results (Gregory Nuel) concern the use of «Finite (Deterministic) Automata» in the statistic of words or patterns To a word, we may associate an FDA : Example 1 : on {a,b}, w = aaab States : b a aa aaa b aaab This can be generalized if “one“ word w is replaced by a motif (finite family of words) or even a language. a aa aaa aaab Approach using FDA This automata is especially dedicated to the study of the word w (the motif, ...) : if we “run“ a sequence on this graph, the automate counts the occurences of w (the motif, ...) It turns to be VERY efficient : “wordcount “, program in EMBOSS, needs 4352 seconds to count the occurrences of all 12-words in E. coli, Nuel’s program acheives this task in 9.86 seconds The prosite motif [LIVMF]GE.[GAS][LIVM].(5-11)R[STAQ]A.[LIVAM].[STACV] (some 1012 words) is treated by a FDA of 329(30) states in M0 1393(78) states in M1 Approach using FDA If this sequence X is a Markov chain, we then have an other MC running on this graph. Even for “rather complicated motifs“, this allows to get the law of “all“ statistics of words : - exact law of the first occurrence of a motif (taking into account the “starting point“), - exact law of the number of occurrences of the motif, - in particuler expectation and variance of these laws, opening the possibility of gaussian, poisonnian,... approximations (and an exhaustive study of the qualities of these approximation), - law of a motif M conditionally to the number of occurrences of another one, M’. 2nd Part : Hidden Markov Models Hidden Markov Models An important criticism against Markov modelization is its stationarity: a well known theorem says that, under weak conditions, P(Xk = u) µ(u) (when k ∞) (and the rate of convergence is exponential.) But biological sequences are not homogeneous. There are g+c rich segments / g+c poor segments (isochores). One may presume (and verify) that the rules of succession of letters differ in coding parts / non-coding parts. Is it possible to take avantage of this problem and to develop a tool for the analysis of heterogeneity ? => annotation HMM – 2 Suppose that d states alternate along the sequence Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1 And in each state we have a MC : if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v) if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v) and (more technical than biological - see HSMM) P(Sk = y | Sk–1 = x) = π0(u ; v) Our objectives • Estimate the parameters π1, π2, π0 • Allocate a state {1, 2} to each position HMM – 3 ¡ Use the likelihood !! L() = ∑ µ (S ) µ (X11 ) .... ... ∏ π (S 0 1 S 0 k-1,Sk) πk (Xk-1,Xk) n terms (length of the sequence) over all possibilities S1S2...Sn ; there are sn terms 210 000 = 103 000 Désespoi !!! r Annotation H.M.M. continue Searching nucleosome positions In eukaryotes (only), an important part of the chromosomes forms chromatine, a state where the double helix winds round “beads“ forming a collar : QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. | | 10 nm | Each bead is called a nucleosome. Its core is a complex involving 8 proteins (an octamer) called histone (H2A, H2B, H3, H4). DNA winds twice this core and is locked by an other histone (H1). The total weight of the histones is ± equal to the weight of the DNA. Curvature within curvature The DNA helix turns twice around the histone core. Each turn corresponds to about 7 pitches of the helix, each one made with about 10 nucleotides. QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. Total = 146 nt within each nucleosome. Depending on the position (“in”vs “out”) the curvature satisfies different constraints Nuc and “no-nuc” states Trifonov (99) as well as Rando (05) underline that there are ‘no‘ nucleosome in the gene promotors (accessibility) The introduce “before“ nucleosome a “no-nucleosme” state. 1 2 ... 70 nucleosome core no-nuc spacer Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999) Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005) Bendability Following an idea (Baldi, Lavery,...) we introduce an indice of bendability ; it depends on succession of 2, 3, 4, ...di-nucleotides. g t g a a c t a t c a t PNUC table There exist various tables which indicate the bendability of di-, tri or even tetranucleotides (PNUC, DNase, ...) a 2nd letter t g a t g c 0.0 7,3 3,0 3,3 2.8 7,3 6,4 2,2 3,3 10,0 6,2 8,3 5.2 10,0 7,5 5,4 a a a a a t g c 0.7 2.8 5.3 6,7 0.7 0.0 3.7 5.2 5,8 5.2 5.4 5.4 5,8 3,3 7,5 5.4 t t t t c 1rst letter We used PNUC-3 : PNUC(cga) = 8,3 PNUC(tcg) = 8,3 3rd letter a t g c 5.2 2,2 5.4 4,2 6,7 3,3 6,5 4,2 5.4 5.4 6,0 4,7 5.4 8,3 7,5 4,7 g g g g a t g c 3.7 6,4 5.6 6,5 5.3 3,0 5.6 5.4 7,5 7,5 8.2 7,5 5.4 6,2 8.2 6,0 c c c c (*) Goodsell, Dickerson, NAR 22 (1994) Scan of K3 of yeast Sometime QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. it works : What about positions ? We represent (*) parts of the chromosome K3 of Yeast The green curve (“proba” of the no-nuc state) increases between genes (promotors) QuickTime™ et un décompresseur TIFF (LZW) sont requis pour visionner cette image. (*) using the software MuGeN, by Mark Hoebeke The red curve (“proba” of the nucleosome state) appears periodically in genes. Acknowledgements Labo «Statistique et Génome» Christopha AMBROISE Maurice BAUDRY Etienne BIRMELE Cécile COT Emmanuelle DELLA-CHIESA Mark HOEBEKE Mickael GUEDJ François KÉPÈS Sophie LEBRE Catherine MATIAS Vincent MIELE Florence MURI-MAJOUBE Grégory NUEL Franck PICARD Hugues RICHARD Anne-Sophie TOCQUET Nicolas VERGNE Sec : Michèle ILBERT Labo MIG – INRA Philippe BESSIÈRE François RODOLPHE Sophie SCHBATH Élisabeth de TURCKHEIM Labo AGRO Jean-Noël BACRO Jean-Jacques DAUDIN Stéphane ROBIN Lab’ Rouen Dominique CELLIER Sabine MERCIER
© Copyright 2026 Paperzz