Uncovering Sequences Mysteries With Hidden Markov Model Cédric Notredame Cédric Notredame (28/07/2017) Cédric Notredame (28/07/2017) Our Scope Look once Under the Hood Understand the principle of HMMs Understand HOW HMMs are used in Biology Cédric Notredame (28/07/2017) Outline -Reminder of Bayesian Probabilities -HMMs and Markov Chains -Application to gene prediction -Application Tm predictions -Application to Domain/Prot Family Prediction -Future Applications Cédric Notredame (28/07/2017) Conditional Probabilities And Bayes Theorem Cédric Notredame (28/07/2017) I now send you an essay which I have found among the papers of our deceased friend Mr Bayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times. Cédric Notredame (28/07/2017) “The Durbin…” Cédric Notredame (28/07/2017) What is a Probabilistic Model ? Dice = Probabilistic Model -Each Possible outcome has a probability (1/6) -Biological Questions: -What kind of dice would generate coding DNA -Non-Coding ? Cédric Notredame (28/07/2017) Which Parameters ? Dice = Probabilistic Model Parameters: proba of each outcome -A Priori estimation: 1/6 for each Number OR -Through Observation: -measure frequencies on a large number of events Cédric Notredame (28/07/2017) Which Parameters ? Parameters: proba of each outcome Model: Intra/Extra Protein 1- Make a set of Inside Proteins using annotation 2- Make a set of Outside Proteins using annotation 3- COUNT Frequencies on the two sets Model Accuracy Training Set Cédric Notredame (28/07/2017) Maximum Likelihood Models Model: Intra/Extra Proteins 1- Make training set 2- Count Frequencies Model Accuracy Training Set Maximum Likelihood Model: Model probability MAXIMISES Data probability Cédric Notredame (28/07/2017) Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model AND Model Probability MAXIMISES Data Probability Data Probability MAXIMISES Model Probability P ( Model ¦ Data) is Maximised ¦ means GIVEN! Cédric Notredame (28/07/2017) Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model AND Model Probability MAXIMISES Data Probability Data Probability MAXIMISES Model Probability P ( Model P ( Data Cédric Notredame (28/07/2017) ¦ ¦ Data) is Maximised Model) is Maximised Maximum Likelihood Models Model: Intra/Extra-Cell Proteins Maximum Likelihood Model Data: 11121112221212122121112221112121112211111 P ( Coin ¦ Data)< P(Dice ¦ Data) Cédric Notredame (28/07/2017) Conditional Probabilities Cédric Notredame (28/07/2017) Conditional Probabilities The Probability that something happens IF something else ALSO Happens P (Win Lottery ¦ Participation) Cédric Notredame (28/07/2017) Conditional Probability The Probability that something happens IF something else ALSO Happens Dice 1 Dice 2 P(6¦ Dice 1)=1/6 P(6¦ Dice 2)=1/2 Cédric Notredame (28/07/2017) Loaded! Joint Probability The Probability that something happens IF AND something else ALSO Happens P(6¦ D1)=1/6 P(6¦ D2)=1/2 , P(6 D2)=P(6¦D2) * P(D2)=1/2* 1/100 Comma Cédric Notredame (28/07/2017) Joint Probability Question: What is the probability of Making a 6, given that the Loaded Dice is used 1% of the time P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DF) + P(6¦ DL)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 (0.16 for an unloaded dice) Cédric Notredame (28/07/2017) Joint Probability P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DL) + P(6¦ DF)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 (0.16 for an unloaded dice) Unsuspected Heterogeneity In the training set Inaccurate Parameters Estimation Cédric Notredame (28/07/2017) Bayes Theorem X : Model or Data or any Event Y : Model or Data or any Event P(Xi¦ Y) = Cédric Notredame (28/07/2017) P(Y¦Xi) * P(Xi) Si (P(Y¦Xi)*P(Xi)) Bayes Theorem X : Model or Data or any Event Y : Model or Data or any Event P(X¦ Y) = XT=X+ X P(Y¦X) * P(X) P(Y¦X)*P(X)+ P(Y¦X)*P(X) P(Y,X)+ P(Y,X) P(Y) Cédric Notredame (28/07/2017) Bayes Theorem X : Model or Data or any Event Y : Model or Data or any event P(X¦ Y) = P(Y¦X) * P(X) Proba of Observing X IF Y is fulfilled Cédric Notredame (28/07/2017) Proba of Observing Y AND X simultaneously P(Y) ‘Remove’ P(Y) to Get P(X¦Y) Bayes Theorem X : Model or Data or any Event Y : Model or Data or any event P(X¦Y) = Proba of Observing X IF Y is fulfilled Cédric Notredame (28/07/2017) P(X,Y) Proba of Observing Y and X simultaneously P(Y) ‘Remove’ P(Y) to Get P(X¦Y) Using Bayes Theorem Question:The dice gave three 6s in a row IS IT LOADED !!! We will use Bayes Theorem to test our belief: If the Dice was loaded (model) what would be the probability of this Model Given the data (three 6 in a row) Cédric Notredame (28/07/2017) Using Bayes Theorem Question:The dice gave three 6s in a row IS IT LOADED !!! P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 Cédric Notredame (28/07/2017) Occasionally Dishonest Casino… Using Bayes Theorem Question:The dice gave three 6s in a row IS IT LOADED !!! P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X¦ Y) = P(Y¦X)*P(X) P(Y) Y: 63 X: D2 P(63 ¦D2)*P(D2) P(D2¦63) = P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) Cédric Notredame (28/07/2017) 63 with D1 63 with D2 Using Bayes Theorem Question:The dice gave three 6s in a row IS IT LOADED !!! P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X¦ Y) = P(X,Y) P(Y) P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) Probably NOT Cédric Notredame (28/07/2017) Posterior Probability Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) 0.21 is a posterior probability: it was estimated AFTER the Data was obtained P(63¦D2) is the likelihood of the Hypotheses Cédric Notredame (28/07/2017) Debunking Headlines 50% of the crimes are committed by Migrants. Question: Are 50% of the Migrants Criminals??. P(Migrant) =0.1 P(Criminal) =0.0001 P(M¦C)=0.5 P(C¦M) = P(M¦C)*P(C) P(M) P(C¦M) = = P(M¦C)*P(C) 0.5*0.0001 0.1 P(M) =0.0005 NO: 0.05% Migrants only are Criminals (NOT 50%!) Cédric Notredame (28/07/2017) Debunking Headlines 50% of Gene Promoters contain TATA. Question:IS TATA a good gene predictor P(T)=0.1 P(P)=0.0001 P(T¦P)=0.5 P(P¦T) = P(P¦T) = P(T¦P)*P(P) P(T) = P(T¦P)*P(P) P(T) 0.5*0.0001 0.1 NO Cédric Notredame (28/07/2017) =0.0005 Bayes Theorem Bayes Theorem Reveals the Trade-off Between Sensitivity:Finding ALL the genes and Specificity: Finding ONLY genes TATA=High Sensitivity / Low Specificity Cédric Notredame (28/07/2017) Markov Chains Cédric Notredame (28/07/2017) What is a Markov Chain ? Simple Chain: One Dice -Each Roll is the same -A Roll does not depend on the previous Markov Chain: Two Dices -You only use ONE dice: the fair OR the loaded -The Dice you roll only depends on the previous roll Cédric Notredame (28/07/2017) What is a Markov Chain ? Biological Sequences Tend To Behave like Markov Chains Question/Example Is it possible to Tell Whether my sequence is CpG island ??? Cédric Notredame (28/07/2017) Cédric Notredame (28/07/2017) What is a Markov Chain ? Question: Identify CpG Island sequences Old Fashion Solution -Slide a Window of size: Captain’s Height/p -Measure the % of CpG -Plot it against the sequence -Decide Cédric Notredame (28/07/2017) sliding Window Methods Average Sliding Window Cédric Notredame (28/07/2017) Sliding Window What is a Markov Chain ? Question: Identify CpG Island sequences Bayesian Solution -Make a CpG Markov Chain -Run the sequence through the Chain -Likelihood for the chain to produce the sequence? Cédric Notredame (28/07/2017) Transition A T C G State Transition Probabilities Probability of Transition from G to C AGC=P(Xi=C ¦ Xi-1=G) Cédric Notredame (28/07/2017) P(sequence)=P(XL,XL-1,XL-2,….., X1) Remember: P(X,Y)=P(X¦Y)*P(Y) In The Markov Chain, XL only depends on XL-1 P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) ) Cédric Notredame (28/07/2017) AGC=P(Xi=C ¦ Xi-1=G) P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) ) P(sequence)= P(x1)* Cédric Notredame (28/07/2017) L P i=2 Axi-1 xi A T C G B Arbitrary Beginning and End States can be added To The Chain. By Convention, Only the Beginning State is added Cédric Notredame (28/07/2017) A T C G B E Adding An End State with a Transition Proba T Defines Length probabilities P(all the sequences length L)=T(1-T)L-1 Cédric Notredame (28/07/2017) A T C G B E The transition are probabilities The sum of the probability of all the possible Sequences of all possible Length is 1 Cédric Notredame (28/07/2017) Using Markov Chains To Predict Cédric Notredame (28/07/2017) What is a Prediction Given A sequence We want to know what is the probability that this sequence is a CpG 1-We need a training set: -CpG+ sequences -CpG- sequences 2-We will Measure the transition frequencies, and treat them like probabilities Cédric Notredame (28/07/2017) What is a Prediction Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities A + GC = N SX + GC N Cédric Notredame (28/07/2017) + GX Transition GC: G followed by a C GCCGCTGCGCGA Ratio between the number of transitions GC, and all the other transitions involving G->X What is a Prediction Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities + A C G T A 0.18 0.17 0.16 0.08 C 0.27 0.36 0.33 0.35 G 0.42 0.27 0.37 0.38 Cédric Notredame (28/07/2017) T 0.12 0.18 0.12 0.18 1 A C G T A 0.30 0.32 0.25 0.17 C 0.21 0.30 0.25 0.24 G 0.28 0.08 0.30 0.29 T 0.21 0.30 0.20 0.29 What is a Prediction Is my sequence a CpG ??? 3-Evaluate the probability for each of these models to generate our sequence P(seq ¦ M+)= + A C G T A 0.18 0.17 0.16 0.08 C 0.27 0.36 0.33 0.35 L P i=1 G 0.42 0.27 0.37 0.38 Cédric Notredame (28/07/2017) + Axi-1 xi T 0.12 0.18 0.12 0.18 P(seq ¦ M-)= A C G T A 0.30 0.32 0.25 0.17 L P i=1 C 0.21 0.30 0.25 0.24 Axi-1 xi G 0.28 0.08 0.30 0.29 T 0.21 0.30 0.20 0.29 Using The Log ODD Is my sequence a CpG ??? 4-Measure the Log Odd S(seq) = Log Log Odd Log2 LEN P(seq ¦ M+) P(seq ¦ M-) ~S log2 X A A + Xi-1,Xi - Xi-1,Xi 1 LEN Confrontation of the Two Models… Gives a value in bits (standard) Gives a less spread out score distribution Cédric Notredame (28/07/2017) Using The Log ODD Is my sequence a CpG ??? 4-Measure the Log Odd S(seq) = Log P(seq ¦ M+) P(seq ¦ M-) ~S log2 X A + 1 Xi-1,Xi - A Xi-1,Xi LEN Positive: more likely than NOT to be CpG Negative: more likely NOT to be CpG Cédric Notredame (28/07/2017) Using The Log ODD Is my sequence a CpG ??? 5-Plot the score distribution N seq 0 Cédric Notredame (28/07/2017) Bits Using The Log ODD Is my sequence a CpG ??? 5-Plot the score distribution N seq Things can go Wrong -bad training set -bad param estimation 0 Cédric Notredame (28/07/2017) Bits Using The Log ODD Is my sequence a CpG ??? -The Markov Chain is a Good discriminator -PB: What to do with long sequences That are partly CpG, and partly NON CpG ??? -How Can we make a prediction Nucleotide per Nucleotide?? -We want to uncover the HIDDEN Boundaries Cédric Notredame (28/07/2017) Hidden Markov Models Cédric Notredame (28/07/2017) Simple Chain: One Dice -Each Roll is the same -A roll does not depend on the previous Markov Chain: Two Dices -You only use ONE dice: the fair OR the loaded -The Dice you roll only depends on the previous roll Hidden Markov Model:Switching Dices -If you are Cheating You want to switch Dices Without Telling! -The MODEL Switch is HIDDEN Cédric Notredame (28/07/2017) Using HMMS Question: I want to find the CpG boundaries The chain had four symbol AGCT The Model has eight states: A+, A-, G+, G-, C+, C-, T+, TThere is no 1to1 correspondence symbol/states: The state of each symbol is hidden A can either be in A+ or ACédric Notredame (28/07/2017) Using HMMs Question: I want to find the CpG boundaries 1-Define the model topology EVERY transition is possible A+ G+ C+ T+ C+ TO G- cost more ACédric Notredame (28/07/2017) G- C- T- Using HMMs Question: I want to find the CpG boundaries 2-Parameterise the model: count frequencies… + A C G T A 0.18 0.17 0.16 0.08 C 0.27 0.36 0.33 0.35 G 0.42 0.27 0.37 0.38 T 0.12 0.18 0.12 0.18 A C G T A 0.30 0.32 0.25 0.17 We also Need + to Cédric Notredame (28/07/2017) C 0.21 0.30 0.25 0.24 G 0.28 0.08 0.30 0.29 T 0.21 0.30 0.20 0.29 Using HMMs Question: I want to find the CpG boundaries 3-FORCE the model to emit your sequence: Viterbi One can use the model to emit any sequence. This sequence is named a PATH (p) because it is a walk through the model G+ C+ G+ C+ T+ C+ C+ C- C- G- T- …. Cédric Notredame (28/07/2017) Using HMMs Question: I want to find the CpG boundaries 3-FORCE the model to emit your sequence: Viterbi The path with the occasionally dishonest Casino AL,F =P(pi=L¦ pi-1=F) Switch Dices: Transition -The state L, emits a symbol with a proba P (emit 6 with L)=EL(6) = P(Xi=6 ¦ pi=L)=0.5 Roll The Dice: Emission Cédric Notredame (28/07/2017) Six Emission For State Fair 1- 0.16 2- 0.16 3- 0.16 4- 0.16 5- 0.16 6- 0.16 1- 0.10 2- 0.10 3- 0.10 4- 0.10 5- 0.10 6- 0.50 Fair Loaded Two States: Fair and Loaded Cédric Notredame (28/07/2017) Six Emission For State Loaded AL,F =P(pi=L¦ pi-1=F) Switch Dices: Transition 1- 0.16 2- 0.16 3- 0.16 4- 0.16 5- 0.16 6- 0.16 1- 0.10 2- 0.10 3- 0.10 4- 0.10 5- 0.10 6- 0.50 Fair Loaded Emissions of L with Their Proba P (emit 6L) =EL(6) = P(Xi=6 ¦ pi=L)=0.5 Roll The Dice: Emission Cédric Notredame (28/07/2017) A+ G+ C+ T+ A- G- C- T- 8 STATES, 1 EMISSION per State Cédric Notredame (28/07/2017) Using HMMs Question: I want to find the CpG boundaries 3-FORCE the model to emit your sequence: Viterbi The path: -goes from state to state with a proba AG+,C+ =P(pi=C+¦ pi-1=G+) -in x, it EMITS a symbol with a proba 1 Proba emit G=EG+(G) = P(Xi=G ¦ pi=G+) Cédric Notredame (28/07/2017) 1 Using HMMs Question: I want to find the CpG boundaries 3-FORCE the model to emit your sequence: Viterbi We are interested in the joint probability of the PATH p (chain of G+, C-…) with our Sequence X P(X,p)= A Cédric Notredame (28/07/2017) L E (Xi)* A *P pi pi,pi-1 i=1 0,p1 Using HMMs Question: I want to find the CpG boundaries 3-FORCE the model to emit your sequence: Viterbi P(X,p)= A L E (Xi)* A *P pi pi,pi-1 i=1 0,p1 P= C+ G- C- G+ X= C G C G A0,C+ *1 * A Cédric Notredame (28/07/2017) C+,G- *1 * AG-,C- *1 * AC-,G+ *1 Using HMMs Question: I want to find the CpG boundaries 3-FORCE the model to emit your sequence: Viterbi A0,C+ *1 * A C+,G- *1 * AG-,C- *1 * AC-,G+ *1 To Make a prediction We must Identify the Best Scoring Path: p*=argmax P(x,p) Cédric Notredame (28/07/2017) Using HMMs Question: I want to find the CpG boundaries 3-FORCE the model to emit your sequence: Viterbi To Make a prediction We must Identify the Best Scoring Path: p*=argmax P(x,p) We do this recursively with the VITERBI Algorithm Cédric Notredame (28/07/2017) A+ G+ C+ T+ AGCT- … … G A+ G+ C+ T+ AGCT- G+ G- C Cédric Notredame (28/07/2017) A+ G+ C+ T+ AGCT- G+C+ G+C+ G A+ G+ C+ T+ AGCT- G+C+G+ G+C+G- A A+ G+ C+ T+ AGCT- A+ G+ C+ T+ AGCT- G C G+C+G-A- G+C+G-A- A+ G+ C+ T+ AGCT- A+ G+ C+ T+ AGCT- A+ G+ C+ T+ AGCT- A+ G+ C+ T+ AGCT- A+ G+ C+ T+ AGCT- A+ G+ C+ T+ AGCT- G C G A G C G- C- Trace Back G+ C+ Cédric Notredame (28/07/2017) G- A- -k and l are two states -Vk(i) score of the best path 1…i, that finishes on state k and position i Initiation: V0(0)=1, Vk(0)=0 for every k Recursion: i=1..L Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl) ptri (l)=argmax (Vk(i-1) *Akl) Termination: i=1..L Cédric Notredame (28/07/2017) P(x,p*)=Maxk (Vk(L)*Ak0) Initiation: k and l are two states V0(0)=1, Vk(0)=0 for every k Recursion: i=1..L Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl) Multiplying Proba can cause an underflow problem Usually, Proba multiplications are replaced with Log additions log (a*b) = log (a) + log (b) Cédric Notredame (28/07/2017) Using HMMs Question: I want to know the Probability of my sequence Given The model In Theory, you must sum over ALL the possible PATH. In practice: p* is a good approximation Cédric Notredame (28/07/2017) Using HMMs Question: I want to know the Proba of my sequence Given The model p* is a good approximation But… The Forward Algorithm Gives the exact value of P(x) Cédric Notredame (28/07/2017) Viterbi Initiation: k and l are two states V0(0)=1, Vk(0)=0 for every k Recursion: i=1..L Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl) Termination: Forward P(x,p*)=Maxk (Vk(L)*Ak0) Initiation: k and l are two states V0(0)=1, Vk(0)=0 for every k Recursion: i=1..L Vl (i)=El(Xi)*Sk (Vk(i-1)*Akl) Termination: Cédric Notredame (28/07/2017) P(x)=Sk (Vk(L)*Ak0) Initiation: k and l are two states V0(0)=1, Vk(0)=0 for every k A+ A+ G+ G+ C+ C+ T+ Max T+ AAGGCCTT- V i t e r b i A+ G+ C+ T+ AGCT- F o r w a r d … G+ Recursion: i=1..L Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl) … Termination: P(x,p*)=Maxk (Vk(L)*Ak0) Initiation: k and l are two states V0(0)=1, Vk(0)=0 for every k Recursion: i=1..L Vl (i)=El(Xi)*Sk (Vk(i-1)*Akl) G- … … Termination: Cédric Notredame (28/07/2017) P(x)=Sk (Vk(L)*Ak0) S A+ G+ C+ T+ AGCT- G+ G- Posterior Decoding of Hidden Markov Models Cédric Notredame (28/07/2017) Why Posterior Decoding ? -Viterbi is BRUTAL !!!! -It does Not Associate Individual Predictions With a Probability Question: What is the probability that Nucleotide 1300 really is a CpG Boundary ? ANSWER: The Backward Algorithm Cédric Notredame (28/07/2017) Posterieur Decoding ? Question: What is the probability that Nucleotide 1300 really is a CpG Boundary ? P (X,pi=l) Probability of Sequence X WITH position i is in state l Cédric Notredame (28/07/2017) Posterieur Decoding i pi=l pi=l Backward Algorithm Forward Algorithm P (x,pi=l)=P(X1…Xi¦ pi=l) * P(XL… Xi+1¦ pi=l) Cédric Notredame (28/07/2017) Forward Initiation: F0(0)=1, Fk(0)=0 for every k Recursion: i=1..L Fl (i)=El(Xi)*Sk (Fk(i-1)*Akl) Termination: Backward Initiation: P(x)=Sk (Fk(L)*Ak0) B0(0)=1, Bk(L)=Ak0 for every k Recursion: i=L..1 Bl (i)=El(Xi)*Sk (Bk(i+1)*Akl) Termination: Cédric Notredame (28/07/2017) P(x)=Sk (Bk(1)*Ak0) Forward Backward Recursion: i=1..L Fl (i)=Fl(Xi)*Sk (Fk(i-1)*Akl) Recursion: i=L..1 Bl (i)=Bl(Xi)*Sk (Bk(i+1)*Akl) P (pi=l,X)=Fl(i)*Bl(i) P (pi=l,X)=P(pi=l ¦ X)*P(X) = Fl(i) * Bl(i) P(pi=l ¦ X)= Cédric Notredame (28/07/2017) Fl(i) * Bl(i) P(X) P(X)=F(L)=B(1) Sliding Window P(pi=l ¦ X) Free From The Sliding Window of Arbitrary Size!!!! Cédric Notredame (28/07/2017) P(pi=l ¦ X) Posterior Decoding is Less Sensitive to the Parameterisation of the model. Cédric Notredame (28/07/2017) Training HMMs Cédric Notredame (28/07/2017) Training HMMs ? Case 1-Set of annotated data Parameters can be estimated on this data where the PATH is known. Case 2-NO annotated data and a Model -Parameterise the model so P(Model¦data)=max -Start with random parameters -Iterate using Baum-Welch, Viterbi or EM Cédric Notredame (28/07/2017) Trainning HMMs ? Difficult !!!! Cédric Notredame (28/07/2017) What Matters About Hidden Markov Models Cédric Notredame (28/07/2017) HMM and Markov Chains Bayes Theorem -Markov Chain: When There is no Hidden State -Hidden Markov Models: When a Nucleotide can be in different HIDDEN states Cédric Notredame (28/07/2017) Three Algorithms for HMMS Viterbi: -Make the State assignments -Predict Forward: Evaluate the Sequence Probability under the considered model Backward and Posterior Decoding: Evaluating the proba of the prediction Window-Free Cédric Notredame (28/07/2017) Applications of HMMs Cédric Notredame (28/07/2017) What To Do with an HMM? Transmembrane domain predictions www.cbs.dtu.dk/services/TMHMM/ Cédric Notredame (28/07/2017) What To Do with an HMM? RNA structure Prediction/Fold Recognition SCGF: Stochastic Context Free Grammars (Sean Eddy) Cédric Notredame (28/07/2017) What To Do with an HMM? Gene Prediction State of the art use HMMs Genemark: Prokaryotes GenScan: Eukaryotes Cédric Notredame (28/07/2017) GeneMark Cédric Notredame (28/07/2017) A typical HMM for Coding DNA G G G G GGG GGA GGT GGC 0.02 0.00 0.6 0.38 G G G G GGG GGA GGT GGC 0.02 0.00 0.6 0.38 S E 64 Codons 64 Codons W TGG 1.00 W TGG 1.00 Cédric Notredame (28/07/2017) A Typical HMM for Coding DNA Emission (codon Frequency) Transition (Dipeptide) Cédric Notredame (28/07/2017) GeneMark HMM HMM order 5: 6th Nucleotide depends on the 5 previous Takes into account Codon Bias AND dipeptide Comp Proba of seq (GGG-TGG Given Model) = Proba(GGG)*Proba(GGG->TGG)*Proba(TGG) Cédric Notredame (28/07/2017) What To Do with an HMM? Family and Domain Identification Pfam Smart Prosite Profiles Cédric Notredame (28/07/2017) What To Do with an HMM? Bayesian Phylogenic Inference chite wheat trybr mouse morphbank.ebc.uu.se/mrbayes/manual.php Cédric Notredame (28/07/2017) What To Do with an HMM? Metabolic Networks: Bayesian Networks www.cs.huji.ac.il/~nirf/ Cédric Notredame (28/07/2017) Collections Of Domains HMMs Cédric Notredame (28/07/2017) What is a Domain HMM ? SAM, HMMER, PFtools Cédric Notredame (28/07/2017) Emission Proba Cédric Notredame (28/07/2017) Using Domain HMMs Question: I want to Compare my HMM with all the sequences in SwissProt Requires an adapted Viterbi: Pair-HMM Very Similar to Dynamic Programming Cédric Notredame (28/07/2017) Using Domain HMMs Question: What are the Available Collections Of Pre-computed HMMs Interpro unites many collections Cédric Notredame (28/07/2017) Interpro: The Idea of Domains Cédric Notredame (28/07/2017) Interpro: A Federation of Databases Cédric Notredame (28/07/2017) Using InterPro: Asking a question Which Domains does the oncogene FosB contain? Cédric Notredame (28/07/2017) Using InterPro: Asking a question Cédric Notredame (28/07/2017) Using InterPro: Asking a question Cédric Notredame (28/07/2017) Finding Domains -How can I be sure that the domain Prediction of my Protein is real ? Use the EMBnet pfscan Cédric Notredame (28/07/2017) Using EMBNet PFscan Cédric Notredame (28/07/2017) Posterior Decoding With EMBNet PFscan Important Position that is Well conserved in our sequence Cédric Notredame (28/07/2017) Prior Posterior Cédric Notredame (28/07/2017) The Inside Of Pfam Cédric Notredame (28/07/2017) A Typical pfam Domain Cédric Notredame (28/07/2017) A Typical pfam Domain HMMER Package: Cédric Notredame (28/07/2017) Cédric Notredame (28/07/2017) Going Further Building and Using HMMs Cédric Notredame (28/07/2017) HMMer2: hmmer.wustl.edu/ Used to create and distribute Pfam PFtools: www.isrec.isb-sib.ch/ftp-server/pftools/ Used to create and distribute Prosite SAM T02 www.cse.ucsc.edu/research/compbio/sam.html Cédric Notredame (28/07/2017) EMBOSS Online www.hgmp.mrc.ac.uk/SOFTWARE/EMBOSS Jemboss: a JAVA aplet interacting with an EMBOSS Server Cédric Notredame (28/07/2017) Cédric Notredame (28/07/2017) HMMer Cédric Notredame (28/07/2017) Cédric Notredame (28/07/2017) EMBASSY (Hmmer) Cédric Notredame (28/07/2017) Cédric Notredame (28/07/2017) In The End: Markov Uncovered Cédric Notredame (28/07/2017) HMM and Markov Chains Domain Collections Gene Prediction Bayesian Phylogenetic Inference chite wheat trybr mouse Cédric Notredame (28/07/2017) HMM and Markov Chains Domain Collections Profiles HMM Generalized Profiles Interactive Tools Cédric Notredame (28/07/2017)
© Copyright 2025 Paperzz