6.1.hmm_introduction - T

Uncovering Sequences
Mysteries
With
Hidden Markov Model
Cédric Notredame
Cédric Notredame (28/07/2017)
Cédric Notredame (28/07/2017)
Our Scope
Look once Under the Hood
Understand the principle of HMMs
Understand HOW HMMs are used in Biology
Cédric Notredame (28/07/2017)
Outline
-Reminder of Bayesian Probabilities
-HMMs and Markov Chains
-Application to gene prediction
-Application Tm predictions
-Application to Domain/Prot Family Prediction
-Future Applications
Cédric Notredame (28/07/2017)
Conditional
Probabilities
And
Bayes Theorem
Cédric Notredame (28/07/2017)
I now send you an essay which I have
found among the papers of our
deceased friend Mr Bayes, and which,
in my opinion, has great merit... In an
introduction which he has writ to this
Essay, he says, that his design at first
in thinking on the subject of it was, to
find out a method by which we might
judge concerning the probability that
an event has to happen, in given
circumstances, upon supposition that
we know nothing concerning it but
that, under the same circumstances, it
has happened a certain number of
times, and failed a certain other
number of times.
Cédric Notredame (28/07/2017)
“The Durbin…”
Cédric Notredame (28/07/2017)
What is a Probabilistic Model ?
Dice = Probabilistic Model
-Each Possible outcome has a probability (1/6)
-Biological Questions:
-What kind of dice would generate coding DNA
-Non-Coding ?
Cédric Notredame (28/07/2017)
Which Parameters ?
Dice = Probabilistic Model
Parameters: proba of each outcome
-A Priori estimation: 1/6 for each Number
OR
-Through Observation:
-measure frequencies on a large number
of events
Cédric Notredame (28/07/2017)
Which Parameters ?
Parameters: proba of each outcome
Model: Intra/Extra Protein
1- Make a set of Inside Proteins using annotation
2- Make a set of Outside Proteins using annotation
3- COUNT Frequencies on the two sets
Model Accuracy  Training Set
Cédric Notredame (28/07/2017)
Maximum Likelihood Models
Model: Intra/Extra Proteins
1- Make training set
2- Count Frequencies
Model Accuracy  Training Set
Maximum Likelihood Model:
Model probability MAXIMISES Data probability
Cédric Notredame (28/07/2017)
Maximum Likelihood Models
Model: Intra/Extra-Cell Proteins
Maximum Likelihood Model
AND
Model Probability MAXIMISES Data Probability
Data Probability MAXIMISES Model Probability
P ( Model
¦
Data) is Maximised
¦ means GIVEN!
Cédric Notredame (28/07/2017)
Maximum Likelihood Models
Model: Intra/Extra-Cell Proteins
Maximum Likelihood Model
AND
Model Probability MAXIMISES Data Probability
Data Probability MAXIMISES Model Probability
P ( Model
P ( Data
Cédric Notredame (28/07/2017)
¦
¦
Data) is Maximised
Model) is Maximised
Maximum Likelihood Models
Model: Intra/Extra-Cell Proteins
Maximum Likelihood Model
Data: 11121112221212122121112221112121112211111
P ( Coin
¦
Data)< P(Dice ¦ Data)
Cédric Notredame (28/07/2017)
Conditional Probabilities
Cédric Notredame (28/07/2017)
Conditional Probabilities
The Probability that something happens
IF
something else
ALSO
Happens
P (Win Lottery ¦ Participation)
Cédric Notredame (28/07/2017)
Conditional Probability
The Probability that something happens
IF
something else
ALSO
Happens
Dice 1
Dice 2
P(6¦ Dice 1)=1/6
P(6¦ Dice 2)=1/2
Cédric Notredame (28/07/2017)
Loaded!
Joint Probability
The Probability that something happens
IF AND
something else
ALSO
Happens
P(6¦ D1)=1/6
P(6¦ D2)=1/2
,
P(6 D2)=P(6¦D2) * P(D2)=1/2* 1/100
Comma
Cédric Notredame (28/07/2017)
Joint Probability
Question: What is the probability of Making a 6,
given that the Loaded Dice is used 1% of the time
P(6¦ DF and DL)= P(6, DF) + P(6, DL)
= P(6 ¦ DF) * P(DF) + P(6¦ DL)*P(DL)
= 1/6*0.99
+
1/2*0.01
= 0.17
(0.16 for an unloaded dice)
Cédric Notredame (28/07/2017)
Joint Probability
P(6¦ DF and DL)= P(6, DF) + P(6, DL)
= P(6 ¦ DF) * P(DL) + P(6¦ DF)*P(DL)
= 1/6*0.99
+
1/2*0.01
= 0.17
(0.16 for an unloaded dice)
Unsuspected Heterogeneity In the training set

Inaccurate Parameters Estimation
Cédric Notredame (28/07/2017)
Bayes Theorem
X : Model or Data or any Event
Y : Model or Data or any Event
P(Xi¦ Y) =
Cédric Notredame (28/07/2017)
P(Y¦Xi) * P(Xi)
Si (P(Y¦Xi)*P(Xi))
Bayes Theorem
X : Model or Data or any Event
Y : Model or Data or any Event
P(X¦ Y) =
XT=X+ X
P(Y¦X) * P(X)
P(Y¦X)*P(X)+ P(Y¦X)*P(X)
P(Y,X)+ P(Y,X)
P(Y)
Cédric Notredame (28/07/2017)
Bayes Theorem
X : Model or Data or any Event
Y : Model or Data or any event
P(X¦ Y) =
P(Y¦X) * P(X)
Proba of Observing X
IF Y is fulfilled
Cédric Notredame (28/07/2017)
Proba of Observing
Y AND X
simultaneously
P(Y)
‘Remove’ P(Y) to
Get P(X¦Y)
Bayes Theorem
X : Model or Data or any Event
Y : Model or Data or any event
P(X¦Y) =
Proba of Observing X
IF Y is fulfilled
Cédric Notredame (28/07/2017)
P(X,Y)
Proba of
Observing Y and
X simultaneously
P(Y)
‘Remove’ P(Y) to
Get P(X¦Y)
Using Bayes Theorem
Question:The dice gave three 6s in a row
IS IT LOADED !!!
We will use Bayes Theorem to test our belief:
If the Dice was loaded (model)
what would be the probability of this Model
Given the data (three 6 in a row)
Cédric Notredame (28/07/2017)
Using Bayes Theorem
Question:The dice gave three 6s in a row
IS IT LOADED !!!
P(D1)=0.99
P(D2)=0.01
P(6¦D1)=1/6
P(6¦D2)=1/2
Cédric Notredame (28/07/2017)
Occasionally Dishonest
Casino…
Using Bayes Theorem
Question:The dice gave three 6s in a row
IS IT LOADED !!!
P(D1)=0.99
P(D2)=0.01
P(6¦D1)=1/6
P(6¦D2)=1/2
P(X¦ Y) =
P(Y¦X)*P(X)
P(Y)
Y: 63
X: D2
P(63 ¦D2)*P(D2)
P(D2¦63) =
P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)
Cédric Notredame (28/07/2017)
63
with D1
63 with D2
Using Bayes Theorem
Question:The dice gave three 6s in a row
IS IT LOADED !!!
P(D1)=0.99
P(D2)=0.01
P(6¦D1)=1/6
P(6¦D2)=1/2
P(X¦ Y) =
P(X,Y)
P(Y)
P(63 ¦D2)*P(D2)
P(D2¦63) =
= 0.21
P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)
Probably NOT
Cédric Notredame (28/07/2017)
Posterior Probability
Question:The dice gave three 6s in a row
IS IT LOADED !!!
P(63 ¦D2)*P(D2)
P(D2¦63) =
= 0.21
P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)
0.21 is a posterior probability: it was estimated
AFTER the Data was obtained
P(63¦D2) is the likelihood of the Hypotheses
Cédric Notredame (28/07/2017)
Debunking Headlines
50% of the crimes are committed by Migrants.
Question: Are 50% of the Migrants Criminals??.
P(Migrant) =0.1
P(Criminal) =0.0001
P(M¦C)=0.5
P(C¦M) =
P(M¦C)*P(C)
P(M)
P(C¦M) =
=
P(M¦C)*P(C)
0.5*0.0001
0.1
P(M)
=0.0005
NO: 0.05% Migrants only are Criminals (NOT 50%!)
Cédric Notredame (28/07/2017)
Debunking Headlines
50% of Gene Promoters contain TATA.
Question:IS TATA a good gene predictor
P(T)=0.1
P(P)=0.0001
P(T¦P)=0.5
P(P¦T) =
P(P¦T) =
P(T¦P)*P(P)
P(T)
=
P(T¦P)*P(P)
P(T)
0.5*0.0001
0.1
NO
Cédric Notredame (28/07/2017)
=0.0005
Bayes Theorem
Bayes Theorem Reveals the Trade-off
Between
Sensitivity:Finding ALL the genes
and
Specificity: Finding ONLY genes
TATA=High Sensitivity / Low Specificity
Cédric Notredame (28/07/2017)
Markov Chains
Cédric Notredame (28/07/2017)
What is a Markov Chain ?
Simple Chain: One Dice
-Each Roll is the same
-A Roll does not depend on the previous
Markov Chain: Two Dices
-You only use ONE dice: the fair OR the loaded
-The Dice you roll only depends on the previous roll
Cédric Notredame (28/07/2017)
What is a Markov Chain ?
Biological Sequences Tend To Behave like Markov Chains
Question/Example
Is it possible to Tell Whether my sequence is
CpG island ???
Cédric Notredame (28/07/2017)
Cédric Notredame (28/07/2017)
What is a Markov Chain ?
Question:
Identify CpG Island sequences
Old Fashion Solution
-Slide a Window of size: Captain’s Height/p
-Measure the % of CpG
-Plot it against the sequence
-Decide
Cédric Notredame (28/07/2017)
sliding Window Methods
Average
Sliding Window
Cédric Notredame (28/07/2017)
Sliding Window
What is a Markov Chain ?
Question:
Identify CpG Island sequences
Bayesian Solution
-Make a CpG Markov Chain
-Run the sequence through the Chain
-Likelihood for the chain to produce the sequence?
Cédric Notredame (28/07/2017)
Transition
A
T
C
G
State
Transition Probabilities
Probability of Transition from G to C
AGC=P(Xi=C ¦ Xi-1=G)
Cédric Notredame (28/07/2017)
P(sequence)=P(XL,XL-1,XL-2,….., X1)
Remember: P(X,Y)=P(X¦Y)*P(Y)
In The Markov Chain, XL only depends on XL-1
P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )
Cédric Notredame (28/07/2017)
AGC=P(Xi=C ¦ Xi-1=G)
P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )
P(sequence)= P(x1)*
Cédric Notredame (28/07/2017)
L
P
i=2
Axi-1 xi
A
T
C
G
B
Arbitrary Beginning and End States can be added
To The Chain.
By Convention, Only the Beginning State is added
Cédric Notredame (28/07/2017)
A
T
C
G
B
E
Adding An End State with a Transition Proba T
Defines Length probabilities
P(all the sequences length L)=T(1-T)L-1
Cédric Notredame (28/07/2017)
A
T
C
G
B
E
The transition are probabilities
The sum of the probability of all the
possible Sequences of all possible Length
is 1
Cédric Notredame (28/07/2017)
Using
Markov Chains
To Predict
Cédric Notredame (28/07/2017)
What is a Prediction
Given A sequence We want to know what is the
probability that this sequence is a CpG
1-We need a training set:
-CpG+ sequences
-CpG- sequences
2-We will Measure the transition frequencies, and
treat them like probabilities
Cédric Notredame (28/07/2017)
What is a Prediction
Is my sequence a CpG ???
2-We will Measure the transition frequencies, and
treat them like probabilities
A
+
GC
=
N
SX
+
GC
N
Cédric Notredame (28/07/2017)
+
GX
Transition GC: G followed by
a C
GCCGCTGCGCGA
Ratio between the number
of transitions GC, and all
the other transitions
involving G->X
What is a Prediction
Is my sequence a CpG ???
2-We will Measure the transition frequencies, and
treat them like probabilities
+
A
C
G
T
A
0.18
0.17
0.16
0.08
C
0.27
0.36
0.33
0.35
G
0.42
0.27
0.37
0.38
Cédric Notredame (28/07/2017)
T
0.12
0.18
0.12
0.18
1
A
C
G
T
A
0.30
0.32
0.25
0.17
C
0.21
0.30
0.25
0.24
G
0.28
0.08
0.30
0.29
T
0.21
0.30
0.20
0.29
What is a Prediction
Is my sequence a CpG ???
3-Evaluate the probability for each of these models to
generate our sequence
P(seq ¦ M+)=
+
A
C
G
T
A
0.18
0.17
0.16
0.08
C
0.27
0.36
0.33
0.35
L
P
i=1
G
0.42
0.27
0.37
0.38
Cédric Notredame (28/07/2017)
+
Axi-1 xi
T
0.12
0.18
0.12
0.18
P(seq ¦ M-)=
A
C
G
T
A
0.30
0.32
0.25
0.17
L
P
i=1
C
0.21
0.30
0.25
0.24
Axi-1 xi
G
0.28
0.08
0.30
0.29
T
0.21
0.30
0.20
0.29
Using The Log ODD
Is my sequence a CpG ???
4-Measure the Log Odd
S(seq) = Log
Log Odd
Log2
LEN
P(seq ¦ M+)
P(seq ¦ M-)
~S log2
X
A
A
+
Xi-1,Xi
-
Xi-1,Xi
1
LEN
 Confrontation of the Two Models…
 Gives a value in bits (standard)
 Gives a less spread out score distribution
Cédric Notredame (28/07/2017)
Using The Log ODD
Is my sequence a CpG ???
4-Measure the Log Odd
S(seq) = Log
P(seq ¦ M+)
P(seq ¦ M-)
~S log2
X
A
+
1
Xi-1,Xi
-
A Xi-1,Xi
LEN
Positive: more likely than NOT to be CpG
Negative: more likely NOT to be CpG
Cédric Notredame (28/07/2017)
Using The Log ODD
Is my sequence a CpG ???
5-Plot the score distribution
N seq
0
Cédric Notredame (28/07/2017)
Bits
Using The Log ODD
Is my sequence a CpG ???
5-Plot the score distribution
N seq
Things can go Wrong
-bad training set
-bad param estimation
0
Cédric Notredame (28/07/2017)
Bits
Using The Log ODD
Is my sequence a CpG ???
-The Markov Chain is a Good discriminator
-PB: What to do with long sequences That
are partly CpG, and partly NON CpG ???
-How Can we make a prediction Nucleotide per
Nucleotide??
-We want to uncover the HIDDEN Boundaries
Cédric Notredame (28/07/2017)
Hidden Markov Models
Cédric Notredame (28/07/2017)
Simple Chain: One Dice
-Each Roll is the same
-A roll does not depend on the previous
Markov Chain: Two Dices
-You only use ONE dice: the fair OR the loaded
-The Dice you roll only depends on the previous roll
Hidden Markov Model:Switching Dices
-If you are Cheating You want to switch Dices
Without Telling!
-The MODEL Switch is HIDDEN
Cédric Notredame (28/07/2017)
Using HMMS
Question: I want to find the CpG boundaries
The chain had four symbol AGCT
The Model has eight states:
A+, A-, G+, G-, C+, C-, T+, TThere is no 1to1 correspondence symbol/states:
The state of each symbol is hidden
A can either be in A+ or ACédric Notredame (28/07/2017)
Using HMMs
Question: I want to find the CpG boundaries
1-Define the model topology
EVERY transition is possible
A+
G+
C+
T+
C+ TO G- cost more
ACédric Notredame (28/07/2017)
G-
C-
T-
Using HMMs
Question: I want to find the CpG boundaries
2-Parameterise the model: count frequencies…
+
A
C
G
T
A
0.18
0.17
0.16
0.08
C
0.27
0.36
0.33
0.35
G
0.42
0.27
0.37
0.38
T
0.12
0.18
0.12
0.18
A
C
G
T
A
0.30
0.32
0.25
0.17
We also Need + to Cédric Notredame (28/07/2017)
C
0.21
0.30
0.25
0.24
G
0.28
0.08
0.30
0.29
T
0.21
0.30
0.20
0.29
Using HMMs
Question: I want to find the CpG boundaries
3-FORCE the model to emit your sequence: Viterbi
One can use the model to emit any sequence.
This sequence is named a PATH (p) because it is a
walk through the model
G+ C+ G+ C+ T+ C+ C+ C- C- G- T- ….
Cédric Notredame (28/07/2017)
Using HMMs
Question: I want to find the CpG boundaries
3-FORCE the model to emit your sequence: Viterbi
The path with the occasionally dishonest Casino
AL,F =P(pi=L¦ pi-1=F)
Switch Dices: Transition
-The state L, emits a symbol with a proba
P (emit 6 with L)=EL(6) = P(Xi=6 ¦ pi=L)=0.5
Roll The Dice: Emission
Cédric Notredame (28/07/2017)
Six
Emission
For
State
Fair
1- 0.16
2- 0.16
3- 0.16
4- 0.16
5- 0.16
6- 0.16
1- 0.10
2- 0.10
3- 0.10
4- 0.10
5- 0.10
6- 0.50
Fair
Loaded
Two States: Fair and Loaded
Cédric Notredame (28/07/2017)
Six
Emission
For
State
Loaded
AL,F =P(pi=L¦ pi-1=F)
Switch Dices: Transition
1- 0.16
2- 0.16
3- 0.16
4- 0.16
5- 0.16
6- 0.16
1- 0.10
2- 0.10
3- 0.10
4- 0.10
5- 0.10
6- 0.50
Fair
Loaded
Emissions
of L with
Their Proba
P (emit 6L) =EL(6) = P(Xi=6 ¦ pi=L)=0.5
Roll The Dice: Emission
Cédric Notredame (28/07/2017)
A+
G+
C+
T+
A-
G-
C-
T-
8 STATES, 1 EMISSION per State
Cédric Notredame (28/07/2017)
Using HMMs
Question: I want to find the CpG boundaries
3-FORCE the model to emit your sequence: Viterbi
The path:
-goes from state to state with a proba
AG+,C+ =P(pi=C+¦ pi-1=G+)
-in x, it EMITS a symbol with a proba 1
Proba emit G=EG+(G) = P(Xi=G ¦ pi=G+)
Cédric Notredame (28/07/2017)
1
Using HMMs
Question: I want to find the CpG boundaries
3-FORCE the model to emit your sequence: Viterbi
We are interested in the joint probability of the
PATH p (chain of G+, C-…) with our Sequence X
P(X,p)= A
Cédric Notredame (28/07/2017)
L
E (Xi)* A
*P
pi
pi,pi-1
i=1
0,p1
Using HMMs
Question: I want to find the CpG boundaries
3-FORCE the model to emit your sequence: Viterbi
P(X,p)= A
L
E (Xi)* A
*P
pi
pi,pi-1
i=1
0,p1
P= C+ G- C- G+
X= C G C G
A0,C+ *1 * A
Cédric Notredame (28/07/2017)
C+,G-
*1 * AG-,C- *1 * AC-,G+ *1
Using HMMs
Question: I want to find the CpG boundaries
3-FORCE the model to emit your sequence: Viterbi
A0,C+ *1 * A
C+,G-
*1 * AG-,C- *1 * AC-,G+ *1
To Make a prediction
We must Identify the Best Scoring Path:
p*=argmax P(x,p)
Cédric Notredame (28/07/2017)
Using HMMs
Question: I want to find the CpG boundaries
3-FORCE the model to emit your sequence: Viterbi
To Make a prediction
We must Identify the Best Scoring Path:
p*=argmax P(x,p)
We do this recursively with the VITERBI Algorithm
Cédric Notredame (28/07/2017)
A+
G+
C+
T+
AGCT-
…
…
G
A+
G+
C+
T+
AGCT-
G+
G-
C
Cédric Notredame (28/07/2017)
A+
G+
C+
T+
AGCT-
G+C+
G+C+
G
A+
G+
C+
T+
AGCT-
G+C+G+
G+C+G-
A
A+
G+
C+
T+
AGCT-
A+
G+
C+
T+
AGCT-
G
C
G+C+G-A-
G+C+G-A-
A+
G+
C+
T+
AGCT-
A+
G+
C+
T+
AGCT-
A+
G+
C+
T+
AGCT-
A+
G+
C+
T+
AGCT-
A+
G+
C+
T+
AGCT-
A+
G+
C+
T+
AGCT-
G
C
G
A
G
C
G-
C-
Trace Back
G+
C+
Cédric Notredame (28/07/2017)
G-
A-
-k and l are two states
-Vk(i) score of the best
path 1…i, that finishes
on state k and position i
Initiation:
V0(0)=1, Vk(0)=0 for every k
Recursion: i=1..L
Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)
ptri (l)=argmax (Vk(i-1) *Akl)
Termination: i=1..L
Cédric Notredame (28/07/2017)
P(x,p*)=Maxk (Vk(L)*Ak0)
Initiation: k and l are two states
V0(0)=1, Vk(0)=0 for every k
Recursion: i=1..L
Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)
Multiplying Proba can cause an underflow problem
Usually, Proba multiplications are replaced with
Log additions
log (a*b) = log (a) + log (b)
Cédric Notredame (28/07/2017)
Using HMMs
Question: I want to know the Probability of my
sequence Given The model
In Theory, you must sum over ALL the possible
PATH. In practice:
p* is a good approximation
Cédric Notredame (28/07/2017)
Using HMMs
Question: I want to know the Proba of my sequence
Given The model
p* is a good approximation But…
The Forward Algorithm Gives the exact value of P(x)
Cédric Notredame (28/07/2017)
Viterbi
Initiation: k and l are two states
V0(0)=1, Vk(0)=0 for every k
Recursion: i=1..L
Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)
Termination:
Forward
P(x,p*)=Maxk (Vk(L)*Ak0)
Initiation: k and l are two states
V0(0)=1, Vk(0)=0 for every k
Recursion: i=1..L
Vl (i)=El(Xi)*Sk (Vk(i-1)*Akl)
Termination:
Cédric Notredame (28/07/2017)
P(x)=Sk (Vk(L)*Ak0)
Initiation: k and l are two states
V0(0)=1, Vk(0)=0 for every k
A+
A+
G+
G+
C+
C+
T+ Max T+
AAGGCCTT-
V
i
t
e
r
b
i
A+
G+
C+
T+
AGCT-
F
o
r
w
a
r
d
…
G+
Recursion: i=1..L
Vl (i)=El(Xi)*Maxk (Vk(i-1)*Akl)
…
Termination:
P(x,p*)=Maxk (Vk(L)*Ak0)
Initiation: k and l are two states
V0(0)=1, Vk(0)=0 for every k
Recursion: i=1..L
Vl (i)=El(Xi)*Sk (Vk(i-1)*Akl)
G-
…
…
Termination:
Cédric Notredame (28/07/2017)
P(x)=Sk (Vk(L)*Ak0)
S
A+
G+
C+
T+
AGCT-
G+
G-
Posterior Decoding
of
Hidden Markov Models
Cédric Notredame (28/07/2017)
Why Posterior Decoding ?
-Viterbi is BRUTAL !!!!
-It does Not Associate Individual Predictions
With a Probability
Question:
What is the probability that Nucleotide 1300 really
is a CpG Boundary ?
ANSWER: The Backward Algorithm
Cédric Notredame (28/07/2017)
Posterieur Decoding ?
Question:
What is the probability that Nucleotide 1300 really
is a CpG Boundary ?
P (X,pi=l)

Probability of Sequence X
WITH
position i is in state l
Cédric Notredame (28/07/2017)
Posterieur Decoding
i
pi=l
pi=l
Backward Algorithm
Forward Algorithm
P (x,pi=l)=P(X1…Xi¦ pi=l) * P(XL… Xi+1¦ pi=l)
Cédric Notredame (28/07/2017)
Forward
Initiation:
F0(0)=1, Fk(0)=0 for every k
Recursion: i=1..L
Fl (i)=El(Xi)*Sk (Fk(i-1)*Akl)
Termination:
Backward
Initiation:
P(x)=Sk (Fk(L)*Ak0)
B0(0)=1, Bk(L)=Ak0 for every k
Recursion: i=L..1
Bl (i)=El(Xi)*Sk (Bk(i+1)*Akl)
Termination:
Cédric Notredame (28/07/2017)
P(x)=Sk (Bk(1)*Ak0)
Forward
Backward
Recursion: i=1..L
Fl (i)=Fl(Xi)*Sk (Fk(i-1)*Akl)
Recursion: i=L..1
Bl (i)=Bl(Xi)*Sk (Bk(i+1)*Akl)
P (pi=l,X)=Fl(i)*Bl(i)
P (pi=l,X)=P(pi=l ¦ X)*P(X) = Fl(i) * Bl(i)
P(pi=l ¦ X)=
Cédric Notredame (28/07/2017)
Fl(i) * Bl(i)
P(X)
P(X)=F(L)=B(1)
Sliding Window
P(pi=l ¦ X)
Free From The Sliding Window of
Arbitrary Size!!!!
Cédric Notredame (28/07/2017)
P(pi=l ¦ X)
Posterior Decoding is Less Sensitive to the
Parameterisation of the model.
Cédric Notredame (28/07/2017)
Training HMMs
Cédric Notredame (28/07/2017)
Training HMMs ?
Case 1-Set of annotated data
Parameters can be estimated on this data where the
PATH is known.
Case 2-NO annotated data and a Model
-Parameterise the model so P(Model¦data)=max
-Start with random parameters
-Iterate using Baum-Welch, Viterbi or EM
Cédric Notredame (28/07/2017)
Trainning HMMs ?
Difficult !!!!
Cédric Notredame (28/07/2017)
What Matters
About
Hidden Markov Models
Cédric Notredame (28/07/2017)
HMM and Markov Chains
Bayes Theorem
-Markov Chain: When There is no Hidden
State
-Hidden Markov Models: When a Nucleotide
can be in different HIDDEN states
Cédric Notredame (28/07/2017)
Three Algorithms for HMMS
Viterbi: -Make the State assignments
-Predict
Forward: Evaluate the Sequence Probability
under the considered model
Backward and Posterior Decoding:
Evaluating the proba of the prediction
Window-Free
Cédric Notredame (28/07/2017)
Applications
of HMMs
Cédric Notredame (28/07/2017)
What To Do with an HMM?
Transmembrane domain predictions
www.cbs.dtu.dk/services/TMHMM/
Cédric Notredame (28/07/2017)
What To Do with an HMM?
RNA structure Prediction/Fold Recognition
SCGF: Stochastic Context Free Grammars
(Sean Eddy)
Cédric Notredame (28/07/2017)
What To Do with an HMM?
Gene Prediction
State of the art use HMMs
Genemark: Prokaryotes
GenScan: Eukaryotes
Cédric Notredame (28/07/2017)
GeneMark
Cédric Notredame (28/07/2017)
A typical HMM for Coding DNA
G
G
G
G
GGG
GGA
GGT
GGC
0.02
0.00
0.6
0.38
G
G
G
G
GGG
GGA
GGT
GGC
0.02
0.00
0.6
0.38
S
E
64 Codons
64 Codons
W TGG 1.00
W TGG 1.00
Cédric Notredame (28/07/2017)
A Typical HMM for Coding DNA
Emission (codon Frequency)
Transition (Dipeptide)
Cédric Notredame (28/07/2017)
GeneMark HMM
HMM order 5: 6th Nucleotide depends on the 5 previous
Takes into account Codon Bias AND dipeptide Comp
Proba of seq (GGG-TGG Given Model)
=
Proba(GGG)*Proba(GGG->TGG)*Proba(TGG)
Cédric Notredame (28/07/2017)
What To Do with an HMM?
Family and Domain Identification
Pfam
Smart
Prosite Profiles
Cédric Notredame (28/07/2017)
What To Do with an HMM?
Bayesian Phylogenic Inference
chite
wheat
trybr
mouse
morphbank.ebc.uu.se/mrbayes/manual.php
Cédric Notredame (28/07/2017)
What To Do with an HMM?
Metabolic Networks: Bayesian Networks
www.cs.huji.ac.il/~nirf/
Cédric Notredame (28/07/2017)
Collections
Of
Domains
HMMs
Cédric Notredame (28/07/2017)
What is a Domain HMM ?
SAM, HMMER, PFtools
Cédric Notredame (28/07/2017)
Emission Proba
Cédric Notredame (28/07/2017)
Using Domain HMMs
Question: I want to Compare my HMM with all the
sequences in SwissProt
Requires an adapted Viterbi: Pair-HMM
Very Similar to Dynamic Programming
Cédric Notredame (28/07/2017)
Using Domain HMMs
Question: What are the Available Collections
Of Pre-computed HMMs
Interpro unites many collections
Cédric Notredame (28/07/2017)
Interpro: The Idea of Domains
Cédric Notredame (28/07/2017)
Interpro: A Federation of Databases
Cédric Notredame (28/07/2017)
Using InterPro: Asking a question
Which Domains does the oncogene FosB contain?
Cédric Notredame (28/07/2017)
Using InterPro: Asking a question
Cédric Notredame (28/07/2017)
Using InterPro: Asking a question
Cédric Notredame (28/07/2017)
Finding Domains
-How can I be sure that the domain Prediction
of my Protein is real ?
Use the EMBnet pfscan
Cédric Notredame (28/07/2017)
Using EMBNet PFscan
Cédric Notredame (28/07/2017)
Posterior Decoding With EMBNet PFscan
Important Position that is Well
conserved in our sequence
Cédric Notredame (28/07/2017)
Prior
Posterior
Cédric Notredame (28/07/2017)
The
Inside
Of
Pfam
Cédric Notredame (28/07/2017)
A Typical pfam Domain
Cédric Notredame (28/07/2017)
A Typical pfam Domain
HMMER Package:
Cédric Notredame (28/07/2017)
Cédric Notredame (28/07/2017)
Going Further
Building and Using
HMMs
Cédric Notredame (28/07/2017)
HMMer2:
hmmer.wustl.edu/
Used to create and distribute Pfam
PFtools:
www.isrec.isb-sib.ch/ftp-server/pftools/
Used to create and distribute Prosite
SAM T02
www.cse.ucsc.edu/research/compbio/sam.html
Cédric Notredame (28/07/2017)
EMBOSS Online
www.hgmp.mrc.ac.uk/SOFTWARE/EMBOSS
Jemboss: a JAVA aplet interacting with an EMBOSS
Server
Cédric Notredame (28/07/2017)
Cédric Notredame (28/07/2017)
HMMer
Cédric Notredame (28/07/2017)
Cédric Notredame (28/07/2017)
EMBASSY
(Hmmer)
Cédric Notredame (28/07/2017)
Cédric Notredame (28/07/2017)
In The End:
Markov Uncovered
Cédric Notredame (28/07/2017)
HMM and Markov Chains
Domain Collections
Gene Prediction
Bayesian Phylogenetic Inference
chite
wheat
trybr
mouse
Cédric Notredame (28/07/2017)
HMM and Markov Chains
Domain Collections
Profiles  HMM  Generalized Profiles
Interactive Tools
Cédric Notredame (28/07/2017)