Profile HMM - Simple Case

Similar Techniques For Molecular
Sequencing and Network Security
Doug Madory
27 APR 05



Big Picture
Protein Structure
Sequencing using Profile HMM
Big Picture

PQS for Network Security (Us)


Design HMM for network event
Find event within linear stream of observed network events
Q: Did an event happen?

Sequencing using Profile HMM (Bioinformatics)


Train HMM using known information about subsequence
Find subsequence within linear protein / genome sequence
Q: If it exists, where is sequence?
Profile HMM - Simple Case





Train HMM
Viterbi Scoring
Backtrace Viterbi
Query: A-, AA, TA
DB:
ATA
HMM Training
D1
D2
I1
I2
I0
Begin
Build HMM with
2 M states because
there are 2
columns in query
A
C
G
T
M1
A
C
G
T
M2
End
HMM Training
1
D1
1
1
1
D2
1
1
1
1
1
I0
1
Begin
Step 1 – add
pseudocount to
each transition
and emission
I1
I2
1
1
1
1
1
1
A
C
G
T
M1
1
1
1
1
1
1
1
M2
A
C
G
T
1
1
1
1
1
End
HMM Training
1
D1
1
1
1
D2
1
1
1
1
1
I0
1
Begin
Step 2 – train
with A-
I1
I2
2
1
1
1
1
2
A
C
G
T
M1
2
1
2
1
1
1
1
M2
A
C
G
T
1
1
1
1
1
End
HMM Training
1
D1
1
1
1
D2
1
1
1
1
1
I0
1
Begin
Step 3 – train
with AA
I1
I2
2
1
1
1
1
3
A
C
G
T
M1
2
1
3
1
1
1
2
M2
A
C
G
T
2
1
1
1
2
End
HMM Training
1
D1
1
1
1
D2
1
1
1
1
1
I0
1
Begin
Step 4 – train
with TA
I1
I2
2
1
1
1
1
4
A
C
G
T
M1
2
1
3
1
1
2
3
M2
A
C
G
T
3
1
1
1
3
End
HMM Training
1
D1
1
1
1
D2
1
1
1
1
1
I0
1
Begin
Fully trained
HMM
I1
I2
2
1
1
1
1
4
A
C
G
T
M1
2
1
3
1
1
2
3
M2
A
C
G
T
3
1
1
1
3
End
Moves
Delete
Insert
Match
Viterbi Scoring
Illegal Moves
Observations
States
X
A
T
VI0(1) = log aB-I0
B/I0 VB=0
VM1(0) = 0
M1
VI1(0) = 0
I1
D1 VD1(0) = log aB-D1
M2
I2
D2
E
A
Moves
Delete
Insert
Viterbi Scoring
Match
Observations
States
B/I0
M1
I1
D1
M2
I2
D2
E
X
A
VB=0
VI0(1) = -0.78
VM1(0)= 0
VI1(0)= 0
VD1(0)= -0.78
T
A
VI0(2) = VI0(1)+log aI0-I0
VI0(3) = VI0(2)+log aI0-I0
Moves
Delete
Insert
Viterbi Scoring
Match
Observations
States
B/I0
M1
I1
D1
M2
I2
D2
E
X
A
T
VB=0
VI0(1) = -0.78 VI0(2)= -1.25
A
VI0(3)=-1.72
VM1(0)= 0
VI1(0)= 0
VD1(0)= -0.78
VM1(1) = log e(A)/q + VB + log aB-M1
VM1(1) = log (3/7)/(1/4) + 0 - 0.17
VM1(1) = 0.23 – 0.17 = 0.06
Moves
Delete
Insert
Viterbi Scoring
Match
Observations
States
B/I0
M1
I1
D1
M2
X
A
VB=0
VI0(1) = -0.78 VI0(2)= -1.25
VM1(0)= 0
VM1(1)= 0.06
A
VI0(3)=-1.72
VI1(0)= 0
VD1(0)= -0.78
VM1(0) + log aM1I1
I2 VI1(1) = 0 + max { VID1(0) + log aI1I1 }
V 1(0) + log aD1I1
D2 VII1(1) = 0 + max {-0.78+-0.47 }
V 1(1) = -0.47
E
T
VD1(1) = VI1(0) + log aI0D1
VD1(1) = -0.78 – 0.47 = -1.25
Moves
Delete
Insert
Viterbi Scoring
Match
Observations
States
B/I0
M1
I1
D1
M2
I2
D2
E
X
A
T
A
VB=0
VI0(1) = -0.78
VI0(2)= -1.25
VI0(3)=-1.72
VM1(0)= 0
VM1(1)= 0.06
VI1(0)= 0
VI1(1) = -0.47
VD1(0)= -0.78 VD1(1)= -1.25
Moves
Delete
Insert
Viterbi Scoring
Match
Observations
A
T
A
B/I0
M1
I1
D1
M2
VB=0
VI0(1) = -0.78
VI0(2)= -1.25
VI0(3)=-1.72
VM1(0)= 0
VM1(1)= 0.06
VM1(2) = -1.19 VM1(3)= -1.49
VI1(0)= 0
VI1(1) = -0.47
VI1(2) = -0.72
States
X
VI1(3) = -1.19
VD1(0)= -0.78 VD1(1)= -1.25
VD1(2) = -1.72 VD1(3)= -1.25
VM2(0)= 0
VM2(1)= -0.47
VM2(2)= -0.41
VM2(3)= -1.19
I2
VI2(0)= 0
VI2(1) = -1.85
VI2(2) = -1.07
VI2(3) = -1.01
D2
VD2(0)= -1.25 VD2(1)= -1.25
VD2(2)= -0.58
VD2(3)= -1.36
E
VE= -1.31
Profile HMM - Simple Case

Demo in Python
Big Picture Revisited

PQS for Network Security (Us)


Design HMM for network event
Find event within linear stream of observed network events
Q: Did an event happen?

Sequencing using Profile HMM (Bioinformatics)


Train HMM using known information about subsequence
Find subsequence within linear protein / genome sequence
Q: If it exists, where is sequence?