Problem 1

A Revealing Introduction to
Hidden Markov Models
Mark Stamp
HMM
1
Hidden Markov Models
 What
is a hidden Markov model (HMM)?
A
machine learning technique
 A discrete hill climb technique
 Where
are HMMs used?
 Speech
recognition
 Malware detection, IDS, etc., etc.
 Why
is it useful?
 Efficient
algorithms
HMM
2
Markov Chain
 Markov
chain is a “memoryless random
process”
 Transitions depend only on
 current
state and
 transition probabilities matrix
 Example
on next slide…
HMM
3
Markov Chain
0.7
 We
are interested in
average annual temperature
 Only
consider Hot and Cold
 From
recorded history, we
obtain probabilities
 See
H
diagram to the right
0.4
0.3
C
0.6
HMM
4
Markov Chain
0.7
 Transition
matrix
probability
H
0.4
0.3
 Matrix
is denoted as A
C
 Note,
A is “row stochastic”
HMM
0.6
5
Markov Chain
Can also include
begin, end states
 Begin state
matrix is π
0.7



In this example,
0.6
H
0.4
0.3
begin
end
C
Note that π is row
stochastic
0.4
0.6
HMM
6
Hidden Markov Model
 HMM
 But
includes a Markov chain
this Markov process is “hidden”
 Cannot
observe the Markov process
 Instead,
we observe something related to
hidden states
 It’s as if there is a “curtain” between
Markov chain and observations
 Example
on next slide
HMM
7
HMM Example
 Consider
H/C temperature example
 Suppose we want to know H or C
temperature in distant past
 Before
humans (or thermometers) invented
 OK if we can just decide Hot versus Cold
 We
assume transition between Hot and
Cold years is same as today
 That
is, the A matrix is same as today
HMM
8
HMM Example
Temp in past determined by Markov process
 But, we cannot observe temperature in past
 Instead, we note that tree ring size is
related to temperature



We consider 3 tree ring sizes


Look at historical data to see the connection
Small, Medium, Large (S, M, L, respectively)
Measure tree ring sizes and recorded
temperatures to determine relationship
HMM
9
HMM Example
 We
find that tree ring sizes and
temperature related by
 This
is known as the B matrix:
 Note
that B also row stochastic
HMM
10
HMM Example
 Can
we now find temps in distant past?
 We cannot measure (observe) temp
 But we can measure tree ring sizes…
 …and tree ring sizes related to temp
 By
the B matrix
 So,
we ought to be able to say
something about temperature
HMM
11
HMM Notation
A
lot of notation is required
 Notation
may be the most difficult part
HMM
12
HMM Notation
 To
simplify notation, observations are
taken from the set {0,1,…,M-1}
 That is,
 The matrix A = {aij} is N x N, where

 The
matrix B = {bj(k)} is N x M, where

HMM
13
HMM Example
Consider our temperature example…
 What are the observations?




V = {0,1,2}, which corresponds to S,M,L

Q = {H,C}
What are states of Markov process?
What are A,B, π, and T?
A,B, π on previous slides
 T is number of tree rings measured


What are N and M?

N = 2 and M = 3
HMM
14
Generic HMM
 Generic
view of HMM
 HMM
defined by A,B, and π
 We denote HMM “model” as λ = (A,B,π)
HMM
15
HMM Example

Suppose that we observe tree ring sizes
For 4 year period of interest: S,M,S,L
 Then
= (0, 1, 0, 2)


Most likely (hidden) state sequence?

We want most likely X = (x0, x1, x2, x3)
Let πx0 be prob. of starting in state x0
 Note
prob. of initial observation



And ax0,x1 is prob. of transition x0 to x1
And so on…
HMM
16
HMM Example
 Bottom
line?
 We can compute P(X) for any X
 For X = (x0, x1, x2, x3) we have
 Suppose
we observe (0,1,0,2), then what
is probability of, say, HHCC?
 Plug into formula above to find
HMM
17
HMM Example
 Do
same for all
4-state
sequences
 We find…
 The winner is?
 CCCH
 Not
so fast my
friend…
HMM
18
HMM Example
 The
path CCCH scores the highest
 In dynamic programming (DP), we find
highest scoring path
 But, HMM maximizes expected number
of correct states
 Sometimes
called “EM algorithm”
 For “Expectation Maximization”
 How
does HMM work in this example?
HMM
19
HMM Example
 For
first position…
 Sum probabilities for all paths that have H
in 1st position, compare to sum of probs for
paths with C in 1st position --- biggest wins
 Repeat
for each position and we find:
HMM
20
HMM Example
So, HMM solution gives us CHCH
 While dynamic program solution is CCCH
 Which solution is better?
 Neither!!! Why is that?


Different definitions of “best”
HMM
21
HMM Paradox?
 HMM
maximizes expected number of
correct states
 Whereas
 Possible
DP chooses “best” overall path
for HMM to choose “path” that
is impossible
 Could
 Cannot
be a transition probability of 0
get impossible path with DP
 Is this a flaw with HMM?
 No,
it’s a feature…
HMM
22
The Three Problems
HMMs used to solve 3 problems
 Problem 1: Given a model λ = (A,B,π) and
observation sequence O, find P(O|λ)




That is, we score an observation sequence to
see how well it fits the given model
Problem 2: Given λ = (A,B,π) and O, find an
optimal state sequence

Uncover hidden part (as in previous example)

That is, train a model to fit the observations
Problem 3: Given O, N, and M, find the
model λ that maximizes probability of O
HMM
23
HMMs in Practice
 Typically,
HMMs used as follows
 Given an observation sequence
 Assume a hidden Markov process exists
 Train a model based on observations
 Problem
 Then
3 (determine N by trial and error)
given a sequence of observations,
score it vs model from previous step
 Problem
1 (high score implies it’s similar to
training data)
HMM
24
HMMs in Practice
 Previous
slide gives sense in which HMM
is a “machine learning” technique
 We
do not need to specify anything except
the parameter N
 And “best” N found by trial and error
 That
is, we don’t have to think too much
 Just
train HMM and then use it
 Best of all, efficient algorithms for HMMs
HMM
25
The Three Solutions
We give detailed solutions to the three
problems
 Note: We must have efficient solutions
 Recall the three problems:

Problem 1: Score an observation sequence
versus a given model
 Problem 2: Given a model, “uncover” hidden part
 Problem 3: Given an observation sequence, train
a model

HMM
26
Solution 1

Score observations versus a given model

Given model λ = (A,B,π) and observation
sequence O=(O0,O1,…,OT-1), find P(O|λ)
Denote hidden states as
X = (x0, x1, . . . , xT-1)
 Then from definition of B,
P(O|X,λ)=bx0(O0) bx1(O1) … bxT-1(OT-1)
 And from definition of A and π,
P(X|λ)=πx0 ax0,x1 ax1,x2 … axT-2,xT-1

HMM
27
Solution 1
Elementary conditional probability fact:
P(O,X|λ) = P(O|X,λ) P(X|λ)
 Sum over all possible state sequences X,

P(O|λ) = Σ P(O,X|λ) = Σ P(O|X,λ) P(X|λ)
= Σπx0bx0(O0)ax0,x1bx1(O1)…axT-2,xT-1bxT-1(OT-1)
This “works” but way too costly
 Requires about 2TNT multiplications



Why?
There better be a better way…
HMM
28
Forward Algorithm

Instead of brute force: forward algorithm

Or “alpha pass”
For t = 0,1,…,T-1 and i=0,1,…,N-1, let
αt(i) = P(O0,O1,…,Ot,xt=qi|λ)
 Probability of “partial sum” to t, and
Markov process is in state qi at step t



What the?
Can be computed recursively, efficiently
HMM
29
Forward Algorithm
Let α0(i) = πibi(O0) for i = 0,1,…,N-1
 For t = 1,2,…,T-1 and i=0,1,…,N-1, let

αt(i) =


Where the sum is from j = 0 to N-1
From definition of αt(i) we see
P(O|λ) = ΣαT-1(i)


(Σαt-1(j)aji)bi(Ot)
Where the sum is from i = 0 to N-1
Note this requires only N2T multiplications
HMM
30
Solution 2

Given a model, find “most likely” hidden
states: Given λ = (A,B,π) and O, find an
optimal state sequence
Recall that optimal means “maximize expected
number of correct states”
 In contrast, DP finds best scoring path


For temp/tree ring example, solved this


But hopelessly inefficient approach
A better way: backward algorithm

Or “beta pass”
HMM
31
Backward Algorithm
For t = 0,1,…,T-1 and i=0,1,…,N-1, let
βt(i) = P(Ot+1,Ot+2,…,OT-1|xt=qi,λ)
 Probability of partial sum from t to end and
Markov process in state qi at step t
 Analogous to the forward algorithm
 As with forward algorithm, this can be
computed recursively and efficiently

HMM
32
Backward Algorithm
 Let
βT-1(i) = 1 for i = 0,1,…,N-1
 For t = T-2,T-3, …,1 and i=0,1,…,N-1, let
βt(i) = Σai,jbj(Ot+1)βt+1(j)
 Where
the sum is from j = 0 to N-1
HMM
33
Solution 2

For t = 1,2,…,T-1 and i=0,1,…,N-1 define
γt(i) = P(xt=qi|O,λ)


Note that γt(i) = αt(i)βt(i)/P(O|λ)


Most likely state at t is qi that maximizes γt(i)
And recall P(O|λ) = ΣαT-1(i)
The bottom line?
Forward algorithm solves Problem 1
 Forward/backward algorithms solve Problem 2

HMM
34
Solution 3
 Train
a model: Given O, N, and M, find λ
that maximizes probability of O
 Here, we iteratively adjust λ = (A,B,π)
to better fit the given observations O
 The
size of matrices are fixed (N and M)
 But elements of matrices can change
 It
is amazing that this works!
 And
even more amazing that it’s efficient
HMM
35
Solution 3
 For
t=0,1,…,T-2 and i,j in {0,1,…,N-1},
define “di-gammas” as
γt(i,j) = P(xt=qi, xt+1=qj|O,λ)
 Note γt(i,j) is prob of being in state qi at
time t and transiting to state qj at t+1
 Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ)
 And γt(i) = Σγt(i,j)
 Where
sum is from j = 0 to N – 1
HMM
36
Model Re-estimation
Given di-gammas and gammas…
 For i = 0,1,…,N-1 let πi = γ0(i)
 For i = 0,1,…,N-1 and j = 0,1,…,N-1
aij = Σγt(i,j)/Σγt(i)



For j = 0,1,…,N-1 and k = 0,1,…,M-1
bj(k) = Σγt(j)/Σγt(j)


Where both sums are from t = 0 to T-2
Both sums from from t = 0 to T-2 but only t for
which Ot = k are counted in numerator
Why does this work?
HMM
37
Solution 3
 To
1.
2.
3.
4.
summarize…
Initialize λ = (A,B,π)
Compute αt(i), βt(i), γt(i,j), γt(i)
Re-estimate the model λ = (A,B,π)
If P(O|λ) increases, goto 2
HMM
38
Solution 3
Some fine points…
 Model initialization

If we have a good guess for λ = (A,B,π) then we
can use it for initialization
 If not, let πi ≈ 1/N, ai,j ≈ 1/N, bj(k) ≈ 1/M
 Subject to row stochastic conditions
 Note: Do not initialize to uniform values


Stopping conditions
Stop after some number of iterations
 Stop if increase in P(O|λ) is “small”

HMM
39
HMM as Discrete Hill Climb
 Algorithm
on previous slides shows that
HMM is a “discrete hill climb”
 HMM consists of discrete parameters
 Specifically,
the elements of the matrices
 And
re-estimation process improves
model by modifying parameters
 So,
process “climbs” toward improved model
 This happens in a high-dimensional space
HMM
40
Dynamic Programming
 Brief
detour…
 For λ = (A,B,π) as above, it’s easy to
define a dynamic program (DP)
 Executive summary:
 DP
is forward algorithm, with “sum”
replaced by “max”
 Precise
details on next slides
HMM
41
Dynamic Programming
Let δ0(i) = πi bi(O0) for i=0,1,…,N-1
 For t=1,2,…,T-1 and i=0,1,…,N-1 compute
δt(i) = max (δt-1(j)aji)bi(Ot)


Where the max is over j in {0,1,…,N-1}

Not the best path, for that, see next slide
Note that at each t, the DP computes best
path for each state, up to that point
 So, probability of best path is max δT-1(j)
 This max only gives best probability

HMM
42
Dynamic Programming
 To
determine optimal path
 While
computing optimal path, keep track
of pointers to previous state
 When finished, construct optimal path by
tracing back points
 For
example, consider temp example
 Probabilities for path of length 1:
 These
are the only “paths” of length 1
HMM
43
Dynamic Programming

Probabilities for each path of length 2
Best path of length 2 ending with H is CH
 Best path of length 2 ending with C is CC

HMM
44
Dynamic Program
 Continuing,
we compute best path ending
at H and C at each step
 And save pointers --- why?
HMM
45
Dynamic Program
 Best
final score is .002822
 And,
 But
A
thanks to pointers, best path is CCCH
what about underflow?
serious problem in bigger cases
HMM
46
Underflow Resistant DP
 Common
trick to prevent underflow
 Instead
of multiplying probabilities…
 …we add logarithms of probabilities
 Why
does this work?
 Because
log(xy) = log x + log y
 And adding logs does not tend to 0
 Note
that we must avoid 0 probabilities
HMM
47
Underflow Resistant DP
Underflow resistant DP algorithm:
 Let δ0(i) = log(πi bi(O0)) for i=0,1,…,N-1
 For t=1,2,…,T-1 and i=0,1,…,N-1 compute

δt(i) = max (δt-1(j) + log(aji) + log(bi(Ot)))

Where the max is over j in {0,1,…,N-1}
And score of best path is max δT-1(j)
 As before, must also keep track of paths

HMM
48
HMM Scaling
 Trickier
to prevent underflow in HMM
 We consider solution 3
 Since
 Recall
it includes solutions 1 and 2
for t = 1,2,…,T-1, i=0,1,…,N-1,
αt(i) = (Σαt-1(j)aj,i)bi(Ot)
 The idea is to normalize alphas so that
they sum to one
 Algorithm
on next slide
HMM
49
HMM Scaling
 Given
αt(i) = (Σαt-1(j)aj,i)bi(Ot)
 Let
a0(i) = α0(i) for i=0,1,…,N-1
 Let c0 = 1/Σa0(j)
 For i = 0,1,…,N-1, let a0(i) = c0a0(i)
 This takes care of t = 0 case
 Algorithm continued on next slide…
HMM
50
HMM Scaling
 For

t = 1,2,…,T-1 do the following:
For i = 0,1,…,N-1,
at(i) =
(Σat-1(j)aj,i)bi(Ot)
 Let
ct = 1/Σat(j)
 For i = 0,1,…,N-1 let at(i) = ctat(i)
HMM
51
HMM Scaling
 Easy
to show at(i) = c0c1…ct αt(i)
 Simple
(♯)
proof by induction
 So,
c0c1…ct is scaling factor at step t
 Also, easy to show that
at(i) = αt(i)/Σαt(j)
 Which implies ΣaT-1(i) = 1
(♯♯)
HMM
52
HMM Scaling
 By
combining (♯) and (♯♯), we have
1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i)
= c0c1…cT-1 P(O|λ)
 Therefore, P(O|λ) = 1 / c0c1…cT-1
 To avoid underflow, we compute
log P(O|λ) = -Σ log(cj)
 Where
sum is from j = 0 to T-1
HMM
53
HMM Scaling
Similarly, scale betas as ctβt(i)
 For re-estimation,


Compute γt(i,j) and γt(i) using original formulas,
but with scaled alphas and betas
This gives us new values for λ = (A,B,π)
 “Easy exercise” to show re-estimate is
exact when scaled alphas and betas used
 Also, P(O|λ) cancels from formula


Use log P(O|λ) = -Σ log(cj) to decide if iterate
improves
HMM
54
All Together Now
Complete pseudo code for Solution 3
 Given: (O0,O1,…,OT-1) and N and M
 Initialize: λ = (A,B,π)

A is NxN, B is NxM and π is 1xN
 πi ≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row
stochastic, but not uniform


Initialize:
maxIters = max number of re-estimation steps
 iters = 0
 oldLogProb = -∞

HMM
55
Forward Algorithm
 Forward
 With
algorithm
scaling
HMM
56
Backward Algorithm
 Backward
algorithm
or “beta pass”
 With
scaling
 Note:
same scaling
factor as alphas
HMM
57
Gammas
 Here,
use scaled
alphas and betas
 So formulas
unchanged
HMM
58
Re-Estimation
 Again,
using
scaled gammas
 So formulas
unchanged
HMM
59
Stopping Criteria
 Check
that
probability
increases
 In
practice, want
logProb >
oldLogProb + ε
 And
don’t
exceed max
iterations
HMM
60
English Text Example
 Suppose
Martian arrives on earth
 Sees
written English text
 Wants to learn something about it
 Martians know about HMMs
 So,
strip our all non-letters, make all
letters lower-case
 27
symbols (letters, plus word-space)
 Train HMM on long sequence of symbols
HMM
61
English Text
 For
first training case, initialize:
= 2 and M = 27
 Elements of A and π are about ½ each
 Elements of B are each about 1/27
N
 We
use 50,000 symbols for training
 After 1st iter: log P(O|λ) ≈ -165097
 After 100th iter: log P(O|λ) ≈ -137305
HMM
62
English Text
 Matrices
 What
A and π converge:
does this tells us?
 Started
in hidden state 1 (not state 0)
 And we know transition probabilities
between hidden states
 Nothing
 We
too interesting here
don’t care about hidden states
HMM
63
English Text
 What
about B
matrix?
 This much more
interesting…
 Why???
HMM
64
A Security Application

Suppose we want to detect metamorphic
computer viruses
Such viruses vary their internal structure
 But function of malware stays same
 If sufficiently variable, standard signature
detection will fail


Can we use HMM for detection?
What to use as observation sequence?
 Is there really a “hidden” Markov process?
 What about N, M, and T?
 How many Os needed for training, scoring?

HMM
65
HMM for Metamorphic Detection
Set of “family” viruses into 2 subsets
 Extract opcodes from each virus
 Append opcodes from subset 1 to make one
long sequence

Train HMM on opcode sequence (problem 3)
 Obtain a model λ = (A,B,π)


Set threshold: score opcodes from files in
subset 2 and “normal” files (problem 1)
Can you sets a threshold that separates sets?
 If so, may have a viable detection method

HMM
66
HMM for Metamorphic Detection

Virus
detection
results from
recent paper


Note the
separation
This is good!
HMM
67
HMM Generalizations

Here, assumed Markov process of order 1


Current state depends only on previous state
and transition matrix
Can use higher order Markov process
Current state depends on n previous states
 Higher order vs increased N ?

Can have A and B matrices depend on t
 HMM often combined with other
techniques (e.g., neural nets)

HMM
68
Generalizations
 In
some cases, big limitation of HMM is
that position information is not used
 In
many applications this is OK/desirable
 In some apps, this is a serious limitation
 Bioinformatics
 DNA
applications
sequencing, protein alignment, etc.
 Sequence alignment is crucial
 They use “profile HMMs” instead of HMMs
 PHMM is next topic…
HMM
69
References
A
revealing introduction to hidden
Markov models, by M. Stamp
 http://www.cs.sjsu.edu/faculty/stamp/RUA
/HMM.pdf
A
tutorial on hidden Markov models and
selected applications in speech
recognition, by L.R. Rabiner
 http://www.cs.ubc.ca/~murphyk/Bayes/rabi
ner.pdf
HMM
70
References
 Hunting
for metamorphic engines, W.
Wong and M. Stamp
 Journal
in Computer Virology, Vol. 2, No. 3,
December 2006, pp. 211-229
 Hunting
for undetectable metamorphic
viruses, D. Lin and M. Stamp
 Journal
in Computer Virology, Vol. 7, No. 3,
August 2011, pp. 201-214
HMM
71