Introduction to Machine Learning

Lecture Slides for
INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 15:
HIDDEN MARKOV MODELS
Introduction
3


Modeling dependencies in input; no longer iid
Sequences:
Temporal: In speech; phonemes in a word (dictionary), words
in a sentence (syntax, semantics of the language).
In handwriting, pen movements
 Spatial: In a DNA sequence; base pairs

Discrete Markov Process
4




N states: S1, S2, ..., SN State at “time” t, qt = Si
First-order Markov
P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)
Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si)
aij=1
Initial probabilities
πi ≡ P(q1=Si)
aij ≥ 0 and Σj=1N
Σj=1N πi=1
Stochastic Automaton
5
T
P O  Q | A ,    P q1  P qt | qt 1   q1aq1q2 aqT 1qT
t 2
Example: Balls and Urns
6

Three urns each full of balls of one color
S1: red, S2: blue, S3: green
  0.5,0.2,0.3T
O  S1 , S1 , S3 , S3 
0.4 0.3 0.3
A  0.2 0.6 0.2
 0.1 0.1 0.8
P O | A ,    P S1   P S1 | S1   P S3 | S1   P S3 | S3 
  1  a11  a13  a33
 0.5  0.4  0.3  0.8  0.048
Balls and Urns: Learning
7

Given K example sequences of length T
k
# sequences starting with Si  k1q1  Si 
ˆi 

# sequences 
K
# transition s from Si to S j 
aˆij 
# transition s from Si 
k
k

1
q

S
and
q
k t 1 t i
t 1  S j 
T- 1

  1q
T- 1
k
t 1
k
t
 Si 
Hidden Markov Models
8





States are not observable
Discrete observations {v1,v2,...,vM} are recorded; a
probabilistic function of the state
Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj)
Example: In each urn, there are balls of different
colors, but with different probabilities.
For each observation sequence, there are multiple
state sequences
HMM Unfolded in Time
9
Elements of an HMM
10





N: Number of states
M: Number of observation symbols
A = [aij]: N by N state transition probability matrix
B = bj(m): N by M observation probability matrix
Π = [πi]: N by 1 initial state probability vector
λ = (A, B, Π), parameter set of HMM
Three Basic Problems of HMMs
11
1.
2.
3.
Evaluation: Given λ, and O, calculate P (O | λ)
State sequence: Given λ, and O, find Q* such that
P (Q* | O, λ ) = maxQ P (Q | O , λ )
Learning: Given X={Ok}k, find λ* such that
P ( X | λ* )=maxλ P ( X | λ )
(Rabiner, 1989)
Evaluation
12

Forward variable:
 t i   P O1 Ot , qt  Si |  
Initializa tion :
1 i    i bi O1 
Recursion :
N

 t 1  j    t i aij  b j Ot 1 
 i 1

N
P O |     T i 
i 1
Backward variable:
 t i   P Ot 1 OT | qt  Si ,  
Initializa tion :
T i   1
Recursion :
N
t i    aij b j Ot 1 t 1  j 
j 1
13
Finding the State Sequence
14
 t i   P qt  Si O,  
 t i  t i 
 N
 j 1 t  j t  j 
Choose the state that has the highest probability,
for each time step:
qt*= arg maxi γt(i)
No!
Viterbi’s Algorithm
15
δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)




Initialization:
δ1(i) = πibi(O1), ψ1(i) = 0
Recursion:
δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt1(i)aij
Termination:
p* = maxi δT(i), qT*= argmaxi δT (i)
Path backtracking:
qt* = ψt+1(qt+1* ), t=T-1, T-2, ..., 1
Learning
16
t i , j   P qt  Si , qt 1  S j | O ,  
 t i aij b j Ot 1  t 1  j 
t i , j  
k l  t k akl bl Ot 1 t 1 l 
Baum - Welch algorithm (EM) :
1 if qt  Si and qt 1  S j
1 if qt  Si
t
zij  
z 
0 otherwise
0 otherwise
t
i
Baum-Welch (EM)
 
 
E  step : E zit   t i  E zijt  t i , j 
M  step :
K
ˆi 
k

 1 i 
k 1
K
aˆij 
k

k 1 t 1 t i 
K
Tk 1
k
k




j
1
O
k 1 t 1 t
t  vm 
K
bˆ j m  
k

k 1 t 1 t i , j 
Tk 1
K
Tk 1
k

k 1 t 1 t i 
K
Tk 1
17
Continuous Observations
18

Discrete:
rmt
P Ot | qt  S j ,     bj m
M
m 1

1 if Ot  vm
r 
0 otherwise
t
m
Gaussian mixture (Discretize using k-means):
P Ot |qt  S j ,     P G jl  pOt |qt  S j ,Gl ,  
L
l 1

Continuous:
~ N  l ,  l 
POt |qt  S j ,   ~ N  j , j2 
Use EM to learn parameters, e.g.,
  j O

̂ 
   j
t
t
t
j
t
t
HMM with Input
19

Input-dependent observations:
POt | qt  S j , x t ,   ~ N g j x t |  j , j2 

Input-dependent transitions (Meila and Jordan,
1996; Bengio and Frasconi, 1996):
Pqt 1  S j |qt  Si , x t 

Time-delay input:
xt  f Ot  ,...,Ot 1 
HMM as a Graphical Model
20
21
Model Selection in HMM
22

Left-to-right HMMs:
a11 a12 a13 0 
0 a

a
a
22
23
24 
A
0
0 a33 a34 


0
0
0
a
44 


In classification, for each Ci, estimate P (O | λi) by a
separate HMM and use Bayes’ rule
P O | i P i 
P i | O  
 P O |  P  
j
j
j