Lecture Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, 2014 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml3e CHAPTER 15: HIDDEN MARKOV MODELS Introduction 3 Modeling dependencies in input; no longer iid Sequences: Temporal: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language). In handwriting, pen movements Spatial: In a DNA sequence; base pairs Discrete Markov Process 4 N states: S1, S2, ..., SN State at “time” t, qt = Si First-order Markov P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si) Transition probabilities aij ≡ P(qt+1=Sj | qt=Si) aij=1 Initial probabilities πi ≡ P(q1=Si) aij ≥ 0 and Σj=1N Σj=1N πi=1 Stochastic Automaton 5 T P O Q | A , P q1 P qt | qt 1 q1aq1q2 aqT 1qT t 2 Example: Balls and Urns 6 Three urns each full of balls of one color S1: red, S2: blue, S3: green 0.5,0.2,0.3T O S1 , S1 , S3 , S3 0.4 0.3 0.3 A 0.2 0.6 0.2 0.1 0.1 0.8 P O | A , P S1 P S1 | S1 P S3 | S1 P S3 | S3 1 a11 a13 a33 0.5 0.4 0.3 0.8 0.048 Balls and Urns: Learning 7 Given K example sequences of length T k # sequences starting with Si k1q1 Si ˆi # sequences K # transition s from Si to S j aˆij # transition s from Si k k 1 q S and q k t 1 t i t 1 S j T- 1 1q T- 1 k t 1 k t Si Hidden Markov Models 8 States are not observable Discrete observations {v1,v2,...,vM} are recorded; a probabilistic function of the state Emission probabilities bj(m) ≡ P(Ot=vm | qt=Sj) Example: In each urn, there are balls of different colors, but with different probabilities. For each observation sequence, there are multiple state sequences HMM Unfolded in Time 9 Elements of an HMM 10 N: Number of states M: Number of observation symbols A = [aij]: N by N state transition probability matrix B = bj(m): N by M observation probability matrix Π = [πi]: N by 1 initial state probability vector λ = (A, B, Π), parameter set of HMM Three Basic Problems of HMMs 11 1. 2. 3. Evaluation: Given λ, and O, calculate P (O | λ) State sequence: Given λ, and O, find Q* such that P (Q* | O, λ ) = maxQ P (Q | O , λ ) Learning: Given X={Ok}k, find λ* such that P ( X | λ* )=maxλ P ( X | λ ) (Rabiner, 1989) Evaluation 12 Forward variable: t i P O1 Ot , qt Si | Initializa tion : 1 i i bi O1 Recursion : N t 1 j t i aij b j Ot 1 i 1 N P O | T i i 1 Backward variable: t i P Ot 1 OT | qt Si , Initializa tion : T i 1 Recursion : N t i aij b j Ot 1 t 1 j j 1 13 Finding the State Sequence 14 t i P qt Si O, t i t i N j 1 t j t j Choose the state that has the highest probability, for each time step: qt*= arg maxi γt(i) No! Viterbi’s Algorithm 15 δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ) Initialization: δ1(i) = πibi(O1), ψ1(i) = 0 Recursion: δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt1(i)aij Termination: p* = maxi δT(i), qT*= argmaxi δT (i) Path backtracking: qt* = ψt+1(qt+1* ), t=T-1, T-2, ..., 1 Learning 16 t i , j P qt Si , qt 1 S j | O , t i aij b j Ot 1 t 1 j t i , j k l t k akl bl Ot 1 t 1 l Baum - Welch algorithm (EM) : 1 if qt Si and qt 1 S j 1 if qt Si t zij z 0 otherwise 0 otherwise t i Baum-Welch (EM) E step : E zit t i E zijt t i , j M step : K ˆi k 1 i k 1 K aˆij k k 1 t 1 t i K Tk 1 k k j 1 O k 1 t 1 t t vm K bˆ j m k k 1 t 1 t i , j Tk 1 K Tk 1 k k 1 t 1 t i K Tk 1 17 Continuous Observations 18 Discrete: rmt P Ot | qt S j , bj m M m 1 1 if Ot vm r 0 otherwise t m Gaussian mixture (Discretize using k-means): P Ot |qt S j , P G jl pOt |qt S j ,Gl , L l 1 Continuous: ~ N l , l POt |qt S j , ~ N j , j2 Use EM to learn parameters, e.g., j O ̂ j t t t j t t HMM with Input 19 Input-dependent observations: POt | qt S j , x t , ~ N g j x t | j , j2 Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996): Pqt 1 S j |qt Si , x t Time-delay input: xt f Ot ,...,Ot 1 HMM as a Graphical Model 20 21 Model Selection in HMM 22 Left-to-right HMMs: a11 a12 a13 0 0 a a a 22 23 24 A 0 0 a33 a34 0 0 0 a 44 In classification, for each Ci, estimate P (O | λi) by a separate HMM and use Bayes’ rule P O | i P i P i | O P O | P j j j
© Copyright 2026 Paperzz