xT s1 x1 s2 x2 x3 s3 sT

From Hidden Markov Models to Linear Dynamical Systems
Thomas P. Minka
Abstract
Hidden Markov Models (HMMs) and Linear Dynamical Systems (LDSs) are based on
the same assumption: a hidden state variable, of which we can make noisy measurements,
evolves with Markovian dynamics. Both have the same independence diagram and consequently the learning and inference algorithms for both have the same structure. The
only dierence is that the HMM uses a discrete state variable with arbitrary dynamics
and arbitrary measurements while the LDS uses a continuous state variable with linearGaussian dynamics and measurements. We show how the forward-backward equations for
the HMM, specialized to linear-Gaussian assumptions, lead directly to Kalman ltering and
Rauch-Tung-Streibel smoothing. We also investigate the most general possible modeling
assumptions which lead to ecient recursions in the case of continuous state variables.
1 Introduction
Both HMMs and LDSs specify a joint probability distribution over states s1::sT and measurements x1 ::xT . In both cases, the distribution factors in the same way:
p(s1 ::sT ; x1 ::xT ) = p(s1 )p(x1 js1 )
T
Y
=2
t
p(st jst?1 )p(xt jst )
(1)
where all terms are implicitly conditioned on the parameters of the model. This factorization is
equivalent to the following independence diagram:
s1
X
X
s2
X
X
s3
X
X
d
d
d
X
X
sT
C C
C C
C C
C C
x1
x2
x3
xT
which is a graphical way to say that the hidden states form a Markov chain that emits a time
series of outputs. To convert a graph into a factorization, make one term per node, conditioned
on the nodes pointing into it. In this way, mathematical manipulations can be replaced with
graphical ones.
Besides a Markov chain (horizontal arrows) with noisy measurements (vertical arrows), this
graph can also be considered a mixture model (vertical arrows) with coupling between the assignment variables (horizontal arrows). It is just a matter of whether you consider the horizontal
1
or vertical links to be fundamental. More complex independence diagrams can be interpreted
in even more dierent ways.
The main point of this paper is that the independence structure reected in the graph determines
the overall structure of the algorithms. The choice of conditional distributions only aects
computational details, which are the only dierence between the HMM and the LDS.
Given one or more measurement sequences from the model, there are three basic tasks we want
to perform:
Classication Compute the probability that a measurement sequence x1::x came from this
T
model.
Inference Compute the probability that the system was in state z at time t, i.e. p(st = zjx1::xT ).
Learning Determine the parameter settings which maximize the probability of the measurement sequences.
The rst two tasks are accomplished via forward-backward recursions which propagate information across the graph. The third task is performed iteratively by EM, where the E-step is the
inference task and the M-step is a linear maximum-likelihood problem. Only the rst two tasks
are discussed here; the M-step for a LDS is straightforward and can be found in Ghahramani
(1996).
2 Generic forward-backward propagation
This section develops the complete solution to the classication and inference tasks. In exchange
for making the restrictive assumption of a Markov chain, we get an exact, ecient inference
algorithm. The reason is that each state variable st separates the graph into three independent
parts:
Bt
X
X
st
X
X
Ft
C C
xt
where
Bt =
fx1::xt?1 ; s1 ::st?1 g
2
(2)
fxt+1 ::xT ; st+1 ::sT g
Ft =
(3)
Therefore,
p(Bt ; st ; xt ; Ft ) = p(Bt ; st )p(xt jst )p(Ft jst )
(4)
If we want p(st; x1 ::xT ), we can integrate out the other state variables:
p(st ; x1 ::xT ) =
Z
Z
s1 ::st?1
Z
=
s1 ::st?1
x
t
t
st+1 ::sT
p(Bt ; st ; xt ; Ft )
!
Z
p(Bt ; st ) p(xt jst )
st+1 ::sT
= p(B ; s )p(xt jst)p(Ftxjst)
p(Ft jst )
!
(5)
(6)
(7)
where
Btx = fx1 ::xt?1 g
Ftx = fxt+1 ::xT g
(8)
(9)
The probability of the entire measurement sequence is p(x1::xT ) = Pst p(st; x1 ::xT ) and the
probability of being in state z at time t is p(st ; x1::xT )=p(x1::xT ), so the joint distribution (7)
is all we need to solve the classication and inference tasks.
The idea is to compute the left and right terms in (7) recursively on the left and right subgraphs.
To this end, dene
t (st ) = p(Btx ; st )p(xt jst ) = p(Btx; xt ; st )
(10)
x
(11)
t (st ) = p(Ft jst )
corresponding to the notation in Rabiner (1989). Therefore
p(st ; x1 ::xT ) = t (st )t (st )
(12)
To derive the recursions, consider two consecutive time-steps:
Bt?1
X
X
st?1
X
X
C C
st
C C
xt?1
xt
3
X
X
Ft
In this diagram, Bt = Bt?1 [ fst?1 ; xt?1 g and Ft?1 = fst; xt g [ Ft .
The independence diagram tells us
Z
t (st ) = p(xt jst )p(Btx ; st ) = p(xt jst ) p(Btx?1 ; st?1 = z; xt?1 ; st )
Zz
(13)
= p(xt jst ) p(Btx?1 ; st?1 = z)p(xt?1 jst?1 = z)p(st jst?1 = z)(14)
Zz
= p(xt jst ) p(st jst?1 = z)t?1(z)
t?1 (st?1 ) = p(Ftx?1 jst?1 ) =
=
=
z
Z
Z
z
Z
z
z
(15)
p(st = z; xt ; Ftxjst?1 )
(16)
p(st = zjst?1 )p(xt jst = z)p(Ftx jst = z)
(17)
p(st = zjst?1 )p(xt jst = z)t (z)
(18)
Intermediate state variables must be integrated out since they are not observed.
To start o the recursions:
1 (s1 ) = p(s1 )p(x1 js1 )
T (sT ) = 1
(19)
(20)
Putting these equations together lets us compute all t () (going forward in time) and all t()
(going backward in time) and therefore all marginals for st . A backward step is needed because
we want the best estimate of st using all the data at hand, including data received after time t.
It may be helpful to consider the graph to be a parallel machine, where each node is a processor
and the links are communication paths. The forward-backward recursions make sure that every
st processor gets information about the entire measurement sequence.
If we instead want the marginal pairwise density of st?1 and st , the diagram tells us, going left
to right:
p(st?1 ; st jBtx?1 ; Ftx ; xt?1 ; xt )
_ p(Btx?1 ; st?1 )p(xt?1 jst?1)p(st jst?1 )p(xtjst )p(Ftxjst ) (21)
= t?1 (st?1 )p(st jst?1)p(xt jst )t(st )
(22)
This obviously generalizes to triples of state variables and so on. Therefore any inference task
on this model can be solved by computing the 's and 's.
Since this algorithm is derived from independence relationships alone, it is clear that any transition density p(st jst?1) and any emission density p(xtjst ) can be used, in principle. These densities
may depend on time, i.e. the process need not be stationary. Furthermore, if the independence
diagram were not a chain but rather a tree, exact inference could still be performed eciently.
This case is illustrated below:
4
s11
x21
x11
X
X
Q
Q
Q
Q
Q
Q
Q
a
e
%
!
Q
a
e
!
%
s21
X
X
s22
S
S
,
,
s31
C C
x31
X
X
S
S
S
S
S
S
,
,
S
L
l
l
S
L
s32
s33
C C
C C
S
L
l
l
S
L
s34
C C
x34
x33
x32
x22
Since each hidden state node still partitions the graph into independent pieces, a similar set of
recursions can be derived, resulting in an upward-downward algorithm (Pearl, 1988). This is
left as an exercise (or a project!).
2.1 Scaling factors
In preparation for the Linear Dynamical System equations, it is helpful to reformulate the
forward-backward recursions in terms of scaled 's and 's. The rescaling is also useful for
avoiding numerical underow (Rabiner, 1989).
Dene
ct = p(xt jx1 ::xt?1 )
(23)
which is the inverse of Rabiner's (1989) scaling factor, also called ct . Now factor ct out of the
original denition of t (st):
t (st ) = p(st ; Bt ) = p(st ; x1 ::xt )
(24)
= p(x1::xt )p(st jx1::xt )
(25)
= (
t
Y
=1
c )^t (st )
(26)
which denes ^t (st ) = p(st jx1::xt ). Equation 15 can be turned into a recursion for ^:
(
t
Y
=1
Z
c )^t (st ) = p(xt jst ) p(st jst?1 = z)(
z
5
?1
tY
=1
c )^t?1 (z)
(27)
Z
ct ^t (st ) = p(xt jst ) p(st jst?1 = z)^t?1 (z)
(28)
z
where ct is computed as the factor that normalizes ^t (st ). This way ^t (st ) and all of the ct stay
within machine precision during the forward propagation.
Similarly, dene
t (st ) = (
T
Y
=t+1
c )^t (st )
(29)
where ct is the same scaling factor used for ^t. Equation 18 turns into
1 Z p(s = zjs )p(x js = z)^ (z)
^ (s ) =
?1
t
?1
t
ct
z
?1
t
t
t
t
t
(30)
which keeps ^ within machine precision.
The marginal distributions now become exact in terms of the scaled 's and 's (the distributions
do not require normalization):
p(st jx1 ::xT ) = ^t (st )^t (st )
(31)
1
(32)
p(st?1 ; st jx1 ::xT ) = ^t?1 (st?1 )p(st jst?1 )p(xt jst )^t (st )
ct
Note that the forward and backward recursions could be derived from these equations alone, by
integrating the latter equation over st?1 or st , respectively.
The probability of the measurement sequence is now
p(x1 ::xT ) =
T
Y
=1
c
(33)
3 Linear dynamical systems
When the state variables st are discrete, the integrals above become sums and the ^ and ^
functions can be stored and computed explicitly. This leads to the HMM forward-backward
algorithm, which is completely general in terms of what the transition and emission probabilities can be. When the state variables are continuous, the ^ and ^ functions must instead be
stored implicitly and the integrals solved analytically. This leads to restrictions on the kinds of
transition and emission densities we can handle eciently, i.e. in time linear in the length of
the measurement sequence.
To get a linear-time algorithm, each ^ and ^ function must be represented with a constant
number of parameters (constant with respect to T ). Otherwise the integral at each step would
6
take time proportional to T . For example, if we use a mixture of Gaussians, the number
of components must remain constant. This is a very stringent requirement, since it means
multiplication by the transition and emission densities cannot make the function more complex
but only change its parameters. The only family of probability distributions with this property|
closure under multiplication|is the exponential family. Fortunately, the exponential family is
quite large, including the Gaussian, Gamma, Poisson, and Beta distributions. Any distribution
which can be written as p(z; ) = exp(Tf (z) + g()) is in the family.
Let Q denote the abstract choice of density for our model, e.g. Q = Gaussian. Conditioned
on st , the variables xt and st+1 must have density Q, though with arbitrary parameters. They
must also have marginal density Q if st has density Q (because of (15)). If Q = Gaussian, then
this means all conditional densities must be linear:
p(st jst?1 ) N (Ast?1 ; ?)
(34)
p(xt jst ) N (Cst ; )
(35)
1
1
where N (x; m; V) =
exp(? (x ? m)TV?1(x ? m))
1
=2
2
j2Vj
This model can also be written, more conventionally, as a set of linear equations driven by noise
(e.g. as in Ghahramani (1996)):
st = Ast?1 + wt
(36)
wt N (0; ?)
(37)
xt = Cst + vt
(38)
vt N (0; )
(39)
Hence choosing Q = Gaussian leads to exactly the class of Linear Dynamical Systems (LDSs).
In their fullest generality, the parameters A; C; ? and may depend on t, i.e. the model may
be nonstationary.
3.1 Forward recursion: Kalman lter
Once we've decided on a linear-Gaussian restriction, the next step is to perform the integrals
in the forward-backward equations. The representation for ^t can be any xed-length mixture of Gaussians. For simplicity, let it be one Gaussian: ^t (st ) N (mt; Vt). The forward
equation (28) becomes
Z
ct ^t (st ) = N (xt ; Cst ; ) N (st ; Az; ?) N (z; mt?1; Vt?1 )
(40)
z
The following fact comes in handy. Suppose we have a Gaussian random vector partitioned into
x and y with the following mean and covariance:
x N ( m
y
Am
x
x
;
V
V AT )
AV AV AT + V
x
x
7
x
x
y
(41)
Since p(yjx)p(x) = p(y)p(xjy), we get
N (y; Ax; Vy) N (x; mx; Vx) = N (y; Amx; AVxAT + Vy )
N (x; mx + K(y ? Amx); Vx ? KAVx)
where K = VxAT(AVxAT + Vy )?1
(42)
(43)
using the rule for conditioning a Gaussian (Minka, 1997).
Applying this rule twice to (40):
ct ^ t (st ) =
N (xt; Cst ; ) N (st ; Amt?1; AVt?1AT + ?)
= N (xt; CAmt?1; CPt?1CT + )
N (st; Amt?1 + Kt(xt ? CAmt?1); Pt?1 ? Kt CPt?1)
where Kt = Pt?1CT (CPt?1CT + )?1
and Pt?1 = AVt?1AT + ?
(44)
m = Am ?1 + K (x ? CAm ?1)
V = P ?1 ? K CP ?1
c = N (x ; CAm ?1; CP ?1 CT + )
(48)
(49)
(50)
(45)
(46)
(47)
Therefore:
t
t
t
t
t
t
t
t
t
t
t
t
t
which are exactly the Kalman lter equations reported in Ghahramani (1996). By design, the
computational cost is constant per time step.
Note that the transition density could be such that st is a constant s, in which case we have a
recursive solution for the posterior of s given a series of independent, noisy observations. This
again emphasizes that the observation density must come from the exponential family: it is the
only family for which parameter estimation can be performed recursively, via sucient statistics
(this is the Pitman-Koopman theorem).
3.2 Backward recursion: Kalman smoothing
^ t; V^ t). This is the marginal distribution for st ,
Proceeding similarly, let ^t(st )^t (st) = N (st ; m
given the entire measurement sequence.
The backward equation (30) gives us
^t?1 (st?1 )^t?1 (st?1 ) =
^
N (st?1 ; mt?1; Vt?1) z N (z; Ast?1; ?)N (xt; Cz; ) N (cz;^m^(tz;)Vt) (51)
Z
t
8
t
Combining the rst two terms via (42) and substituting (44) for ct ^t(z) causes massive cancellation, leaving
^t?1 (st?1 )^t?1 (st?1 ) =
where Jt?1
Z
N (st?1 ; mt?1 + Jt?1 (z ? Amt?1); Vt?1 ? Jt?1AVt?1)N (z; m^ t; V^ t)
Z
^ t; V^ t)
= N (st?1 ? mt?1 + Jt?1 Amt?1; Jt?1 z; Vt?1 ? Jt?1AVt?1)N (z; m
z
^ t; Jt?1V^ t JTt?1 + Vt?1 ? Jt?1AVt?1)
= N (st?1 ? mt?1 + Jt?1 Amt?1; Jt?1 m
= Vt?1 AT(Pt?1)?1
(52)
z
Therefore:
m^ ?1 = m ?1 + J ?1(m^ ? Am ?1)
V^ ?1 = V ?1 + J ?1 (V^ ? P ?1)JT?1
t
t
t
t
t
t
t
t
(53)
(54)
t
t
t
since AVt?1 = Pt?1JTt?1 . These match the equations given in Ghahramani (1996).
Since ^T (sT ) = 1, the backward recursion starts with
m^ = m
V^ = V
T
T
T
T
(55)
(56)
Equation 32 for the marginal pairwise density becomes
p(st?1 ; st jx1 ::xT ) =
^
N (st?1; mt?1; Vt?1)N (st ; Ast?1; ?)N (xt; Cst; ) N (cst;^m^(st;)Vt) (57)
t
t
t
which is virtually the same as the backward equation above. Proceeding as before:
p(st?1 ; st jx1 ::xT ) =
N (st?1; mt?1 + Jt?1 (st ? Amt?1); Vt?1 ? Jt?1 AVt?1)N (st ; m^ t; V^ t)
Therefore st?1 and st are jointly normal:
s ?1 N ( m^ ?1 ; V^ ?1 J ?1 V^ )
s
m^
V^ JT?1 V^
t
t
t
t
t
t
t
t
t
t
(58)
The recursion reported in Ghahramani (1996) to compute the cross-covariance between st?1 and
st is not required; the answer is simply Jt?1 V^ t.
Every Linear Dynamical System has the property that the entire set of state variables and
measurement variables fs1 ::sT ; x1::xT g is jointly Gaussian. In other words, the variables all
form a long Gaussian random vector whose mean and covariance matrix can be computed in
terms of A; C; ? and . In principle, one could use this fact to compute any probability of
interest; the forward-backward recursions just provide a particularly ecient way to do so.
9
Acknowledgements
I am indebted to Rosalind Picard for improving the presentation and Kenneth Russell for helpful
discussions.
References
[1] Zoubin Ghahramani and Georey E. Hinton. Parameter estimation for linear dynamical systems. Technical Report CRG-TR-96-2, University of Toronto, 1996.
http://www.cs.utoronto.ca/~zoubin/.
[2] Thomas
P.
Minka.
Old
and
new
matrix
algebra useful for statistics. http://vismod.www.media.mit.edu/~tpminka/papers.html,
1997.
[3] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Francisco, CA, 1988.
[4] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech
recognition. Proc. IEEE, 77(2):257{286, Feb. 1989.
10