The Moments of Matched and Mismatched

NUSC Technical Report 7989
11 June 1987
The Moments of Matched and
Mismatched Hidden Markov Models
Roy L. Streit
Submarine Sonar Department
CN
tiN A
Naval Underwater Systems Center
Newport, Rhode Island / New London, Connecticut
Approved for public release; distribution Is unlimited.
Preface
This report was prepared under NUSC Project No. A75210, "Application of
Hidden Markov Models to Transient Classification," Principal Investigator
Dr. R. L. Streit. This project is part of the NUSC Independent Research
Program sponsored by the Office of Naval Research.
The
2112).
technical
reviewer for this
report was Or. J. P. lanniello (code
The author thanks Dr. J. P. lanniello of the Naval Underwater Systems
Center, New London, CT, and Jeff Woodard of Rockwell International
Corporation, Anaheim, CA, for their helpful comments and discussions.
Reviewed and Approved:
41).
o
11 June 1987
64k
W. A. Von Winkle
Associate Technical Director for Technology
UNCLASS iFIED
SECURITY CLASSiFiCArlON OF TXIS PAGE
REPORT DOCUMENTATION PAGE
Ia. REPORT SECURITY CLASSIFICATION
lb. RESTRICTIVE MARKINGS
UNCLASSIFIED
2a. SECURITY CLASSIFICATION AUTHORITY
3. OISTRIBUTION/AVAILABILJTY OF REPORT
Approved for public release;
distribution is
Zb. OECL.ASSIFICATIONIOOWNGRADING SCHEDULE
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
unlimited
S. MONITORING ORGANIZATION REPORT NUMBER(S)
TR 7989
Go. NAME OF PERFORMING ORGANIZATION
6b. OFFICE SYMBOL
Naval Underwater
Systems Center
7a. NAME OF MONITORING ORGANIZATION
OfC
214
6c. ADORESS (Gty, State, and ZJPCoe).
b. ADDRESS (COty, State, and ZrP Cod*)
New London Laboratory
New London, CT 06320
Ba. NAME OF FUNDINGISPONSORING
ORGANIZATION
ft OFFICE SYMBOL
Of appkabi)
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
Office of Naval Research
Sc. ADDRESS (City, State. and ZIPCode)
10. SOURCE OF FUNDING NUMBERS
PROGRAM
PROJECT
TASK
rlington, VA 22217-5000
ELEMENTNO.
161152N
NO.RROOOO
NO1
-
WORK UNIT
NO.RROO0
1
NO
ACCESSION NO
It.TITLE (include Secunty Clawflcation)
THE MOMENTS OF MATCHED AND MISMATCHED HIDDEN MARKOV MODELS
12. PERSONAL AUTHOR(S)
Roy L. Streit
13a. TYPE OF REPORT
13b. TIME COVERED
FROM
I/27
TO
Final
14. DATE OF REPORT (Year, Month, Day)
1987 June 11
S. PAGE COUNT
40
16. SUPPLEMENTARY NOTATION
17
FIELD
12
17
COSATI CODES
GROUP
SUB-GROUP
18. SUBJECT TERMS (Continue on revere if neceary and identify by block number)
Hidden Markov Models (HMMs)
Log Likelihood Ratio
Maximum Likelihood Classification
01
01
Moments
Receiver Operating
Characteristics ROC)
I9 ABSTRACT (Continue on reverse if neceory and identify by block number)
-An algorithm for computing the moments of matched and mismatched hidden Markov
models from their defining parameters is presented. The ability of the first two
moments to adequately describe the probability density function of a maximum posterior
likelihood classifier based on hidden Markov models is assessed by examples. These
examples include ergodic and nonergodic simulated hidden Markov observations that
are matched and mismatched with the posterior likelihood classifier. One example
discusses the effect of a noisy discrete communication channel on the posterior
likelihood classifier reliability. The examples indicate that the posterior
likelihood function is log-normal when the Markov chains are ergodic, and thus the
first two moments suffice to describe the required probability density functions.
The examples are also of independent interest because they indicate how different
internal structures of hidden Markov models impact the performance of a maximum
posterior likelihood classifier.
20 DISTRIBUTION/AVAILABIUTY OF ABSTRACT
(OUNCLASSIFEO0UNLUMIrED
SAME AS RPT.
22a NAME OF RE5PONSIBLE 1NDIVIDUAL
0
Roy L. Streit
00 FORM 1473. B4 MAR
• m
21. ABSTRACT SECURITY CLASSIFICATION
0
oTIC USERS
UNCLASS I FI ED
22b. TELEPHONE (Include Area Code)
(203) 440-4906
83 APR edition may be used until exlausted.
All other editions arp mbsolete.
m
•
m
.
4-
22c. OFFICE SYMBOL
214
SECURITY CLASSIFICATION OF THIS PACE
UNCLASS I FI ED
TR 7989
TABLE OF CONTENTS
Page
LIST OF ILLUSTRATIONS . ........................
LIST OF TABLES ............
...........................
I.
INTRODUCTION ........
II.
THE MOMENT ALGORITHM ..........
III.
IV.
ii
..........................l..
.....................
A.
Finite Symbol HMMs
B.
Continuous Symbol HMMs .........
6
..................
6
..................
COMPARISON OF THEORETICAL MOMENTS WITH SIMULATION ...
16
.......
19
..................... ...
19
A.
Two Ergodic HMMs .......
B.
Mixed Ergodic and Left-to-Right HMMs ..
C.
Left-to-Right HMM With Noise ....
..............
22
............... ...
28
.......................... ...
34
.................................
35
CONCLUSIONS ..........
REFERENCES ...........
iii
_j..
i
I
,
-
*
-
I
TR 7989
LIST OF ILLUSTRATIONS
Page
Fiqure
1
Classification of Unknown Signal s(t) as One of p
Signals for Which Trained HMMs Are Available
2
(The Normal Curve Has the Sample Mean
and Variance Given in Table 3.) .....
23
(The Normal Curve Has the Sample Mean
and Variance Given in Table 5.) .....
.................
26
Histogram of 10000 Values of log dF 2 3 (x) for
T = 25.
(The Normal Curve Has the Sample Mean
and Variance Given in Table 7.) .....
0
.................
28
Histogram of 9879 Samples of log dF34 (x) for
T = 25.
(The Normal Curve Has the Sample Mean
-28.156 and the Variance = 3.6167.) ..
ii1
.................
Histogram of 10000 Values of log dF3 3 (x) for
T = 25.
6
23
(The Normal Curve Has the Sample Mean
and Variance Given in Table 3.) .....
5
.................
Histogram of 10000 Values of log dF 2 1 (x) for
T = 25.
4
3
Histogram of 10000 Values of log dF2 2 (x) for
T = 25.
3
..........
...............
33
TR 7989
LIST OF TABLES
Page
Table
1
Parameters of HMM(1) .....
.................. ..
...
2
Parameters of HMM(2), Rounded to Three
Significant Digits .......
3
.......................
...............
24
....................
25
4
Parameters of HMM(3) .........
5
Comparison of Two Estimates for the Mean and Standard
Deviation of log dF 3 3 (x) . .....
Number of
0
....
27
27
Comparison of Two Estimates for the Mean and Standard
..
.................
21
Number of 0T c HMM(3) + Noise for Which
f3(OT) = 0 at Various Values of ET
9
..
...................
Deviation of log dF2 3 (x) ......
8
....
T c HMM(i) for Which
f3(0 T) = 0, i = 1, 2 .........
7
21
Comparison of Two Estimates for the Mean and Standard
Deviation of log dF 2 j(x) for j = 1, 2 ..
6
20
............
31
Parameters of HMM(4), Rounded to Three
Significant Digits .......
.......................
32
iii/iv
Reverse Blank
TR 7989
THE MOMENTS OF MATCHED AND MISMATCHED HIDDEN MARKOV MODELS
I.
INTRODUCTION
Hidden Markov models (HMMs) are statistical models of nonstationary time
series
the
or signals.
time
In
variation of the
example
is
the
problem,
where
short term spectra
speaker-independent
HMMs
they are
speech applications,
used
to characterize
of spoken words.
isolated
word
A specific
recognition
characterize
the words in a finite-size
different HMMs.I
by
Different words are characterized
Every HMM is comprised of two basic parts:
random variables.
state
is
(SIIWR)
vocabulary.
a Markov chain and a set of
The Markov chain has a finite number of states, and each
uniquely associated with one
of
the
random variables.
sequence generated by the chain is not observable;
The state
i.e., the Markov chain is
At each time t = 0, 1, 2, .... , the Markov chain is assumed to be
"hidden."
in some state;
it transitions to another state at time t + 1 according to
its
probability
transition
generated
by
chain
time
at
the
random
t.
The
matrix.
At
each
time
variable associated with
observations
are
t,
one
observation
the state of
referred
to
as
is
the Markov
symbols.
If the
random variables assume only a finite set of values, the HMM is referred to
as
a
finite
values,
symbol
HMM.
If
the
random
variables
the HMM is called a continuous symbol
defining
an
HMM
is
function
of
the
Markov
comprised
of
chain
at
the
HMM.
initial
time
t
=
assume
continuum of
The full parameter set
state
0,
a
probability
the
Markov
density
chain
state
transition probability matrix, and the probability density functions of each
of the random observation variables.
The
act
parameters
of
of
computing
an
HMM
solving a mathematical
estimates
of
specific
is called
numerical
"training,"
and
values
for
training
the
various
is equivalent
to
optimization problem to determine maximum likelihood
the HMM parameters.
2
In this paper,
it
is assumed
that the
training phase is completed and that the HMMs developed are adequate models
for each
of
the
nonstationary time series,
the vocabulary words in the SIIWR problem).
or
signals, of
interest
(e.g.,
TR 7989
During the training phase, a suitable preprocessor is developed to map
an
transform)
(or
is
this
way
T
truncated
HMM
required
for
sequence
0
HMM.
that
the
Denote
i-th recognizer thus computes the
The
by HMM(i).
hidden Markov model
i-th
by
the
to
likelihood
posterior
the
evaluates
passed
then
is
..., T}
The
integer.
positive
a
characterized
series
a time
from
comes
that 0T
is
T
where
recognizer
HMM
1
classification
the
in
one
of
reference
in
given
utilized
also
t = 1, 2,
T = (O(t),
Each
recognizers.
is
discrete
description
A good
.}.
a
into
0,
>
The observation sequence is truncated to the
HMMs,
the
..
problem
preprocessor is
The
phase as depicted in figure 1.
length
1, 2,
SIIWR
the
for
done
(pp. 10-17-1078).
t =
[O(t),
sequence
observation
t
s(t),
signal
input
arbitrary
posterior likelihood function
fi(0 T
c
0T
where
p)
Pr[OT 1 0T c HMMi
observation
the
that
hypothesis
the
denotes
HMM(i)
sequence
The maximum of the p computed posterior
0T is a realization of HMM(i).
likelihoods is assumed to identify, or classify, the original signal s(t).
In
of
kind
some
practice,
threshold
must
be
to
identify
signals
trained.
The
likelihood
function
(1)
set
for which
can
n is the number of
2
using the forward-backward algorithm.
(where
multiplications
be
must
rule
tie-breaking
defined
HMMs
with
computed
be
in
states
have
and
some
not
been
only
the Markov chain)
n2 T
by
The misclassification rate (or false alarm rate) of the system depicted
in
1 can
figure
Alternatively,
be
estimated
by
simulation
the misclassification
rate
of
after training
i as
signal
is completed.
signal
j can
be
determined from the conditional cumulative distribution functions
(2)
Fij(x) = Pr[fi(OT) < x 1 0T c HMM(j)]
by
using
operator
rates
classical
detection
characteristics
based
(ROC)
on F ij(x) depends
and
estimation methods
3
curves.
on the
The
validity of
2
theory
and
simulation would
support
the
develop
receiver-
misclassification
the assumption that
validity of
HMMs developed are adequate models for the signals
between
to
of
interest.
hypothesis
that
the
Agreement
the HMMs
TR 7989
z
0<
w
.J
0
C)
LM >
LI
LU
F--
0
4-0
.4-2
Ln
Cci
C)
00
0
a
z
oz
z
zC (D
Z
U3
TR 7989
Unfortunately, algorithms for calculating
later
For
parameters
are not known.
really do represent the signals.
directly from the HMM
Fij(x)
reference, note that F. (x) * Fj (x) in general.
The moments of dF ij (x) are defined by the Riemann-Stieltjes integral
Mij(k,T)
If
Fij(x)
is
differentiable
F ij(x),
derivative
with
(3)
k = 0, 1, 2, ....
dF . (x) ,
x
=
the
moments
sequence
because
then
can be written equivalently as the Riemann integral
Iijif
M. (k,T)
The
moments
depend
on
the
x) dx
F'(x)
x
=
T
length
of
the
observation
They uniquely determine
Fij(x) depends on T, as seen from equation (2).
function of
finite and the characteristic
dF ij(x) when 13they are all
4
dF ij(x)
(2),
it
has
a
positive
radius
is clear that dF ij(x)
of
x < 0 and
0 for
=
From
convergence.
equations
for x > I.
(1)
Thus,
and
from
equation (3),
1
Mij(kT) =
so that all the moments are finite.
@ij
x
O,
k dFij(x) < 1
The series
(
V!
Mij~vT
U=O
for
the
characteristic
an
infinite radius
of
is
bounded
in
4
above
function
of
dF ij(w) is absolutely
convergence because, for fixed
magnitude
by
jwO1V/v!
and
convergent with
w0
0, each summand
0
thus
the
radius
of
TR 7989
convergence
must
be
at
least
as
large
as
exp( Io[).
Consequently,
the
moments of dFi (x) uniquely determine dF ij(x) and, thereby, Fij(x).
This
of
paper presents
dFii (x)
up
to
any
an algorithm for computing explicitly the moments
desired
order
parameters of the HMMs involved.
directly
from
the
given
underlying
The only essential assumption made is the
usual one that the HMMs have a finite number of states.
However, it is not
required that all HMMs have the same number of states.
This
paper
also
Lheoretical
moments
independent
interest
presents
with
simulation
because
they
likelihood
classification
theoretical
analysis alone would
features
examples
based
compare
results.
exhibit
on
that
The
important
ergodic
not show as
and
the
first
examples
features
of
left-to-r'lht
easily -r as
are important because they indicate how the
are
of
posterior
HMMs
quickly.
internal
two
that
These
structure of
HMMs impact the performance c" the system depicted in figure 1.
5
TR 7989
II.
THE MOMENT ALGORITHM
The reader is assumed to be familiar with such first principles of HMMs
as given in reference 2.
to undei
A.
It is not, however, necessary to reaJ this section
tand the examples provided in section 11.
FINITE SYMBOL HMMs
Let HMM(u)
be a hidden
MarKov process
with n(u)
states,
u = 1,
Subscripted indices will always be written as functions of their subscripts
(for
instance,
is
used
subscripted
subscripts.
Let
HMM(u)
denoted
be
n(u)
as
instead
the
n )
of
state
Au = [au( ),(
avcid
0
T
)
i(v),
for
,
vector
[O(t), t = l
=
O(t) C V = [Vl,
later
probability
of
HMM(u)
restrict attention to finite symbol HMMs;
that every observation sequence
the
transition
n(u).
Let the initial state probability
W = [ i(V)], for i(u) = 1, ..., n(u).
We first
to
.. .,
use
of
matrix
of
j(u)
I.
be
denoted
that is, we suppose
.... Tj is such that
Vmj
I
where V is the set of all possible output symbols of the preprocessor.
true nature of the symbols
as
in V is of no importance here.
The
HMMs assume that
0(t) is a random variable whose probability density function depends on the
current state of the Markov process.
Let the discrete probability density
function for HMM(u) when it is in state uB
i(v) be denoted as
i(u)
for
n().
...
Thus,
each
B
is a
row
vector
of
length
Stacking these row vectors gives the n(v)-by-m symbol probability matrix
Bl
B2
B
= [b
)I
.U),j
8
n(u)
bB
m.
TR 7989
Note that
b
j(V)
(
b=
j(U))
where we define
bi()(O(t)) = Pr[O(t) I HMM(u) and Markov state
The
that
assumption
parameters X
For
is
phase
training
the
=
the
that
means
completed
= (w , Au, Bu) are known.
discrete
HMMs,
symbol
function
cumulative
the
is
Fij(x)
Let
increasing step function with a finite number of jump discontinuities.
X
.
denote
the
set
that dF ij(x)
It follows
discontinuous.
= 0
for
x
of
values
all
of
which
if x is not
in X..
Fi
an
(x)
is
and that,
for x in X..,
13
dF ij(x) = Fij(x )
-
= Pr[fi (0)
F ij(x-)
=
(4)
x 1 0T c HMM(j)]
Substituting equation (4) into equation (3) gives
xk Pr[f iOT
E
xcxi.
M ij(k,T)
=
0T
i(OT)}
k
=
x 1 0T c HMM(j)]
Pr[OT 1 0T c HMM(j)]
] (Pr[ OT 1 0T c HMM(i)3}
k
Pr[OT 1 0T c HMM(j)]
(5)
0T
It
k >
is clear
1.
and I.
For
from
k
equation
(5)
m I, however,
we
that
have
M.i (k,l)
Mi (k,T)
M ij(,T)
=
M..(l,l)
in general
for
all
for
i, j,
TR 7989
parameters
in
exponentially
algorithm
for
except
practical
2
0T = (O(t),
output
symbols
total of k[n(i) n(j)]
2
increases
effort
the
forward-backward
direct calculation
Thus,
T2 m
O(t)
each
can
sequences
observation
possible
different
because
T}
...,
V.
not
T
m
are
There
in
is
multiplin2 (u) T
using
Pr[OT I T c HMM(u)]
n(j)] 2 T 2
k[n(i)
requires
(5)
summand in equation
each
t = 1,
the
calculation
that
note
this,
a
from
0
calculates
multiplications.
such
computational
the
because
see
To
T.
Thus,
cations.
T
small
directly
computable
is
however,
HMM(j);
and
HMM(i)
of
(5)
equation
in
expression
The
be
any
of equation
one
of
the
m
(5) requires a
multiplications.
We now derive a recursion for equation (5) that requires computational
The recursion is derived for a more
effort that grows only linearly with T.
For k = 1,
general expression that contains equation (5) as a special case.
2, ..., define
k
Pr[OT 1 0T
*I--
R(k,T) =
V=l
0T
The application of
straightforward;
forward
from equation (5) is
(6) to compute any moment
for example, R(k+lT) equals M 2 1 (k,T) when HMM(2)
= ... =
Note that R(k,T) can be interpreted as a joint moment of HMMs.
HMM(k+l).
The
equation
(6)
c HMM(u)]
recursion
of
portion
the
proceeds
for R(k,T)
recursion
derivation of the
forward-backward
as
follows.
gives
algorithm
The
the
expression
n(u)
0T c HMM(u)]
Pr(OT
T(j(v))
=
(7)
,
j(v)=l
where, for 2 < t < T,
VM
Ul(i(v)) a"
)) =t
=
)b
(O(t))
,
(8)
'(v)=l
and
I
8
v ))
=
v)
b
(00))
(9)
TR 7989
Substitute equation (7) into equation (6) to obtain
k
II:(V)) VU-
n(u)
R(k,T) =,
j ( )=l
0T
n(u)
where we define for t
1
=
,
..
u;=I
JT (J( ), ..., j(k))
( 0)
T
k
V
(11)
t(j(v))
V
j(k))
...
,
l
00ttU=1
One
interpretation
•nd
in
state
of
j(v,),
vT
v=I
suppose that 2 < t < T.
is
that
it
..., k.
equals
We
seek
R(k,T)
a
given
that HMM(v)
recursion
for
PT.
must
First
Then, substituting equation (8) into equation (11)
gives
lt(J(1)
]t7
... , j(k))
0t
II
n()
E1-
li(
t
kO
21 ri
i(1u)=:
U=l ..... k
=I
& -I(i(v)) ai(V),j(V
i
t-I
tI -V=I
)=l
1k~l
'
ri
i(U)J)
,ji
llkilb1
j
10(i()
Vt-I
u=l ,... ,k
•
bJ( )(O(t))
i(i)
) t
vl-~
ai(U)
E~
)
l
|
m
I
bV() (O(t)]
E =~),(
I~
TR 7989
Because
ct- (i(v))
t-1
does
not
depend
observation sequence 0t = (0(1),
n(v
k
i(u)=l
V=l ..... k
U=l
last
symbol
O(t)
in
the
..., 0(t)}, we have
a-ICL,j)
Mi
0t
Because the sum over 0(t)
...
the
i( k ) )
lit (J(1) . ,
= [0(l),
on
,
Q(t-l)},
is
as
v)
k
l
Ot
b V )(O (t ) )
V=l
independent of the observation sequence Ot- l
well
as
the
indices
i(v),
and
because
of
equation (11), we have
jt(J(1), . .. , j(k))
n(u)
Fk
=rcj(1), . .,j~k))
Ut-Il(i(1), .. , i(k))
a
(12)
where
k
r(j(I)
.
,
,j(k))=
buj(V)Ot)
(M)
I
O(t) V=l
m
k
b
I)(Vs)
(13)
s=l U=l
Equation
(12)
is
the
desired
recursion
substituting equation (9) into equation (11)
lJ( j,
j(k))
2
<
t
<
gives
TI
=
0(1)
10
for
k
V=l
(j(u))
T.
For
t = 1,
)k
(k
TR 7989
b)(00))
E
k
r(j(),
j(k))
,
...
I
(14)
rj(v)
U=l
Let k
regardless
=
From the definition, it is clear that R(l,T)
1.
of
HMM(l),
because the sum in
equation (6)
is
I for all T,
over all
0
T.
To
independently check the recursion (12)-(13), note that, from equation (13),
m
(1 )(V)
r(j(l))
< t< T
=
s=1
From equation (14), we have
j(l)
I=
Hence, from equation (10), we obtain
n(l)
R(ll)
=
The recursion is verified for T
=
%__=
1.
For T = 2, from equation (12),
we have
n(1)
i(1 )=l
n(l)
21i()
ai(), !
i(l)=l
11
TR 7989
so that, from equation (10),
n(1) (n(l)
R(,2)
=
i(l) ai(1),j(1)
ii(1)=l
j(1)=l
n(i)
a(n())
i(1)=1 lu)j(1)=l
and the recursion is verified for T = 2.
The
first nontrivial
identically
the
first
special
moment
case
is k = 2.
M1 2 (l,T).
In
this
From
equation
t
(i( ),i(2))
case,
(12),
we
R(2,T)
have
2<t<T
n(1)
lt(j(l),j(2))
=
n(2)
r(j(1),j(2))
V
i(l)=l i(2)=l
and, from equation (14),
1
p(j(l),j(2)) = r(j(l),j(2))
j() 1
2
j(2)
where, from equation (13),
m
'(l
r(j(1),j(2)) =
)(Vs)
(2)(VS)
s=l
From equation (10),
then, we have
n(1)
R(2,T) =
n(2)
E
j(l)=l j(2)=l
12
uT(J(l),j(2))
a
),
ai2
is
for
TR 7989
Computation of R(2,T) = M1 2 (l,T) is therefore not excessively laborious.
The
evaluation of
into two
parts.
possible
every
R(k,T)
using the
recursion
(12)
is properly
The first is the precalculation of r(j(l), ....
requires
This
indices j(v).
value of the
,
broken
j(k)) for
(k-l)m Nk
multiplications and Nk storage locations, where
storag
loatoswhr
N
(15)
net)
=
is the geometric mean of the number of different states in the various HMMs
and
is
not
different
requires
necessarily
observation
262144
an
symbols,
storage
If N = 8
integer.
then
locations
computing
and
and
7
and 2.1
if there
x 10
storing
are
r
m
for
multiplications.
16
k
6
=
Storage
is clearly more crucial an issue than multiplications.
It
is
possible in
some cases to utilize the underlying symmetries of
to reduce both storage and computational
=
effort.
For example, if HMM(2)
=
HMM(k+l), then
r(j(1),j(2),
.....j(k+l)) = r(j(l), a(j(2)),
for every permutation c of the k integers j(2),
equation
(16)
commutative.
follows
easily
from equation
number
combinations
..., o(j(k l)))
... , j(k+l).
(13)
because
(16)
The proof of
multiplication
is
Thus one only need consider indices that satisfy
1 s j(1) s n(l) and 1 s j(2)
The
r
of
ordered
of
n(2)
index
letters
S
sets
taken
(j(u))
k at
repeated any number of times up to k.
Nk+l
(n(2)(n(2) + 1)
j(3) s
a
.
:.. j(k+l) s n(2)
is
equal
time, when
to
each
the
number
of
letter may
be
Storage is therefore proportional to
.
k. (n(2) + k
) n(l)
13
TR 7989
is
which
necessary.
be
otherwise
storage
would
that
reduced
is also
count
multiplication
total
The
[n(2)] k n(l)
the
than
smaller
significantly
proportionately.
r
Once
recursion
sequence.
sum
the
N
all
over
the
of
each
For
appears to require k N
k
k
k
E
...
sum
ai(k),j(k
"t-l~i1
)
1~k
..
~).
one
iteration of
(k)=1
(2)=1
i(1)=l
(12),
This
undertaken.
be
the
multiplications; however, by using the nested form,
E
aM(),W()
equation
in
(j(to)}
k,
of
observation
the
of
T
length
must
value
given
indices
of
(i(u)}
indices
n(2)
n(l)
sets
N
a
for
any
for
computed
be
can
(12)
stored
and
computed
been
has
it is possible to use approximately
instead.
If
equation
(12)
sequence
of
for
about
requires
T,
length
If N
multiplications.
required
order terms
lower
(N N)
(Nk -1)
neglected,
computing
N
k N2
Nk
=
are
2k
N
computing
For
an
on
the
order
UT
requires
of
N2kT
multiplications are
8 and T =32, then 2.2 x1
Assuming a multiplication takes
k = 6.
observation
multiplications.
one microsecond,
the
calculation requires 611 hours and is clearly impractical.
Significant reduction
by
utilizing
=
the
in computational
underlying
symmetries
in
effort is possible in some cases
For example,
vt"
=
HMM(k l), then
Vt(j(l),j(2),
...
, j(k+l)) = ,t(j(l), 0 (j(2)), ....
for every permutation a of the k integers j(2),
equation
(17)
multiplication
property
14
if HMM(2)
(16)
follows
easily
is commutative
in this
case.
by
and
Thus,
induction
because
r
...
from
j(ki-l).
,
(17)
The proof of
equation
(12)
because
the
same
symmetry
satisfies
the recursion
o(j(k+l)))
(12)
need be computed for
TR 7989
2
4
sets
Nk~l
only
which
NkiT,
total
The
of
indices.
is
significantly
multiplication count
multiplications
N2k T
the
than
smaller
to
is reduced
mutplktin
For the above example requiring 611 hours,
that would otherwise be needed.
is utilized, the calculation
if N = n(1) = n(2) = 8 and if the symmetry (17)
would be reduced to roughly a 96-minute calculation.
Utilizing symmetry is
clearly significant in that it can turn an impractical
long calculation into
a feasible shorter one.
Underflow is
It
easily
be
can
potentially a problem when the recursion (12)
in
overcome
exactly
same
the
is
computed.
pointed
manner as
out
in
reference 2 for preventing numerical underflow during the calculation of the
equation (12)
which
use
is
the
in
be computed according
to
and then multiplied by a scale factor ct defined by
=
ct
Then
let lt
Specifically,
forward-backward algorithm.
scaled
vdlues
of
t(j(1),
... , j(k))
the
recursion
Vt
turn scaled as shown in
in
equation
R(k,T)
(12)
If
(18).
fashion and recall the expression in equation (10),
(1a)
to
compute
pt+l'
this
we continue in
it follows that
(19)
ct
=
Because the product cannot be evaluated without underflow, we compute instead
T
log R(k,T)
(20)
log ct
= -E
t=l
Any
convenient
potentially
scale
useful
factor
one
might
can
be
be
to
used
take
instead
Tt
=
of
Nk.
equation
Using
A
(18).
Tt
would
eliminate the effort of computing the sum in equation (18) before scaling.
15
TR 7989
B.
CONTINUOUS SYMBOL HMMS
The objective of this section is to show that the moment algorithm for
discrete symbol HMMs can be carried over essentially unchanged to continuous
symbol HMMs.
In fact, it holds also for continuous vector symbol HMMs;
however, only the continuous symbol HMMs are treated here for simplicity.
Throughout this section, it is assumed that each output symbol 0(t) is a
real random variable defined on some underlying event space, V. The
probability density function of 0(t) is uniquely defined for each state i(V)
= 1, ..., n(i)
b
(x).
b
An
of
v = 1, 2, ...,
HMM(v),
and
is
denoted
as
Thus, for real numbers a and B with a < 0, we have
(x) dx
observation
real
each
numbers
=
Pr[a < 0(t) < B I HMM(u) and Markov state = i(u)]
sequence
xt, with
O(t).
The posterior
density function for
0
=
[xt, t = 1, 2, ..., T)
xt being
a realization
of
is
the
a
(21)
sequence
random
of
variable
likelihood
continuous
function f (0T) is now a probability
symbol HMMs, as opposed to a simple
probability (see equation (1)) for discrete symbol HMMs. Thus,- for real
vectors V and B with '&<B, we have
f
f(OT) dOT = Pr[- <
where
0T
c HMM(u)
HMM(u) and dOT = dx
The
denotes
the
0T
hypothesis
0T
C HMM(u)]
,
(22)
that 0T is a realization of
... dxT .
conditional cumulative distribution functions Fi(x) are defined
by equation (2), just as for discrete symbol HMMs. From equation (3), we
have the moments
16
TR 7989
M ij(k,T)
f
xk dF ij(x)
f
k Pr[x < fi(OT)< x + dx I
Sf.
0
T c HMM(j)] dx
[f i(OT)} k f COT) dOT ,
(23)
T-fold
which
is
the
equation (23)
case k = 1.
continuous
that
analog
of
equation
M ij(k,T) = Mji(k,T)
in
(5).
general
It
only
is
clear
for
the
from
special
The analog of equation (6) for continuous HMMs is
R(k,T)
f
=
1
f
(T)
E
dOT
(24)
Ul
T-fold
The
forward-backward
algorithm
for
function for continuous HMMs is modified
f (
computing
the
posterior
likelihood
as follows:
)(j(v))
=
(25)
,
j(u):l
where
(9),
now
aT(j(v))
with
the
is
only
interpreted
equation (21).
computed
exactly
difference
as
the
being
as
given
that
probability
b"
by
j(V)
the
(0(t))
density
recursions
in equation
function
(8)
and
(8) is
implicit
in
Consequently, equation (10) still holds exactly if we define
17
TR 7989
t(i(l),
f
j(k))
.....
(26)
t(j(v)) dOt
...
t-fold
as the analog of equation (11).
replacing
Proceeding as before with t-fold integrals
t-fold summations gives exactly the
(12),
recursion
but with the
one-dimensional integral
k
rum,
...
,
j(k))
J
jib
(27)
(x) dx
in place of equation (13).
The
remarks
in the preceding section concerning storage, multiplication
symmetry properties all
counts,
and
primary
difference
is that
equation
apply for continuous
(27)
requires an
instead of a finite sum as in equation (13).
initial
computational
overhead,
but
once
algorithm (12) proceeds exactly as before.
1B
18!
symbol
integral
HMMs.
The
evaluation
This evaluation increases the
equation
(27)
is
computed,
the
TR 7989
III.
Ergodic
from
Markov chains are those for which it is possible to transition
every
step.
COMPARISON OF THEORETICAL MOMENTS WITH SIMULATION
state
to
every
other
state,
although
not
necessarily
in one
Left-to-right Markov chains are those for which transitions to lower
numbered states are not allowed, that is, have probability zero.
types
of
chains
are
sufficiently
different
that
they
are
These two
considered
separately in the examples.
One
interpretation
signals,
while
ultimately
left-to-right
become
exited once
is that ergodic HMMs are models of
HMMs
stationary
it is entered).
are
(because
models
the
of
quasi-stationary
transient
highest
signals
numbered
state
that
is not
One might therefore expect these two types of
HMMs to impact the performance of the maximum likelihood classifier depicted
in figure
1 in different
ways.
The three
examples given
in this
section
support this expectation.
Using
the
above
follows.
The
first
based
HMMs
can
on
stationary signals
second
interpretation,
example
shows
reliably
with
the
examples
may
be
described
that
maximum
likelihood
distinguish
between
sufficiently
a reasonable
amount of
as
classification
long
computational
quasi-
effort.
The
example shows that short quasi-stationary and transient signals
look
significantly different to the HMM transient recognizer, but not to the HMM
recognizer
based
on
the
quasi-stationary signal.
The third
example
shows
that noisy observations of transient signals adversely affect classification
performance
by
making
the
transient
signal
appear
to
have
a
stationary
component, which is then misclassified by the HMM transient recognizer.
A.
TWO ERGODIC HMMS
HMM(l)
and
HMM(2)
parameters
are
given
and 2,
respectively.
uniformly
distributed
every symbol
are
five-state,
(rounded
HMM(l)
symbols.
can be generated
interest here is the following.
to three
clearly
HMM(2)
eight-symbol
ergodic
significant
decimals)
generates
observation
is more complex
in every state.
models
whose
in tables
sequences
in structure,
The fundamental
1
of
but
question ot
How long must an observation sequence be to
19
TR 7989
guarantee
that
posterior
maximum
the Introduction)
classification (described
probability
be 98 percent reliable?
will
in
We will give what may best
be 6escribed as a semiempirical answer to this question.
Because of the nature of HMM(l), it is easy to see that
PrFOT
f0
In other
constant
HMM(l).
0T
words,
the
posterior
because
all
observation
In
particular,
f1
0. c HMM(l)]
likelihood
0T )
function
sequences
cannot
8-
are
equally
distinguish
useful
likely
0T
is
if 0~TC
c HMM(l)
from
for classification.
were
generated,
and
Ten-thousand
the
posterior
observation
probability
instead of HMM(l), is
sequences
NUMBER OF MARKOV STATES = 5
NUMBER OF SYMBOLS PER STATE =8
INITIAL STATE PROBABILITf VECTOR:
2.OOE-Ol 2.OOE-Ol
2.OOE-Ol 2.OOE-Ol
TRANSIlTION
2.OOE-Ol
2.OOE-Ol
2.OOE-Ol
2.OOE-Ol
2.OOE-Ol
PROBABILITY MATRIX:
2.OOE-Ol 2.QOE-Ol
2.OOE-Ol 2.OOE-Ol
2.OOE-Ol 2.OOE-Ol
2.OOE-Ol 2.OOE-Ol
2.OOE-Ol 2.00E-01
0T
f2 (01T) was
Table 1. Parameters of HMM(l)
20
HMM(l)
HMM(2) and thus is useless for classification.
The posterior likelihood function based on HMM(2),
HMM
on
based
2.OOE-Ol
2.OOE-Ol
2.OOE-Ol
2.OOE-Ol
2.OOE -01
2.OOE-Ol
?.OOE-Ol
2.OOE-Ol
2.OOE-Ol
2.OOE-Ol
2.OOE-Ol
SYMBOL PROBABILITY MATRIX (TRANSPOSED):
1.25E-01
1.25E-01
1.25E-01 1.25E-01
1.25E-01 1.25E-01 1.25E-01 1.25E-01
1.25E-01
1.25E-01
1 .25E-01
1 .25E-01
1 .25E-01
1 .25E-Ol
1 .25E-01
1.25E-01
1 .25E-01
1 .25E-01
1.25E-01
1.25E-01
1 .25E-01
1 .25E-01
1.25E-01
1.25E-01
1 .25E-01
1.25E-01
l.25E--Ol
1.25E-01
1 .25E-01
1 .25E -01
1.25E-01
1.25E-01
1 .25E-01
1 .25E-01
1.25E-01
of
each
computed
TR 7989
Table 2. Parameters of HMM(2), Rounded to
Three Significant Digits
NUMBER OF MARKOV STATES = 5
NUMBER OF SYMBOLS PER STATE = B
INITIAL STATE PROBABILITY VECTOR:
I.OOE-O0
O.OOE*O0
O.OOEO0
O.OOE+O0
1.24E-01
2.13E-Ol
1.27E-Ol
1.15E-Ol
2.82E-01
1.94E-Ol
2.34E-01
3.38E-01
2.75E-01
2.98E-02
SYMBOL PROBABILITY MATRIX (TRANSPOSED):
1.81E-Ol
1.22E-Ol
7.89E-03 1.48E-Ol
7.04E-02
TRANSITION
1.40E-Ol
1.40E-O1
4.37E-02
9.73E-02
2.36E-01
using the
natural
thus
O.OOE+OO
PROBABILITY MATRIX:
2.35E-01 3.08E-01
1.14F-O. 2.99E-01
3.20E-01
1.72E-Ol
4.97E-Ol
1.53E-02
2.49E-02 4.27E-01
1.39E-Ol
8.28E-02
3.23E-02
9.13E-Q2
1.33E-Ol
2.67E-02
l.79E-Ol
1.56E-Ol
1.60E-Ol
1.66E-O1
1.58E-Ol
5.87E-02
2.18E-Ol
2.15E-Ol
1.08E-Ol
1.30E-Ol
2.09E-Ol
2.34E-01
5.97E-02
2.35E-01
1.19E-O1
5.75E-02
l.lIE-Ol
1.02E-Ol
1.03E-Ol
l.76E-Ol
2.37E-02
1.32E-Ol
1.22E-Ol
2.40E-01
1.17E-Ol
6.61E-02
1.46E-Ol
1.76E-02
1.47E-Ol
forward-backward algorithm.
logarithm of
matched
histogram
of
to
dF2 2 (x) for
the
posterior
log dF2 1 (x)
for
Figure 2 shows
T = 25.
likelihood
T = 25.
mismatched to the likelihood function.
The
observation
function.
In
a histogram of
figure
the
sequences
Figure
3,
3
are
shows
a
01
is
then,
As is clear from figures 2 and 3,
the difference between the mean values of the log likelihood functions is
about
1.4
standard
deviations.
Thus,
the
potential
exists
for
using
log dF2 2 (x) to classify observation sequences; however, T = 25 is not long
enough to classify with high reliability.
A useful observation drawn from figures 2 and 3 is that the probability
density
function
distribution.
log dF ij(x).
1..
of
log dF2 i(x)
Let vii
Then,
and
is nicely
approximated
by
the
normal
aij denote the mean and standard deviation of
if dFi (x) is ;og-normal,
it is easy
to
show
that
and a. . are related to the moments M. .(k,T) by the formulas
21
TR 7989
2
2
It
is stressed
dFii (x)
is
(28)
2 log M ij(l,T) - (1/2) log Mij(2,T)
=
= log Mij (2,T) - 2 log M ij(1,T)
that
equations
truly
(28)
log-normal.
and
(29)
For
(29)
hold
finite
exactly
symbol
if and
HMMs,
only
if
dFii (x)
is
necessarily discrete, so that both equations (28) and (29) must be viewed as
approximations.
Sufficient
conditions
under
which
it
may be
proved
that
dF ij(x) is, in some sense, approximately log-normal are unknown.
the mean and standard deviations of
Table 3 gives a comparison between
log dF ij(x)
estimated
calculated
between
and
from
from equations
10000
(28)
the approximations of
variances.
T = 400
are
HMM(2)
with
value
of
standard
It
long
high
also
as
equations
to
and
the
Assuming
they
appear
to
(28)
that
value
log dF2 1 (x)
is
reliability
01 C
the
mean
about
are
of
length
and
between
log dF 22 (x)
the
of
0T c HMM(l)
log dF2 2 (x)
and
those
good agreement
sequences
difference
of
and
(29) and the sample means
between
mean
0T
shows
observation
is, the
then
table
and
That
be,
sequences
This
distinguish
reliability.
deviations.
(29).
establishes
enough
log dF2 1 (x)
distributed,
and
observation
5.2
normally
the
maximum
posterior classifier is about 98 percent.
Computing
the
requires
n 2 T = 10000
effort as
computing
9216
posterior
multiplications.
are
small
forward-backward
equivalent
to
function
This
is
one 512-point FFT, which
multiplications.
f2 (04 0 0 )
likelihood
In
enough
algorithm
a
nested
other
for
words,
the
practical
f2 (OT)
about
the
for
same
T = 400
level
of
requires (2)(512)(log 2 512)
computational
application.
requirements
Furthermore,
=
of
the
for
computing
f2(OT)
is
mathematically
sequence
of
matrix-vector
multiplications.
Consequently, it is possible to reduce total computation time by the design
of a "black box" to exploit this special structure in hardware.
B.
MIXED ERGODIC AND LEFT-TO-RIGHT HMMS
HMM(3)
is
a
parameters are given
22
five-state,
in table 4.
eight-symbol
left-to-right
model
whose
It has a structure that might conceivably
TR 7989
g
-60
-57
-54
-51
-48
-45
-42
Figure 2. Histogram of 10000 Values of log dF2 2 (x) for T = 25.
(The Normal Curve Has the Sample Mean and Variance Given in Table 3.)
-60
Figure 3.
-57
-54
-51
-46
-45
-42
Histogram of 10000 Values of log dF2 1 (x0 for T
=
25.
(The Normal Curve Has the Sample Mean and Variance Given in Table 3.)
23
TR 7989
Table 3. Comparison of Two Estimates for the
Mean and Standard Deviation of log dF2 j(x) for j = 1, 2
Standard Deviation
Eq. 29
Sample
Mean Value
j
Eq. 28
Sample
T
=
1
5
10
15
20
25
50
100
200
400
-10.8
-21.3
-31.9
-42.4
-53.0
-105.8
-211.4
-422.6
-845.0
-10.6
-21.2
-31.8
-42.4
-52.9
-105.8
-211.5
-423.0
-845.8
0.95
1.11
1.24
1.35
1.47
1.93
2.62
3.60
5.09
0.71
0.92
1.10
1.25
1.38
1.91
2.67
3.76
5.30
=
2
5
10
15
20
25
50
100
200
400
-10.1
-20.3
-30.6
-40.8
-51.0
-102.1
-204.4
-408.9
-818.0
-10.1
-20.3
-30.5
-40.8
-51.0
-102.1
-204.4
-409.0
-818.1
0.69
0.90
1.08
1.23
1.37
1.92
2.66
3.77
5.33
0.59
0.84
1.03
1.20
1.34
1.90
2.69
3.80
5.37
Note that HMM(3) never
arise in the SIIWR problem.
once
is
it
entered.
sequences
ultimately
Note also
that
been
symbol
and
V8
V
VV
3
or
f3(OT) = 0.
will
be
against
symbol
the
It
entered.
ergodic
V
that
the
forbidden
these
observation
of
sequences
make
f3( 0 T)
fifth
may
a
also
state has
containing
sequence 0
five
V8 .
V,
symbols
zero;
be
powerful
noticed.
the
V2
,
i.e.,
It
discriminator
summarize briefly, this example
To
sequences.
V6 f V7 9 and
if the
the
observation
long
likelihood
posterior
symbol
facts
any
symbols
only
observation
have
must
three
if and
occurs
containing
subsequently
V5
only
that an
follows
Other
seen
contain
sufficiently
all
Consequently,
leaves the fifth state
will
show that short observation sequences of quasi-stationary and transient
HMMs
look
hand,
very
all
different
observation
to
the
transient
sequences
look
HMM
recognizer.
somewhat
alike
to
On
the
ergodic
other
HMM
recognizers.
When
24
HMM(3)
enters
its
fifth
state,
it
becomes
stationary
and,
TR 7989
Parameters of HMM(3)
Table 4.
NUMBER OF MARKOV STATES = 5
NUMBER OF SYMBOLS PER STATE = 8
INITIAL STATE PROBABILITY VECTOR:
O.OOE+O0 O.OOE+O0
l.OOE+OO O.OOE+O0
TRANSITION
6.OOE-01
O.OOE OO
O.OOE+OO
O.OOE+OO
O.OOE O0
O.OOE-O0
l.OOE-Ol
4.00E-Ol
7.OOE-Ol
O.OOE+O0
O.OOE+-O0
O.OOE+OO
O.OOE+OO
3.OOE-O1
l.OOE+OO
SYMBOL PROBABILITY MATRIX (TRANSPOSED):
l.OOE-Ol O.OOE+OO O.OOE+O0
9.OOE-Ol
l.OOE-Ol 6.OOE-Ol O.OOE OQ O.OOE+O0
3.OOE-Ol O.OOE+O0
O.OOE+O0 2.OOE-Ol
l.OOE-Ol
6O0E-Ol
l.OOE-Ol
O.OOE+O0
2.OOE-Ol
l.OOE-Ol
OO
O.OOE
O.OOE+O0
O.OOE+OO 4.OOE-Ol
O.OOE+OO O.00E4-O
O.OOE+O0 O.OOE+O0 O.OOE+O0 3.OOE-Ol
O.OOE+OO O.OOE O0
O.OOE OO O.OOE-O
O.OOE+OO
O.OOEO0
O.OOEO0
O.OOE+OO
O.OOE+OO
l.OOE-Ol
6.OOE-O
3.OOE-Ol
transient
HMM(3)
of
portion
estimating the first passage
The
6
entered.
mean
simulation
however,
explicitly;
deviation
respectively.
largest
The
first
the
of
was
by
gained
used
times
here
first passage time was 43 transitions.
be
computed
10000
In
instead.
was
time
was
may
is
found that the mean and
it was
time
fifth state
its
before
passage
passage
passage
first
least
first
for HMM(3),
observation sequences generated
standard
of
of
into its fifth state, that is,
the Markov process
variance
and
length
the
is
sequences
observation
time of HMM(3)
in
of transitions
the number
into
Insight
interesting.
less
significantly
consequently,
the
PROBABILITY MATRIX:
O.OOE+O0
4.O0E-Ol
2.OOE-O|
7.OOE-Ol
O.OOE*OO 6.OOE-Ol
O.OOE+O0 O.OOE O0
O.OOE O0 O.OOE+OO
O.OOE+OO
3
4.8,
and
10.9
transitions,
and
the
Thus, observation sequences
almost certainly become stationary for t > 50.
Figure
4
distribution
as
closely
dF 22(x),
as
5 clearly
and
table
even
though
approximated
evidenced
HMM(3)
by
by
a
the
show
is not
that
dF3 3 (x) is a
ergodic.
log-normal
discrepancy
"well-behaved"
However, dF 3 3 (x) is not
distribution
in table
as
are
5 between
dF2 1 (x) and
the
sample
25
TR 7989
I
-45
-40
-35
-30
-20
-2s
-15
Figure 4. Histogram of 10000 Values of log dF3 3 (x) for T = 25.
(The Normal Curve Has the Sample Mean and Variance Given in Table 5.)
and
StdLiStics
the
statistics
that
would
hold
if
dF3 3 (x)
were
truly
log-normal.
Ten-thousand
and
observation sequences
the
posterior
backward
algorithm.
posterior
likelihood
The
likelihood
0
which
f3( T) = 0.
ergodic
HMM
f3(OT)
observation
function.
Better
observations
than
was
into
state
5.
Total
was
and HMM(2) were generated
computed
sequences
Table
are
6 gives the
99-percent
attained
observation sequences were about as
HMM(3)
of HMM(l)
using
to
the
number of sequences
for
of
T = 10,
that
long as the mean
rejection
of
the
forward-
thus mismatched
rejection
when
the
the
simulated
is,
when
the
first passage time of
10000
ergodic
observations
occurred for T = 20.
The
ability
much
more
lack
of
of
impressive
symmetry
f3( 0 T)
than
Fi (x) o
gives two estimates of
the
26
and
those
reject
0
f2( T)
observations
rejection
F.i (x) is striking
in
of
this
of
0
0 T c HMM(2)
T c HMM(3).
instance.
the mean and standard deviation of
figure 5 is a histogram of
samples
to
predicted
the case T = 25.
by
equation
(28)
is
The
Table
log dF2 3 (x),
7
and
The mean values of the 10000
agree
very
well;
however,
TR 7989
Table 5.
Comparison of Two Estimates for the Mean and Standard
Deviation of log dF 3 3 (x)
Mean Value
Sample
Eq. 28
T
- 5.6
-12.8
-18.6
-23.5
-28.1
-50.5
5
10
15
20
25
50
Standard Deviation
Sample
Eq. 29
- 4.9
-11.8
-18.2
-22.6
-26.3
-47.2
Table 6.
1.92
2.30
2.72
3.22
3.61
4.59
1.13
1.91
2.33
2.29
2.15
2.75
Number of 0 T c HMM(i) for Which
f3(OT) = 0, i = 1, 2
T
HMM(l)
HMM(2)
5
10
15
20
9389
9937
9997
10000
9172
9918
9988
10000
Table 7. Comparison of Two Estimates for the Mean
and Standard Deviation of log dF2 3 (x)
Mean Value
Sample
Eq. 28
T
5
- 10.8
- 21.4
- 32.0
- 42.6
- 53.2
-106.4
10
15
20
25
50
dF 23 (x)
is
not
as
dFl (x),
as
seen
from
standard
deviations.
well
Standard Deviation
Sample
Eq. 29
-10.8
-21.4
-32.0
-42.7
-53.5
-106.8
0.51
0.91
1.02
1.04
1.05
1.11
approximated
the discrepancy
In any event,
it
by
a
in the
log-normal
0.60
0.89
1.04
1.15
1.12
1.43
as
sample versus
dF 22(x)
and
the predicted
is clear by comparing table 7 with
27
TR 7989
-60
-51
-54
-ST
-48
-42
-45
Figure 5. Histogram of 10000 Values of log dF2 3 (x) for T = 25.
(The Normal Curve Has the Sample Mean and Variance Given in Table 7.)
the
lower
HMM(3)
time
half
from
of
of
table
0T c HMM(2)
HMM(3)
is
3 that
when
almost
f2(OT)
T = 50.
certainly
cannot
reliably
distinguish
However,
since
first
less
than
the
T = 50,
c
0
passage
increasing
the
observation sequence length to improve reliability is not appropriate if the
underlying intent is the classification of the transient portion of HMM(3).
C.
LEFT-TO-RIGHT HMM WITH NOISE
In this example, the effect of noise on the maximum posterior likelihood
classifier is assessed for the left-to-right model HMM(3).
study noise in finite symbol
The right way to
HMMs is to add the noise to the original time
series s(t) and then analyze the particular preprocessor under consideration
to determine the noisy symbol sequence.
However, no particular preprocessor
is proposed here, and so we
resort to modeling noise in much the same way
that
discrete
Shannon
can give an
modeled
noisy
indication of the successful
of
the probable number of incorrect
it
cannot
28
memoryless
provide
an assessment
channels. 7
This
approach
classification rate as a function
symbols in an observation sequence, but
of the
effect of signal-to-noise ratio on
TR 7989
classification
because
such
hkj
probability
an
assessment
requires
knowledge
of
the
preprocessor.
Denote
altered
by
to symbol
the
V.
3 by
the
that
the
observation
noise mechanism and define
symbol
the
Vk
m-by-m noise
probability matrix H = [hkj].
It is assumed that H is independent of
state
of
of
given
HMM
the
noiseless.
matrix
H,
BH).
The
Markov
chain and
corrupted
If x
by
(ir,
=
noise
is
t.
Consequently, the
equivalent
to
another
is
of
the
equivalent noiseless
straightforward:
probability that the state
of
symbol
bik hkj
gives
probability
output of
HMM
the
of the Markov chain
HMM are
product
that
the
component
matrix
b.j
B. Clearly,
of
b..
=
bik h
(IT,
is
a
is
the
equivalent
equals
the
is
The sum over
noiseless
(ij)
A,
the
j
is i and that symbol
produced, given that symbol k was the output of the given HMM.
k
the
A, B) are the parameters of a given HMM with noise
the parameters
proof
time
is
HMM
component
of
the product BH, so that B = BH.
The noise
row
probability matrix H must be
sum must equal
one.
row stochastic;
The HMM-generated
symbol
Vk
that
is altered
is, every
by
noise
to one of the available symbols, so that row k must sum to one.
Because
H has
probability matrix
=
BH sums to one.
row sums
for the
equal
to one,
the matrix
equivalent noiseless
HMM;
B is a valid
that
is, each
symbol
row of
We have
m
j=1
m
m
j~l kzl
m
E
k=l
m
bikEI
j~l
hkj
m
EZ bik
k=l
=21
29
TR 7989
The worst case noise
entry
=
1/m for all i and j.
m
E
bik
mL
kj
guishable.
HMMs
In
with
fact,
0
H
.=k
m
k=l
k=l
Consequently,
has the constant
In this case,
m
ij
0
denoted H ,
probability matrix,
noise
makes
probability
all
HMMs
H0
matrix
statistically
are
indistin-
equivalent
to
the
ergodic HMM(l) given in table 1.
Let
Pr[Vi]
be
the
relative
frequency
of
occurrence
of
the
symbol
Vi
observation sequences of length T before the addition of noise. Thus, we
have
E Pr[V.] = I.
After
alteration
by
noise,
the
probability
of
in
correct occurrences of V
in 0T is
that the symbol O(t) c 0 T is correct is
then
Pr[Vi]
h. ..
The
probability
m
D
(30)
Pr[Vi] hii
=
i=1
and the probability that O(t) is incorrect is
ET = 1 - DT
For
(31)
the examples here, given a specific
value of
E
we choose the simple
noise probability matrix H defined by
hi
I - ET ,
all i ,
(32)
ET
h..
Tal
m - 1
ij
For
this
choice
of
H, 0T
is
*
j
all i
independent
of
j
the actual
as is clear from equation (30) and the fact that
30
values
Z Pr[V.] = 1.
of
Pr[Vi,
TR 7989
Noise tends to make observations
HMM(l),
and
ergodic
sequences
for
therefore
to
function
of
results
the
how
many
incorrect
various
sequences.
likely
small
of
T
It shows
T than
for
HMMs
look like observations of
tend
to
HMM(3).
The
forbidden
symbol
symbol
values
observation
for
sequences
left-to-right
determine
the
for
observation
of all
first
probability
and
ETo
that
large
based
This
forbidden
natural
sequences
ET.
forbidden
T.
have
on
8
of
the
10000
sequences are
shows
a
as
gives
simulations
table also
is
issue
occur
Table
symbol
symbol
that
less
noisy
observations of HMM(3) do not have as high a proportion of Forbidden symbol
sequences
as observations
of HMM(l)
and HMM(2),
as can be seen by comparing tables 6 and 8.
that
ET
due
to
must be
forbidden
ET = 0.001,
1.21 percent.
in the
small
the
T must
symbol
be
for ET = 10 percent,
One may conclude from table 8
short to
sequences.
misclassification
Shorter
likelihood
and
even
For
minimize misclassification
instance,
rate
is
and thus
T = 25
apparently
T, however, causes smaller shifts
function
if
at
and
least
in the statistics
increases the misclassification
rate.
Consequently, a tradeoff exists between short T and long T.
The
total
misclassification
misclassification
rate
due
rate
to
can
be
expressed
forbidden
misclassification
rate due to noise-induced
nonzero values of
the posterior
likelihood
symbol
shift
as the
sum of
sequences
and
in the statistics
function.
of
the
the
the
We examine the total
misclassification rate for HMM(4), which is defined to be the HMM equivalent
Table 8. Number of 0 T c HMM(3) + Noise
for Which f3(OT) = 0 at Various Values of ET
ET
T
0.1
0.01
0.001
5
10
15
20
25
50
2194
3906
5305
6625
7643
9643
236
443
651
986
1303
2684
23
37
64
103
121
345
0.0001
1
1
11
13
11
34
31
TR 7989
to HMM(3) with the noise matrix H given by equation (32) with
E
0.001.
=
The parameters of HMM(4) are given explicitly in table 9.
0(x) the cumulative distribution function
Denote by F.
Pr[O < fi(OT) < x I 0 T c HMM(j)]
F
for x > 0
,
(33)
=
for x < 0
O,
Ten-thousand
25.
As
given
likelihood
nonzero
observation
in
function
values
of
sequences
table
8,
121
values
(that
f3( 0 T)
give
0
T were generated
sequences
is,
the
from
resulted
for
in
zero
and
the
remaining
shown
in
figure
0
f3( T) = 0)
histogram
HMM(4)
I
=
posterior
6.
9819
By
comparison with figure 4, it is clear that no significant difference between
0
log dF3 4 (x) and
log dF3 3 (x) is evident.
Therefore,
the misclassifi-
Table 9. Parameters of HMM(4), Rounded
to Three Significant Digits
NUMBER OF MARKOV STATES = 5
NUMBER OF SYMBOLS PER STATE = 8
32
INITIAL STATE PROBABILITY VECTOR:
l.OOE+O0
O.OOE O0 O.OOE+0O O.OOE+00
O.OOE+O0
TRANSITION
6.OOE-Ol
O.OOE+O0
O.OOE+O0
O.OOEO0
O.OOE+OO
O.OOE+O0
l.OOE-0l
4.OOE-Ol
7.OOE-Ol
O.OOE+0O
O.OOE O0
O.OOE+00
O.OOE-0O
3.OOE-Ol
l.OOE O0
SYMBOL PROBABILITY MATRIX (TRANSPOSED):
B.99E-Ol
l.OOE-Ol 1.43E-04 l.43E-04
l.OOE-Ol
5.99E-01 1.43E-04 l.43E-04
l.43E-04 2.OOE-Ol
3.OOE-Ol
l.43E-04
l.43E-04 l.OOE-Ol
5.99E-01
I.OOE-Ol
l.43E-04 1.43E-04 l.OOE-Ol
2.OOE-Ol
l.43E-04 1.43E-04 1.43E-04 4.OOE-Ol
l.43E-04 1.43E-04 1.43E-04 3.OOE-0l
l.43E-04 1.43E-04 1.43E-04 l.43E-04
1.43E-04
1.43E-04
1.43E-04
1.43E-04
1.43E-04
l.OOE-Ol
5.99E-01
3.OOE-Ol
PROBABILITY MATRIX:
4.OOE-Ol
O.OOE+O0
7.OOE-0
2.OOE-01
O.OOEfOO 6.OOE-Ol
O.OOE+O0
O.OOE+O0
O.OOE OO O.OOE+O0
TR 7989
-45
-40
-35
-3o
-15
-20
-is
0
Figure 6. Histogram of 9879 Samples of log dF3 4 (x) for T = 25.
(The Normal Curve Has the Sample Mean - -28.156 and
the Variance = 3.6167.)
cation
rate
due
effectively
reliable
to
zero.
when
noise-induced
The
used
maximum
with
shifts
in
posterior
noisy
the
statistics
classifier
observations
of
is thus
characterized
dF0 4 (x) is
98.8
by
percent
the
noise
probability matrix H.
ET = 0.001
Because
has
probability 0.025
observation
of
sequences,
symbol
is 250.
caused
the
in
this
having at
the
expected
Nearly half
only
example,
(121)
significant
each
least one
number
observation
incorrect
with
contained
at
sequence
symbol.
least
one
forbidden symbol
misclassification
problem.
Of
025
10000
incorrect
sequences and
The
other
half
of
F0ii~x
(x)
apparently made no contribution to misclassification.
It
would
instead
of
compute
the
seems
to
be
be
desirable
Fii (x).
be
Alternatively,
amplitude
present
to
of
in the
the
able
to
compute
it would
impulse
be
(delta
left-to-right
the moments
desirable
function)
HMMs considered
to
in
be
able
dF. .(x)
here.
to
that
In other
words, if we write
dF. (x) = A 6(x) + dF.0 (x)
ii3
,
(34)
33
TR 7989
Knowing A and
then an algorithm to compute A directly would be worthwhile.
of
moments
IV.
The
first
apparently
strictly
are
are
empirical
and
a proof
The
of
ergodic,
distribution.
dF ij(x)
Consequently,
does
higher
dFi(x)
developing
order
give
good
in
case
when
is
that
are
dFij (x)
is
claim
is
this
of
the
needed
dF ij(x)
or both,
or HMM(j),
approximate
moments
the
approximation
When either HMM(i)
not
F ij(x)
supporting
degree of
reasonable continuous approximations to dFij(x).
34
of
reason
The
evidence
the
log-normal would be worthwhile.
not
function
ergodic.
log-normal.
nearly
M ij(2,T)
and
density
both
However,
CONCLUSIONS
M ij(1,T)
probability
HMM(j)
and
HMM(i)
to
the
of
estimatE,
moments
two
F0 (x).
ir
of
F.. gives the moments
..
such an algorithm requires further work.
the
log-normal
to
develop
TR 7989
REFERENCES
1.
L. R. Rabiner, S. E. Levinson, and M. M. Sondhi,
Vector
Quantization
and
Hidden
Isolated Word Recognition,"
Markov
Models
"On the Application of
to
Speaker-Independent
The Bell System Technical Journal,
vol.
62,
no. 4, April 1983, pp. 1075-1105.
2.
S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, "An Introduction to the
Application of the Theory of Probabilistic Functions of a Markov Process
to
Automatic
Speech
Recognition,"
The
Bell
System
Technical
Journal,
vol. 62, no. 4, April 1983, pp. 1035-1074.
3.
H. L. Van Trees, Detection, Estimation, and Modulation Theory,
Part
I,
chapter 2, Wiley and Sons, New York, 1968.
4.
A.
Papoulis,
Probability,
Random
Variables
and
Stochastic
Processes,
McGraw-Hill, New York, 1965, p. 158.
5.
L. A.
Liporace,
Observations
Theory, vol.
6.
J. G.
Kemeny
of
"Maximum
Markov
Likelihood
Sources,"
IEEE
Estimation
for
Transactions
on
Multivariate
Information
IT-28, no. 5, September 1982, pp. 729-734.
and
J. L.
Snell,
Finite Markov
Chains,
chapter
3, Van
Nostrand, Princeton, New Jersey, 1960.
7.
C. E. Shannon, "A Mathematical Theory of Communication," The Bell System
Technical Journal, vol.
27, 1948.
35/36
Reverse Blank
INITIAL DISTRIBUTION LIST
Addressee
NAVSEA (Code 63-0 (CDR E. Graham, C. Walker))
NRL (Code 5130 (Dr. P. Abraham), - 5844 (Dr. F. Rosenthal,
Dr. R. J. Hansen), -5104 (Dr. S. Hanish)-, -5130
(R. Menton, J. Cole))
NRL/USRD, Orlando (A. Markowitz)
NOSC, San Diego (E. Schiller, R. Johnson, R. Smith,
G. Benthein)
DTNSRDC, Bethesda (Code 1541 (P. Rispin))
DARPA (CMDR A. Sears)
OTIC
NAVPGSCOL, Monterey
Naval Technology Office (L. Epley)
ONR (Code 434 (Dr. N. Glassman, Dr. R. Fitzgerald), -220B
(Dr. T. Warfield))
ONR, Boston (Dr. R. L. Sternberg)
AMSI (W. Zimmerman)
Bell Laboratories (S. Francis, G. Zipfel)
S. Berlin, Inc.
(S. Berlin)
Campbell and Kronauer Associates (C. Campbell, R. Kronauer)
Gould, Inc. (S. Lemon)
G&R Associates (F. Reiss)
SCRIPPS (0. Andrews, Jr.)
Technology Service Corp. (L. Brooks)
Atlantic Applied Research (S. Africk)
Signal Technology, Inc.
(Dr. S. B. Davis)
ATT Bell Laboratory (Dr. S. Levinson)
Digital Equipment Corp. (Dr. T. Vitale)
Atlantic Aerospace Electronics Corp. (L. Lynn)
URI, Dept. of Math. (Prof. P. T. Liu)
URI, Dept. of Elec. Eng. (Dr. 0. Tufts)
Chase, Inc.
(0. Chase)
A&T, No. Stonington (L. Gorham)
No. of Copies
2
6
1
4
1
1
12
1
1
3
1
I
2
1
2
I
1
1
1
I
1
1
1
1
1
I
1
1