The Use of a One-Stage Dynamic Programming Algorithm for

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-32, NO. 2 , APRIL 1984
263
The Use of a One-Stage Dynamic Programming
Algorithm for Connected Word Recognition
HERMANN NEY
Abstract-This paper is of tutorial nature and describes a one-stage pendent segmentation rules or an extensive training of the segof connected word
dynamic programming algorithm for the problem
mentationalgorithm.There
have beenothersystemslike
recognition. The algorithm to be developed
is essentially identical to
[
1
13
and
HARPY
[
121 , which model the recogniDRAGON
one presented by Vintsyuk [ I ] and later by Bridle and Brown [2] ;but
tion
problem
as
an
optimization
problem
as well, but which
the notation and the presentation
have been clarified. The derivation
of a are based on subword units for the recognition as opposed to
used for optimally time synchronizing a test pattern, consisting
sequence of connected words, is straightforwsd and simplein com- whole word templates. Meanwhile, a number of systems for
parison with other approaches decomposing the pattern matching prob- connectedwordrecognitionhavebeendevelopedwhichare
lem into severallevels.
Theapproachpresented
reliesbasically on
based on the algorithm described in this paper[ 131 - [ 161 .
parameterizing the time warping pathby a singe index and on exploiting
Although Vintsyuk had formulated his algorithm already in
certainpathconstraintsboth
in thewordinteriorand
at theword
boundaries. The resulting algorithm turns out to be
significantly more 1971, his algorithm was not commonly known. Thus, later,
efficient than those proposed by Sakoe [3] as well asMyersand Rabiner Sakoe independently derived a two-level algorithm to solve the
[ 4 ] , while providing the same accuracy in estimating the best possible optimization problem. On the first level, all reference patterns
matching string. Its most important feature is that the computational
are systematically matched against
all possible subsections of
expenditure per word is independent of the number of words in the inthe
pattern.
This
matching
operation
is
the
sameasforthe
put string. Thus, it is well suited for recognizing comparatively long
word sequences and for real-time operation. Furthermore, there is no nonlinear time alignment in isolated word recognition. On the
need to specify the maximum number of words in the input string. The second level, using the distance scores generated, the optimal
practical implementation of the algorithm is discussed; it requires no
estimate of the unknown sequence of words is obtained by
heuristic rules and no overhead. The algorithm can be modified to deal
minimizing the total distance of
all possible word sequences.
with syntactic constraints in terms of a finite state syntax.
I.INTRODUCTION
NE of the most promising approaches to connected word
recognition is thetechniqueofdynamicprogramming
[5] . For isolated word recognition,several authorshave shown
that the problem of nonlinearly time aligning speech patterns
with no fixed time scale can be efficiently solved by dynamic
programming [6] - [ 8 ] . Vintsyuk [ l ] , Bridle and Brown [2],
andSakoe[3]haveformulatedtheconnectedwordrecognition problem as an optimization problem similar
to the isolated word recognition problem. One of the attractive features
of this formulation is the small amount of a priori information
and training that is required: only the reference patterns for
Another advantage
each individual word need to be known.
is thatthethreeoperationsofwordboundarydetection,
nonlineartimealignment,andrecognitionareperformed
simultaneously; thui, recognition errors due to errors in word
boundary detection or to time alignment errors arepossible.
not
The algorithm is forced to match the complete words, and as
a result of this, the word boundaries are determined automatically. In comparison,. methods based on prespecified segmentation rules [9]. or on statistical estimation for segmentation
[ l o ] require either detailed knowledge of the vocabulary de-
0
'
Manuscript rcccivcd February 15, 1982; revised February 7, 1983,
July 29,1983, and October 14,1983.
The author is with Philips GmbH r'orschungslaboratoriunl Hamburg,
D-2000 Hamburg 54, Germany.
In a recent paper, Myers and Rabiner derived another solution
to the optimization problem of connected word recognition
[4]. They made use of the property that the matching of
all
possible word sequences can be performed by successive concatenationofreferencepatterns.Thustheyobtainedwhat
they called a level building algorithm.
This paper is essentially tutorial. Its purpose
is t o present a
clear description of the connected word recognition problem
anditssolutionbyaone-stagedynamicprogrammingalgorithm, to discuss the advantages of the one-stage algorithm and
to compare it with other algorithms. The one-stage algorithm
is basically identical to the algorithms given by both Vintsyuk
[ l ] and by Bridle and Brown [2]. However, a straightforward
formulation and derivation of the algorithm is given. The derivation is based on parameterizing the time warping path by a
single index and treating the optimization criterion directly as
a function of this time warping path. This approach
is much
[3] and by
simpler than the approaches used by both Sakoe
MyersandRabiner
[ 4 ] , a n d it resultsinacomputationally
more efficient algorithm. The constraints imposed on the path
are described in terms of two types of transition rules: transition rules for the word interior and for the word boundaries.
Carrying out the optimization using these path constraints and
dynamic programming leads to a one-stage algorithm, for which
no multipleoptimizationlevels
as in theother
thereare
approaches. Unlike the level building of Myers and'aabiner,
the one-stage algorithm needs no prespecified maximum number of words in the input string. What is most important, the
0096-3518/84/0400-0263$01.00 0 1984 IEEE
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.
2 64
IEEE TIIANSACTIONS O N ACOUSTICS, SPEECH. AND
SIGNAL
I’ROCKSSINC, VOL. ASSP-32, NO. 2 , APRIL 1984
one-stage algorithm requires no more computational expenditure than the corresponding case of isolated word recognition
with no adjustment window. The implernentational aspects of
theone-stagealgorithmaredescribed,anditscomputational
and storage requirements are compared with the two-level algorithm of Sakoe and the
level buildingalgorithmofMyers
and Rabiner. Finally, the algorithm is modified t o deal with a
finite state syntax.
5
<=
<=L
11. FORMULATION
0 1 ’ THE PATTERNMATCHINGPROBLEM
In the following, we will present a simple approach to the
patternmatchingproblemforconnectedwordrecognition,
the reason being that the simplification due to parameterizing
the time warping path by a single index is most significant and
provides some insights that immediately reveal how to arrive at
a practical implementation of the algorithm.
or testpatternconsistingof
Assumeanunknowninput
i = 1 , . , N time frames, where a time frame is represented by
a vector of features. The input pattern
is known to be composed of individual words, which are chosen from a prespecified vocabulary. The words of the vocabulary correspond to
a set of K reference patterns or templates obtained fromsingle
word utterances spoken in isolation. The word templates are
distinguished by the index k = 1 ~.. . . K . The time frames of
the template k are denoted as j = 1,. . . , J ( k ) , where J ( k ) is
the length of the templatek .
The ultimate goal of connected word recognition
is to determine that sequence y ( l ) , . . . y(R) of templates that best
matches the input pattern, where the criterion of match needs
furtherspecification.Theconcatenation
of thetemplates
y ( l ) , . . . , 4 ( R ) is referred t o as“super”referencepattern.
Since this unknown “super” reference pattern may be handled
like a single utterance pattern, the matching procedure is the
same as in the case of isolated word recognition. Based on this
consideration, it is obvious what specification and constraints
toapplytothetimewarpingprocedure.Insteadofdecomposing the matching procedure into a single template matching
level and a word string constructing level, as it was done in the
[ 3 ] , [ 4 ] , wewant to treatthe
otherapproachesmentioned
matching procedure as a one-stage procedure [ I ] , [ 2 ] .
The basic idea is illustrated in Fig. 1. The time frames i of
the test pattern and the time frames j of each template k de(i, j , k ) . Each grid point (i, j , k ) is
fine a set of grid points
d ( i , j , k ) defining a
associated with a local distance measure
measure of dissimilaritybetweenthecorrespondingacoustic
events.Theconnectedwordrecognitionproblemcanberegarded as one of finding the path through theset of grid points
(i,,j, k) which provides the best match between the test pattern
andtheunknownsequenceoftemplates.Thepath
is often
i,j , k
referred to as time warping path. The three parameters
are of different characters: the time parameters i and j tend to
change more or less uniformly in ascending order, whereas the
template number k is constant for comparatively long subsections of the path and can change only after the path has passed
a template boundary with j = J ( k ) . However, for deriving the
algorithm, it is crucial to treat the three parameters as matheW is given as asematicallyequivalent.Formally,thepath
quence of grid points
k=3
k:2
<: 1
i, J I .
‘il?l
I
VI,
~ 1 ,’
!‘,.1’1
l~,.
Fig. 1. The connectedwordrecognitionproblcm.Theoptimalpath
of words aswcll asthcnonlinear
providestheunknownsequence
of ternplates
timealignmentbetweentheCorrespondingscqucnce
and the input pattern.
w =( w ( l ) ,w(2), . .
‘
,w(l),
’ ’ ’
,w(L))
(1)
where w ( l ) = (i(l), j ( l ) , k ( l ) ) and 1 is the path parameter for
indexing the. ordered set of path elements. The criterion for
thematchingprocedure
is the globaldistance, i.e., thesum
overthelocaldistancesalonga
given path. The problem of
as the minimiconnected word recognition can now be stated
zation problem
i.e.:minimizetheglobaldistancewithrespect
to all allowed
paths.Fromthebestpath,theassociatedsequenceoftemplates can be uniquely recovered as is clear from Fig. 1 .
In addition to minimizing the global distance, the time warping path is required to obey certain continuity constraints impliedbythephysicalnatureofthepatterns
t o bematched.
These constraints apply to consecutive points of the path. The
of
constraints result from the requirement of the preservation
time order along the time axes and from the requirement of
timecontinuityimplyingthatnotimeframe,i.e.,acoustic
event, be omitted in the sequence i(l), . . . , i(l), . . . , i ( L ) .
Thecontinuityconstraintsdeterminethepossiblepreceding
points for a given path point (z’, j , k ) and arethereforealso
referred to as transition rules. A possible disadvantage of the
global distance definition as given in (2) is that the global disare
tance depends 011 the path length, and thus shorter paths
favored. This problem will be studied later in connection with
the details of the dynamic programming algorithm.
single wordtemplates to a
Due to theconcatenationof
“super”referencepattern; it is convenient to distinguishbetweentwotypesoftransitionrules:transitionrulesinthe
templateinteriorcalledwithin-templatetransitionrulesand
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.
ONE-STAGE
NEY:
CONNECTED
LIP FOR
ALtiOKITHM
‘ T I N I : ‘R;OI::S OF I \ ~ : : L w
111,’l’’rtI ~ Y
WORD RECOGNITION
‘
Fig. 2. (a) Within-template transition rules. (b) Illustration of betweentemplate transition rules.
l,k)}.
(44
D(i, 1, k ) = d ( i , 1, k ) + min {D(i - 1 , 1 ,k ) ;
D(i- l,J(k*),k*):k*=l;..,K}.
(3a)
of the
i.e., the point (i, j , k ) can be reached only from one
points (i - 1, j , k ) , (i - 1, j - 1 , k ) , (i, j - 1 , k ) as shown in Fig.
j = 1 , the
2(a). For transition at template boundaries where
between-template transition rules are shown in Fig. 2(b):
if w(I) = (i, 1 , k ) , then
~ ( l l)E{(i1, 1,k);
(i - 1 , J ( k * ) ,k*):k* = 1 , . . . , K } .
D(i, j , k ) = d ( i , j , k ) f min {D(i - 1 , j , k ) ,
At the template boundaries withj = 1 , the between-template
transition rules (3b) yield
> 1, then
w ( l - 1)E{(i- l , j , k ) , ( i -1 , j -l , k ] , ( i , j -l , k ) }
problems of the type of (2) which can be broken down into a
sequence of optimization steps and in which there are only a
small number of possible choices or decisions to be taken at
each optimization step [ 171 . The philwophy of dynamic programming is based primarily on the “principle of optimality”
which is due to R. Bellman [ 5 ] . Its application to the minimization problem (2) says: If the best path goes through a grid
point (i, j , k ) , then the best path includes, as a portion of it,
(i, j , k)’:
the best partial path to the grid point
In order to utilize this “principle of optimality,” we define
a minimum accumulated distance D ( i , j , k ) along any path to
the grid point (i, j , k ) . Since the accumulated distanceD(i,j,k )
is a sum of local distances, it can be decomposed in the same
way as the path into the accumulated distance along the best
path to its predecessors and the local distance associated with
the grid point (i, j , k ) itself. To obtain the best path, we have
t o select thepredecessorwiththeminimumtotaldistance.
Thus for the template interior,
i.e., j > 1 , we obtain the following using the within-template transition rules (3a):
D ( i - l1, k, j)-, D ( i , j -
transitionrulesatthetemplateboundariescalledbetweentemplate transition rules. These two types of transition rules
areillustratedin
Fig. 2.Forwithin-templatetransitionsthe
following relation holds between two consecutive points:
if w ( l ) = ( i , j , k ) , j
265
(3b)
Since the point (i, 1, k ) corresponds to the beginning frame
of the template k , it is necessary that it can be reached from
k* including k itself.The
the ending frame of any template
between-template transition rules depend heavily on how the
single words can be combined and must therefore be changed
in the case of syntactic constraints as will be shown later. The
coarticulation problem at the word boundaries
is tackledby
matching the interior parts of the words such that the word
boundaries are correctly treated automatically. Finally, there
are endpoint constraints requiring that the time warping path
begins at a beginning frame of any template and ends at the
ending frame of any template.
111. DERIVATIONO F T H E A L G O R I T H MUSING
D Y N A M IPROGRAMMING
C
In thissection,analgorithmfor
solving theminimization
problem (2) and thus finding the best path W is derived by use
of dynamic programming. The concept
o’f dynamic programming is especially useful in solving combinatorial optimization
(4b)
For grid points at the beginning frame of the test pattern,
the transition rules must be modified, since thereis no precedA grid point
ing ‘frame on the time axis of the test pattern.
(1 , j , k ) can be reached only from a grid point (1 , j - 1, k ) .
Equations (4a) and (4b) are the typical recurrence relations
of dynamic programming. By using these recurrence relations,
the accumulated distancesD(i,j , k ) can be recursively evaluated
pointbypoint.Evidently,therecursiveevaluation
is made
possiblebythetransitionruleswhichimplyanorderedarrangement of grid points.
By way of summary, a complete algorithm for connected word
recognition is given as follows.
Step 1) Initialize D ( 1 , j , k ) =
i
d(1, n , k).
n:1
Step 2)
a ) F o r i = 2 , . . . , N , do steps 2b-2e.
b) For k = 1 , . . . , K, do steps 2c-2e.
c) D ( i , 1 , k ) = d ( i , 1, k ) f min {D(i - 1 , 1, k
D(i - 1, J(k*), k*):k* = 1, . .
d) F o r j = 2 ; . . , J ( k ) , d o s t e p 2 e .
e ) D ( i , j , k ) = d ( i , j , k ) + min { D ( i - 1, j , k ) ,
D ( i - 1 , j - l , k ) , D ( i , / - 1 , k )‘1.
Step 3) Trace back the best path from the
grid point at a
template ending frame with minimum total distance
using the array D ( i , j , k ) of accumulated distances.
Step 3 of this algorithm recovers the unknown sequence of
words in the input pattern by tracing back the decisions taken
by the “minimum” operator at each grid point. For this back-
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.
266
IEEE
TRANSACTIONS
O N ACOUSTICS,
SPEECH,
AND
SIGNAL
PROCESSING,
tracking procedure, the entire array of accumulated distances
D(i, j , k ) must have been stored during the recursion. Although
thisstoragerequirementofthealgorithm
is not necessarily
intractable even on today’s microprocessors, it
will be shown
in the following section that the storage requirement can be
significantly reduced by special techniques.
A drawback of the algorithm formulated so far results from
the dependence of the global criterion (2) on the path length.
An easyconsiderationshowsthatforpathsegmentswitha
slope greater than 1 , the number of local distances per input
frame increases, whereas it remains constant for path segments
with a slope smaller than 1. Ideally, the global criterion should
be independent of the slope of the path in order to allow
all
types of time axis distortion. At the same time, however, the
optimal path is likely t o deviate only rarely from the diagonal
direction, i.e., t o have a slope 1, since speaking rate variations
tend to be small. There
is no general solution known to this
problem. Sakoe and Chiba [SI describe two within-word tran1
sition rules for which the path slopes are always
2, 1, o r 2,
and the path length is equal to the number of test frames pro[7] leadtothesamepath
cessed.TheItakuraconstraints
lengthnormalization,
i.e., notimedistortionpenaltiesfor
2. Locallyvariablepenaltiesfortime
slopesbetweenand
distortion can be introduced as illustrated in Fig. 3. Depending
on the three directions horizontal, diagonal and vertical, the
(1 + a ) , 1, and b
localdistance is multipliedbytheweights
prior to evaluating the dynamic programming recursion:
VOL. ASSF32, NO. 2 , APRIL 1984
Y
W
+
4
_I
n
E
~
0
0
0
0
0
0
0
0
U
0
co
b
W
=
o
dU
w
o
z
L
0
o
o
o
o
0
I
0
I
TIMEFRAMES
I
OF INPUTPATTERN
Fig. 3. Timedistortionpenalties
as provided by slopedependent
weights 1 + a , 1, b : Thc number of local distances per input frame is
1 for slope 1 and 1 + b for slope 2.
thus 1 + ( a / 2 ) for
i,
thetimeaxisofthetestpatternandupdatingthestorage
column point by point.
This technique for storage reduction is analogous to the case
of isolated word recognition, where
all that is wanted is the
minimum total distance as a matching score between the test
pattern and the template. The essential difference, however, is
that some form of backtracking must be added to enable the
algorithm to recover the unknown sequence of words.
Bridle
et al. [13] refer to the following technique for backtracking
[ I ] . AsimilarbacktrackingproceasVintsyuk’salgorithm
[4] . The backtracking
dure was proposed by Myers and Rabiner
information must be recorded during the evaluation of the dynamicprogrammingrecursion.Forthepathleadingtothe
D ( i , j , k ) = min {( 1 + a) . d ( i , j , k ) + D(i - 1 , j , k ) ,
grid point (i, j , k ) , there is a unique starting point at the line
j = 1 in thesametemplate
k . Hence,foreachgridpoint,a
d ( i , j , k ) t D ( i - 1, j - 1 , k ) ,
backpointer B(i, j , k ) can be defined as thatvalue of the ending
b.d(i,j- l,k)+D(i,j- l,k)}.
frame of the preceding. word from which the path to the grid
(i,j , k ) has come. Fig. 4 illustrates the basic concept of
point
The reformulation of the global criterion (2) is omitted since
the
backpointers
for the three precedinggrid points of the grid
it is trivial.
(i,
j
,
k
)
.
Originally,
the array of backpointers depends
point
Fig. 3 indicates also what is the number of localdistances
1
on
the
index
triple
(i,
j
,
k
)
in
the same way
as the arrayD ( i ,j , k )
per input frame for slopes 2, 1, and T . Typical values of the
1
of
accumulated
distances.
weights II and b are in the order of 1, e.g., a = 1 and b = T .
Since, as stated above,we are only interested
in the backpoint(i,
J
(
k
)
,
k
)
at
the
template
boundaries,
ers
of
the
grid
points
I v . PRACTICALIMPLEMENTATION Ot’ T H E ALGORITHM
we can reduce the array of backpointers B(i,j , k ) to a column
The final aim of the algorithm is to determine the unknown array of backpointers B ( j , k ) (in Fig. 1 ) as in the case of the
sequence of words. In order to accomplish this, it is sufficient array D ( i , j , k ) , and update itpoint by point accordingto where
to know at which time framei of the test pattern the best path the best path has come from.
A “from template” array T ( i )
has started for a given ending point of agiven template k .
mustbeused
to recordtheindex
k o ofthetemplatewith
The details of the best path within the templates are of
no minimum accumulated distance at its ending frame
J ( k 0 ) for
primaryimportancetotherecognitionproblem.Another
each frame i of the test pattern. Formally, this is expressed as
important aspect for reducing the storage requirement
is eviT(i) = k o = argmin ( D ( i ,J ( k ) ,k )
dent from the structure of loops in the algorithm presented in
To performthedynamicprogramming
theprevioussection.
D(J(k), k ) : k = 1 , . . . ,K 3 ,
i, onlyasmallportionfromthe
recursionsforatimeframe
complete array D ( i , j , k ) of accumulated distances is needed, where the operator “argmin { j x x ) : x = x. . . . ,x,i 1” means to
findtheoptimumargument
x whichminimizes f ( x ) . Addinamelytheelementscorrespondingtotheprecedingtime
F ( i ) must beintroduced in
tionally,
a
“from
frame’’
array
f r a m e i : ( D ( i - l ? j , k ) : k = l ? . . . , K ;. .j.=.,lJ ( k ) ) . The
order to retain the backpointer B ( J ( k o ) ,k o ) B ( i , J ( k o ) ,k o )
grid points associated with these elements form a vertical cut
through the time plane of Fig. 1. This column of storage will to the ending frame of the preceding word after the test patbe simply referred to as column array of accumulated distances tern frame j has been processed. Thus thc ‘‘from frame” array
and denoted as D ( j , k). Thus, using only one column of stor- F ( i ) keeps track of the frame along the test pattern time axis
age from the array D ( i , j , k ) , thedynamicprogramming re- from which the best path to the grid point ( i , ’ J ( k o ) k, o ) has
cursions (3a) and (3b) can be carried out by proceeding along
come. In other words, the “from frame” array keeps track of
~
1
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.
N t Y : ONE-STAGE DP ALGORITHM FOR CONNECTED WORD RECOGNITION
2 67
the “from template” array
T(i):= argniin {D(i,J ( k ) ,k ) : k = 1 , * . . ,K ]
the “from frame” array
F(i): = B(J(T(i)),T(i)j =^ B(i,J(T(ij),
T(i)).
The savingsachievedbyusingthesearraysinstead
of the
original array D(i, j , k ) are considerable: instead of N . J . K .
TIME FRAMES O F INPUT PATTERN
where 7 denotesthe
averagelength
of atemplate,only
Fig. 4. Backpointcrs from thc three preceding grid points
(i, j , k ) to
2 . ( J .K + N ) storage locations are required.
their corresponding starting frames.
Fig. 6 shows a schematic diagram of the complete algorithm
using the arrays introduced above and provides a short verbal
‘ F R O MF R A M E ’ A R R A Y
description of the one-stage dynamic programming algorithm
FROMTEMPLATE’ARRAY
0
N
for connected word recognition.
TIME FRAMES , OF INPdT
PATTERN
In order to allow for the possibility of pauses between the
Fig. 5. The backtracking procedure.
words in the word string, it is useful to include an additional
silence template in the set of templates. This technique can
also be used for refining the detection of the beginning and
potentialwordboundaries,andthe“fromtemplate”array
ending points of the complete word string as it was proposed
keepstrackoftherespectivedecisionabouttherecognized
by Bridle et al. [13] .
word.
These authors also suggested an extension
of the algorithr;1
The concept of the “from template” and of the “from frame”
t o deal with nonvocabulary words. They introduced a so-called
arrays is closelyrelatedtothebetween-templatetransition
pseudotemplate consisting of only one frame to which a fixed
rules, which are the characteristic of this approach presented
as opposed to the other approaches. It is crucial to realize that “distance” is assigned independent of with which input frame
for each time frame i, only the best template, i.e., the template it is matched. Thus, the algorithm is able to reject a portion of
the input utterance if the similarity is not sufficient.
with minimum accumulated distance at its ending frame, and
so far
The connected word recognition algorithm described
thecorrespondingwordboundarymustbekepttrackof
in
starts
the
traceback
from
the
end
of
the
word
string.
However,
order to be able to determine the optimal global path.
it is possible t o recognize the initial portion of the word string
Fig. 5 shows an example of the backtracking procedure for
is very desirable for a
recovering the unknown sequence of words for the example in before its end has been spoken, which
real
time
operation
of
the
recognition
system.
This
is clear
Fig. 1. For the last time frame of the test pattern, the “from
from
the
observation
that
if
there
is
a
section
shared
by
all
template”arrayrecordsthetemplatewithminimumtotal
candidate
paths,
that
section
is
necessarily
part
of
the
globally
distance at its ending frame. The “from frame” array points
to the ending frame of the preceding word on the test pattern optimal path [ 131 , [ 141 . Thus, by tracing back all candidate
paths for a given input frame and determining the point where
timeaxis.Forthisendingpoint,the“fromtemplate”array
they
all meet, the initial portion of the word string can be recagain contains the index of the best preceding word and the
ognized.
Thisrecognition
is irrevocableinthesensethatno
corresponding“fromframe”arraypointsagainbacktothe
subsequent
input
can
change
the decision about the recognized
endingframeoftheprecedingword.Obviously,the“from
words.
In
order
to
carry
out
this premature traceback, it is in
frame” array contains indices into itself and the “from temfact
sufficient
that
decisions
about
therecognizedwordsas
plate” array. This backtracking
is continued until the zeroth
such,
and
not
about
the
path
details,
are the same for all canframeofthetestpatternhasbeenreached.Thusforthe
didate
paths.
optimal word boundaries we obtain the sequence (in
reverse
order)
V. SYNTACTIC
CONSTRAINTS
N , F(W,W W ) ,
,
In this section, the algorithm is modified in order to include
and for the optimal word sequence, we obtain the sequence
(in asyntacticanalysisforafinitestatesyntax.Forconnected
reverse order)
wordrecognition,it
is suitabletoincorporatethesyntactic
analysis into the recognition process itself.
It is worth noting
T ( W , W ( N ) ) ,W v W ) 11, ’ .
that any lexicon of legal word strings can be described in terms
In summary, the array D ( i , j , k ) of accumulated distances
ofafinitestatesyntax.Clearly,syntacticconstraintsmust
B ( i , j , k ) of backpointers has affectthebetween-wordtransitionrules.Eachword
along with the associated array
asit
beensubstitutedbyfoursmallerarrays.Thesearraysareas
startscanbereachedonlyfromwordsthatmay
legally prefollows.
cede i t according to the syntactic constraints. Thus, we define
for
each word k a “from predecessor” set P ( k ) as the set of all
The column array of accumulated distances
k . Fig. 7(a) ilvocabulary words which can precede the word
D ( j , k ) : =^ D(i,j, k ) ,
lustrates the method. For convenience, START and STOP are
added as pseudowords to the vocabulary { A , B , C}. The arcs
the column array o f backpointers
of the finite state network are associated with the words of
B(j. k ) : e B(i.1, k ) ,
the vocabulary. The finite state network shown in Fig. 7(a) is
’ ‘ ’
’ ‘
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.
268
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-32, NO. 2, APRIL 1984
I N I T I A L I Z EA R R A Y SO FA C C U M U L A T E DD I S T A N C E SA N DB A C K P O I N T E R S
I
t
I
LOOP
OVER
TIME
FRAMES
OT
FHE
INPUP
TATTERN.
/I
LOOP OVER
TEMPLATES.
I
1
EVALUATE
DYNAMIC
PROGRAMMING
RECURSION
ACCORDING
TO
BETWEEN-TEMPLATE
TRANSITION
RULES:
I
-
U P D ATCTHO
EEL UAMRNR A Y
A C C U M U L A TDEIDS T A N C E S
OF
- U P D A T TE H C
EOLUMN
ARRAO
YB
FACKPOINTERS.
R E C U R S I O NA C C O R D I N G
RULES:
E V A L U A TD
EY N A M IP
CR O G R A M M I N G
TO
WITHIN-TEMPLATE
TRANSITION
I /
-
UPDATT
EH C
EOLUMN
ARRAY
--
UPDATT
EH C
EOLUMN
ARRAO
YB
FACKPOINTERS.
OF ACCUMULATED
DISTANCES,
LOOC
PONTROL
KEEP
TRACK
OF
THE
TEMPLATE
WITH
DISTANCE
AT
ITS
ENDING
FRAME
IN
M I N I M U M ACCUMULATED
A “ F R O M T E M P L A T EA” R R A Y
KEEP
TRACK
OF B A C K P O I N T E R S A T H E
CORRESPONDING TEMPLATE I N A “FROM
E N D I N G F R A MO
E TF H E
FRAME” ARRAY,
RECOVER
THS
EEQUENCO
E TF E M P L A T E S :
-
START
FROM
THE
TEMPLATE
WITH
MINIMUM
ACCUMULATED
DISTANCE
A T ITS E N D I N G
FRAME,
- B A C K T R A C TK H SE E Q U E N C E
OF TEMPLATES
USING
THE
“FROM
FRAME”
A N “0F R O M
T E M P L A T EA” R R A Y U
ST
PO
THB
EEGINNING
FRAMO
EF
T H IEN P U P
TA T T E R N .
Fig. 6. Schematic diagram of the one-stagedynamicprogramming
algorithm.
bythe “from predecessor”sets. For theexample inFig.
wehave to maketwocopies
B , andB?oftheword
shown
in
Fig.
7(c).
Thus,
we
obtain
the
four-word
vocabulary
7(b),
B as
D(i,
k ) = d(i,
k)
+
min tD(i -
k);
D ( i - 1, J ( k * ) , k”): k* E P ( k ) }
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.
tationalrequirementsoftheone-stagealgorithm,thecomparison is made for word strings with no syntactic constraints.
The computational and storage requirements of the four all . ThecomputationalregorithmsaresummarizedinTable
quirements are measured as the number of local distance calculations, which is the product of basic time warps multiplied
by the(average) size of a time warp,i.e., the area covered by it.
According t o Myers and Rabiner [4] , t h e storage requirements
includeonlythestoragelocationsforthosearraysthatare
specific to the different algorithms. Thus the storage locations
for the reference patterns as well as the arrays of accumulated
distancesandbackpointersarenotconsideredingeneral.
However,, as will be seen later, there will be an exception for
the one-stage algorithm.
In the two-level algorithm of Sakoe, a time warp
is carried
out for each of the N time frames of the input pattern and for
each of theK templates with average length 7.Each time warp
(2R + 1);
covers a stripe around the diagonal line of bandwidth
(2R + 1). As a result, the number of
the warp size is then 7 .
(2R + 1). To start up
local distance calculations is N . K . 7.
Fig. 7. Four cxalnplcs (a), (b), (c), (d) of finitc state nctworks as syn- thematchingprocedureonthesecond
level, the socalled
tactic constraints for building up word strings from the vocabulary
of the
phraselevel, the accumulated distance and the index
words A , B , C .
best template must be stored for each of the
(2R + 1) pairs
where the index k now indexes the word copies instead of the of beginning andendingpointsandforeachtimeframeof
the input pattern. Thus, the number ,of storage locations
is
word classes. Therecurrencerelationforthewithin-word
2
.
N
.
(
2
R
t
1).
transitions is not changed. Note that the finite state networks
In the level building algorithm of Myers and Rabiner, M . K
need not be loop free, which was already indicated Fig.
by 7.
time
warps are performed, where M is a prespecified number
Since now the optimal preceding word depends on the word
for
the
maximum number of words,
i.e.,levels,in the input
under consideration, it is necessary to have a “from template”
of 7 .
N / 3 grid points.
string.
Each
time
warp
covers
an
area
array T(i) and a “from frame” array F(i) for each word copy.
$
is
due
to
the
Itakura
constraints
imposed on the
The
factor
For the several copies of a word, the local distances need to be
local
slope
of
the
time
warping
path.
These
constraints
result
calculated only once for one copy and can then be “copied”
in
a
global
restriction
of
the
search
area
for
the
path.
Hence,
for the other copies. As a result, there
is n o significant increase
in computational expenditure due to the syntactic constraints. the total number of distance calculationsis M . K . J . N / 3 . At
is consumed in the each of the M levels and for each of the N time frames of the
The major part of the computation time
d ( i , j , k ) , input pattern, the accumulated distance, the best template and
most inner loop by the calculation of the local distance
be
which is a distance measure between feature vectors of dimen- the starting position on the input pattern time axis must
3 . N . M storage locations. The
kept track of. This requires
sion ten and more. Hence, the additional costs for the computation and updating of the backpointers, which is carried out level building algorithm turns out to be equivalent to the onein parallel with the computation of the accumulated distances, stage algorithm with a prescribed maximum number of words
in the string: the levels correspond to the copies made of each
are negligible incomparisonwiththelocaldistancecalculations. There is, however, an increase in storage costs since each template as described in the preceding section. For the reduced
(2R t l ) , where
word copy must have its individual arrays for the accumulated level algorithm, the size of a time warp is 7 .
R is a range parameter and defines a reduced search region for
distances and the backpointers.
the path. It is worthwhile noting, however, that the reduced
VI. COMPARISON W I T H OTHER APPROACHES
level algorithm is not guaranteed to find the globally optimal
sequence of words.
It is instructivetocomparethedifferentapproachesto
The one-stagealgorithmperformsonetimewarpforeach
connectedwordrecognitionwithrespecttocomputational
template.
Strictly speaking, the term time warp
is not appro[4] present such
and storage requirements. Myers and Rabiner
priate
for
this
algorithm,
since
actually
only
one
global
time
[3],
a comparison between the two-level algorithm of Sakoe
theiroriginal
level buildingalgorithmandareduced
level warp is performed. However, the area covered by this global
building algorithm, which results from the level building algo- time warp as shown in Fig. 1 can be thought of as decomposed
rithm by restricting the search region for the best path.
In the into smaller time warps for each template, as far as local disfollowing, we will extend their comparison to include the one- tance computations are concerned. Each of these smaller time
stage dynamic programming algorithm, Since, as we have seen, warps is of size 7.
N . Hence, the total number local
of distance
syntacticconstraintsdonotsignificantlyaffectthecompucomputations is K ‘ 7 .N . Since the one-stage algorithm must
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.
270
IEEE TRANSACTIONS O N ACOUSTICS, SPEECH, AND SIGNAL PKOCESSING, VOL. ASS1'-32, NO. 2 : APRIL 1984
Number of
basic time
warps
Size of
time warps
Total
computation
Storage
Number of
basic time
warps
S i x of
time warps
Total
computation
Storage
where
K .
875
3 150000
18000
4200
875
12600
504000
105000
126000
12960
12960
1420
7= 35 = average length of a template
K
M
10 = number of templates
12 = maximum number o f words in the input string, e.g., for voice dialing (5 + 7)
digits
N = 360 = length o f the input string
R = 12 = range parameter for time warping
=
=
keep track of the accumulated distances and backpointers for
all vocabulary words at a time, it is appropriate to include the
accumulateddistancesandthebackpointers
in thestorage
2 . ( 7 .K + N)
costs. Thus, the storage locations required are
as shown in Section IV. Theaboveconsiderationsshowthe
one-stagealgorithmtobetheonlyalgorithmthatcalculates
each local distance exactly once. Furthermore, the amount
of
computation and storage required for the one-stage algorithm
is independent of the maximum allowed number of words
in
the input string as opposed to the level building algorithm and
its reduced version.
Computationalreductionsbypruningtechniquesare
also
[ 121 -[14] . Theyare
applicabletotheone-stagealgorithm
based on the observation that candidate paths with accumulated distances significantly larger than the minimum accumulated distance for the input frame under consideration are very
unlikelytoresult
in the optimal path. Candidate paths with
accumulated distances exceeding a threshold set relative to the
best path candidate for the input frame considered can be removed from the subsequent operations.
Hence,thepruningoperationrequiresthefollowingadditional computational steps to be performed for each grid point
in connection with the evaluation of the dynamic programming
recursion:onecomparisonfordeterminingtheminimumof
the accumulated distances and three comparisons concerning
the accumulated distances of the predecessor
grid points for
deciding whether to prune the grid point under consideration
involves four comor not. As a result: the pruning overhead
parisons per grid point, which definitely require less than a few
percent of the overall computation time per grid point without
pruning. For the pruned
grid point or the corresponding path
candidate,nolocaldistance
is calculated.Thepruningtechnique results in an adjustment window of varying size [ 131 :
when there is a very well defined minimum for the accumulated
distances,onlyafewpathcandidatesareretained;onthe
otherhand,whenthere
is no clear minimumatall,
allor
A conservative estinearly all path candidates are expanded.
mate of the reduction factor achieved by pruning
is three or
more; apart from a few extra storage locations, there is no increase in storage requirements.
Table I gives alsoanumericalcomparisonofthecomputational and storage requirements of the four algorithms for
some typical parameter values:
M = 12, K
=
10. N = 360, 7 = 3 5 > R = 12.
Such parameter values are appropriate for the application of
voice dialing, where there are,
e.g., twelve digits in the input
string: five digitsfortheareacodeand
sevendigits forthe
telephone number. For such long digit strings, it may be useful
to include the endpoint detection in the recognition algorithm
as described in Section V. Apart from the parameters ilil and
N , the values arethesameaschosenbyMyersandRabiner
[ 4 ] . Table I showsforthisexamplethattheone-stagealgo1
25 of the computational amount
of the
rithmrequiresonly
two-level algorithm of Sakoe and only $of the computational
amount of the level building algorithm of Myers and Rabiner.
As for the storage requirements, the one-stage algorithm offers
a reduction factor of 9 or more as compared to the three other
algorithms.Onlythereduced
level buildingalgorithmdoes
with a computational amount comparable with the one-stage
algorithm, however, without ensuring the global optimality of
thesolution.Moreover,byusingpruningthecomputational
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.
NEY: ONE-STAGE
RECOGNITION
CONNECTED
WORD
DP ALGORITHM
FOR
expenditureoftheone-stagealgorithmcouldbereduced
further.
271
Acoust., Speech, Signal Ppocessing, vol. ASSP-29, pp. 284-297,
Apr. 1981.
Princeton,NJ:Princeton
[SIR. Bellman, DynamicProgramming.
Univ. Press, 1957.
VII. SUMMARY
[6] T. K. Vintsyuk,“SpeechdiscriminationbydynamicprogramKibernetika, vol. 4, pp. 81-88, Jan.-Feb. 1968. ,
In this paper, a tutorial presentation of a one-stage dynamic [7] ming,”
F. Itakura,“Minimumprtliction
residual principleappliedto
programmingalgorithmforconnectedwordrecognitionand
IEEE h n s . Aco?st., Speech, SignalProspeechrecognition,”
cessing, vol. ASSP-23, pp. 65-72, Feb. 1975.
its relation to other dynamic programming algorithms has been
[ t i ] H. Sakoe and S . Chiba, “Dynamic,programming algorithm optigiven. The one-stage algorithm is basically the same as the almizationforspokenwordrecognition,”
IEEE Trans. Acoust.,
[ I ] and developed by Bridle and
gorithms given by Vintsyuk
Speech, SignalProcessing, vol. ASSP-26, pp. 43-49, -Feb. 1978.
i s ] M. R. Sambur and L. R. Rabiner, “A statistical decision approach
Brown [2]. The derivation of the one-stage algorithm is based
totherecognitionofconnected
digits,” IEEE Trans. Acoust.,
o n parameterizing tlie time warping path by one single index
Speech, Signal Processing,vol. ASSP-24, pp. 550-558, Dec. 1976.
andresultsinaone-stagedynamicprogrammingalgorithm.
[ 101 R. Zelinski and F. Class, “A segmentationprocedureforconnected word recognition based on estimation principles,” in Proc.
The dynamic programming strategy performs three functions
1981 IEEE Int. Conf. Acoust., Speech,
Signal Processing,Atlanta,
simultaneously:thetimedignment,thewordboundarydeGA, pp. 960-963, Mar.-Apr. 1981.
of [ 111 J . K. Baker, “The DRAGON system-An overview,” IEEE Trans.
tectionandtheclassificationitself.Thus,theproblem
Acoust., Speech, SigFalProcessing, vol. ASSP-23, pp. 24-29, Feb.
concatenating reference patterns of a connected word string is
1975.
solved in a highly efficient manner. The main features of the
[12] B. T. Lowerre, “The HARPY speech recognition system,”
Ph.D.
one-stage algorithm include the following.
dissertation, Carnegie Mellon Univ., Dep. Comp. Sci., Pittsburgh,
PA, Apr. 1976.
1) For input strings with no syntactic constraints, the algo[ 131 .I. S . Bridle, M. D. Brown, and R. M. Chamberlain, “An algorithm
rithm is independent of a prespecified maximum number of
Proc. 1982IEEEConf.
forconnectedwordrecognition,”in
words in the input string.
Acoust., Speech, Signal Processing, Pans, France, May 1982, pp.
899-902.
2) The computational effort is proportional to the number
F. Brown, J . C. Spohrer, P. H. Hochschild, and J. K. Baker,
of time frames in the reference patterns and in the input string;[14] P.
“Partial tracebackanddynamicprogramming,”in
Proc. 1982
a reference frame and an input
eachlocaldistancebetween
IEEE Con$ Acqust., Speech,Signal Processing,Paris, France, May
1982, pp. 1629-1632.’
frame is calculatedexactlyonce.Inparticular,thecompu[15] J. Peckham, J. Green, J. Canning, and
P. Stephens, !‘Logos-A
tationaleffortperinputframeturnsouttobeexactlythe
real-timehardwarecontinuous
speech recognitionsystem,”in
same as in the case that the input string consists of isolated
Proc. 1982 IEEE Gonf. Acoust., Speech,Signal Processing, Paris,
France, May 1982, pp. 863-866.
words and an isolated word recognition with no adjustment
[16] H. Ney,“Connectedutterancerecognition
using dynamicprowindow is performed for each word.
gramming,” in Proc. 3rd Cong. FASE, DAGA, Goettingen, Ger3) The algorithm lends itself to a straightforward and simple
many, Sept. 1982, pp. 915-918.
[ 171 -,
“Dynamic programming as a technique for pattern recogniimplementation.
tion,” in Proc. 6th Int. Conf.Pattern Recogn., Munich, Germany,
4) Syntactic constraints, formulated in terms of finite state
Oct. 1 9 8 2 , ~1119-1125.
~.
syntaxes, can be dealt with by simple modifications of the algorithmandincreasethecomputationalexpenditureonly’
negligibly.
ACKNOWLEDGMENT
The author thanks thereviewers for helpful and constructive
comments.
REFERENCES
T. K. Vintsyuk, “Element-wise recognition of continuous speech
composed of wordsfroma
specified dictionary,” Kibernetika,
vol. 7, pp. 133-143, Mar.-Apr. 1971.
J. S. Bridle and M. D. Brown, “Connected word recognition using
wholewordtemplates,”
in Proc.Inst. Acoust. Autumn Conf.,
NOV.
1479, pp. 25-28.
H. Sakoe, “Two-levelDP-matching-Adynamicprogrammingbased patternmatchingalgorithmforconnectedword
recognition,” IEEE Trans. Acoust., Speech,Signal Processing,vol. ASSP27, pp. 588-595, Dec. 1979.
C. S . Myers and L. R. Rabiner, “A levelbuilding dynamic time
warping algorithm for connected word recognition,” IEEE Trans.
Authorized licensed use limited to: Universitatsbibliothek Karlsruhe. Downloaded on February 18, 2009 at 07:52 from IEEE Xplore. Restrictions apply.