TO THE POSSIBILITY OF CALCULATION

POSTER, MAY 2005
1
Speech Recognizer based on the FSM
Pavel Štemberk
2
Dept. of Circuit Theory, Czech Technical University, Technická 2, 166 27 Praha, Czech Republic
[email protected]
Abstract. Main issues in the speech recognition are relative
complex. In the present the recognition process is not very
efficient (recognition accuracy, complexity of the
algorithm). Standart solutions [5] has its disadvantages,
which could be compensated by other methods of searching
of the most probability of words sequence for given HMM.
K [2] is given by an input alphabet
, an output alphabet
,a finite set of states Q , a
finite set of transitions E , an initial state i Q , a set of
final states F Q , an initial weight
and a final weight
over the semiring
function
from the source state t
Keywords
FSM, circuit theory, speech recognition, recognizer.
1.
Introduction
Speech recognition based on hidden Markov Models
is essentially statistics of the parametrised speech signal.
Thus the method of matching this parametrised speech
signal (observation sequence) with trained HMMs, which
are represented phonemes could be a main idea of improve
properties of speech recognisers at the present. A feedforward network creation followed by a token passing
algorithm [5] which deals the solution for the problem
mentioned above is the most used in the present. However
this solution has its disadvantages, which could be
compensated by other methods of searching of the most
probability of words sequence for given HMM.
2.
Finite-State Machines - FSM
Weighted Transducers (WFST)
A WFST
T
, OMEGA ,Q , E , i , F , ,
to the destination state t , with
l
t
the input label i
, the output label l o t and the
weight w t . The definition of path, path input label and
path weight are those given earlier for acceptors. A path's
output label is the concatenation of output labels of its
transitions. Fig. 1 above represents a language model [2],
where each transition has identical input and output labels.
Fig. 1 below represents a toy pronunciation lexicon as a
mapping from phone sequences to words in the lexicon, in
this example "čtyři" and "dva", with probabilities
representing the likelihood of alternative pronunciations.
Fig. 1. WFST examples.
2.2
The application of weighted finite-state transducer
(WFST) approach to speech recognition [1,2,4] was
developed at AT&T over the last several years. A
transducer is a finite-state device, that encodes a mapping
between input and output symbol sequences; a weighted
transducer associates weights such as probabilities,
durations, penalties, or any other quantity that accumulates
linearly along paths, to each pair of input and output
symbol sequences.
2.1
. A transition can be represented by an arc
(1)
Composition
There are various operations between WFSTs [2,4].
Descriptions of all operations can be found in [4], or in any
manual of FSM toolkit. In this paper will be explained the
most useful FSM operation called composition. The
composition of two transducers R and S
T R S
(2)
has exactly one path mapping sequence u to
sequence w for each pair of paths, the first in R
mapping u to some sequence v and the second in S
mapping v to w . The weight of path in T is the
addition of the weights of the corresponding paths in R
and S in the tropical semiring (default) case [2].
Composition is useful for combining different levels of the
2
P.ŠTEMBERK, SPEECH RECOGNIZER BASED ON THE FSM
recognizer representation called a recognition cascade –
see below,
effecients. The automaton is called n - gram in the second
case, where n is history of previous words. For example
for n 2 and two words w1 and w2 will be the
grammar automaton looks as shown in Fig. 3.
Fig. 2. Two WFSTs composition example
The composition (2) of transducers R and S has
pairs of an R state and an S state as states, and satisfies
the following conditions:

Its initial state is the pairs of the initial states
R and
Fig. 3. Bigram of two words w1 and w2.
A P w
occurrence and
sequence words
means the likelihood of the word w
P w1 w2 means the likelihood of the
w2 , w1 occurrence.
S


Its final states are pairs of final states of
states of S
R and final
There is a transition t from r , s to
r ' , s ' for
each pair of transitions t R from r to r ' and t S
from s to s' such that the output label of t R matches
the input label of t S
The transition
3.2
Lexicon WFST
A lexicon transducer maps an input sequence of
phonemes into words and one can consider an alternative
pronunciation (for example Fig. 1 below).
If the
recognition cascade (3) have to works properly, an
operation called closure must be applied on the lexicon. An
example of the closured lexicon is shown in Fig. 4.
t takes its input label from t R , its
output label from t S , ant its weight is the addition of the
weights of t R
and
tS
in tropical semiring case.
Transitions with labels in R or S must be treated
specially as discussed elsewhere [1]. In Fig. 2 is shown an
example of composition of two transducers [2].
3.
3.3
Recognition Cascade
By the recognition cascade we assume composition
H C L G,

(3)
H - HMM transducer which maps states of
individually HMMs into context dependent phonemes
(triphones)

C - context-dependency transducer which maps
context dependent phonemes (triphones) into context
independent ones

L - lexicon transducer which
independent phonemes into words

G - automaton which represents the grammar
(likelihood of words sequence)
3.1
Fig. 4. Example of a closured lexicon.
maps
Context-Dependency WFST
A context-dependency transducer maps a context
dependent phonemes (triphones) into context independent
ones1.
3
The
transducer
has
n
2
n 1
states
and
2
n 2 n n transitions in triphonics (our) case, where
n is the number of phonemes. For a lucidity is shown in
Fig. 5 a context-dependency transducer for two phonemes
only, where the context-dependent phoneme is marked as
phoneme / left context _ right context.
context-
Fig. 5. Context dependency transducer example; for simplicity
shown only two phonemes "x" and "y".
Grammar WFST
A simplified example of grammar for a chess game is
shown in Fig. 1 above. The grammar can be created either
by hand (in voice control case, for example the chess game
mentioned above). or by using any long texts without
1 The opposite can be achieved by swapping input
and output labels - inversion
POSTER, MAY 2005
3.4
3
WFST representing HMM
HMMs are described in details in [3,5]. A HMM
transducer must contains all context-dependent models used
in grammar. In more extensive systems case we can use
about 10000 models. The weights are represented by a
transitions matrix A , distribution functions b j o t , and
particular speech vectors (observations) o t . Thus the
weights is known after ending of the utterance and after
expanding of the recognition cascade into feed-forward
non-deterministic network. An example of two closured
HMMs is shown in Fig. 6.
6. these obtained sequences of phonemes we bring on the
input of the incomplete recognition cascade
det C det L G
(5)
7. We choose the most likely utterance as output of
transducer with lowest weight2.
Fig. 7. The detail of weights creation in the expanded nondeterministic automaton.
Fig. 6. Two 3-state HMM of phonemes "x" and "y" - closured.
4.
Speech Recognition Using WFST
An example of the speech recogniser of two words
"ano" and "ne" is shown in Fig. 8. Here is shown
complexity of the result automaton. Its creation from the
recognition cascade is after the end of the spoken utterance
(number of observations is needed).
The step-by-step procedure leads to the speech
recogniser based on FSM is written below.
1. creating transducers G , L , C and H from a
grammar (created either by hand, or from any expansive
text), and the recognition cascade from them. (3}) [2].
We give
min det H det C det L G
.
(4)The example of a piece of a recognition cascade
which represents phoneme based recogniser of two
words "ano" and “ne” is shown in Fig. 7 on the left (the
rest of this automaton can be fount in [6]).
2. we give T observations O
utterance is parametrised.
o1, o 2, ... , oT
after an
3. we create a probability matrix P by using trained
HMMs for elementary phonemes (they can be obtained
for example using HTK [5]), and known observations
O .Elements of this matrix are outputs from
distribution functions b M j ot , where M j is j th row of the matrix (HMM model M is attached in
state j ) and t is t -column of the matrix.
4. we expand the recognition cascade into the feedforward non-deterministic network using a probability
matrix P as is the its part shown in Fig. 7 on the right.
Weights is given by parameters of appropriate HMM
models and observations o t [5], which is shown in an
example in Fig. 7.
5. we find n-bestpath using fsmbestpath tool from FSM
toolkit [1] (Viterbi search). Thus output sequences
represent n most likely variants of sequences
recognised HMM triphones.
Fig. 8. Recognition cascade for words “ano” and “ne” (on the
left). The expanded recognition cascade from left into feed-forward nondeterministic network (on the right).
5.
Experimentals Results
The main algorithm for creating of elementary
components of the recognition cascade and expanding the
recognition cascade into feed-forward non-deterministic
network is called rct – recognition cascade toolkit. For
demonstration of this algorithm Fig 9 has been included.
For the realization of WFSMs operations, I'm using
FSM libraries v 4.0 from AT&T (Mohri et al. 2000) This
library is judged as most efficient library, but is available
only in the binary form with non commercial license.
The HTK toolkit v3.2.1 [5] is used for training
HMMs which represents Czech triphones. The data for
training are from the Czech speech database called
2 Weights assumes negative of log probabilities here
(tropical semiring is used)
4
P.ŠTEMBERK, SPEECH RECOGNIZER BASED ON THE FSM
SPEECON (about 1000 speakers). This database is for
student's purposes only and it is available on the department
of circuit theory CTU FEE.
For data processing (which is needed when either
HTK training or any recognition is used) was written a
program called hdp - HTK data preparation toolkit.
Type of recogniser
Without backwrad
transitions in sil.
models
With backward
transitions in sil.
models
“ano”-”ne”
80,00%
81.5%
0-9
90,00%
90,50%
Tab. 1. Experimantals results on simple recognizers.
Tab. 1 shows first experimantals results on simply
recognizers. The backward transition in the silence model
coudln't be as the most imortant. The algorithm seems
process. The context-dependency transducer is the
relativedi fficult automaton, but the process of creation this
automaton and following composition C with L G is
done before the recognition process has begun. Thus there
is also the time saving.
A distribution of weights in the HTK lattice can leads
to less effective searching of the best path. For FSMs exist
the operation called weight pushing, which allows sort
weights by their value in the part of the recognition cascade
L G , when the result automaton is equivalent with the
original [1,2,4].
Acknowledgements
The presented work was supported by GAČR
102/05/0278 "New Trends in Research and Application of
Voice Technology", GAČR 102/03/H085 "Biological and
Speech Signals Modeling", and research activity MSM
6840770014 "Research in the Area of the Prospective
Information and Navigation Technologies".
References
works properly, however this good results is achieved using
the silence addition to get same lenghts of elementary paths
of given
utterance possibilities.
Fig. 9. The expanded feed-forward network. This represents a
recognizer of two words “ano”,”ne” for 173 observations
on the input. Created by the program rct.
6.
Conclusion
Next rows describes basic properties of speech
recognizers created using by the HTK toolkit, or by the
FSM toolkit.
Creating of the recognition network is an embedded
process in the HTK , it's about creating the lattice3 using
either by tool HParse (the grammar is written by hand
case), or HBuild (the grammar is taken up from text case)
[2]. The main feed-forward recognition network is created
after obtaining observations o1, o2, ... , oT . For the finitestate machines method is the feed-forward network created
similarly after the parametrisation of the input utterance
immediately. However the minimized and determinized
recognition cascade is used here for the feed forward
recognition network creation. It means the feed-forward
network will be created more easily (less spacious and time
complexity of the resulting recognition network).
The network representation of context-dependent
phonemes is extremely hard when the feed-forward
recognition network is created by the HTK toolkit. The
FSM method solves this problem before the recognition
3 Grammar network [2]
[1] Fernando, C., Pereira, N., Riley, M. Speech recognition by
composition of weighted finite automata. MIT Press , Cambridge,
Massachusetts, 1997.
[2] Mohri, M., Fernando, C., Pereira, N. Weighted finite state
transducers in speech recognition. Computer Speech and Language,
1:69–88, 2002.
[3] Rabiner, L., Juang, B., H. Fundamentals Of Speech Recognition.
Englewood Cliffs, N.J., PTR Prentice Hall, c1993. 507 p.
TK7895.S65R33, 1993.
[4] Roche, E., Schabes Rabiner, Y. Finite-State Language Processing.
464p, ISBN 0-262-18182-7, 1997.
[5] Young, S. The HTK Book (for HTK Version 3.2.1). Microsoft
Corporation, CambridgeUniversity Engineering Department, 3.2
edition, 2002.
[6] Štemberk, P. Speech recognition based on fsm and htk toolkits.
Proceedings Digital Technologies 2004, EDIS-ilina University
publishers, ilina, ISBN 80-8070-334-5.
About Author...
Pavel ŠTEMBERK was born in Mladá Boleslav. He is a
PhD student on FEE CTU since 3/2003, Theoretical
Fundamentals of Electrical Engineering Program, thesis
"Implementation speech recognisers into multimetial
platforms".