POSTER, MAY 2005 1 Speech Recognizer based on the FSM Pavel Štemberk 2 Dept. of Circuit Theory, Czech Technical University, Technická 2, 166 27 Praha, Czech Republic [email protected] Abstract. Main issues in the speech recognition are relative complex. In the present the recognition process is not very efficient (recognition accuracy, complexity of the algorithm). Standart solutions [5] has its disadvantages, which could be compensated by other methods of searching of the most probability of words sequence for given HMM. K [2] is given by an input alphabet , an output alphabet ,a finite set of states Q , a finite set of transitions E , an initial state i Q , a set of final states F Q , an initial weight and a final weight over the semiring function from the source state t Keywords FSM, circuit theory, speech recognition, recognizer. 1. Introduction Speech recognition based on hidden Markov Models is essentially statistics of the parametrised speech signal. Thus the method of matching this parametrised speech signal (observation sequence) with trained HMMs, which are represented phonemes could be a main idea of improve properties of speech recognisers at the present. A feedforward network creation followed by a token passing algorithm [5] which deals the solution for the problem mentioned above is the most used in the present. However this solution has its disadvantages, which could be compensated by other methods of searching of the most probability of words sequence for given HMM. 2. Finite-State Machines - FSM Weighted Transducers (WFST) A WFST T , OMEGA ,Q , E , i , F , , to the destination state t , with l t the input label i , the output label l o t and the weight w t . The definition of path, path input label and path weight are those given earlier for acceptors. A path's output label is the concatenation of output labels of its transitions. Fig. 1 above represents a language model [2], where each transition has identical input and output labels. Fig. 1 below represents a toy pronunciation lexicon as a mapping from phone sequences to words in the lexicon, in this example "čtyři" and "dva", with probabilities representing the likelihood of alternative pronunciations. Fig. 1. WFST examples. 2.2 The application of weighted finite-state transducer (WFST) approach to speech recognition [1,2,4] was developed at AT&T over the last several years. A transducer is a finite-state device, that encodes a mapping between input and output symbol sequences; a weighted transducer associates weights such as probabilities, durations, penalties, or any other quantity that accumulates linearly along paths, to each pair of input and output symbol sequences. 2.1 . A transition can be represented by an arc (1) Composition There are various operations between WFSTs [2,4]. Descriptions of all operations can be found in [4], or in any manual of FSM toolkit. In this paper will be explained the most useful FSM operation called composition. The composition of two transducers R and S T R S (2) has exactly one path mapping sequence u to sequence w for each pair of paths, the first in R mapping u to some sequence v and the second in S mapping v to w . The weight of path in T is the addition of the weights of the corresponding paths in R and S in the tropical semiring (default) case [2]. Composition is useful for combining different levels of the 2 P.ŠTEMBERK, SPEECH RECOGNIZER BASED ON THE FSM recognizer representation called a recognition cascade – see below, effecients. The automaton is called n - gram in the second case, where n is history of previous words. For example for n 2 and two words w1 and w2 will be the grammar automaton looks as shown in Fig. 3. Fig. 2. Two WFSTs composition example The composition (2) of transducers R and S has pairs of an R state and an S state as states, and satisfies the following conditions: Its initial state is the pairs of the initial states R and Fig. 3. Bigram of two words w1 and w2. A P w occurrence and sequence words means the likelihood of the word w P w1 w2 means the likelihood of the w2 , w1 occurrence. S Its final states are pairs of final states of states of S R and final There is a transition t from r , s to r ' , s ' for each pair of transitions t R from r to r ' and t S from s to s' such that the output label of t R matches the input label of t S The transition 3.2 Lexicon WFST A lexicon transducer maps an input sequence of phonemes into words and one can consider an alternative pronunciation (for example Fig. 1 below). If the recognition cascade (3) have to works properly, an operation called closure must be applied on the lexicon. An example of the closured lexicon is shown in Fig. 4. t takes its input label from t R , its output label from t S , ant its weight is the addition of the weights of t R and tS in tropical semiring case. Transitions with labels in R or S must be treated specially as discussed elsewhere [1]. In Fig. 2 is shown an example of composition of two transducers [2]. 3. 3.3 Recognition Cascade By the recognition cascade we assume composition H C L G, (3) H - HMM transducer which maps states of individually HMMs into context dependent phonemes (triphones) C - context-dependency transducer which maps context dependent phonemes (triphones) into context independent ones L - lexicon transducer which independent phonemes into words G - automaton which represents the grammar (likelihood of words sequence) 3.1 Fig. 4. Example of a closured lexicon. maps Context-Dependency WFST A context-dependency transducer maps a context dependent phonemes (triphones) into context independent ones1. 3 The transducer has n 2 n 1 states and 2 n 2 n n transitions in triphonics (our) case, where n is the number of phonemes. For a lucidity is shown in Fig. 5 a context-dependency transducer for two phonemes only, where the context-dependent phoneme is marked as phoneme / left context _ right context. context- Fig. 5. Context dependency transducer example; for simplicity shown only two phonemes "x" and "y". Grammar WFST A simplified example of grammar for a chess game is shown in Fig. 1 above. The grammar can be created either by hand (in voice control case, for example the chess game mentioned above). or by using any long texts without 1 The opposite can be achieved by swapping input and output labels - inversion POSTER, MAY 2005 3.4 3 WFST representing HMM HMMs are described in details in [3,5]. A HMM transducer must contains all context-dependent models used in grammar. In more extensive systems case we can use about 10000 models. The weights are represented by a transitions matrix A , distribution functions b j o t , and particular speech vectors (observations) o t . Thus the weights is known after ending of the utterance and after expanding of the recognition cascade into feed-forward non-deterministic network. An example of two closured HMMs is shown in Fig. 6. 6. these obtained sequences of phonemes we bring on the input of the incomplete recognition cascade det C det L G (5) 7. We choose the most likely utterance as output of transducer with lowest weight2. Fig. 7. The detail of weights creation in the expanded nondeterministic automaton. Fig. 6. Two 3-state HMM of phonemes "x" and "y" - closured. 4. Speech Recognition Using WFST An example of the speech recogniser of two words "ano" and "ne" is shown in Fig. 8. Here is shown complexity of the result automaton. Its creation from the recognition cascade is after the end of the spoken utterance (number of observations is needed). The step-by-step procedure leads to the speech recogniser based on FSM is written below. 1. creating transducers G , L , C and H from a grammar (created either by hand, or from any expansive text), and the recognition cascade from them. (3}) [2]. We give min det H det C det L G . (4)The example of a piece of a recognition cascade which represents phoneme based recogniser of two words "ano" and “ne” is shown in Fig. 7 on the left (the rest of this automaton can be fount in [6]). 2. we give T observations O utterance is parametrised. o1, o 2, ... , oT after an 3. we create a probability matrix P by using trained HMMs for elementary phonemes (they can be obtained for example using HTK [5]), and known observations O .Elements of this matrix are outputs from distribution functions b M j ot , where M j is j th row of the matrix (HMM model M is attached in state j ) and t is t -column of the matrix. 4. we expand the recognition cascade into the feedforward non-deterministic network using a probability matrix P as is the its part shown in Fig. 7 on the right. Weights is given by parameters of appropriate HMM models and observations o t [5], which is shown in an example in Fig. 7. 5. we find n-bestpath using fsmbestpath tool from FSM toolkit [1] (Viterbi search). Thus output sequences represent n most likely variants of sequences recognised HMM triphones. Fig. 8. Recognition cascade for words “ano” and “ne” (on the left). The expanded recognition cascade from left into feed-forward nondeterministic network (on the right). 5. Experimentals Results The main algorithm for creating of elementary components of the recognition cascade and expanding the recognition cascade into feed-forward non-deterministic network is called rct – recognition cascade toolkit. For demonstration of this algorithm Fig 9 has been included. For the realization of WFSMs operations, I'm using FSM libraries v 4.0 from AT&T (Mohri et al. 2000) This library is judged as most efficient library, but is available only in the binary form with non commercial license. The HTK toolkit v3.2.1 [5] is used for training HMMs which represents Czech triphones. The data for training are from the Czech speech database called 2 Weights assumes negative of log probabilities here (tropical semiring is used) 4 P.ŠTEMBERK, SPEECH RECOGNIZER BASED ON THE FSM SPEECON (about 1000 speakers). This database is for student's purposes only and it is available on the department of circuit theory CTU FEE. For data processing (which is needed when either HTK training or any recognition is used) was written a program called hdp - HTK data preparation toolkit. Type of recogniser Without backwrad transitions in sil. models With backward transitions in sil. models “ano”-”ne” 80,00% 81.5% 0-9 90,00% 90,50% Tab. 1. Experimantals results on simple recognizers. Tab. 1 shows first experimantals results on simply recognizers. The backward transition in the silence model coudln't be as the most imortant. The algorithm seems process. The context-dependency transducer is the relativedi fficult automaton, but the process of creation this automaton and following composition C with L G is done before the recognition process has begun. Thus there is also the time saving. A distribution of weights in the HTK lattice can leads to less effective searching of the best path. For FSMs exist the operation called weight pushing, which allows sort weights by their value in the part of the recognition cascade L G , when the result automaton is equivalent with the original [1,2,4]. Acknowledgements The presented work was supported by GAČR 102/05/0278 "New Trends in Research and Application of Voice Technology", GAČR 102/03/H085 "Biological and Speech Signals Modeling", and research activity MSM 6840770014 "Research in the Area of the Prospective Information and Navigation Technologies". References works properly, however this good results is achieved using the silence addition to get same lenghts of elementary paths of given utterance possibilities. Fig. 9. The expanded feed-forward network. This represents a recognizer of two words “ano”,”ne” for 173 observations on the input. Created by the program rct. 6. Conclusion Next rows describes basic properties of speech recognizers created using by the HTK toolkit, or by the FSM toolkit. Creating of the recognition network is an embedded process in the HTK , it's about creating the lattice3 using either by tool HParse (the grammar is written by hand case), or HBuild (the grammar is taken up from text case) [2]. The main feed-forward recognition network is created after obtaining observations o1, o2, ... , oT . For the finitestate machines method is the feed-forward network created similarly after the parametrisation of the input utterance immediately. However the minimized and determinized recognition cascade is used here for the feed forward recognition network creation. It means the feed-forward network will be created more easily (less spacious and time complexity of the resulting recognition network). The network representation of context-dependent phonemes is extremely hard when the feed-forward recognition network is created by the HTK toolkit. The FSM method solves this problem before the recognition 3 Grammar network [2] [1] Fernando, C., Pereira, N., Riley, M. Speech recognition by composition of weighted finite automata. MIT Press , Cambridge, Massachusetts, 1997. [2] Mohri, M., Fernando, C., Pereira, N. Weighted finite state transducers in speech recognition. Computer Speech and Language, 1:69–88, 2002. [3] Rabiner, L., Juang, B., H. Fundamentals Of Speech Recognition. Englewood Cliffs, N.J., PTR Prentice Hall, c1993. 507 p. TK7895.S65R33, 1993. [4] Roche, E., Schabes Rabiner, Y. Finite-State Language Processing. 464p, ISBN 0-262-18182-7, 1997. [5] Young, S. The HTK Book (for HTK Version 3.2.1). Microsoft Corporation, CambridgeUniversity Engineering Department, 3.2 edition, 2002. [6] Štemberk, P. Speech recognition based on fsm and htk toolkits. Proceedings Digital Technologies 2004, EDIS-ilina University publishers, ilina, ISBN 80-8070-334-5. About Author... Pavel ŠTEMBERK was born in Mladá Boleslav. He is a PhD student on FEE CTU since 3/2003, Theoretical Fundamentals of Electrical Engineering Program, thesis "Implementation speech recognisers into multimetial platforms".
© Copyright 2026 Paperzz