Statistical Models for Automatic Speech Recognition Lukáš Burget Feature extraction • • • Preprocessing speech signal to satisfy needs of the following recognition process (dimensionality reduction, preserving only the “important” information, decorelation). Popular features are MFCC: modification based on psycho-acoustic findings applied to short-time spectra. For convenience, we will use one-dimensional features in most of our examples (e.g. short time energy). Classifying single speech frame unvoiced voiced Classifying single speech frame unvoiced voiced j ask the following question: j Mathematically, we P (voiced x) > P (unvoiced x) But the value we read from probability distribution is p(x|class). According to Bayes Rule, the above canj be revritten as: j p(x voiced)P (voiced) p(x) > p(x unvoiced)P (unvoiced) p(x) Multi-class classification The class being correct with the highest probability is given by: arg max! P (! jx) = arg max! p(xj!)P (!) silence unvoiced voiced But we do not know the true distribution, … Estimation of parameters … we only see some training examples. silence unvoiced voiced Estimation of parameters … we only see some training examples. Let’s decide for some parametric model (e.g. Gaussian distribution) and estimate its parameters from the data. silence unvoiced voiced Maximum Likelihood Estimation • In the next part, we will use ML estimation Yof model parameters: p(xi j£) ^ class = arg max £ £ ML 8x 2class i • • This allow as to individually estimate parameters, Θ, of each class given the data for that class. Therefore, for the convenience, we can omit the class identities in the following equations. The models we are going to examine are: – Single Gaussian – Gaussian Mixture Model (GMM) – Hidden Markov Model • We want to solve three fundamental problems: – Evaluation of the model (computing likelihood of features given the model) – Training the model (finding ML estimates of parameters) – Finding most likely values of hidden parameters Gaussian distribution (1 dimension) N Evaluation: (x; ¹; ¾ 2 ) = p1 e ¾ 2¼ ¡ (x¡¹)2 2¾ 2 ML estimates of parameters (Training): ¹= ¾2 = 1 T 1 T P T t=1 P x(t) T (x(t) t=1 No hidden variables. ¡ ¹)2 Gaussian distribution (2 dimensions) N (x; ¹; §) = p 1 e¡ 12 (x¡¹)T §¡1 (x¡¹) (2¼)P j§j Gaussian Mixture Model P j p(x £) = Pc N (x; ¹c ; ¾ 2 ) c c P Pc = 1 Evaluation: where c Gaussian Mixture Model P j p(x £) = Pc N (x; ¹c ; ¾ 2 ) Evaluation: c c • We can see the sum above just as a function defining the shape of the probability density function, • or we can see it as a more complicated generative probabilistic model, from which features are generated as follows: – One of Gaussian components is first randomly selected according prior probabilities Pc – Feature vector is generated form the selected Gaussian distribution • For the evaluation, however, we do not know which component generated the input vector (Identity of the component is hidden variable). Therefore, we marginalize – sum over all the components respecting their prior probabilities. • Why we want to complicate our lives with this concept: – It allows at to apply EM algorithm for GMM training – We will need this concept for HMMs Training GMM –Viterbi training • Intuitive and Approximate iterative algorithm for training GMM parameters. • Using current model parameters, let Gaussians to classify data as the Gaussians were different classes (Even though the both data and all components corresponds to one class modeled by the GMM) • Re-estimate parameters of Gaussian using the data associated with to them in the previous step. • Repeat the previous two steps until the algorithm converge. Training GMM – EM algorithm • • • • Expectation Maximization is very general tool applicable in many cases were we deal with unobserved (hidden) data. Here, we only see the result of its application to the problem of reestimating parameters of GMM. It guarantees to increase likelihood of training data in every iteration, however it does not guarantees to find the global optimum. The algorithm is very similar to Viterbi training presented above. Only instead of hard decisions, it uses “soft” posterior probabilities of Gaussians (given the old model) as a weights and weight average is used to compute new mean and variance estimates. ¹ ^(new) = c P T ° (t)x(t) t=1 c P T °c (t) t=1 ¾ ^ 2 (new) = c °c (t) = P T t=1 ^ (new) )2 °c (t)(x(t)¡¹ c P T °c (t) t=1 Pc N (x(t);¹ ^ (old) ;^ ¾ 2 (old) ) c c P N (old) Pc (x(t);¹ ^c ;^ ¾ 2 (old) ) c c Classifying stationary sequence silence unvoiced voiced Y assumption Frame independency j j P (X class) = p(xi class) 8x 2class i Modeling more general sequences: Hidden Markov Models a11 a22 a12 b1(x) • • • a33 a23 b2(x) a34 b3(x) Generative model: For each frame, model moves from one state to another according to a transition probability aij and generates feature vector from probability distribution bj(.) associated with the state that was entered. To evaluate such model, we do not see which path through the states was taken. Let’s start with evaluating HMM for a particular state sequence. a11 a22 a12 b1(x) b2(x) a33 a23 b3(x) P(X,S|Θ) = b1(x1) a11 b1(x2) a12 b2(x3) a23 b3(x4) a33 b3(x5) a34 Evaluating HMM for a particular state sequence P(X,S|Θ) = b1(x1) a11 b1(x2) a12 b2(x3) a23 b3(x4) a33 b3(x5) Evaluating HMM for a particular state sequence The jointj likelihood of observes sequence X and state sequence S P (X; S £) = b1 (x1 )a11 b1 (x2 )a12 b2 (x3 )a23 b3 (x4 )a33 b3 (x5 )a34 can be jdecomposed j as follows: P (X; S £) = P (X S; £)P (S; £) where P (S j£) = a11 a12 a23 a33 a34 is prior probability of hidden variable – state sequence S. For GMM, the corresponding term was: Pc P (X jS; £) = b1 (x1 )b1 (x2 )b2 (x3 )b3 (x4 )b3 (x5 ) N (x; ¹ ; ¾ ) is likelihood of observed sequence X, given the state 2 sequence S. For c c GMM, the corresponding term was: . . . Evaluating HMM (for any state sequence) Since we do not know the underlying state sequence, we must marginalize – compute and sum likelihoods over all the possible paths P (X j£) = P S P (X; S j£) Finding the best (Viterbi) paths P (X j£) = maxS P (X; S j£) Training HMMs – Viterbi training • Similar to the approximate training we have already seen for GMMs 1. 2. 3. For each training utterance find Viterbi path through GMM, which associate feature frames with states. Re-estimate state distribution using associated feature frames. Repeat steps 1. and 2. until the algorithm converges. Training HMMs using EM P T ¹ ^(new) = s ¾ ^ 2 (new) = s t=1 P P °s (t) = °s (t)x(t) °s (t) T t=1 T t=1 ®s (t)¯s (t) P (X j£(old) ) ^ (new) )2 °s (t)(x(t)¡¹ s P T °s (t) t=1 ¯s (t) s ®s (t) t Isolated word recognition YES NO p(XjY ES)P (Y ES) > p(XjN O)P (N O) Connected word recognition YES sil sil NO Phoneme based models y eh y eh s s Using Language model - unigram P(one) sil P(two) P(three) w ah t uw th r n one two iy three sil Using Language model - bigram one two P(W2|W1) w ah n t uw sil th r iy three sil one two sil three Other basic ASR topics not covered by this presentation • Context dependent models • Training phoneme based models • Feature extraction – Delta parameters – De-correlation of features • Full-covariance vs. diagonal cov. modeling • Adaptation to speaker or acoustic condition • Language Modeling – LM smoothing (back-off) • Discriminative training (MMI or MPE) • and so on
© Copyright 2025 Paperzz