Deep Learning Based on Manhattan Update Rule Yasser Hifny University of Helwan, Egypt Abstract Acoustic models based on Deep Neural Networks (DNNs) lead to significant improvement in the recognition accuracy. In these methods, Hidden Markov Models (HMMs) state scores are computed using flexible discriminant DNNs. Training DNNs is computationally expensive and efficient training of DNNs is an active area of research. Similar to HMMs, Deep Conditional Random Fields (DCRFs) use DNNs to compute state scores. In this paper, we present a method to estimate DCRFs using the Manhattan (MH) update rule. The Manhattan update rule does not involve the gradient magnitude. The Manhattan update method is general and can be used to estimate models when the gradient can be computed. 1. Introduction DNNs with many hidden layers have been shown to outperform Gaussian mixture models in several tasks (Mohamed et al., 2012), (Seide et al., 2011), (Dahl et al., 2012), (Hinton et al., 2012). Training DNNs is computationally expensive and efficient training of DNNs is an active area of research (Kingsbury et al., 2012),(Vinyals & Povey, 2012). Training CRFs on the top of a hidden layer constructed from scoring a large number of sigmoid functions was introduced in (Prabhavalkar & Fosler-Lussier, 2010). DNNs can be used to compute the state scores within CRF framework. Hence, this improvement will lead to a deep version of CRFs (DCRFs)(Hifny, 2013). Other deep alternatives were developed in (Yu & Deng, 2010),(Mohamed et al., 2010), (Do & Artieres, 2010), (Fujii et al., 2012). In this paper, we present a method to estimate DCRFs using the Manhattan (MH) update rule. The Manhatth Proceedings of the 30 International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). [email protected] tan update rule does not involve the gradient magnitude. It was shown in (Hifny, 2006) that this algorithm outperforms other gradient based algorithms for CRF parameter estimation. In Section 2, a mathematical formulation of DCRFs is described. The optimization problem of DCRFs is addressed in Section 3. Section 4 gives experimental results on a phone recognition task. Finally, a summary of the presented work is given in the conclusions. 2. Deep Conditional Random Fields Linear chain CRFs can be thought as the undirected graphical twins for HMMs regardless of their training (generative or discriminative) (Lafferty et al., 2001). DCRF acoustic models are a particular implementation of linear chain CRFs where the state scores are computed based on a DNN that has many hidden layers. The feed-forward phase updates the output value of each neuron. Starting from the first hidden layer , each neuron output is computed as a weighted sum of inputs and applying the sigmoid function to it:1 n X h−1 ohtj = sigm( λij oti ) (1) i=1 where oht is an output of a hidden layer, n is the number of inputs, h is an index to a hidden layer, and sigmoid function is computed as follows: sigm(x) = 1 1 + e−x (2) The output of an hidden layer is passed to the next layer until the output layer is computed as follows: oN tj = n X N −1 λij oti (3) i=1 where N is index of the output layer. Hence, the activation of hidden layers is nonlinear based on a sigmoid function and the output layer activation is linear. 1 The bias term is not implemented in neuron activation. Deep Learning Based on Manhattan Update Rule and it is similar to the total probability p(O|M) in HMMs, which can be calculated using the forward algorithm (Lafferty et al., 2001). 3. DCRF Optimization Figure 1. DCRF model for phone representation, where the state scores are computed from a DNN. For R training observations {O1 , O2 , . . . , Or , . . . , OR } with corresponding transcriptions {Wr }, DCRFs are trained using the conditional maximum likelihood (CML) criterion to maximize the posterior probability of the correct word sequence given the acoustic observations: FCML (Λ) = A graphical representation of the DCRF acoustic model is shown in Figure 1. The conditional distribution defining DCRFs is given by PΛ (S|O) = T Y 1 ZΛ (O) t=1 exp λst st−1 a(st , st−1 ) + bst (ot ) (4) R X log PΛ (MWr |Or ) r=1 P PT P (Wr ) S|Wr exp t Ψ(O, S, c, Λ) = log P P PT r=1 t Ψ(O, S, c, Λ) Ŵ P (Ŵ ) S|Ŵ exp R X ≈ R X log ZΛ (Or |Mnum ) − log ZΛ (Or |Mden ), (6) r=1 where where Ψ(O, S, c, Λ) = λst st−1 a(st , st−1 ) + bst (ot ) • PΛ (S|O) obeys the Markovian property: PΛ (st |st−1 , st−2 , . . . , s1 , O) = PΛ (st |st−1 , O) • λst st−1 are associated with the characterizing function a(st , st−1 ). a(st , st−1 ) is a binary function and can be used to define DCRF topology. • bst (ot ) = oN tst is computed from Equation (3). Hence, bst (ot ) connects DNN output to CRF input. • ZΛ (O) (Zustandsumme) is a normalization coefficient referred to as the partition function. HMMs and DCRFs (in general, linear chain CRFs) share the first order Markov assumption, which simplifies the training and decoding algorithms. However, DCRFs do not assume observation independence and causality, as the joint event in this case is factorized as a simple product of exponential functions. Therefore, the observations and the characterizing functions can be statistically dependent or correlated and can depend on the past and future acoustic context. The partition function, ZΛ (O), is given by ZΛ (O) = T XY S t=1 exp λst st−1 a(st , st−1 ) + bst (ot ) , (5) (7) The optimal parameters, Λ∗ , are estimated by maximizing the CML criterion, which implies minimizing the cross entropy between the correct transcription model and the hypothesized recognition model. In other words, the process maximizes the partition function of the correct models2 (the numerator term) ZΛ (Or |Mnum ), and simultaneously minimizes the partition function of the recognition model (the denominator term) ZΛ (Or |Mden ). The optimal parameters are obtained when the gradient of the CML criterion is zero. 3.1. Numerical Optimization for DCRFs Newton’s method can be used to estimate DCRFs based on local quadratic approximation of the CML objective function. These methods rely on local quadratic approximation by expanding the CML nonlinear objective function FCML (Λ + δ) using Taylor expansion around the current model point Λ in parameter space and given by 1 FCML (Λ + δ) ≈ L(Λ) + δ T g(Λ) + δ T H(Λ)δ + . . . (8) 2 2 Since a summation over potential functions is commonly called the partition function in undirected graphical modeling, we coin the notation ZΛ (Or |Mnum ) for the summation of all possible state sequences of the correct models. Deep Learning Based on Manhattan Update Rule where g(Λ) is the local gradient vector defined by g(Λ) = ∂FCML (Λ) ∂λi Λ (9) and the H(Λ) is the local Hessian matrix defined by ∂FCML (Λ) Hij (Λ) ≡ ∂λi ∂λj Λ (10) The Newton’s Method update rule is given by λ(τ ) = λ(τ −1) − η (τ ) H−1 (Λ)g(Λ) (11) Since CML is not a quadratic function, taking the full Newton step H−1 (Λ)g(Λ) may lead to an overshoot of the maximum. Hence, η (τ ) 6= 1 will lead to the damped Newton step. A line search algorithm is used to calculate η (τ ) . A line search works by evaluating the objective function starting from the current model in the direction of search and choosing η (τ ) will lead to an increase of the CML objective function. Hessian matrix calculation, its inverting and storage, makes Newton’s Method useful only for small scale problems. Quasi-Newton or variable metric methods can be used when it is impractical to evaluate the Hessian matrix. Instead of obtaining an estimate of the Hessian matrix at a single point, these methods gradually build up an approximate Hessian matrix by using gradient information from some or all of the previous iterates visited by the algorithm. Limited memory quasi-Newton’s methods like L-BFGS are particular realizations of quasi-Newton’s methods that cut down the storage for large problems (Nocedal & Wright, 1999). Truncated-Newton method known as Hessian-Free approach (Nocedal & Wright, 1999; Martens, 2010; Kingsbury et al., 2012), is a second order method for large scale problems. It finds the search direction using an iterative solver and the solver is typically based on conjugate gradient but other alternatives are possible. In this method, Hessian-vector products are computed without explicitly forming the Hessian. Hessianfree methods approximately invert the Hessian while quasi-Newton methods invert an approximate Hessian. By ignoring the second order derivative, a first order approximation of the CML will lead to the gradient ascent methods and the update is given by λ(τ ) = λ(τ −1) + ηg(Λ) (12) The step size η must be small enough to ensure a stable increase of the CML objective function. It can be shown that the algorithm is convergent provided that 2 η satisfies the condition 0 < η < λmax , where λmax is the largest eigenvalue of the Hessian matrix H(Λ∗ ) evaluated at the global maximum of the CML objective function (Haykin, 1998). In practice, second order statistics are not accumulated so λmax is not known and η is chosen in ad-hoc fashion by trial and error. The training speed of gradient descent (batch mode) is usually slow. The training process can be accelerated using an online variant known as stochastic gradient descent (SGD).3 This algorithm can update the learning system on the basis of the objective function measured for a single utterance or batch. 3.2. Manhattan Update Many authors have described variants of the gradient ascent algorithm, where increased rates of convergence through learning rate adaptation were proposed (Jacobs, 1988; Sutton, 1992; Murata et al., 1997). A comparison between different algorithms was addressed in (Schiffmann et al., 1994). The Resilient Propagation (RProp) algorithm (Riedmiller & Braun, 1993), uses a Manhattan update rule to provide faster convergence. A Manhattan update rule does not involve the gradient magnitude. Our Manhattan (MH) update rule implementation is similar to RProp where the updates are given by ( (τ ) ηi = (τ −1) (τ ) (τ −1) min(ηi φ, ηmax ) gi (Λ)gi (Λ) > 0 (τ −1) (τ ) (τ −1) max(ηi κ, ηmin ) gi (Λ)gi (Λ) < 0 (13) where ( (τ ) λi = (τ −1) (τ ) λi − ηi (τ −1) (τ ) λi + ηi (τ ) gi (Λ) < 0 (τ ) gi (Λ) > 0 (14) The learning parameters were set as follows κ = 0.5, (0) φ = 1.2, and ηi = η. 3.3. DCRFs Gradient Computation For an exponential family activation function based on first-order sufficient statistics, the gradient of the CML objective function for the output layer parameters is given by num den ∇FCML (O) = Cji (O) − Cji (O) (15) where the accumulators of the sufficient statistics, Cji (O), for the j th state and ith constraint are cal3 Since CML objective function is maximized in this work, stochastic gradient ascent is used to train DCRFs models. Deep Learning Based on Manhattan Update Rule culated as follows: num Cji (O) = Tr R X X γjr (t|Mnum )oN rti (16) γjr (t|Mden )oN rti (17) r=1 t=1 den Cji (O) = Tr R X X r=1 t=1 where r is the utterance index and the frame-state alignment probability γj , denoting the probability of being in state j at some time t can be written in terms of the forward score αj (t) and the backward score βj (t) as in HMMs: αj (t|M)βj (t|M) (18) γj (t|M) = P (st = j|O; M) = ZΛ (O|M) and to avoid the necessity of building lattices, the γj (t|M) is approximated with state estimates as follows (Hifny et al., 2005): exp oN tj den (19) γj (t|M ) = P N s exp ots The delta of the output layer neuron j is given by N δtj = γj (t|Mnum ) − γj (t|Mden ) and the delta of the hidden layers: X h+1 h δtj = ohtj (1 − ohtj ) λh+1 kj δkt (20) (21) k∈outputs and the gradient for the hidden layers parameters is given by: R T r ∂FCML (Λ) X X h δrtj = oh−1 rtki ∂λhki r=1 t=1 (22) Based on equation (22) and equation (15), a gradient based optimization can be used to estimate the parameters (Nocedal & Wright, 1999). The transition parameters are given by: λst st−1 = log ast st−1 , (23) where ast st−1 is the transition probability in HMM modeling and is estimated using the maximum likelihood (MLE) criterion. 4. Experiments We have carried out phone recognition experiments on the TIMIT corpus (Garofolo et al., 1990). We used the 462 speaker training set and testing on the 24 speaker core test set (the SA1 and SA2 utterances were not used). The speech was analyzed using a 25ms Hamming window with a 10 ms fixed frame rate. In all the experiments we represented the speech using 12th order mel frequency cepstral coefficients (MFCCs), energy, along with their first and second temporal derivatives, resulting in a 39 element feature vector. The training data and test data features are pre-processed to have zero mean and unit variance. Hence, acoustic context information is integrated using a window of 9 frames (4 left + current frame+ 4 right) to construct the final frame vector with 351 dimensions. Following Lee and Hon (Lee & Hon, 1989), the original 61 phone classes in TIMIT were mapped to a set of 48 labels, which were used for training. This set of 48 phone classes was mapped down to a set of 39 classes (Lee & Hon, 1989), after decoding, and phone recognition results are reported on these classes, in terms of the phone error rate (PER), which is analogous to word error rate. The baseline HMMs have three emitting states and the emission probabilities were modeled with mixtures of Gaussian densities with diagonal covariance matrices. The generative HMMs were trained by the maximum likelihood criterion using the conventional EM algorithm (Young et al., 2001). A DNN with 2 hidden layers was chosen and each layer has 512 neurons. Hence, each phone was represented using a three state left-to-right DCRF, all parameters of DNN were initialized to random values and the transition parameters were initialized from trained HMM models forcing left to right DCRFs (the transition parameters are held fixed after the initialization). The training procedure accumulated the Mnum sufficient statistics via a Viterbi pass (forced alignment) of the reference transcription using HMMs trained using maximum likelihood criterion. Several iterations were used to train DCRFs and the language model scaling factor is set to 6.0 during the decoding process. All our experiments used a bigram language model over phones, estimated from the training set. The training process is divided into two phases online and batch. The online training phase computes an initial model that used to start the batch training phase. The reason to split the training into two phases is to use a grid computer for batch training. The grid has 28 cores. In Table 1, DCRFs recognition performance is reported in terms of PER on TIMIT task (core test set) for online training phase. Since MH update is sensitive the gradient sign change, we train the models Deep Learning Based on Manhattan Update Rule Table 1. DCRF decoding results on TIMIT recognition task in terms of PER (online phase). Algorithm SGD Manhattan learning rate 0.0001 0.001 0.01 0.0001 0.001 0.01 #Updates 7390 7390 7390 7390 7390 7390 Batch Manhattan #Itr 1 50 100 150 200 250 300 350 400 450 500 550 600 Scaling factor 1.0 2.0 3.0 4.0 5.0 6.0 PER 51.3% 43.4% 58.1% 45.6% 42.2% 65.8% Table 2. DCRF decoding results on TIMIT recognition task in terms of PER (batch phase). Training Method Online Manhattan Table 3. DCRF decoding results based on different language model scaling factors. PER 42.2% 36.9% 35.5% 34.7% 34.4% 34.2% 33.7% 33.6% 33.4% 33.2% 33.2% 33.1% 33.0% using the same batch ten times before moving to another batch. This may reduce the effect of random gradients. Hence, a complete iteration of the online Manhattan update implies ten loops over the training data. For SGD, we train the models for ten iterations and update after each batch. The training data are divided into 739 batches. Consequently, the models were updated 7390 times in the online phase. The results show that online learning is sensitive to the initial learning rate. The phone error rate is reduced to 42.2 % for online MH training. Hence, the final model of the online training phase is used to initialize a model to start the batch training phase. A batch Manhattan update is used to update the models. The initial learning rate was set to 0.01. As shown in the results, the recognition accuracy improves when the number of iterations is high. The batch training phase was executed over a computer grid of 28 cores. The decoding results of DCRFs are sensitive to the value of the language model scaling factor. The PER for different scaling factors is shown in Table 3. DCRF decoding results based on different Language model PER 30.8% 30.2% 30.6% 31.4% 32.2% 33.0% scaling factors. The DCRF model for iteration 600 was selected to run the decoder. 5. Conclusions In this paper, we present a method to estimate deep conditional random fields (DCRFs) using the Manhattan (MH) update rule. In this approach, DCRFs state scores are computed based on a DNN that has many hidden layers. In the estimation process, a backpropagation algorithm is used to compute the gradient of each parameter in the hidden layers. Preliminary results on the TIMIT phone recognition task show that MH update can lead to good convergence properties. Future work will focus on running online methods over a grid computer and improving the training speed of the batch MH using different sets of initial learning rates. References Dahl, George, Yu, Dong, Deng, Li, and Acero, Alex. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, Special Issue on Deep Learning for Speech and Langauge Processing, 2012. Do, Trinh-Minh-Tri and Artieres, Thierry. Neural conditional random fields. In Proc. of the 13th Interational Conference on Artificial Intelligence and Statistics, (AI-STATS), 2010. Fujii, Yasuhisa, Yamamoto, Kazumasa, and Nakagawa, Seiichi. Deep-hidden conditional neural fields for continuous phoneme speech recognition. In Proc. IWSML, 2012. Garofolo, John S., Lamel, Lori F., Fisher, William M., Fiscus, Jonathan G., Pallett, David S., Dahlgren, Nancy L., and Zue, Victor. TIMIT acousticphonetic continuous speech corpus, 1990. URL http://www.ldc.upenn.edu/Catalog/ CatalogEntry.jsp?catalogId=LDC93S1. Deep Learning Based on Manhattan Update Rule Haykin, Simon. Neural Networks: A Comprehensive Foundation. Prentice Hal, 2nd edition, 1998. Nocedal, Jorge and Wright, Stephen J. Numerical Optimization. Springer, 1999. Hifny, Yasser. Conditional Random Fields for Continuous Speech Recognition. PhD thesis, University Of Sheffield, 2006. Prabhavalkar, R. and Fosler-Lussier, E. Backpropagation training for multilayer conditional random field based phone recognition. In Proc. IEEE ICASSP, volume 1, pp. 5534 – 5537, France, March 2010. Hifny, Yasser. Acoustic modeling based on deep conditional random fields. Deep Learning for Audio, Speech and Language Processing, ICML, 2013. Hifny, Yasser, Renals, Steve, and Lawrence, Neil. A hybrid MaxEnt/HMM based ASR system. In Proc. INTERSPEECH, pp. 3017–3020, Lisbon, Portugal, 2005. Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George, rahman Mohamed, Abdel, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara, , and Kingsbury, Brian. Deep Neural Networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012. Riedmiller, M. and Braun, H. A direct method for faster backpropagation learning: The RPROP algorithm. In Proc. IEEE International Conference on Neural Networks, pp. 586–591, 1993. Schiffmann, W., Joost, M., and Werner, R. Optimization of the backpropagation algorithm for training multilayer perceptrons. Technical Report tech. rep., University of Koblenz, 1994. Seide, F., Li, G., and ., D. Yu. Conversational speech transcription using context-dependent Deep Neural Networks. In Interspeech, 2011. Jacobs, R. A. Increased rates of convergence through learning rate adaptation. Neural Networks, 1:295– 307, 1988. Sutton, Richard S. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In Proc. AAAI, pp. 171–176, 1992. URL citeseer. ist.psu.edu/158284.html. Kingsbury, Brian, Sainath, Tara N., and Soltau, Hagen. Scalable minimum bayes risk training of Deep Neural Network acoustic models using distributed hessian-free optimization. In interspeech, 2012. Vinyals, Oriol and Povey, D. Krylov subspace descent for deep learning. In AISTATS, 2012. Lafferty, John, McCallum, Andrew, and Pereira, Fernando. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML, pp. 282–289, 2001. Lee, K.-F. and Hon, H.-W. Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Speech and Audio Processing, 37(11):1641–1648, Nov 1989. Martens, J. Deep learning via hessian-free optimization. In Proc. ICML, 2010. Mohamed, Abdel-rahman, Yu, Dong, and Deng, Li. Investigation of full-sequence training of Deep Belief Networks for speech recognition. In Interspeech, 2010. Mohamed, Abdel-rahman, Dahl, George, and Hinton, Geoffrey. Acoustic modeling using Deep Belief Networks. IEEE Transactions on Audio, Speech and Language Processing, 20:14–22, 2012. Murata, Noboru, Müller, Klaus-Robert, Ziehe, Andreas, and ichi Amari, Shun. Adaptive on-line learning in changing environments. In Proc. NIPS, volume 9, pp. 599, 1997. URL citeseer.ist.psu. edu/murata97adaptive.html. Young, Steve, Kershaw, Dan, Odell, Julian, Ollason, Dave, Valtchev, Valtcho, and Woodland, Phil. The HTK Book, Version 3.1. 2001. Yu, Dong and Deng, Li. Deep-structured hidden conditional random fields for phonetic recognition. In Proc. INTERSPEECH, 2010.
© Copyright 2026 Paperzz