C S T R Neural Networks for Speech Recognition Steve Renals Centre for Speech Technology Research University of Edinburgh [email protected] C S T R Outline Some history The state of the art Looking ahead C S T R A brief history of Neural Networks C S T R The Perceptron (Rosenblatt) (early 1960s) C S T R NN Winter #1 Perceptrons (Minsky and Papert) xor C S T R MLPs and backprop (mid 1980s) C S T R MLPs and backprop • Train multiple layers of hidden units – nested nonlinear functions • • • Powerful feature detectors Posterior probability estimation Theorem: any function can be approximated with a single hidden layer y1 y Outputs yK δ! δ1 (2) w1 j (2) w! j w(2) Kj Hidden units j = h (bj ) w j zj w(1) ji xi K Neural network acoustic models (1990s) “Hybrid systems” C S T R Bourlard & Morgan, 1994 Robinson, IEEE TNN 1994 Error (%) 11.0 CI-HMM Perceptual Linear Prediction DARPA RM 1992 Chronos Decoder CI RNN 10.0 Speech 9.0 Modulation Spectrogram 8.0 Chronos Decoder ROVER 7.0 CI MLP 6.0 CI-MLP 5.0 CD-HMM 4.0 MIX 3.0 2.0 1.0 0.0 0 1 2 3 4 5 6 Million Parameters Renals, Morgan, Cohen & Franco, ICASSP 1992 Utterance Hypothesis Perceptual Linear Prediction Chronos Decoder CD RNN Broadcast news 1998 20.8% WER (best GMM-based system, 13.5%) Cook, Christie, Ellis, Fosler-Lussier, Gotoh, Kingsbury, Morgan, Renals, Robinson, & Williams, DARPA, 1999 C S T R • • • Neural network acoustic models (1990s) Limitations compared with GMMs Computationally restricted to monophone outputs • CD-RNN factored over multiple networks – limited within-word context Training not easily parallelisable • • experimental turnaround slower systems less complex (fewer parameters) • • RNN – <100k parameters MLP – ~1M parameters Rapid adaptation hard (cf MLLR) C S T R NN Winter #2 s-iy+l f-iy-l t-iy-n t-iy-m 5 5 5 5 5 5 4 4 4 4 4 4 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 1 1 0 0 0 0 0 !1 !5 !4 !3 !2 !1 0 1 2 3 4 !1 !5 !4 !3 !2 !1 0 1 2 3 4 !1 !5 !4 !3 !2 !1 0 1 2 3 4 !1 !5 !4 !3 !2 !1 0 1 2 3 4 !1 !5 0 !4 !3 !2 !1 0 1 2 3 4 !1 !5 !4 !3 !2 !1 0 1 2 3 4 Discriminative long-term features – Tandem C S T R • • A neural network-based technique provided the biggest increase in accuracy in speech recognition during the 2000s Tandem features (Hermansky, Ellis & Sharma, 2000) • • • use (transformed) outputs or (bottleneck) hidden values as input features for a GMM deep networks – e.g. 5 layer MLP to obtain bottleneck features (Grézl, Karafiát, Kontár & Černocký, 2007) reduces errors by about 10% relative (Hain, Burget, Dines, Garner, Grezl, el Hannani, Huijbregts, Karafiat, Lincoln & Wan, 2012) C S T R Neural network acoustic models (2010s) CD Phone Outputs CI Phone Outputs 12000 45 3–8 hidden layers Tandem Hybrid 3–8 hidden layers Hidden units 2000 Hidden units 2000 MFCC Inputs MFCC Inputs (39*9=351) Dahl,Yu, Deng & Acero, IEEE TASLP2012 (39*9=351) Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior,Vanhoucke, Nguyen, Sainath & Kingsbury, IEEE SP Mag 2012 C S T R 1. Deep neural networks What’s new? Unsupervised pretraining (Hinton, Osindero & Teh, 2006) • • • Train a stacked RBM generative model, then finetune Good initialisation Regularisation 2. Deep – many hidden layers • • Deeper models more accurate GPUs gave us the computational power 3. Wide output layer (context dependent phone classes) rather than factorised into multiple nets • • More accurate phone models GPUs gave us the computational power (Adaptation is still hard . . .) C S T R DNNs improve the state-of-the-art in ASR Peter Bell Paweł Świętojański Arnab Ghoshal C S T R DNNs improve the state-of-the-art #1 - TED Talks C S T R Number of hidden layers 23 dev2010 22 tst2010 WER (%) 21 20 19 18 17 1 2 3 4 Number of hidden layers 5 6 C S T R MLAN: Multi Level Adaptive Networks Train DNNs on OOD data OOD posterior features generated for in-domain data In-domain PLP features Tandem features OOD Tandem HMMs Train in-domain DNNs on Tandem features MLAN features Tandem MLAN HMMs Bell, Swietojanski & Renals, 2013 Hybrid MLAN HMMs C S T R TED Talks System Extract OOD tandem features Segmented speech recordings Adapt CMLLR x 32 Decode Tandem MLAN 3-gram LM Adapt CMLLR x1 Bell, Yamamoto, Swietojanski, Wu, McInnes, Hori & Renals, 2013 Decode Tandem MLAN + SAT Rescore lattice Rescore 100-best 3-gram LM 4-gram LM fRNN LM Decode Hybrid MLAN + SAT Rescore lattice Rescore 100-best ROVER combine WER results (IWSLT tst2010) C S T R 35 31.7 20.3 17.9 17.6 16.4 12.8 16.4 12.7 11.7 30 +SAT 25 +MPE 20 +SAT WER / % +MPE +SAT +SAT +MPE 15 +SAT +LM +RNN +LM +RNN 10 5 0 PLP Tandem Hybrid Tandem MLAN Hybrid MLAN Rover C S T R DNNs improve the state-of-the-art #2 – Sequential optimisation for Switchboard recognition C S T R Training a neural network DNN finetuned on PL Log-linear classifier Softmax output FCE = XX u Feature extraction log yut (sut ) t • Trained using gradient descent (backprop) • Cross entropy between network outputs and labels C S T R Training a neural network sequence discriminatively DNN finetuned on PL Log-linear classifier Softmax output FsM BR = P X u Feature extraction p(Ou |S)P (W )A(W, Wu ) WP 0) p(O |S)P (W u W0 • Alphanet trained sequence discriminatively • Lattice-based training of MLP Dodd, 1991) (Kingsbury, 2009) (Bridle & C S T R Switchboard experiments • Training set: 300 hours (Switchboard-1 Release 2) • 30k word pronunciation dictionary • Smoothed trigram LM trained on 3M words training transcripts interpolated with 11M words of Fisher-1 • GMMs trained on 40-dim LDA+STC features from 7 frames (±3) of 13-dim MFCC (C0-C12) • DNNs trained on 40dim LDA+STC+FMLLR features, 11 frames (± 5) input Vesely, Ghoshal, Burget & Povey, Interspeech 2013 C S T R Switchboard Results 35 Hub5 '00 test set CHE 300 hour training set 33.0 http://kaldi.sf.net/! 30 CHE AVE 25 25.8 CHE 25.7 24.1 WER/% 20 AVE SWB AVE 20.0 18.6 18.4 SWB 15 14.2 SWB 12.6 10 5 0 ---GMM/BMMI--- ----DNN/CE---- ---DNN/sMBR--- C S T R DNNs improve the state-of-the-art #3 – Low resource cross-lingual knowledge transfer C S T R Cross-lingual knowledge transfer in DNN acoustic models • Low resource speech recognition: limited acoustic training data in the target language • • Use untranscribed speech from different languages to improve target language speech recognition? Transfer DNN hidden layers across languages? • Pre-training using multilingual untranscribed audio • • • Language-dependent output layer Transfer hidden layers across languages Finetune whole network for target language • Unsupervised • Supervised C S T R Unsupervised cross-lingual knowledge transfer • Globalphone corpus • In-domain language: German • • 1/5/15 h in-domain training sets Unsupervised pretraining using Spanish, Portuguese, Swedish (20-25 h each) • Baseline systems: • • • • GMM/HMM + MFCC features, model/feature space discriminative training – (f)BMMI transformed feature space – LDA+MLLT 550–2500 tied states C S T R DNNs • Features: PLP-12 + energy • • 1st and 2nd derivatives ±4 frames context • 1024 hidden units per layer • Tandem: 44 outputs (phone classes) reduced to 25 dimensions using PCA • Hybrid: 550–2500 outputs (context dependent tied states from HMM/GMM phonetic decision tree) C S T R Effect of pretraining (5h in-domain) 18.5 fBMMI+BMMI LDA+MLLT 5h baseline Hybrid_GE_5h_nopre fBMMI+BMMI LDA+MLLT 15h baseline Hybrid_GE_5h_pGE Hybrid_GE_5h_pPO fBMMI+BMMI LDA+MLLT 15h baseline Hybrid_GE_5h_pSP Hybrid_GE_5h_pSW Hybrid_GE_5h_pGE+PO+SP+SW fBMMI+BMMI LDA+MLLT 15h baseline fBMMI+BMMI LDA+MLLT 15h baseline 18 18 WER [%] 17.5 17.5 17 17 16.5 16.5 16 16 15.5 15.5 15 15 14.5 1 2 3 Swietojanski, Goshal & Renals, SLT-2013 4 5 6 #hidden layers 7 8 9 Word error rates on Globalphone (German) C S T R 35 discrim GMM 1h tandem 1h Unsupervised pretraining on Spanish for hybrid and tandem systems 30 hybrid 1h WER/% discrim GMM 5h 25 tandem 5h discrim GMM 15h hybrid 5h tandem 15h hybrid 15h 20 15 C S T R Stacked RBMs trained on PL Supervised cross-lingual acoustic models DNN finetuned on CZ DNN finetuned on DE DNN finetuned on PT Cross-lingual DNNs (“Hat Swap”) Ghoshal, Swietojanski & Renals, ICASSP-2013 DNN finetuned on PL C S T R Mono & Multilingual results on Globalphone (Polish) PL→CZ→DE→FR→PL C S T R Another important result C S T R DNN acoustic model for robust speech recognition Seltzer,Yu & Wang, ICASSP 2013 • • Aurora 4 – WSJ, added noise State-of-the-art – GMMs with combined discriminative and noise adaptive training, with speaker adaptation WER equalled by baseline DNN system with multicondition training 23.0 20 15 WER/% • 25 13.4 13.4 12.4 10 5 0 GMM Baseline GMM-VAT-Joint DNN Baseline DNN+NAT+D/O C S T R Practical details for DNNs • Computing platform • High-end PC, with “gamers GPU” (e.g. GTX690) (~€5000) • Open source software, eg: • • • • Kaldi – http://kaldi.sourceforge.net Theano – http://deeplearning.net/software/theano Torch7 – http://www.torch.ch Quicknet – http://www.icsi.berkeley.edu/Speech/qn.html C S T R Outlook: 2 interesting things (among many) facilitated by DNNs C S T R DNNs and acoustic features • GMM/HMM systems closely tied to MFCC/PLP feature representations • GMMs are not a good model to learn feature correlations • DNNs may be viewed as powerful feature extractors and correlators • (log) (mel) spectral features result in equal or lower error rates • DNNs well matched to exploring a range of rich acoustic feature representations • Reverb, overlapped speech, multiple acoustic sources... C S T R Incorporation of speech knowledge Graph transformer networks Trainable multi-modular systems Le Cun, Bottou, Bengio & Haffner, Proc IEEE 1997 Fig. 27. A globally trainable SDNN/HMM hybrid system expressed as a GTN. C S T R Conclusions • DNNs enable automatic extraction of rich features • (Implicit or explicit) regularisation enables the construction of deep, complex DNNs • Training DNNs to estimate CD HMM state probabilities has extended state-of-the-art in ASR • DNNs open new research possibilities • • • Multilingual, multidomain systems New acoustic features for ASR .... Thanks. [email protected] C S T R References • • A Bryson and Y Ho (1969). Applied Optimal Control, Taylor and Francis. • P Bell, H Yamamoto, P Swietojanski, Y Wu, F McInnes, C Hori, and S Renals (2013). “A lecture transcription system combining neural network acoustic and language models”, Proc Interspeech. • H Bourlard and N Morgan (1994). Connectionist Speech Recognition – A Hybrid Approach, Kluwer. • J Bridle and L Dodd (1991). “An Alphanet approach to optimising input transformations for continuous speech recognition”, Proc IEEE ICASSP. • G Cook, J Christie, D Ellis, E Fosler-Lussier, Y Gotoh, B Kingsbury, N Morgan, S Renals, T Robinson, and G Williams (1999). “An overview of the SPRACH system for the transcription of broadcast news”, Proc DARPA Broadcast News Workshop. • G Dahl, D Yu, L Deng, and A Acero (2012). “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition” IEEE Trans ASLP, 20:30-42. P Bell, P Swietojanski, and S Renals (2013). “Multi-level adaptive networks in tandem and hybrid ASR systems”, Proc. IEEE ICASSP. C S T R References • A Ghoshal, P Swietojanski, and S Renals (2013). “Multilingual Training of Deep Neural Networks”, Proc IEEE ICASSP. • F Grézl, M Karafiát, S Kontár, and J Černocký (2007). “Probabilistic and Bottle-Neck Features for LVCSR of Meetings”, Proc IEEE ICASSP. • T Hain, L Burget, J Dines, P Garner, F Grezl, A el Hannani, M Huijbregts, M Karafiat, M Lincoln, and V Wan (2012). “Transcribing meetings with the AMIDA systems”, IEEE Trans ASLP, 20:486-498. • H Hermansky, D Ellis, and S Sharma (2000). “Tandem connectionist feature extraction for conventional HMM systems”, Proc IEEE ICASSP. • G Hinton, L Deng, D Yu, G Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke, P Nguyen, T Sainath, and B Kingsbury (2012). “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups”, IEEE Signal Processing Magazine, Nov 2012:82-97. • G Hinton, S Osindero, and YW Teh (2006). “A Fast Learning Algorithm for Deep Belief Nets”, Neural Computation, 18:1527-1554. C S T R References • B Kingsbury (2009). “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling”, Proc IEEE ICASSP. • Y LeCun, L Bottou,Y Bengio, and P Haffner (1997). “Gradient-based learning applied to document recognition”, Proc IEEE, 86(11):2278-2324 • • M Minsky and S Papert (1969). Perceptrons, MIT Press. • S Renals, N Morgan, M Cohen and H Franco (1992). “Connectionist probability estimation in the DECIPHER speech recognition system”, Proc IEEE ICASSP. • S Renals, N Morgan, H Bourlard, M Cohen and H Franco (1994). “Connectionist probability estimators in HMM speech recognition”, IEEE Trans SAP, 2(1):161–174. • AJ Robinson (1994). “An application of recurrent nets to phone probability estimation”, IEEE Trans Neural Networks, 5:298-305. • F Rosenblatt (1962). Principles of neurodynamics: perceptrons and the theory of brain mechanisms, Spartan Books, D Parker (1985). Learning Logic, TR-47, Center for Computational Research in Economics and Management Science, MIT. C S T R References • D Rumelhart, G Hinton, R Williams (1986). “Learning internal representations by error propagation”, in Parallel Distributed Processing (vol 1), D Rumelhart and J McClelland (eds), MIT Press. • M Seltzer, D Yu, and E Wang (2013). “An Investigation of Deep Neural Networks for Noise Robust Speech Recognition”, Proc IEEE ICASSP. • P Swietojanski, A Ghoshal, and S Renals (2013). “Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR”, Proc IEEE SLT Workshop. • K Vesely, A Ghoshal, L Burget, and D Povey (2013). “Sequence-discriminative training of deep neural networks”, Proc Interspeech. • P Werbos (1990). “Backpropagation through time: what it does and how to do it”, Proc IEEE, 78:1550-1560.
© Copyright 2026 Paperzz