Phoneme Recognition using Temporal Patterns Petr Schwarz, Pavel Matějka Brno University of Technology, Czech Republic OGI School of Science and Engineering at OHSU, USA E-mail: [email protected], [email protected] September 23-24 2003 M4 meeting Delft 1 Outline • • • • • • • • The goal Experimental setup and system Baseline experiment with MFCC and MFCC multi-frame Comparison of conventional MFCC and novel TempoRAl Patterns (TRAPs) features under well matched and mismatched conditions Optimization of TRAPs for our task New three-band TRAPs system Implementation and distribution of the SW Conclusions and future work… September 23-24 2003 M4 meeting Delft 2 The goal • For many applications, speech needs to be transcribed into discrete symbols. • very reliable phoneme recognizer (not only) for meeting domain • no language constraints • suitable as a front end to LVCSR, for keyword spotting, speaker recognition, language recognition or recognition of out-of-vocabulary words Comparison of several techniques for automatic recognition of unconstrained contextindependent phonemes September 23-24 2003 M4 meeting Delft 3 Experimental setup • Two databases TIMIT and NTIMIT - all SA records are removed - databases down-sampled to 8000 Hz - 412 speakers for training, 50 for CV, 168 for test • The phoneme set contains 39 phonemes - very similar to CMU/MIT phoneme set - closures are merged with burst (bcl b b) • Experimental system is NN/HMM hybrid - phoneme insertion penalty tuned to the equal number of inserted and deleted phonemes September 23-24 2003 M4 meeting Delft 4 Experimental system Posterior probablity estimator MFCC, TRAPS Feature Viterbi decoder aa ae extraction z Which classifier? pau hh ae l ow pau September 23-24 2003 M4 meeting Delft 5 Which classifier, GMM or NN? • HMM-GMM and HMM-NN with one-state models • MFCC + Δ + ΔΔ features • Number of parameters is increased until the decrease in phoneme error rate (PER) is negligible (<0.5 %) System PER [%] Parameters GMM 42.0 788736 NN 41.6 31200 NN doesn’t degrade performance compared to GMM + 2 % absolute by merging September 23-24 2003 M4 meeting Delft 6 Single frame and multi-frame input with MFCC – FeatureNet • Subsequent frames are joined together • Size of context is being increased to find minimal PER • 300, 400 and 500 neurons in hidden layer tested minimum change but the best is 400 42 PER [%] 41 40 frames 39 1 41.6 5 37.5 PER = 37.5 % 38 PER [%] 37 0 5 10 15 frames September 23-24 2003 M4 meeting Delft 7 TempoRAl Patterns 1. frequency-localized posterior probabilities of phonemes are estimated from temporal evolution of critical band energies within a single critical band 2. such estimates are used in another class-posterior estimator which estimates the overall phoneme probability from the probabilities in the individual critical bands. 1. band classifier 2. band classifier N. band classifier September 23-24 2003 M4 meeting Delft 8 TRAP system scheme Norm a a e z Norm pau hh ae l ow pau September 23-24 2003 M4 meeting Delft 9 MFCC and TRAP on well-matched conditions • Training and testing data are from the same database • Similar performance of MFCC multi-frame and 1s long TRAPs • Improvement can be obtained when length of TRAP is optimized PER [%] TIMIT NTIMIT MFCC39 41.6 55.6 MFCC39 5frames 37.5 49.0 TRAP 1sec 37.9 49.6 September 23-24 2003 M4 meeting Delft 10 MFCC and TRAP on mismatched conditions • Training and testing data are from different databases • TRAP system yielded better results in both mismatched conditions • It’s better to train the system on corrupted speech rather than on clean one PER [%] TIMIT/NTIMIT NTIMIT/TIMIT MFCC39 80.9 63.4 MFCC39 5frames 80.1 75.7 TRAP 1sec 75.0 56.6 September 23-24 2003 M4 meeting Delft 11 Effect of length of TRAP • The original TRAP length was kept 1 second long to be sure that it covers all information about phoneme in the critical band, but the length is not optimal • 300 ms long context is the best for the TIMIT database 42 41 PER [%] 40 39 PER = 36.1 % 38 37 36 35 0 September 23-24 2003 200 400 600 time [ms] M4 meeting Delft 800 1000 12 Effect of mean and variance normalization • Experiment was performed on original 1 second long TRAPs • Significant degradation caused by both normalizations can be seen in well-matched conditions • Mean normalization always helps in mismatched condition, the benefit of variance normalization is less clear Normalization / PER [%] TIMIT/ NTIMIT/ NTIMIT TIMIT TIMIT NTIMIT None 37.9 49.6 75.0 56.6 Mean 40.5 51.8 73.5 54.7 Mean & variance 42.6 53.2 74.8 54.1 September 23-24 2003 M4 meeting Delft 13 TRAP with more than one critical band • Three neighboring temporal vectors were merged together and sent to one classifier Posterior probabilities of phonemes for each triple of bands system TRAPS 3 band TRAPS September 23-24 2003 M4 meeting Delft PER [%] 36.1 33.7 14 Implementation and distribution of the SW: phnrec • Early experiments performed with a set of scripts interconnecting execs: trapper, QuickNet, HTK,… – still used for the training. • Phoneme recognition – in phnrec containing: – feature extraction (MFCC (compat HTK), FeatureNet, TRAPS) – from files or microphone – posterior-probability estimator (NN –compatible with QuickNet nets) – Viterbi decoder – can work also on-line with fixed delay. • Very good as black-box for people what want to consider speech-to-phoneme transcription as front-end September 23-24 2003 M4 meeting Delft 15 phnrec (2) • Source codes for Linux and EXE for Windows available for free for research. • Available with nets trained on US-English (TIMIT) and Czech (SpeechDat-E). • More languages to come (also some Language ID experiments running in Brno) • Works on-line http://www.fit.vutbr.cz/speech/sw/phnrec.html September 23-24 2003 M4 meeting Delft 16 Conclusion • TRAP based phoneme recognizer was built, comparison to MFCC. • Properties of TRAPs were studied and TRAPs were optimized for phoneme recognition • New multi-band TRAPs approach was tested and its benefit is proved • The recognizer was successfully evaluated in language identification task • An easy-to-use software was written and is available for research community. September 23-24 2003 M4 meeting Delft 17 But … • Adaptation to meeting data necessary (TIMIT clean training not good at all), updating the distribution on www. • Tests on ICSI, IDIAP and Brno data (which phonemes going to work the best for us CzEnglish ?) • Applications – LID already tested, kwd spotting and LVCSR (some papers at Eurospeech making use of phoneme strings). • Phoneme lattices • Real-time issues (1 band version running ok on reasonable machine, 3 band not) – NN weights pruning? September 23-24 2003 M4 meeting Delft 18 THE END • A demo during the break. • Please download phnrec, test it and comment !!! • Questions ? September 23-24 2003 M4 meeting Delft 19
© Copyright 2026 Paperzz