Lombard Speech Recognition Hynek Bořil [email protected] Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Overview Model of Speech Production Automatic Speech Recognition (LE) Outline Feature Extraction Acoustic Models Lombard Effect Definition & Motivation Acquisition of Corpus capturing Lombard Effect Analysis of Speech under LE Methods Increasing ASR Robustness to LE Email: [email protected] Slide 2 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Speech Production Model of speech production understanding speech signal structure design of speech processing algorithms Email: [email protected] Slide 3 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Speech Production Linear Model Voiced Excitation Pitch Period AV IMPULSE TRAIN GENERATOR I(z) GLOTTAL PULSE MODEL G(z) Vocal Tract Parameters Voiced/Unvoiced Switch uG(n) VOCAL TRACT MODEL V(z) RADIATION MODEL R(z) RANDOM NOISE GENERATOR N(z) AN Unvoiced Excitation Email: [email protected] Slide 4 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Speech Production Linear Model |I(F)G(F)| 1/F0 -12 dB/oct = Time F0 2F0 ... Freq. Pitch Period AV IMPULSE TRAIN GENERATOR I(z) GLOTTAL PULSE MODEL G(z) Vocal Tract Parameters Voiced/Unvoiced Switch uG(n) VOCAL TRACT MODEL V(z) RADIATION MODEL R(z) RANDOM NOISE GENERATOR N(z) AN Email: [email protected] Slide 5 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Speech Production Linear Model Pitch Period AV GLOTTAL PULSE MODEL G(z) IMPULSE TRAIN GENERATOR I(z) Vocal Tract Parameters Voiced/Unvoiced Switch uG(n) VOCAL TRACT MODEL V(z) RADIATION MODEL R(z) RANDOM NOISE GENERATOR N(z) AN |N(F)| = Time Email: [email protected] Frequency Slide 6 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Speech Production Linear Model Pitch Period AV |V(F)| -12 IMPULSE TRAIN GENERATOR I(z) GLOTTAL PULSE MODEL G(z) 8 8 6 4 4 2 2 0 0 -2 -2 -4 Voiced/Unvoiced Switch -12 x 10 6 0 500 uG(n) x 10 -4 1000 0 -12 8 x 10 |R(F)| +6 dB/oct -12 8 6 6 4 4 2 2 x 10 Vocal Tract Parameters 0 0 Frequency -2 1500 500 -4 2000 1000 0 Frequency -2 -4 2500 1500 500 3000 2000 1000 0 3500 2500 1500 500 4000 3000 2000 1000 3500 2500 1500 4000 3000 2000 3500 2500 4000 3000 VOCAL TRACT MODEL V(z) V ( z) RANDOM NOISE GENERATOR N(z) 3500 4000 RADIATION MODEL R(z) G N 1 k z k k 1 AN Email: [email protected] Slide 7 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Speech Production Linear Model |I(F)G(F)| 1/F0 -12 dB/oct = Time F0 2F0 ... Freq. Pitch Period AV |V(F)| -12 GLOTTAL PULSE MODEL G(z) IMPULSE TRAIN GENERATOR I(z) 8 8 6 4 4 2 2 0 0 -2 -2 -4 Voiced/Unvoiced Switch -12 x 10 6 0 500 uG(n) x 10 -4 1000 0 -12 8 x 10 6 4 4 2 2 x 10 Vocal Tract Parameters 0 0 Frequency -2 1500 500 -4 2000 1000 0 Frequency -2 -4 2500 1500 500 3000 2000 1000 0 3500 2500 1500 500 4000 3000 2000 1000 3500 2500 1500 4000 3000 2000 3500 2500 4000 3000 VOCAL TRACT MODEL V(z) V ( z) RANDOM NOISE GENERATOR N(z) |R(F)| +6 dB/oct -12 8 6 3500 4000 RADIATION MODEL R(z) G N 1 k z k k 1 AN |N(F)| = Time Email: [email protected] Frequency Slide 8 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Speech Production Linguistic/Speaker Information in Speech Signal How is Linguistic Info Coded in Speech Signal? Phonetic Contents Energy: voiced phones (v) – higher energy than unvoiced phones (uv) Low formants: locations and bandwidths ( changes in configuration of vocal tract during speech production) Spectral tilt: differs across phones, generally flatter for uv (due to changes in excitation and formant locations) Other Cues Pitch contour: important to distinguish words in tonal languages (e.g., Chinese dialects) How is Speaker Identity Coded in Speech Signal? Glottal Waveform Vocal Tract Parameters Prosody (intonation, rhythm, stress,…) Email: [email protected] Slide 9 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Speech Production Phonetic Contents in Features 2200 Example 1 – First 2 Formants in US Vowels (Bond et. al., 1989) F2 (Hz) 1800 1600 Example 2 – Spectral Slopes in Czech Vowels 1200 /ae/ 1400 1000 Neutral Vowel #N T (s) /a/ 454 69.03 /e/ 1064 69.33 /i/ 509 58.92 /o/ 120 9.14 /u/ 102 5.73 Email: [email protected] Slide 10 Slope (dB/oct) -6.8 (-6.9; -6.7) -5.6 (-5.7; -5.6) -5.0 (-5.1; -4.9) -8.0 (-8.1; -7.8) -6.1 (-6.3; -6.0) (dB/oct) /i/ 2000 /u/ 800 200 /a/ 400 600 800 F1 (Hz) 1.13 1.06 1.15 0.91 0.77 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Automatic Speech Recognition (ASR) Architecture of HMM Recognizer ACOUSTIC MODEL SPEECH SIGNAL FEATURE EXTRACTION (MFCC/PLP) LEXICON (HMM) SUB-WORD LIKELIHOODS (GMM/MLP) LANGUAGE MODEL (BIGRAMS) DECODER (VITERBI) ESTIMATED WORD SEQUENCE Feature extraction – transformation of time-domain acoustic signal into representation more effective for ASR engine: data dimensionality reduction, suppression of irrelevant (disturbing) signal components (speaker/environment/recording chain-dependent characteristics), preserving phonetic content Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used to model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) – neural networks – multi-layer perceptrons (MLPs) (much less common than GMMs) Email: [email protected] Slide 11 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Automatic Speech Recognition (ASR) HMM-Based Recognition – Stages Speech Signal … Feature Extraction (Windowing,…, cepstrum) o1 o2 o3 … Acoustic Models Language Model (HMMs word sequences) (HTK book, 2006) Speech Transcription Email: [email protected] Slide 12 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Automatic Speech Recognition (ASR) Feature Extraction – MFCC Mel Frequency Cepstral Coefficients (MFCC) Davis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980 MFCC is the first choice in current commercial ASR s(n) PREEMPHASIS WINDOW (HAMMING) |FFT| 2 FILTER BANK (MEL) Log(.) IDCT MFCC c(n) Preemphasis: compensates for spectral tilt (speech production/microphone channel) Windowing: suppression of transient effects in short-term segments of signal |FFT|2: energy spectrum (phase is discarded) MEL Filter bank: MEL scale – models logarithmic perception of frequency in humans; triangular filters – dimensionality reduction Linear Frequency Log + IDCT: extraction of cepstrum – deconvolution of glottal waveform, vocal tract function, channel characteristics Email: [email protected] Slide 13 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Automatic Speech Recognition (ASR) Feature Extraction – MFCC & PLP Perceptual Linear Predictive Coefficients (PLP) Hermansky, Journal of Acoustical Society of America, 1990 An alternative to MFCC, used less frequently Many stages similar to MFCC Linear prediction – smoothing of the spectral envelope PLP s(n) WINDOW (HAMMING) |FFT|2 FILTER BANK (BARK) EQUAL LOUDNESS PREEMPHASIS INTENSITY LOUDNESS 3 LINEAR PREDICTION RECURSION PLP c(n) CEPSTRUM MFCC s(n) PREEMPHASIS Email: [email protected] WINDOW (HAMMING) Slide 14 |FFT|2 FILTER BANK (MEL) Speech and Speaker Recognition Log(.) IDCT MFCC c(n) SLIDES by John H.L. Hansen, 2007 Automatic Speech Recognition (ASR) Feature Extraction – MFCC & PLP Perceptual Linear Predictive Coefficients (PLP) Hermansky, Journal of Acoustical Society of America, 1990 An alternative to MFCC, used less frequently Many stages similar to MFCC Linear prediction – smoothing of the spectral envelope (may improve robustness) PLP s(n) WINDOW (HAMMING) |FFT|2 FILTER BANK (BARK) EQUAL LOUDNESS PREEMPHASIS INTENSITY LOUDNESS 3 LINEAR PREDICTION RECURSION PLP c(n) CEPSTRUM MFCC s(n) PREEMPHASIS Email: [email protected] WINDOW (HAMMING) Slide 15 |FFT|2 FILTER BANK (MEL) Speech and Speaker Recognition Log(.) IDCT MFCC c(n) SLIDES by John H.L. Hansen, 2007 Automatic Speech Recognition (ASR) Acoustic Models – GMM-HMM Gaussian Mixture Models (GMMs) Motivation: distributions of cepstral coefficients can be well modeled by a mixture (sum) of gaussian functions Example – distribution of c0 in certain phone and corresponding gaussian (defined uniquely by mean, variance, and weight) Probability Density Function (pdf) Histogram 120 120 Weight 100 Pr(c0) # Samples 100 80 80 60 60 40 40 20 20 0 0 m m m+ c0 m m m+ c0 Multidimensional observations (c0,…,c12) multidimensional gaussians – defined uniquely by means, covariance matrices, and weights GMMs – typically used to model parts of phones Hidden Markov Models (HMMs) States (GMMs) + transition probabilities between states Models of whole phones; lexicon word models built of phone models Email: [email protected] Slide 16 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Lombard Effect Definition & Motivation What is Lombard Effect? When exposed to noisy adverse environment, speakers modify the way they speak in an effort to maintain intelligible communication (Lombard Effect - LE) Why is Lombard Effect Interesting? Better understanding mechanisms of human speech communication (Can we intentionally change particular parameters of speech production to improve intelligibility, or is LE an automatic process learned through public loop? How the type of noise and communication scenario affect LE?) Mathematical modeling of LE classification of LE level, speech synthesis in noisy environments, increasing robustness of automatic speech recognition and speaker identification systems Email: [email protected] Slide 17 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Lombard Effect Motivation & Goals Ambiguity in Past LE Investigations LE has been studied since 1911, however, many investigations disagree in the observed impacts of LE on speech production Analyses conducted typically on very limited data – a couple of utterances from few subjects (1–10) Lack of communication factor – a majority of studies ignore the importance of communication for evoking LE (an effort to convey message over noise) occurrence and level of LE in speech recordings is ‘random’ contradicting analysis results LE was studied only for several world languages (English, Spanish, French, Japanese, Korean, Mandarin Chinese), no comprehensive study for any of Slavic languages 1st Goal Design of Czech Lombard Speech Database addressing the need of communication factor and well defined simulated noisy conditions Systematic analysis of LE in Czech spoken language Email: [email protected] Slide 18 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Lombard Effect Motivation & Goals ASR under LE Mismatch between LE speech with by noise and acoustic models trained on clean neutral speech Strong impact of noise on ASR is well known and vast number of noise suppression/speech emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached) Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR systems mostly ignore this issue LE-Equalization Methods LE-equalization algorithms typically operate in the following domains: Robust features, LEtransformation towards neutral, model adjustments, improved training of acoustic models The algorithms display various degrees of efficiency and are often bound by strong assumptions preserving them from the real world application (applying fixed transformations to phonetic groups, known level of LE, etc.) 2nd Goal Proposal of novel LE-equalization techniques with a focus on both level of LE suppression and extent of bounding assumptions Email: [email protected] Slide 19 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 LE Corpora Available Czech Corpora Czech SPEECON – speech recordings from various environments including office and car CZKCC – car recordings – include parked car with engine off and moving car scenarios Both databases contain speech produced in quiet in noise candidates for study of LE, however, not good ones, shown later Design/acquisition of LE-oriented database – Czech Lombard Speech Database‘05 (CLSD‘05) Goals – Communication in simulated noisy background high SNR -Phonetically rich data/extensive small vocabulary material -Parallel utterances in neutral and LE conditions Email: [email protected] Slide 20 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data Acquisition Recording Setup Simulated Noisy Conditions Noise samples mixed with speech feedback and produced to the speaker and operator by headphones Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible, operator asks the subject to repeat it speakers are required to convey message over noise communication LE Noises: mostly car noises from Car2E database, normalized to 90 dB SPL Speaker Sessions 14 male/12 female speakers Each subject recorded both in neutral and simulated noisy conditions Middle talk Close talk SPEAKER Email: [email protected] H&T RECORDER Noise + speech feedback Slide 21 Speech and Speaker Recognition OK – next / / BAD - again Noise + speech monitor SMOOTH OPERATOR SLIDES by John H.L. Hansen, 2007 Data Acquisition Recording Setup Simulated Noisy Conditions Noise samples mixed with speech feedback and produced to the speaker and operator by headphones Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible, operator asks the subject to repeat it speakers are required to convey message over noise real LE Noises: mostly car noises from Car2E database, normalized to 90 dB SPL Speaker Sessions 14 male/12 female speakers Each subject recorded both in neutral and simulated noisy conditions ME-104 ME-104 Email: [email protected] NB2 NB2 Slide 22 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data Acquisition Impact of Headphones Environmental Sound Attenuation by Headphones Attenuation characteristics measured on dummy head Source of wide-band noise, measurement of sound transfer to dummy head’s auditory canals when not wearing/wearing headphones Attenuation characteristics – subtraction of the transfers Email: [email protected] Slide 23 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data Acquisition Impact of Headphones Environmental Sound Attenuation by Headphones Directional attenuation – reflectionless sound booth Angle (°) 90 75 105 60 Real attenuation in recording room 1 kHz 2 kHz 4 kHz 8 kHz 120 45 135 30 150 15 165 0 30 30 20 10 20 20 10 10 0 0 -10 0 -10 Attenuation (dB) 10 20 0 -10 200 150 4 10 100 3 10 50 Angle (°) 0 2 10 195 Attenuation by headphones 330 210 15 10 5 315 0° 90° 180° 300 Rec. room 225 240 285 270 255 0 -5 Frequency (Hz) -10 100 1000 Frequency (Hz) Email: [email protected] 180 180 30 20 25 345 Attenuation (dB) Attenuation (dB) 30 Slide 24 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 10000 Speech Production under Lombard Effect Speech Features affected by LE Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises Pitch Period AV IMPULSE TRAIN GENERATOR I(z) GLOTTAL PULSE MODEL G(z) Vocal Tract Parameters Voiced/Unvoiced Switch uG(n) VOCAL TRACT MODEL V(z) V ( z) RANDOM NOISE GENERATOR N(z) RADIATION MODEL R(z) G N 1 k z k k 1 AN Email: [email protected] Slide 25 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Speech Production under Lombard Effect Speech Features affected by LE Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises Vocal tract transfer function: center frequencies of low formants increase, formant bandwidths reduce Pitch Period AV IMPULSE TRAIN GENERATOR I(z) GLOTTAL PULSE MODEL G(z) Vocal Tract Parameters Voiced/Unvoiced Switch uG(n) VOCAL TRACT MODEL V(z) V ( z) RANDOM NOISE GENERATOR N(z) RADIATION MODEL R(z) G N 1 k z k k 1 AN Email: [email protected] Slide 26 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Speech Production under Lombard Effect Speech Features affected by LE Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises Vocal tract transfer function: center frequencies of low formants increase, formant bandwidths reduce Vocal effort (intensity) increase Pitch Period AV IMPULSE TRAIN GENERATOR I(z) GLOTTAL PULSE MODEL G(z) Vocal Tract Parameters Voiced/Unvoiced Switch uG(n) VOCAL TRACT MODEL V(z) V ( z) RANDOM NOISE GENERATOR N(z) RADIATION MODEL R(z) G N 1 k z k k 1 AN Email: [email protected] Slide 27 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Speech Production under Lombard Effect Speech Features affected by LE Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises Vocal tract transfer function: center frequencies of low formants increase, formant bandwidths reduce Vocal effort (intensity) increase Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,… Pitch Period AV IMPULSE TRAIN GENERATOR I(z) GLOTTAL PULSE MODEL G(z) Vocal Tract Parameters Voiced/Unvoiced Switch uG(n) VOCAL TRACT MODEL V(z) V ( z) RANDOM NOISE GENERATOR N(z) RADIATION MODEL R(z) G N 1 k z k k 1 AN Email: [email protected] Slide 28 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 pL(n) Analysis of Speech Features under LE Fundamental Frequency 16 Distribution of fundamental frequency Czech SPEECON 10 8 Office F Car F Office M Car M 6 Distribution of fundamental frequency CZKCC 14 Number of samples (x 1000) Number of samples (x 10,000) 12 4 2 12 Eng off F Eng on F Eng off M Eng on M 10 8 6 4 2 0 0 70 170 270 370 470 Number of samples (x 10,000) Fundamental frequency (Hz) 6 570 70 170 270 370 470 Fundamental frequency (Hz) Distribution of fundamental frequency CLSD'05 5 4 Neutral F LE F Neutral M LE M 3 2 1 0 70 170 270 370 470 570 Fundamental frequency (Hz) Email: [email protected] Slide 29 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 570 Analysis of Speech Features under LE Formant Locations 2100 2500 2300 Formants - CZKCC Female digits /i'/ /i/ Formants - CZKCC Male digits /i'/ 1900 /i/ /e'/ 1700 2100 /e/ /e/ /e'/ 1500 F2 (Hz) F2 (Hz) 1900 1700 1500 1300 /u'/ 900 300 /a'/ /u'/ /u/ /a/ /o'/ 900 Female_N Female_LE 400 500 600 700 800 900 Male_N Male_LE 700 1000 500 200 300 400 500 Formants - CLSD'05 Female digits /i'/ /i/ 800 900 1900 /i/ 1700 /e/ F2 (Hz) 1700 /a'/ /a/ 1500 1300 1100 Female_N Female_LE 500 /o'/ 600 700 800 900 1000 Slide 30 Male_N Male_LE 700 500 200 F1 (Hz) Email: [email protected] /o/ /u/ 900 /o/ /u/ 400 /a'/ /a/ /u'/ /o'/ /u'/ /e'/ 1500 /e/ 1300 Formants - CLSD'05 Male digits /i'/ 1900 /e'/ 2100 F2 (Hz) 700 2100 2300 900 300 600 F1 (Hz) F1 (Hz) 2500 1100 /o/ /o'/ /u/ 1100 1300 1100 /a/ /o/ /a'/ 300 400 500 600 700 F1 (Hz) Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 800 900 Analysis of Speech Features under LE Formant Bandwidths SPEECON, CZKCC: no consistent BW changes CLSD‘05: significant BW reduction in many voiced phonemes CZKCC 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz) 275 97 299 78 78 156 68 186 79 127* 44 105 44 136 53 87 222 67 263* 85 269* 73 100 170 89 174* 96 187* 101 Vowel B1M (Hz) 1M (Hz) B1M (Hz) /a/ 207* 74 210* 84 /e/ 125* 70 130* /i/ 124* 49 /o/ 275 /u/ 187 CLSD‘05 Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz) /a/ 269 88 152 59 232 85 171 68 /e/ 168 94 99 44 169 73 130 49 /i/ 125 53 108 52 132* 52 133* 58 /o/ 239 88 157 81 246 91 158 62 /u/ 134* 67 142* 81 209 95 148 66 Email: [email protected] Slide 31 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Analysis of Speech Features under LE Phoneme Durations Significant increase in duration in some phonemes, especially voiced phonemes Some unvoiced consonants – duration reduction Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC CZKCC Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%) Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50 Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36 Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04 Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72 Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58 CLSD‘05 Word Phoneme #N TN (s) Tn (s) # LE TLE (s) Tle (s) (%) Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35 Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98 Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92 Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71 Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46 Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25 Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20 Email: [email protected] Slide 32 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Lombard Effect Initial ASR Experiments Digit Recognizer Monophone HMM models 13 MFCC + ∆ + ∆∆ 32 Gaussian mixtures per model state D – word deletions ASR Evaluation – WER (Word Error Rate) S – word substitutions WER I – word insertions D – word deletions D+S +I 100 N Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR) Czech SPEECON Set Office F Office M % Clean recordings (LE - 40.9 dB SNR) CLSD‘05 CZKCC Car F Car M OFF F OFF M ON F ON M NF NM LE F LE M # Spkrs 22 31 28 42 30 30 18 21 12 14 12 14 # Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303 WER 5.5 4.3 4.6 10.5 3.0 2.3 13.5 10.4 7.3 3.8 42.8 16.3 (%) (4.0–7.0) (3.1–5.4) (3.4–5.9) (9.0–12.0) (2.1–3.8) (1.5–3.1) (6.6–8.0) (2.8–4.8) Email: [email protected] Slide 33 (11.7–15.2) (8.8–12.0) Speech and Speaker Recognition (41.5–44.1) (15.4–17.2) SLIDES by John H.L. Hansen, 2007 Lombard Effect Initial ASR Experiments Digit Recognizer Monophone HMM models 13 MFCC + ∆ + ∆∆ 32 Gaussian mixtures per model state D – word deletions ASR Evaluation – WER (Word Error Rate) S – word substitutions WER I – word insertions D – word deletions D+S +I 100 N Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR) Czech SPEECON Set Office F Office M % Clean recordings (LE - 40.9 dB SNR) CLSD‘05 CZKCC Car F Car M OFF F OFF M ON F ON M NF NM LE F LE M # Spkrs 22 31 28 42 30 30 18 21 12 14 12 14 # Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303 WER 5.5 4.3 4.6 10.5 3.0 2.3 13.5 10.4 7.3 3.8 42.8 16.3 (%) (4.0–7.0) (3.1–5.4) (3.4–5.9) (9.0–12.0) (2.1–3.8) (1.5–3.1) (6.6–8.0) (2.8–4.8) Email: [email protected] Slide 34 (11.7–15.2) (8.8–12.0) Speech and Speaker Recognition (41.5–44.1) (15.4–17.2) SLIDES by John H.L. Hansen, 2007 Lombard Effect Initial ASR Experiments Digit Recognizer Monophone HMM models 13 MFCC + ∆ + ∆∆ 32 Gaussian mixtures per model state D – word deletions ASR Evaluation – WER (Word Error Rate) S – word substitutions WER I – word insertions D – word deletions D+S +I 100 N Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR) Czech SPEECON Set Office F Office M % Clean recordings (LE - 40.9 dB SNR) CLSD‘05 CZKCC Car F Car M OFF F OFF M ON F ON M NF NM LE F LE M # Spkrs 22 31 28 42 30 30 18 21 12 14 12 14 # Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303 WER 5.5 4.3 4.6 10.5 3.0 2.3 13.5 10.4 7.3 3.8 42.8 16.3 (%) (4.0–7.0) (3.1–5.4) (3.4–5.9) (9.0–12.0) (2.1–3.8) (1.5–3.1) (6.6–8.0) (2.8–4.8) Email: [email protected] Slide 35 (11.7–15.2) (8.8–12.0) Speech and Speaker Recognition (41.5–44.1) (15.4–17.2) SLIDES by John H.L. Hansen, 2007 Lombard Effect Initial ASR Experiments Digit Recognizer Monophone HMM models 13 MFCC + ∆ + ∆∆ 32 Gaussian mixtures per model state D – word deletions ASR Evaluation – WER (Word Error Rate) S – word substitutions WER I – word insertions D – word deletions D+S +I 100 N Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR) Czech SPEECON Set Office F Office M % Clean recordings (LE - 40.9 dB SNR) CLSD‘05 CZKCC Car F Car M OFF F OFF M ON F ON M NF NM LE F LE M # Spkrs 22 31 28 42 30 30 18 21 12 14 12 14 # Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303 WER 5.5 4.3 4.6 10.5 3.0 2.3 13.5 10.4 7.3 3.8 42.8 16.3 (%) (4.0–7.0) (3.1–5.4) (3.4–5.9) (9.0–12.0) (2.1–3.8) (1.5–3.1) (6.6–8.0) (2.8–4.8) Email: [email protected] Slide 36 (11.7–15.2) (8.8–12.0) Speech and Speaker Recognition (41.5–44.1) (15.4–17.2) SLIDES by John H.L. Hansen, 2007 LE Suppression in ASR Model Adaptation Model Adaptation Often effective when only limited data from given conditions are available Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per class, acoustically close classes are grouped and transformed together μ 'MLLR Aμ + b Maximum a posteriori approach (MAP) – initial models are used as informative priors for the adaptation μ 'MAP N μ+ μ N + N + Adaptation Procedure First, neutral speaker-independent (SI) models transformed by MLLR, employing clustering (binary regression tree) Second, MAP adaptation – only for nodes with sufficient amount of adaptation data Email: [email protected] Slide 37 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 LE Suppression in ASR Model Adaptation Adaptation Schemes Speaker-independent adaptation (SI) – group dependent/independent Speaker-dependent adaptation (SD) – to neutral/LE 90 Model adaptation to conditions and speakers 80 70 WER (%) 60 50 40 30 20 SI adapt to LE (same spkrs) SI adapt to LE (disjunct spkrs) SD adapt to neutral SD adapt to LE 10 0 Adapted digits LE Baseline digits LE Email: [email protected] Slide 38 Baseline sentences LE Speech and Speaker Recognition Adapted sentences LE SLIDES by John H.L. Hansen, 2007 LE Suppression in ASR Data-Driven Design of Robust Features Filter Bank Approach Analysis of importance of frequency components for ASR Repartitioning filter bank (FB) to emphasize components carrying phonetic information and suppress disturbing components Initial FB uniformly distributed on linear scale – equal attention to all components Consecutively, a single FB band is omitted impact on WER? Omitting bands carrying more information will result in considerable WER increase Implementation MFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters without overlap Email: [email protected] Slide 39 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Importance of Frequency Components 5 WER (%) Neutral speech 4 3 0 5 10 15 20 Omitted band 40 WER (%) LE speech 30 20 0 5 10 15 20 Omitted band Amplitude c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 1 20 Filterbank Cut-Off Frequencies (Hz) Email: [email protected] Slide 40 Speech and Speaker Recognition 4K SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Importance of Frequency Components 5 WER (%) Neutral speech 4 3 0 5 10 15 20 Omitted band 40 WER (%) LE speech 30 20 0 5 10 15 20 Omitted band Amplitude c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 1 20 Filterbank Cut-Off Frequencies (Hz) Email: [email protected] Slide 41 Speech and Speaker Recognition 4K SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Importance of Frequency Components 5 WER (%) Neutral speech 4 3 0 5 10 15 20 Omitted band 40 WER (%) LE speech 30 20 0 5 10 15 20 Omitted band Amplitude c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 1 20 Filterbank Cut-Off Frequencies (Hz) Email: [email protected] Slide 42 Speech and Speaker Recognition 4K SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Importance of Frequency Components 5 WER (%) Neutral speech 4 3 0 5 10 15 20 Omitted band 40 WER (%) LE speech 30 20 0 5 10 15 20 Omitted band Amplitude c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 1 20 Filterbank Cut-Off Frequencies (Hz) Email: [email protected] Slide 43 Speech and Speaker Recognition 4K SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Importance of Frequency Components 5 WER (%) Neutral speech 4 3 0 5 10 15 20 Omitted band 40 WER (%) LE speech 30 20 0 5 10 15 20 Omitted band c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 Area of 1st and 2nd formant occurrence – highest portion of phonetic information, F1 more important Amplitude for neutral speech, F1–F2 for LE speech recognition Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech tradeoff Next step – how much of the low frequency content should be (Hz) omitted for LE 4K ASR? Filterbank Cut-Off Frequencies Email: [email protected] Slide 44 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Lombard Effect WER (%) Optimizing Filter Banks – Omitting Low Frequencies 12 10 8 6 4 2 0 Neutral speech 0 200 400 600 800 1000 1200 Bandwidth (Hz) WER (%) 30 20 10 LE speech 0 0 200 400 600 800 1000 1200 Bandwidth (Hz) Amplitude c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 1 19 Filterbank Cut-Off Frequencies (Hz) Email: [email protected] Slide 45 Speech and Speaker Recognition 4K SLIDES by John H.L. Hansen, 2007 WER (%) Data-Driven Design of Robust Features Omitting Low Frequencies 12 10 8 6 4 2 0 Neutral speech 0 200 400 600 800 1000 1200 Bandwidth (Hz) WER (%) 30 20 10 LE speech 0 0 200 400 600 800 1000 1200 Bandwidth (Hz) Amplitude c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 1 19 Filterbank Cut-Off Frequencies (Hz) Email: [email protected] Slide 46 Speech and Speaker Recognition 4K SLIDES by John H.L. Hansen, 2007 WER (%) Data-Driven Design of Robust Features Omitting Low Frequencies 12 10 8 6 4 2 0 Neutral speech 0 200 400 600 800 1000 1200 Bandwidth (Hz) WER (%) 30 20 10 LE speech 0 0 200 400 600 800 1000 1200 Bandwidth (Hz) Amplitude c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 1 19 Filterbank Cut-Off Frequencies (Hz) Email: [email protected] Slide 47 Speech and Speaker Recognition 4K SLIDES by John H.L. Hansen, 2007 WER (%) Data-Driven Design of Robust Features Omitting Low Frequencies 12 10 8 6 4 2 0 Neutral speech 0 200 400 600 800 1000 1200 Bandwidth (Hz) WER (%) 30 20 10 LE speech 0 0 200 400 600 800 1000 1200 Bandwidth (Hz) Amplitude c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 1 1 9 Filterbank Cut-Off Frequencies (Hz) Email: [email protected] Slide 48 Speech and Speaker Recognition 4K SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Omitting Low Frequencies Effect of Omitting Low Spectral Components Increasing FB low cut-off results in almost linear increase of WER on neutral speech while considerably enhancing ASR performance on LE speech Optimal low cut-off found at 625 Hz Set WER (%) Email: [email protected] Slide 49 LFCC, full band LFCC, 625 Hz Devel set Neutral LE 4.8 29.0 (4.1–5.5) (27.5–30.5) 6.6 15.6 (5.8–7.4) (14.4–16.8) Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Increasing Filter Bank Resolution Increasing Frequency Resolution Idea – emphasize high information portion of spectrum by increasing FB resolution Experiment – FB decimation from 1912 bands (decreasing computational costs) Increasing number of filters at the peak of information distribution curve deterioration of LE ASR (17.2 % 26.9 %) Slight F1–F2 shifts due to LE affect cepstral features No simple recipe on how to derive efficient FB from the information distribution curves WER (%) 30 25 LE speech 20 15 0 1 2 3 5 6 7 8 9 10 11 12 13 Omitted band 625 Hz Email: [email protected] 4 Slide 50 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Increasing Filter Bank Resolution Consecutive Filter Bank Repartitioning Consecutively, from lowest to highest, each FB high cut-off is varied, while the upper rest of the FB is redistributed uniformly across the remaining frequency band Cut-off resulting in WER local minimum is fixed and the procedure is repeated for adjacent higher cut-off WER reduction by 2.3 % for LE, by 1 % on neutral (Example – 6 bands FB) 27 LE speech 25 WER (%) 23 21 19 17 15 13 500 Band 1 Band 2 1000 1500 Band 3 2000 Band 4 2500 Band 5 3000 Band 6 3500 4000 Critical frequency (Hz) Email: [email protected] Slide 51 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Standard vs. Novel Features State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech) Expolog frequency (Hz) Increased resolution in the area of F2 occurrence f 3988 1 0 f 2000 Hz 700 10 Expolog f f 2595 log 1 + 700 2000 f 4000 Hz Linear frequency (Hz) Email: [email protected] Slide 52 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Standard vs. Novel Features Evaluation in ASR Task Standard MFCC, PLP, variations MFCC-LPC, PLP-DCT – altered cepstrum exctraction schemes Expolog – Expolog FB replacing trapezoid FB in PLP 20Bands-LPC – uniform rectangular FB employed in PLP front-end Big1-LPC – derived from 20Bands-LPC, first 3 bands merged – decreased resolution at frequencies disturbing for LE ASR RFCC-DCT – repartitioned FB, 19 bands, starting at 625 Hz, employed in MFCC RFCC-DCT – RFCC employed in PLP Email: [email protected] Slide 53 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Data-Driven Design of Robust Features Standard vs. Novel Features 80 Features - performance on female digits 70 60 Neutral LE CLE CLEF0 WER (%) 50 40 30 20 10 0 MFCC Email: [email protected] MFCC-LPC PLP Slide 54 PLP-DCT Expolog 20Bands-LPC Speech and Speaker Recognition Big1-LPC RFCC-DCT RFCC-LPC SLIDES by John H.L. Hansen, 2007 LE Suppression in ASR Frequency Warping Maximum Likelihood (ML) Approach Vocal tract length normalization (VTLN): mean formant locations inversely proportional to vocal tract length (VTL) compensation for inter-speaker VTL variations by frequency transformation (warping): FW F Warping factor searched to maximize likelihoods of observations and acoustic models: ˆ arg max Pr O W, Θ Factor searched in typically in the interval 0.8–1.2 (corresponds to ratio of VTL differences in males and females) Formant-Driven (FD) Approach Warping factor determined from estimated mean formant locations Email: [email protected] Slide 55 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping VTLN – Principle Normalized Speaker NORM (VTLNORM) Formant frequencies (Hz) 4500 VTLN - Principle 4000 Case: VTL1 = VTLNORM 3500 F4 3000 2500 F3 2000 1500 F2 1000 500 F1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Formant frequencies (Hz) Speaker1 (VTL1) Email: [email protected] Slide 56 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping VTLN – Principle Normalized Speaker NORM (VTLNORM) Formant frequencies (Hz) 4500 VTLN - Principle 4000 Case: VTL1 > VTLNORM 3500 F4 3000 2500 F3 2000 1500 F2 1000 500 F1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Formant frequencies (Hz) Speaker1 (VTL1) Email: [email protected] Slide 57 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping VTLN – Principle Normalized Speaker NORM (VTLNORM) Formant frequencies (Hz) 4500 VTLN - Principle 4000 Case: VTL1 > VTLNORM 3500 F4 3000 2500 F3 2000 1500 F2 1000 500 F1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Formant frequencies (Hz) Speaker1 (VTL1) Email: [email protected] Slide 58 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping VTLN – Principle Normalized Speaker NORM (VTLNORM) Formant frequencies (Hz) 4500 VTLN - Principle 4000 Case: VTL1 > VTLNORM 3500 F4 3000 2500 F3 2000 1500 F2 1000 500 F1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Formant frequencies (Hz) Speaker1 (VTL1) Email: [email protected] Slide 59 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping VTLN vs. Lombard Effect Normalized Speaker NORM (VTLNORM) Formant frequencies (Hz) 4500 VTLN - Principle 4000 F4 Case: VTL1 ? VTLNORM 3500 3000 F3 2500 2000 1500 F2 1000 500 F1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Formant frequencies (Hz) Speaker1 (VTL1) Email: [email protected] Slide 60 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping VTLN vs. Lombard Effect Normalized Speaker NORM (VTLNORM) Formant frequencies (Hz) 4500 VTLN - Principle 4000 F4 Case: VTL1 ? VTLNORM 3500 3000 F3 2500 2000 1500 F2 1000 What to choose? 500 Good approx. of low formants? F1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Formant frequencies (Hz) Speaker1 (VTL1) Email: [email protected] Slide 61 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping Generalized Transform Normalized Speaker NORM (VTLNORM) Formant frequencies (Hz) 4500 VTLN - Principle 4000 F4 Case: VTL1 ? VTLNORM 3500 3000 F3 2500 F2 2000 1500 F1 1000 What to choose? 500 Good approx. of higher formants? 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Formant frequencies (Hz) Speaker1 (VTL1) Email: [email protected] Slide 62 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping Generalized Transform Normalized Speaker NORM (VTLNORM) Formant frequencies (Hz) 4500 4000 VTLN - Principle Generalized Transform 3500 Case: VTL ? VTL Case: VTL11 ? VTLNORM NORM F4 3000 F3 2500 F2 2000 1500 F1 1000 500 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Formant frequencies (Hz) Speaker1 (VTL1) Email: [email protected] Slide 63 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Frequency Warping Evaluation – VTLN vs. Generalized Transform Generalized transform better addresses LE-induced formant shifts Formant-driven approach is less computationally demanding (no need of multiple alignment passes as in VTLN), but requires reliable formant tracking problem in low SNR’s ML approach more stable Females Set # Digits Baseline WER Utterance-dependent VTLN (%) Speaker-dependent VTLN # Digits Baseline bank Warped bank Email: [email protected] Slide 64 Neutral LE Neutral LE 2560 2560 1423 6303 4.3 33.6 2.2 22.9 (3.5–5.0) (31.8–35.5) (1.4–2.9) (21.8–23.9) 3.6 28.2 1.8 16.6 (2.9–4.3) (26.4–29.9) (1.1–2.4) (15.7–17.6) 4.0 27.7 1.8 17.4 (3.2–4.7) (26.0–29.5) (1.1–2.4) (16.5–18.3) Females Set WER (%) Males Males Neutral LE Neutral LE 2560 2560 1423 6303 4.2 35.1 2.2 23.2 (3.4–5.0) (33.3–37.0) (1.4–2.9) (22.1–24.2) 4.4 23.4 1.8 15.7 (3.6–5.2) (21.8–25.0) (1.1–2.4) (14.8–16.6) Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 LE Suppression in ASR Two-Stage Recognizer (TSR) Tandem Neutral/LE Classifier – Neutral/LE Dedicated Recognizers Improving ASR features for LE often results in performance tradeoff on neutral speech Idea - combining separate systems ‘tuned’ for neutral and LE speech directed by neutral/LE classifier Neutral Recognizer Speech Signal Neutral/LE Classifier LE recognizer Email: [email protected] Slide 65 Speech and Speaker Recognition Estimated Word Sequence SLIDES by John H.L. Hansen, 2007 Two-Stage Recognizer (TSR) Neutral/LE Classification Proposal of Neutral/LE Classifier Search for a set of features providing good discriminability between neutral/LE speech Requirements – speaker/gender/phonetic content independent classification Extension of the set of analyzed features for the slope of short-term spectra 60 Spectral slope – female vowel /a/ 40 Amplitude (dB) Magnitude (dB) 20 0 -20 -40 -60 -80 0 10 Email: [email protected] 1 10 Slide 66 2 10 Log frequency (Hz) Frequency (Hz) 3 10 Speech and Speaker Recognition 4 10 SLIDES by John H.L. Hansen, 2007 Two-Stage Recognizer (TSR) Neutral/LE Classification Mean Spectral Slopes in Voiced Male/Female Speech Neutral Set 0–8000 Hz #N T (s) M 2587 618 F 5558 1544 LE Slope (dB/oct) -7.42 (-7.48; -7.36) -6.15 (-6.18; -6.12) (dB/oct) # LE T (s) 1.53 3532 1114 1.30 5030 1926 Slope (dB/oct) (dB/oct) -5.32 (-5.37; -5.27) -3.91 (-3.96; -3.86) 1.55 1.77 Overlap of Neutral/LE Spectral Slope Distributions Set Neutral – LE distribution overlap (%) 0–8000 Hz 60–8000 Hz 60–5000 Hz 1k–5k Hz 0–1000 Hz 60–1000 Hz M 26.00 28.13 29.47 100.00 27.81 27.96 F 26.20 28.95 16.76 100.00 25.75 22.18 M+F 28.06 30.48 29.49 100.00 27.54 26.00 Classification Feature Set A feature set providing superior classification performance on the development data set was found: SNR, spectral slope (60–1000 Hz), F0, F0 Training GMM and multi-layer perceptron (MLP) classifiers Email: [email protected] Slide 67 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Two-Stage Recognizer (TSR) Neutral/LE Classification Binary Classification Task P oi H1 P oi H 0 Pr(N) Pr(LE) GMM Classifier P oi 1 2 MLP Classifier e n GMMN 1 oi μ T Σ1 oi μ 2 GMMLE Σ Acoustic Observation (Classification Feature Vector) Pr(N) Pr(LE) f2 q j e qj (Softmax) M e 1 f q 1+ e qi … … i 1 1 j q j (Sigmoid) Classification Feature Vector Email: [email protected] Slide 68 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Two-Stage Recognizer (TSR) Neutral/LE Classification – Feature Set 0.06 0.05 80 0.04 60 0.03 Dev_N_M+F Dev_LE_M+F PDF_LE PDF_N 40 20 0.02 0.01 0 20 40 60 80 0 100 0.8 ANN posteriors SNR 100 0.6 80 0.5 60 0.4 0.3 40 Dev_N_M+F Dev_LE_M+F Pr(N) Pr(LE) 20 0 0 20 40 SNR (dB) 0.16 0.08 Dev_N_M+F Dev_LE_M+F PDF_LE PDF_N 30 20 0.04 0.7 0.5 0.4 0.3 20 0.2 -20 -10 0.1 0 0 10 20 Spectral slope (dB/oct) Spectral slope (dB/oct) Slide 69 Dev_N_M+F Dev_LE_M+F Pr(N) Pr(LE) 30 0 30 0.6 40 0 20 0.9 50 0 10 1 0.8 10 Email: [email protected] 0 100 60 10 0 80 0.1 ANN posteriors Spectral slope 70 Number of samples (normalized) 40 PDFN, PDFLE Number of samples (normalized) 0.12 50 -10 80 GMM PDFs Spectral slope 60 -20 60 0.2 SNR (dB) 80 70 0.7 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 30 Pr(N), Pr(LE) 0 Number of samples (normalized) 100 120 Pr(N), Pr(LE) 0.07 GMM PDFs SNR PDFN, PDFLE Number of samples (normalized) 120 Two-Stage Recognizer (TSR) Neutral/LE Classification – Feature Set 120 100 Dev_N_M+F Dev_LE_M+F PDF_LE PDF_N 80 60 0.008 0.004 40 20 0.7 ANN posteriors F0 140 120 0.5 100 0.4 80 Dev_N_M+F Dev_LE_M+F Pr(N) Pr(LE) 60 40 0 100 200 300 400 0.000 500 0 0 100 200 0.2 300 400 0 500 F0 (Hz) F0 (Hz) 0.04 250 GMM PDFs F0 150 0.02 Dev_N_M+F Dev_LE_M+F PDF_LE PDF_N 100 50 0.01 PDFN, PDFLE 0.03 Number of samples (normalized) 250 200 0.3 0.1 20 0 Number of samples (normalized) 0.6 0.8 ANN posteriors F0 200 0.7 0.6 0.5 150 0.4 Dev_N_M+F Dev_LE_M+F Pr(N) Pr(LE) 100 50 0.3 0.2 0.1 0 0 20 40 60 80 100 0 120 0 0 F0 (Hz) Email: [email protected] 20 40 60 80 100 F0 (Hz) Slide 70 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 0 120 Pr(N), Pr(LE) 140 Number of samples (normalized) 160 Pr(N), Pr(LE) 0.012 GMM PDFs F0 PDFN, PDFLE Number of samples (normalized) 160 Two-Stage Recognizer (TSR) Neutral/LE Classification – Performance Classification Data Sets TUtter s 2472 TUtter s 4.10 1371 4.01 1.50 Set #Utterances Devel Open 1.60 Classification Performance UER – Utterance Error Rate – ratio of incorrectly classified utterances to all utterances GMM Set Devel FM Open FM Devel DM Open DM # Utterances 2472 1371 2472 1371 UER (%) MLP 6.6 2.5 8.1 2.8 (5.6–7.6) (1.7–3.3) (7.0–9.2) (1.9–3.6) Set Train CV Open # Utterances 2202 270 1371 9.9 5.6 1.6 (8.7–11.1) (2.8–8.3) (0.9–2.3) UER (%) Email: [email protected] Slide 71 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Two-Stage Recognizer (TSR) Overall Performance Discrete Recognizers Neutral Recognizer Either good on neutral or LE speech Speech Signal Neutral/LE Classifier LE recognizer Set Real – neutral Real – LE # Female digits 1439 1837 4.3 48.1 (3.3–5.4) (45.8–50.4) 6.5 28.3 (5.2–7.7) (26.2–30.4) 4.2 28.4 (3.2–5.3) (26.4–30.5) PLP RFCC–LPC WER MLP TSR (%) FM–GMLC TSR DM–GMLC TSR Email: [email protected] Slide 72 4.4 28.4 (3.3–5.4) (26.4–30.5) 4.4 28.4 (3.3–5.4) (26.3–30.4) Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Estimated Word Sequence LE Suppression in ASR Comparison of Proposed Methods 60 Comparison of proposed techniques for LE-robust ASR Baseline Neutral Baseline LE LE Suppression 50 WER (%) 40 30 20 10 0 Model Adapt to LE - SI Email: [email protected] Model Adapt to LE - SD Slide 73 Voice Conversion CLE Modified FB RFCC-LPC VTLN Recognition Utt. Dep. Warp Speech and Speaker Recognition Formant Warping MLP TSR SLIDES by John H.L. Hansen, 2007 Thank You Thank You for Your Attention! Email: [email protected] Slide 74 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
© Copyright 2025 Paperzz