Speech Signal Feature Extraction

Lombard Speech Recognition
Hynek Bořil
[email protected]
Email: [email protected]
Slide 1
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Overview
 Model of Speech Production
 Automatic Speech Recognition (LE)
 Outline
 Feature Extraction
 Acoustic Models
 Lombard Effect




Definition & Motivation
Acquisition of Corpus capturing Lombard Effect
Analysis of Speech under LE
Methods Increasing ASR Robustness to LE
Email: [email protected]
Slide 2
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Speech Production

Model of speech production  understanding speech signal
structure  design of speech processing algorithms
Email: [email protected]
Slide 3
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Speech Production
Linear Model
Voiced Excitation
Pitch Period
AV
IMPULSE TRAIN
GENERATOR I(z)
GLOTTAL PULSE
MODEL G(z)
Vocal Tract
Parameters
Voiced/Unvoiced
Switch
uG(n)
VOCAL TRACT
MODEL V(z)
RADIATION
MODEL R(z)
RANDOM NOISE
GENERATOR N(z)
AN
Unvoiced Excitation
Email: [email protected]
Slide 4
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Speech Production
Linear Model
|I(F)G(F)|
1/F0

-12 dB/oct
=
Time
F0 2F0 ... Freq.
Pitch Period
AV
IMPULSE TRAIN
GENERATOR I(z)
GLOTTAL PULSE
MODEL G(z)
Vocal Tract
Parameters
Voiced/Unvoiced
Switch
uG(n)
VOCAL TRACT
MODEL V(z)
RADIATION
MODEL R(z)
RANDOM NOISE
GENERATOR N(z)
AN
Email: [email protected]
Slide 5
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Speech Production
Linear Model
Pitch Period
AV
GLOTTAL PULSE
MODEL G(z)
IMPULSE TRAIN
GENERATOR I(z)
Vocal Tract
Parameters
Voiced/Unvoiced
Switch
uG(n)
VOCAL TRACT
MODEL V(z)
RADIATION
MODEL R(z)
RANDOM NOISE
GENERATOR N(z)

AN
|N(F)|
=
Time
Email: [email protected]
Frequency
Slide 6
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Speech Production
Linear Model
Pitch Period
AV
|V(F)|
-12
IMPULSE TRAIN
GENERATOR I(z)
GLOTTAL PULSE
MODEL G(z)
8
8
6
4
4
2
2
0
0
-2
-2
-4
Voiced/Unvoiced
Switch
-12
x 10
6
0
500
uG(n)
x 10
-4
1000
0
-12
8
x 10
|R(F)|
+6 dB/oct
-12
8
6
6
4
4
2
2
x 10
Vocal Tract
Parameters
0
0
Frequency
-2
1500
500
-4
2000
1000
0
Frequency
-2
-4
2500
1500
500 3000
2000
1000
0 3500
2500
1500
500 4000
3000
2000
1000 3500
2500
1500 4000
3000
2000
3500
2500
4000
3000
VOCAL TRACT
MODEL V(z)
V ( z) 
RANDOM NOISE
GENERATOR N(z)
3500
4000
RADIATION
MODEL R(z)
G
N
1   k z  k
k 1
AN
Email: [email protected]
Slide 7
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Speech Production
Linear Model
|I(F)G(F)|
1/F0

-12 dB/oct
=
Time
F0 2F0 ... Freq.
Pitch Period
AV
|V(F)|
-12
GLOTTAL PULSE
MODEL G(z)
IMPULSE TRAIN
GENERATOR I(z)
8
8
6
4
4
2
2
0
0
-2
-2
-4
Voiced/Unvoiced
Switch
-12
x 10
6
0
500
uG(n)
x 10
-4
1000
0
-12
8
x 10
6
4
4
2
2
x 10
Vocal Tract
Parameters
0
0
Frequency
-2
1500
500
-4
2000
1000
0
Frequency
-2
-4
2500
1500
500 3000
2000
1000
0 3500
2500
1500
500 4000
3000
2000
1000 3500
2500
1500 4000
3000
2000
3500
2500
4000
3000
VOCAL TRACT
MODEL V(z)
V ( z) 
RANDOM NOISE
GENERATOR N(z)

|R(F)|
+6 dB/oct
-12
8
6
3500
4000
RADIATION
MODEL R(z)
G
N
1   k z  k
k 1
AN
|N(F)|
=
Time
Email: [email protected]
Frequency
Slide 8
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Speech Production
Linguistic/Speaker Information in Speech Signal
 How is Linguistic Info Coded in Speech Signal?
 Phonetic Contents
 Energy: voiced phones (v) – higher energy than unvoiced phones (uv)
 Low formants: locations and bandwidths ( changes in configuration
of vocal tract during speech production)
 Spectral tilt: differs across phones, generally flatter for uv (due to
changes in excitation and formant locations)
 Other Cues

Pitch contour: important to distinguish words in tonal languages (e.g.,
Chinese dialects)
 How is Speaker Identity Coded in Speech Signal?
 Glottal Waveform
 Vocal Tract Parameters
 Prosody (intonation, rhythm, stress,…)
Email: [email protected]
Slide 9
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Speech Production
Phonetic Contents in Features
2200
 Example 1 – First 2 Formants in US Vowels
(Bond et. al., 1989)
F2 (Hz)
1800
1600
 Example 2 – Spectral Slopes in Czech Vowels
1200
/ae/
1400
1000
Neutral
Vowel
#N
T (s)
/a/
454
69.03
/e/
1064
69.33
/i/
509
58.92
/o/
120
9.14
/u/
102
5.73
Email: [email protected]
Slide 10
Slope
(dB/oct)
-6.8
(-6.9; -6.7)
-5.6
(-5.7; -5.6)
-5.0
(-5.1; -4.9)
-8.0
(-8.1; -7.8)
-6.1
(-6.3; -6.0)
(dB/oct)
/i/
2000
/u/
800
200
/a/
400
600
800
F1 (Hz)
1.13
1.06
1.15
0.91
0.77
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)
Architecture of HMM Recognizer
ACOUSTIC
MODEL
SPEECH
SIGNAL
FEATURE
EXTRACTION
(MFCC/PLP)
LEXICON
(HMM)
SUB-WORD
LIKELIHOODS
(GMM/MLP)
LANGUAGE
MODEL
(BIGRAMS)
DECODER
(VITERBI)
ESTIMATED
WORD
SEQUENCE
Feature extraction – transformation of time-domain acoustic signal into representation
more effective for ASR engine: data dimensionality reduction, suppression of irrelevant
(disturbing) signal components (speaker/environment/recording chain-dependent
characteristics), preserving phonetic content
Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used to
model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) –
neural networks – multi-layer perceptrons (MLPs) (much less common than GMMs)
Email: [email protected]
Slide 11
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)
HMM-Based Recognition – Stages
Speech Signal
…
Feature Extraction
(Windowing,…,
 cepstrum)
o1 o2 o3
…
Acoustic Models
Language Model
(HMMs  word sequences)
(HTK book, 2006)
Speech Transcription
Email: [email protected]
Slide 12
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)
Feature Extraction – MFCC
Mel Frequency Cepstral Coefficients (MFCC)
Davis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980
MFCC is the first choice in current commercial ASR
s(n)
PREEMPHASIS
WINDOW
(HAMMING)
|FFT|
2
FILTER
BANK
(MEL)
Log(.)
IDCT
MFCC
c(n)
Preemphasis: compensates for spectral tilt (speech production/microphone channel)
Windowing: suppression of transient effects in short-term segments of signal
|FFT|2: energy spectrum (phase is discarded)
MEL Filter bank: MEL scale – models logarithmic perception of frequency in humans;
triangular filters – dimensionality reduction
Linear Frequency
Log + IDCT: extraction of cepstrum – deconvolution of glottal waveform, vocal tract
function, channel characteristics
Email: [email protected]
Slide 13
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)
Feature Extraction – MFCC & PLP
Perceptual Linear Predictive Coefficients (PLP)
Hermansky, Journal of Acoustical Society of America, 1990
An alternative to MFCC, used less frequently
Many stages similar to MFCC
Linear prediction – smoothing of the spectral envelope
PLP
s(n)
WINDOW
(HAMMING)
|FFT|2
FILTER
BANK
(BARK)
EQUAL
LOUDNESS
PREEMPHASIS
INTENSITY
 LOUDNESS
3
 
LINEAR
PREDICTION
RECURSION

PLP
c(n)
CEPSTRUM
MFCC
s(n)
PREEMPHASIS
Email: [email protected]
WINDOW
(HAMMING)
Slide 14
|FFT|2
FILTER
BANK
(MEL)
Speech and Speaker Recognition
Log(.)
IDCT
MFCC
c(n)
SLIDES  by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)
Feature Extraction – MFCC & PLP
Perceptual Linear Predictive Coefficients (PLP)
Hermansky, Journal of Acoustical Society of America, 1990
An alternative to MFCC, used less frequently
Many stages similar to MFCC
Linear prediction – smoothing of the spectral envelope (may improve robustness)
PLP
s(n)
WINDOW
(HAMMING)
|FFT|2
FILTER
BANK
(BARK)
EQUAL
LOUDNESS
PREEMPHASIS
INTENSITY
 LOUDNESS
3
 
LINEAR
PREDICTION
RECURSION

PLP
c(n)
CEPSTRUM
MFCC
s(n)
PREEMPHASIS
Email: [email protected]
WINDOW
(HAMMING)
Slide 15
|FFT|2
FILTER
BANK
(MEL)
Speech and Speaker Recognition
Log(.)
IDCT
MFCC
c(n)
SLIDES  by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)
Acoustic Models – GMM-HMM
Gaussian Mixture Models (GMMs)
Motivation: distributions of cepstral coefficients can be well modeled by a mixture (sum) of
gaussian functions
Example – distribution of c0 in certain phone and corresponding gaussian (defined uniquely by
mean, variance, and weight)
Probability Density Function (pdf)
Histogram
120
120
Weight
100
Pr(c0)
# Samples
100
80
80
60
60
40
40
20
20
0
0
m m m+
c0
m m m+
c0
Multidimensional observations (c0,…,c12)  multidimensional gaussians – defined uniquely by
means, covariance matrices, and weights
GMMs – typically used to model parts of phones
Hidden Markov Models (HMMs)
States (GMMs) + transition probabilities between states
Models of whole phones; lexicon  word models built of phone models
Email: [email protected]
Slide 16
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Lombard Effect
Definition & Motivation
What is Lombard Effect?
When exposed to noisy adverse environment, speakers modify the
way they speak in an effort to maintain intelligible communication
(Lombard Effect - LE)
Why is Lombard Effect Interesting?
Better understanding mechanisms of human speech communication
(Can we intentionally change particular parameters of speech
production to improve intelligibility, or is LE an automatic process
learned through public loop? How the type of noise and
communication scenario affect LE?)
Mathematical modeling of LE  classification of LE level, speech
synthesis in noisy environments, increasing robustness of automatic
speech recognition and speaker identification systems
Email: [email protected]
Slide 17
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Lombard Effect
Motivation & Goals
Ambiguity in Past LE Investigations
LE has been studied since 1911, however, many investigations disagree in the
observed impacts of LE on speech production
Analyses conducted typically on very limited data – a couple of utterances from few
subjects (1–10)
Lack of communication factor – a majority of studies ignore the importance of
communication for evoking LE (an effort to convey message over noise)  occurrence
and level of LE in speech recordings is ‘random’  contradicting analysis results
LE was studied only for several world languages (English, Spanish, French, Japanese,
Korean, Mandarin Chinese), no comprehensive study for any of Slavic languages
1st Goal
Design of Czech Lombard Speech Database addressing the need of communication
factor and well defined simulated noisy conditions
Systematic analysis of LE in Czech spoken language
Email: [email protected]
Slide 18
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Lombard Effect
Motivation & Goals
ASR under LE
Mismatch between LE speech with by noise and acoustic models trained on clean neutral
speech
Strong impact of noise on ASR is well known and vast number of noise suppression/speech
emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached)
Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR
systems mostly ignore this issue
LE-Equalization Methods
LE-equalization algorithms typically operate in the following domains: Robust features, LEtransformation towards neutral, model adjustments, improved training of acoustic models
The algorithms display various degrees of efficiency and are often bound by strong
assumptions preserving them from the real world application (applying fixed transformations
to phonetic groups, known level of LE, etc.)
2nd Goal
Proposal of novel LE-equalization techniques with a focus on both level of LE suppression and
extent of bounding assumptions
Email: [email protected]
Slide 19
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
LE Corpora
Available Czech Corpora
Czech SPEECON – speech recordings from various environments including office
and car
CZKCC – car recordings – include parked car with engine off and moving car
scenarios
Both databases contain speech produced in quiet in noise  candidates for study of
LE, however, not good ones, shown later
 Design/acquisition of LE-oriented database – Czech Lombard
Speech Database‘05 (CLSD‘05)
 Goals
– Communication in simulated noisy background  high SNR
-Phonetically rich data/extensive small vocabulary material
-Parallel utterances in neutral and LE conditions
Email: [email protected]
Slide 20
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data Acquisition
Recording Setup
Simulated Noisy Conditions
Noise samples mixed with speech feedback and produced to the speaker and
operator by headphones
Operator qualifies intelligibility of speech in noise – if the utterance is not
intelligible, operator asks the subject to repeat it  speakers are required to convey
message over noise  communication LE
Noises: mostly car noises from Car2E database, normalized to 90 dB SPL
Speaker Sessions
14 male/12 female speakers
Each subject recorded both in neutral and simulated noisy conditions
Middle talk
Close talk
SPEAKER
Email: [email protected]
H&T
RECORDER
Noise + speech
feedback
Slide 21
Speech and Speaker Recognition
OK – next /
/ BAD - again
Noise + speech
monitor
SMOOTH
OPERATOR
SLIDES  by John H.L. Hansen, 2007
Data Acquisition
Recording Setup
Simulated Noisy Conditions
Noise samples mixed with speech feedback and produced to the speaker and
operator by headphones
Operator qualifies intelligibility of speech in noise – if the utterance is not
intelligible, operator asks the subject to repeat it  speakers are required to convey
message over noise  real LE
Noises: mostly car noises from Car2E database, normalized to 90 dB SPL
Speaker Sessions
14 male/12 female speakers
Each subject recorded both in neutral and simulated noisy conditions
ME-104
ME-104
Email: [email protected]
NB2
NB2
Slide 22
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data Acquisition
Impact of Headphones
Environmental Sound Attenuation by Headphones
Attenuation characteristics measured on dummy head
Source of wide-band noise, measurement of sound transfer to dummy head’s
auditory canals when not wearing/wearing headphones
Attenuation characteristics – subtraction of the transfers
Email: [email protected]
Slide 23
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data Acquisition
Impact of Headphones
Environmental Sound Attenuation by Headphones
Directional attenuation – reflectionless sound booth
Angle (°)
90
75
105
60
Real attenuation in recording room
1 kHz
2 kHz
4 kHz
8 kHz
120
45
135
30
150
15
165
0
30
30
20
10
20
20
10
10
0
0
-10
0
-10
Attenuation (dB)
10
20
0
-10
200
150
4
10
100
3
10
50
Angle (°)
0
2
10
195
Attenuation by headphones
330
210
15
10
5
315
0°
90°
180° 300
Rec. room
225
240
285
270
255
0
-5
Frequency (Hz)
-10
100
1000
Frequency (Hz)
Email: [email protected]
180
180
30
20
25 345
Attenuation (dB)
Attenuation (dB)
30
Slide 24
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
10000
Speech Production under Lombard Effect
Speech Features affected by LE
Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises
Pitch Period
AV
IMPULSE TRAIN
GENERATOR I(z)
GLOTTAL PULSE
MODEL G(z)
Vocal Tract
Parameters
Voiced/Unvoiced
Switch
uG(n)
VOCAL TRACT
MODEL V(z)
V ( z) 
RANDOM NOISE
GENERATOR N(z)
RADIATION
MODEL R(z)
G
N
1   k z  k
k 1
AN
Email: [email protected]
Slide 25
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Speech Production under Lombard Effect
Speech Features affected by LE
Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises
Vocal tract transfer function: center frequencies of low formants increase, formant
bandwidths reduce
Pitch Period
AV
IMPULSE TRAIN
GENERATOR I(z)
GLOTTAL PULSE
MODEL G(z)
Vocal Tract
Parameters
Voiced/Unvoiced
Switch
uG(n)
VOCAL TRACT
MODEL V(z)
V ( z) 
RANDOM NOISE
GENERATOR N(z)
RADIATION
MODEL R(z)
G
N
1   k z  k
k 1
AN
Email: [email protected]
Slide 26
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Speech Production under Lombard Effect
Speech Features affected by LE
Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises
Vocal tract transfer function: center frequencies of low formants increase, formant
bandwidths reduce
Vocal effort (intensity) increase
Pitch Period
AV
IMPULSE TRAIN
GENERATOR I(z)
GLOTTAL PULSE
MODEL G(z)
Vocal Tract
Parameters
Voiced/Unvoiced
Switch
uG(n)
VOCAL TRACT
MODEL V(z)
V ( z) 
RANDOM NOISE
GENERATOR N(z)
RADIATION
MODEL R(z)
G
N
1   k z  k
k 1
AN
Email: [email protected]
Slide 27
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Speech Production under Lombard Effect
Speech Features affected by LE
Vocal tract excitation: glottal pulse shape changes, fundamental frequency rises
Vocal tract transfer function: center frequencies of low formants increase, formant
bandwidths reduce
Vocal effort (intensity) increase
Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,…
Pitch Period
AV
IMPULSE TRAIN
GENERATOR I(z)
GLOTTAL PULSE
MODEL G(z)
Vocal Tract
Parameters
Voiced/Unvoiced
Switch
uG(n)
VOCAL TRACT
MODEL V(z)
V ( z) 
RANDOM NOISE
GENERATOR N(z)
RADIATION
MODEL R(z)
G
N
1   k z  k
k 1
AN
Email: [email protected]
Slide 28
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
pL(n)
Analysis of Speech Features under LE
Fundamental Frequency
16
Distribution of fundamental frequency
Czech SPEECON
10
8
Office F
Car F
Office M
Car M
6
Distribution of fundamental frequency
CZKCC
14
Number of samples (x 1000)
Number of samples (x 10,000)
12
4
2
12
Eng off F
Eng on F
Eng off M
Eng on M
10
8
6
4
2
0
0
70
170
270
370
470
Number of samples (x 10,000)
Fundamental frequency (Hz)
6
570
70
170
270
370
470
Fundamental frequency (Hz)
Distribution of fundamental frequency
CLSD'05
5
4
Neutral F
LE F
Neutral M
LE M
3
2
1
0
70
170
270
370
470
570
Fundamental frequency (Hz)
Email: [email protected]
Slide 29
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
570
Analysis of Speech Features under LE
Formant Locations
2100
2500
2300
Formants - CZKCC
Female digits
/i'/
/i/
Formants - CZKCC
Male digits
/i'/
1900
/i/
/e'/
1700
2100
/e/
/e/
/e'/
1500
F2 (Hz)
F2 (Hz)
1900
1700
1500
1300
/u'/
900
300
/a'/
/u'/
/u/
/a/
/o'/
900
Female_N
Female_LE
400
500
600
700
800
900
Male_N
Male_LE
700
1000
500
200
300
400
500
Formants - CLSD'05
Female digits
/i'/
/i/
800
900
1900
/i/
1700
/e/
F2 (Hz)
1700
/a'/
/a/
1500
1300
1100
Female_N
Female_LE
500
/o'/
600
700
800
900
1000
Slide 30
Male_N
Male_LE
700
500
200
F1 (Hz)
Email: [email protected]
/o/
/u/
900
/o/
/u/
400
/a'/
/a/
/u'/
/o'/
/u'/
/e'/
1500
/e/
1300
Formants - CLSD'05
Male digits
/i'/
1900
/e'/
2100
F2 (Hz)
700
2100
2300
900
300
600
F1 (Hz)
F1 (Hz)
2500
1100
/o/
/o'/
/u/
1100
1300
1100
/a/
/o/
/a'/
300
400
500
600
700
F1 (Hz)
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
800
900
Analysis of Speech Features under LE
Formant Bandwidths
SPEECON, CZKCC: no consistent BW changes
CLSD‘05: significant BW reduction in many voiced phonemes
CZKCC
1M (Hz) B1F (Hz)
1F (Hz)
B1F (Hz)
1F (Hz)
275
97
299
78
78
156
68
186
79
127*
44
105
44
136
53
87
222
67
263*
85
269*
73
100
170
89
174*
96
187*
101
Vowel
B1M (Hz)
1M (Hz)
B1M (Hz)
/a/
207*
74
210*
84
/e/
125*
70
130*
/i/
124*
49
/o/
275
/u/
187
CLSD‘05
Vowel
B1M (Hz)
1M (Hz)
B1M (Hz)
1M (Hz)
B1F (Hz)
1F (Hz)
B1F (Hz)
1F (Hz)
/a/
269
88
152
59
232
85
171
68
/e/
168
94
99
44
169
73
130
49
/i/
125
53
108
52
132*
52
133*
58
/o/
239
88
157
81
246
91
158
62
/u/
134*
67
142*
81
209
95
148
66
Email: [email protected]
Slide 31
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Analysis of Speech Features under LE
Phoneme Durations
Significant increase in duration in some phonemes, especially voiced phonemes
Some unvoiced consonants – duration reduction
Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC
CZKCC
Word
Phoneme
# OFF
TOFF (s)
TOFF (s)
# ON
TON (s)
TON (s)
 (%)
Nula
/a/
349
0.147
0.079
326
0.259
0.289
48.50
Jedna
/a/
269
0.173
0.076
251
0.241
0.238
39.36
Dva
/a/
245
0.228
0.075
255
0.314
0.311
38.04
Štiri
/r/
16
0.045
0.027
68
0.080
0.014
78.72
Sedm
/e/
78
0.099
0.038
66
0.172
0.142
72.58
CLSD‘05
Word
Phoneme
#N
TN (s)
Tn (s)
# LE
TLE (s)
Tle (s)
 (%)
Jedna
/e/
583
0.031
0.014
939
0.082
0.086
161.35
Dvje
/e/
586
0.087
0.055
976
0.196
0.120
126.98
Čtiri
/r/
35
0.041
0.020
241
0.089
0.079
115.92
Pjet
/e/
555
0.056
0.033
909
0.154
0.089
173.71
Sedm
/e/
358
0.080
0.038
583
0.179
0.136
122.46
Osm
/o/
310
0.086
0.027
305
0.203
0.159
135.25
Devjet
/e/
609
0.043
0.022
932
0.120
0.088
177.20
Email: [email protected]
Slide 32
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Lombard Effect
Initial ASR Experiments
Digit Recognizer
Monophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
D – word deletions
ASR Evaluation – WER (Word Error Rate)
S – word substitutions
WER 
I – word insertions
D – word deletions
D+S +I
100
N
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR)
Czech SPEECON
Set
Office F Office M
 %
Clean recordings (LE - 40.9 dB SNR)
CLSD‘05
CZKCC
Car F
Car M
OFF F
OFF M
ON F
ON M
NF
NM
LE F
LE M
# Spkrs
22
31
28
42
30
30
18
21
12
14
12
14
# Digits
880
1219
1101
1657
1480
1323
1439
1450
4930
1423
5360
6303
WER
5.5
4.3
4.6
10.5
3.0
2.3
13.5
10.4
7.3
3.8
42.8
16.3
(%)
(4.0–7.0)
(3.1–5.4)
(3.4–5.9)
(9.0–12.0)
(2.1–3.8)
(1.5–3.1)
(6.6–8.0)
(2.8–4.8)
Email: [email protected]
Slide 33
(11.7–15.2) (8.8–12.0)
Speech and Speaker Recognition
(41.5–44.1) (15.4–17.2)
SLIDES  by John H.L. Hansen, 2007
Lombard Effect
Initial ASR Experiments
Digit Recognizer
Monophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
D – word deletions
ASR Evaluation – WER (Word Error Rate)
S – word substitutions
WER 
I – word insertions
D – word deletions
D+S +I
100
N
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR)
Czech SPEECON
Set
Office F Office M
 %
Clean recordings (LE - 40.9 dB SNR)
CLSD‘05
CZKCC
Car F
Car M
OFF F
OFF M
ON F
ON M
NF
NM
LE F
LE M
# Spkrs
22
31
28
42
30
30
18
21
12
14
12
14
# Digits
880
1219
1101
1657
1480
1323
1439
1450
4930
1423
5360
6303
WER
5.5
4.3
4.6
10.5
3.0
2.3
13.5
10.4
7.3
3.8
42.8
16.3
(%)
(4.0–7.0)
(3.1–5.4)
(3.4–5.9)
(9.0–12.0)
(2.1–3.8)
(1.5–3.1)
(6.6–8.0)
(2.8–4.8)
Email: [email protected]
Slide 34
(11.7–15.2) (8.8–12.0)
Speech and Speaker Recognition
(41.5–44.1) (15.4–17.2)
SLIDES  by John H.L. Hansen, 2007
Lombard Effect
Initial ASR Experiments
Digit Recognizer
Monophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
D – word deletions
ASR Evaluation – WER (Word Error Rate)
S – word substitutions
WER 
I – word insertions
D – word deletions
D+S +I
100
N
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR)
Czech SPEECON
Set
Office F Office M
 %
Clean recordings (LE - 40.9 dB SNR)
CLSD‘05
CZKCC
Car F
Car M
OFF F
OFF M
ON F
ON M
NF
NM
LE F
LE M
# Spkrs
22
31
28
42
30
30
18
21
12
14
12
14
# Digits
880
1219
1101
1657
1480
1323
1439
1450
4930
1423
5360
6303
WER
5.5
4.3
4.6
10.5
3.0
2.3
13.5
10.4
7.3
3.8
42.8
16.3
(%)
(4.0–7.0)
(3.1–5.4)
(3.4–5.9)
(9.0–12.0)
(2.1–3.8)
(1.5–3.1)
(6.6–8.0)
(2.8–4.8)
Email: [email protected]
Slide 35
(11.7–15.2) (8.8–12.0)
Speech and Speaker Recognition
(41.5–44.1) (15.4–17.2)
SLIDES  by John H.L. Hansen, 2007
Lombard Effect
Initial ASR Experiments
Digit Recognizer
Monophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
D – word deletions
ASR Evaluation – WER (Word Error Rate)
S – word substitutions
WER 
I – word insertions
D – word deletions
D+S +I
100
N
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB SNR)
Czech SPEECON
Set
Office F Office M
 %
Clean recordings (LE - 40.9 dB SNR)
CLSD‘05
CZKCC
Car F
Car M
OFF F
OFF M
ON F
ON M
NF
NM
LE F
LE M
# Spkrs
22
31
28
42
30
30
18
21
12
14
12
14
# Digits
880
1219
1101
1657
1480
1323
1439
1450
4930
1423
5360
6303
WER
5.5
4.3
4.6
10.5
3.0
2.3
13.5
10.4
7.3
3.8
42.8
16.3
(%)
(4.0–7.0)
(3.1–5.4)
(3.4–5.9)
(9.0–12.0)
(2.1–3.8)
(1.5–3.1)
(6.6–8.0)
(2.8–4.8)
Email: [email protected]
Slide 36
(11.7–15.2) (8.8–12.0)
Speech and Speaker Recognition
(41.5–44.1) (15.4–17.2)
SLIDES  by John H.L. Hansen, 2007
LE Suppression in ASR
Model Adaptation
Model Adaptation
Often effective when only limited data from given conditions are available
Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per class,
acoustically close classes are grouped and transformed together
μ 'MLLR  Aμ + b
Maximum a posteriori approach (MAP) – initial models are used as informative
priors for the adaptation
μ 'MAP 
N

μ+
μ
N +
N +
Adaptation Procedure
First, neutral speaker-independent (SI) models transformed by MLLR, employing
clustering (binary regression tree)
Second, MAP adaptation – only for nodes with sufficient amount of adaptation data
Email: [email protected]
Slide 37
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
LE Suppression in ASR
Model Adaptation
Adaptation Schemes
Speaker-independent adaptation (SI) – group dependent/independent
Speaker-dependent adaptation (SD) – to neutral/LE
90
Model adaptation to conditions and speakers
80
70
WER (%)
60
50
40
30
20
SI adapt to LE (same spkrs)
SI adapt to LE (disjunct spkrs)
SD adapt to neutral
SD adapt to LE
10
0
Adapted digits LE
Baseline digits LE
Email: [email protected]
Slide 38
Baseline sentences LE
Speech and Speaker Recognition
Adapted sentences LE
SLIDES  by John H.L. Hansen, 2007
LE Suppression in ASR
Data-Driven Design of Robust Features
Filter Bank Approach
Analysis of importance of frequency components for ASR
Repartitioning filter bank (FB) to emphasize components carrying phonetic information and
suppress disturbing components
Initial FB uniformly distributed on linear scale – equal attention to all components
Consecutively, a single FB band is omitted  impact on WER?
Omitting bands carrying more information will result in considerable WER increase
Implementation
MFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters
without overlap
Email: [email protected]
Slide 39
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Importance of Frequency Components
5
WER (%)
Neutral speech
4
3
0
5
10
15
20
Omitted band
40
WER (%)
LE speech
30
20
0
5
10
15
20
Omitted band
Amplitude
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
1
20
Filterbank Cut-Off Frequencies (Hz)
Email: [email protected]
Slide 40
Speech and Speaker Recognition
4K
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Importance of Frequency Components
5
WER (%)
Neutral speech
4
3
0
5
10
15
20
Omitted band
40
WER (%)
LE speech
30
20
0
5
10
15
20
Omitted band
Amplitude
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
1
20
Filterbank Cut-Off Frequencies (Hz)
Email: [email protected]
Slide 41
Speech and Speaker Recognition
4K
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Importance of Frequency Components
5
WER (%)
Neutral speech
4
3
0
5
10
15
20
Omitted band
40
WER (%)
LE speech
30
20
0
5
10
15
20
Omitted band
Amplitude
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
1
20
Filterbank Cut-Off Frequencies (Hz)
Email: [email protected]
Slide 42
Speech and Speaker Recognition
4K
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Importance of Frequency Components
5
WER (%)
Neutral speech
4
3
0
5
10
15
20
Omitted band
40
WER (%)
LE speech
30
20
0
5
10
15
20
Omitted band
Amplitude
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
1
20
Filterbank Cut-Off Frequencies (Hz)
Email: [email protected]
Slide 43
Speech and Speaker Recognition
4K
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Importance of Frequency Components
5
WER (%)
Neutral speech
4
3
0
5
10
15
20
Omitted band
40
WER (%)
LE speech
30
20
0
5
10
15
20
Omitted band
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
Area of 1st and 2nd formant occurrence
– highest portion of phonetic information, F1 more important
Amplitude
for neutral speech, F1–F2 for LE speech recognition
Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech
 tradeoff
Next step – how much of the low frequency
content
should be (Hz)
omitted for LE 4K
ASR?
Filterbank
Cut-Off Frequencies
Email: [email protected]
Slide 44
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Lombard Effect
WER (%)
Optimizing Filter Banks – Omitting Low Frequencies
12
10
8
6
4
2
0
Neutral speech
0
200
400
600
800
1000
1200
Bandwidth (Hz)
WER (%)
30
20
10
LE speech
0
0
200
400
600
800
1000
1200
Bandwidth (Hz)
Amplitude
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
1
19
Filterbank Cut-Off Frequencies (Hz)
Email: [email protected]
Slide 45
Speech and Speaker Recognition
4K
SLIDES  by John H.L. Hansen, 2007
WER (%)
Data-Driven Design of Robust Features
Omitting Low Frequencies
12
10
8
6
4
2
0
Neutral speech
0
200
400
600
800
1000
1200
Bandwidth (Hz)
WER (%)
30
20
10
LE speech
0
0
200
400
600
800
1000
1200
Bandwidth (Hz)
Amplitude
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
1
19
Filterbank Cut-Off Frequencies (Hz)
Email: [email protected]
Slide 46
Speech and Speaker Recognition
4K
SLIDES  by John H.L. Hansen, 2007
WER (%)
Data-Driven Design of Robust Features
Omitting Low Frequencies
12
10
8
6
4
2
0
Neutral speech
0
200
400
600
800
1000
1200
Bandwidth (Hz)
WER (%)
30
20
10
LE speech
0
0
200
400
600
800
1000
1200
Bandwidth (Hz)
Amplitude
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
1
19
Filterbank Cut-Off Frequencies (Hz)
Email: [email protected]
Slide 47
Speech and Speaker Recognition
4K
SLIDES  by John H.L. Hansen, 2007
WER (%)
Data-Driven Design of Robust Features
Omitting Low Frequencies
12
10
8
6
4
2
0
Neutral speech
0
200
400
600
800
1000
1200
Bandwidth (Hz)
WER (%)
30
20
10
LE speech
0
0
200
400
600
800
1000
1200
Bandwidth (Hz)
Amplitude
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
1
1
9
Filterbank Cut-Off Frequencies (Hz)
Email: [email protected]
Slide 48
Speech and Speaker Recognition
4K
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Omitting Low Frequencies
Effect of Omitting Low Spectral Components
 Increasing FB low cut-off results in almost linear increase of WER on neutral speech while
considerably enhancing ASR performance on LE speech
Optimal low cut-off found at 625 Hz
Set
WER
(%)
Email: [email protected]
Slide 49
LFCC, full band
LFCC,  625 Hz
Devel set
Neutral
LE
4.8
29.0
(4.1–5.5)
(27.5–30.5)
6.6
15.6
(5.8–7.4)
(14.4–16.8)
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Increasing Filter Bank Resolution
Increasing Frequency Resolution
 Idea – emphasize high information portion of spectrum by increasing FB resolution
Experiment – FB decimation from 1912 bands (decreasing computational costs)
Increasing number of filters at the peak of information distribution curve
deterioration of LE ASR (17.2 %  26.9 %)
Slight F1–F2 shifts due to LE affect cepstral features
No simple recipe on how to derive efficient FB from the information distribution curves
WER (%)
30
25
LE speech
20
15
0
1
2
3
5
6
7
8
9
10
11
12
13
Omitted band
625 Hz
Email: [email protected]
4
Slide 50
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Increasing Filter Bank Resolution
Consecutive Filter Bank Repartitioning
Consecutively, from lowest to highest, each FB high cut-off is varied, while the upper rest of the FB
is redistributed uniformly across the remaining frequency band
Cut-off resulting in WER local minimum is fixed and the procedure is repeated for adjacent higher
cut-off  WER reduction by 2.3 % for LE, by 1 % on neutral (Example – 6 bands FB)
27
LE speech
25
WER (%)
23
21
19
17
15
13
500
Band 1
Band 2
1000
1500
Band 3
2000
Band 4
2500
Band 5
3000
Band 6
3500
4000
Critical frequency (Hz)
Email: [email protected]
Slide 51
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Standard vs. Novel Features
State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000)
 FB redistributed to improve stressed speech recognition (including loud and Lombard speech)
Expolog frequency (Hz)
Increased resolution in the area of F2 occurrence
f

 3988

 1
0  f  2000 Hz
 700  10



Expolog  f   
f 


2595  log 1 + 700  2000  f  4000 Hz

Linear frequency (Hz)
Email: [email protected]
Slide 52
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Standard vs. Novel Features
Evaluation in ASR Task
 Standard MFCC, PLP, variations MFCC-LPC, PLP-DCT – altered cepstrum exctraction schemes
Expolog – Expolog FB replacing trapezoid FB in PLP
20Bands-LPC – uniform rectangular FB employed in PLP front-end
Big1-LPC – derived from 20Bands-LPC, first 3 bands merged – decreased resolution at frequencies
disturbing for LE ASR
RFCC-DCT – repartitioned FB, 19 bands, starting at 625 Hz, employed in MFCC
RFCC-DCT – RFCC employed in PLP
Email: [email protected]
Slide 53
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Data-Driven Design of Robust Features
Standard vs. Novel Features
80
Features - performance on female digits
70
60
Neutral
LE
CLE
CLEF0
WER (%)
50
40
30
20
10
0
MFCC
Email: [email protected]
MFCC-LPC
PLP
Slide 54
PLP-DCT
Expolog
20Bands-LPC
Speech and Speaker Recognition
Big1-LPC
RFCC-DCT
RFCC-LPC
SLIDES  by John H.L. Hansen, 2007
LE Suppression in ASR
Frequency Warping
Maximum Likelihood (ML) Approach
Vocal tract length normalization (VTLN): mean formant locations inversely proportional to vocal
tract length (VTL)  compensation for inter-speaker VTL variations by frequency transformation
(warping):
FW  F 
Warping factor  searched to maximize likelihoods of observations and acoustic models:


ˆ  arg max Pr O W, Θ 


Factor  searched in typically in the interval 0.8–1.2 (corresponds to ratio of VTL differences in males
and females)
Formant-Driven (FD) Approach
 Warping factor determined from estimated mean formant locations
Email: [email protected]
Slide 55
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
VTLN – Principle
Normalized Speaker NORM (VTLNORM)
Formant frequencies (Hz)
4500
VTLN - Principle
4000
Case: VTL1 = VTLNORM
3500
F4
3000
2500
F3
2000
1500
F2
1000
500
F1
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Formant frequencies (Hz)
Speaker1 (VTL1)
Email: [email protected]
Slide 56
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
VTLN – Principle
Normalized Speaker NORM (VTLNORM)
Formant frequencies (Hz)
4500
VTLN - Principle
4000
Case: VTL1 > VTLNORM
3500
F4
3000
2500
F3
2000
1500
F2
1000
500
F1
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Formant frequencies (Hz)
Speaker1 (VTL1)
Email: [email protected]
Slide 57
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
VTLN – Principle
Normalized Speaker NORM (VTLNORM)
Formant frequencies (Hz)
4500
VTLN - Principle
4000
Case: VTL1 > VTLNORM
3500
F4
3000
2500
F3
2000
1500
F2
1000
500
F1
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Formant frequencies (Hz)
Speaker1 (VTL1)
Email: [email protected]
Slide 58
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
VTLN – Principle
Normalized Speaker NORM (VTLNORM)
Formant frequencies (Hz)
4500
VTLN - Principle
4000
Case: VTL1 > VTLNORM
3500
F4
3000
2500
F3
2000
1500
F2
1000
500
F1
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Formant frequencies (Hz)
Speaker1 (VTL1)
Email: [email protected]
Slide 59
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
VTLN vs. Lombard Effect
Normalized Speaker NORM (VTLNORM)
Formant frequencies (Hz)
4500
VTLN - Principle
4000
F4
Case: VTL1 ? VTLNORM
3500
3000
F3
2500
2000
1500
F2
1000
500
F1
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Formant frequencies (Hz)
Speaker1 (VTL1)
Email: [email protected]
Slide 60
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
VTLN vs. Lombard Effect
Normalized Speaker NORM (VTLNORM)
Formant frequencies (Hz)
4500
VTLN - Principle
4000
F4
Case: VTL1 ? VTLNORM
3500
3000
F3
2500
2000
1500
F2
1000
What to choose?
500
Good approx. of low formants?
F1
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Formant frequencies (Hz)
Speaker1 (VTL1)
Email: [email protected]
Slide 61
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
Generalized Transform
Normalized Speaker NORM (VTLNORM)
Formant frequencies (Hz)
4500
VTLN - Principle
4000
F4
Case: VTL1 ? VTLNORM
3500
3000
F3
2500
F2
2000
1500
F1
1000
What to choose?
500
Good approx. of higher formants?
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Formant frequencies (Hz)
Speaker1 (VTL1)
Email: [email protected]
Slide 62
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
Generalized Transform
Normalized Speaker NORM (VTLNORM)
Formant frequencies (Hz)
4500
4000
VTLN - Principle
Generalized
Transform
3500
Case: VTL ? VTL
Case: VTL11 ? VTLNORM
NORM
F4
3000
F3
2500
F2
2000
1500
F1
1000
500
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Formant frequencies (Hz)
Speaker1 (VTL1)
Email: [email protected]
Slide 63
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Frequency Warping
Evaluation – VTLN vs. Generalized Transform
 Generalized transform better addresses LE-induced formant shifts
Formant-driven approach is less computationally demanding (no need of multiple alignment passes as in
VTLN), but requires reliable formant tracking  problem in low SNR’s  ML approach more stable
Females
Set
# Digits
Baseline
WER
Utterance-dependent VTLN
(%)
Speaker-dependent VTLN
# Digits
Baseline bank
Warped bank
Email: [email protected]
Slide 64
Neutral
LE
Neutral
LE
2560
2560
1423
6303
4.3
33.6
2.2
22.9
(3.5–5.0)
(31.8–35.5)
(1.4–2.9)
(21.8–23.9)
3.6
28.2
1.8
16.6
(2.9–4.3)
(26.4–29.9)
(1.1–2.4)
(15.7–17.6)
4.0
27.7
1.8
17.4
(3.2–4.7)
(26.0–29.5)
(1.1–2.4)
(16.5–18.3)
Females
Set
WER
(%)
Males
Males
Neutral
LE
Neutral
LE
2560
2560
1423
6303
4.2
35.1
2.2
23.2
(3.4–5.0)
(33.3–37.0)
(1.4–2.9)
(22.1–24.2)
4.4
23.4
1.8
15.7
(3.6–5.2)
(21.8–25.0)
(1.1–2.4)
(14.8–16.6)
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
LE Suppression in ASR
Two-Stage Recognizer (TSR)
Tandem Neutral/LE Classifier – Neutral/LE Dedicated Recognizers
 Improving ASR features for LE often results in performance tradeoff on neutral speech
Idea - combining separate systems ‘tuned’ for neutral and LE speech directed by neutral/LE classifier
Neutral Recognizer
Speech
Signal
Neutral/LE
Classifier
LE recognizer
Email: [email protected]
Slide 65
Speech and Speaker Recognition
Estimated
Word
Sequence
SLIDES  by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)
Neutral/LE Classification
Proposal of Neutral/LE Classifier
 Search for a set of features providing good discriminability between neutral/LE speech
Requirements – speaker/gender/phonetic content independent classification
Extension of the set of analyzed features for the slope of short-term spectra
60
Spectral slope – female vowel /a/
40
Amplitude (dB)
Magnitude (dB)
20
0
-20
-40
-60
-80
0
10
Email: [email protected]
1
10
Slide 66
2
10
Log
frequency
(Hz)
Frequency
(Hz)
3
10
Speech and Speaker Recognition
4
10
SLIDES  by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)
Neutral/LE Classification
Mean Spectral Slopes in Voiced Male/Female Speech
Neutral
Set
0–8000
Hz
#N
T (s)
M
2587
618
F
5558
1544
LE
Slope
(dB/oct)
-7.42
(-7.48; -7.36)
-6.15
(-6.18; -6.12)
(dB/oct)
# LE
T (s)
1.53
3532
1114
1.30
5030
1926
Slope
(dB/oct)
(dB/oct)
-5.32
(-5.37; -5.27)
-3.91
(-3.96; -3.86)
1.55
1.77
Overlap of Neutral/LE Spectral Slope Distributions
Set
Neutral – LE distribution overlap (%)
0–8000 Hz
60–8000 Hz
60–5000 Hz
1k–5k Hz
0–1000 Hz
60–1000 Hz
M
26.00
28.13
29.47
100.00
27.81
27.96
F
26.20
28.95
16.76
100.00
25.75
22.18
M+F
28.06
30.48
29.49
100.00
27.54
26.00
Classification Feature Set
 A feature set providing superior classification performance on the development data set was found:
SNR, spectral slope (60–1000 Hz), F0, F0
Training GMM and multi-layer perceptron (MLP) classifiers
Email: [email protected]
Slide 67
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)
Neutral/LE Classification
Binary Classification Task

P  oi H1 
P  oi H 0 
Pr(N)
Pr(LE)
GMM Classifier
P  oi  
1
 2 
MLP Classifier
e
n

GMMN
1
 oi μ T Σ1 oi μ 
2
GMMLE
Σ
Acoustic Observation
(Classification Feature Vector)
Pr(N) Pr(LE)
f2  q j  
e
qj
(Softmax)
M
e
1
f q  
1+ e
qi
…
…
i 1
1
j
q j
(Sigmoid)
Classification Feature Vector
Email: [email protected]
Slide 68
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)
Neutral/LE Classification – Feature Set
0.06
0.05
80
0.04
60
0.03
Dev_N_M+F
Dev_LE_M+F
PDF_LE
PDF_N
40
20
0.02
0.01
0
20
40
60
80
0
100
0.8
ANN posteriors
SNR
100
0.6
80
0.5
60
0.4
0.3
40
Dev_N_M+F
Dev_LE_M+F
Pr(N)
Pr(LE)
20
0
0
20
40
SNR (dB)
0.16
0.08
Dev_N_M+F
Dev_LE_M+F
PDF_LE
PDF_N
30
20
0.04
0.7
0.5
0.4
0.3
20
0.2
-20
-10
0.1
0
0
10
20
Spectral slope (dB/oct)
Spectral slope (dB/oct)
Slide 69
Dev_N_M+F
Dev_LE_M+F
Pr(N)
Pr(LE)
30
0
30
0.6
40
0
20
0.9
50
0
10
1
0.8
10
Email: [email protected]
0
100
60
10
0
80
0.1
ANN posteriors
Spectral slope
70
Number of samples (normalized)
40
PDFN, PDFLE
Number of samples (normalized)
0.12
50
-10
80
GMM PDFs
Spectral slope
60
-20
60
0.2
SNR (dB)
80
70
0.7
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
30
Pr(N), Pr(LE)
0
Number of samples (normalized)
100
120
Pr(N), Pr(LE)
0.07
GMM PDFs
SNR
PDFN, PDFLE
Number of samples (normalized)
120
Two-Stage Recognizer (TSR)
Neutral/LE Classification – Feature Set
120
100
Dev_N_M+F
Dev_LE_M+F
PDF_LE
PDF_N
80
60
0.008
0.004
40
20
0.7
ANN posteriors
F0
140
120
0.5
100
0.4
80
Dev_N_M+F
Dev_LE_M+F
Pr(N)
Pr(LE)
60
40
0
100
200
300
400
0.000
500
0
0
100
200
0.2
300
400
0
500
F0 (Hz)
F0 (Hz)
0.04
250
GMM PDFs
F0
150
0.02
Dev_N_M+F
Dev_LE_M+F
PDF_LE
PDF_N
100
50
0.01
PDFN, PDFLE
0.03
Number of samples (normalized)
250
200
0.3
0.1
20
0
Number of samples (normalized)
0.6
0.8
ANN posteriors
F0
200
0.7
0.6
0.5
150
0.4
Dev_N_M+F
Dev_LE_M+F
Pr(N)
Pr(LE)
100
50
0.3
0.2
0.1
0
0
20
40
60
80
100
0
120
0
0
F0 (Hz)
Email: [email protected]
20
40
60
80
100
F0 (Hz)
Slide 70
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
0
120
Pr(N), Pr(LE)
140
Number of samples (normalized)
160
Pr(N), Pr(LE)
0.012
GMM PDFs
F0
PDFN, PDFLE
Number of samples (normalized)
160
Two-Stage Recognizer (TSR)
Neutral/LE Classification – Performance
Classification Data Sets
 TUtter s 
2472
TUtter s 
4.10
1371
4.01
1.50
Set
#Utterances
Devel
Open
1.60
Classification Performance
 UER – Utterance Error Rate – ratio of incorrectly classified utterances to all utterances
GMM
Set
Devel FM
Open FM
Devel DM
Open DM
# Utterances
2472
1371
2472
1371
UER (%)
MLP
6.6
2.5
8.1
2.8
(5.6–7.6)
(1.7–3.3)
(7.0–9.2)
(1.9–3.6)
Set
Train
CV
Open
# Utterances
2202
270
1371
9.9
5.6
1.6
(8.7–11.1)
(2.8–8.3)
(0.9–2.3)
UER (%)
Email: [email protected]
Slide 71
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)
Overall Performance
Discrete Recognizers
Neutral Recognizer
 Either good on neutral or LE speech
Speech
Signal
Neutral/LE
Classifier
LE recognizer
Set
Real – neutral
Real – LE
# Female digits
1439
1837
4.3
48.1
(3.3–5.4)
(45.8–50.4)
6.5
28.3
(5.2–7.7)
(26.2–30.4)
4.2
28.4
(3.2–5.3)
(26.4–30.5)
PLP
RFCC–LPC
WER
MLP TSR
(%)
FM–GMLC TSR
DM–GMLC TSR
Email: [email protected]
Slide 72
4.4
28.4
(3.3–5.4)
(26.4–30.5)
4.4
28.4
(3.3–5.4)
(26.3–30.4)
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007
Estimated
Word
Sequence
LE Suppression in ASR
Comparison of Proposed Methods
60
Comparison of proposed techniques for LE-robust ASR
Baseline Neutral
Baseline LE
LE Suppression
50
WER (%)
40
30
20
10
0
Model Adapt to
LE - SI
Email: [email protected]
Model Adapt to
LE - SD
Slide 73
Voice
Conversion CLE
Modified FB RFCC-LPC
VTLN
Recognition Utt. Dep. Warp
Speech and Speaker Recognition
Formant
Warping
MLP TSR
SLIDES  by John H.L. Hansen, 2007
Thank You
Thank You for Your Attention!
Email: [email protected]
Slide 74
Speech and Speaker Recognition
SLIDES  by John H.L. Hansen, 2007