Steve Renals

C
S
T
R
Neural Networks
for Speech Recognition
Steve Renals
Centre for Speech Technology Research
University of Edinburgh
[email protected]
C
S
T
R
Outline
Some history
The state of the art
Looking ahead
C
S
T
R
A brief history of
Neural Networks
C
S
T
R
The Perceptron (Rosenblatt)
(early 1960s)
C
S
T
R
NN Winter #1
Perceptrons (Minsky and Papert)
xor
C
S
T
R
MLPs and backprop (mid 1980s)
C
S
T
R
MLPs and backprop
• Train multiple layers
of hidden units –
nested nonlinear
functions
•
•
•
Powerful feature
detectors
Posterior probability
estimation
Theorem: any
function can be
approximated with a
single hidden layer
y1
y
Outputs yK
δ!
δ1
(2)
w1 j
(2)
w! j
w(2)
Kj
Hidden units
j
= h (bj )
w
j
zj
w(1)
ji
xi
K
Neural network acoustic models (1990s)
“Hybrid systems”
C
S
T
R
Bourlard & Morgan, 1994
Robinson, IEEE TNN 1994
Error (%)
11.0
CI-HMM
Perceptual
Linear
Prediction
DARPA RM 1992
Chronos
Decoder
CI RNN
10.0
Speech
9.0
Modulation
Spectrogram
8.0
Chronos
Decoder
ROVER
7.0
CI MLP
6.0
CI-MLP
5.0
CD-HMM
4.0
MIX
3.0
2.0
1.0
0.0
0
1
2
3
4
5
6
Million Parameters
Renals, Morgan, Cohen & Franco, ICASSP 1992
Utterance
Hypothesis
Perceptual
Linear
Prediction
Chronos
Decoder
CD RNN
Broadcast news 1998
20.8% WER
(best GMM-based system, 13.5%)
Cook, Christie, Ellis, Fosler-Lussier, Gotoh,
Kingsbury, Morgan, Renals, Robinson, & Williams,
DARPA, 1999
C
S
T
R
•
•
•
Neural network acoustic models (1990s)
Limitations compared with GMMs
Computationally restricted to monophone outputs
•
CD-RNN factored over multiple networks – limited
within-word context
Training not easily parallelisable
•
•
experimental turnaround slower
systems less complex (fewer parameters)
•
•
RNN – <100k parameters
MLP – ~1M parameters
Rapid adaptation hard (cf MLLR)
C
S
T
R
NN Winter #2
s-iy+l
f-iy-l
t-iy-n
t-iy-m
5
5
5
5
5
5
4
4
4
4
4
4
3
3
3
3
3
3
2
2
2
2
2
2
1
1
1
1
1
1
0
0
0
0
0
!1
!5
!4
!3
!2
!1
0
1
2
3
4
!1
!5
!4
!3
!2
!1
0
1
2
3
4
!1
!5
!4
!3
!2
!1
0
1
2
3
4
!1
!5
!4
!3
!2
!1
0
1
2
3
4
!1
!5
0
!4
!3
!2
!1
0
1
2
3
4
!1
!5
!4
!3
!2
!1
0
1
2
3
4
Discriminative long-term
features – Tandem
C
S
T
R
•
•
A neural network-based technique provided the
biggest increase in accuracy in speech recognition
during the 2000s
Tandem features (Hermansky, Ellis & Sharma, 2000)
•
•
•
use (transformed) outputs or (bottleneck) hidden
values as input features for a GMM
deep networks – e.g. 5 layer MLP to obtain bottleneck
features (Grézl, Karafiát, Kontár & Černocký, 2007)
reduces errors by about 10% relative (Hain, Burget, Dines,
Garner, Grezl, el Hannani, Huijbregts, Karafiat, Lincoln & Wan, 2012)
C
S
T
R
Neural network acoustic models (2010s)
CD
Phone Outputs
CI
Phone Outputs
12000
45
3–8 hidden layers
Tandem
Hybrid
3–8 hidden layers
Hidden units
2000
Hidden units
2000
MFCC Inputs
MFCC Inputs
(39*9=351)
Dahl,Yu, Deng & Acero,
IEEE TASLP2012
(39*9=351)
Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior,Vanhoucke,
Nguyen, Sainath & Kingsbury, IEEE SP Mag 2012
C
S
T
R
1.
Deep neural networks
What’s
new?
Unsupervised pretraining
(Hinton, Osindero & Teh, 2006)
•
•
•
Train a stacked RBM generative model, then finetune
Good initialisation
Regularisation
2. Deep – many hidden layers
•
•
Deeper models more accurate
GPUs gave us the computational power
3. Wide output layer (context dependent phone
classes) rather than factorised into multiple nets
•
•
More accurate phone models
GPUs gave us the computational power
(Adaptation is still hard . . .)
C
S
T
R
DNNs improve the
state-of-the-art in ASR
Peter
Bell
Paweł
Świętojański
Arnab
Ghoshal
C
S
T
R
DNNs improve the
state-of-the-art
#1 - TED Talks
C
S
T
R
Number of hidden layers
23
dev2010
22
tst2010
WER (%)
21
20
19
18
17
1
2
3
4
Number of hidden layers
5
6
C
S
T
R
MLAN: Multi Level
Adaptive Networks
Train DNNs
on OOD data
OOD posterior features generated
for in-domain data
In-domain PLP
features
Tandem features
OOD
Tandem HMMs
Train in-domain DNNs
on Tandem features
MLAN features
Tandem
MLAN HMMs
Bell, Swietojanski & Renals, 2013
Hybrid
MLAN HMMs
C
S
T
R
TED Talks System
Extract OOD tandem
features
Segmented
speech recordings
Adapt CMLLR
x 32
Decode
Tandem MLAN
3-gram LM
Adapt CMLLR
x1
Bell, Yamamoto, Swietojanski, Wu,
McInnes, Hori & Renals, 2013
Decode
Tandem MLAN + SAT
Rescore lattice
Rescore 100-best
3-gram LM
4-gram LM
fRNN LM
Decode
Hybrid MLAN + SAT
Rescore lattice
Rescore 100-best
ROVER combine
WER results (IWSLT tst2010)
C
S
T
R
35
31.7 20.3
17.9
17.6
16.4
12.8
16.4 12.7 11.7
30
+SAT
25
+MPE
20
+SAT
WER / %
+MPE
+SAT
+SAT
+MPE
15
+SAT
+LM
+RNN
+LM
+RNN
10
5
0
PLP
Tandem
Hybrid
Tandem MLAN
Hybrid MLAN
Rover
C
S
T
R
DNNs improve the
state-of-the-art
#2 – Sequential optimisation for
Switchboard recognition
C
S
T
R
Training a neural network
DNN finetuned
on PL
Log-linear classifier
Softmax output
FCE =
XX
u
Feature extraction
log yut (sut )
t
• Trained using gradient descent (backprop)
• Cross entropy between network outputs and labels
C
S
T
R
Training a neural network
sequence discriminatively
DNN finetuned
on PL
Log-linear
classifier
Softmax output
FsM BR =
P
X
u
Feature extraction
p(Ou |S)P (W )A(W, Wu )
WP
0)
p(O
|S)P
(W
u
W0
• Alphanet trained sequence discriminatively
• Lattice-based training of MLP
Dodd, 1991)
(Kingsbury, 2009)
(Bridle &
C
S
T
R
Switchboard experiments
• Training set: 300 hours (Switchboard-1 Release 2)
• 30k word pronunciation dictionary
• Smoothed trigram LM trained on 3M words training
transcripts interpolated with 11M words of Fisher-1
• GMMs trained on 40-dim LDA+STC features from 7
frames (±3) of 13-dim MFCC (C0-C12)
• DNNs trained on 40dim LDA+STC+FMLLR
features, 11 frames (± 5) input
Vesely, Ghoshal, Burget & Povey, Interspeech 2013
C
S
T
R
Switchboard Results
35
Hub5 '00 test set
CHE
300 hour training set
33.0
http://kaldi.sf.net/!
30
CHE
AVE
25
25.8
CHE
25.7
24.1
WER/%
20
AVE
SWB
AVE
20.0
18.6
18.4
SWB
15
14.2
SWB
12.6
10
5
0
---GMM/BMMI---
----DNN/CE----
---DNN/sMBR---
C
S
T
R
DNNs improve the
state-of-the-art
#3 – Low resource cross-lingual
knowledge transfer
C
S
T
R
Cross-lingual knowledge transfer
in DNN acoustic models
• Low resource speech recognition: limited acoustic
training data in the target language
•
•
Use untranscribed speech from different languages to
improve target language speech recognition?
Transfer DNN hidden layers across languages?
•
Pre-training using multilingual untranscribed audio
•
•
•
Language-dependent output layer
Transfer hidden layers across languages
Finetune whole network for target language
• Unsupervised
• Supervised
C
S
T
R
Unsupervised cross-lingual
knowledge transfer
• Globalphone corpus
• In-domain language: German
•
•
1/5/15 h in-domain training sets
Unsupervised pretraining using Spanish, Portuguese,
Swedish (20-25 h each)
• Baseline systems:
•
•
•
•
GMM/HMM + MFCC features,
model/feature space discriminative training – (f)BMMI
transformed feature space – LDA+MLLT
550–2500 tied states
C
S
T
R
DNNs
• Features: PLP-12 + energy
•
•
1st and 2nd derivatives
±4 frames context
• 1024 hidden units per layer
• Tandem: 44 outputs (phone classes) reduced to 25
dimensions using PCA
• Hybrid: 550–2500 outputs (context dependent tied
states from HMM/GMM phonetic decision tree)
C
S
T
R
Effect of pretraining (5h in-domain)
18.5
fBMMI+BMMI LDA+MLLT 5h baseline
Hybrid_GE_5h_nopre
fBMMI+BMMI LDA+MLLT 15h baseline
Hybrid_GE_5h_pGE
Hybrid_GE_5h_pPO
fBMMI+BMMI LDA+MLLT 15h baseline
Hybrid_GE_5h_pSP
Hybrid_GE_5h_pSW
Hybrid_GE_5h_pGE+PO+SP+SW
fBMMI+BMMI
LDA+MLLT 15h baseline
fBMMI+BMMI LDA+MLLT 15h baseline
18
18
WER [%]
17.5
17.5
17
17
16.5
16.5
16
16
15.5
15.5
15
15
14.5
1
2
3
Swietojanski, Goshal & Renals, SLT-2013
4
5
6
#hidden layers
7
8
9
Word error rates on
Globalphone (German)
C
S
T
R
35
discrim GMM 1h
tandem 1h
Unsupervised pretraining on Spanish for
hybrid and tandem systems
30
hybrid 1h
WER/%
discrim GMM 5h
25
tandem 5h
discrim GMM 15h
hybrid 5h
tandem 15h
hybrid 15h
20
15
C
S
T
R
Stacked RBMs
trained on PL
Supervised cross-lingual
acoustic models
DNN finetuned
on CZ
DNN finetuned
on DE
DNN finetuned
on PT
Cross-lingual DNNs (“Hat Swap”)
Ghoshal, Swietojanski & Renals, ICASSP-2013
DNN finetuned
on PL
C
S
T
R
Mono & Multilingual results
on Globalphone (Polish)
PL→CZ→DE→FR→PL
C
S
T
R
Another important result
C
S
T
R
DNN acoustic model for
robust speech recognition
Seltzer,Yu & Wang, ICASSP 2013
•
•
Aurora 4 – WSJ, added
noise
State-of-the-art –
GMMs with combined
discriminative and noise
adaptive training, with
speaker adaptation
WER equalled by
baseline DNN system
with multicondition
training
23.0
20
15
WER/%
•
25
13.4
13.4
12.4
10
5
0
GMM Baseline GMM-VAT-Joint
DNN Baseline DNN+NAT+D/O
C
S
T
R
Practical details for DNNs
• Computing platform
•
High-end PC, with “gamers GPU” (e.g. GTX690)
(~€5000)
• Open source software, eg:
•
•
•
•
Kaldi – http://kaldi.sourceforge.net
Theano – http://deeplearning.net/software/theano
Torch7 – http://www.torch.ch
Quicknet – http://www.icsi.berkeley.edu/Speech/qn.html
C
S
T
R
Outlook:
2 interesting things
(among many)
facilitated by DNNs
C
S
T
R
DNNs and acoustic features
• GMM/HMM systems closely tied to MFCC/PLP
feature representations
•
GMMs are not a good model to learn feature correlations
• DNNs may be viewed as powerful feature extractors
and correlators
•
(log) (mel) spectral features result in equal or lower error
rates
• DNNs well matched to exploring a range of rich
acoustic feature representations
• Reverb, overlapped speech, multiple acoustic sources...
C
S
T
R
Incorporation of speech knowledge
Graph transformer networks
Trainable multi-modular systems
Le Cun, Bottou, Bengio & Haffner, Proc IEEE 1997
Fig. 27. A globally trainable SDNN/HMM hybrid system expressed as a GTN.
C
S
T
R
Conclusions
• DNNs enable automatic extraction of rich features
•
(Implicit or explicit) regularisation enables the
construction of deep, complex DNNs
• Training DNNs to estimate CD HMM state
probabilities has extended state-of-the-art in ASR
• DNNs open new research possibilities
•
•
•
Multilingual, multidomain systems
New acoustic features for ASR
....
Thanks.
[email protected]
C
S
T
R
References
•
•
A Bryson and Y Ho (1969). Applied Optimal Control, Taylor and Francis.
•
P Bell, H Yamamoto, P Swietojanski, Y Wu, F McInnes, C Hori, and S Renals (2013).
“A lecture transcription system combining neural network acoustic and language
models”, Proc Interspeech.
•
H Bourlard and N Morgan (1994). Connectionist Speech Recognition – A Hybrid Approach,
Kluwer.
•
J Bridle and L Dodd (1991). “An Alphanet approach to optimising input transformations
for continuous speech recognition”, Proc IEEE ICASSP.
•
G Cook, J Christie, D Ellis, E Fosler-Lussier, Y Gotoh, B Kingsbury, N Morgan,
S Renals, T Robinson, and G Williams (1999). “An overview of the SPRACH system for
the transcription of broadcast news”, Proc DARPA Broadcast News Workshop.
•
G Dahl, D Yu, L Deng, and A Acero (2012). “Context-Dependent Pre-Trained Deep
Neural Networks for Large-Vocabulary Speech Recognition” IEEE Trans ASLP, 20:30-42.
P Bell, P Swietojanski, and S Renals (2013). “Multi-level adaptive networks in tandem
and hybrid ASR systems”, Proc. IEEE ICASSP.
C
S
T
R
References
•
A Ghoshal, P Swietojanski, and S Renals (2013). “Multilingual Training of Deep Neural
Networks”, Proc IEEE ICASSP.
•
F Grézl, M Karafiát, S Kontár, and J Černocký (2007). “Probabilistic and Bottle-Neck
Features for LVCSR of Meetings”, Proc IEEE ICASSP.
•
T Hain, L Burget, J Dines, P Garner, F Grezl, A el Hannani, M Huijbregts, M Karafiat,
M Lincoln, and V Wan (2012). “Transcribing meetings with the AMIDA systems”, IEEE
Trans ASLP, 20:486-498.
•
H Hermansky, D Ellis, and S Sharma (2000). “Tandem connectionist feature extraction
for conventional HMM systems”, Proc IEEE ICASSP.
•
G Hinton, L Deng, D Yu, G Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke,
P Nguyen, T Sainath, and B Kingsbury (2012). “Deep Neural Networks for Acoustic
Modeling in Speech Recognition: The Shared Views of Four Research Groups”, IEEE
Signal Processing Magazine, Nov 2012:82-97.
•
G Hinton, S Osindero, and YW Teh (2006). “A Fast Learning Algorithm for Deep Belief
Nets”, Neural Computation, 18:1527-1554.
C
S
T
R
References
•
B Kingsbury (2009). “Lattice-based optimization of sequence classification criteria for
neural-network acoustic modeling”, Proc IEEE ICASSP.
•
Y LeCun, L Bottou,Y Bengio, and P Haffner (1997). “Gradient-based learning applied to
document recognition”, Proc IEEE, 86(11):2278-2324
•
•
M Minsky and S Papert (1969). Perceptrons, MIT Press.
•
S Renals, N Morgan, M Cohen and H Franco (1992). “Connectionist probability
estimation in the DECIPHER speech recognition system”, Proc IEEE ICASSP.
•
S Renals, N Morgan, H Bourlard, M Cohen and H Franco (1994). “Connectionist
probability estimators in HMM speech recognition”, IEEE Trans SAP, 2(1):161–174.
•
AJ Robinson (1994). “An application of recurrent nets to phone probability
estimation”, IEEE Trans Neural Networks, 5:298-305.
•
F Rosenblatt (1962). Principles of neurodynamics: perceptrons and the theory of brain
mechanisms, Spartan Books,
D Parker (1985). Learning Logic, TR-47, Center for Computational Research
in Economics and Management Science, MIT.
C
S
T
R
References
•
D Rumelhart, G Hinton, R Williams (1986). “Learning internal representations by error
propagation”, in Parallel Distributed Processing (vol 1), D Rumelhart and J McClelland
(eds), MIT Press.
•
M Seltzer, D Yu, and E Wang (2013). “An Investigation of Deep Neural Networks for
Noise Robust Speech Recognition”, Proc IEEE ICASSP.
•
P Swietojanski, A Ghoshal, and S Renals (2013). “Unsupervised cross-lingual knowledge
transfer in DNN-based LVCSR”, Proc IEEE SLT Workshop.
•
K Vesely, A Ghoshal, L Burget, and D Povey (2013). “Sequence-discriminative training of
deep neural networks”, Proc Interspeech.
•
P Werbos (1990). “Backpropagation through time: what it does and how to do it”, Proc
IEEE, 78:1550-1560.