PowerPoint **

Survey on state-of-the-art approaches:
Neural Network Trends in Speech Recognition
Presented by Ming-Han Yang (楊明翰)
Outline
• Speech Processing
◦ Neural Network Trends in Speech Recognition
 EXPLORING MULTIDIMENSIONAL LSTM FOR LARGE VOCABULARY ASR
 Microsoft Corporation
 END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION
 Yoshua Bengio, Université de Montréal, Canada
 DEEP CONVOLUTIONAL ACOUSTIC WORD EMBEDDINGS USING WORD-PAIR
SIDE INFORMATION
 Toyota Technological Institute at Chicago, United States
 VERY DEEP MULTILINGUAL CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR
 IBM, United States; Yann LeCun, New York University, United States
 LISTEN, ATTEND AND SPELL: A NEURAL NETWORK FOR LARGE VOCABULARY
CONVERSATIONAL SPEECH RECOGNITION
 Google Inc., United States
 A DEEP SCATTERING SPECTRUM - DEEP IAMESE NETWORK PIPELINE FOR
UNSUPERVISED ACOUSTIC MODELING
 Facebook A.I. Research, France
Introduction


Long short-term memory (LSTM) recurrent neural networks (RNNs)
have recently shown significant performance improvements over deep
feed-forward neural networks.
A key aspect of these models is the use of time recurrence, combined
with a gating architecture that allows them to track the long-term
dynamics of speech.

Inspired by human spectrogram reading, we recently proposed the
frequency LSTM (F-LSTM) that performs 1-D recurrence over the
frequency axis and then performs 1-D recurrence over the time axis.

In this study, we further improve the acoustic model by proposing a 2-D,
time-frequency (TF) LSTM.

The TF-LSTM jointly scans the input over the time and frequency axes to
model spectro-temporal warping, and then uses the output activations
as the input to a time LSTM (T-LSTM).
THE LSTM-RNN

Figure 1 depicts the architecture of an LSTMRNN with one recurrent layer. In LSTMRNNs, in addition to the past hidden-layer
output 𝐡𝑡−1 , the past memory activation
𝐜𝑡−1 is also an input to the LSTM cell.
TF-LSTM processing

The detailed TF-LSTM processing is described as follows.
1)
At each time step, divide the 𝑁 log-filter-banks at the current time into 𝑀
overlapped chunks, shifting by 𝐶 log-filter-banks between adjacent chunks. They
1
are denoted as 𝑥𝑘,𝑡
, 𝑘 = 1 … 𝑀.
2)
Using the hidden activations at each frequency chunk from the previous time
step 𝐡1𝑘,𝑡−1 , the hidden activations at each time step from the previous
frequency chunk 𝐡1𝑘−1,𝑡 , and the input at the current frequency chunk and time
step 𝐡1𝑘,𝑡 , go through Eqs (6)--(10) to generate the output of 𝐡1𝑘,𝑡 , 𝑘 = 1 … 𝑀.
3)
Merge 𝐡1𝑘,𝑡 , 𝑘 = 1 … 𝑀 into a super-vector 𝐡1𝑡 which can be considered as a
trajectory of time-frequency patterns. Then use 𝐡1𝑡 as the input to the upper
layer T-LSTM.
Corpora Description & Experiments

Microsoft Windows phone short message dictation task
◦ Training data : 375 hr
◦ Test set : 125k words

Features
◦ 87維log-filter-bank features
◦ (29維*3)

5976 tied-triphone states (senones)

DNN settings:
◦ 5層*2048;splice=5

LSTM settings :
◦ TLSTM: 每1層有1024神經元,
◦ 每層透過linear projection layer  512
◦ BPTT step=20;delay=5 frames, etc.
Introduction

We present Listen, Attend and Spell (LAS), a
neural speech recognizer that transcribes
speech utterances directly to characters
without pronunciation models, HMMs or
other components of traditional speech
recognizers.

LAS consists of two sub-modules: the listener
and the speller.
◦ The listener is an acoustic model encoder that
performs an operation called Listen.
◦ The speller is an attention-based character
decoder that performs an operation we call
AttendAndSpell.
Introduction (cont.)

Let 𝐱 = (𝑥1 , . . . , 𝑥𝑈 ) be the input sequence of filter bank spectra features

𝐲 = <𝑠𝑜𝑠>, 𝑦1 , … , 𝑦𝑆 , <𝑒𝑜𝑠> , 𝑦𝑖 ∈ {a, … , z, 0, … , 9, <space>, <comma>,
<period>, <apostrophe>, <unk>} be the output sequence of characters.

LAS models each character output 𝑦𝑖 as a conditional distribution over the
previous characters 𝑦<𝑖
𝑃 𝐲𝐱 =
𝑃(𝑦𝑖 |𝐱, 𝑦<𝑖 )
𝑖

The Listen operation transforms the original signal 𝑥 into a high level
representation ℎ = (ℎ1 , . . . , ℎ𝑈 ) with 𝑈 ≤ 𝑇.

The AttendAndSpell operation consumes h and produces a probability
distribution over character sequences:
𝐡 = Listen(𝐱)
𝑃 𝑦𝑖 𝑥, 𝑦<𝑖 = AttendAndSpell(𝑦<𝑖 , 𝐡)
Listen

The Listen operation uses a Bidirectional Long Short Term Memory RNN
(BLSTM) with a pyramidal structure.
◦ A direct application of BLSTM for the operation Listen converged slowly and produced
results inferior to those reported here, even after a month of training time.



In the pBLSTM model, we concatenate the outputs at consecutive steps
of each layer before feeding it to the next layer:
The pyramidal structure also reduces the
computational complexity.
The attention mechanism in the speller 𝑈 has a computational
complexity of 𝑂(𝑈𝑆). Thus, reducing 𝑈 speeds up learning and inference
significantly
Attend and Spell



The distribution for 𝑦𝑖 is a function of the decoder state 𝑠𝑖 and context 𝑐𝑖 .
The decoder state 𝑠𝑖 is a function of the previous state 𝑠𝑖−1 , the
previously emitted character 𝑦𝑖−1 and context 𝑐𝑖−1 . The context vector
𝑐𝑖 is produced by an attention mechanism. Specifically,
𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛() is an MLP with softmax outputs over
characters, and where RNN is a 2 layer LSTM.
Attend and Spell (cont.)

Specifically, at each decoder timestep 𝑖, the AttentionContext function computes the
scalar energy 𝑒𝑖,𝑢 for each time step 𝑢, using vector ℎ𝑢 ∈ 𝐡 and 𝑠𝑖 .

The scalar energy 𝑒𝑖,𝑢 is converted into a probability distribution over times steps (or
attention) 𝛼𝑖 using a softmax function.

The softmax probabilities are used as mixing weights for blending the listener
features ℎ𝑢 to the context vector 𝑐𝑖 for output time step 𝑖:

where 𝜑 and 𝜓 are MLP networks.

After training, the 𝛼𝑖 distribution is typically very sharp and focuses on only a few
frames of h

𝑐𝑖 can be seen as a continuous bag of weighted features of h
Corpora Description & Experiments

Google Voice Search Task
◦ 2000 hours, 3 million utterances
◦ Test set : 16 hours

Features
◦ 40-dimensional log-mel filter bank

All utterances were padded with the start-of-sentence <sos> and the
end-of-sentence <eos> tokens.
Introduction

Many state-of-the-art Large Vocabulary Continuous Speech Recognition
(LVCSR) Systems are hybrids of neural networks and Hidden Markov Models
(HMMs).

Recently, more direct end-to-end methods have been investigated, in which
neural architectures were trained to model sequences of characters.

To our knowledge, all these approaches relied on Connectionist Temporal
Classification [3] modules.

We start from the system proposed in [11] for phoneme recognition and
make the following contributions:
◦ reduce total training complexity from quadratic to linear
◦ introduce a recurrent architecture that successively reduces the
source sequence length by pooling frames neighboring in time.
◦ character-level ARSG + n−gram word-level language model + WFST
Introduction (cont.)

Encoder-Decoder
We use a deep Bidirectional RNN (BiRNN) as the encoder

The representation is a sequence of BiRNN State vectors (ℎ1 , . . . , ℎ𝑁 ) .

For a standard deep BiRNN, the sequence is as long as the input of the
bottommost layer  for our decoder such a representation is overly
precise and contains much redundant information  + pooling between
BiRNN layers

Introduction (cont.)






Attention-equipped RNNs
The decoder network in our system is an Attentionbased Recurrent Sequence Generator (ARSG)
An ARSG produces an output sequence
(𝑦1 , . . . , 𝑦𝑁 ) one element at a time, simultaneously
aligning each generated element to the encoded
input sequence (ℎ1 , . . . , ℎ𝑁 )
Attention mechanism
The attention mechanism selects the temporal locations over the input
sequence that should be used to update the hidden state of the RNN and
to predict the next output value.
The selected input sequence elements are combined in a weighted sum
ct = n αtn hn , where αtn are called the attention weights and we require
that αtn ≥ 0 and that n αtn = 1
Introduction (cont.)
• Simply put, the attention mechanism combines information from three
sources to decide where to focus at the step 𝑡:
1) the decoding history contained in 𝐬𝑡−1
2) the content in the candidate location 𝐡𝑛
3) the focus from the previous step described by the attention weights 𝜶𝒕−𝟏 .
Corpora Description & Experiments

Wall Street Journal (WSJ) corpus
◦ Training set : 81 hours (37K
sentences)
◦ Test set : eval92

Features
◦ 40 mel-scale filter bank
(energy+∆+∆∆)

Our model used 4 layers of 250
forward + 250 backward GRU [15]
units in the encoder