Weighted Finite-State Transducers in Speech Recognition

DIGITAL SIGNAL PROCESSING ARCHITECTURE
FOR LARGE VOCABULARY
SPEECH RECOGNITION
WONYON SUNG
SCHOOL OF ELECTICAL AND COMPUTER ENGINEERING
SEOUL NATIONAL UNIVERSITY
July 6, 2015 CLUJ-NAPOCA ROMANIA
Speech recognition

Most natural human-machine interface.





Long history of research (even from analog technology age)
Extensive use of DSP technology
Still imperfect
We nowadays understand it a kind of machine learning problems.
Diverse applications: Keyword spotting (~100 words) to multi-language
understanding (> 100K words)
One two three four
…..
Multimedia Systems Lab. @ SoEE, SNU
Hidden Markov model for speech recognition

Hidden Markov model contains states (tri-phone states) and
state transitions according to the speech input and the
network connections. It combines three knowledge sources.



Acoustic model – Phoneme representation, (Gaussian
mixture model for emission probability computation)
Pronunciation model
– vocabulary (lexicon)
Sentence or
language model
Multimedia Systems Lab. @ SoEE, SNU
HMM base speech recognition implementation
1.
Feature extraction: speech to acoustic parameter


2.
Emission probability computation


3.
MFCC (Mel-Frequency Cepstrum Coefficient)
Irrelevant of the vocabulary size
Generate the log-likelihood of each hypothesis state
Higher dimension needed for large vocabulary (4 ~ 128th dimension)
Viterbi beam search


Dynamic programming through the network
High complex ‘compare and select’ operations
Search
Input Speech
Feature
Extraction
Emission
Probability
Computation
Result
time
Multimedia Systems Lab. @ SoEE, SNU
Algorithm and implementation trend

~ 2000


~ up to date




Single CPU or programmable DSP based
Hidden Markov model based
Parallel computer architecture (GPU, multi-core)
FPGA, VLSI
In the future


Neural network (deep neural network, recurrent neural network)
GPU, Neuromorphic system based
In this talk:
 Part1: multi-core CPU and GPU (parallel computer) based
implementation of large vocabulary speech recognition
 Part2: Deep neural network based implementations
Multimedia Systems Lab. @ SoEE, SNU
Speech recognition for large vocabulary (>60K)

Not only lexicon size increases, more precise acoustic
modeling:



The network complexity for HMM grows very rapidly.



High dimension (32~128) Gaussian mixture model
More tri-phone states
We need to prune many states or arcs during the search
Resulting in very irregular computation
High complexity language model needed: 3-gram or higher is
desired.

Large memory size
You, Kisun, et al. "Parallel scalability in
speech recognition." Signal Processing
Magazine, IEEE 26.6 (2009): 124-135.
Multimedia Systems Lab. @ SoEE, SNU
Recognition Network Example
Gaussian Mixture Model
for One Phone State
Mixture Components
aa
128
…
Components
hh
…
…
…
…
…
…
…
Computing
distance to
each mixture
components
HMM Acoustic
Phone Model
17550
Triphones
n
Computing
weighted sum
of all components
Pronunciation Model
...
HOP hh aa p
58k
...
ON Word
aa n
...Vocabulary
POP p aa p
...
Bigram
Language Model
4 million States
10 millions Transition Arcs
Compiled & Optimized
WFST Recognition Network
...
CAT
...
HAT
HOP
IN
...
ON
POP
...
THE
...
Features
from one
frame
...
CAT
...
HAT
HOP
IN
...
ON
POP
...
THE
...
168k Bigram
Transitions
Multimedia Systems Lab. @ SoEE, SNU
Multi-core CPU and GPU

Multicore: Two ~ 10’s CPU cores on each chip. Good single
thread performance

GPU (manycore): 100’s of processing cores, maximizing
computation throughput at the expense of single thread
performance
Intel Core i7 (45nm)
4 cores
NVIDIA GTX285 (55nm)
30 cores
Intel Xeon Phi
96 cores
Multimedia Systems Lab. @ SoEE, SNU
Multicore/GPU Architecture Trends

Increasing numbers of cores per die




Intel Nehalem: 4~8 cores
Intel Xeon Phi: about 60 cores
NVIDIA GTX285: 30 cores (each core contains
8 units)
Increasing vector unit width (SIMD)



Intel Core i7 (Nehalem): 8-way ~ 16-way
Intel Xeon Phi: 16-way
NVIDIA GTX285: 8-way (physical), 32-way
(logical)
Core
Core
Core
Core
Core
Core
4X32b
a0
a1
a2
a3
b0
b1
b2
b3
+
+
+
+
a0+b0a1+b1a2+b2a3+b3
Vector unit (SIMD)
Multimedia Systems Lab. @ SoEE, SNU
Parallel scalability

Can we achieve the speed-up of ‘SIMD_width x
#_of_cores’?


E.g. With 8-way SIMD, 8-core CPU, 64 times speed-up.
Some parts of speech recognition algorithm is quite parallel
scalable, but other parts are not.


Emission probability computation: computation flow is quite regular.
Good scalability
Hidden Markov network search: quite irregular network search


Packing overhead for SIMD
Synchronization overhead
Multimedia Systems Lab. @ SoEE, SNU
Emission probability computation

Very regular and good candidate workload for parallelization

Tied HMM states (Senone)



Usually up to several thousands (depends on training condition)
39 D MFCC feature vector
Gaussian mixture: 16, 32, 128
1 K ( xk   mk ) 2
log( bm (Ot ; s))  Cm  
2
2 k 1
 mk

Thread
0
Parallelization



Active
Active
Senone Flag Senone Address
Triphones (senone) for core-level
Gaussian for SIMD-level
Simple dynamic workload distribution


Count the number of active triphones
Distribute evenly to each thread
Thread
1
Thread
2
Thread
3
Multimedia Systems Lab. @ SoEE, SNU
Irregular network search <-> parallel scalability
Parallel graph traversal through an irregular network with millions of arcs and
states
Vector (SIMD) unit efficiency


Synchronization

Arc traversal induces write conflicts for destination state update
Increased chance of conflicts for large number of cores
Core
Core
$
Core
Core
$
Core
$
Core
$

$

SIMD operation demands a packed data (packing overhead may be needed)
Continuously changing working set guided by input
$

Core
Synchronization
Vector Unit Utilization
HMM Network
Multimedia Systems Lab. @ SoEE, SNU
Applying SIMD in network search
1
j
a
8
g
2
b
5
h
9
c
10
k
l
Src state cost
2
3
11
Arc Obs. Prob.
b
c
f
e
6
8
d
j
d
j
5
10
5
10
+
i
4
4
+
3
d
Coalescing (gathering) data
12
Arc Weight
b
c
<
7
Dest state cost
5
5
=
Scattering data
Updated Dest state cost
5
5
Multimedia Systems Lab. @ SoEE, SNU
SIMD Efficiency
Mapped
onto SIMD
SIMD
Utilization
Speedup and SIMD Efficiency in
State Based Traversal
Extra work
Speedup Over
Sequential Case
SIMD Utilization
9
8
90%
80%
7
70%
6
60%
5
50%
4
40%
3
30%
2
20%
1
10%
0
Time
100%
SIMD Utilization
10
Speedup
Active
States
0%
1
2
4
8
16
32
SIMD Width
Multimedia Systems Lab. @ SoEE, SNU
Parallelization choice of network search

Coarse grained vs fine
grained job partitioning
problem
1

Active-state or Active-arc
based traversal



In this graph, (2) (3) (4) (8) for
active state based
parallelization.
In this graph, (b), (c), (d), (e) ..(j)
for active arc based
parallelization
Active-state based is simpler
and coarse, but the number
of arcs for each state varies

Active-arc based is finegrained
j
a
8
g
2
b
5
h
9
c
10
k
l
11
3
d
i
4
f
e
6
7
Multimedia Systems Lab. @ SoEE, SNU
12
Thread-scheduling issues

Assigning a destination node update
to a single thread or multiple
threads?
1
Minimum of incoming cost should be
adopted (Viterbi)
2


5
Traversal by propagation


(2) (3) (4) are parallelized, the update of
(5) can be done by multiple threads
Write-conflict problem




Thread 0
Atomic operation for the update
Privatization (Build private buffer)
Lock based implementation
3
Thread 1
4
6
Traversal by aggregation

(5) are update by a single thread


No write-conflict problem
Need preparation
Multimedia Systems Lab. @ SoEE, SNU
Design Space for search network parallelization
synch cost needed
Traversal by Propagation
overhead needed
Traversal by Aggregation
coarse grain
Active States
Current States
Next States
Maintain active source states,
propagate out-arc computation results
to destination state
fine grain
Active Arcs
Current States
Next States
Maintain active arcs, propagate active
arc computation results to destination
state
Current States
Next States
Maintain active destination states,
determine all potential destination
states and aggregate incoming arcs
Current States
Next States
Maintain active arcs, group arcs with
same destination states and aggregate
active arcs locally to resolve write
conflicts
Multimedia Systems Lab. @ SoEE, SNU
Speedup: Multicore – relatively small # of cores
Sequential
RTF: 3.17; 1x
State-based Propagation
RTF: 0.925; 3.4x
0.732
0.157
0.035
0.001
2.623
0.474
0.073
RTF:
Real Time Factor
3.4x:
Speedup vs Seq
Obs. Prob. Comp.
Non-eps Traversal
Eps Traversal
Seq. Overhead
Arc-based Propagation
RTF: 1.006; 3.2x
3.4x speed-up
with 4x4 = 16 times
parallel hardware support
State-based Aggregation
RTF: 2.593; 1.2x
No SIMD in search!!
Coalescing overhead
exceeds 4-way SIMD benefit
Multimedia Systems Lab. @ SoEE, SNU
Speedup: GPU – large number of cores
Sequential
RTF: 3.17; 1x
2.623
Arc-based Propagation
RTF: 0.302; 10.5x
10.5x speed-up
with 240 (=8*30)
parallel hardware support
0.148
0.103
0.043
0.008
0.474
0.073
RTF:
Real Time Factor
10.5x:
Speedup vs Seq
Obs. Prob. Comp.
Non-eps Traversal
Eps Traversal
Seq. Overhead
State-based Propagation
RTF:0.776; 4.1x
Arc-based Aggregation
RTF: 0.912; 3.5x
State-based Aggregation
RTF: 1.203; 2.6x
Multimedia Systems Lab. @ SoEE, SNU
Conclusion for parallel speech recognition
synch cost needed
Multi-core
Traversal by Propagation
overhead needed
Traversal by Aggregation
coarse grain
Active States
Current States
Next States
Maintain active source states,
propagate out-arc computation results
to destination state
fine grain
Active Arcs
Current States
Next States
Maintain active arcs, propagate active
arc computation results to destination
state
Many-core or GPU
Current States
Next States
Maintain active destination states,
determine all potential destination
states and aggregate incoming arcs
Current States
Next States
Maintain active arcs, group arcs with
same destination states and aggregate
active arcs locally to resolve write
conflicts
Multimedia Systems Lab. @ SoEE, SNU
Part2 – speech recognition with deep neural
network
 Neural networks first used for speech recognition a few
decades ago, but the performance was worse than GMM.
 Single Hidden Layer
o Lack of large data
o Lack of processing power
o Overfitting problem
● Resurrection of neural network
o Multiple Hidden Layers (deep neural network)
o Large training sets are now available.
o RBM pretraining reduces the overfitting problem.
o NN returns with the name of Deep Neural Network and
shows significantly better performance than GMM in
phoneme recognition.
Multimedia Systems Lab. @ SoEE, SNU
Deep neural network for acoustic modeling




Multiple layers of neural network
Each layer is (usually) all-to-all connection with activation function
(e.g. sigmoid)
Considering that one layer contains 1,000 units, each layer demands
1 million weights, and 1 million multiply-add operations for one
output computation
Five layer DNN implies 5 million (or 20 million with 2000 units)
weights and operations  a lot of memory and arithmetic
operations
Likelihood of phonemes
Multimedia Systems Lab. @ SoEE, SNU
Strategies for low-power speech recognition
systems

GPU based systems are not power efficient
Approach for low-power
 Do not use DRAM
 All on-chip operations
 Apply low voltage to lower the switching power (P = k C V2)

Parallel processing, instead of time-multiplexing

No global data transfer

Our solution



Use low-precision weights
All on-chip operations
Thousands of distributed processing units (no global communication)
Multimedia Systems Lab. @ SoEE, SNU
DNN Implementation with fixed-point hardware

Reducing the precision of weights



For reduced memory size
For removing the multipliers
Retrain based fixed-point optimization of weights



Floating-point training
Step by step quantization from the 1st level to the last level
Retrain with quantized weights
Multimedia Systems Lab. @ SoEE, SNU
Good fixed-point DNN performance with 3-level
weights
Multimedia Systems Lab. @ SoEE, SNU
Multiplier free, fully parallel DNN circuit



The resulting hardware (PU)
employs no multipliers
(adders and muxes)
Each layer contains 1K PUs.
With 1W, about 1000 times of
real-time processing speed
Multimedia Systems Lab. @ SoEE, SNU
Recurrent neural network (RNN)

Speech recognition needs
sequence processing
(memorizing past)

Delayed recurrent paths allow a
neural network access the
previous inputs.

Feedforward
Recurrent
Long short-term memory
(LSTM):
special RNN structure that can
learn very long time
dependencies.
Multimedia Systems Lab. @ SoEE, SNU
RNN for language model




Predict the next word/character when given the history of
words/characters.
Input: one-hot encoded word/character, xt
Output: probabilities or confidences of the next
words/characters
e.g. P(xt+1|x1:t) for all xt+1
Delayed recurrent paths
allow RNNs to access
the previous inputs.
Multimedia Systems Lab. @ SoEE, SNU
RNN-only speech recognition

CTC (connectionist temporal classification) objective function allows
RNNs to learn sequences rather than frame-wise targets.

CTC only concerns the sequential order of the output labels, not the exact
timing.



With CTC-based training, RNNs can directly learn to generate texts
from speech data without any prior knowledge of the linguistic
structures or dictionaries.
Long short-term memory (LSTM) RNN is used
Bidirectional architecture is employed to access the future input as
well as the past input.
Multimedia Systems Lab. @ SoEE, SNU
RNN-only speech recognition: results






Input: 39-dim MFCC feature (D_A_E)
Output: 29-dim characters, (a-z _ . ’)
Network topology:
- 3 BLSTM layers, 256 memory blocks per LSTM layer, 3.8 M weights
Training data: WSJ0 + WSJ1 + TIMIT training set (66,319,215 frames), about
1 month training with a single thread CPU.
Test data: TIMIT complete test set
Character error rate 17.94 %
well_junior_didn't_he_eat_only_one_and_mr_henry_didn't_even_do_that_well_
wel_junor_didn't_he_eat_only_one_and_mr._henry_didn't_even_do_that_well_
(Word-level learning)
eternity_is_no_time_for_recriminations_
auttornity_is_no_time_for_recriminations_
(The word “recrimination” is not in the training set)
laugh_dance_and_sing_if_fortune_smiles_upon_you_
lafh_dance_and_sing_o_fortune_smile_supponyou_
Multimedia Systems Lab. @ SoEE, SNU
Characteristics of DNN (& RNN) bases speech
recognition
Disadvantages: Large number of weights, large number of arithmetic
operations
Many advantages leading to low-power consumption
 Highly parallel and regular (most operations are inner product of
512 ~ 2K inputs) computation
 Non-volatile memory can be used for recognition only (off-line
training) architecture,


Low precision arithmetic: DNN architecture is very robust to
quantization when quantization is included in the training procedure


High density (10 times of SRAM), low power, no standing-by power
Weight memory saving, arithmetic unit size reduction
Thousands of distributed arithmetic units

Mostly local connection
Multimedia Systems Lab. @ SoEE, SNU
Conclusion
Architectural trends for speech recognition
Time
1990
1990 ~ 2010
2010 ~
Algorithm
(vocabulary
size)
HMM (hidden
Markov model)
( ~ 1K)
Gaussian mixture
acoustic model + HMM
+ language model (64K)
Deep neural network
(64K)
Architecture
Programmable
DSP or
embedded CPU
GPU or multi-core CPU
Distributed memory +
arithmetic logic
No. of PE
(Process. Ele)
1 or a few
Up to a few 100’s
(floating-point
arithmetic)
1000’s (low precision
ALU)
Memory
SRAM
DRAM
Non-volatile memory
Power for realtime
processing (1
person)
10 W
Less than 100mW
(on-chip memory, local
connection, low
precision)
Multimedia
Systems Lab. @ SoEE, SNU
References




You, Kisun, et al. "Parallel scalability in speech recognition." Signal
Processing Magazine, IEEE 26.6 (2009): 124-135.
Choi, Y. K., You, K., Choi, J., & Sung, W. (2010). A real-time FPGAbased 20 000-word speech recognizer with optimized DRAM
access. Circuits and Systems I: Regular Papers, IEEE Transactions
on, 57(8), 2119-2131.
Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech
recognition with recurrent neural networks." Proceedings of the 31st
International Conference on Machine Learning (ICML-14). 2014.
Hwang, Kyuyeon, and Wonyong Sung. "Fixed-point feedforward
deep neural network design using weights+ 1, 0, and− 1." Signal
Processing Systems (SiPS), 2014 IEEE Workshop on. IEEE, 2014.
Multimedia Systems Lab. @ SoEE, SNU