A 40-nm 144-mW VLSI Processor for Real-time

1D-1
A 40-nm 144-mW VLSI Processor for Real-time
60-kWord Continuous Speech Recognition
Guangji He,Takanobu Sugahara,Tsuyoshi Fujinaga,Yuki Miyamoto,
Hiroki Noguchi, Shintaro Izumi, Hiroshi Kawaguchi, and Masahiko Yoshimoto
Kobe University, Kobe, 657-8501 Japan
[email protected]
Abstract - We have developed a low-power VLSI chip for 60kWord real-time continuous speech recognition based on a
context-dependent Hidden Markov Model (HMM). Our
implementation includes a cache architecture using locality of
speech recognition, beam pruning using a dynamic threshold,
two-stage language model searching, highly parallel Gaussian
Mixture Model (GMM) computation based on the mixture level,
a variable-frame look-ahead scheme, and elastic pipeline
operation between the Viterbi transition and GMM processing.
Results show that our implementation achieves 95% bandwidth
reduction (70.86 MB/s) and 78% required frequency reduction
(126.5 MHz). The test chip, fabricated using 40 nm CMOS
technology, contains 1.9 M transistors for logic and 7.8 Mbit
on-chip memory. It dissipates 144 mW at 126.5 MHz and 1.1 V
for 60 kWord real-time continuous speech recognition.
extracted on a frame-by-frame basis. Step 2: GMM
calculation: a phonemic-model GMM is read and GMM
probability, log [bj (xt)], is calculated for all active state
nodes. Step 3: Viterbi transition: Gt (j) is calculated for all
active state nodes using GMM probabilities. Step 4: Beam
pruning: according to the beam width, active state nodes
having a higher score (accumulated probability) are selected;
the others are dumped. Step 5: Output sentence: The
word-end state and having the maximum score is output as a
speech recognition result after final-frame calculation and
determination of the transition sequence.
III. Architecture
The overall chip architecture is depicted in Fig.3. The
proposed architecture is comprised of a global Sequencer,
GMM-core, Viterbi-core, double GMM result Buffer to
support pipeline operation. The GMM core have 16 mixture
processing blocks each of which has 1.6Kb register to
preserve the mixture parameter. All the blocks are processed
simultaneously for the look-ahead frames which is saved in
MFCC buffer. The parameters will be reused until all the
look-ahead frames are processed. The mixture results will
soon be caculated by the add-log processor based on a
look-up table. The mixture computation, Add-log calculation,
and parameter reading are processed in pipeline. Figure.4
shows the viterbi architecture, we find that the Active Node,
Active Node Map and the Bi-gram accounts for 96% of the
memory bandwidth for viterbi transition as shown in Fig.5.
So we introduce all the active node data and part of the
bi-gram and active node map data into the cache memory by
using the locality of speech recognition that some of the data
which has been used for this frame may have a high
probability to be reused in the following frames. We
maximize the cache memory size of bi-gram and active node
map to 0.73Mbit, which can give a hit rate of 75%.
I Introduction
A hardware approach for large vocabulary continutous
speech recognition (LVCSR), with implementation by
VLSI or an FPGA is requested recently, especially for
mobile equipment and the intelligent robot, because of its
advantageous processing speed and power consumption. Lin
et al. investigated FPGA implementations for 5k-word
speech recognition [1], but it consumes too much power and
is not cost-effective as it needs two FPGAs .Yoshizawa et al.
proposed a scalable architecture for speech recognition[2],
but their chip costs 136mW even with a limited vocabulary
of
800-word. Choi et al. investigated FPGA
implementations for 5k-word and 20k-word[3, 4], they
implemented special memory interface for several part of the
recognition engine to apply optimized DRAM access, which
reduces the delay for loading data, but they did not reduce
the large amount of external DRAM access which will
cause a lot of power consumption for IO. Comparison of
power consumption among the recent published
hardware-based speech recognizers is shown in Fig.1. To
date, the hardware approach has never achieved real-time
operation with a 60-k word language model because of the
numerous computation work-load and external memory
bandwidth. For low power and real-time 60-k processing,
we have to reduce both the memory bandwidth and
operating clock frequency.
V. Implementation
The chip has been fabricated in 40nm CMOS technology as
shown in Fig.6. It occupies 2.2™2.5 mm2 containing 1.9M
transistors for Logic and 7.75Mbit on-chip SRAM (TableI).
The clock gating is implemented in GMM result RAM and
GMM Core. Figure.7 shows a measured data of power
consumption versus operating frequencies versus
beam_width. The bigger beam_width can give higher
accuracy, but it will also cause larger computation, This chip
can work at 126.5MHz operation for a beam_width of 3000
while the power consumption is 144mw with a accuracy of
II. Algorithm
Figure. 2 presents the speech recognition flow with the
HMM algorithm. The following items describe concrete
stages. Step 1: Feature vector extraction: a feature vector is
978-1-4673-3030-5/13/$31.00 ©2013 IEEE
71
1D-1
91.39% and 140MHz for a beam_width of 3800 while the
power consumption is 204.8mW with a accuracy of 91.92%.
HMM Tree Dictionary, N-gram DB, Transition DB
Acknowledgements
Active
Node Map
Cache
Divider
Output
Buffer
MUX
Beam
threshold
ADD
The VLSI chip in this study was fabricated in the chip
fabrication program of VLSI Design and Education Center
(VDEC), the University of Tokyo. This development was
performed by the author for STARC as part of the Japanese
Ministry of Economy, TrDGHDQG,QGXVWU\VSRQVRUHG³6LOLFRQ
Implementation Support Program for Next Generation
Semiconductor Circuit $UFKLWHFWXUHV´
CMP
ADD
ADD
MUX
Active node
Workspace RAM0
Trellis
Buffer
CMP
N-gram
Shared
Tree DB
N-gram
Cache
Active node
Workspace RAM1
Viterbi Core
GMM RESULT RAM
Fig.4 Viterbi Cache Architecture
80
[1]6<RVKL]DZDHWDO³6FDODEOHDUFKLWHFWXUHIRUZRUG+00-based
speech recognition and implementation in complete syVWHP´IEEE
Trans. Circuits Syst. I, Vol. 53, pp. 70-77, 2006.
[2]E. C. Lin et al., ³A multi-FPGA 10x-real-time high-speed search
engine for a 5000-word vocabulary speech recognizer,´Proc. 17th
ACM/SIGDA Int. Symp. FPGA, 2009.
[3]< &KRL HW DO ³A real-time FPGA-based 20,000-word speech
recognizer with optimized DRAM access,´ IEEE Trans. Circuits
Syst. I, Vol. 57, pp. 95-105, 2006.
[4]< &KRL HW DO ³FPGA-based implementation of a real-time
5000-word continuous speech recognizer,´Proc. 16th Eur. Signal
Process. Conf., 2008.
>@70D HW DO ³1RYHO FL-backoff scheme for real-time embedded
VSHHFKUHFRJQLWLRQ´3URF,(((,&$663
[6]G. He et al.³$QPP:9/6,SURFHVVRUIRUUHDOWLPH
NZRUGFRQWLQXRXVVSHHFKUHFRJQLWLRQ´3URF,(((CICC, 2011.
Bi-gram
37%
Others
10,000
0
20,000
65
60
55
This chip : 0.73Mbit (75%)
50
45
0.23
Total: 2581MB/s
0.33
0.43
0.53
0.63
0.73
Cache size [Mb]
Fig.5 External memory bandwidth of Viterbi and Cache hit rate
2.2mm
Input buffer
N-gram
Cache
N-gram
Cache
Active
Node
Work place
AddLog
Table
AddLog
Table
2.5mm
Viterbi Core
Controllers
GMM Result RAM
30,000
70
GMM Result RAM
[5] Intel(2010)
20,000-words
Atom Z530 2.2[W]
75
Map Cache
[1] Hokkaido Univ
800-words
CMOS 0.18um
0.136[W](2006)
GMM Core
PLL
This work [6]
60,000-words
0.144[W]
0.5
0
Active Node Map
Map Cache
Power [W]
1
Active Node
22% 37%
[4] CMU(2009)
5.000-words
FPGA 10[W]
10
Hit rate of cache [%]
References
40,000
50,000
[words]
60,000
Fig. 6. Chip microphotograph.
Fig. 1. Comparison of power consumption.
TABLE I
Summary of chip implementation
Recursion while not final frame
>
@
max G t -1 (i ) log aij log P( wt ) log b j ( xt )
Step 3:
Feature
GMM
Viterbi
vector x computation log b j ( xt )
Search
extraction t
log aij
bj
THIS CHIP
Step 4:
Step 5:
Sort
Output
Final frame sentence
Transister Count (Logic )
wࠉ
log P( wt )
I want to
be «..
N-gram model
Phoneme HMM
Process Technology
Core area
Chip area
k
On-Chip Memory (SRAM)
Supply voltage
I/O voltage
Evaluation environment
Operating Frequency
Measured Power (core)
Measured Power (IO)
Leakage current
Fig. 2 Speech recognition flow with HMM algorithm
computation
GMM
Parameter
RAM
Tree Dictionary
N-gram DB
Transition DB
MFCC
32bit
32bit
32bit
Node_to_token_list
Trellis_information
32bit
33bit
32bit
GMM_BUS
Viterbi_BUS
32bit
40Kb
GMM
buffer
0.4Mbit
MFCC
RAM
N-gram
Cache
32bit
8bit
REG
REG
Gaussian
processor
mix0
Gaussian
processor
mix1
Gaussian
processor
mix15
32bit
32bit
Add Log
Processor
Add Log
Processor
Global
Sequencer
1.6Kb
REG
14Kb
Viterbi
Cache
160
2Kb
0.4Mbit
N-gram
shared
tree_DB
Output
Buffer
Frequency (MHz)
26Kb
8bit
MUX
Internal Word Transition
& Cross Word Transition
32bit
120
Active node
Active node
Information 0 Information 1
83.8 MHz
16bit
0.5Mb
GMM_CORE
0
Internal WS
126.5 MHz
91.5
300
204.8 mW(Vdd = 1.32V)
200
98 mW
1000
124.8 mW
100
91.39%
91
90.5
90
89.5
89
88.5
88
2000 3000
4000
Beam_width
5000
0
87.5
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Beam width
VITERBI_CORE
25bit
GMM_SCORE_RAM
0.5Mb
112 MHz
91.94%
Beam_width: 3800
Accuracy: 91.92%
92
140 MHz
144 mW(Vdd = 1.16V)
40
92.5
400
Frequency
Power
80
1.5Mb
Add Log
Table
GMM
Viterbi
Other
Total
40-nm CMOS
2.2 mm 㽢㻌㻌㻞㻚㻡㻌㼙㼙
㻡㻌㼙㼙㻌㽢㻌㻌㻡㻌㻌㼙㼙
1.1 M
0.3 M
0.5 M
1.9 M
7.8 Mbit
1.1V
3.3V
ADVAN SOC Tester System
126.5 MHz for 60 kWord real-time processing
144 mW
58.2 mW
2.28 mA
Accuracy [%]
i j 1, j
Step 2:
Power (mW)
G t ( j)
Step 1:
GMM Score RAM0
25bit
2.5Mb
GMM Score RAM1
2.5Mb
Fig. 7. Measurement results of frequency vs. power consumption
vs. beam-width vs recognition accuracy.
LSI
Fig.3 Proposed speech recognition adrchitecture
72