1D-1 A 40-nm 144-mW VLSI Processor for Real-time 60-kWord Continuous Speech Recognition Guangji He,Takanobu Sugahara,Tsuyoshi Fujinaga,Yuki Miyamoto, Hiroki Noguchi, Shintaro Izumi, Hiroshi Kawaguchi, and Masahiko Yoshimoto Kobe University, Kobe, 657-8501 Japan [email protected] Abstract - We have developed a low-power VLSI chip for 60kWord real-time continuous speech recognition based on a context-dependent Hidden Markov Model (HMM). Our implementation includes a cache architecture using locality of speech recognition, beam pruning using a dynamic threshold, two-stage language model searching, highly parallel Gaussian Mixture Model (GMM) computation based on the mixture level, a variable-frame look-ahead scheme, and elastic pipeline operation between the Viterbi transition and GMM processing. Results show that our implementation achieves 95% bandwidth reduction (70.86 MB/s) and 78% required frequency reduction (126.5 MHz). The test chip, fabricated using 40 nm CMOS technology, contains 1.9 M transistors for logic and 7.8 Mbit on-chip memory. It dissipates 144 mW at 126.5 MHz and 1.1 V for 60 kWord real-time continuous speech recognition. extracted on a frame-by-frame basis. Step 2: GMM calculation: a phonemic-model GMM is read and GMM probability, log [bj (xt)], is calculated for all active state nodes. Step 3: Viterbi transition: Gt (j) is calculated for all active state nodes using GMM probabilities. Step 4: Beam pruning: according to the beam width, active state nodes having a higher score (accumulated probability) are selected; the others are dumped. Step 5: Output sentence: The word-end state and having the maximum score is output as a speech recognition result after final-frame calculation and determination of the transition sequence. III. Architecture The overall chip architecture is depicted in Fig.3. The proposed architecture is comprised of a global Sequencer, GMM-core, Viterbi-core, double GMM result Buffer to support pipeline operation. The GMM core have 16 mixture processing blocks each of which has 1.6Kb register to preserve the mixture parameter. All the blocks are processed simultaneously for the look-ahead frames which is saved in MFCC buffer. The parameters will be reused until all the look-ahead frames are processed. The mixture results will soon be caculated by the add-log processor based on a look-up table. The mixture computation, Add-log calculation, and parameter reading are processed in pipeline. Figure.4 shows the viterbi architecture, we find that the Active Node, Active Node Map and the Bi-gram accounts for 96% of the memory bandwidth for viterbi transition as shown in Fig.5. So we introduce all the active node data and part of the bi-gram and active node map data into the cache memory by using the locality of speech recognition that some of the data which has been used for this frame may have a high probability to be reused in the following frames. We maximize the cache memory size of bi-gram and active node map to 0.73Mbit, which can give a hit rate of 75%. I Introduction A hardware approach for large vocabulary continutous speech recognition (LVCSR), with implementation by VLSI or an FPGA is requested recently, especially for mobile equipment and the intelligent robot, because of its advantageous processing speed and power consumption. Lin et al. investigated FPGA implementations for 5k-word speech recognition [1], but it consumes too much power and is not cost-effective as it needs two FPGAs .Yoshizawa et al. proposed a scalable architecture for speech recognition[2], but their chip costs 136mW even with a limited vocabulary of 800-word. Choi et al. investigated FPGA implementations for 5k-word and 20k-word[3, 4], they implemented special memory interface for several part of the recognition engine to apply optimized DRAM access, which reduces the delay for loading data, but they did not reduce the large amount of external DRAM access which will cause a lot of power consumption for IO. Comparison of power consumption among the recent published hardware-based speech recognizers is shown in Fig.1. To date, the hardware approach has never achieved real-time operation with a 60-k word language model because of the numerous computation work-load and external memory bandwidth. For low power and real-time 60-k processing, we have to reduce both the memory bandwidth and operating clock frequency. V. Implementation The chip has been fabricated in 40nm CMOS technology as shown in Fig.6. It occupies 2.22.5 mm2 containing 1.9M transistors for Logic and 7.75Mbit on-chip SRAM (TableI). The clock gating is implemented in GMM result RAM and GMM Core. Figure.7 shows a measured data of power consumption versus operating frequencies versus beam_width. The bigger beam_width can give higher accuracy, but it will also cause larger computation, This chip can work at 126.5MHz operation for a beam_width of 3000 while the power consumption is 144mw with a accuracy of II. Algorithm Figure. 2 presents the speech recognition flow with the HMM algorithm. The following items describe concrete stages. Step 1: Feature vector extraction: a feature vector is 978-1-4673-3030-5/13/$31.00 ©2013 IEEE 71 1D-1 91.39% and 140MHz for a beam_width of 3800 while the power consumption is 204.8mW with a accuracy of 91.92%. HMM Tree Dictionary, N-gram DB, Transition DB Acknowledgements Active Node Map Cache Divider Output Buffer MUX Beam threshold ADD The VLSI chip in this study was fabricated in the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo. This development was performed by the author for STARC as part of the Japanese Ministry of Economy, TrDGHDQG,QGXVWU\VSRQVRUHG³6LOLFRQ Implementation Support Program for Next Generation Semiconductor Circuit $UFKLWHFWXUHV´ CMP ADD ADD MUX Active node Workspace RAM0 Trellis Buffer CMP N-gram Shared Tree DB N-gram Cache Active node Workspace RAM1 Viterbi Core GMM RESULT RAM Fig.4 Viterbi Cache Architecture 80 [1]6<RVKL]DZDHWDO³6FDODEOHDUFKLWHFWXUHIRUZRUG+00-based speech recognition and implementation in complete syVWHP´IEEE Trans. Circuits Syst. I, Vol. 53, pp. 70-77, 2006. [2]E. C. Lin et al., ³A multi-FPGA 10x-real-time high-speed search engine for a 5000-word vocabulary speech recognizer,´Proc. 17th ACM/SIGDA Int. Symp. FPGA, 2009. [3]< &KRL HW DO ³A real-time FPGA-based 20,000-word speech recognizer with optimized DRAM access,´ IEEE Trans. Circuits Syst. I, Vol. 57, pp. 95-105, 2006. [4]< &KRL HW DO ³FPGA-based implementation of a real-time 5000-word continuous speech recognizer,´Proc. 16th Eur. Signal Process. Conf., 2008. >@70D HW DO ³1RYHO FL-backoff scheme for real-time embedded VSHHFKUHFRJQLWLRQ´3URF,(((,&$663 [6]G. He et al.³$QPP:9/6,SURFHVVRUIRUUHDOWLPH NZRUGFRQWLQXRXVVSHHFKUHFRJQLWLRQ´3URF,(((CICC, 2011. Bi-gram 37% Others 10,000 0 20,000 65 60 55 This chip : 0.73Mbit (75%) 50 45 0.23 Total: 2581MB/s 0.33 0.43 0.53 0.63 0.73 Cache size [Mb] Fig.5 External memory bandwidth of Viterbi and Cache hit rate 2.2mm Input buffer N-gram Cache N-gram Cache Active Node Work place AddLog Table AddLog Table 2.5mm Viterbi Core Controllers GMM Result RAM 30,000 70 GMM Result RAM [5] Intel(2010) 20,000-words Atom Z530 2.2[W] 75 Map Cache [1] Hokkaido Univ 800-words CMOS 0.18um 0.136[W](2006) GMM Core PLL This work [6] 60,000-words 0.144[W] 0.5 0 Active Node Map Map Cache Power [W] 1 Active Node 22% 37% [4] CMU(2009) 5.000-words FPGA 10[W] 10 Hit rate of cache [%] References 40,000 50,000 [words] 60,000 Fig. 6. Chip microphotograph. Fig. 1. Comparison of power consumption. TABLE I Summary of chip implementation Recursion while not final frame > @ max G t -1 (i ) log aij log P( wt ) log b j ( xt ) Step 3: Feature GMM Viterbi vector x computation log b j ( xt ) Search extraction t log aij bj THIS CHIP Step 4: Step 5: Sort Output Final frame sentence Transister Count (Logic ) wࠉ log P( wt ) I want to be «.. N-gram model Phoneme HMM Process Technology Core area Chip area k On-Chip Memory (SRAM) Supply voltage I/O voltage Evaluation environment Operating Frequency Measured Power (core) Measured Power (IO) Leakage current Fig. 2 Speech recognition flow with HMM algorithm computation GMM Parameter RAM Tree Dictionary N-gram DB Transition DB MFCC 32bit 32bit 32bit Node_to_token_list Trellis_information 32bit 33bit 32bit GMM_BUS Viterbi_BUS 32bit 40Kb GMM buffer 0.4Mbit MFCC RAM N-gram Cache 32bit 8bit REG REG Gaussian processor mix0 Gaussian processor mix1 Gaussian processor mix15 32bit 32bit Add Log Processor Add Log Processor Global Sequencer 1.6Kb REG 14Kb Viterbi Cache 160 2Kb 0.4Mbit N-gram shared tree_DB Output Buffer Frequency (MHz) 26Kb 8bit MUX Internal Word Transition & Cross Word Transition 32bit 120 Active node Active node Information 0 Information 1 83.8 MHz 16bit 0.5Mb GMM_CORE 0 Internal WS 126.5 MHz 91.5 300 204.8 mW(Vdd = 1.32V) 200 98 mW 1000 124.8 mW 100 91.39% 91 90.5 90 89.5 89 88.5 88 2000 3000 4000 Beam_width 5000 0 87.5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Beam width VITERBI_CORE 25bit GMM_SCORE_RAM 0.5Mb 112 MHz 91.94% Beam_width: 3800 Accuracy: 91.92% 92 140 MHz 144 mW(Vdd = 1.16V) 40 92.5 400 Frequency Power 80 1.5Mb Add Log Table GMM Viterbi Other Total 40-nm CMOS 2.2 mm 㽢㻌㻌㻞㻚㻡㻌㼙㼙 㻡㻌㼙㼙㻌㽢㻌㻌㻡㻌㻌㼙㼙 1.1 M 0.3 M 0.5 M 1.9 M 7.8 Mbit 1.1V 3.3V ADVAN SOC Tester System 126.5 MHz for 60 kWord real-time processing 144 mW 58.2 mW 2.28 mA Accuracy [%] i j 1, j Step 2: Power (mW) G t ( j) Step 1: GMM Score RAM0 25bit 2.5Mb GMM Score RAM1 2.5Mb Fig. 7. Measurement results of frequency vs. power consumption vs. beam-width vs recognition accuracy. LSI Fig.3 Proposed speech recognition adrchitecture 72
© Copyright 2026 Paperzz