DIGITAL SIGNAL PROCESSING ARCHITECTURE FOR LARGE VOCABULARY SPEECH RECOGNITION WONYON SUNG SCHOOL OF ELECTICAL AND COMPUTER ENGINEERING SEOUL NATIONAL UNIVERSITY July 6, 2015 CLUJ-NAPOCA ROMANIA Speech recognition Most natural human-machine interface. Long history of research (even from analog technology age) Extensive use of DSP technology Still imperfect We nowadays understand it a kind of machine learning problems. Diverse applications: Keyword spotting (~100 words) to multi-language understanding (> 100K words) One two three four ….. Multimedia Systems Lab. @ SoEE, SNU Hidden Markov model for speech recognition Hidden Markov model contains states (tri-phone states) and state transitions according to the speech input and the network connections. It combines three knowledge sources. Acoustic model – Phoneme representation, (Gaussian mixture model for emission probability computation) Pronunciation model – vocabulary (lexicon) Sentence or language model Multimedia Systems Lab. @ SoEE, SNU HMM base speech recognition implementation 1. Feature extraction: speech to acoustic parameter 2. Emission probability computation 3. MFCC (Mel-Frequency Cepstrum Coefficient) Irrelevant of the vocabulary size Generate the log-likelihood of each hypothesis state Higher dimension needed for large vocabulary (4 ~ 128th dimension) Viterbi beam search Dynamic programming through the network High complex ‘compare and select’ operations Search Input Speech Feature Extraction Emission Probability Computation Result time Multimedia Systems Lab. @ SoEE, SNU Algorithm and implementation trend ~ 2000 ~ up to date Single CPU or programmable DSP based Hidden Markov model based Parallel computer architecture (GPU, multi-core) FPGA, VLSI In the future Neural network (deep neural network, recurrent neural network) GPU, Neuromorphic system based In this talk: Part1: multi-core CPU and GPU (parallel computer) based implementation of large vocabulary speech recognition Part2: Deep neural network based implementations Multimedia Systems Lab. @ SoEE, SNU Speech recognition for large vocabulary (>60K) Not only lexicon size increases, more precise acoustic modeling: The network complexity for HMM grows very rapidly. High dimension (32~128) Gaussian mixture model More tri-phone states We need to prune many states or arcs during the search Resulting in very irregular computation High complexity language model needed: 3-gram or higher is desired. Large memory size You, Kisun, et al. "Parallel scalability in speech recognition." Signal Processing Magazine, IEEE 26.6 (2009): 124-135. Multimedia Systems Lab. @ SoEE, SNU Recognition Network Example Gaussian Mixture Model for One Phone State Mixture Components aa 128 … Components hh … … … … … … … Computing distance to each mixture components HMM Acoustic Phone Model 17550 Triphones n Computing weighted sum of all components Pronunciation Model ... HOP hh aa p 58k ... ON Word aa n ...Vocabulary POP p aa p ... Bigram Language Model 4 million States 10 millions Transition Arcs Compiled & Optimized WFST Recognition Network ... CAT ... HAT HOP IN ... ON POP ... THE ... Features from one frame ... CAT ... HAT HOP IN ... ON POP ... THE ... 168k Bigram Transitions Multimedia Systems Lab. @ SoEE, SNU Multi-core CPU and GPU Multicore: Two ~ 10’s CPU cores on each chip. Good single thread performance GPU (manycore): 100’s of processing cores, maximizing computation throughput at the expense of single thread performance Intel Core i7 (45nm) 4 cores NVIDIA GTX285 (55nm) 30 cores Intel Xeon Phi 96 cores Multimedia Systems Lab. @ SoEE, SNU Multicore/GPU Architecture Trends Increasing numbers of cores per die Intel Nehalem: 4~8 cores Intel Xeon Phi: about 60 cores NVIDIA GTX285: 30 cores (each core contains 8 units) Increasing vector unit width (SIMD) Intel Core i7 (Nehalem): 8-way ~ 16-way Intel Xeon Phi: 16-way NVIDIA GTX285: 8-way (physical), 32-way (logical) Core Core Core Core Core Core 4X32b a0 a1 a2 a3 b0 b1 b2 b3 + + + + a0+b0a1+b1a2+b2a3+b3 Vector unit (SIMD) Multimedia Systems Lab. @ SoEE, SNU Parallel scalability Can we achieve the speed-up of ‘SIMD_width x #_of_cores’? E.g. With 8-way SIMD, 8-core CPU, 64 times speed-up. Some parts of speech recognition algorithm is quite parallel scalable, but other parts are not. Emission probability computation: computation flow is quite regular. Good scalability Hidden Markov network search: quite irregular network search Packing overhead for SIMD Synchronization overhead Multimedia Systems Lab. @ SoEE, SNU Emission probability computation Very regular and good candidate workload for parallelization Tied HMM states (Senone) Usually up to several thousands (depends on training condition) 39 D MFCC feature vector Gaussian mixture: 16, 32, 128 1 K ( xk mk ) 2 log( bm (Ot ; s)) Cm 2 2 k 1 mk Thread 0 Parallelization Active Active Senone Flag Senone Address Triphones (senone) for core-level Gaussian for SIMD-level Simple dynamic workload distribution Count the number of active triphones Distribute evenly to each thread Thread 1 Thread 2 Thread 3 Multimedia Systems Lab. @ SoEE, SNU Irregular network search <-> parallel scalability Parallel graph traversal through an irregular network with millions of arcs and states Vector (SIMD) unit efficiency Synchronization Arc traversal induces write conflicts for destination state update Increased chance of conflicts for large number of cores Core Core $ Core Core $ Core $ Core $ $ SIMD operation demands a packed data (packing overhead may be needed) Continuously changing working set guided by input $ Core Synchronization Vector Unit Utilization HMM Network Multimedia Systems Lab. @ SoEE, SNU Applying SIMD in network search 1 j a 8 g 2 b 5 h 9 c 10 k l Src state cost 2 3 11 Arc Obs. Prob. b c f e 6 8 d j d j 5 10 5 10 + i 4 4 + 3 d Coalescing (gathering) data 12 Arc Weight b c < 7 Dest state cost 5 5 = Scattering data Updated Dest state cost 5 5 Multimedia Systems Lab. @ SoEE, SNU SIMD Efficiency Mapped onto SIMD SIMD Utilization Speedup and SIMD Efficiency in State Based Traversal Extra work Speedup Over Sequential Case SIMD Utilization 9 8 90% 80% 7 70% 6 60% 5 50% 4 40% 3 30% 2 20% 1 10% 0 Time 100% SIMD Utilization 10 Speedup Active States 0% 1 2 4 8 16 32 SIMD Width Multimedia Systems Lab. @ SoEE, SNU Parallelization choice of network search Coarse grained vs fine grained job partitioning problem 1 Active-state or Active-arc based traversal In this graph, (2) (3) (4) (8) for active state based parallelization. In this graph, (b), (c), (d), (e) ..(j) for active arc based parallelization Active-state based is simpler and coarse, but the number of arcs for each state varies Active-arc based is finegrained j a 8 g 2 b 5 h 9 c 10 k l 11 3 d i 4 f e 6 7 Multimedia Systems Lab. @ SoEE, SNU 12 Thread-scheduling issues Assigning a destination node update to a single thread or multiple threads? 1 Minimum of incoming cost should be adopted (Viterbi) 2 5 Traversal by propagation (2) (3) (4) are parallelized, the update of (5) can be done by multiple threads Write-conflict problem Thread 0 Atomic operation for the update Privatization (Build private buffer) Lock based implementation 3 Thread 1 4 6 Traversal by aggregation (5) are update by a single thread No write-conflict problem Need preparation Multimedia Systems Lab. @ SoEE, SNU Design Space for search network parallelization synch cost needed Traversal by Propagation overhead needed Traversal by Aggregation coarse grain Active States Current States Next States Maintain active source states, propagate out-arc computation results to destination state fine grain Active Arcs Current States Next States Maintain active arcs, propagate active arc computation results to destination state Current States Next States Maintain active destination states, determine all potential destination states and aggregate incoming arcs Current States Next States Maintain active arcs, group arcs with same destination states and aggregate active arcs locally to resolve write conflicts Multimedia Systems Lab. @ SoEE, SNU Speedup: Multicore – relatively small # of cores Sequential RTF: 3.17; 1x State-based Propagation RTF: 0.925; 3.4x 0.732 0.157 0.035 0.001 2.623 0.474 0.073 RTF: Real Time Factor 3.4x: Speedup vs Seq Obs. Prob. Comp. Non-eps Traversal Eps Traversal Seq. Overhead Arc-based Propagation RTF: 1.006; 3.2x 3.4x speed-up with 4x4 = 16 times parallel hardware support State-based Aggregation RTF: 2.593; 1.2x No SIMD in search!! Coalescing overhead exceeds 4-way SIMD benefit Multimedia Systems Lab. @ SoEE, SNU Speedup: GPU – large number of cores Sequential RTF: 3.17; 1x 2.623 Arc-based Propagation RTF: 0.302; 10.5x 10.5x speed-up with 240 (=8*30) parallel hardware support 0.148 0.103 0.043 0.008 0.474 0.073 RTF: Real Time Factor 10.5x: Speedup vs Seq Obs. Prob. Comp. Non-eps Traversal Eps Traversal Seq. Overhead State-based Propagation RTF:0.776; 4.1x Arc-based Aggregation RTF: 0.912; 3.5x State-based Aggregation RTF: 1.203; 2.6x Multimedia Systems Lab. @ SoEE, SNU Conclusion for parallel speech recognition synch cost needed Multi-core Traversal by Propagation overhead needed Traversal by Aggregation coarse grain Active States Current States Next States Maintain active source states, propagate out-arc computation results to destination state fine grain Active Arcs Current States Next States Maintain active arcs, propagate active arc computation results to destination state Many-core or GPU Current States Next States Maintain active destination states, determine all potential destination states and aggregate incoming arcs Current States Next States Maintain active arcs, group arcs with same destination states and aggregate active arcs locally to resolve write conflicts Multimedia Systems Lab. @ SoEE, SNU Part2 – speech recognition with deep neural network Neural networks first used for speech recognition a few decades ago, but the performance was worse than GMM. Single Hidden Layer o Lack of large data o Lack of processing power o Overfitting problem ● Resurrection of neural network o Multiple Hidden Layers (deep neural network) o Large training sets are now available. o RBM pretraining reduces the overfitting problem. o NN returns with the name of Deep Neural Network and shows significantly better performance than GMM in phoneme recognition. Multimedia Systems Lab. @ SoEE, SNU Deep neural network for acoustic modeling Multiple layers of neural network Each layer is (usually) all-to-all connection with activation function (e.g. sigmoid) Considering that one layer contains 1,000 units, each layer demands 1 million weights, and 1 million multiply-add operations for one output computation Five layer DNN implies 5 million (or 20 million with 2000 units) weights and operations a lot of memory and arithmetic operations Likelihood of phonemes Multimedia Systems Lab. @ SoEE, SNU Strategies for low-power speech recognition systems GPU based systems are not power efficient Approach for low-power Do not use DRAM All on-chip operations Apply low voltage to lower the switching power (P = k C V2) Parallel processing, instead of time-multiplexing No global data transfer Our solution Use low-precision weights All on-chip operations Thousands of distributed processing units (no global communication) Multimedia Systems Lab. @ SoEE, SNU DNN Implementation with fixed-point hardware Reducing the precision of weights For reduced memory size For removing the multipliers Retrain based fixed-point optimization of weights Floating-point training Step by step quantization from the 1st level to the last level Retrain with quantized weights Multimedia Systems Lab. @ SoEE, SNU Good fixed-point DNN performance with 3-level weights Multimedia Systems Lab. @ SoEE, SNU Multiplier free, fully parallel DNN circuit The resulting hardware (PU) employs no multipliers (adders and muxes) Each layer contains 1K PUs. With 1W, about 1000 times of real-time processing speed Multimedia Systems Lab. @ SoEE, SNU Recurrent neural network (RNN) Speech recognition needs sequence processing (memorizing past) Delayed recurrent paths allow a neural network access the previous inputs. Feedforward Recurrent Long short-term memory (LSTM): special RNN structure that can learn very long time dependencies. Multimedia Systems Lab. @ SoEE, SNU RNN for language model Predict the next word/character when given the history of words/characters. Input: one-hot encoded word/character, xt Output: probabilities or confidences of the next words/characters e.g. P(xt+1|x1:t) for all xt+1 Delayed recurrent paths allow RNNs to access the previous inputs. Multimedia Systems Lab. @ SoEE, SNU RNN-only speech recognition CTC (connectionist temporal classification) objective function allows RNNs to learn sequences rather than frame-wise targets. CTC only concerns the sequential order of the output labels, not the exact timing. With CTC-based training, RNNs can directly learn to generate texts from speech data without any prior knowledge of the linguistic structures or dictionaries. Long short-term memory (LSTM) RNN is used Bidirectional architecture is employed to access the future input as well as the past input. Multimedia Systems Lab. @ SoEE, SNU RNN-only speech recognition: results Input: 39-dim MFCC feature (D_A_E) Output: 29-dim characters, (a-z _ . ’) Network topology: - 3 BLSTM layers, 256 memory blocks per LSTM layer, 3.8 M weights Training data: WSJ0 + WSJ1 + TIMIT training set (66,319,215 frames), about 1 month training with a single thread CPU. Test data: TIMIT complete test set Character error rate 17.94 % well_junior_didn't_he_eat_only_one_and_mr_henry_didn't_even_do_that_well_ wel_junor_didn't_he_eat_only_one_and_mr._henry_didn't_even_do_that_well_ (Word-level learning) eternity_is_no_time_for_recriminations_ auttornity_is_no_time_for_recriminations_ (The word “recrimination” is not in the training set) laugh_dance_and_sing_if_fortune_smiles_upon_you_ lafh_dance_and_sing_o_fortune_smile_supponyou_ Multimedia Systems Lab. @ SoEE, SNU Characteristics of DNN (& RNN) bases speech recognition Disadvantages: Large number of weights, large number of arithmetic operations Many advantages leading to low-power consumption Highly parallel and regular (most operations are inner product of 512 ~ 2K inputs) computation Non-volatile memory can be used for recognition only (off-line training) architecture, Low precision arithmetic: DNN architecture is very robust to quantization when quantization is included in the training procedure High density (10 times of SRAM), low power, no standing-by power Weight memory saving, arithmetic unit size reduction Thousands of distributed arithmetic units Mostly local connection Multimedia Systems Lab. @ SoEE, SNU Conclusion Architectural trends for speech recognition Time 1990 1990 ~ 2010 2010 ~ Algorithm (vocabulary size) HMM (hidden Markov model) ( ~ 1K) Gaussian mixture acoustic model + HMM + language model (64K) Deep neural network (64K) Architecture Programmable DSP or embedded CPU GPU or multi-core CPU Distributed memory + arithmetic logic No. of PE (Process. Ele) 1 or a few Up to a few 100’s (floating-point arithmetic) 1000’s (low precision ALU) Memory SRAM DRAM Non-volatile memory Power for realtime processing (1 person) 10 W Less than 100mW (on-chip memory, local connection, low precision) Multimedia Systems Lab. @ SoEE, SNU References You, Kisun, et al. "Parallel scalability in speech recognition." Signal Processing Magazine, IEEE 26.6 (2009): 124-135. Choi, Y. K., You, K., Choi, J., & Sung, W. (2010). A real-time FPGAbased 20 000-word speech recognizer with optimized DRAM access. Circuits and Systems I: Regular Papers, IEEE Transactions on, 57(8), 2119-2131. Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. Hwang, Kyuyeon, and Wonyong Sung. "Fixed-point feedforward deep neural network design using weights+ 1, 0, and− 1." Signal Processing Systems (SiPS), 2014 IEEE Workshop on. IEEE, 2014. Multimedia Systems Lab. @ SoEE, SNU
© Copyright 2026 Paperzz