A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen Overview An Introduction to Speech Signals (Vedrana) Linear Prediction Analysis (Fu) Speech Coding and Synthesis (Fu) Speech Recognition (Vedrana) Speech Coding and Recognition AN INTRODUCTION TO SPEECH SIGNALS AN INTRODUCTION TO SPEECH SIGNALS Speech Production Flow of air from lungs Vibrating vocal cords Speech production cavities Lips Sound wave Vowels (a, e, i), fricatives (f, s, z) and plosives (p, t, k) AN INTRODUCTION TO SPEECH SIGNALS Speech Signals Sampling frequency 8 — 16 kHz Short-time stationary assumption (frames 20 – 40 ms) AN INTRODUCTION TO SPEECH SIGNALS Model for Speech Production Excitation (periodic, noisy) Vocal tract filter (nasal cavity, oral cavity, pharynx) AN INTRODUCTION TO SPEECH SIGNALS Voiced and Unvoiced Sounds Voiced sounds, periodic excitation, pitch period Unvoiced sounds, noise-like excitation Short-time measures: power and zero-crossing AN INTRODUCTION TO SPEECH SIGNALS Frequency Domain Pitch, harmonics (excitation) Formants, envelope (vocal tract filter) Harmonic product spectrum AN INTRODUCTION TO SPEECH SIGNALS Speech Spectrograms Time varying formant structure Narrowband / wideband Speech Coding and Recognition LINEAR PREDICTION ANALYSIS LINEAR PREDICTION ANALYSIS Categories Vocal Tract Filter Linear Prediction Analysis Error Minimization Levison-Durbin Recursion Residual sequence u(n) LINEAR PREDICTION ANALYSIS Vocal Tract Filter(1) Vocal tract filter Output: speech S ( z) H ( z) U g ( z) S (z ) If we assume an all poles filter? H (z ) U g (z ) Input: periodic impulse train LINEAR PREDICTION ANALYSIS Vocal Tract Filter(2) Auto regressive model: H ( z) A p 1 ak z k S ( z) U g ( z) (all poles filter) k 1 S ( z ) a1 z 1S ( z ) a2 z 2 S ( z ) ... a p z p S ( z ) AU g ( z ) s (n) a1s (n 1) a2 s (n 2) ... a p s (n p ) Au g (n) where p is called the model order Speech is a linear combination of past samples and an extra part, Aug(z) LINEAR PREDICTION ANALYSIS Linear Prediction Analysis(1) Goal: how to find the coefficients ak in this all poles model? Physical model v.s. Analysis system impulse, Aug(n) error, speech, all poles model ak here is fixed, but unknown! e(n) s(n) ? we try to find αk to estimate ak LINEAR PREDICTION ANALYSIS Linear Prediction Analysis(2) What is really inside the ? box? A predictor (P(z), FIR filter) inside, predicitve original s(n) P(z) ŝ(n) - predictive error, e(n)=s(n)- ŝ(n) A(z)=1-P(z) where ŝ(n)= α1s(n-1)+α2s(n-2)+… + αps(n-p) If αk ≈ ak , then e(n) ≈ Aug(n) LINEAR PREDICTION ANALYSIS Linear Prediction Analysis (3) If we can find a predictor generating a smallest error e(n) which is close to Aug(n), then we can use A(z) to estimate filter coefficients. e(n) ≈Aug(n) 1 / A(z) ŝ(n) very similar to vocal tract model LINEAR PREDICTION ANALYSIS Error Minization(1) Problem: How to find the minimum error? Energy of error: E 2 e (n) , where e(n)=s(n)- ŝ(n) n = function(αi) For quadratic function of αi we can find the smallest value by E / i 0 for each i LINEAR PREDICTION ANALYSIS Error Minization(2) By differentiation, p s(n)s(n i) s(n k )s(n i) n k 1 k n a set of linear equations We define that, (i, k ) s(n k ) s(n i) where 1 k p,1 i p n This is actually an autocorrelation of s(n) LINEAR PREDICTION ANALYSIS Error Minization(3) Hence, let’s discuss linear equations in matrix: r (1) r ( p 1) 1 r (1) r ( 0) r (1) r (2) r ( 0 ) 2 r ( p 1 ) r ( 0 ) r ( p ) p Linear prediction coefficient is our goal. How to solve it efficiently? LINEAR PREDICTION ANALYSIS Levinson-Durbin Recursion(1) In the matrix, LD recursion method is based on following characteristics: Symmetric Toeplitz Hence we can solve matrix in O(p2) instead of O(p3) Don’t forget our objective, which is to find αk to simulate the vocal tract filter. LINEAR PREDICTION ANALYSIS Levinson-Durbin Recursion(2) In exercise, we solve matrix by ‘brute force’ and L-D recursion. There is no difference of corresponding parameters Error energy v.s. Predictor order LINEAR PREDICTION ANALYSIS Residual sequence u(n) After knowing filter coefficients, we can find s(n) residual sequence u(n) by inversely filtering computation. Try to compare original s(n) residual u(n) A(z) u(n) Speech Coding and Recognition SPEECH CODING AND SYNTHESIS SPEECH CODING AND SYNTHESIS Categories Analysis-by-Synthesis Perceptual Weighting Filter Linear Predictive Coding Multi-Pulse Linear Prediction Code-Excited Linear Prediction (CELP) CELP Experiment Quantization SPEECH CODING AND SYNTHESIS Analysis-by-Synthesis(1) Analyze the speech by estimating a LP synthesis filter Computing a residual sequence as a excitation signal to reconstruct signal Encoder/Decoder : the parameters like LP synthesis filter, gain, and pitch are coded, transmitted, and decoded SPEECH CODING AND SYNTHESIS Analysis-by-Synthesis(2) Frame by frame s(n) LP analysis Excitation Generator LP Synthesis Filter ŝ(n) - e(n) Error Minimization Without error minimization: With error minimization: LP parameters Excitation parameters E N C O To channel D E R SPEECH CODING AND SYNTHESIS Perceptual Weighting Filter(1) Perceptual masking effect: Within the formant regions, one is less sensitive to the noise Idea: designing a filter that de-emphasizes the error in the formant region Result: synthetic speech with more error near formant peaks but less error in others SPEECH CODING AND SYNTHESIS Perceptual Weighting Filter(2) Q( z ) 1 P( z ) A( z ) 1 P( z ) A( z ) In frequency domain: LP syn. filter v.s. PW filter Perceptual weighting coefficient: α = 1, no filtering. α decreases, filtering more optimalα depends on perception SPEECH CODING AND SYNTHESIS Perceptual Weighting Filter(3) In z domain, LP filter v.s. PW filter Numerator: generating the zeros which are the original poles of LP synthesis filter Denominator: placing the poles closer to the origin. α determines the distance SPEECH CODING AND SYNTHESIS Linear Predictive Coding(1) Based on above methods, PW filter and analysis-by-synthesis If excitation signal ≈ impulse train, during voicing, we can get a reconstructed signal very close to the original More often, however, the residue is far from the impulse train SPEECH CODING AND SYNTHESIS Linear Predictive Coding(2) Hence, there are many kinds of coding trying to improve this Primarily differ in the type of excitation signal Two kinds: Multi-Pulse Linear Prediction Code-Excited Linear Prediction (CELP) SPEECH CODING AND SYNTHESIS Multi-Pulse Linear Predcition(1) Concept: represent the residual sequence by putting impulses in order to make ŝ(n) closer to s(n). s(n) LP Analysis Error Minimization Excitation Generator LP Synthesis Filter ŝ(n) - Multi-pulse, u(n) PW Filter SPEECH CODING AND SYNTHESIS Multi-Pulse Linear Predcition(2) s1 s2 s3 s4 Estimate the LPC filter without excitation Place one impulse (placement and amplitude) A new error is determined Repeat s2-s3 until reaching a desired min error original multi-pulse synthetic error s1 s2,3 s4 SPEECH CODING AND SYNTHESIS Code-Excited Linear Prediction(1) The difference: Represent the residue v(n) by codewords (exhaustive searching) from a codebook of zeromean Gaussian sequence Consider primary pitch pulses which are predictable over consecutive periods SPEECH CODING AND SYNTHESIS Code-Excited Linear Prediction(2) s(n) Pitch estimate LP parameters Gaussian Multi-pulse excitation generator codebook v(n) Pitch synthesis filter u(n) LP synthesis filter ŝ(n) LP analysis s(n) PW filter Error minimization SPEECH CODING AND SYNTHESIS CELP Experiment(1) An experiment of CELP Original (blue) : Excitation signal (below): Reconstructed (green) : SPEECH CODING AND SYNTHESIS CELP Experiment(2) 1. Test the quality for different settings: LPC model order 2. Initial M=10 Test M=2 PW coefficient SPEECH CODING AND SYNTHESIS CELP Experiment(3) 3. Codebook (L,K) K: codebook size K influences the computation time strongly. if K: 1024 to 256, then time: 13 to 6 sec Initial (40,1024) Test (40,16) L: length of the random signal L determines the number of subblock in the frame SPEECH CODING AND SYNTHESIS Quantization With quantization, 16000 bps CELP 9600 bps CELP Trade-off Bandwidth efficiency v.s. speech quality Speech Coding and Recognition SPEECH RECOGNITION SPEECH RECOGNITION Dimensions of Difficulty Speaker dependent / independent Vocabulary size (small, medium, large) Discrete words / continuous utterance Quiet / noisy environment SPEECH RECOGNITION Feature Extraction Overlapping frames Feature vector for each frame Mel-cepstrum, difference cepstrum, energy, diff. energy SPEECH RECOGNITION Vector Quantization Vector quantization K-means algorithm Observation sequence for the whole word SPEECH RECOGNITION Hidden Markov Model (1) Changing states, emitting symbols (1), A, B 1 2 3 4 5 SPEECH RECOGNITION Hidden Markov Model (2) Probability of transition State transition matrix State probability vector State equation SPEECH RECOGNITION Hidden Markov Model (3) Probability of observing Observation probability matrix Observation probability vector Observation equation SPEECH RECOGNITION Hidden Markov Model (4) Discrete observation hidden Markov model Two HMM problems Training problem Recognition problem SPEECH RECOGNITION Recognition using HMM (1) Determining the probability produced the observation sequence states that a given HMM 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 time Using straightforward computation – all possible paths, ST SPEECH RECOGNITION Recognition using HMM (2) Forward-backward algorithm, only the forward part Forward partial observation Forward probability i SPEECH RECOGNITION Recognition using HMM (3) Initialization Recursion Termination i j SPEECH RECOGNITION Training HMM No known analytical way Forward-backward (Baum-Welch) reestimation, a hill-climbing algorithm Reestimates HMM parameters in such a way that Method: Uses and to compute forward and backward probabilities, calculates state transition probabilities and observation probabilities Reestimates the model to improve probability Need for scaling SPEECH RECOGNITION Experiments Matrices A and B Observation sequences for words ‘one’ and ‘two’ Thank you!
© Copyright 2026 Paperzz