s(n) - DTU Compute

A 12-WEEK PROJECT IN
Speech Coding and
Recognition
by Fu-Tien Hsiao
and Vedrana Andersen
Overview




An Introduction to Speech Signals
(Vedrana)
Linear Prediction Analysis (Fu)
Speech Coding and Synthesis (Fu)
Speech Recognition (Vedrana)
Speech Coding and
Recognition
AN INTRODUCTION TO SPEECH SIGNALS
AN INTRODUCTION TO SPEECH SIGNALS
Speech Production






Flow of air from lungs
Vibrating vocal cords
Speech production cavities
Lips
Sound wave
Vowels (a, e, i), fricatives (f, s, z) and plosives
(p, t, k)
AN INTRODUCTION TO SPEECH SIGNALS
Speech Signals


Sampling frequency 8 — 16 kHz
Short-time stationary assumption (frames 20
– 40 ms)
AN INTRODUCTION TO SPEECH SIGNALS
Model for Speech Production


Excitation (periodic, noisy)
Vocal tract filter (nasal cavity, oral cavity,
pharynx)
AN INTRODUCTION TO SPEECH SIGNALS
Voiced and Unvoiced Sounds



Voiced sounds,
periodic excitation,
pitch period
Unvoiced sounds,
noise-like excitation
Short-time
measures: power
and zero-crossing
AN INTRODUCTION TO SPEECH SIGNALS
Frequency Domain



Pitch, harmonics (excitation)
Formants, envelope (vocal tract filter)
Harmonic product spectrum
AN INTRODUCTION TO SPEECH SIGNALS
Speech Spectrograms


Time varying formant structure
Narrowband / wideband
Speech Coding and
Recognition
LINEAR PREDICTION ANALYSIS
LINEAR PREDICTION ANALYSIS
Categories


Vocal Tract Filter
Linear Prediction Analysis



Error Minimization
Levison-Durbin Recursion
Residual sequence u(n)
LINEAR PREDICTION ANALYSIS
Vocal Tract Filter(1)

Vocal tract filter
Output: speech
S ( z)
H ( z) 
U g ( z)

S (z )
If we assume an all
poles filter?
H (z )
U g (z )
Input: periodic
impulse train
LINEAR PREDICTION ANALYSIS
Vocal Tract Filter(2)

Auto regressive model:
H ( z) 
A
p
1   ak z k

S ( z)
U g ( z)
(all poles filter)
k 1
S ( z )  a1 z 1S ( z )  a2 z  2 S ( z )  ...  a p z  p S ( z )  AU g ( z )
s (n)  a1s (n  1)  a2 s (n  2)  ...  a p s (n  p )  Au g (n)

where p is called the model order
Speech is a linear combination of past
samples and an extra part, Aug(z)
LINEAR PREDICTION ANALYSIS
Linear Prediction Analysis(1)

Goal: how to find the coefficients ak in
this all poles model?
Physical model v.s. Analysis system
impulse,
Aug(n)
error,
speech,
all poles
model
ak here is fixed,
but unknown!
e(n)
s(n)
?
we try to find αk
to estimate ak
LINEAR PREDICTION ANALYSIS
Linear Prediction Analysis(2)


What is really inside the ? box?
A predictor (P(z), FIR filter) inside,
predicitve
original
s(n)

P(z)
ŝ(n)
-
predictive error,
e(n)=s(n)- ŝ(n)
A(z)=1-P(z)
where ŝ(n)= α1s(n-1)+α2s(n-2)+… + αps(n-p)
If αk ≈ ak , then e(n) ≈ Aug(n)
LINEAR PREDICTION ANALYSIS
Linear Prediction Analysis (3)

If we can find a predictor generating a
smallest error e(n) which is close to
Aug(n), then we can use A(z) to estimate
filter coefficients.
e(n)
≈Aug(n)
1 / A(z)
ŝ(n)
very similar to vocal tract model
LINEAR PREDICTION ANALYSIS
Error Minization(1)


Problem: How to find the minimum error?
Energy of error:
E

2
e
 (n) , where e(n)=s(n)- ŝ(n)
n  

= function(αi)
For quadratic function of αi we can find the
smallest value by E /  i  0 for each i
LINEAR PREDICTION ANALYSIS
Error Minization(2)

By differentiation,


p
 s(n)s(n  i)    s(n  k )s(n  i)
n  
k 1
k
n  
a set of linear equations

We define that,
 (i, k ) 

 s(n  k ) s(n  i)
where
1  k  p,1  i  p
n  
This is actually an autocorrelation of s(n)
LINEAR PREDICTION ANALYSIS
Error Minization(3)

Hence, let’s discuss linear equations in
matrix:
r (1)  r ( p  1)  1   r (1) 
 r ( 0)
 r (1)
    r (2) 
r
(
0
)


 2   

 

      

  


r
(
p

1
)


r
(
0
)
r
(
p
)

  p  



Linear prediction coefficient is our goal.
How to solve it efficiently?
LINEAR PREDICTION ANALYSIS
Levinson-Durbin Recursion(1)

In the matrix, LD recursion method is
based on following characteristics:




Symmetric
Toeplitz
Hence we can solve matrix in O(p2)
instead of O(p3)
Don’t forget our objective, which is to
find αk to simulate the vocal tract filter.
LINEAR PREDICTION ANALYSIS
Levinson-Durbin Recursion(2)


In exercise, we solve matrix by ‘brute
force’ and L-D recursion. There is no
difference of corresponding parameters
Error energy
v.s. Predictor order
LINEAR PREDICTION ANALYSIS
Residual sequence u(n)


After knowing filter
coefficients, we can find
s(n)
residual sequence u(n)
by inversely filtering
computation.
Try to compare
original s(n)
residual u(n)
A(z)
u(n)
Speech Coding and
Recognition
SPEECH CODING AND SYNTHESIS
SPEECH CODING AND SYNTHESIS
Categories



Analysis-by-Synthesis
Perceptual Weighting Filter
Linear Predictive Coding




Multi-Pulse Linear Prediction
Code-Excited Linear Prediction (CELP)
CELP Experiment
Quantization
SPEECH CODING AND SYNTHESIS
Analysis-by-Synthesis(1)



Analyze the speech by estimating a LP
synthesis filter
Computing a residual sequence as a
excitation signal to reconstruct signal
Encoder/Decoder :
the parameters like LP synthesis filter,
gain, and pitch are coded, transmitted,
and decoded
SPEECH CODING AND SYNTHESIS
Analysis-by-Synthesis(2)

Frame by frame
s(n)
LP analysis
Excitation
Generator
LP Synthesis
Filter
ŝ(n)
-
e(n)
Error
Minimization


Without error minimization:
With error minimization:
LP
parameters
Excitation
parameters
E
N
C
O
To channel
D
E
R
SPEECH CODING AND SYNTHESIS
Perceptual Weighting Filter(1)



Perceptual masking effect:
Within the formant regions, one is less
sensitive to the noise
Idea:
designing a filter that de-emphasizes the
error in the formant region
Result:
synthetic speech with more error near
formant peaks but less error in others
SPEECH CODING AND SYNTHESIS
Perceptual Weighting Filter(2)

Q( z ) 
1  P( z )
A( z )

1  P( z ) A( z )



In frequency domain:


LP syn. filter v.s. PW filter
Perceptual weighting coefficient:



α = 1, no filtering.
α decreases, filtering more
optimalα depends on perception
SPEECH CODING AND SYNTHESIS
Perceptual Weighting Filter(3)



In z domain, LP filter v.s. PW filter
Numerator: generating the zeros which are the
original poles of LP synthesis filter
Denominator: placing the poles closer to the
origin. α determines the distance
SPEECH CODING AND SYNTHESIS
Linear Predictive Coding(1)



Based on above methods, PW filter and
analysis-by-synthesis
If excitation signal ≈ impulse train, during
voicing, we can get a reconstructed signal
very close to the original
More often, however, the residue is far from
the impulse train
SPEECH CODING AND SYNTHESIS
Linear Predictive Coding(2)



Hence, there are many kinds of coding
trying to improve this
Primarily differ in the type of excitation
signal
Two kinds:
 Multi-Pulse Linear Prediction
 Code-Excited Linear Prediction (CELP)
SPEECH CODING AND SYNTHESIS
Multi-Pulse Linear Predcition(1)

Concept: represent the residual sequence
by putting impulses in order to make ŝ(n)
closer to s(n).
s(n)

LP Analysis
Error
Minimization
Excitation
Generator
LP Synthesis
Filter
ŝ(n)
-
Multi-pulse, u(n)
PW
Filter
SPEECH CODING AND SYNTHESIS
Multi-Pulse Linear Predcition(2)


s1
s2
s3
s4
Estimate the LPC filter without excitation
Place one impulse (placement and amplitude)
A new error is determined
Repeat s2-s3 until reaching a desired min error
original
multi-pulse
synthetic
error
s1
s2,3
s4
SPEECH CODING AND SYNTHESIS
Code-Excited Linear Prediction(1)

The difference:


Represent the residue v(n) by codewords
(exhaustive searching) from a codebook of zeromean Gaussian sequence
Consider primary pitch pulses which are
predictable over consecutive periods
SPEECH CODING AND SYNTHESIS
Code-Excited Linear Prediction(2)
s(n)
Pitch estimate
LP parameters
Gaussian
Multi-pulse
excitation
generator
codebook
v(n)
Pitch
synthesis
filter
u(n)
LP
synthesis
filter
ŝ(n)
LP
analysis
s(n)
PW filter
Error
minimization
SPEECH CODING AND SYNTHESIS
CELP Experiment(1)

An experiment of
CELP



Original (blue) :
Excitation signal
(below):
Reconstructed
(green) :
SPEECH CODING AND SYNTHESIS
CELP Experiment(2)

1.
Test the quality for different settings:
LPC model order


2.
Initial M=10
Test M=2
PW coefficient
SPEECH CODING AND SYNTHESIS
CELP Experiment(3)
3.
Codebook (L,K)


K: codebook size
K influences the computation time strongly.
if K: 1024 to 256, then time: 13 to 6 sec




Initial (40,1024)
Test (40,16)
L: length of the random signal
L determines the number of subblock in the
frame
SPEECH CODING AND SYNTHESIS
Quantization

With quantization,



16000 bps CELP
9600 bps CELP
Trade-off
Bandwidth efficiency v.s. speech quality
Speech Coding and
Recognition
SPEECH RECOGNITION
SPEECH RECOGNITION
Dimensions of Difficulty




Speaker dependent / independent
Vocabulary size (small, medium, large)
Discrete words / continuous utterance
Quiet / noisy environment
SPEECH RECOGNITION
Feature Extraction



Overlapping frames
Feature vector for each frame
Mel-cepstrum, difference cepstrum, energy,
diff. energy
SPEECH RECOGNITION
Vector Quantization



Vector quantization
K-means algorithm
Observation sequence for the whole word
SPEECH RECOGNITION
Hidden Markov Model (1)


Changing states, emitting symbols
(1), A, B
1
2
3
4
5
SPEECH RECOGNITION
Hidden Markov Model (2)

Probability of transition

State transition matrix

State probability vector

State equation
SPEECH RECOGNITION
Hidden Markov Model (3)

Probability of observing

Observation probability matrix

Observation probability vector

Observation equation
SPEECH RECOGNITION
Hidden Markov Model (4)

Discrete observation hidden Markov model

Two HMM problems


Training problem
Recognition problem
SPEECH RECOGNITION
Recognition using HMM (1)

Determining the probability
produced the observation sequence
states
that a given HMM
3
3
3
3
3
2
2
2
2
2
1
1
1
1
1
time

Using straightforward computation – all possible paths, ST
SPEECH RECOGNITION
Recognition using HMM (2)

Forward-backward algorithm, only the forward part
Forward partial observation

Forward probability

i
SPEECH RECOGNITION
Recognition using HMM (3)

Initialization

Recursion

Termination
i
j
SPEECH RECOGNITION
Training HMM

No known analytical way
Forward-backward (Baum-Welch) reestimation, a hill-climbing
algorithm
Reestimates HMM parameters in such a way that

Method:





Uses
and to compute forward and backward probabilities,
calculates state transition probabilities and observation probabilities
Reestimates the model to improve probability
Need for scaling
SPEECH RECOGNITION
Experiments

Matrices A and B

Observation sequences for words ‘one’ and ‘two’
Thank you!