CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
CHARACTERIZATION OF SPEECH WAVEFORMS
A graduate project submitted in partial satisfaction of
the requirements for the degree of Master of Science in
Engineering
by
August, 1979
The Graduate Project of Sang Huy Ta is approved:
Fred .;uenber ger
~
Prabhakar -
C Dale Manquen
California State University, Northridge
ii
ACKNOWLEDGMENTS
I would like to thank Dr C Dale Manquen for his precious
tim~
corrections and invaluable advice, and Dr D. Schwartz for his invaluable guidance.
I would also like to thank Dr J.C. Prabhakar, Dr F.
Gruenberger and Dr R. Davidson for their helpful suggestions, and
Dr R. Pettit for having helped me to get References from Georgia
Institute of Technology.
Recommendations of Dr A.M. Bush from School of Engineering,
Georgia Tech - Atlanta are gratefully acknowledged.
Finally, I have to thank my wi:fe for her typing and
encouragement.
iii
TABLE OF CONTENTS
Page
iii
ACKNOWLEDGMENTS ••
LIST OF FIGURES.
.
....
v
vi
ABSTRACT
Chapter
II.
III.
INTRODUCTION TO SPEECH • .
1
1.1
1.2
1.3
1
2
2
METHODS OF CHARACTERIZATION OF SPEECH WAVEFORMS • •
7
2.1
2.2
2.3
2.4
2.5
7
9
Mathematical characterization ••
Physical characterization • . • • • • • • • • •
Characterization by way of analysis-by-synthesis •
Characterization by linear predictive coefficients
Statistical characterization • • • • • • • • •
12
12
13
CHARACTERIZATION BY LINEAR PREDICTIVE COEFFICIENTS.
15
3.1
3.2
3.3
3.4
3.5
3.6
15
16
21
23
25
29
3.7
3.8
3.9
VI.
Mechanism of forming speech wave • • • • • • • • • • •
Voiced and unvoiced sounds •
• • • • • •
• • •
Phonenes, elements of speech sound •
Introduction •
Linear predictor and its formulation • • • •
Estimation of linear predictive coefficients
Estimation of autocorrelation functions • • • • • • • •
Computation algorithm • • • • • • • • • • •
Stability of the filter • • • • • • • • • • •
Extraction of other parameters from linear prediction.
Determination of filter order and speech frame size ••
Choice of window function. • •
• ••••••
30
32
33
MICROPROCESSOR BASED- SYSTEM DESIGN AND CONSIDERATION •
37
4.1
4.2
4.3
4.4
4.5
4.6
37
Description. • • • • • • • • • • • • • • • • • • •
End-point detector • • • • • • • • • • • • • • •
Determination of the number of quantization bits •
Speech band-limitation and low-pass filter •
Interfacing microprocessor •
• • • •
Programming..
• • • •
• • • •
39
41
46
48
49
57
REFERENCES
iv
LIST OF FIGURES
Page
Figure
2.1
Ideal low-pass filter for short-time spectrum analysis • • •
10
2.2
Simple technique to characterize speech by
short-time spectrum analysis • • • • • •
11
3.1
3.2
3.3
......
The use of Linear Predictor to represent discrete speech
production model: (a) in the frequency and (b) in the
time domains.
• ••••••••••••••
..
20
Elements of a stationary autocorrelation matrix of
six data samples • • •
..
25
Convolution of signal with the rectangular window in the
frequency domain at (a), and the resulting spectrum at (b) ••
35
4.1
Block-diagram of a speech characterization system • •
38
4.2
Rectified speech signal and noise after the capacitor •
41
4.3
Interfacing of the microprocessor with the ADC and
the end-point detector circuit. • • • • • •
• • • •
48
4.4
Flowchart for reading data from the ADC.. •
51
4.5
Flowchart for windowing data. •
52
4.6a
Flowchart for computing autocorrelation
4.6b
Flowchart for computing autocorrelation functions,2nd part. •
55
4.6c
Flowchart for computing autocorrelation functions,3rd part. •
56
4.7
Flowchart for computing linear predictive coefficients. • •
57
v
function~st
part
54
ABSTRACT
CHARACTERIZATION OF SPEECH WAVEFORMS
by
Sang Huy Ta
Master of Science in Engineering
Speech signals are a random process which can be voiced or
unvoiced, oral and/or nasal.
a phoneme.
The smallest element of speech is called
The phonemes are combined to form meaningful speech sounds.
It is well known that speech can be recorded and reproduced by means
of a recording disc or a magnetic tape, but for efficient speech storage, compression and transmission; for speech automatic verification and
identification; for voice response system; for speech synthesis and
other digital processings, the waveform of speech needs to be characterized.
Among the various methods of characterization such as mathe-
matical, statistical, physical, and analysis-by-synthesis methods, the
characterization of speech waveforms by linear predictive coefficient_s
is seen to be the best method.
This method is considered in details
with the concept of linear predication and its formulation.
vi
The linear
predictor is simply a form of the recursive digital filter with
coefficients that are to be estimated by least mean squares method in
order to simulate speech.
The number of coefficients or poles of the
filter is determined by the desired degree of simulation and is shown
to be related to the vocal tract length.
In the estimation of the
linear predictive coefficients, the autocorrelation functions of each
characterized speech segment must be computed first using the principle
of sample autocorrelation function.
An iterative computation algorithm
is developed to find the predictive coefficients.
The choice of
appropriate window function and speech frame size, and the derivation
of other characterizing parameters such as voice pitch and formant
frequencies from the linear predictive coefficients are discussed.
An
experiment using a microprocessor to compute the linear predictive
coefficients is considered.
Each windowed speech frame is represented
by a set of filter coefficients which, in the strict sense, should
make the simulated speech at the output of the filter indistinguishable
from the natural speech in the resynthesis process, although the signal
at the input of the filter is only a train of impulses or white noise.
vii
CHAPTER I
INTRODUCTION TO SPEECH
1.1
Mecbanism of forming speech wave.
When we connect a loud-speaker across a DC voltage source, we
do not hear anything from it in the steady state.
Alternatively, if we
blow right through a regular pipe, we also do not hear anything.
If
instead, an AC voltage source of the frequencies in the audio range,
i.e. from 20Hz to 20 kHz, is now connected to the loud-speaker, or if
the pipe is made with constrict ions at appropriate places,
or is
obstructed at its blown end by a vibrating device, sound will be heard.
In both cases, the phenomenon can only be explained by the fact that
only when the air surrounding the ear is perturbed will the perturbation be "heard".
Through the mechanism of hearing, we can know that
this perturbation is weak or strong, pulsative or repetitive, periodic
or nonperiodic, etc •••
_In engineering, we identify this audible air perturbation as
a sound wave.
For the speech to be heard, the air stream coming from
the mouth and nose must be caused to alternate somehow to generate a
sound wave which is referred to as a speech wave.
The construction of the vocal tract is complicated enough
to make the out- going air stream oscillating or pulsating.
air is pushed out from the lung, it must flow
l
When the
through the glottis,
z
a passage between the vocal
cord~,
through the pharynx cavity and then
through the oral and nasal cavities before reaching the outside air.
It is at the glottis and at the oral cavity that the air stream is
mostly modulated at a temporal rate high enough to make it audible.
The vocal cords make the glottis vibrate.
The jaw, tongue and lips
change the size and shape of the oral cavity to each type of sound
generated.
Also, the constrictions created by the tongue, the teeth,
the lips, the soft palate and the hard palate are added to determine
the perturbability of the speech waves.
1.2
Voiced and unvoiced sounds.
The division of speech sounds into voiced or unvoiced refers
to the circumstance that,in the former group the vibration of the vocal
cords is the first step in generating the a-c components of the airstream, while in the latter group the vocal cords do not vibrate and
the speech is resulted from the excitation of the vocal tract by turbulent air stream flowing through a point of constriction or other
sharp edges in the tract, or by abrupt release of pressure through
the mouth opening or through some point of closure in the mouth.
The voiced sound is one of the elements by which a speaker
can be identified or a word can be recognized.
It is also the subject
for various methods of analysis and synthesis.
1.3
Phonemes, elements of speech sound.
Based on the manner of formation of voiced or unvoiced speech
sound and their physical characteristics, and particularly treauditory
ability by which an average listener can discriminate among them,
3
fourty-four speech sounds of
Ge~eral
chosen for experimental studies.
American English (GA) were readily
They can be classified in six catego-
ries as follow :
1.
vowels
2.
fricative consonants
3.
stop consonants
4.
nasal consonants
5.
semi-vowels and glides
6.
dipthongs and affricates.
These sounds may be represented in terms of phonemes which are the smallest distinguishable units of speech.
The different manifestations that
are identified as a particular phoneme are called allophones of that
phoneme.
Vowels
Vowel sounds are originally produced by the vibration of the
vocal cords when the air passes through the glottis.
There is little
constriction of the oral cavity, and the vocal tract is relatively stable.
Radiation of sound is from the mouth.
Since vowel production
does not require motion of the articulators, they may be uttered as
sustained sounds and are called continuants.
/i/
eve
/I/ ·it
I?/ I
bird
/u/
boot
fa• I
over (unstressed)/U/
foot
/e/
hatei:
/A/
~p
/o/
~beyk
IE/
met
/;}/
ado( unstressed)
/:)/
all
/til./
at
/a/
father
(;':these sounds usually appear as dipthongs in GA English.)
4
Fricative consonants
Fricative consonants result from unvoiced excitation or from
combination of voiced and unvoiced excitation of the vocal tract. Unvoiced excitation is produced by air flowing through some point of
constriction.
The constriction is formed by tongue,teeth, lips, hard
and soft palates.
If the vocal cords vibrate in conjunction with the
unvoiced excitation, the fricative is a voiced one.
fricative is normally from the mouth.
Voiced
Fricatives are also continuants.
Unvoiced
/v/
vote
/f/
for
ra1
then
/8/
thin
/z/
zoo
/sf
see
lSI
she
/h/
he
/3/ azure
Radiation of a
·Stop consonants
Stop consonants are characterized by an abrupt release of
pressure at a point of closure in the mouth.
The closure can be formed
by lips, tongue and teeth or by tongue and soft palate. Stop consonants
require rapid articulation and thus are not continuant sound.
A stop
may also be produced with simultaneuos voicing.
·Voiced
Unvoiced
/b/
be
/p/
E_ay
/d/
~ay
It/ to
/g/
.f£0
/k/
~ey
Nasal consonants
The nasal consonants are voiced sounds produced with the
5
velum widely open.
There is complete closure at some point in the oral
cavity so that the air is forced to radiate through the nostrils.
Na-
sals are continuant sounds.
/m/ m
/n/
no
/TJ/ s i_gg
Semi-Vowels and Glides
Semi-vowels and glides are sounds which closely resemble vowels.
They are characterized by greater constriction of the oral cavi-
ty than in the case of most vowels.
Glides are dynamic sounds which
precede vowels and which show movement toward the vowel.
Glides
Semi-vowels
/w/ we
/r/
real
/j/ you
/1/
let
Dipthongs and Affricates
Dipthongs and affricates are special sounds produced by the
appropiate combination of a pair of preceding vowels in the case of
dipthongs or of a pair of stop-fricative consonants in the case of
affricates.
Dipthongs are vowel-like sounds uttered by a transition
from one vowel phoneme to another, while affricates are characterized
by a transition between certain stop and fricative phonemes.
/ei/
s~y
/ai/
I
/Iu/
new
/oU/
Js£
/:JI/
b_£y
/tf/
chew
/aU/
out
/d3/. jar
The phonemic description of speech is an attempt to classify
6
speech sounds in terms of fundamental elements or units which cannot
be further subdivided.
These elements or phonemes are combined to pro-
duce morphemes, the simplest meaningful sounds.
Morphemes are then
arranged to make up words, phrases, etc •••
The phonemes may be thought of as a code relating specific
vocal tract configurations and excitations to specific speech sounds.
Such a viewpoint would suggest that speech can be analyzed for characterization.
CHAPTER II
METHODS OF CHARACTERIZATION OF SPEECH WAVEFORMS.
There are various methods for characterizing a speech waveform.
In the strict sense, a waveform is said to be characterized
when its characteristic features can be resynthesized to obtain the
original form.
For a speech waveform, there is some exception to the
degree of exactness of synthesizing due to the physioacoustic property
of the source of speech sound.
This voice intelligibility-based cha-
racterization is reflected by some methods as discussed below.
2.1
Mathemathical characterization.
Undoubtedly, the best way to characterize a speech waveform
is by simulating it with a mathemathical function, because if this
function is exact, the speech will be redefined at anytime and so is
its spectrum.
However, the speech wave is so randomly varied that a set of
functions must be used to attain the exact representation of speech.
In an effort to get to this exactness, researchers have tried to use
various functions and synthesized them by different ways.
A well-known indirect method is estimation of speech spectrum
envelope by employing the Fourier transform of the impulse response
h(t) of the inverse filter obtained by solving Wiener's inverse filter
problem.
If t is made sufficiently large, the impulse response of the
7
8
inverse filter converges toward zero.
Then h(t) can be expanded in
terms of damped oscillating orthogonal functions.
This modulated
Fourier system is effective in representing speech features with fewer
terms and provides a means to reconstruct the original waveform by way
of an inverse Fourier transform.
Motivated by the Fourier analysis and by the damping property
of speech sound, the voiced waveforms of connected speech can be analyzed and resynthesized in terms of fixed, orthogonalized, exponen. 11
t~a
y d ampe d
.
. d s.
s~nuso~
Dur~ng
.
.
1 , t h e speech wave f orm
t h e qth ~nterva
is represented by an approximation of the form :
n/2
S (t) z ~
q
k=l
A~e -~(t-tq)sin[tSkCt-tq) + ~~]
where A~ and?)~ are, respectively, the amplitude and starting phase
o f t h e kth
. 11y d ecay~ng
.
exponent~a
.
.d ~n
. the qth 1nterva
.
1 ,~k and
s~nuso~
~k are, respectively the kth (positive and real) damping and radian
frequency parameters and are not functions of the analysis interval q,
and n is the number of complex frequencies included in the functions
of the equation.
If the n constants are known for each pitch period
or analysis interval, then the original speech signal can be regenerated by a summation,
S(t)
= ~ Sq(t)
q
Characterization of speech can also be done by a time-ordered
wave function sets or by application of Laguerre functions for parametric coding.
While the simulation of speech by mathematical functions
9
can reduce the number of its characteristic feature representation,
the resynthesized speech suffers quality degradation which is caused
by discontinuities at the boundaries between pitch periods.
2.2
Physical characterization.
The most popular method used before the present decade for
characterizing a speech waveform had been the physical method.
This method relates the characteristic features to the physical properties of the wave source and use the physical instrument
such as oscillograph, spectrograph and spectrum analyzer to pick out
the features.
Since the vocal cord formation, the vocal tract, the oral
and nasal cavities and constrictions are not the same from one individual to another, one expects to have voice pitch and harmonic resonant
ftequencies as characteristic features of speech.
In speech processing,
these resonant frequencies are called formant frequencies.
This expec-
tation is complemented by observing the speech oscillogram and spectrum
wherein the waveform of the voiced speech is quite periodic, and the
spectrum shows relatively accentuated regions with distinctive peaks
along the banwidth.
although
Changing the speaker will make these features vary
they also can vary to some degree with the same speaker for
the same word.
The unvoiced sound
does not have these features and
is often characterized by its transient characteristic as well as its
energy.
The problem is thus to find the voice fundamental frequency
and formant frequency information contained in each speech segment
10
in the time domain or in the frequency domain or both.
For the same
speaker, these features are subject to change during the utterance and
during the transition from one phoneme to another.
A good way to find theformant frequencies is by way of spectrum analysis, and since speech waveform changes quite fast, a shorttime spectrum analysis is applied.
The spectrum of a discrete time speech signal is defined as
00
X(w)
= ~ x(n)e -jwn
n=-oo
For simplicity, the sampling period is given equal to unity hereafter.
H(w)
w
Fig. 2.1
Ideal low-pass filter for
short-time spectrum analysis.
The short-time spectrum analysis is defined by
X(w,n) =
n
. k
.:Z:
x(k)h(n-k)e -JW
(2.1)
n=-oo
Eq. (2.1) can be viewed as measuring the infinite time Fourier transform of speech signal at time n, seen through a window with response
h(n).
Eq. (2.1) can also be written as
X(w,n) =
a(w,n) - jb(w,n)
where
a(w,n) =k=~ x(k)h(n-k)cos(wk)
b(w,n)
n
=k=~x(k)h(n-k)sin(wk)
(2.2)
(2.3)
11
Eqs. (2.2) and (2.3) can be used as simple technique for measuring the
short-time transform and therefrom the speech features as illustrated
in Fig. (2.2),
cos(kw)
t----+
a( wn)
t----~.
b(.,.m)
')((n) - - - i l l
Fig. 2.2
Simple technique to characterize speech
by short-time spectrum analysis.
H(w) in Fig. 2.1 corresponds to the frequency response of an ideal
low-pass filter with its input as an impulse.
of H(w) shows an infinite time response.
The inverse transform
In practice, this infinite
time is limited and error is introduced in the measurement.
A set of N frequencies is measured by iterating the above
technique.
From the obtained spectrum, information such as formant
frequencies and banwidths are picked out.
Instead of extracting the above speech characteristics from
the short-time spectrum, the spectral envelope can be estimated for
matching and for synthesis purposes, or the autocorrelation coefficients can be computed by an inverse transform for speech recognition
purpose.
The concept of short-time spectrum analysis offers the possibility of a moderately efficient and flexible characterization of
speech wherein pitch tracking is not required.
Its fundamental prin-
12
ciple lies in the fact that a speech signal is passed through a bank
of contiguous bandpass filters whose passbands can be recombined to
recover the original speech sound.
2.3
Characterization by way of analysis-by-synt!hesis.
Based on the auditory ability by which an average listener
can distinguish the speech sounds and different words, a speech waveform is first analyzed to find the feature content, these features are
then resynthesized while being adjusted to the degree that an average
listener can recognized the original words from the speaker-dependent
or independent experiment.
The adjusted features are then considered
as the resulting product of the analysis, which will characterize the
speech waveform.
The just discriminable changes in the fundamental frequency
are of the order of 0.3 to 0.5 cps and are in general, slightly less
then the frequency changes discriminable in a pure tone of the same
frequency and sound pressure level.
While this method can lead to an exactness in the characterization, it is costly and time consuming.
2.4
Characterization by linear predictive coefficients •
. Efficient characterization of speech waveform in term of a
small number of slowly varying parameters is certainly the most important problem in speech research.
In recent year, speech researchers had succeded in synthesizing speech from a linear all-pole filter by determining the coefficients of this filter so that a present speech sample can be predicted
from the past samples. This method will be considered in details later.
13
2.5
Statistical characterization.
Speech characterizing parameters are subject to variations.
For the same phoneme, morpheme or word, they can vary with speaker,
from speaker to speaker and with time.
They can be stable for a period
of about 10 days and then exhibit gradual or serious variations for
periods of three month or more.
Since speech is considered asarandom signal, a statistical
method can be applied for
long-time averaging the parameters, esti-
mating their variances and correlation coefficients between them.
The
reliability of a specific system thus depends on the statistical measurement of these parameters.
In signal processing, a statistical method can also be used
while extracting features from a random speech signal.
It helps to
reduce the variance of the parameter obtained and enhances the parametric discriminability.
One such method is the method of periodograms which is used
to derive the smoothed autocorrelation functions.
These functions are
then considered as characterizing a speech segment, since the spectral
density obtained from them can be converted back to the original speech
segment by an inverse transform process.
This method makes use of the
probability distribution function of a purely random stationary signal
and is formulated as follows.
The n samples of speech are first divided into N segment,
each segment corresponds to a locally stationary random signal.
spectral density of each speech segment is
SN(w)
= ~~~
x(n)e -jnwl2
giv~n
=
~
I
by
XN(w)l2
The
14
Unfortunately, var SN(w) is not small even for large N.
If each speech segment of N samples is further subdivided
into K subsegrnents of M samples, and K subsegrnents or periodograms are
statistically independent, the variance of the averaged estimate will
be
Thus, the mth periodogram is calculated by
S
M,m
(w)
=1-...
M
Mrn-1
.-::::::::
-==::::::...
k:M(m-1)
X
ke
.
2
-]Wk
m=1,2, ••••
And the averaged periodogram S~ can then be calculated from K individual periodograms of length M,
· "'K
1
K
SM (w) =K~ sM,m(w)
The estimated autocorrelation functions over a speech segment can be
computed from
"K
If SM is derived from the finite Fourier transform, which when inversetransformed will give out a train of periodic sequence of one speech
segment, the autocorrelation functions resulting from correlating this
periodic $equence with the same but lagged periodic sequence will be
erroneous.
In order to correct this error, each subsegrnent of M
samples must be padded with M zeros before transformation.
The autocorrelation functions obtained thus represent
a set
of slowly time-varying parameters characterizing each speech segment.
The order of K is much less then N.
CHAPTER III
CHARACTERIZATION BY LINEAR PREDICTIVE COEFFICIENTS
3.1
Introduction
The characterization of a speech waveform by mathematical
functions can reduce the number of characterizing parameters, but the
resynthesized speech suffers quality degradation which is 'caused by
discontinuities at the boundaries between the pitch periods.
Further-
more, it can be cumbersome and is not easily controlled in a physical
system.
The physical characterization can have some advantages over
the mathematical method since the characterizing features such as pitch
periods and formant frequencies can relate the system to the physioacoustic properties of the sound source; but the well-known technique
of spectral analysis suffers from a number of serious limitations
arising from the non-stationary as well as the quasi-periodic properties of the speech wave.
Even a short-time spectrum analysis can give
some correction to this situation, but it is limited by system resolution and requires a bit rate of at least 10,000 bit/sec. for a good
quality resynthesized speech.
Although the analysis-by-synthesis techniques can provide a
partial solution to the above difficulties, such techniques are cumbersome and time consuming even for a modern digital computer and are
15
16
therefore unsuitable for the processing of large
The main
proble~for
amoun~of
speech data.
speech researchers is how to find a
method of characterizing a speech waveform with as small a number of
parameters as possible while still having an indistinguishable resynthesized speech, and an efficient storage and transmission of speech.
By studying the characteristics of the filters and the transfer function of the vocal tract, a powerful method which can satisfy the above
requirements has been derived.
This method uses the principle of li-
near prediction of signal and is formulated as follows.
3.2
Linear Predictor and its formulation
It is well known that when a signal passes through a filter,
its shape can be modified by the filter by filtering out a portion of
the signal frequency band or by modifying the relative amplitude or
phase of the signal frequency components.
Such a filter characteristic suggests the derivation of a
pole-zero filter transfer function to shape the white noise or impulses
into a speech signal.
Consider an analog system function
q
~ dks
k
2
k
k=O
H (s) =
a
p
=
cks
Y (s)
a
X (s)
a
k=O
where x (t) is the input and y (t) is the output, and X (s) and Y (s)
a
a
a
a
are their respective Laplace transforms.
The output is thus related to the input by a convolution in
the time domain,
17
y (t) =
{ : ('l)h (t-t')dt
a
a
a
-oo
where h a (t), the impulse response, is the inverse Laplace transform of .
H (s).
a
Alternatively, an analog system having a system function H (s)
a
can be described by the differential equation
In digital signal processing, the corresponding rational
system function for digital filters has the form
=
Y(z)
X(z)
(3 .1)
Such a multiplication in the frequency domain will correspond to a
convolution-sum in the time domain:
()Q
y(n) =
2, h(k)x(n-k)
(3.2)
k=-co
-or equivalently, by the difference equation corresponding to the
inverse Laplace transform of H(z),
(3.3)
.
.Since Eq. (3 .l) is a ratio of polynomials ln z
expressed in factored form as
-l
, it can also be
18
c
H(z) =
fr
k=l
p
Tf
k=1
(1-Akz-1)
(3.4)
(1-B z-1)
k
Each of the factors (1-Akz-1) in the numerator of Eq. (3.4) contributes
a zero at z:Ak and a pole at Z=O•
Similarly, each of the factors of
the form (1-Bkz-1) in the denominator contributes a pole at Z=Bk and
a zero at the origin.
It is characteristic of the systems describable
by linear constant-coefficient difference equations that their system
functions are a ratio of polynomials, in z-1 here.
Consequently, to
within a scale factor C in Eq. (3.1), the transfer function of the
digital filter can be specified by a pole-zero pattern in the z-plane.
Thus any filter having this pole-zero pattern must satisfy the pth_
order linear constant-coefficient difference equation of the form
given by (3.3). Eq. (3.3) can be rewritten in the form
p
ak
y(n) = - ~ -a- y(n-k)
k=1
°
Assume a 0
=1,
q bk
+z
-- x(n-k)
k=O ao
we then have
y( n) =
-
k
q
..:::::::. aky( n-k) + ~ bkx( n-k)
k=1
k=O
(3.5)
The question now arises "Do we really need the zero-pattern in the filter transfer function?"
19
It is known that for non-nasal voiced speech sounds the
transfer function of the vocal tract has no zeros.
For these sounds,
an all-pole recursive filter is good enough to adequately represent
:he transfer function of the vocal tract.
:v~.sal
However, for unvoiced and
sounds, it is found that the vocal tract transfer function is
represented by both poles and zeros.
Since these zeros lie within the
unit circle in the z-plane, each factor in the numerator of the transfer function can be approximated by multiple poles in the denominator
of the transfer function.
In addition, if we consider the degree of
perceptivity, especially by looking at the positions of the formant
frequencies in the speech spectrum, we will see that the location of
a pole contributes much more to the speech characterization then the
location of a zero.
Thus, by adding more poles to the transfer func-
tion, an all-pole filter model can approximate the effect of antiresonances on the speech wave in the frequency range of interest to any
desired accuracy.
Such an all-pole filter model is often called as
autoregressive model.
By this, Eq. (3.1) is reduced to
1
H(z) = -- - -
~
k=O
(3.6)
a z-k
k
And Eq. (3.5) is reduced to
p
y(n)
=-2
aky(n-k) + x(n)
(3. 7)
k=l
Eq. (3.7) reveals that if impulse or discrete white noise x(n) is sent
to the input of an all-pole linear recursive filter, the nth output can
be computed from the present input sample and the past(n-1) output
20
samples.
If y(n) simulates a speech sample, then the order p and the
values of coefficients ak can be estimated to represent it to an acceptable degree of accuracy.
With this characteristic of linear prediction, a linear
recursive filter, particularly an autoregressive· filter in this case,
is often called a linear predictor and is represented by models in
Fig. 3 .1.
x(n)
y(n)
H(z) =
1
P.
~akz-l
k=O
(a)
y(n)
x(n)
___.
Linear predictor ,......
of order p
(b)
Fig. 3.1
The use of linear predictor to represent
discrete speech production model:(a) in
the frequency and (b) in the time domains.
Now, from Eq.(3.7), the first speech sample is given by
y(O) = x(O) •
If we assume that the next input samples are unknown and try
to simulate the next output samples recursively just from this first
output sample y(O), we can then estimate the output for a particular
speech segment by the approximate equation :
21
,.
(3.8)
Y.(n)
How good this approximation is depends on the estimation of
coefficients ak called the linear predictive coefficients of the filteL
3.3
Estimation of linear predictive coefficients
If y(n) represents the true speech sample, then by applying
Eq. (3.8), we have introduced an error of
.,..
p
e(n) = y(n) - y(n) = y(n) +
.:Z: yCn-k)
(3.9)
k=l
The best way to minimize this kind of noise-free error is to
square it, differentiate it with respect to coefficients ak, equate it
to zero and solve for ak.
The coefficients ak thus obtained, when put
A
back into the equation, will make y(n) closest to y(n).
Such a method
is called the least squares estimation.
Since speech is a random signal, the error e(n) is also a
sample of a random process.
In the least-square method, we minimize
the expected value of the square of the error.
2
2
Thus,
2-
p
E[ e ] = E[ ( y - y ) ] = E[ ( y + ~ aky -k) ] , and
n
n
n
n
k=l
n
2
E['~e J = 2E [ e ·
·ba.l
be
_E. ]
n c)a.
= 2E[e y
l
.]
n n-1.
=0
The set of equations that must be solved to find the optimum
choice of ai' i = 1,2, •••• ,p is thus given by
(3.10)
22
This is called the orthogonality principle in methods of
estimation.
It means that the product of the error e
each of the past sample y
A
n
.
= y - y Wlth
n
n
. is set equal to zero in an expected-value
n-l
sense.
If Eq. (3.9) is substituted into (3.10), we get
p
- k=
~ akE[y n- ky n-l· J
1
= E[ynyn_;. . J
(3.11)
Random speech signal can be considered as non-stationary or
locally stationary.
processing.
Hereafter, we choose the stationary case for the
Thus
E[y
y
] = R(i-k)
n-k n-i
E[ YnYn-i ] = R(i)
And Eq. (3.11) becomes:
p
~ akR( i-k) = R( i)
k=1
i=1,2 •••• p
(3.12)
The expanded form of this set of equations corresponds to
R(O)a1 + R(1)a 2 + ••••••••• + R(p-1)a = R(l)
p
R(1)a1 + R(O)a 2 + ••••••••• + R(p-2)ap = R(2)
(3.13)
.....................................
R(p-l)a1 + R(p-2)a 2 + ••••• + R(O)ap
= R(p)
-
--~----·__:__ - - - - -
---
----·
-->----·--·-
23
The expansion of (3.13) has taken into account the symmetry property
of the autocorrelation function by which R(k) = R(-k).
More compactly, if we define the p x p autocorrelation matrix
R with element R(i-k), a column vector Ap with element ak and a column
vector R with element R(i) where i= 1,2 ••••••• ,p, we have the set of
equations in matrix-vector form
R(O)
R(l)
R(2) ••••••• R(p-l)
al
R(l)
R(l)
R(O)
R(1) ••••••• R(p-2)
a2
R(2)
R(2)
R(l)
R(D) ••••••• R(p-3)
a3
R(p-1)
R(p-2)
R(p-3) ••••• R(O)
a
R(3)
=
p
(3.14)
R(p)
This p x p matrix is known as a Toeplitz matrix by its symmetry property and the elements along any diagonal are identical.
The formal solution to Eq. (3.14) is thus
Ap
= R-1
R
where R-1 is the inverse matrix of R
This is the formal solution to the problem previously stated.
The set of filter coefficients derived are to be multiplied with the
past output samples from the linear predictor and then added to obtain
the best linear estimate of a particular present speech sample in the
·least mean-squared error sense.
3.4
Estimation of the autocorrelation functions
The first step in solving Eq (3.14) is to derive the autocor-
relation functions from the speech segment which is to be characterized.
24
If we consider a speech signal as zeromean, random, locally
stationary, and the number of N speech samples available for characterization is not large compared to order p of the filter, the true
values of the autocorrelation functions can not be obtained, and we
must rely on a method of estimation called sample autocorrelation
function.
If the autocorrelation function of order zero or R(O) can
be estimated by means of the sample average of N samples
.
2
2
2
1 ~ 2
yo+ yl+ ••••••••YN-1
RN(O) =
<fYi =
N
--r
then the sample autocorrelation functions at other sampling delay times
or lags can be estimated by
+ •••••••+
:.1....
N
N
::21-1
i=O
y.y.+/k/
l
k= ±1, ±2 •••• ±p. (3.15)
l
RN(k) is also an even function of k, just as the true R(k) is.
The mean of RN(k) can be found by iteration and is related
to the true R(k) as follows:
(1- /k/ )R(k)
N
k=1, 2, .p.
The estimated RN(k) is far from its true R(k) as the order p of the
linear recursive filter increases or as N decreases.
On the basis of Eq. (3.15), Fig. 3.2 shows the calculation of
elements of the stationary autocorrelation matrix for a data sequence
25
yoyo yoyl yoy2 yoy3 yoy4 YoYs
Y1Yo yly1 yly2 yly3 yly4 yly5
y2
Fig. 3.2Elements
.
YzYz YzY3 YzY4 YzYs
of a stationary
.
y3y3 Y3Y4 y3y5
autocorrelation
.
Y4Y4 y4y5
'
matrix of six
data samples.
YsYs
The autocorrelation functions can also be derived from the frequency
domain by the method of smoothed spectral density, but it takes more
computation time as comparedto the method of sample autocorrelation
function just discussed,although the results obtained have smaller
variance.
3.5.
Computation algorithm
With the autocorrelation functions estimated in part 3.4, we
now can solve Eq. (3.14) for the linear predictive coefficients ak.
First of all,the solution to Eq. (3.14) will be unaffected
if all the autocorrelation functions are scaled by a constant.
In
particular, if all R( i) are normalized by dividing by R( 0), we have
what is known as the normalized autocorrelation functions matrix in
which
_ R(i)
r(i) - R(O)
•
Eq. (3.14) is thus rewritten in the normalized form as
--
------------·-
--------
-----------·
26
r
. . . . . r p-1
rl
r2
rl
r
rl
r
r2
rl
r
r p-3
0
0
0
p-2
(p)
al
(p)
a2
(p)
a3
rl
r2
r3
=
r
r
r
p-2
r
p-1
p-3
p-2
r
r
p-4
p-3
a(p)
p-1
a(p)
p
rl
r
0
(3.15a)
r
r
p-1
p
or in equivalent form
A(p) =
PP p
R
where
(3.16)
"';':
R
p
= the p x p matrix on the left side of (3.15a)
pp
A(p)= a p dimensional column vector corresponding to the
R
p
.
. ak
(p)
p x p matr1x,
wh ose kth e 1 ement 1s
*
Rp
=a
p dimensional column vector whose kth element is rk.
Also, define the inverse column vector as
a(p)
p-1
=
And
27
r
= R"·'·
.R
plk
p
r
=
p-1
plp-k+l
rl
Eq. (3.15a) or Eq. (3.16) can thus be rewritten as
R
(p-l)(p-1)
A(p) + a(p)R
p-1
p
p-1
= R*
p-1
(3.17)
Aq.d
·RT
p-1
T
where R
p- 1
A(p) + r a(p)
p-1
0 p
=r
(3.18)
p
is the transpose of column vector R
p- 1
•
If Eq. (3.15a) is only a (p-1) x (p-1) matrix with corresponding column
vectors of p-1 elements, we can write
.A (p-1)
p-1
=
-1
--k
R(p-l)(p-1) Rp-1
(3.19)
or
(p-1)~':
=
Ap-1
Multiplying Eq. (3.17) by
A(p)
p-1
R-1
(p-l)(p-1)
RC~-l)(p-l)
= R-l
(p-l)(p-1)
(3.20)
gives
R~·:
p~l
- R-l
a(p)R
(p-l)(p-1) p
p-1
(3 21)
•
Substituting Eq. (3.19) and Eq. (3.20) into Eq. (3.21) gives
(3.22)
T
Multiplying Eq. (3.22) by R
and substituting Eq. (3.18) into it
p- 1
give
28
-r a(p) +r
o p
P
(p-1)
(p) T
(p-1)*
A
-a
R
A
P-1 p-1
P
P-1 P-1
= RT
or
T
- R
A (p-1)
(3.23)
p-1
P-1
Developing Eq. (3.23) and summarizing terms give
P-1
r
0
-
)
P-1
=r
<:::. r a (P-l )
<::::. k k
k=l
p
(p-1 )
~ rP-k ak
(3.24)
k=l
Eqs.• (3.22) and (3.24) is the complete algorithm for computing linear
predictive coefficients.
We start with Eq. 3.24:
-For p=l
a
(1)
1
=r1/
r
0
.
For p>l, with each value of p, we use first Eq. (3.24) to compute a
then use Eq. (3.22) to compute
A~i); pis then incremented
l-1
( i)
by 1 and
(p)
is filled. Thus
p
the procedure is repeated until the column vector A
-For p=2
and
-For p=3
(2)
(2)
(2)
(2)
(3)
a3 = (r3- r2~ - rl a2 ) /(r o -r1 al - r2a2
and
(2)
(3)
a1
(3)
a2
-For P=4
=
al
(2)
a2
(3)
a3
(2)
a2
(2)
~
29
(4)
3
(3)
(3)
(3)
(3)
(3~
=(r4- r 3 a(1 ) - r2a2 - r1a3 )/(ro- r1a1 -rza2 -r3a3
a4
and
(3)
a1
(4)
a1
(4)
a2
=
(4)
a3
(3)
a3
(3)
a2
(4)
-<t
(3)
a3
(3)
a2
(3)
a1
and so on until A(p) corresponding to the desired value of p is obtainp
ed.
For each value of p, the partial results derived from Eqs.
(3.24) and then (3.22) are stored.
These partial results are used to
compute the new partial results corresponding to p+1.
Since the matrix R
pp
is positive definite, the expression
inside the brackets on the left side of Eq. (3.24) is always positive;
therefore, a(p) is always finite.
p
3.6
Stability of the filter
If the characterization of speech by linear predictor coef-
ficients is for the purpose of speech reproduction, then after the
predictor coefficients are computed, the question of stability of the
resulting filter arises.
The best method for checking the stability is to substitute
values of coefficients just computed into the transfer function equation of the filter
H(z) =
1
)
30
and check the locations of its poles which are the roots of the denominator of H(z).
If the poles fall outside the unit circle on the z-
plane of H(z), then this all-pole causal linear filter is no longer
stable.
Since the filter order p is normally greater then 8, such a
check is time consuming and costly.
If the autocorrelation functions are computed from a non-zero
signal, then according to Eq. (3.15) they are positive definite, and
the solution of the Eq. (3.14) or of the normalized matrix Eq. (3.15a)
gives the predictor coefficients which guarantee that all poles of H(z)
lie inside the unit circle, i.e., a stable filter.
Thus the simplest method of stability check is to assure that
the autocorrelation functions are computed from a non-zero signal.
This can be done by inserting an end-detector circuit into the system
to detect the starting and the ending of a word.
This end-detector
circuit will be considered in more detail in chapter IV.
The filter can also be unstable if the quantization bits for
the predictor·coefficients and the round-off of the autocorrelation
functions are not adequate.
3.7
Extraction of other parameters from linear prediction
One of the advantagesof the method of characterization by
linear predictive coefficients is that the other parameters such as
pitch periods, formant frequencies and voiced-unvoiced switching can
also be computed from the coefficients derived.
The formant frequen-
cies are just the pole frequencies of H(z).
Unlike the pitch asynchronous characterization technique by
which a speech segment does not start at the beginning of each period,
31
the pitch synchronous technique requires the knowledge of pitch period,
voiced-unvoiced switching and root mean square value of each characterized speech segment.
On resynthesis, the pitch period information is
needed to control the period of the input "impulse" for simulating
voiced speech; the voiced-unvoiced switching binary information is
needed to switch from the impulse source to the white noise source for
simulating the unvoiced speech; and finally, the root-mean-square or
intensity value is needed to give to the resynthesized speech a more
human-like sound.
The linear predictive coefficients must be computed
or recomputed at the start of each pitch period and the computation is
ended at or before the following pitch period starts.
If these para-
meters are to be recomputed from the asynchronous method, an interpolation of the asynchronous predictor coefficients can be made to determine the synchronous predictive coefficients of a speech segment synchronized at the beginning of each pitch period.
However, to assure
stability, interpolation of the autocorrelation functions are made and
the predictor coefficients are computed therefrom.
The position of individual pitch pulses can be determined by
computing the prediction error e(n) given by Eq. {3.9) and then locating the samples for which the prediction error is relatively large.
In practice, the location at which the prediction error is large is
also the location of the pitch pulse.
The error can also be squared
to increase the accuracy of the peak picking circuit.
The voiced-unvoiced decision is based on the ratio of the
mean-squared value of the speech samples to the mean-squared value of
the prediction error samples.
This ratio is considerably smaller for
32
unvoiced speech sounds than for voiced speech sounds.
By a level com-
parison, the voiced-unvoiced switching can be extracted.
Since the primary concern of this comprehensive report is to
characterize speech waveforms by linear predictive coefficients and
by principle of stationary pitch asynchronous aimlysis, the other
characterizing parameters such as pitch periods, voice-unvoice •••• are
not discussed in details here.
Furthermore, these parameters can be
derived from a priori predictor parameters, although other methods may
be more accurate and efficient.
3.8
Determination of filter order and speech frame size
In theory, if the order p of the linear predictor is increa-
sed, the degree of accuracy in predict ion is also increased.
However,
the statistical literature based on experiments did show that as p is
increased beyond 12, there are no significant differences between
natural speech and synthetic speech.
For p as low as 2, the synthetic
speech is still intelligible although poor in quality.
The influence
of decreasing p to values less then 10 is most distinguishable on
nasal consonants.
For unvoiced speech, a value of p as low as 6 is
considered as adequate.
By considering the fact that for the same p and same word,
the intelligibility of the resynthesized word varies from speaker to
speaker, especially between female and male speaker, a relationship
between the length of the vocal tract and the number of predictor
coefficients has been defined [2].
The memory of the linear predictor
is by definition equal to the duration of the inverse Fourier transform
of the reciprocal of the transfer function between the lip and the
33
glottal volume velocities.
Since this inverse Fourier transform has
finite duration equal to 21/c, the memory of the linear predictor is
also equal to 21/c where 1 is the length of the vocal tract and c is
the velocity of sound.
For instance, if 1
predictor is about 1 msec.
the memory of the
This memory represents the poles of the
vocal tract transfer function.
equal to 10.
= 17cm
For a sampling period of 0.1 msec, p is
It was also found that the influence of the glottal flow
and radiation on the speech wave are adequately represented by two
other poles.
Thus a total number of p
= 12
corresponding to a sampling
frequency of 10 kHz is required for good quality synthetic speech. The
order p is thus a function of sampling frequency f
s
and is roughly
proportional to f •
s
The estimation of autocorrelation functions at lag 12 or
smaller will be-more accurate if the number of samples in each characterized speech segment is much greater than 12. Since voiced speech can
no longer be considered to be statistically stationary for intervals
exceeding two pitch periods,a sample speech segment not longer than two
pitch periods should be taken for characterization,unless the linear
predictive coefficients are used for other purposes than resynthesis,
such as speech recognition. The pitch period in turn can vary from 6
msec
for high pitched speaker to 30 msec for low pitched speaker. A
speech frame of the order of 15 to 20 msec can be chosen. A lower bound
of the frame size is limited by the windowing error as discussed below.
3.9
Choice of window function
An appropriate window function is used to pick out a speech
segment or N consecutive speech samples from n samples for analysis.
34
Windowing a speech waveform is equivalent to multiplying each
speech sample in the window with a weight.
At first glance we might say that the rectangular window is
the most simple and convenient opening for the speech wave to be seen
through.
Let's now analyze to see if the rectangular window is really
advantageous.
The spectrum of a N-point rectangular window is given by
W(w) =
N-1
~e-jwn=
n=o
=
1 _e-jwn
_
_ __
1-e-jw
sin( wN/2 ) e-jw[(N-1 )/ 2 ]
sin( w/2 )
and the discrete Fourier transform of an infinite speech sequence is
~
S(w) = ~s(n)e-jwn
If SN(w) is the spectrum of N speech samples, then SN(w) is obtained by
convolution of S(w) with W(w) in the frequency domain,
=
2_
{n;s(8)W(w-8)d8.
(3 .25)
21t
-JT;
The process of -convolution on the right hand side of Eq. (3.25) is
shown in Fig.3.3(a), where for simplicity the speech signal is assumed
to have a rectangular frequency spectrum.
The magnitude of SN(w) at
each frequency w is equal to the magnitude of S(w) multiplied by the
average value of W(w-8) (area divided by period) for the interval from
35
w
(a)
j- 2Tt/N
w
Fig. 3.3
Convolution of signal with the rectangular window
in the frequency domain at (a), and the resulting
spectrum at (b).
-w
- c to wc •
As the transformed window shifts to the left, its area
from -w to w changes with the side lobes so that a signal spectrum
c
c
of the form of Fig. 3.3(b) results.
Now, if the side lobes are absent, the convolving area of
W(w-e) is constant except at the boundaries of discontinuities where
the sharp corners of S(w) are rounded-off for an interval equal to the
main lobe width, and the resulting spectrum will be unchanged except
for some round-off at the sharp edges.
If the main lobe is very narrow and the side lobe are absent,
the effect of round-off will be negligible, SN(w) will be the same as
S(w) after convolution, and the truncation will not affect the signal
at all.
36
In practice, the
quest~on
arises how to have a window
function with relatively low side lobes and a very narrow main lobe. If
N is increased, the main lobe width of the rectangular window decreases
but at the same time the side lobes swing higher due to their constant
areas.
Thus there is a trade-off between narrowing main lobe and
decreasing side lobes.
Moreover, the first side lobe of the rectan-
gular window is only 13db below the main peak, resulting in oscillation
of SN(w) of considerable size at a discontinuity of S(w).
Since speech
has high transients, its spectrum can show sharp details at or near
Such oscillations are obviously undesirable.
~.
The rectangular window is
therefore unsuitable for truncating speech waveforms.
Since the main lobe of a window will smooth out the sharp
details of speech spectrum by the process of convolution, the first
condition of a window function is that the main lobe width must be
smaller than the formant frequency bandwidth or
~ < formant frequency bandwidth
where N is the desired size of speech spectrum.
The second condition
is that the side peak must be much lower than the main peak.
Two simple window functions that can satisfy the above two
conditions are the Hanning and the Hamming windows. The side peak of
the Hanning window is 25db below the main peak, while the side peak of
the Hamming window is 45db less than the main peak [2].
Thus the
Hamming window which is given by
21tn
W(n) = 0.54 - 0.46cos(N_ ),
1
is seen to be the best for speech application.
O<n<N-1
(3.26)
CHAPTER IV
.MICROPROCESSOR-BASED SYSTEM DESIGN AND CONSIDERATION
4.1
Description
As shown in Fig. 4.1, the system consists of a low noise
microphone, a low noise preamplifier, a voltage amplifier, an analog
filter, an analog-to-digital converter (ADC), and the minicomputer. The
starting and ending of a word is detected by the end-point detector
which will control the starting of conversion of the ADC and tell the
minicomputer to stop accepting data and start the arithmetic operation.
The heart of the system is naturally the microprocessor.
Speech signals are transduced by a low noise microphone such
as a condenser mike and then in the first step preamplified.
The
purpose of this preamplifier is to boost the signal voltage from the
microphone to a level acceptable by the following stage, while still
keeping the input signal-to-noise ratio unchanged or adding very low
noise to it.
An integrated circuit such as an 1M 381A or equivalent
can be used as the preamplifier.
Power supplied to the preamplifier is
passed to a noise filter to filter out undesirable noise.
The voltage amplifier is needed to raise the maximum signal
to the full scale input voltage of the ADC.
The design of this ampli-
fier has no problems since power is not required and signal bandwidth
is not large.
37
CLIPPING
INDICATOR
---,
OW-NOIS
PREAMP.
VOLTAGE
PLIFIER:
ANALOG
FILTER
SAMPLER
)tP
PIA
3.15kHz
-
COMPARATOR
-
_J
MINICOMPUTER
- -l
FULL-WAVE
MEMORY
MATRIX
SAMPLING CLOCK
(10.8kHz)
l
-
-
-
_J
END-POINT DETECTOR
Fig. 4.1 Blockdiagram of a speech characterization system
w
co
39
In order to avoid perturbations caused by aliasing, an analog
low-pass filter is used after the voltage amplifier to band-limit the
speech signal to less than half the sampling frequency.
Requirements
for this filter will be considered in paragraph 4.3
The output of the filter is passed to the sampler which
consists of a FET switch and a very low leak capacitor such as mica or
polystyrene.
During the track mode, the FET switch is closed and the
capacitor voltage just follows the input analog voltage.
During the
hold mode, the FET switch is opened and the capacitor is left with the
last analog voltage which is held for the duration of the sampling
period.
The transition from the track mode to the hold mode is acti-
vated by a sampling clock.
The sampler is followed by the ADC which converts the voltage
held at the sampler to the digital form.
The start of conversion (SC)
of the ADC is enabled by the sampling clock which is synchronized with
the input signal.
Synchronization is made by ANDing the sampling clock
with the logic signal from the end-point detector.
The parallel output
bi~together
with the sign b1t from the
ADC are sent to the microprocessor via the peripheral interface adapter
(PIA) for arithmetic operations.
The linear predictive coefficients
computed are stored back into the memory.
A clipping indicator which makes use of the ovenange signal
from th ADC is used to indicate that the speaker has talked tooloudly
or tooclose to the microphone.
A light emitting diode is good enough
for this purpose.
The system is intended for a single word characterization.
40
4.2
End-point detector
The end-po:lnt detector is needed for three functions : detect
the starting of a word and enable the operation of the system, maintain
the operation of the system during the presence of a word, and at the
end of a word order the ADC to stop conversion and the microprocessor
to start computations.
A zero threshold full-wave rectifier and a comparator are
used for this purpose.
Refer to Fig. 4.1, the rectifier is connected after the
preamplifier.
When a word is started, the output from the rectifier
becomes greater than zero and the comparator quickly switches to logic
1.
Logic voltages from the end-point detector and the sampling clock
are ANDed.
By this way, the ADC starts to convert only when an analog
voltage is present at its input.
During ·the presence of a word, a rectified wave swings down
to zero and then followed immediately by another rectified wave.
At
these zero-crossing points, the comparator does not switch to logic
0 due to the presence of noise at its input, and a zero sample,if
coincided with the sampling clock will not be lost.
However, when a word ends, the comparator can not switch to
logic 0 to stop the conversion.
If the comparator threshold is made
equal to noise level, it will enable the stop of conversion but a zero
sample will be lost.
The above problem can be solved by inserting a capacitor
between the end-point detector's rectifier and the comparator.Referring
to Fig. 4.2, when the rectified wave begins to swing down, the
41
capacitor will start to discharge.
level
Fig. 4.2 Rectified speech signal and noise
after the capacitor.
For the same time-constant, the capacitor will reach the
threshold level faster when the wave is lower.
Therefore, the calcu-
lation will be based on a wave as low as possible over the noise, which
fortunately in the case of speech occurs at the ending portion of a
word where the waves gradually die out.
At the other places where the
rectified waves are higher, the capacitor will be charged up again
before it can reach the threshold, therefore a zero sample is not lost.
If the zero samples are to be ignored after the end of a word,
the total discharging time t
1
of the capacitor must be
T
4f < tl<-2-s +
1
where f
T
s
1
4f
lowest frequency component of speech
= 100
Hz
sampling period
The value of the capacitor is then calculated from
(4.1)
where VT: threshold voltage
= noise
voltage
42
V : peak voltage of the lowest rectified wave
p
R
input impedance of the comparator
Ts
1
t = :f + f, then from Eq. (4.1),
1
4
tl
c = - R
VT
ln C-v-)
=
p
The comparator threshold voltage is thus made equal to the noise
voltage at its input, which is originated from the rectifier, from the
preamplifier, from the microphone, from the power supply, and also
from the ambiance noise.
The calculation or measurement of these
noise sources is complicated and often not so precise.
Furthermore,
these noise sources are subject to changes with time, temperature and
room condition.
A practical method consists of manually adjusting the threshold voltage while observing the output of the comparator to have a
logic 0 when a word is not present, provided that the output offset
voltage from the rectifier has been adjusted to zero.
4.3
Determination of the number of quantization bits
The rounding of speech samples during the quantization may
cause instability in the linear predictor or increase the error of
estimation of its coefficients if the quantization level is too large.
The number of quatization bits can be determined by relating it to
speech signals and quantization level as follows.
The quantized samples can be expressed as
Q[S(n)]
= S(n) +
q(n)
where S(n) is the exact sample and q(n) is called the quantization
43
error.
Since rounding is assumed,
--2
<
q(n) <
A
T
where ~ is the quantization width.
If b is the number of quantization bits,
= 2
-b
The probability distribution of the quantization error is statistically
considered as uniform between
!J.
2
~
and2
Thus the mean and variance
of the quantization error are
'Y)q
=0
2
crq
= 2 12
-2b
In digital processing of sampled analog signals, the quantization error
is commonly viewed as an additive noise signal.
to-noise ratio resulting from the quantization
2
<Ts
-2-
crq
is
2
cr s
=---.,...-=
2-2b/12
or
SNR (db) =
q
Therefore, the signal-
2
6.02b + 10.8 + 10log(~ )
s
(4.2)
When the signal level exceeds the maximum range of the ADC, the resulting clipping distortion decreases the variance of the signal and the
quantization noise error.
The sample autocorrelation functions
computed from these speech samples become farther from their true
values and the estimated linear predictive coefficients do not represent the true speech.
This can also be explained in the frequency
44
domain as follows : due to the clipping, the clipped wave approximates
a rectangular pulse, adding higher frequency components to the frequency content of the characterized speech signal.
Since the clipping
happens after the low-pass filter, the effect of band-limiting is
nulled out and aliasing occurs.
As the clipping increases, more of the
speech spectrum is modified by the effect of aliasing and the resynthesized speech suffers quality degradation.
In order to avoid the above effects caused by clipping
distortion, the input samples should be multiplied by a constant K
where 0-<:K<l.
That is, sample KS(n) is quantized. instead of S(n).
2 2
Since var KS(n) is equal to K ~ , Eq. (4.2) becomes
s
SNR (db) = 6.02b + 10.8 +
q
10log(~s2 ) + l0log(K2 )
(4.3)
For speech signal, the probability that the magnitude of a given sample
will exceed three to four times the root-mean-square value of the
signal is very low.
If K is chosen to be equal to l/3fJ'" , then with
s
high probability, no distortion will occur.
In this case
2
SNR (db)= 6.02b + 10.8 + 10log(<T ) + 10log(l/9crl)
q
s
s
= 6.02b
(4.4)
+ 1.25
Thus, if a speech signal exceeds a certain threshold level, its
variance increases, K decreases, the whole signal will be reduced in
amplitude and clipping distortion will not occur.
Eq. (4.2) or Eq. (4.4) can be used to determine the number of
quantization bits required for a
S~
and a speech root-mean-square
value desired.
The SNR
q
of the ADC can be shown to be related to the input
45
SNR.
~n
and to the attenuation on the SNR.
~n
SNR (db)
q
(]"2
s
= lOlog 2
at its output as follows:
=
lOlog
=
d (db)
a-q
2
(fs
lOlog--- lOlog'""Z"
2
2
(Ts
(4.5)
2
2
<To- (J'i
2
CTS
cr.
cro
~
where:
2
2
a02 ,
~s'~q'
(4.6)
2
rri are, respectively, variances of signal, ADC noise,
input noise, and output noise.
d
= attenuation
on the SNR.
~n
caused by the ADC noise.
From Eq. ( 4 • 6):
(4. 7)
Replacing (4.7) into (4.5) gives:
(]'2
=
SNR (db)
q
s
lOlogz(f.
- lOlog(lOd/lO_l)
~
or
SNR (db)
q
= SNR.
lOlog(lOd/lO_l)
(4.8)
~n
If K in Eq. (4.3) is to be corrected by manual means, then replacing
Eq. (4.8) to Eq. (4.2) and rearranging gives:
b
= -i-[sNRin-
lOlog(lOd/lO_ 1) - 10.8 -
For the system performance, SNR.
~n
and d should not exceed 3db.
- For d
lOlog(~;)]
should be kept as low as possible
Let SNR.
=
~n
70 db, then:
= 3db, rJs = 1
b-
(4.9)
~ [70 - 10log(lo0 •3 -l) - 10.8]
= 10
bits
46
-: For d
= ldb,
(f
s
=1
1
.
0 1
b ~ [70 - lOlog(lO • - 1) - 10]
= 11 bits
(12 bits standard)
Since the quantized samples from the ADC are for the estimation
processes, an ADC of 12 bits or more is desired.
This corresponds to
an ADC 's S NR of
SNR
4.4
q
=
6.02(12) + 10.8
=
83db
Speech band-limitation and low-pass filter
Before speech can be digitized, its bandwidth must be limited
so that all the frequency components at or above half the sampling
frequency must be ideally removed in order to avoid the aliasing.
This
is the task of the analog filter.
In practice, all the filter do not produce infinite attenuation in the stop-band.
It is therefore necessary to impose a specifi-
cation for the filter performance.
The specification is based on the
speech signal characteristics and on the number of quantization bits
desired.
If the signals of the frequencies at or above half the samplingfrequency are not to be converted by the ADC, the output of the fil:termust be less than half the quantization level at these frequencies
or
Attenuation (db)
= -20log2•2 b = -20log2 b+l
From the paragraph 4.3, a 12 bits ADC corresponds to an attenuation at
or above half the sampling frequency of
47
13
-20log2 .
= -78db
One of the features of speech signals is the formant frequencies.
A
voiced speech signal can have up to five formant frequencies or more,
but only the first three formants contribute significantly to the
quality of speech.
In addition, the third formant is not higher than
3 kHz [5] [8] [9].
Therefore, a low-pass filter having a cut-off frequency of
3 kHz and an attenuation of 78db at half the sampling frequency is
acceptable for the experiment.
The Ithaco's filter set model 4302 can be used.
This filter
set is a 4 pole Butterworth and can be used in cascade to obtain a
48 db/octave low-pass filter.
For two filters in cascade and cut-off frequency set at
3.15 kHz, the set shows an attenuation of 78db at 10kHz.
ponds to a sampling frequency of at least 20 kHz.
This corres-
Such a high sampling
frequency will increase cost and time.
By considering the fact that above about 5 kHz, the magnitude
of speech frequency components
characteristically
drops
40db or more
as compared to the highest formant frequency [5] [8] [9], the filter
only needs to attenuate 38db more at some frequency around 5kHz.
On the normalized response curve of the filter, an attenuation of 38db corresponds to a frequency ratio of
f
f
=
1.71
=
1.71 x 3.15 = 5.39 kHz
c
f
48
Thusthe sampling frequency can now be reduced to 10.8 kHz.
Since the amplitude response of the filter is maximally flat,
a -6db bandwidth of 3.15 kHz can be defined here instead of -3db bandwidth.
4.5
Interfacing microprocessor
Since only arithmetic operations are involved in the proces-
sing, any microprocessor can be used.
I use here the 6800 for its
simplicity in construction and programming although the length of processing time can be questionable.
The microprocessor communicates with external devices through
the PIA 6820.
Both ports A and B are configured as inputs.
Only port
A needs to handshake with the ADC.
EOC
CAl
OE
CA2
.ADC
PIA
12 bit
OUT
SIGN
'\.
PA0-7
, PBo-3
p
ftP
PB7
CBl
TO END-POINT
DETECTOR' OUTPUT_.
Fig. 4.3 Interfacing of the microprocessor with the
'ADC and the end-point detector circuit.
The 12 bit parallel outputs from the ADC are connected to
PAO-PA7 and to PBO-PB3.
The sign output is connected to PB7.
After each conversion, the ADC sends out a (EOC) end-ofconversion signal to CA2 to interrupt-request (IRQ) the microprocessor,
the microprocessor pulls up the (OE) output enable logic at the ADC to
read data then pulls it down to enable another data output from the
ADC.
The end-point-detector circuit signal is sent to CBl to
interrupt-request the microprocessor on the falling edge.
However this
IRQ is not for output request; it only tells the microprocessor to stop
reading data and start computation.
When receiving an interrupt request, the microprocessor will
check the port B first to see if IRQ comes from there.
If it is the
case, it will stop reading data and start computation.
Otherwise, IRQ
must come from port A and it will automatically load and store PBO-PB3
and PB7 into memory as high-order data byte with sign bit, and then
load and store PAO-PA7 to the next location in the memory as low-order
data byte.
4.6
Programming
The microprocessor is needed to execute the following pro-
grams
1/
Load the quantized weighting function
2/
Read data
3/
Multiply data with weighting function
4/
Compute autocorrelation functions
5/
Compute linear predictive coefficients
The weighting function is first calculated for each value of N sampling points.
The calculated values are then converted to binary
numbers and loaded into the microprocessor.
Since the weighting values
are fixed for each value of N, it is assumed that they have been
50
prestored at specific locations in the memory.
The program is started
with reading in the data.
Input data
Fig. 4.4 shows a detailed flowchart for reading data from the
ADC.
The procedure was already described in part 4.4.
Window data
When the microprocessor acknowledges an IRQ signal from the
end-point detector, it immediately stops reading, stores the total
length n of input speech read then starts to truncate n data samples
into segments of N samples.
The truncation is done by multiplying the
kth point of the Hamming window to the corresponding (mN + k)th point
of the segment where m,k
= 0,1,2, •••
The data samples start from location 0 and the window weights
start from location K.
The weighted data are stored in consecutive
locations starting from location 0.
Since the length of the speech
sequence is longer than the length of the window function, the technique of variable offset is used to shift the window to the next speech
segment.
Detail flowchart in Fig. 4.5 shows the above computation
progra~
51
INITIALIZE PIA
PORT A,B INPUT
CA2 HANDSHAKE
CAl,CBl IRQ
Fig. 4.4
Flowchart for reading data
from the A/D
index,offset 0
index,offset 0
converter
52
n
= total
length of input speech
NEXT
index,offset K
MULTIPLY A WITH
DATA BY FAST
MULTIBYTE XCATION PROGRAM
index,variable offset VO
index,variable offset VO
Program for computing autocorrelation functions
N
Fig. 4.5
= length
of Hamming window or
size of each speech frame
Flowchart for windowing data
53
Compute normalized autocorrelation functions
From the method described in paragraph 3.4 :
R(l)
= yoyl+
yly2+ •••••••••••••••••••••• +yN-3YN-2+ YN-2YN-l
R( 2 )
= yoy2+
yly3+ •••••••••••••••••••••• +yN-3YN-l
First, the partial products (P.PRODUCT) yN-~N-k-i at each
column are iteratively computed and stored in consecutive locations.
If p is the order of the linear predictor, the number of partial products will be reduced by 1 at the (N-p)th column, by 2 at the (N-p-l)th
column and so on.
Next the partial products belonging to each R(i) are picked
out and added, the partial results are the partial sums (P. SUM), and
the final sum is R(i).
R(i) needs not to be divided by N as indicated
by Eq. (3.15) since both sides of Eq. (3.12) are divided by N.
Finally, the R(i) just computed are divided by R(O) for
normalization purpose.
Detail flowcharts in Fig. 4.6 show the above computation program.
54
p
= order
of linear filter
index,offset 0
index,offset 0
index,offset 0
Fig. 4.6a
Flowchart for computing auto-
correlation function,lst part: multiply
data sample i with data sample j to form
partial products, where i,j
= 0,1,2 ••••
55
LDAB (N-p-l)p+:E_ i
i=l
address of the last p. product of each
R(i).
p
p
STAA
P.SUM
R(i) LOOP
Ml
address of each p.product ofeachR(i)
M2
address of 1st p.product of eachR(i)
Ml
M2
P.SUM
index,offset k
P.SUM
index,offset k
N
p
Fig. 4. 6b Flowchart for computing
CONautocorrelation functions, 2nd part:
Ml
sum up thepartial products then
divide by R(O) to obtain the normar( i).
N
56
(N-p-l)p
SUBA# +
i
f:
1
R(i)
LOOP
N
DIVIDE R( i) BY
R( 0) AND STORE
r(i),i= 0,1,2 ••
(Program for computing
linear predictive
coefficients)
Fig. 4.6c
Flowchart for computing
autocorrelation functions, 3rd part
Compute linear predictive coefficients ak
Following the computation algorithm developed in paragraph
3.5, a flowchart for computing the linear predictive coefficients ak
can now be carried out as shown in Fig. 4.7
57
p
=1
DIVISION PROGRAM TCl)
COMPUTE AND STORE a1
p +1
=p
XTIPLICATION & ADDITION
PROGRAMS TO COMPUTE AND
P·l
(
)
STORE S =~r a p-l
1
k .. , k k
AND S =~r
a(p-l)
2 lc.d p-k k
SUBTRACTION PROGRAM FOR
D1 = r 0 - s 1 AND D2= rp -s 2
DIVISION PROGRAM TO
DIVIDE D1 BY D2
(=
aC:~
MATRIX COMPUTATION
PROGRAM TO COMPUTE A(p)
p-1
N
Fig. 4.7
Flowchart for computing linear
predictive coefficients
58
REFERENCES
BOOKS
.1.
C.Ray Wylie, Advanced Engineering Mathematics, Me Graw-Hill,
New-York, 4th edition.
2.
A.V.Oppenheim; R.W.Schaffer, Digital Signal Processing, Prentice
Hall, New Jersey, 1975.
3.
M. Schwartz; L. Shaw, Signal Processing, Me Graw-Hill, New-York,
1975.
4.
J.L. Flanagan, Speech Analysis, Synthesis and Perception, Springer
Verlag, New-York.
5.
Harvey Fletcher, Speech and Hearing in Communication, D. Van
Nostrand, 1954.
6.
Elma Poe, Using the 6800 Microprocessor, Howard W. Sams & Co.
1st edition.
7.
L.R. Rabiner; B. Gold, Theory and Application of Digital Signal
Processing, Prentice Hall, New Jersey, 1974.
PERIODICALS
A Tutorial Review," Proc. IEEE,
8.
J. Makhoul, " Linear Prediction
vol 63 n4, April 75, p561-570.
9.
B.S. Atal; S.L. Hanauer, " Speech analysis and synthesis by linear
prediction of the speech wave," J. Acoust. Soc. Am., vol 50, 1971,
p655-687.
10. S. Chandra; W.C. Lin, "Experimental comparison between stationary
and non-stationary formulation of linear prediction applied to
voiced speech analysis," IEEE Trans. Acoust. Speech Signal Process,
vol ASSP-22 n6, Dec 74, p403-415.
11. M.R. Sambur, " Speaker recognition using orthogonal linear prediction," IEEE Trans. Acoust. Signal Process., vol ASSP..,.24, Aug 76,
p283-289.
59
12. H.J. Manley, "Analysis-synthesis of speech in terms of orthogonalized exponentially damped sinusoids," J. Acoust. Soc. Am., val 35
n 4, April 63, p464-474.
13. Miki Nobuhiro, " Speech feature extraction by a modulated Fourier
function," Electron. Commun. Jap., val 57 n 1, Jan. 74, p56-63.
14. J.L. Flanagan, "Pitch discrimination for synthetic vowels,"
J. Acoust. Soc. Am., val 30, 1958 , p435-442.
15. Furui; Sadaoki M., "Talker recognition by statistical features of
speech sounds," Electron, Commun. Jap. val 56 n 11, Nov. 73, p62-71
16. Furui; Itakura F., " Talker recognition by long time averaged
speech spectrum," Electron Commun. Jap. val 55A, 1972, p54-61.
17. G.M. White; R.B. Neely, "Speech recognition experiments with
linear predication, bandpass filtering, and dynamic programming,"
IEEE Trans. Acoust. Speech Signal Process., val ASSP-24 n 2,
April 76, pl83-188.
18. C.R. Partisaul, "Adaptive time-frequency resolution in vocal tract
parameter coding for speech analysis and synthesis," Ph D. Thesis,
Georgia Tech., June 74.
© Copyright 2025 Paperzz