CSCI 5582 Artificial Intelligence

Phonetics
Speech Tech, week 2
Signal as sequence of
'sounds'
• It's convenient to assume that signals can be broken up into smaller units of
speech, a finite set of discrete consonant and vowel sounds.
• But this is an idealisation. For example, when you say the word /spun/ ('spoon'), at
least the following things are happening:
 tongue tip: starts in alveolar fricative position, then moves to a neutral position,
then raises again to alveolar closure.
 tongue body: starts in a high position and drops to a relatively neutral position
 lips: move into closure, then rounded, then less rounded.
 velum: starts raised (oral), but then lowers (nasal).
 glottis: start open, then narrows till voicing starts
What these movements really
look like:
movements are
relatively smooth, and
overlap with
movements of other
articulators
not
abrupt movements,
positions of
articulators in a given
sound are
independent of
preceding or
following sounds
But we pretend
• That speech consists of discrete units,
assembled like 'beads on a string'.
• This idealisation is the basis of the IPA
(and all alphabetic writing systems)
 As a practical matter, it's impossible to have a unique symbol
for every slightly different physical realization of a given
consonant or vowel (e.g. a 70 msec. [t] vs. a 71 msec [t].
• Allows for a compact record of speech, sufficiently
accurate for most practical purposes.
• But the inaccuracy of this simplifying assumption will
come back to haunt us repeatedly.
Figure 7.1
Figure 7.1 continued
Figure 7.2
Organs of the vocal tract
Consonant place of
articulation
Consonant manner of
articulation
• stops (plosives)
 closure+release
• fricatives
 sibilant and non-sibilant
 affricate = stop+fricative
• taps
• approximants
Phonation
•
•
•
•
voiced
voiceless
creaky
breathy
Vowels
•
•
•
•
tongue body height
tongue body advancement
lip rounding
diphthongization
Figure 7.5
Figure 7.6
Syllables
Stress
• Marked in Arpabet by placing a
number after each vowel
 0 = no stress
 1 = primary stress
 2 = secondary stress
Variation: phonemes and
allophones
Allophones of English /t/
Figure 7.10
Distinctive features
Acoustic phonetics
• Sound propagates as
rarefaction/compression waves
through a medium (typically air).
Simple sine wave
1 cycle
amplitude
Frequency = # cycles per unit time
(this wave has 5 cycles in .5 s = 10 cycles/s = 10 Hz
Complex waveform
The vowel [iy]
Can be analyzed as the sum of set of component sine waves,
each with their own frequency and amplitude
Digitizing an acoustic signal
• Sampling (e.g. 44100)
• Quantization (as integers within
some range)
Pitch track
F0 = fundamental frequency
Loudness
• Amplitude
• Power (amplitude squared)
• Intensity (in deciBels)
N
10log 10 (1/ NP 0 ∑ x 2i )
i=1
where x is amplitude, N is the number of samples, and
P0 is the auditory threshold of human hearing
Intensity
calculated as RMS dB
Waveform of a sentence
Periodic vs. aperiodic
waveform
From frequency to
wavelength
wavelength=
speedofsound
frequency
(for speech purposes, typically measured in
centimeters)
speed of sound = 340 m/s = 34000 cm/s
wavelength
point of maximum
amplitude within the
cycle
end of
cycle
wavelength
1/4
2/4
3/4
4/4
so, the peak (point of maximum amplitude) occurs at ¼ of the wavelength
If the sound is emitted from a uniform tube which is exactly
¼ the wavelength, the sound will emerge from the tube at
its loudest
sound
source
mouth of
tube
Whereas if it has a shorter or longer wavelength relative to the tube, the sound coming
out of the tube is somewhat weaker (at precisely 1/2 the wavelength, the sound is
completely silenced)
Resonance
• In speech, the soundwave from the glottal source
is not a simple sine wave, but a complex
waveform, with multiple component frequencies
(i.e. the fundamental and all the harmonics)
• Imagine the supralaryngeal vocal tract as a
uniform tube of length L. The component
frequencies that emerge from that tube at
maximum strength are those with corresponding
wavelengths L/4, L/8, L/16, etc. These are the
resonant wavelengths.
Resonant frequencies
• If frequency F corresponds to wavelength L/4, the
other resonant frequencies are 5F, 9F, 13F, etc.
• Non-resonant frequency components of the source
sound will be damped to a greater or lesser degree
depending on how far they are from the resonant
frequencies.
Where formants come from
• Formants are the regions at and
around the resonant frequencies.
• In these regions, the harmonics
emerge at maximum strength, in all
other regions they are damped.
3 tube model
• The vocal tract is not a uniform tube. But the
effects of tongue body position on the formants
can be modelled (to a close approximation) as a
sequence of three tubes
laryngeal
source
tube 2
tube 1
tube 3
lips
tongue body
• Depending on the size of each tube (determined by
placement of the tongue body), particular formant
freqencies emerge.
Formants
• Only the first 3 or 4 formants are relevant
for linguistic phonetics.
 (Higher formants are useful for identifying speaker
voices, but not for identifying what's being said.)
• F1 (first formant) frequency is inversely
correlated with tongue body height: high F1
= low vowel.
• F2 (second) freq. is directly correlated with
tongue body advancement: high F2 = front
vowel.
Source-filter theory
• The sound emerging from the vocal tract (in voiced
sounds anyway) can be thought of as the product of
2 things
 The sound source, i.e. glottal pulses (determines the F0
and the frequency of the harmonics)
 The sound filter, i.e. the supra-laryngeal vocal tract
(depending on how it's shaped at any given moment,
boosts amplitude of harmonics near resonant
frequencies, and damps harmonics elsewhere, resulting
in some pattern of formants)
Glottal source
• If there were no filter on top of it, a
spectrum of the glottal source would
show near-linear decrease in
amplitude of the harmonics as
frequency increases.
Filter spectrum
Typical F1, F2, F3 values for an adult male speaker saying [æ]
Product of source and filter
• Results in the actual spectrum that
we can observe in Praat
Source-filter independence
Same source
spectrum as
previous slide,
but different
filter
Likewise, we
could change the
source (F0 and
harmonic
frequencies)
without changing
the filter. The
precise harmonics
would then be in
different places,
but the location of
the formants
would be the same.
Aperiodic spectra
• Though an aperiodic sound (e.g. a
voiceless fricative) has no F0, it still may
be stronger in certain frequency regions.
 Compare sounds of [θ, s, ʃ]
 Fricative spectrum is primarily determined by the
size of the cavity in front of the point of
constriction:
• the larger the cavity, the lower the centre of energy
Figure 7.21
Figure 7.22
Figure 7.23
Figure 7.24
Figure 7.25
Figure 7.26
Figure 7.27
Figure 7.28
Figure 7.29
Figure 7.30
Figure 7.31