Speech Production (v2)

Speech communication, speech production and
phonetics
Speech processing
Tom Bäckström
Aalto University
Fall 2016
Speech Communication
Thinking, language and communication
I
Among the most important differences between humans and
other animals is
I
I
I
the ability for abstract thought and awareness of thought and
the ability to communicate with other people and to act in
co-operation for joint goals.
Both abilities essentially require organization of thought and
communication using a language.
I
I
I
Language gives us the ability to think and communicate on an
abstract level.
The sentence “Give me food” does not specify “what” food to
give or “how” it should be served, but the meaning is obvious
to us. Language thus gives us the opportunity to process
actions and things on an abstract level.
The communication of other animals is generally an expression
of emotional state without self-awareness; a dog can bark
when it is afraid, angry or excited, but the dog is (probably)
not aware of these emotions.
Speech Communication
Thinking, language and communication
I
The primary mode of communication for humans is speech. It
is thus no coincidence that
I
I
I
I
I
Writing and reading (on paper, screen, etc.) are also
important forms of communication:
I
I
I
speech is in a central position in the development of a child,
this lecture is presented in a spoken form instead of slides-only,
television programming gives most of the information at least
in a spoken form,
the first form of instantaneous communication was the phone,
etc.
It facilitates storage of information.
It was the arguably the first form of telecommunication.
When comparing a chat and speaking on the phone, it
however becomes obvious which form of communication
causes less problems.
Speech Communication
The Voice
I
A speech signal, the voice, is an acoustical signal for the
transmission of words, language and messages.
I
I
The acoustics pressure waveform is the “carrier”, which relays
the speech signal from the speaker to the ear of the listener.
The speech signal and language, in turn, is the “carrier” for the
message and meaning the speaker wants to relay.
I
Speech communication can thus be seen as a layered model of
a communication path.
I
Such communication protocols can be formally defined using
models such as OSI or TCP/IP.
Speech Communication
Applied in the OSI-model
Group assignment: What parts of the speech communication
process would you assign to the different parts of the OSI-model?
Layer
Media
layers
1. Physical
Data
unit
Bit
2. Data link
Frame
3. Network
Packet
4. Transport
Segments
5. Session
Data
6. Presentation
Data
7. Application
Data
Host
layers
OSI Model
Function
Transmission and reception of raw bit
streams over a physical medium
Reliable transmission of data frames
between two nodes connected by a
physical layer
Structuring and managing a multi-node
network, including addressing, routing
and traffic control
Reliable transmission of data segments
between points on a network, including
segmentation, acknowledgement and
multiplexing
Managing communication sessions, i.e.
continuous exchange of information in
the form of multiple back-and-forth
transmissions between two nodes
Translation of data between a networking service and an application; including character encoding, data compression and encryption/decryption
High-level APIs, including resource
sharing, remote file access, directory
services and virtual terminals
Examples
DSL, USB
PPP, IEEE 802.2,
L2TP
IPv4,
IPv6,
IPsec, AppleTalk,
ICMP
TCP, UDP
RPC,
PAP,
HTTP,
FTP,
SMTP,
Secure
Shell
ASCII, EBCDIC,
JPEG
Mail, Internet Explorer, Firefox
Speech production
Terminology
Speech production refers to those physiological, physical and
neurological processes required to produce speech.
Phonation refers (in my interpretation) to the physiological,
physical and neurological processes in the production
of a single speech sound, whereby it is a physiological
base unit.
Phone is a specific (perceptually identifiable) sound
irrespective of its grammatical position or meaning. It
is thus a perceptual and acoustic base unit.
Phoneme is the base unit of language and refers to the smallest
unit which distinguishes between meanings.
(Auditory) perception refers to the auditory reception and
detection of acoustic signals such as speech.
Speech production
Physiology
Speech production
Physiology
English
nasal cavity
palate
oral cavity
lips
tongue
jaw
pharynx
epiglottis
glottis
vocal folds
larynx
esophagus
Suomi
nenäväylä
kitalaki
suuväylä
huulet
kieli
leuka
nielu
kurkunkansi
äänirako
äänihuulet
kurkunpää
ruokatorvi
Speech production
The phonation process
I
A phonation begins on a neurological level with the decision or
intent to produce speech, whereby the brain sends a message
to the physiological organs to produce speech.
I
The physiological process begins in the lungs, which contract,
increasing the air pressure such that air flows out.
I
The acoustical signal is then produced with two mostly
independent processes:
voiced phones are produced by tightening the vocal folds
to an appropriate tension, such that they begin
to oscillate in the air flow. The varying airflow
causes a pressure waveform, that is, a sound.
unvoiced phones are produced by constricting some part of
the vocal tract such that airflow is either
prevented or constricted, causing a turbulent
mode of airflow and pressure waveform.
Speech production
Voiced phones
English
vocal fold
trachea
epiglottis
Vocal folds depicted from above.
Suomi
äänihuuli
henkitorvi
kurkunkansi
Speech production
Vocal folds in action
https://www.youtube.com/watch?v=mJedwz_r2Pc
https://www.youtube.com/watch?v=W-nS9fgs7Ro
Phonetics and phonology
Phonetics is the study of the acoustics and physiology of speech
production, speech perception and speech sounds. It
often involves attempting to formalize, using a
grammar, what these sound patterns are, as well as
account for and understand how grammars can differ
across languages.
Phonology is the study of the sound patterns of language.
The distinction is not very clear and some claim that phonology is a
part of phonetics. Either way, telling this to phoneticians or
phonologians is a good way to pick up a fight.
Phonetics and phonology
Production of voiced sounds
I
When the vocal folds are tightened appropriately, then an
airflow passing through the glottis (the opening between the
vocal folds) can cause an oscillation.
I
I
I
I
When the folds are closed, pressure under them will increase
due to force from the lungs, until the vocal folds open.
Air flows through the opening folds and sub-glottal pressure
decreases.
When the tension of the vocal folds exceeds the force exerted
by the momentum of the vocal folds and air pressure, the folds
begin to close again.
When the folds close it causes a clapping sound, which is the
primary sound event of the vocal folds and is known as the
glottal excitation.
Phonetics and phonology
Production of voiced sounds
Phonetics and phonology
Production of voiced sounds
I
This cycle repeats more or less periodically thus producing the
fundamental frequency of speech, F0 .
I
I
The speech production is not a rigid mechanical system,
whereby the period length and intensity varies slightly over
time.
Variations in period length and amplitude are known as jitter
and shimmer, respectively.
Phonetics and phonology
Production of voiced sounds
The airflow passing through the glottis is thus half-wave rectified,
whereby it is an harmonic signal and its spectrum has a
comb-structure.
(a)
Displacement
Left
Right
Time
Magnitude (dB)
(b)
Frequency
Phonetics and phonology
Production of voiced sounds
I
If the fundamental frequency is F0 , then the harmonic
frequencies are kF0 , k = 1, 2, 3 . . . .
I
Since F0 varies slightly, F0 + , then the harmonic frequencies
will also vary by k.
I
It follows that the variations in the location of upper
harmonics is large and the harmonic structure is often visible
only at low frequencies (when k is small).
Phonetics and phonology
Production of voiced sounds
I
The fundamental frequency of a speaker depends on the
tension, length and mass of the vocal folds.
I
I
I
Shorter people (like children and some women) have shorter
and lighter vocal folds, whereby their fundamental frequency is
higher F0 ≈ 150 Hz to 400 Hz whereas taller and heavier people
(usually males) have a lower F0 ≈ 90 Hz to 200 Hz.
The tension of the vocal folds can be modified consciously to
modify the fundamental frequency.
Joint exercise: The fundamental frequency has a great
expressive function. Pronounce the following expressions and
observe the changes in pitch.
I
I
I
Party! (excitement)
Party? (question)
Party. (disappointment)
(You probably also noticed a difference in intensity/volume.)
Phonetics and phonology
Production of unvoiced sounds
I
Unvoiced phones are produce by preventing airflow in the vocal
tract partially or completely.
I
Different manners of articulation are
Stops (fin. klusiili) are phones where airflow is completely
stopped for a moment and then abruptly released.
For example, pop, hat, cat.
Nasals (fin. nasaalit) are phones where airflow through the
mouth is prevented (partly or wholly) but air does
flow through the nose. For example, nose, moomins.
Fricatives (fin. frikatiivit) are formed by constricting airflow
through a small opening such that airflow goes into a
turbulent mode (nonlinear effect), such that a noisy
sound is generated. For example, half, fluffy.
Phonetics and phonology
Production of unvoiced sounds
Affricatives (fin. affrikaatat) are phones which begin as a stop
and open up to a fricative, such as chin, gentle.
Tremulants (fin. tremulantit) are single or repeated constrictions
in the vocal folds such as robbery, horrid. Most
English accents use only single taps and proper
repeated constrictions appear only in accents such as
Scottish. (Half-way) Taps are also known as
approximants since they approximate stops or
tremulants.
Liquids (fin. puolivokaalit) are similar to fricatives, but where
there is only a small amount of noisiness. For
example, water, joint.
Laterals (fin. lateraalit) are similar to liquids but such that air
flows on both sides of the tongue. For example, hello,
lateral.
I
This is only a high-level list and many more detailed
Phonetics and phonology
Production of unvoiced sounds
The place of articulation describes where the most significant
constriction of the vocal tract is located.
http://qz.com/680488/
watch-mri-footage-of-a-world-class-opera-singer-performing/
Phonetics and phonology
The effect of the vocal tract
I
I
The acoustic properties of the vocal tract are in many respects
similar as those of a trumpet or other wind instruments.
The shape of the pipe/tube has a significant effect on the
timbre (colour) of the sound
I
I
I
The position and shape of tongue, lips and jaw have a crucial
role in speech production.
(Unvoiced sounds are formed in constrictions of the vocal
tract.)
For all voicings, the vocal tract has resonances which depend
on the shape of the tube.
Vowels are distinguished from each other by the location
and intensity the resonances of the vocal tract.
Phonetics and phonology
The effect of the vocal tract
c
(Jarmo
Malinen, with permission)
Phonetics and phonology
The effect of the vocal tract – Formants
I
The resonances of the vocal tract are known as formants.
I
I
I
I
The frequency location (and amplitude) of formants uniquely
identify vowels.
Formants are visible as peaks in their spectral envelopes (the
rough shape of their spectrum).
Formant frequencies are usually denoted by F1 , F2 , F3 . . . .
NOTE1! F0 is not a formant but the fundamental frequency.
NOTE2! The fundamental frequencies and formants are
independent of each other.
Formants are the most important feature of speech.
I
I
By reproducing only the formant structure of a speech signal
we obtain a signal which is quite intelligible.
By vocoders we refer to all methods which produce sounds
with a distinct formant structure.
Phonetics and phonology
The effect of the vocal tract – Formants
ee
0
500
i
1000
0
500
ae
0
500
1000
500
1000
0
500
ah
0
500
U
0
e
1000
0
500
oo
1000
0
500
1000
aw
1000
u
1000
0
500
er
0
500
1000
The two first formants F1 and F2 in the spectrum.
1000
Phonetics and phonology
The effect of the vocal tract – Formants
2500
ee
i
F2 frequency (Hz)
2000
e
ae
1500
er
u
U
1000
500
oo
300
ah
aw
400
500
600
F1 frequency (Hz)
700
800
The locations of the two first formants F1 and F2 in a formant
triangle.
Phonetics and phonology
The effect of the vocal tract – Formants
https:
//en.wikipedia.org/wiki/IPA_vowel_chart_with_audio
Vowel chart (source: Wikipedia).
Phonetics and phonology
The effect of the vocal tract – Formants
I
Every language has its own formant structure.
I
I
I
When a language develops over time, the position and number
of vowels can change.
I
I
I
I
The location of vowels varies.
The number of vowels can be different.
The vowels have to be distinguishable in the formant triangle.
Otherwise they cannot be distinguished.
If the formants, of different vowels of a language, travel over
time too near, they can become indistinguishable and they
become one phone. Sometimes also the phonemes can merge,
and sometimes they retain their context-dependent differences
in meaning.
Similarly, a single vowel can over time evolve such that it is
split into two parts.
Evolution of language is a slow process and beyond the scope
of this course.
Phonetics and phonology
International Phonetic Alphabet (IPA)
I
I
Since every language has its own set of phones (which can
change over time), we need a description or notation which
uniquely identifies every possible phone.
It provides means for describing pronunciation such that
I
I
the alphabet is language-independent,
unambiguous.
I
Quick reference
https://en.wikipedia.org/wiki/File:
IPA_chart_(C)2005.pdf
I
Examples in English http://www.m-w.com/ and Finnish
http://www.sanakirja.org/
I
Alphabets better suitable for representation on a computer are
for example SAMPA and X-SAMPA.
THE INTERNATIONAL PHONETIC ALPHABET (revised to 2005)
CONSONANTS (PULMONIC)
© 2005 IPA
Bilabial Labiodental Dental
Alveolar Post alveolar Retroflex
p b
m
ı
Plosive
Nasal
Trill
Tap or Flap
Fricative
Lateral
fricative
Approximant
Lateral
approximant
t d
µ
n
r
|
v
F B f v T D s z S Z
Ò L
√
®
l
Palatal
Velar
Uvular
Pharyngeal
Glottal
Ê ∂ c Ô k g q G
/
=
≠
N
–
R
«
ß Ω ç J x V X Â © ? h H
’

j
¥
˜
K
Where symbols appear in pairs, the one to the right represents a voiced consonant. Shaded areas denote articulations judged impossible.
VOWELS
Dental
(Post)alveolar
Palatoalveolar
Alveolar lateral
∫
Î
˙
ƒ
Ï
Bilabial
Dental/alveolar
Palatal
Velar
Uvular
Front
Ejectives
Voiced implosives
Bilabial
’
p’
t’
k’
s’
Close
Examples:
i
Close-mid
Dental/alveolar
Velar
Open-mid
Alveolar fricative
OTHER SYMBOLS
DIACRITICS
9
3
Ó
7
¶
™
2
·
+
`
8
±
ª
0
£
W
∆
◊
≥
ù
6
§
5
∞
U
e ∏
´
E { ‰
å
œ
a ”
¨ u
Ø o
ø O
A Å
Where symbols appear in pairs, the one
to the right represents a rounded vowel.
SUPRASEGMENTALS
"
(
…
Ú
N(
bª aª 1 Dental
t 1 d1
Creaky voiced
b0 a0 ¡ Apical
t ¡ d¡
Linguolabial
t £ d£ 4 Laminal
t 4 d4
Labialized
tW dW ) Nasalized
e)
Palatalized
t∆ d∆ ˆ Nasal release
dˆ
Velarized
t◊ d◊ ¬ Lateral release d¬
Pharyngealized t≥ d≥
} No audible release d}
Velarized or pharyngealized :
Raised
e6 ( ®6 = voiced alveolar fricative)
Lowered
e§ ( B§ = voiced bilabial approximant)
Advanced Tongue Root
e5
Retracted Tongue Root
e∞
Primary stress
Secondary stress
Æ
kp ts
Diacritics may be placed above a symbol with a descender, e.g.
n9 d9
Voiced
s3 t 3
Aspirated
tÓ dÓ
More rounded O7
Less rounded
O¶
Advanced
u™
Retracted
e2
Centralized
e·
Mid-centralized e+
Syllabic
n`
Non-syllabic
e8
Rhoticity
´± a±
Voiceless
Open
Back
È Ë
IY
e P
(
∑ Voiceless labial-velar fricative Ç Û Alveolo-palatal fricatives
w Voiced labial-velar approximant
» Voiced alveolar lateral flap
Á Voiced labial-palatal approximant Í Simultaneous S and x
Ì Voiceless epiglottal fricative
Affricates and double articulations
¿ Voiced epiglottal fricative
can be represented by two symbols
joined by a tie bar if necessary.
÷ Epiglottal plosive
Central
y
Bilabial
ò
Clicks
>
˘
!
¯
≤
Breathy voiced
ÆfoUn´"tIS´n
˘
≤
.
≈
e _
e!
e@
e~
e—
Õ
õ
e…
eÚ
e*
Long
*
Half-long
Extra-short
Minor (foot) group
Major (intonation) group
Syllable break
®i.œkt
Linking (absence of a break)
TONES AND WORD ACCENTS
LEVEL
CONTOUR
Extra
Rising
or
or
high
â
ê
î
ô
û
ˆ
CONSONANTS (NON-PULMONIC)
High
Mid
Low
Extra
low
Downstep
Upstep
e
e$
e%
efi
e&
ã
Ã
ä
ë
ü
ï
ñ$
Falling
High
rising
Low
rising
Risingfalling
Global rise
Global fall
"IPA-chart".
Licensed under CC
BY-SA 3.0 via Wikimedia Commons
- https://en.wikipedia.org/wiki/File:
IPA_chart_(C)2005.pdf
Phonetics and phonology
Phonemes in the English Language – Consonants
Voiceless
/p/ pit
/t/
tin
/k/ cut
/tS/ cheap
/f/
fat
/T/ thigh
/s/
sap
/S/
dilution
/x/ loch
/h/ ham
Voiced
/b/
bit
/d/
din
/g/
gut
/dZ/ jeep
/v/
vat
/D/
thy
/z/
zap
/Z/
delusion
/m/
/n/
/ŋ/
/j/
/w/
/r/
/l/
map
keen
king
yes
we
run
left
Phonetics and phonology
Phonemes in the English Language – Vowels
æ
æ / A:
A:
6 / A:
6 / O:
O:
trap
bath
palm
lot
cloth
thought
I
i:
e/E
2
U
u:
kit
fleece
dress
strut
foot
goose
eI
aI
OI
@U / oU
aU
3:(r) / 3:r
face
price
choice
goat
mouth
nurse
A:(r)
O:(r)
O:(r) / oUr
I@(r) / Ir
e@ / Er
U@(r) / Ur
start
north
force
near
square
cure
Sound examples:
https://www.youtube.com/watch?v=xiqUVnXExTQ
Further discussion about English phonology:
https://en.wikipedia.org/wiki/English_phonology
@
@(r)
i
comma
letter
happy
Phonetics and phonology
Coarticulation, diphones and triphones
I
The physiological state of the voice production system does
not have discrete states.
I
I
I
I
I
In a sequence of phonemes, organs move continuously from
one state to the next.
Most of the time the state is in-between two subsequent states.
Clean articulations of single phonemes are rather rare
occurrences.
Definition: Coarticulation in its general sense refers to a
situation in which a conceptually isolated speech sound is
influenced by, and becomes more like, a preceding or following
speech sound.
Diphones and triphones are, respectively, combinations of two
or three adjacent phonemes.
I
I
By considering/modeling adjacent phonemes we can take
coarticulation into account.
Spanish has about 800 diphones and German has about 2500.
Phonetics and phonology
More terminology
I
A speech onset is the event when a phonation begins.
I
I
Onsets can vary in their speed; a phonation can start gradually
with for example a fricative (e.g. “hat”) or abruptly with a stop
(e.g. “top”).
A speech offset is the ending event of a phonation.
I
I
Offsets are usually slower than onsets.
A common difficult case are very slow offsets where speech is
trailing off (can be denoted by trailing dots, “It was like...
umm...”). Is that still speech? At what point does the
utterance actually end?
Phonetics and phonology
Prosody
I
In linguistics, prosody is concerned with those elements of
speech that are not individual vowels and consonants but are
properties of syllables and larger units of speech.
I
These contribute to such linguistic functions as intonation,
tone, stress and rhythm.
Prosody may reflect various features of the speaker or the
utterance:
I
I
I
I
I
I
the emotional state of the speaker;
the form of the utterance (statement, question, or command);
the presence of irony or sarcasm;
emphasis, contrast, and focus;
or other elements of language that may not be encoded by
grammar or by choice of vocabulary.
Phonetics and phonology
Prosody
Group exercise (15min)
For each of the following attributes, generate one sentence, where
you can change the attribute using only intonation, tone, stress or
rhythm.
I
the emotional state of the speaker;
I
the form of the utterance (statement, question, or command);
I
the presence of irony or sarcasm;
I
emphasis, contrast, and focus;
I
or other elements of language that may not be encoded by
grammar or by choice of vocabulary.