Computer Speech

Manfred R. Schroeder
Computer Speech
Recognition, Compression,
Synthesis
With Introductions to Hearing and Signal Analysis
and a Glossary of Speech and Computer Terms
With 78 Figures
Springer
Contents
1.
Introduction
1.1 Speech: Natural and Artificial
1.2 Voice Coders
1.3 Voiceprints for Combat and for Fighting Crime .
1.4 The Electronic Secretary
1.5 The Human Voice as a Key
1.6 Clipped Speech
1.7 Frequency Division
1.8 The First Circle of Hell: Speech in the Soviet Union
1.9 Linking Fast Trains to the Telephone Network
1.10 Digital Decapitation
1.11 Man into Woman and Back
1.12 Reading Aids for the Blind
1.13 High-Speed Recorded Books
1.14 Spectral Compression for the Hard-of-Hearing
1.15 Restoration of Helium Speech
1.16 Noise Suppression
1.17 Slow Speed for Better Comprehension
1.18 Multiband Hearing Aids and Binaural Speech Processors . . . .
1.19 Improving Public Address Systems
1.20 Raising Intelligibility in Reverberant Spaces
1.21 Conclusion
1
1
2
4
7
8
9
11
12
13
14
16
16
16
17
17
17
19
19
20
20
21
2.
A Brief History of Speech
2.1 Animal Talk
2.2 Wolfgang Ritter von Kempelen
2.3 From Kratzenstein to Helmholtz
2.4 Helmholtz and Rayleigh
2.5 The Bells:
Alexander Melville and Alexander Graham Bell
2.6 Modern Times
2.7 The Vocal Tract
2.8 Articulatory Dynamics
2.9 The Vocoder and Some of Its Progeny
23
23
24
26
27
."T
28
29
30
31
34
XXVIII Contents
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
3.
4.
Formant Vocoders
Correlation Vocoders
The Voice-Excited Vocoder
Center Clipping for Spectrum Flattening
Linear Prediction
Subjective Error Criteria
Neural Networks
Wavelets
Conclusion
Speech Recognition
and Speaker Identification
3.1 Speech Recognition
3.2 Dialogue Systems
3.3 Speaker Identification
3.4 Word Spotting
3.5 Pinpointing Disasters by Speaker Identification
3.6 Speaker Identification for Forensic Purposes
3.7 Dynamic Programming
3.8 Markov Models
3.9 Shannon's Outguessing Machine
- A Hidden Markov Model Analyzer
3.10 Hidden Markov Models in Speech Recognition
3.11 Neural Networks
3.11.1 The Perceptron
3.11.2 Multilayer Networks
3.11.3 Backward Error Propagation
3.11.4 Kohonen Self-Organizing Maps
3.11.5 Hopfield Nets and Associative Memory
3.12 Whole Word Recognition
3.13 Robust Speech Recognition
3.14 The Modulation Transfer Function
Speech Compression
4.1 Vocoders
4.2 Digital Simulation
4.3 Linear Prediction
4.3.1 Linear Prediction and Resonances
4.3.2 The Innovation Sequence
4.3.3 Single Pulse Excitation
4.3.4 Multipulse Excitation
^
4.3.5 Adaptive Predictive Coding
4.3.6 Masking of Quantizing Noise
4.3.7 Instantaneous Quantizing Versus Block Coding
4.3.8 Delays
35
36
36
37
38
38
39
39
40
41
42
44
45
46
47
48
49
49
50
51
52
53
54
54
55
55
56
57
57
63
64
65
66
67
71
72
74
74
75
76
78
Contents XXIX
4.3.9 Code Excited Linear Prediction (CELP)
4.3.10 Algebraic Codes
4.3.11 Efficient Coding of Parameters
4.4 Waveform Coding
4.5 Transform Coding
4.6 Audio Compression
79
79
79
80
81
82
5.
Speech Synthesis
5.1 Model-Based Speech Synthesis
5.2 Synthesis by Concatenation
5.3 Prosody
85
87
88
89
6.
Speech Production
6.1 Sources and Filters
6.2 The Vocal Source
6.3 The Vocal Tract
6.3.1 Radiation from the Lips
6.4 The Acoustic Tube Model of the Vocal Tract
6.5 Discrete Time Description
91
92
92
95
96
98
102
7.
The
7.1
7.2
7.3
7.4
105
106
106
106
107
8.
Hearing
8.1 Historical Antecedents
8.2 Thomas Seebeck and Georg Simon Ohm
8.3 More on Monaural Phase Sensitivity
8.4 Hermann von Helmholtz and Georg von Bekesy
8.4.1 Thresholds of Hearing
8.4.2 Pulsation Threshold and Continuity Effect
8.5 Anatomy and Basic Capabilities of the Ear
8.6 The Pinnae and the Outer Ear Canal
8.7 The Middle Ear
8.8 The Inner Ear
8.9 Mechanical to Neural Transduction
8.10 Some Astounding Monaural Phase Effects
8.11 Masking
8.12 Loudness
8.13 Scaling in Psychology
8.14 Pitch Perception and Uncertainty . . . .\
Speech Signal
Spectral Envelope and Fine Structure
Unvoiced Sounds
The Voiced-Unvoiced Classification
The Formant Frequencies
109
Ill
113
113
114
114
115
116
116
116
118
125
127
130
130
131
133
XXX
9.
Contents
Binaural Hearing — Listening with Both Ears
9.1 Directional Hearing
9.2 Precedence and Haas Effects
9.3 Vertical Localization
9.4 Virtual Sound Sources and Quasi-Stereophony
9.5 Binaural Release from Masking
9.6 Binaural Beats and Pitch
9.7 Direction and Pitch Confused
9.8 Pseudo-Stereophony
9.9 Virtual Sound Images
9.10 Philharmonic Hall, New York
9.11 The Proper Reproduction of Spatial Sound Fields
9.12 The Importance of Lateral Sound
9.13 How to Increase Lateral Sounds in Real Halls
9.14 Summary
135
135
137
139
141
144
145
146
150
152
153
154
156
158
161
10. Basic Signal Concepts
10.1 The Sampling Theorem and Some Notational Conventions . . .
10.2 Fourier Transforms
10.3 The Autocorrelation Function
10.4 The Convolution Integral and the Delta Function
10.5 The Cross-Correlation Function and the Cross-Spectrum . . . .
10.5.1 A Bit of Number Theory
10.6 The Hilbert Transform and the Analytic Signal
10.7 Hilbert Envelope and Instantaneous Frequency
10.8 Causality and the Kramers-Kronig Relations
10.8.1 Anticausal Functions
10.8.2 Minimum-Phase Systems and Complex Frequencies . . .
10.8.3 Allpass Systems
10.8.4 Dereverberation
10.9 Matched Filtering
10.10 Phase and Group Delay
10.11 Heisenberg Uncertainty and The Fourier Transform
10.11.1 Prolate Spheroidal Wave Functions and Uncertainty. .
10.12 Time and Frequency Windows
10.13 The Wigner-Ville Distribution
10.14 The Cepstrum: Measurement of Fundamental Frequency . . . .
10.15 Line Spectral Frequencies
163
163
164
167
169
171
173
174
176
180
181
182
183
184
185
186
188
190
194
195
197
200
A. Acoustic Theory and Modeling of the Vocal Tract
A.I Introduction
A.2 Acoustics of a Hard-Walled, Lossless Tube
A.2.1 Field Equations
A.2.2 Time-Invariant Case
A.2.3 Forrnants as Eigenvalues
203
203
204
204
208
209
Contents XXXI
A.2.4 Losses and Nonrigid Walls
A.3 Discrete Modeling of a Tube
A.3.1 Time-Domain Modeling
A.3.2 Frequency-Domain Modeling, Two-Port Theory
A.3.3 Tube Models and Linear Prediction
A.4 Notes on the Inverse Problem
A.4.1 Analytic and Numerical Methods
A.4.2 Empirical Methods
211
212
213
216
219
220
221
223
B. Direct Relations
Between Cepstrum and Predictor Coefficients
225
B.I Derivation of the Main Result
225
B.2 Direct Computation of Predictor Coefficients
from the Cepstrum
227
B.3 A Simple Check
228
B.4 Connection with Algebraic Roots and Symmetric Functions . . 228
B.5 Connection with Statistical Moments and Cumulants
230
B.6 Computational Complexity
230
B.7 An Application of Root-Power Sums to Pitch Detection
231
References
234
General Reading
249
Selected Journals
259
A Sampling of Societies and Major Meetings
260
Glossary of Speech and Computer Terms
261
Name Index
285
Subject Index
293
The Author
315