Beware of the `telephone effect`: the influence of telephone

Beware of the ‘telephone effect’: the
influence of telephone transmission on the
measurement of formant frequencies
Hermann J. Künzel
Department of Phonetics, University of Marburg
ABSTRACT Speech scientists often have to work with speech signals that have been
transmitted over the telephone. Although the acoustic properties of telephone transmission such as the band-pass filter characteristics are well known, little attention has
been paid to their effect on the measurement of speech parameters.1 This study deals
with artefacts introduced by the lower cut-off slope of the transmission channel on
vowel formants. For theoretical reasons, frequency components may be assumed to be
attenuated the lower they are. Therefore F1 of most vowels can be expected to be
affected most. Attenuation of the lower components of a formant will necessarily
increase the relative weight of the higher components for the determination of a
formant and thus cause an artificial upward shift of its centre frequency. An empirical
investigation with directly and telephone-transmitted samples from ten male and ten
female subjects shows that the predicted effect on F1 does in fact occur for all tested
vowels except /a/, whose F1 is too high to be affected by the slope of the band-pass. The
consequences of measurement errors arising from such artefacts are discussed with
special reference to speaker identification and empirical dialectology.
KEYWORDS telephone transmission, spectrographic analysis, spectrographic shifting,
forensic speaker recognition, dialectology
INTRODUCTION
The various effects of telephone transmission on speech have been
described comprehensively in the classical study by Moye (1979).2
Although some of the typical disturbances such as high levels of background noise or higher-order distortions due to carbon microphones have
become less of a problem with today’s modern digital landline telephony,
other factors have persisted, particularly the band-pass characteristics of
the transmission channel (350–3400 Hz). With the rise of the Global
Standard for Mobile Communication (GSM) telephony, new types of distortions have occurred that may seriously degrade speech transmission and
also negatively affect both forensic and commercial speaker identification.3
But speaker identification is by no means the only phonetic application
that is affected by telephone transmission. As a matter of fact, the very
stimulus for the present study came from empirical dialectology. One of
the authors of the Mittelrheinischer Sprachatlas (Bellmann, Herrgen and
© University of Birmingham Press 2001
Forensic Linguistics 8(1) 2001
1350-1771
The ‘telephone effect’
81
Schmidt 1994–99) is planning to re-interview a set of sample speakers that
were used for the original recordings (1981–88) with the aim of gaining a
diachronic view of dialectal change within different generations of the
same population. In an attempt to economize on both financial resources
and time for data collection and analysis, J. Schmidt originally intended to
adopt the procedure used previously by Labov et al. (forthcoming) for
their Atlas of North American English, i.e. (a) to conduct telephonic interviews rather than face-to-face interviews to record the acoustic data, and
(b) to use an automatic algorithm for the determination of formant centre
frequencies. Therefore, he put the question to the present author whether
there were, from a phonetic point of view, any general reservations about
a (landline-) telephone-based acquisition of speech data that were to be
analysed both acoustically and auditorily, and, in particular, that were to
be compared with the previous data, which had been recorded directly.
If the question is restricted to the band-pass filter effect of the transmission channel and focused on vocalic sounds it may be formulated in
phonetic terms as follows: will the measurement of formant frequencies be
influenced by telephone transmission? Considering further that the first
two or three formants of any vowel are below the upper cut-off frequency
of a standard telephone channel, the investigation may be limited to the
lower cut-off region. Therefore, the first empirical step was to determine
empirically the typical transmission characteristics of standard landline
telephony, including typical standard hardware at both ends. Since today
analogue telecommunication networks have been replaced by digital ones
in Germany, as in most other western industrialized countries, it was considered useful to concentrate on the main digital technique (ISDN).
In order to address this question, data from a hitherto unpublished
study on the reliability and variability of digital telephone equipment and
ISDN transmission characteristics were used. A total of ten card and/or
coin telephones (public telephone booths) inside the same local network
(city of Wiesbaden) were investigated. As part of the test white noise was
induced for 60 seconds into the microphone of the headset at a moderate
amplitude via an essentially linear loudspeaker (high-quality headphone).
At the other end of the line the signal was recorded onto a DAT recorder
that was wired to a standard digital telephone set. With the exception of
the loudspeaker and the microphone capsule of the telephone unit at the
front end, no analogue components were used in the set-up. Since the
spectral shapes of the transmitted test signal were almost identical in all
trials any one of them may be taken as representative of all.
Figure 1 shows the frequency response of one particular public card
telephone to the white noise. For the purpose of the present question it is
important to note that the amplitude of the signal rises approximately 30
dB from the frequency of the 50 Hz mains hum to the left cursor (400
82
Forensic Linguistics
Hz). For an interval of approximately 3000 Hz, i.e. up to the right cursor
(3400 Hz), the frequency response is rather flat. The slight extra increase
(3 dB) between ca. 400 and ca. 1200 Hz may be due to an artefact caused
by the loudspeaker.
Figure 1 Spectrum of white noise after ISDN transmission using standard
digital telephone hardware. Left and right cursor are fixed at
400 and 3400 Hz, respectively. One vertical division corresponds to 2.5 dB. The spike close to the left margin is caused by
mains hum (50 Hz).
With regard to the initial question it may be expected that the measurement of vowel formants below 400, perhaps even 500 Hz, could be
affected by the slope of the transmission channel in the following way. The
calculations of the bandwidth as well as of the centre frequency of a
formant are based on an averaging process over a number of neighbouring
harmonics. The amplitude, i.e. the weight of these harmonics, will
decrease if they come within the slope of the spectrum of the transmission
channel, i.e. below 400 or 500 Hz. The lower the frequency, the stronger
the attenuation will be. Thus, the relative weight of the higher harmonics
The ‘telephone effect’
83
of a formant, particularly those above the slope, will be increased. At the
same time the bandwidth of the formant is reduced from the bottom end,
which will artificially shift its centre frequency upwards. It goes without
saying that in broad-band spectrograms the effect will also depend on the
filter bandwidth that is selected. The principle is illustrated in Figure 2. In
the right-hand part of the figure, the low frequency area (black) is most
severely attenuated since it falls into the steep part of the slope of the telephone filter function (S). A smaller frequency band (white) above the first
one is less strongly affected since the slope of S is already less steep here.
Harmonics in the grey-shaded area, i.e. in the horizontal part of the
spectrum, will not be affected at all. It should also be noted that the slope
falls within the ranges of both male and female speakers’ fundamental frequency (F0) values. The general problem with formant determination for
female voices is addressed in Footnote 4.
Figure 2 Shift of formant bandwidth and centre frequency as an artefact
of telephone transmission
As an implication of the general acoustic theory of vowel production (e.g.
Stevens and House 1955, 1961) and the data on formants that have been
provided since the early days of sound spectrography and pattern playback
(Potter et al. 1947, Peterson and Barney 1952, Cooper et al. 1952,
Delattre 1965, Delattre et al. 1952; see also the survey in Baken 1987:
356–9 and more recent data for German in Kohler 1977: 54, Neppert and
Petursson 1986:124–5, Simpson 1998: 214–17) it can be expected that
the first formant of high and mid vowels in particular will be affected by
what might be called the ‘telephone effect’. If this hypothesis is confirmed,
an important question will arise: What are the potential perceptual consequences of a rise of F1, particularly in the light of early results by Flanagan
(1955) who found that changes in F1 or F2 centre frequency of as little as
84
Forensic Linguistics
3 to 5 per cent are detected consistently by listeners. In other words: Will
a vowel such as close [i] be perceived as more open [I] when it is transmitted over the telephone? Generally, F2 and higher formants should
remain unaffected by the telephone effect as long as they remain within
the frequency band transmitted by the telephone channel.
If the rationale of the experiment had been to test the ‘telephone effect’
per se, measurements could have been facilitated if isolated vowels,
preferably produced by male speakers,4 or even synthetic vowels, had been
used as stimulus material. Regarding the forensic and linguistic applications mentioned earlier, however, it was considered useful deliberately to
include other factors, such as phonetic and phonological context, vowel
duration, different speed of talking, speaker sex and the like. These factors
might individually and/or in combination produce so many artefacts, i.e.
increase so much the variability of formant measurements, that they might
override the telephone effect in praxi, in which case the phenomenon
might be ignored for most investigations. Therefore it was decided to
work with natural fluent speech.
EXPERIMENT
Materials
Ten male and ten female subjects were used for the production of the
speech material. The respective age ranges were twenty to fifty (average of
twenty-eight years) and twenty to fifty-nine (average of thirty-one years).
The subjects were asked to read the standard text ‘The North Wind and
the Sun’ in German at normal speed and loudness level. The readings
lasted between 35 and 40 seconds.5 The speech signal was recorded simultaneously onto two DAT recorders (Sony TCD D7) (a) over a high-quality
condenser microphone (Sony ECM 959) and (b) over a standard digital
telephone handset (Alcatel 4400) which was connected to another digital
extension via a standard ISDN transmission line. For each recording a new
local call was dialled.
The sound system of standard German (see, for instance, Kohler (1977:
175) contains fifteen monophthongs plus /´/, the reduced variant of any of
these. A general phonological rule determines that long vowels are close,
short vowels are more open. However, long and short /a/ do not differ
phonetically, and there is a long variant of /E/. The central unstressed
vowel /´/ is always short. Since /O:,œ/ and /y:/ are not contained in the read
text a total of thirteen different vowels were selected, most of them
occurring more than once in different phonetic environments. The total
number of test tokens was twenty-nine (4 x /i:/, 3 x /I/, 1 x /e:/, 2 x /E/, 1 x
/E:/, 7 x /a/, 1 x /a:/, 1 x /o:/, 2 x /ç/, 1 x /u:/, 1 x /U/, 2 x /Y/, 3 x /´/; these
vowels are underscored in the text which is contained in Footnote 5).
The ‘telephone effect’
85
Method
Each subject’s speech was recorded onto a computer (22,050 samples/s, 16
bit) for further processing. Segmentation and acoustic measurements of
the vowels were made with the KAY Multispeech Software. Centre frequencies for F1 and F2 were measured from broad-band spectrograms in
two ways: (a) by visual determination of the ‘centre of gravity’ by an experienced interpreter of spectrograms (author), (b) automatically, using the
specific autocorrelation function (‘formant history’) provided by the Multispeech software. Results of the automatic determination can be
monitored by means of small dots that mark all frequency–time coordinates that are located by the algorithm (one per cycle if the amplitude
surpasses a certain threshold). Numerical values gained by both techniques
were usually derived at the centre of each steady-state portion using the
cross-cursor function. A total of 2320 measurements were made (29
vowels, 20 speakers, 2 conditions, 2 formants).
In the course of the analysis a number of problems arose. First, the
automatic function for the determination of formant centre frequencies
failed to provide any, or any reasonable, results in as many as 50 per cent
of the cases, depending on the individual speaker. Generally, female
speakers were affected more than males. The majority of measurements
that were considered ‘not reasonable’ consisted of cases where the algorithm ‘disregarded’ the formant in question and picked a higher formant
or single harmonic instead. The problem was particularly frequent with
the telephone data. For this reason the decision was taken to dispense with
the automatic determination of formant frequencies altogether and rely
exclusively on the interactive visual method. A typical example of the
problem is provided in Figure 3 which contains spectra of the words
‘daherkam’ produced in both conditions. In the direct recording (bottom)
the first formant of [a:] in the syllable [kha:m] is clearly visible but was not
identified by the algorithm (no marker dots). F2 is marked but the calculated frequency values (dots) are far too low. The reason is that obviously
the algorithm has taken into account some of the frequency components
that were disregarded for F1. Interestingly enough, in the spectrogram of
the telephone recording of the same vowel (top) both F1 and F2 are
marked. The frequency values seem way too low, however, and F1 marker
dots are scattered vertically too much to provide a reasonable average.
It is well-known to anyone acquainted with the analysis of spectrograms
that the visual determination of formants may also involve problems. In
the current data some formants in certain vowels could not be identified
unequivocally because (a) amplitudes were generally too low, a phenomenon that also has a speaker-specific aspect, or (b) the high F0 of
female speakers made it sometimes impossible to integrate enough harmonics into a formant to determine its ‘centre’, even after optimizing the
settings (see Footnote 4). As a result of these problems there was a total of
86
Forensic Linguistics
Figure 3 Spectra of the word ‘daherkam’ produced by the same male
speaker and recorded in direct (bottom) and telephone condition (top)
nine missing values for three of the female subjects, all involving F2 of [i:]
and one for a male subject involving [U].
A different problem, i.e. a supplementary source of variability, was a
consequence of the general decision to use natural fluent speech rather
than isolated vowels as speech material. Many vowels are so short that
there are none of the above-mentioned central or steady-state portions
that are conventionally used for the measurement of formants. Phonologically long vowels in unstressed, and sometimes even in (sentence-)stressed
positions, may be quite short phonetically, for instance the /o:/ in ‘wohl’ or
the /i:/ in ‘blies’, not to mention the phonologically short vowels. In such
cases the phonetic context is of special importance to the stage at which a
measurement is taken: For /o/ in ‘wohl’ formants were measured right at
the start of the vowel since F2 moves sharply upwards from the beginning
in anticipation of the lateral consonant. On the other hand, the formants
of /i/ in ‘sich’ were measured near the end of the vowel in order to exclude
the influence of the initial /s/. Here, the final /C/ can be considered
‘neutral’ because of its homorganity to /i/. Such context-based phonetic
criteria were applied consistently to all relevant cases.
In order to determine the size of variability introduced by the author,
who made all the measurements from the spectrograms, the data from
female subject no. 6, which was among the more ‘difficult’ ones, were
gathered a second time after the completion of all measurements. Results
showed that there were no significant differences for F1 or F2 centre
The ‘telephone effect’
87
frequencies as a function of the first or second analysis (for F1: t = -0.65,
p= 0.51; for F2: t= 1.2, p= 0.24). The average deviation was 3.5 Hz for
F1 (95 per cent confidence interval -14.5 to +7.5 Hz) and 7.0 Hz for F2
(95 per cent confidence interval -4.9 to +18.9 Hz).
RESULTS
First, the question was asked for each subject whether there were differences in terms of the frequencies of the first two formants as a function of
recording condition in the set of twenty-nine test tokens. In other words,
at this stage of the investigation no sub-sets of same (phonological) vowels
were formed. A series of Wilcoxon matched-pairs signed ranks tests were
used. The two-tailed results are summarized in Table 1. The data show: (1)
For every single subject of either sex there were significant differences in
F1-values, which, as a matter of fact, always proved to be higher in the
telephone recording condition. This fact is all the more remarkable since
Table 1
Results of Wilcoxon-Wilcox matched-pairs signed ranks tests
for differences of F1 and F2 of twenty-nine test items as a
function of recording condition (direct/telephone)
Significance levels: * < 0.05, ** < 0.01 and *** < 0.001; ns
= not significant.
Subjects
F1
F2
males:
3
4
5
11
12
15
16
18
20
21
***
*
***
*
***
***
*
***
***
***
ns
ns
ns
ns
ns
ns
ns
ns
ns
**
females:
2
6
7
8
9
10
13
14
17
19
**
***
***
***
***
***
***
***
**
**
ns
ns
ns
ns
ns
ns
ns
ns
ns
ns
88
Forensic Linguistics
eight instances of the open vowel /a:/ or /a/ were contained in the set of
twenty-nine. As a result of their high F1 values they could be expected to
be less affected or not affected at all by the telephone effect. (2) Except for
one male subject whose values for the telephone condition are 19 Hz
higher on average for no obvious reason (see discussion), no significant
differences of F2 centre frequencies were found.
Regarding the initial hypothesis the central question was how the
influence of the telephone band-pass would affect the different vowels.
For this purpose the twenty-nine tokens were regrouped according to the
thirteen vowel phonemes. Table 2 contains the data for both sex groups.
At this juncture it should be reiterated that some of the values represent a
Table 2
Average centre frequencies of F1 and F2 of thirteen German
vowels in direct and telephone recording conditions for ten
male and ten female subjects
Male subjects
vowel
E
E:
Y
a
a:
e:
I
i:
ç
o:
´
U
u:
F1dir
435
416
349
628
670
334
323
272
519
403
491
366
322
F1tel
463
453
379
641
673
351
360
303
536
413
508
413
353
F2dir
1668
1861
1490
1285
1293
1935
1698
2056
1110
867
1391
926
850
F2tel F1%tel/dir
1674
106,4
1852
108,9
1496
108,6
1288
102,1
1306
100,4
1933
105,1
1692
111,5
2051
111,4
1131
103,3
848
102,5
1391
103,5
926
112,8
854
109,6
F2%tel/dir
100,4
99,5
100,4
100,2
101,0
99,9
99,6
99,8
101,9
97,8
100,0
100,0
100,5
F1tel
523
441
414
726
765
376
430
343
584
457
525
478
396
F2dir
2021
2260
1684
1505
1462
2241
2044
2369
1302
967
1676
1039
994
F2tel F1%tel/dir
2051
106,4
2261
106,0
1662
105,4
1499
102,3
1446
97,9
2272
105,5
2052
110,0
2385
113,6
1280
103,0
977
105,6
1650
107,3
1041
105,6
983
110,6
F2%tel/dir
101,5
100,1
98,7
99,6
98,9
101,4
100,4
100,7
98,3
101,1
98,4
100,2
98,8
Female subjects
vowel
E
E:
Y
a
a:
e:
I
i:
ç
o:
´
U
u:
F1dir
492
416
393
710
781
357
391
302
567
433
489
453
358
The ‘telephone effect’
89
single measurement per subject whereas others are based on as many as
seven per subject, depending on the number of occurrences of the particular sound throughout the text. The two right-hand columns contain
the results of a division of the telephone data by the corresponding direct
data of the respective formants. The quotients for F2 (‘F2%tel/dir’) are
illustrated in Figure 4. It is obvious that they vacillate around the 100 per
cent line, i.e. there is no tendency towards either direction (maximum
deviation: 2.2 per cent). In other words: As could be expected, the measurement of F2 centre frequencies has not been influenced by the
recording condition.
Results are dramatically different for F1. With the exception of the
female subjects’ long [a:], i.e. the sound that has the highest F1 anyway,
the F1 centres of each vowel measured higher in the telephone-transmitted
data as compared to the data recorded directly. It is of particular interest to
note that the size of the difference varies between less than one and over
thirteen per cent between the two recording conditions and is grosso
modo inversely proportional to the absolute level of F1, i.e. the difference
is largest for close vowels such as [i] and [u], medium for vowels such as
[e] and [o] and smallest or zero for open vowels like [ç, a]. The finding
applies to both sex groups in essentially the same way, since correlation
coefficients for the magnitudes of the F1 differences and the absolute
levels of F1 (in the direct-recording condition) are -0.80 for male and 0.86 for female subjects. Figure 5 illustrates the data using the same
vertical scale as Figure 4 in order to provide a direct comparison of the
vastly different magnitudes of the F1 vs. F2 differences – except for /a/. On
the basis of the data from the first four columns of Table 2, conventional
(reversed-scales) F1-F2 vowel charts for the thirteen German vowels were
drawn for both sex groups. Since in Figures 6 and 7 F1 is noted on the vertical axis the visual impression is essentially that of a downward shift of
the whole pattern of the telephone-recorded data (marked ‘T’) in relation
to their direct-recorded counterparts (marked ‘h’). Only long [a:] (produced by the female subjects) is an exception as a result of its very high F1.
The fact that there is no concomitant sideways shift of the pattern shows
that F2 remains unaffected by the telephone effect.
The picture presented here of the influence of the telephone transmission seems rather clear and thus the question may be asked whether
there is a simple method for its compensation. It should be kept in mind,
however, that the results presented so far are based on averages, either
over (groups of ) subjects or vowel types. An examination of the individual
subjects’ data, however, shows how differently the reported phenomena
may appear in the sense that some speakers are affected more than others
by the telephone transmission. This fact is illustrated in Figure 8, which
shows for the male speakers the dot-density distributions of the deviations
of F1 (telephone–direct) for five phonological vowel types. In order to
Forensic Linguistics
F2 tel/direct
90
female subjects
male subjects
E
a u:
hw
sc
E: I O U Y a a: e: i: o:
F2 tel/direct
Figure 4 Quotients of direct and telephone recorded F2 centre
frequencies for thirteen vowels
E E: I O U Y a a: e: i: o: wa u:
h
female subjects
male subjects
sc
Figure 5 Quotients of direct and telephone recorded F1 centre
frequencies for thirteen vowels
The ‘telephone effect’
h
T
h
h
h
T
h
T
T
h
T
h
T
F1 [Hz]
h
T
h
T
h
T
h
T
h
T
h
Figure 6 Formant chart for male subjects
h = direct recording T = telephone recording
h
F1 [Hz]
T
h
T
h
T
h
h
h
T
T
T
h
T
h
T
T
h
T
h
T
h
T
T
h
Figure 7
Formant chart for female subjects
h = direct recording T = telephone recording
91
92
Forensic Linguistics
Figure 8 Dot-density functions of F1-differences in direct and telephone
recordings. Data for five vowel types from ten male subjects
(U,u:- ç,o: - a, a: - E, e, E: - I, i:)
The ‘telephone effect’
93
make all vowels comparable, the same frequency scale was used for the
abscissae.6 It is evident that subjects exhibit a great degree of variance in
terms of the size of the deviations. For /u/ they range from -147 Hz to +36
Hz, and even for /a/ the range is from -120 Hz to +90 Hz. The positive
figures show that there are even cases (though very few) where some
formant centre frequencies are lower in the telephone condition. It can
also be observed that within one and the same subject the F1 of /i/ may be
affected more strongly by the telephone effect than the F1 of /u/, even
though their centre frequencies (according to the direct recording) may be
the same. Such a result can be explained by the different relative intensity
levels and bandwidths of the first formants of both vowels (Stevens and
House 1961: 314 ). The detrimental consequence of these two findings for
the present question is, however, that it will not be possible to develop the
simple kind of compensatory algorithm that might be suggested on theoretical grounds, i.e. a filtering that is inverse to the slope of the telephone
channel.
DISCUSSION
The results of the present study are clear. By hindsight they may even seem
trivial, considering the simple theoretical assumptions presented at the
outset. Yet it would seem that hitherto little attention has been paid to
what may be called the telephone effect. The fact that low formants of
vowels of male and female speakers are shifted upwards and thus cause
faulty measurements is relevant in several respects. Let us first consider
modern forensic speaker identification as it is carried out today, namely
using acoustic–phonetic or semi-automatic methods. As was mentioned
earlier the standard setting for voice comparisons implies that telephonetransmitted material is involved. As long as both questioned and reference
material in a case were recorded via telephone there would not be much of
a problem – provided that the different telephone channels do not differ
too much in terms of their acoustic properties. There is hope but no certainty (see last paragraph) that ISDN transmission has taken the edge off
this issue. If only part of the speech material was recorded over the telephone,7 for instance an anonymous call, and the reference sample was
recorded directly from a suspect, for instance during a police interview,
then the problem of shifted formants is inherent. Errors due to unnoticed
formant shifts may ultimately lead to false identifications or false rejections, depending upon whether correspondences or discrepancies of
measurement values are created that way. Fortunately, many forensic
experts, including the present author, have always declined to use formant
frequencies as parameters for voice comparisons, but the proponents of
the so-called voiceprint method are by no means the only ones who have
been using this kind of spectral information for a long time. There are also
94
Forensic Linguistics
at least two well-known semi-automatic speaker identification systems,
Italian IDEM (Falcone and De Sario 1994, Falcone et al. 1995) and
Russian DIALECT (Popov et al. 1999) that use formant-related parameters
for comparison. The current findings strongly suggest that in cases containing both types of speech recordings centre frequency and bandwidth of
F1 should not be used for the analysis.
The same conclusion must be drawn with respect to the use of telephone-recorded speech samples in dialectology, particularly if used in lieu
of, not simply as a complement to, directly recorded samples. In analogy
to the case of forensic speaker identification the problem is equally relevant if samples recorded in the ‘traditional’ way, i.e. directly, in
face-to-face interviews, are to be compared with telephone-recorded data.
Here, an additional argument comes into play that extends more generally
into the field of speech perception: If telephone transmission causes F1 of
a vowel such as [i] to be shifted upwards by as much as 13 per cent, with
F2 remaining more or less unaltered around 1900–2000 Hz, the resulting
F1–F2 pattern comes close to that of an [I].8 Such a phonetic difference
may at first seem minute but it is of great interest to a dialectologist
because it may be the nucleus of a sound change in progress or just a difference between two dialects. It would be more than disappointing to the
dialectologist if what she/he has perceived as an open [I] – and, ironically,
perhaps ‘verified’ by hard acoustic evidence such as formant frequencies –
is but an artefact of a telephone-transmitted speech recording. This
problem is not just a theoretical possibility, but has been verified informally using the vowel in ‘Sie’ as produced by male subject no.3. When
both direct and telephone-recorded /i:/ are trimmed to identical durations
and spliced together with a silent interval of one second between them a
trained listener clearly hears a sequence of [i:] and [I]. This procedure is
similar to that used by Flanagan (1955) who found that the difference
limen for vowel discrimination are as small as 3 per cent for F1 and F2.
The fluent speech material of the present study precludes a more formalized testing of this question since the open German vowels are much
too short in relation to their close counterparts. Sustained isolated vowels,
or, following Flanagan’s approach, synthetic vowels, will serve the
purpose better. Another finding of the present study that should be of
interest to empirical dialectology is that the automatic algorithm for
formant extraction used in this study does not work reliably enough – particularly if the speech signal is telephone-transmitted – to use it in lieu of
visual extraction by the trained observer. The problem is anything but
novel to forensic phonetics and applies in principle to all formant
extraction algorithms. It is for this reason that semi-automatic speaker
identification systems like IDEM contain sophisticated procedures for the
manual (or rather, ‘visual’) post-hoc control and correction of automatically derived formant frequencies, mostly on the basis of long-term
The ‘telephone effect’
95
average spectra (LTAS) that are presented to the observer and can be
manipulated at their discretion. The current state-of-the-art in automatic
formant extraction reminds one of the situation that existed for many
years regarding the automatic extraction of the fundamental frequency:
there were numerous approaches (see Hess 1983 for an extensive survey),
but no single algorithm would produce ‘correct’ results for all speech
sounds, speakers and transmission conditions.
s
i
v Uå
d
n
a
I
n
///
s
I
v Uå d n
a
Figure 9 Spectrogram of the utterance ‘Sie wurden einig’ produced by
male subject no.3
Since vowels were in the centre of this investigation the effects of telephone transmission have been limited to the lower slope of the band-pass.
The study of consonants requires also a look at the upper end of the transmission band. However, such a focus may also provide more evidence on
artefacts affecting the higher frequencies of vowels. Figure 9 gives an
I
n
96
Forensic Linguistics
example. The left side of the spectrogram contains the utterance ‘Sie
wurden einig’ produced by a male speaker in the direct-recorded condition, the right side contains the identical utterance as recorded via the
ISDN transmission line. The [i] of the direct recording exhibits three resonances in an area between ca. 2500 and 3600 Hz. In the telephonerecorded signal, however, there is an additional resonance between 2850
and 3100 Hz with a centre at about 2960 Hz (marked ‘?’). Since identical
hardware was used for the recording of all subjects this resonance is quite
obviously an artefact of the particular transmission channel since it occurs
throughout the whole recording of that speaker alone. It is clearly visible,
for instance, in the words ‘wurden einig’ (see horizontal line). The additional resonance, which in addition has a rather strong amplitude, must
necessarily lead to an error in the measurement of frequency components
of any speech sound, not only vowels, in an area near the upper end of the
telephone band-pass. This finding certainly emphasizes the old argument
(proposed by engineers; see footnote 1), formerly applied to analogue
transmission systems, against the use of any formant frequencies in
forensic speaker identification. For the same reasons this argument may be
extended to other phonetic applications, including empirical dialectology
of the kind that provided the stimulus for the present investigation. At the
same time, a question mark has to be put behind the assumption that the
characteristics of ISDN channels are in principle identical, albeit the data
on the ten public telephone boxes presented above would support it. In
order to assess the probability for the occurrence of such artefacts in a
telecommunications network a large-scale empirical study of transmission
characteristics will have to be undertaken.
NOTES
1
2
3
This argument does not apply to many acoustics engineers, though. For instance,
for those working on telephone-based automatic speaker recognition systems
this problem has been one of the reasons for never even considering using
formant information for feature vectors.
Similar to Moye (1979), this study is confined to the more technical aspects of
telephone transmission. It goes without saying that other peculiarities of telephone speech exist that may seriously affect an individual’s speech behaviour.
For instance, Hirson et al. (1995) have shown that in telephone interviews the
fundamental frequency of speakers increases significantly in relation to face-toface interviews.
Here, the upper cut-off frequency is down to 3200 Hz again, making distinctions between high-frequency speech sounds such as /s/ and /f/ even more
difficult than with ordinary landlines. The most serious problem caused by GSM
transmission is certainly the distortion induced by the data-reduction algorithm
that is used for breaking down and reassembling the speech signal at either end
of the transmission path. Although there does not seem to be a systematic study
The ‘telephone effect’
4
5
6
7
8
97
of such effects yet, extensive experience with casework has shown that speaker
identification is generally more difficult on the basis of GSM-transmitted speech
samples. What makes the problem even more severe, however, is that false-identification type errors seem to increase more than false eliminations. Members of
the judiciary have therefore been warned of this problem that is typically pertinent when speech recordings are played to witnesses in court with the aim of
having them identify voices (Künzel 1997: 100).
It is well known that formants of speech produced by male voices are easier to
measure in broad-band spectrograms than those produced by female voices. The
simple reason is that the average fundamental frequency, F0, of the former is
almost an octave lower than that of the latter. Since formants can be conceived
of as ‘bundles’ of harmonics, one can say that the lower F0 is, the more of its
harmonics are contained in a bundle of a certain size (bandwidth). Using the
typical filter bandwidth of 300 Hz, the first formant of an [i] (which has a centre
frequency of, say, 270 Hz and a bandwidth of ca. 300 Hz) produced by a male
speaker will mainly comprise 3 harmonics (200, 300, 400 Hz) that are blurred
into one band whose peak (centre frequency) is easy to determine. If F0 is 200
Hz, however, the first formant of the same vowel will comprise only F0 itself
and one harmonic (200, 400 Hz). Since these frequencies are so widely spaced
they will not merge into one band, which makes it difficult to determine a
formant (bandwidth and peak). Selecting a larger filter bandwidth may only partially compensate for the effect. See also the example in Baken 1987: 352 figure
9-26.
Der Nordwind und die Sonne. Einst stritten sich Nordwind und Sonne, wer von
ihnen beiden wohl der Stärkere wäre, als ein Wanderer, der in einen warmen
Mantel gehüllt war, des Weges daherkam. Sie wurden einig, dass derjenige für
den Stärkeren gelten sollte, der den Wanderer zwingen würde, seinen Mantel
auszuziehen. Der Nordwind blies mit aller Macht, aber je mehr er blies, desto
fester hüllte sich der Wanderer in seinen Mantel ein. Endlich gab der Nordwind
den Kampf auf. Da erwärmte die Sonne die Luft mit ihren freundlichen Strahlen,
und schon nach wenigen Augenblicken zog der Wanderer seinen Mantel aus. Da
musste der Nordwind zugeben, dass die Sonne von ihnen beiden der Stärkere
war.
Each dot corresponds to one measurement value from one male subject. The
varying total number of dots per diagram is the result of different numbers of
occurrences of the vowels (see Introduction). In the case of [a, a:] there are (7
+1)x10 subjects = 80 dots, whereas for [U, u:] there are only (1+1)x10 = 20.
According to Hirson, French and Howard (1995: 230) this applies to ‘well over
90 per cent of cases’.
Simpson (1998: 217) has found the following average formant values in the
spontaneous speech of male German subjects (all in Hz): F1: [i] 309, [I] 353 Hz;
a difference of 14 per cent; F2: [i] 2039, [I] 1801. Individual measurements
show a considerable overlap between both formants of either vowel. The corresponding values for these vowels of female speakers, however, differ more: F1:
[i] 330, [I] 418; a difference of 26 per cent; F2: [i] 2371, [I] 2093 Hz. The spontaneous data from Simpson are not quite comparable with the data provided by
Delattre (1956: 49), who measured German vowels [i] and [I] in isolation: F1:
275 re. 325 a difference of 18 percent; F2: 2250 re. 2100.
98
Forensic Linguistics
REFERENCES
Baken, R. J. (1987) Clinical Measurement of Speech and Voice, London:
Taylor & Francis.
Bellmann, G., Herrgen, J. and Schmidt, J. E. (1994–99) Mittelrheinischer
Sprachatlas (MRhSA), Band 1–3: Vokalismus. Tübingen.
Cooper, F. S., Delattre, P., Liberman, A. M., Borst, J. and Gerstman, L. J.
(1952) ‘Some experiments on the perception of synthetic speech
sounds’, Journal of the Acoustical Society of America, 24: 597–606.
Delattre, P. C., Liberman, A. M., Cooper, F. S. and Gerstman, L. J. (1952)
‘An experimental study of the acoustic determinants of vowel color’,
Word, 8: 195–210.
Delattre, P. C. (1965) Comparing the Phonetic Features of English, French,
German and Spanish, Heidelberg: Groos.
Falcone, M. and De Sario, N. (1994) ‘A PC based speaker identification
system for forensic use’ in Proceedings of the ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny,
169–72.
Falcone, M., Paoloni, A. and De Sario, N. (1995) ‘IDEM: a software tool
to study vowel formants in speaker recognition’ in: Proceedings of
ICPhS ’95 Stockholm, vol.3, 294–7.
Flanagan, J. (1955) ‘A difference limen for vowel formant frequency’,
Journal of the Acoustical Society of America, 24: 613–17.
Hess, W. (1983) Pitch Determination of the Speech Signal, Berlin:
Springer-Verlag.
Hirson, A., French, P. and Howard, D. (1995) ‘Speech fundamental frequency over the telephone and face-to-face: some implications for
forensic phonetics’ in J. W. Lewis (ed.) Studies in General and English
Phonetics. Essays in Honour of J. D. O’Connor, London/New York:
Routledge, 230–40.
Kohler, K. (1977) Einführung in die Phonetik des Deutschen, Berlin:
Schmidt.
Künzel, H. J. (1997) ‘Methoden der forensischen Sprecher-Erkennung’,
Strafverteidiger Forum, 5: 100–105.
Labov, W., Ash, S. and Boberg, C. (forthcoming) Atlas of North American
English, Berlin/New York: de Gruyter.
Moye, L. S. (1979) Study of the Effects on Speech Analysis of the Types of
Degradation Occurring in Telephony, Harlow, Essex: Standard Telecommunication Laboratories.
Peterson, G. E. and Barney, H. L. (1952) ‘Control methods used in a study
of the vowels’, Journal of the Acoustical Society of America, 24:175–84.
Neppert, J. and Petursson, M. (1986) Elemente einer akustischen Phonetik,
Hamburg: Buske.
Popov, N. F., Linkov, A. N., Fesenko, A. V., Kurachenkova, N. B.,
Baicharov, N. V., Karlin, I. P., Timofeev, I. N. and Potapova, R. K.
(1999) ‘The interactive expert system for forensic speaker identification
The ‘telephone effect’
99
used in Russia’ in Proceedings of the International Workshop Speech and
Computer, Moscow, 37–53.
Potter, R., Kopp, G. and Green, H. (1947) Visible Speech, New York: Van
Nostrand.
Simpson., A. (1998) Phonetische Datenbanken des Deutschen in der
empirischen Sprachforschung und der phonologischen Theoriebildung,
Arbeitsberichte des Instituts für Phonetik Kiel, (AIPUK) vol. 33.
Stevens, K. N. and House, A. S. (1955) ‘Development of a quantitative
description of vowel articulation’, Journal of the Acoustical Society of
America, 27: 484–93.
Stevens, K. N. and House, A. S. (1961) ‘An acoustical theory of vowel production and some of its implications’, Journal of the Acoustical Society
of America, 4:303–20.