Beware of the ‘telephone effect’: the influence of telephone transmission on the measurement of formant frequencies Hermann J. Künzel Department of Phonetics, University of Marburg ABSTRACT Speech scientists often have to work with speech signals that have been transmitted over the telephone. Although the acoustic properties of telephone transmission such as the band-pass filter characteristics are well known, little attention has been paid to their effect on the measurement of speech parameters.1 This study deals with artefacts introduced by the lower cut-off slope of the transmission channel on vowel formants. For theoretical reasons, frequency components may be assumed to be attenuated the lower they are. Therefore F1 of most vowels can be expected to be affected most. Attenuation of the lower components of a formant will necessarily increase the relative weight of the higher components for the determination of a formant and thus cause an artificial upward shift of its centre frequency. An empirical investigation with directly and telephone-transmitted samples from ten male and ten female subjects shows that the predicted effect on F1 does in fact occur for all tested vowels except /a/, whose F1 is too high to be affected by the slope of the band-pass. The consequences of measurement errors arising from such artefacts are discussed with special reference to speaker identification and empirical dialectology. KEYWORDS telephone transmission, spectrographic analysis, spectrographic shifting, forensic speaker recognition, dialectology INTRODUCTION The various effects of telephone transmission on speech have been described comprehensively in the classical study by Moye (1979).2 Although some of the typical disturbances such as high levels of background noise or higher-order distortions due to carbon microphones have become less of a problem with today’s modern digital landline telephony, other factors have persisted, particularly the band-pass characteristics of the transmission channel (350–3400 Hz). With the rise of the Global Standard for Mobile Communication (GSM) telephony, new types of distortions have occurred that may seriously degrade speech transmission and also negatively affect both forensic and commercial speaker identification.3 But speaker identification is by no means the only phonetic application that is affected by telephone transmission. As a matter of fact, the very stimulus for the present study came from empirical dialectology. One of the authors of the Mittelrheinischer Sprachatlas (Bellmann, Herrgen and © University of Birmingham Press 2001 Forensic Linguistics 8(1) 2001 1350-1771 The ‘telephone effect’ 81 Schmidt 1994–99) is planning to re-interview a set of sample speakers that were used for the original recordings (1981–88) with the aim of gaining a diachronic view of dialectal change within different generations of the same population. In an attempt to economize on both financial resources and time for data collection and analysis, J. Schmidt originally intended to adopt the procedure used previously by Labov et al. (forthcoming) for their Atlas of North American English, i.e. (a) to conduct telephonic interviews rather than face-to-face interviews to record the acoustic data, and (b) to use an automatic algorithm for the determination of formant centre frequencies. Therefore, he put the question to the present author whether there were, from a phonetic point of view, any general reservations about a (landline-) telephone-based acquisition of speech data that were to be analysed both acoustically and auditorily, and, in particular, that were to be compared with the previous data, which had been recorded directly. If the question is restricted to the band-pass filter effect of the transmission channel and focused on vocalic sounds it may be formulated in phonetic terms as follows: will the measurement of formant frequencies be influenced by telephone transmission? Considering further that the first two or three formants of any vowel are below the upper cut-off frequency of a standard telephone channel, the investigation may be limited to the lower cut-off region. Therefore, the first empirical step was to determine empirically the typical transmission characteristics of standard landline telephony, including typical standard hardware at both ends. Since today analogue telecommunication networks have been replaced by digital ones in Germany, as in most other western industrialized countries, it was considered useful to concentrate on the main digital technique (ISDN). In order to address this question, data from a hitherto unpublished study on the reliability and variability of digital telephone equipment and ISDN transmission characteristics were used. A total of ten card and/or coin telephones (public telephone booths) inside the same local network (city of Wiesbaden) were investigated. As part of the test white noise was induced for 60 seconds into the microphone of the headset at a moderate amplitude via an essentially linear loudspeaker (high-quality headphone). At the other end of the line the signal was recorded onto a DAT recorder that was wired to a standard digital telephone set. With the exception of the loudspeaker and the microphone capsule of the telephone unit at the front end, no analogue components were used in the set-up. Since the spectral shapes of the transmitted test signal were almost identical in all trials any one of them may be taken as representative of all. Figure 1 shows the frequency response of one particular public card telephone to the white noise. For the purpose of the present question it is important to note that the amplitude of the signal rises approximately 30 dB from the frequency of the 50 Hz mains hum to the left cursor (400 82 Forensic Linguistics Hz). For an interval of approximately 3000 Hz, i.e. up to the right cursor (3400 Hz), the frequency response is rather flat. The slight extra increase (3 dB) between ca. 400 and ca. 1200 Hz may be due to an artefact caused by the loudspeaker. Figure 1 Spectrum of white noise after ISDN transmission using standard digital telephone hardware. Left and right cursor are fixed at 400 and 3400 Hz, respectively. One vertical division corresponds to 2.5 dB. The spike close to the left margin is caused by mains hum (50 Hz). With regard to the initial question it may be expected that the measurement of vowel formants below 400, perhaps even 500 Hz, could be affected by the slope of the transmission channel in the following way. The calculations of the bandwidth as well as of the centre frequency of a formant are based on an averaging process over a number of neighbouring harmonics. The amplitude, i.e. the weight of these harmonics, will decrease if they come within the slope of the spectrum of the transmission channel, i.e. below 400 or 500 Hz. The lower the frequency, the stronger the attenuation will be. Thus, the relative weight of the higher harmonics The ‘telephone effect’ 83 of a formant, particularly those above the slope, will be increased. At the same time the bandwidth of the formant is reduced from the bottom end, which will artificially shift its centre frequency upwards. It goes without saying that in broad-band spectrograms the effect will also depend on the filter bandwidth that is selected. The principle is illustrated in Figure 2. In the right-hand part of the figure, the low frequency area (black) is most severely attenuated since it falls into the steep part of the slope of the telephone filter function (S). A smaller frequency band (white) above the first one is less strongly affected since the slope of S is already less steep here. Harmonics in the grey-shaded area, i.e. in the horizontal part of the spectrum, will not be affected at all. It should also be noted that the slope falls within the ranges of both male and female speakers’ fundamental frequency (F0) values. The general problem with formant determination for female voices is addressed in Footnote 4. Figure 2 Shift of formant bandwidth and centre frequency as an artefact of telephone transmission As an implication of the general acoustic theory of vowel production (e.g. Stevens and House 1955, 1961) and the data on formants that have been provided since the early days of sound spectrography and pattern playback (Potter et al. 1947, Peterson and Barney 1952, Cooper et al. 1952, Delattre 1965, Delattre et al. 1952; see also the survey in Baken 1987: 356–9 and more recent data for German in Kohler 1977: 54, Neppert and Petursson 1986:124–5, Simpson 1998: 214–17) it can be expected that the first formant of high and mid vowels in particular will be affected by what might be called the ‘telephone effect’. If this hypothesis is confirmed, an important question will arise: What are the potential perceptual consequences of a rise of F1, particularly in the light of early results by Flanagan (1955) who found that changes in F1 or F2 centre frequency of as little as 84 Forensic Linguistics 3 to 5 per cent are detected consistently by listeners. In other words: Will a vowel such as close [i] be perceived as more open [I] when it is transmitted over the telephone? Generally, F2 and higher formants should remain unaffected by the telephone effect as long as they remain within the frequency band transmitted by the telephone channel. If the rationale of the experiment had been to test the ‘telephone effect’ per se, measurements could have been facilitated if isolated vowels, preferably produced by male speakers,4 or even synthetic vowels, had been used as stimulus material. Regarding the forensic and linguistic applications mentioned earlier, however, it was considered useful deliberately to include other factors, such as phonetic and phonological context, vowel duration, different speed of talking, speaker sex and the like. These factors might individually and/or in combination produce so many artefacts, i.e. increase so much the variability of formant measurements, that they might override the telephone effect in praxi, in which case the phenomenon might be ignored for most investigations. Therefore it was decided to work with natural fluent speech. EXPERIMENT Materials Ten male and ten female subjects were used for the production of the speech material. The respective age ranges were twenty to fifty (average of twenty-eight years) and twenty to fifty-nine (average of thirty-one years). The subjects were asked to read the standard text ‘The North Wind and the Sun’ in German at normal speed and loudness level. The readings lasted between 35 and 40 seconds.5 The speech signal was recorded simultaneously onto two DAT recorders (Sony TCD D7) (a) over a high-quality condenser microphone (Sony ECM 959) and (b) over a standard digital telephone handset (Alcatel 4400) which was connected to another digital extension via a standard ISDN transmission line. For each recording a new local call was dialled. The sound system of standard German (see, for instance, Kohler (1977: 175) contains fifteen monophthongs plus /´/, the reduced variant of any of these. A general phonological rule determines that long vowels are close, short vowels are more open. However, long and short /a/ do not differ phonetically, and there is a long variant of /E/. The central unstressed vowel /´/ is always short. Since /O:,œ/ and /y:/ are not contained in the read text a total of thirteen different vowels were selected, most of them occurring more than once in different phonetic environments. The total number of test tokens was twenty-nine (4 x /i:/, 3 x /I/, 1 x /e:/, 2 x /E/, 1 x /E:/, 7 x /a/, 1 x /a:/, 1 x /o:/, 2 x /ç/, 1 x /u:/, 1 x /U/, 2 x /Y/, 3 x /´/; these vowels are underscored in the text which is contained in Footnote 5). The ‘telephone effect’ 85 Method Each subject’s speech was recorded onto a computer (22,050 samples/s, 16 bit) for further processing. Segmentation and acoustic measurements of the vowels were made with the KAY Multispeech Software. Centre frequencies for F1 and F2 were measured from broad-band spectrograms in two ways: (a) by visual determination of the ‘centre of gravity’ by an experienced interpreter of spectrograms (author), (b) automatically, using the specific autocorrelation function (‘formant history’) provided by the Multispeech software. Results of the automatic determination can be monitored by means of small dots that mark all frequency–time coordinates that are located by the algorithm (one per cycle if the amplitude surpasses a certain threshold). Numerical values gained by both techniques were usually derived at the centre of each steady-state portion using the cross-cursor function. A total of 2320 measurements were made (29 vowels, 20 speakers, 2 conditions, 2 formants). In the course of the analysis a number of problems arose. First, the automatic function for the determination of formant centre frequencies failed to provide any, or any reasonable, results in as many as 50 per cent of the cases, depending on the individual speaker. Generally, female speakers were affected more than males. The majority of measurements that were considered ‘not reasonable’ consisted of cases where the algorithm ‘disregarded’ the formant in question and picked a higher formant or single harmonic instead. The problem was particularly frequent with the telephone data. For this reason the decision was taken to dispense with the automatic determination of formant frequencies altogether and rely exclusively on the interactive visual method. A typical example of the problem is provided in Figure 3 which contains spectra of the words ‘daherkam’ produced in both conditions. In the direct recording (bottom) the first formant of [a:] in the syllable [kha:m] is clearly visible but was not identified by the algorithm (no marker dots). F2 is marked but the calculated frequency values (dots) are far too low. The reason is that obviously the algorithm has taken into account some of the frequency components that were disregarded for F1. Interestingly enough, in the spectrogram of the telephone recording of the same vowel (top) both F1 and F2 are marked. The frequency values seem way too low, however, and F1 marker dots are scattered vertically too much to provide a reasonable average. It is well-known to anyone acquainted with the analysis of spectrograms that the visual determination of formants may also involve problems. In the current data some formants in certain vowels could not be identified unequivocally because (a) amplitudes were generally too low, a phenomenon that also has a speaker-specific aspect, or (b) the high F0 of female speakers made it sometimes impossible to integrate enough harmonics into a formant to determine its ‘centre’, even after optimizing the settings (see Footnote 4). As a result of these problems there was a total of 86 Forensic Linguistics Figure 3 Spectra of the word ‘daherkam’ produced by the same male speaker and recorded in direct (bottom) and telephone condition (top) nine missing values for three of the female subjects, all involving F2 of [i:] and one for a male subject involving [U]. A different problem, i.e. a supplementary source of variability, was a consequence of the general decision to use natural fluent speech rather than isolated vowels as speech material. Many vowels are so short that there are none of the above-mentioned central or steady-state portions that are conventionally used for the measurement of formants. Phonologically long vowels in unstressed, and sometimes even in (sentence-)stressed positions, may be quite short phonetically, for instance the /o:/ in ‘wohl’ or the /i:/ in ‘blies’, not to mention the phonologically short vowels. In such cases the phonetic context is of special importance to the stage at which a measurement is taken: For /o/ in ‘wohl’ formants were measured right at the start of the vowel since F2 moves sharply upwards from the beginning in anticipation of the lateral consonant. On the other hand, the formants of /i/ in ‘sich’ were measured near the end of the vowel in order to exclude the influence of the initial /s/. Here, the final /C/ can be considered ‘neutral’ because of its homorganity to /i/. Such context-based phonetic criteria were applied consistently to all relevant cases. In order to determine the size of variability introduced by the author, who made all the measurements from the spectrograms, the data from female subject no. 6, which was among the more ‘difficult’ ones, were gathered a second time after the completion of all measurements. Results showed that there were no significant differences for F1 or F2 centre The ‘telephone effect’ 87 frequencies as a function of the first or second analysis (for F1: t = -0.65, p= 0.51; for F2: t= 1.2, p= 0.24). The average deviation was 3.5 Hz for F1 (95 per cent confidence interval -14.5 to +7.5 Hz) and 7.0 Hz for F2 (95 per cent confidence interval -4.9 to +18.9 Hz). RESULTS First, the question was asked for each subject whether there were differences in terms of the frequencies of the first two formants as a function of recording condition in the set of twenty-nine test tokens. In other words, at this stage of the investigation no sub-sets of same (phonological) vowels were formed. A series of Wilcoxon matched-pairs signed ranks tests were used. The two-tailed results are summarized in Table 1. The data show: (1) For every single subject of either sex there were significant differences in F1-values, which, as a matter of fact, always proved to be higher in the telephone recording condition. This fact is all the more remarkable since Table 1 Results of Wilcoxon-Wilcox matched-pairs signed ranks tests for differences of F1 and F2 of twenty-nine test items as a function of recording condition (direct/telephone) Significance levels: * < 0.05, ** < 0.01 and *** < 0.001; ns = not significant. Subjects F1 F2 males: 3 4 5 11 12 15 16 18 20 21 *** * *** * *** *** * *** *** *** ns ns ns ns ns ns ns ns ns ** females: 2 6 7 8 9 10 13 14 17 19 ** *** *** *** *** *** *** *** ** ** ns ns ns ns ns ns ns ns ns ns 88 Forensic Linguistics eight instances of the open vowel /a:/ or /a/ were contained in the set of twenty-nine. As a result of their high F1 values they could be expected to be less affected or not affected at all by the telephone effect. (2) Except for one male subject whose values for the telephone condition are 19 Hz higher on average for no obvious reason (see discussion), no significant differences of F2 centre frequencies were found. Regarding the initial hypothesis the central question was how the influence of the telephone band-pass would affect the different vowels. For this purpose the twenty-nine tokens were regrouped according to the thirteen vowel phonemes. Table 2 contains the data for both sex groups. At this juncture it should be reiterated that some of the values represent a Table 2 Average centre frequencies of F1 and F2 of thirteen German vowels in direct and telephone recording conditions for ten male and ten female subjects Male subjects vowel E E: Y a a: e: I i: ç o: ´ U u: F1dir 435 416 349 628 670 334 323 272 519 403 491 366 322 F1tel 463 453 379 641 673 351 360 303 536 413 508 413 353 F2dir 1668 1861 1490 1285 1293 1935 1698 2056 1110 867 1391 926 850 F2tel F1%tel/dir 1674 106,4 1852 108,9 1496 108,6 1288 102,1 1306 100,4 1933 105,1 1692 111,5 2051 111,4 1131 103,3 848 102,5 1391 103,5 926 112,8 854 109,6 F2%tel/dir 100,4 99,5 100,4 100,2 101,0 99,9 99,6 99,8 101,9 97,8 100,0 100,0 100,5 F1tel 523 441 414 726 765 376 430 343 584 457 525 478 396 F2dir 2021 2260 1684 1505 1462 2241 2044 2369 1302 967 1676 1039 994 F2tel F1%tel/dir 2051 106,4 2261 106,0 1662 105,4 1499 102,3 1446 97,9 2272 105,5 2052 110,0 2385 113,6 1280 103,0 977 105,6 1650 107,3 1041 105,6 983 110,6 F2%tel/dir 101,5 100,1 98,7 99,6 98,9 101,4 100,4 100,7 98,3 101,1 98,4 100,2 98,8 Female subjects vowel E E: Y a a: e: I i: ç o: ´ U u: F1dir 492 416 393 710 781 357 391 302 567 433 489 453 358 The ‘telephone effect’ 89 single measurement per subject whereas others are based on as many as seven per subject, depending on the number of occurrences of the particular sound throughout the text. The two right-hand columns contain the results of a division of the telephone data by the corresponding direct data of the respective formants. The quotients for F2 (‘F2%tel/dir’) are illustrated in Figure 4. It is obvious that they vacillate around the 100 per cent line, i.e. there is no tendency towards either direction (maximum deviation: 2.2 per cent). In other words: As could be expected, the measurement of F2 centre frequencies has not been influenced by the recording condition. Results are dramatically different for F1. With the exception of the female subjects’ long [a:], i.e. the sound that has the highest F1 anyway, the F1 centres of each vowel measured higher in the telephone-transmitted data as compared to the data recorded directly. It is of particular interest to note that the size of the difference varies between less than one and over thirteen per cent between the two recording conditions and is grosso modo inversely proportional to the absolute level of F1, i.e. the difference is largest for close vowels such as [i] and [u], medium for vowels such as [e] and [o] and smallest or zero for open vowels like [ç, a]. The finding applies to both sex groups in essentially the same way, since correlation coefficients for the magnitudes of the F1 differences and the absolute levels of F1 (in the direct-recording condition) are -0.80 for male and 0.86 for female subjects. Figure 5 illustrates the data using the same vertical scale as Figure 4 in order to provide a direct comparison of the vastly different magnitudes of the F1 vs. F2 differences – except for /a/. On the basis of the data from the first four columns of Table 2, conventional (reversed-scales) F1-F2 vowel charts for the thirteen German vowels were drawn for both sex groups. Since in Figures 6 and 7 F1 is noted on the vertical axis the visual impression is essentially that of a downward shift of the whole pattern of the telephone-recorded data (marked ‘T’) in relation to their direct-recorded counterparts (marked ‘h’). Only long [a:] (produced by the female subjects) is an exception as a result of its very high F1. The fact that there is no concomitant sideways shift of the pattern shows that F2 remains unaffected by the telephone effect. The picture presented here of the influence of the telephone transmission seems rather clear and thus the question may be asked whether there is a simple method for its compensation. It should be kept in mind, however, that the results presented so far are based on averages, either over (groups of ) subjects or vowel types. An examination of the individual subjects’ data, however, shows how differently the reported phenomena may appear in the sense that some speakers are affected more than others by the telephone transmission. This fact is illustrated in Figure 8, which shows for the male speakers the dot-density distributions of the deviations of F1 (telephone–direct) for five phonological vowel types. In order to Forensic Linguistics F2 tel/direct 90 female subjects male subjects E a u: hw sc E: I O U Y a a: e: i: o: F2 tel/direct Figure 4 Quotients of direct and telephone recorded F2 centre frequencies for thirteen vowels E E: I O U Y a a: e: i: o: wa u: h female subjects male subjects sc Figure 5 Quotients of direct and telephone recorded F1 centre frequencies for thirteen vowels The ‘telephone effect’ h T h h h T h T T h T h T F1 [Hz] h T h T h T h T h T h Figure 6 Formant chart for male subjects h = direct recording T = telephone recording h F1 [Hz] T h T h T h h h T T T h T h T T h T h T h T T h Figure 7 Formant chart for female subjects h = direct recording T = telephone recording 91 92 Forensic Linguistics Figure 8 Dot-density functions of F1-differences in direct and telephone recordings. Data for five vowel types from ten male subjects (U,u:- ç,o: - a, a: - E, e, E: - I, i:) The ‘telephone effect’ 93 make all vowels comparable, the same frequency scale was used for the abscissae.6 It is evident that subjects exhibit a great degree of variance in terms of the size of the deviations. For /u/ they range from -147 Hz to +36 Hz, and even for /a/ the range is from -120 Hz to +90 Hz. The positive figures show that there are even cases (though very few) where some formant centre frequencies are lower in the telephone condition. It can also be observed that within one and the same subject the F1 of /i/ may be affected more strongly by the telephone effect than the F1 of /u/, even though their centre frequencies (according to the direct recording) may be the same. Such a result can be explained by the different relative intensity levels and bandwidths of the first formants of both vowels (Stevens and House 1961: 314 ). The detrimental consequence of these two findings for the present question is, however, that it will not be possible to develop the simple kind of compensatory algorithm that might be suggested on theoretical grounds, i.e. a filtering that is inverse to the slope of the telephone channel. DISCUSSION The results of the present study are clear. By hindsight they may even seem trivial, considering the simple theoretical assumptions presented at the outset. Yet it would seem that hitherto little attention has been paid to what may be called the telephone effect. The fact that low formants of vowels of male and female speakers are shifted upwards and thus cause faulty measurements is relevant in several respects. Let us first consider modern forensic speaker identification as it is carried out today, namely using acoustic–phonetic or semi-automatic methods. As was mentioned earlier the standard setting for voice comparisons implies that telephonetransmitted material is involved. As long as both questioned and reference material in a case were recorded via telephone there would not be much of a problem – provided that the different telephone channels do not differ too much in terms of their acoustic properties. There is hope but no certainty (see last paragraph) that ISDN transmission has taken the edge off this issue. If only part of the speech material was recorded over the telephone,7 for instance an anonymous call, and the reference sample was recorded directly from a suspect, for instance during a police interview, then the problem of shifted formants is inherent. Errors due to unnoticed formant shifts may ultimately lead to false identifications or false rejections, depending upon whether correspondences or discrepancies of measurement values are created that way. Fortunately, many forensic experts, including the present author, have always declined to use formant frequencies as parameters for voice comparisons, but the proponents of the so-called voiceprint method are by no means the only ones who have been using this kind of spectral information for a long time. There are also 94 Forensic Linguistics at least two well-known semi-automatic speaker identification systems, Italian IDEM (Falcone and De Sario 1994, Falcone et al. 1995) and Russian DIALECT (Popov et al. 1999) that use formant-related parameters for comparison. The current findings strongly suggest that in cases containing both types of speech recordings centre frequency and bandwidth of F1 should not be used for the analysis. The same conclusion must be drawn with respect to the use of telephone-recorded speech samples in dialectology, particularly if used in lieu of, not simply as a complement to, directly recorded samples. In analogy to the case of forensic speaker identification the problem is equally relevant if samples recorded in the ‘traditional’ way, i.e. directly, in face-to-face interviews, are to be compared with telephone-recorded data. Here, an additional argument comes into play that extends more generally into the field of speech perception: If telephone transmission causes F1 of a vowel such as [i] to be shifted upwards by as much as 13 per cent, with F2 remaining more or less unaltered around 1900–2000 Hz, the resulting F1–F2 pattern comes close to that of an [I].8 Such a phonetic difference may at first seem minute but it is of great interest to a dialectologist because it may be the nucleus of a sound change in progress or just a difference between two dialects. It would be more than disappointing to the dialectologist if what she/he has perceived as an open [I] – and, ironically, perhaps ‘verified’ by hard acoustic evidence such as formant frequencies – is but an artefact of a telephone-transmitted speech recording. This problem is not just a theoretical possibility, but has been verified informally using the vowel in ‘Sie’ as produced by male subject no.3. When both direct and telephone-recorded /i:/ are trimmed to identical durations and spliced together with a silent interval of one second between them a trained listener clearly hears a sequence of [i:] and [I]. This procedure is similar to that used by Flanagan (1955) who found that the difference limen for vowel discrimination are as small as 3 per cent for F1 and F2. The fluent speech material of the present study precludes a more formalized testing of this question since the open German vowels are much too short in relation to their close counterparts. Sustained isolated vowels, or, following Flanagan’s approach, synthetic vowels, will serve the purpose better. Another finding of the present study that should be of interest to empirical dialectology is that the automatic algorithm for formant extraction used in this study does not work reliably enough – particularly if the speech signal is telephone-transmitted – to use it in lieu of visual extraction by the trained observer. The problem is anything but novel to forensic phonetics and applies in principle to all formant extraction algorithms. It is for this reason that semi-automatic speaker identification systems like IDEM contain sophisticated procedures for the manual (or rather, ‘visual’) post-hoc control and correction of automatically derived formant frequencies, mostly on the basis of long-term The ‘telephone effect’ 95 average spectra (LTAS) that are presented to the observer and can be manipulated at their discretion. The current state-of-the-art in automatic formant extraction reminds one of the situation that existed for many years regarding the automatic extraction of the fundamental frequency: there were numerous approaches (see Hess 1983 for an extensive survey), but no single algorithm would produce ‘correct’ results for all speech sounds, speakers and transmission conditions. s i v Uå d n a I n /// s I v Uå d n a Figure 9 Spectrogram of the utterance ‘Sie wurden einig’ produced by male subject no.3 Since vowels were in the centre of this investigation the effects of telephone transmission have been limited to the lower slope of the band-pass. The study of consonants requires also a look at the upper end of the transmission band. However, such a focus may also provide more evidence on artefacts affecting the higher frequencies of vowels. Figure 9 gives an I n 96 Forensic Linguistics example. The left side of the spectrogram contains the utterance ‘Sie wurden einig’ produced by a male speaker in the direct-recorded condition, the right side contains the identical utterance as recorded via the ISDN transmission line. The [i] of the direct recording exhibits three resonances in an area between ca. 2500 and 3600 Hz. In the telephonerecorded signal, however, there is an additional resonance between 2850 and 3100 Hz with a centre at about 2960 Hz (marked ‘?’). Since identical hardware was used for the recording of all subjects this resonance is quite obviously an artefact of the particular transmission channel since it occurs throughout the whole recording of that speaker alone. It is clearly visible, for instance, in the words ‘wurden einig’ (see horizontal line). The additional resonance, which in addition has a rather strong amplitude, must necessarily lead to an error in the measurement of frequency components of any speech sound, not only vowels, in an area near the upper end of the telephone band-pass. This finding certainly emphasizes the old argument (proposed by engineers; see footnote 1), formerly applied to analogue transmission systems, against the use of any formant frequencies in forensic speaker identification. For the same reasons this argument may be extended to other phonetic applications, including empirical dialectology of the kind that provided the stimulus for the present investigation. At the same time, a question mark has to be put behind the assumption that the characteristics of ISDN channels are in principle identical, albeit the data on the ten public telephone boxes presented above would support it. In order to assess the probability for the occurrence of such artefacts in a telecommunications network a large-scale empirical study of transmission characteristics will have to be undertaken. NOTES 1 2 3 This argument does not apply to many acoustics engineers, though. For instance, for those working on telephone-based automatic speaker recognition systems this problem has been one of the reasons for never even considering using formant information for feature vectors. Similar to Moye (1979), this study is confined to the more technical aspects of telephone transmission. It goes without saying that other peculiarities of telephone speech exist that may seriously affect an individual’s speech behaviour. For instance, Hirson et al. (1995) have shown that in telephone interviews the fundamental frequency of speakers increases significantly in relation to face-toface interviews. Here, the upper cut-off frequency is down to 3200 Hz again, making distinctions between high-frequency speech sounds such as /s/ and /f/ even more difficult than with ordinary landlines. The most serious problem caused by GSM transmission is certainly the distortion induced by the data-reduction algorithm that is used for breaking down and reassembling the speech signal at either end of the transmission path. Although there does not seem to be a systematic study The ‘telephone effect’ 4 5 6 7 8 97 of such effects yet, extensive experience with casework has shown that speaker identification is generally more difficult on the basis of GSM-transmitted speech samples. What makes the problem even more severe, however, is that false-identification type errors seem to increase more than false eliminations. Members of the judiciary have therefore been warned of this problem that is typically pertinent when speech recordings are played to witnesses in court with the aim of having them identify voices (Künzel 1997: 100). It is well known that formants of speech produced by male voices are easier to measure in broad-band spectrograms than those produced by female voices. The simple reason is that the average fundamental frequency, F0, of the former is almost an octave lower than that of the latter. Since formants can be conceived of as ‘bundles’ of harmonics, one can say that the lower F0 is, the more of its harmonics are contained in a bundle of a certain size (bandwidth). Using the typical filter bandwidth of 300 Hz, the first formant of an [i] (which has a centre frequency of, say, 270 Hz and a bandwidth of ca. 300 Hz) produced by a male speaker will mainly comprise 3 harmonics (200, 300, 400 Hz) that are blurred into one band whose peak (centre frequency) is easy to determine. If F0 is 200 Hz, however, the first formant of the same vowel will comprise only F0 itself and one harmonic (200, 400 Hz). Since these frequencies are so widely spaced they will not merge into one band, which makes it difficult to determine a formant (bandwidth and peak). Selecting a larger filter bandwidth may only partially compensate for the effect. See also the example in Baken 1987: 352 figure 9-26. Der Nordwind und die Sonne. Einst stritten sich Nordwind und Sonne, wer von ihnen beiden wohl der Stärkere wäre, als ein Wanderer, der in einen warmen Mantel gehüllt war, des Weges daherkam. Sie wurden einig, dass derjenige für den Stärkeren gelten sollte, der den Wanderer zwingen würde, seinen Mantel auszuziehen. Der Nordwind blies mit aller Macht, aber je mehr er blies, desto fester hüllte sich der Wanderer in seinen Mantel ein. Endlich gab der Nordwind den Kampf auf. Da erwärmte die Sonne die Luft mit ihren freundlichen Strahlen, und schon nach wenigen Augenblicken zog der Wanderer seinen Mantel aus. Da musste der Nordwind zugeben, dass die Sonne von ihnen beiden der Stärkere war. Each dot corresponds to one measurement value from one male subject. The varying total number of dots per diagram is the result of different numbers of occurrences of the vowels (see Introduction). In the case of [a, a:] there are (7 +1)x10 subjects = 80 dots, whereas for [U, u:] there are only (1+1)x10 = 20. According to Hirson, French and Howard (1995: 230) this applies to ‘well over 90 per cent of cases’. Simpson (1998: 217) has found the following average formant values in the spontaneous speech of male German subjects (all in Hz): F1: [i] 309, [I] 353 Hz; a difference of 14 per cent; F2: [i] 2039, [I] 1801. Individual measurements show a considerable overlap between both formants of either vowel. The corresponding values for these vowels of female speakers, however, differ more: F1: [i] 330, [I] 418; a difference of 26 per cent; F2: [i] 2371, [I] 2093 Hz. The spontaneous data from Simpson are not quite comparable with the data provided by Delattre (1956: 49), who measured German vowels [i] and [I] in isolation: F1: 275 re. 325 a difference of 18 percent; F2: 2250 re. 2100. 98 Forensic Linguistics REFERENCES Baken, R. J. (1987) Clinical Measurement of Speech and Voice, London: Taylor & Francis. Bellmann, G., Herrgen, J. and Schmidt, J. E. (1994–99) Mittelrheinischer Sprachatlas (MRhSA), Band 1–3: Vokalismus. Tübingen. Cooper, F. S., Delattre, P., Liberman, A. M., Borst, J. and Gerstman, L. J. (1952) ‘Some experiments on the perception of synthetic speech sounds’, Journal of the Acoustical Society of America, 24: 597–606. Delattre, P. C., Liberman, A. M., Cooper, F. S. and Gerstman, L. J. (1952) ‘An experimental study of the acoustic determinants of vowel color’, Word, 8: 195–210. Delattre, P. C. (1965) Comparing the Phonetic Features of English, French, German and Spanish, Heidelberg: Groos. Falcone, M. and De Sario, N. (1994) ‘A PC based speaker identification system for forensic use’ in Proceedings of the ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, 169–72. Falcone, M., Paoloni, A. and De Sario, N. (1995) ‘IDEM: a software tool to study vowel formants in speaker recognition’ in: Proceedings of ICPhS ’95 Stockholm, vol.3, 294–7. Flanagan, J. (1955) ‘A difference limen for vowel formant frequency’, Journal of the Acoustical Society of America, 24: 613–17. Hess, W. (1983) Pitch Determination of the Speech Signal, Berlin: Springer-Verlag. Hirson, A., French, P. and Howard, D. (1995) ‘Speech fundamental frequency over the telephone and face-to-face: some implications for forensic phonetics’ in J. W. Lewis (ed.) Studies in General and English Phonetics. Essays in Honour of J. D. O’Connor, London/New York: Routledge, 230–40. Kohler, K. (1977) Einführung in die Phonetik des Deutschen, Berlin: Schmidt. Künzel, H. J. (1997) ‘Methoden der forensischen Sprecher-Erkennung’, Strafverteidiger Forum, 5: 100–105. Labov, W., Ash, S. and Boberg, C. (forthcoming) Atlas of North American English, Berlin/New York: de Gruyter. Moye, L. S. (1979) Study of the Effects on Speech Analysis of the Types of Degradation Occurring in Telephony, Harlow, Essex: Standard Telecommunication Laboratories. Peterson, G. E. and Barney, H. L. (1952) ‘Control methods used in a study of the vowels’, Journal of the Acoustical Society of America, 24:175–84. Neppert, J. and Petursson, M. (1986) Elemente einer akustischen Phonetik, Hamburg: Buske. Popov, N. F., Linkov, A. N., Fesenko, A. V., Kurachenkova, N. B., Baicharov, N. V., Karlin, I. P., Timofeev, I. N. and Potapova, R. K. (1999) ‘The interactive expert system for forensic speaker identification The ‘telephone effect’ 99 used in Russia’ in Proceedings of the International Workshop Speech and Computer, Moscow, 37–53. Potter, R., Kopp, G. and Green, H. (1947) Visible Speech, New York: Van Nostrand. Simpson., A. (1998) Phonetische Datenbanken des Deutschen in der empirischen Sprachforschung und der phonologischen Theoriebildung, Arbeitsberichte des Instituts für Phonetik Kiel, (AIPUK) vol. 33. Stevens, K. N. and House, A. S. (1955) ‘Development of a quantitative description of vowel articulation’, Journal of the Acoustical Society of America, 27: 484–93. Stevens, K. N. and House, A. S. (1961) ‘An acoustical theory of vowel production and some of its implications’, Journal of the Acoustical Society of America, 4:303–20.
© Copyright 2026 Paperzz