The role of formant-frequency contours in the

The role of formant-frequency
frequency contours in the
perceptual grouping of speech formants
161st Meeting
Acoustical Society of America
Seattle, WA
23-27 May 2011
Brian Roberts and Robert J. Summers (Psychology, School of Life & Health Sciences, Aston University, UK)
Peter J. Bailey (Department of Psychology, University of York, UK)
1. Introduction
3. Experiment 1 – Stimuli and Conditions
6. Experiment 2 – Orientation and Methods
• Speech comprises diverse and rapidly changing acoustic elements yet it is heard as a single
perceptual stream, even when accompanied by other sounds. The relative contributions of grouping
“primitives” and of speech-specific factors to the perceptual coherence of speech remain unclear
(Bregman, 1990; Remez et al., 1994; Darwin, 2008).
• Three-formant analogues of 42 sentences were generated after scaling the frequency contours to
50% depth.
• Remez et al. (1994) interpreted their finding that an F2C created by time-reversing F2 was an
effective competitor for sine-wave speech, but that a constant pure tone was not, in terms of the
plausibility of speech-like variation.
• The factors governing perceptual organization are generally revealed only where competition
operates. Therefore, the second-formant competitor (F2C) paradigm was used, in which the listener
must resist competition from an extraneous formant to optimize recognition (Remez et al., 1994;
Roberts et al., 2010; Summers et al., 2010).
• The parametric manipulations possible with simplified speech signals make them attractive stimuli to
explore these issues. Here, we have used three-formant synthetic analogues of natural speech
(Summers et al., 2010).
• Remez et al. (1994) showed that an F2C created by time-reversing F2 was an effective competitor
for sine-wave speech, but that a pure tone of constant frequency and amplitude was not.
• Roberts et al. (2010) used separate manipulations of the frequency and amplitude contours of
competitor formants to tease apart their impact on the intelligibility of sine-wave speech. All F2Cs with
time-varying frequency contours (time reversed or spectrally inverted) were highly effective
competitors, regardless of their amplitude characteristics, but F2Cs with constant frequency contours
were entirely ineffective. These results suggest that the modulation patterns of formant frequency
contours are critical for across-formant grouping.
• Consistent with this interpretation, Remez (1996) reported that reducing the frequency and amplitude
variation in a competitor generated by time-reversing F2 reduced its impact on the intelligibility of sinewave speech.
• Two experiments are reported here. These explore the effect on intelligibility of manipulating the
depth of variation in the formant-frequency contour of F2C, relative to that for the speech formants,
while preserving the formant amplitude contours. The first used F2Cs whose frequency contours were
created by spectral inversion of F2; the second used F2Cs with regular and arbitrary frequency
contours that were not plausibly speech-like.
2. Method – Speech analogues
• Stimuli were derived from a set of BKB-like sentences spoken by a British male talker with Received
Pronunciation English. Each sentence comprised ≤25% phonemes involving closures or unvoiced
frication.
• For each sentence, the frequency contours of the first three formants were estimated automatically
using Praat (Boersma & Weenink, 2010). Gross errors in formant frequency estimates were handcorrected. Amplitude contours corresponding to the corrected formant frequencies were extracted
from the spectrograms for each sentence.
• These contours were used to generate three-formant analogues of the sentences by means of
parallel-formant synthesis and an excitation pulse modelled on the glottal waveform (Rosenberg,
1971). The pitch was monotonous (F0 = 140 Hz), and the 3 dB bandwidths of F1, F2, and F3 were
50, 70, and 90 Hz, respectively.
• In a preliminary study, the depth of variation in each formant-frequency contour was scaled to one of
a range of values about its geometric mean (100% to 0% [i.e., constant], in steps of 10%). This is
illustrated in Fig. 1 for scale factors of 100% (dashed) and 50% (solid).
• The psychometric function obtained using diotic presentation of all three formants at the different
scaling values indicated that scaling the frequency contours to 50% depth had relatively little effect on
intelligibility (see Fig. 2).
• For each sentence, a set of F2 competitors (F2C)
C) was created using the original amplitude contour
of F2 and a 3 dB bandwidth of 70 Hz. The frequency contour of F2C was derived from that of F2 by
inversion about its geometric mean and scaling to one of five values (depth = 100%-0%, 25% steps).
• Stimuli were presented using a dichotic
configuration (left ear = F1+F2C; right ear = F2+F3
3;
cf. Rand, 1974). See Fig. 3.
• Stimuli were selected such that F2C frequency
was always ≥80 Hz from F1.
• There were 7 conditions in total; 5 were
experimental (depth of F2C was varied), 1 was a
control (no F2), and 1 was the dichotic reference
(no F2C).
Figure 3
4. Experiment 1 – Procedure
• For each listener, the sentences were divided equally across conditions (i.e., 6 per condition) using
an allocation that was counterbalanced by rotation across each set of 7 listeners tested.
• 21 listeners (6 male) took part in the experiment (mean age = 21.6 years). All listeners were native
speakers of English (typically British English).
• Listeners sat in a sound-attenuating booth in front of a computer screen and a keyboard. Stimuli
were presented in random order at a reference level of 75 dB SPL over Sennheiser HD480-13II
headphones.
• Listeners were able to listen to each stimulus up to a maximum of 6 times before entering their
transcription of the sentence. No feedback was given.
given
• Listeners first completed a training session with feedback (cf. Davis et al., 2005) intended to
improve recognition performance for three-formant
formant speech analogues. The training stimuli were
derived from commercially available recordings of 40 IEEE sentences. A range of scaling factors was
used; no F2Cs were present.
5. Experiment 1 – Results and Discussion
• Tight scoring was used to calculate % keywords correctly identified for each of the conditions.
• The control condition indicates that intelligibility was near floor when F2C was added full scale in
the absence of the true F2. Hence, F2C did not in itself support intelligibility (see Fig. 4).
• When the true F2 was present, adding F2C typically
reduced intelligibility. This reduction was greatest for
100%-depth, intermediate for 50%-depth, and least for 0%depth (constant) F2Cs (see Fig. 4).
• The smooth and progressive decline in intelligibility as the
scaling factor for the inverted F2C is increased indicates
that competitor efficacy depends on the overall depth of its
frequency variation, not its depth relative to that of the
other formants (all set to 50% depth).
• This result also indicates that modulation of the frequency
contour influences across-formant grouping not only in
sine-wave analogues but also in the more speech-like
speech
simulations used here (cf. Roberts et al., 2010).
Figure 4
References
Figure 1
Figure 2
Boersma, P., Weenink, D. (2010). Praat,, a system for doing phonetics by computer. Institute of Phonetic Sciences, University of Amsterdam.
Bregman, A.S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound (MIT Press, Cambridge, MA).
Davis, M.H., Johnsrude, I.S., Hervais-Adelman, A., Taylor, K., McGettigan,, C. (2005). Lexical information drives perceptual learning of distorted speech:
evidence from the comprehension of noise-vocoded sentences. J. Exp. Psychol. Gen. 134, 222-241.
Darwin, C.J. (2008). Listening to speech in the presence of other sounds. In The Perception of Speech: From Sound to Meaning, edited by B.C.J. Moore
et al. (Special Issue, Phil. Trans. R. Soc. B. 363, 1011-1021).
Rand, T.C. (1974). Dichotic release from masking for speech. J. Acoust.. Soc. Am. 55, 678-680.
Remez R.E. (1996). Perceptual organization of speech in one and several modalities: Common functions, common resources. In: ICSLP-1996,
ICSLP
Philadelphia, PA, pp. 1660-1663.
Remez, R.E., Rubin, P.E. (1990). On the perception of speech from time-varying
varying acoustic information: Contributions of amplitude variation.
varia
Percept.
Psychophys. 48, 313-325.
Remez, R.E., Rubin, P.E., Berns, S.M., Pardo,, J.S., Lang, J.M. (1994). On the perceptual organization of speech. Psychol. Rev. 101, 129-156.
Roberts B., Summers R.J., Bailey P.J. (2010). The perceptual organization of sine-wave
sine
speech under competitive conditions. J. Acoust. Soc. Am. 128,
804-817.
Rosenberg, A.E. (1971). Effect of glottal pulse shape on the quality of natural vowels. J. Acoust. Soc. Am. 49, 583-590.
Summers R.J., Bailey P.J., Roberts B. (2010). Effects of differences in fundamental frequency on across-formant
across
grouping in speech perception. J.
Acoust. Soc. Am. 128, 3667-3677.
Acknowledgements: Supported by EPSRC. Grant Reference EP/F016484/1 (Roberts & Bailey). We thank Rob Morse and
Meghna Patel for providing the BKB-like
like sentences, and Quentin Summerfield for enunciating them.
Email: {B.Roberts, R.J.Summers}@aston.ac.uk; [email protected]
• Here, the importance of speech-like variation for across-formant grouping was explored using an
F2C with a regular and arbitrary formant-frequency contour. A triangle wave was used (see Fig. 5),
which does not constitute a plausibly speech-like frequency contour. All F2Cs were synthesized
using the original F2 amplitude contour.
• In each case, the triangle-wave frequency contour was
matched to the average rate and depth of modulation for its
inverted F2C counterpart, derived from F2. Modulation rate was
set in relation to zero crossings at the geometric mean frequency
(see Fig. 5, top panel). Peak-to-trough depth was matched to
that of F2 on a log-frequency scale and centered around the
geometric mean frequency.
• The same dichotic stimulus arrangement and procedure were
used as for Experiment 1. There were 8 conditions; 5 were
experimental (depth of triangle-wave contour for F2C was varied
from 100% to 0% in 25% steps), 1 was the dichotic reference
(no F2C), and 2 were control conditions.
• Here, one control comprised F2+F3 alone to provide a
benchmark measure of intelligibility when F1 does not contribute
to the sentence. The other control was the 100%-depth inverted
F2C condition from Experiment 1, as a comparator for the 100%depth triangle-wave case.
Figure 5
7. Experiment 2 – Results and Discussion
• Preliminary results are presented in Fig. 6 (n = 11 of 24).
As before, tight scoring was used to calculate % keywords
correct.
• The first control condition (white bar) shows that
intelligibility was more than halved relative to the dichotic
reference (black bar) when F2+F3 were presented alone.
This result provides a benchmark against which to assess
the impact of different competitor formants.
• The experimental conditions (light gray bars) show that
increasing the depth of frequency modulation for the
triangle-wave F2C caused a smooth, progressive decline
in intelligibility. F2Cs with constant-frequency contours
(0% depth) were largely ineffective. As for experiment 1,
the results indicate that competitor efficacy depends on
the overall depth of frequency variation, not depth relative
Figure 6
to that for the other formants (all set to 50% depth).
• The second control condition (dark gray bar) indicates that the reduction in recognition arising from
adding a 100%-depth inverted F2C is almost identical to that arising from adding a 100%-depth
triangle-wave F2C. Note also that performance in both 100%-depth cases suggests that F1 may in
effect have been excluded from the percept of the target sentences (cf. F2+F3 alone case, white bar).
• Contrary to the argument that across-formant grouping depends on speech-specific constraints
(Remez et al., 1994; Remez, 1996), the triangle-wave competitors were as effective as their more
speech-like counterparts.
8. Conclusions
• The results confirm and extend those of earlier studies using the dichotic F2C paradigm (Remez et
al., 1994; Remez, 1996; Roberts et al., 2010; Summers et al., 2010).
• Adding F2 competitors typically reduces intelligibility; this effect is one of informational masking rather
than energetic masking (see Roberts et al., 2010; Summers et al., 2010). A more specific interpretation
is that F2C acts by influencing the perceptual organization of the sentences.
• Two ways in which F2C might plausibly influence sentence perception are: (a) F2C contributes to the
perceptual estimation of F2; (b) F2C may capture F1 perceptually given that are presented in the same
ear. Though not conclusive, the second possibility gains credence given that the most effective F2Cs
(100% depth cases) cause sentence intelligibility to fall to that when F1 is physically absent.
• The main new findings are that:
1. It is the overall extent of the modulation of the formant-frequency contour, not the extent
relative to that of the other formants, that governs competitor efficacy.
2. Competitor efficacy does not depend on the plausibility of the articulatory motions
implied by F2C. The results indicate that there are at least some circumstances in which
across-formant grouping does not depend on speech-specific constraints.