Frequency shifts and vowel identification

Frequency shifts and vowel
identification
Peter F. Assmann (School of Behavioral and Brain
Sciences, Univ. of Texas at Dallas, Box 830688,
Richardson TX 75083)
Terrance M. Nearey (Dept. of Linguistics, University
of Alberta, Edmonton, Alberta, Canada T6E 2G2).
Introduction
•
•
•
Listeners can understand frequency-shifted speech
across a wide frequency range (Fu & Shannon, 1999).
We hypothesize that this ability can be explained in
terms of listeners’ sensitivity to statistical variation
across talkers in natural speech.
The aims of the present study were:
1. To study the effects of frequency shifts on the identification
of vowels spoken by 2 men, 2 women and 2 children (age 7).
2. To test the predictions of a model of vowel perception that
incorporates measures of fundamental frequency (F0) and
formant frequencies (FF) associated with size differences in
larynx and vocal tract across talkers
Co-variation of formant frequencies and F0 in natural speech
Mean log FF: Geometric mean of formant frequencies: F1,F2,F3
>3000 vowels in hVd words (Assmann & Katz, 2000)
Pattern recognition model
• Hillenbrand & Nearey (1999) dual-target model
• Parameters: duration, mean F0, and F1, F2, F3
sampled at 20% and 80% points
• Training data: 3000+ vowels spoken by 10 men, 10
women and 30 children from the N. Texas region
(Assmann & Katz, 2000)
• A posteriori probabilities derived from linear
discriminant analysis for each stimulus vowel
Frequency shifts and vowel identification
• In a previous study (Assmann, Nearey & Scott, 2002) we
confirmed that upward shifts in F0 or formant frequencies
(FF) resulted in lower vowel identification accuracy.
• However, combining upward shifts in F0 with upward
shifts in FF led to improved identification accuracy.
• The finding that vowel identification accuracy is higher
with coordinated shifts in F0 and FF is well predicted by
the model of vowel identification outlined below, and
supports the idea that listeners are sensitive to the pattern
of co-variation of F0 and FF in natural speech.
Vowel Identification Accuracy
Identification accuracy (%)
100
Means and standard errors of 11 listeners
Predicted means
80
60
40
20
0
F0*1
F0*2
F0*4
F0*1
F0*2
F0*4
1.00
1.25
1.50
1.75
2.00
Spectrum envelope scale factor
1.00 1.25
1.50
1.75
2.00
Spectrum envelope scale factor
(Assmann, Nearey, and Scott, ICSLP 2002).
Vowel Identification Experiment
• The present study examined effects of upward and
downward frequency shifts on vowel identification.
• 11 vowels (/i/, //, /e/, //, /æ/, //, / /, //, /o/, //, /u/)
in hVd context spoken by 3 men, 3 women, and 3
children from the N. Texas region.
• Upward and downward frequency shifts were
introduced by means of the STRAIGHT vocoder
(Kawahara, 1997).
STRAIGHT vocoder
• High-resolution analysis of time-varying
spectrum envelope
• Wavelet-based instantaneous frequency F0
extraction
– Spectrum envelope (FF) scaling
– Fundamental frequency (F0) scaling
Scale Factors
0.6
FF scale factors
0.8
1.0
1.5
2.0
F0 scale factors
0.5
1.0
4.0
• For females and children, downward shifts tend to produce
male-like voices; for adults, upward shifts heard as child-like
voices.
Method
• Listeners were 14 Psychology undergraduates
participating for partial course credit. Since the
majority had no phonetics training, they first
completed 3 practice sets:
Set 1: passive listening with feedback (24 resynthesized but
not frequency-shifted vowels; no response required).
Set 2: practice identification (a different set of 24 vowels
presented for identification; repeated until a score of
21/24 or better was obtained).
Set 3: passive listening with feedback (24 frequency-shifted
vowels; shift factors randomly chosen from the 15
conditions of the experiment; no response required)
Method
• Main experiment: 990 syllables (11 vowels x 2
talkers per group x 3 talker groups x 3 F0 scale
factors x 5 FF scale factors). All conditions
randomly interspersed.
• Vowels were presented diotically over headphones
in a double-walled sound booth. Listeners
identified the vowels using an 11-button response
box drawn on computer screen labeled with
keywords for the vowel category.
Effects of FF shifts
Identification accuracy (%)
100
80
60
40
20
0
Men
Women
Children
0.6
0.8
1
1.5
2
Spectrum envelope scale factor
Interaction of FF and F0 shifts
Identification accuracy (%)
Men
Women
100
100
50
50
0
0.6 0.8 1.0 1.5 2.0
Children
100
50
0
0.6 0.8 1.0 1.5 2.0
0
F0 x 0.5
F0 x 1.0
F0 x 4.0
0.6 0.8 1.0 1.5 2.0
• For men’s vowels, accuracy is
higher when upward shift in FF is
accompanied by upward shift in F0
• For women and children, there is
a recovery from downward shifts
in FF when F0 is also shifted down
Spectrum envelope (FF) scale factor
Conclusions
• Identification accuracy drops significantly when vowels are
shifted upward in formant frequency by a factor of 1.5 or more,
or downward by a factor of 0.6 or less. Adult males are less
susceptible to upward shifts than females and children, while
children are less affected by downward shifts.
• In several conditions, the drop in intelligibility was reduced by
combining formant shifts with corresponding changes in F0.
• Pattern recognition models predicted the effects of frequency
shifts on vowel identification, including the synergistic link
between F0 and formant frequency. A plausible account is that
learned relationships between F0 and spectral envelope cues are
responsible for this interaction.
References
1.
Assmann PF, Katz WF. (2000) Time-varying spectral change in the
vowels of children and adults. J Acoust Soc Am. 108(4): 1856-1866.
2.
Assmann, P.F., Nearey, T.M., and Scott, J.M. (2002) Modeling the
perception of frequency-shifted vowels. Proceedings of the 7th
International Conference on Spoken Language Processing, pp. 425-428.
3.
Fu, Q-J. & Shannon, R.V. (1999). Recognition of spectrally degraded
and frequency-shifted vowels in acoustic and electric hearing. J Acoust
Soc Am. 105: 1889-1900.
4.
Hillenbrand JM, Nearey TM. (1999) Identification of resynthesized
/hVd/ utterances: effects of formant contour. J Acoust Soc Am. 105(6):
3509-3523.
5.
Kawahara, H. (1997) Speech representation and transformation using
adaptive interpolation of weighted spectrum: vocoder revisited. Proc.
IEEE Int. Conf. on Acoustics, Speech & Signal Processing (ICASSP
'97), vol.2, pp.1303-1306.
Lowered
Male
Base
Male
Lowered
F&C
Base
F&C
Raised
Male
Raised
F&C
Average correct ID per synthetic voice
Basketballs w legend
Disk and spoke plot
Disks = Observed ID
• The colored disks represent listeners correct
identification rate
– Blue: male speakers’ synthesized voices (scaled and
unscaled) ; Red: female speakers; Green: child
speakers;
– The position of the center of the disk indicates the
average F0 and formant frequencies of the voice
– The area of each disk is proportional to the average %
corrected identification by listeners of the voice
– The circles in the legend box indicate the correct
identification rate
Disk and spoke plot
Spokes = Predicted ID
• Length of the spokes indicate predicted ID rate by
LDFA
– Trained on natural measurements of Assmann and
Katz , predictions on scaled values of this experiment
– Patterns:
• Accurate predictions: ‘Basketball’ spoke length matches disk
radius
• Under predictions: ‘Asterisks’ in disks, listeners do better
• Over predictions: ‘Spiked disks’, model does better than
listeners
‘bad’ voice
‘good’ voice
% ID well predicted by smooth function of mean F0 and mean FF
Accounts for 88% of variance