Acoustic Modeling of the Perception of Place Information in

Acoustic Modeling of the Perception of Place Information in Incomplete Stops
Presented at the 169th Meeting of the
Acoustical Society of America
19 May 2015
Session 2pSC
Pittsburgh, PA
Megan Willi & Brad Story
Speech, Language, and Hearing Sciences, The University of Arizona
Method
b
d
g
20
0
20
0
30 32 34 36 38 40 42 44
30 32 34 36 38 40 42 44
d
g
20
0
30 32 34 36 38 40 42 44
b
60
d
40
0
/d/
/b/
F3
F2
F2
F2
F1
F1
F1
VV1_100ms
150
100
b
50
d
0
g
30 34 38 42
Figure 3: Relative Formant
Deflection Patterns
The formant frequencies of
the VV contexts are
represented with black lines
and the relative formant
deflections are represented
with red (upward deflection)
and blue (downward
deflection) lines.
Figure 4: Characterization of the Tube
Talker where the black and blue lines
represent the vocal tract shapes for the
first and second vowels (i.e. [əә] and [i]
respectively) and the red line represents
the cross sectional area achieved by the
constrictive gesture.
All stimuli for Experiment 1 and
Experiment 2 had an incomplete closure
of .1 cm2 .
100
b
50
d
0
g
3033363942
VV1_100ms
150
100
b
50
d
0
g
3033363942
g
80
Axis Title
VV1_100ms
0
100
100
b
d
40
g
30 32 34 36 38 40 42 44
g
80
d
40
g
20
0
30 32 34 36 38 40 42 44
30 32 34 36 38 40 42 44
VV1_100ms
120
100
80
b
60
d
40
g
20
0
b
60
VV1_100ms
120
60
d
40
20
30 32 34 36 38 40 42 44
80
b
60
30 32 34 36 38 40 42 44
30 32 34 36 38 40 42 44
80
b
60
d
40
g
20
0
30 32 34 36 38 40 42 44
Figure 9: Average participant ID curves and contour plots (F2 lower panel and F3 upper panel) for three vowel contexts (i.e. [əәi], [əәɑ], [əәu] from top to bottom
respectively) for Condition 1, 2, and 3 (left to right respectively). See Figure 8 for a detailed description.
[əәu]
Discussion
Listeners’ phonetic boundaries in Experiment 1 and Experiment 2 indicate that place-of-articulation information is present in
incomplete, voiced stop consonant VCV stimuli lacking canonical hold duration and burst cues.
Listeners’ phonetic boundaries coincide with the proposed relative formant deflection patterns for all three places-of-articulation
(i.e. /b-d-g/) across vowel contexts (i.e. [əәi], [əәɑ], [əәu]) and timing function manipulations (i.e. Conditions 1, 2, 3) except for the
velar position in Condition 3.
  The results suggest that listeners may be sensitive to changes along this relative acoustic dimension and that relative formant
deflection patterns could potentially explain the perception of place-of-articulation information in natural, reduced speech contexts.
However, further investigation of the the perceptual limits of this cue with respect to place is necessary.
 
 
Experiment 1:
  Original Timing- 500 ms
Experiment 2:
  Condition 1- 300 ms (60%)
  Condition 2 & 3- 200 ms (40%)
References
Crystal, T. H., & House, A. S. (1988). Segmental durations in connected‐speech signals: Current results. The journal of the acoustical society of America, 83(4), 1553-1573.
Story, B.H. (2009). Vowel and consonant contributions to vocal tract shape. The Journal of the Acoustical Society of America, 126, 825-836.
Story, B. H., & Bunton, K. (2010). Relation of vocal tract shape, formant transitions, and stop consonant identification. Journal of Speech, Language, and Hearing Research, 53(6),
1514-1528.
Warner, N., & Tucker, B. V. (2011). Phonetic variability of stops and flaps in spontaneous and careful speech. The Journal of the Acoustical Society of America, 130(3), 1606-1617.
Figure 6: Illustration of the
proportionally reduced timedependent activation functions for
the constrictive gestures.
The percent represents the
proportion of the original signal’s
timing function maintained.
Experiment 1:
  Original Timing 500 ms
Experiment 2:
  Condition 1- 300 ms (60%)
  Condition 2- 200 ms (40%)
  Condition 3- 100 ms (20%)
b
d
120
0
Figure 5: Illustration of the
proportionally reduced VV contexts.
The percent represents the proportion
of the original signal’s timing
function maintained.
VV1_100ms
150
Axis Title
F3
Axis Title
F3
0
30 32 34 36 38 40 42 44
100
40
g
20
100
60
d
40
100
80
b
60
120
20
[əәɑ]
80
VV1_100ms
VV1_100ms
30 32 34 36 38 40 42 44
Figure 7: (Experiment 1) Example stimuli from one VV context (i.e. [əәi]) at three different vocal tract locations (i.e. 17.5 cm, 13.9 cm, 11.9 cm respectively for /bd-g/). (Experiment 2) Example stimuli at vocal tract location 17.5 cm (i.e. /b/) for conditions 1, 2, and 3 for each vowel context (i.e. [əәi], [əәɑ], [əәu] respectively).
[əәi]
g
120
0
g
20
d
40
120
20
Results: Experiment 1
Participants: 10 native English speakers (Exp. 1) and 5 native English speakers (Exp. 2)
Task: Forced Choice Test (i.e. /b-d-g/)
Materials: All stimuli were 500 ms, vowel-consonant-vowel (VCV) utterances simulated using
a voice source model based on the kinematic representation of the medial surfaces of the vocal
folds and an airway modulation model of the vocal tract (aka ‘Tube Talker’). VCV continua
were created for 3 underlying vowel-to-vowel transition (VV) contexts (i.e. [əәi], [əәɑ], and [əәu])
by incrementally moving the constriction location from the lips toward the velar part of the
vocal tract in 20 (Exp.1) and 15 (Exp. 2) discrete 0.4-cm steps. Experiment specific
manipulations are described below.
Design: Stimuli were randomly presented 5 times (Exp. 1) and 3 times (Exp.2) in a block
design were only one vowel context was presented per block.
Analysis: Participants ID curve boundaries were compared to the perceptual boundaries
predicted by the contour plots of the relative formant deflection patterns.
Axis Title
[əәi]
VV1_100ms
80
b
60
VV1_100ms
100
40
80
0
30 32 34 36 38 40 42 44
120
60
100
20
[əәbu]
b
120
100
Axis Title
80
g
120
Axis Title
100
80
d
40
Axis Title
120
100
b
60
VV1_100ms
120
40
80
[əәbɑ]
[əәbi]
60
g
Axis Title
Axis Title
0
VV1_100ms
30 32 34 36 38 40 42 44
d
40
20
30 32 34 36 38 40 42 44
Experiment 2:
b
60
Axis Title
g
Aims
Axis Title
g
80
[əәɑ]
50
d
Evaluate participants’ perceptions of place-of-articulation information in stimuli that simulate:
1)  incomplete closure in reduced voiced stop consonants.
2)  proportionally reduced consonant and vowel timing functions in stimuli with incomplete
stop consonant closure.
/g/
d
40
Axis Title
b
Axis Title
30 32 34 36 38 40 42 44
b
60
[əәu]
100
0
80
VV1_100ms
VV1_100ms
Axis Title
g
100
Axis Title
50
d
120
100
Axis Title
b
120
Condition 3
Condition 2
[əәgi]
VV1_100ms
100
0
Figure 2: Reduced
speech examples of
100ms, VCV segments
excised from the read
words “sabotage”,
“steady”, and “spigot.”
[əәdi]
VV1_100ms
120
20
150
Axis Title
Axis Title
100
Figure 1: Reduced
speech example of the
read word “sabotage.”
[əәbi]
VV1_100ms
VV1_100ms
VV1_100ms
150
Results: Experiment 2
Condition 1
Experiment 1:
Axis Title
  Previous research on stop consonant production found that less than 60% of the stops sampled from a connected
speech corpus contained a clearly defined hold duration followed by a plosive release [Crystal & House, JASA,
1988]. How listeners perceive the remaining portion of incomplete stop consonants is not well understood.
  Prior pilot research demonstrated that participants could identify place information (i.e. /b-d-g/) in reduced, 100 ms
vowel-consonant-vowel (VCV) segments excised from read words lists from the Arizona English Recording
Corpus.
  The purpose of the current study is to investigate whether relative formant deflection patterns, a potential model of
acoustic invariance proposed by Story and Bunton (2010), are capable of predicting listeners’ perceptions of place
information in acoustically continuous, voiced stop consonants.
  Listeners identified speech stimuli simulated using a computational model of speech production and model
parameters based on x-ray microbeam articulatory data from VCV utterances [Story, JASA, 2009].
0
V
Example Stimuli
Introduction
Figure 8: (Top) Identity curves averaged across all participants for the Forced Choice Test: /b/-(blue), /d/-(red), and /g/-(green). (Bottom) Contour plots
depicting the relative formant deflection directions: upwards (red) and downwards (blue). The three panels correspond to F1 (lower panel), F2 (middle panel),
and F3 (upper panel). The black lines indicate participants’ phonetic boundaries defined as a 50% crossover point on the ID curve.
Acknowledgements
This research was supported by the Grunewald Foundation Fellowship and NIH R01-DC011275.