Static, dynamic, and relational properties in vowel perception
lerrance M. Nearey
DepartmentofLinguistics,
University
of•41berta,
EdmontonT6G2E7, Canada
(Received3 November1987;acceptedfor publication19January1989)
The presentwork reviewstheoriesand empiricalfindings,includingresultsfrom two new
experiments,that bearon the perceptionof Englishvowels,with an emphasison the
comparisonof data analytic"machinerecognition"approacheswith resultsfrom speech
perceptionexperiments.
Two majorsourcesof variability(viz., speakerdifferences
and
consonantal
contexteffects)are addressed
from the classicalperspective
of overlapbetween
vowelcategories
in F 1X F2 space.Variousapproaches
to thereductionof thisoverlapare
evaluated.Two typesof speakernormalizationare considered.
"Intrinsic" methodsbasedon
relationships
amongthe steady-state
properties(FO, F 1, F2, andF3) within individualvowel
tokensare contrastedwith "extrinsic"methods,involvingthe relationships
amongthe formant
frequencies
of the entirevowelsystemof a singlespeaker.Evidencefrom a new experiment
supportsAinsworth's(1975) conclusion[W. Ainsworth,AuditoryAnalysisand Perceptionof
Speech(Academic,London,1975)] that both typesof informationhavea role to play in
perception.The effectsof consonantalcontexton formantoverlapare alsoconsidered.A new
experimentis presentedthat extendsLindblom and Studdert-Kennedy'sfinding [ B. Lindblom
and M. Studdeft-Kennedy,
J. Acoust.Soc.Am. 43, 840-843 (1967)] of perceptualeffectsof
consonantal
contexton vowelperceptionto/dVd/and/bVb/contexts. Finally, the roleof
vowel-inherent
dynamicproperties,includingdurationanddiphthongization,
is briefly
reviewed.All of the abovefactorsareshownto havereliableinfluences
on vowelperception,
althoughthe relativeweightof sucheffectsand the circumstances
that alter theseweights
remainfar from clear.It is suggested
that thedesignof morecomplexperceptualexperiments,
togetherwith the development
of quantitativepatternrecognitionmodelsof humanvowel
perception,will be necessary
to resolvetheseissues.
PACS numbers: 43.71.Es, 43.71.An
INTRODUCTION
This workis concerned
with the problemof perceptual
invariancein vowelperception.Althougha numberof side
issueswill be discussed,focus will be centered on the classi-
cal problemof overlapin formant frequenciesof different
vowelsdue to speaker-and context-dependent
variability.
Someadditionalattentionwill be givento intrinsicdynamic
propertiesin Englishvowels,includingrelativevowelduration and diphthongization(or "vowel-inherentspectral
change"). The problemsto be addressedwill be first outlinedfrom the pointof viewof patternclassification
or "machinerecognition"
of speech.Later,thequestionof thecorrespondence
of suchproceduresto categorizationby human
listenerswill be explicitlyaddressed.
Speaker-dependent
overlapamongtheF 1X F 2 patterns
of differentvowelsisviewedasarisingprimarilyfromdifferencesin vocaltractsize.Two typesof procedures
havebeen
proposedto dealwith thisproblem.Followingthe terminologyof Ainsworth (1975), thesewill becalledintrinsicversus
extrinsicnormalization.Intrinsicnormalizationprocedures
may be regardedas reducingF ! XF2 overlap by meansof
exploitingrelationshipsof theseformantswith F0 and/or
higherformantswithin a singlesyllable.Extrinsicnormalization reducesoverlapby usinginformationthat is spread
acrossa speakeftsentirevowelsystem,e.g.,by reformulating
absoluteformantfrequencies
as proportionsof a speaker's
formantfrequencyranges( Gerstman,1968).
Context-dependent
overlapin F 1X F2 patternsistraditionally viewedas the acousticmanifestationof coarticula2088
J. Acoust.Soc.Am.85 (5), May 1989
tory effectsin speechproduction.
Roughlyspeaking,
articulators are viewed as being pulled away from their
context-freevoweltargetsby mechanicaland neuromuscular overlap.At leasttwo explicit proposals(Broad, 1976;
Kuwabara, 1985) have beenmade in the literature as to how
such coarticulatoryeffectsmight be "undone" in pattern
recognitionprocedures.
The proceduressketchedabove are "data analytic"
rather than "perceptual"in the sensethat they deal with
reliableseparationof categories
basedon datafrom production measurements
only.Althoughthisisa worthypractical
endin itself,a detailedcorrespondence
betweentheoutputof
suchmethodsand listeners'performancemustbe demanded
beforeevententativeperceptualvaliditycan beclaimed.In
the followingdiscussion,
I will attemptto assess
thesemethodsfrom thepointof viewof theirqualitativeandquantitativecompatibilitywith listeners'identification
behavior.Before examiningtheseargumentsin detail, it is usefulto
review someperceptualresultsbasedon experimentswith
natural speech.
I. SPEAKER
SPEECH
AND CONTEXT
EFFECTS
FOR NATURAL
Table I summarizeserror ratesin four experimentsthat
involvespeaker-dependent
effectsin the perceptionof naturally producedvowels.The absolutemagnitudeof errors
varies considerablydependingon a number of additional
factors.However, in all casesthere are significantlyfewer
errors for stimuli that are presentedin a "blocked" speaker
0001-4966/89/052088-26500.80
¸ 1989Acoustical
Societyof America
2088
TABLE I. Error rates for vowel identificationby listenersfor blocked
( speakers
segregated)
andmixed( speakers
randomized
} conditions.In all
four cases,therewer• significantly
lowererror ratesin the blockedcondi-
spokenresponsesand marked down their judgmentson
HVD answersheets.
] "Monitoring"involveshavinglisteners monitorfor a singlespecificvowelcategoryin a single
session,
e.g.,to respond
"yes"or "no"depending
onwhether
the presentedstimuluscontainedthe vowel/i/.
In spiteof the evidencefor orthographiccompatibility
effects,
StrangeandGottfried(1980) providegoodevidence
that morethantaskvariablesare involved.They reporta
significantadvantagefor/kVk/syllables (7% errors) over
isolatedvowels(28% errors).Noteherethatthisadvantage
is considerably
lessthansomeof the earlierreports.Rakerd
et al. (1984) alsofinda very smallbut reliableconsonantal
contextadvantagein their vowelmonitoringtask (4% vs
5%) althoughthey acknowledgethat task variablescan
greatlyaffectapparenterror rates.
The followingconclusions
will bedrawn from thisand
relateddata:First, concerning
contexteffects,isolatedvowelsare not by their natureimpoverished
stimuli;rather, in
manyconditionstheyarewellidentified.Therefore,extreme
theoriesof cospecification
of vowelsby consonantalcontext
mustberejected.On theotherhand,asStrangeetal. (1983)
pointout, thereareneverany largedisadvantages
for vowels
in consonantalcontextas might havebeenexpectedfrom
some"target"theories.Furthermore,stimuliin consonantal
contextcanhavea reliableadvantage,
evenwhentaskvariablesare carefullycontrolled.Second,concerning
speaker
tion.
Speakercondition
Stimulus
Mixed
Blocked
(%}
(%)
/V/
/pVp/
43
17
31
10
Strangeetal. (1976)
Strangeet al. (1976)
/V/
Gated /V/
5
14
4
l0
Assmann et al. ( 1982}
Assmann et al. ( 1982}
type
Source
condition(speakeridentity held constantfor a full set of
vowels)comparedto a "mixed"speakercondition(speaker
identityvaryingrandomlyfrom trial to trial).
TableII summarizes
errorratesfor vowelspresented
in
isolationversusconsonantalcontext. Some experiments
showsignificantadvantages
for CVCs while othersdo not.
The fluctuationin error ratesfor isolatedvowelsis striking.
Whilesomeof theexperiments
presented
byStrangeandher
colleagues
haveshownalarminglyhigherror ratesfor isolated vowels,andvastlyimprovedperformance
in consonantal
context,experimentsfrom a numberof labs (Kahn, 1977;
Macehi, 1980; Assmannet al., 1982) have shown consider-
ably higherperformanceon isolatedvowelsand little or no
advantagefor consonantal
contexts.A casehasbeenmade
by someof theseresearchers,
includingAssmannet al.
(1982; seealsoDiem et al., 1981) that theseadvantages
mightstemin partfromextraneous,
task-relatedeffectssuch
as orthographicinterference.The column labeled "response"indicatesthe natureof thetaskrequiredby subjects.
"PVP" indicatesthat listenerswere requiredto mark off
wordsand pseudowords,
spelledin Englishorthography,
suchas "peep,pip, pep.... ""HVD," and "KVK" indicate
answer sheetsanalogouslyspelled with "h... d" and
"k... k (e)" words."Rhyming"indicatesthat the answer
sheet contained words that were not identical with the stim-
uli, but merelyrhyming words (or in the caseof isolated
vowels,nearrhymesendingin/t/). "Spoken"indicatesresponsesthat were producedverballyby subjectsand later
transcribedby trained listeners. [The blocked HVD and
blockedspokenresponses
reportedfor Assmannet al.
(1982) were recordedsimultaneously,
i.e., listenersgave
effects,
evenspeaker-rando•nized
isolated
vowels
areoften
well identified. Thus extreme theories of relational
vowel
spacenormalization must be also rejected. Nonetheless,
thereappearto be reliableadvantages
that accruefrom lis-
teningto syllables
of a singlespeaker.Recentworkby Mullennixetal. (1989) showsthatsimilaradvantages
arerobust
and persistin a varietyof conditionsfor the identificationof
real words.
Studiesof errorratesin naturalspeechareimportantin
keepingusin touchwith the real world.However,although
strongcircumstantialcasescan sometimesbe made,error
rate studiesrarely provideunequivocalevidenceas to preciselywhichstimuluspropertiesare responsible
for the observeddifferences
in perception.Thus, althougha blocked
speakerconditionleadsto lower error rates,it is not clear
just whichfeaturesera voicea listener"tunesin to." Similarly, we cannottell whichaspectsof the signalleadto advantagesfor consonantalcontext. For detailed illumination of
TABLE I1. Errorratesforvowels
in isolation
andin consonantal
context.Blocked
= speakers
segregated;
mixed= speakers
randomized;
single= single
speaker.
Seetextfor a description
of theresponse
tasks.
Speaker
condition
Mixed
Blocked
Mixed
Mixed
Blocked
Blocked
Mixed
Mixed
Single
Response
task
PVP
PVP
KVK
XV(T)/XVK
Context(%)
t#V4•/
tCVC/
43
31
28
17•
10•
7
Source
StrangeetaL, 1976
Strangeet aL, 1976
Strangeand Gottfried, 1980
Strangeand Gottfried, 1980
19
5
17
15
Assmann et al., 1982
Simultaneous
5
5
Assmannet aL, 1982
responses
PVP
9
4•
Assmann et aL. 1982
HVD
11
8
5
4'
Assmann etaL, 1982
Rakerd et aL, 1984
HVD
Spoken
Monitoring
• Error ratesfor CVCs significantlylower than for isolatedVs.
2089
J. Acoust.Sea. Am.,Vol. 85, No. 5, May 1g89
TerranceM. Nearey:Vowel perception
2089
theseproblems,we mustrely on studiesthat includea detailedspecification
of signalparameters.
II. FORMANT
QUALITY
FREQUENCY
VARIATION
AND VOWEL
susmalespeaker
effects,
at leastfor obstruent
contexts.
4 An
inspection
of Figure6 of StevensandHouse(1963) for vowels in varyingconsonantalframesrevealsthat contextinduced variation in F 1 is less than about 25 Hz for all vowels
except/n/, wherea rangeof about90 Hz or 13% of the null
The traditionalpositionwill be adoptedthat F 1 andF2
aretheprimarydeterminants
of vowelquality.
1Fromthis
context F 1 value (720 Hz) is found. Variation in F2 is as
perspective
(essentially
that of Chibaand Kajiyama,1941;
largeas400Hz for/u/and 200Hz for/•/, corresponding
to
about40% and 20%, respectively,
of their null contextF2
Joos, 1948; Peterson, 1961; Ladefoged, 1967; Assmann et
values { 820 and 1040 Hz). For the other vowelsstudied,F 2
al., 1982;Nearey and Assmann,1986;Miller, 1989), the
classicpuzzlehasbeento try to dealwithformantfrequency
overlapbetweencategories.
One of the mostcommonways
to attemptto reduceoverlapisthroughtheuseof normalization procedures.The term "normalizationprocedure"is
usedheresimplyasa labelfor explicitmethodsthat attempt
to factor out systematic,but phoneticallynondistinctive,covariationin signalproperties,andthusto revealmorenearly
invariantpatternsseparatingphoneticcategories.
However,
many of the theoriesdiscussed
belowhavebeencouchedin
blatantlypsychological
terms.In my view,the experiments
of the type discussed
belowprovidestrongevidenceonly
aboutwhattypesolinformationareimportant,andnot pre-
variation is on the order of 100 to 150 Hz.
ciselyhowthatinformation
isusedbylisteners.
2
It is usefulto considerthe approximaterangeof variation in formant frequenciesinducedby severaldifferent
sources.
In the followingdiscussion,
rangesof variationwill
be generallycalculatedas a percentagechangefrom some
baselinevalue,or morepreciselyas
% change= 100[(x/Vr•f) -- 1],
(1)
where x is the modifiedvalue and Vrcris the baselineor
reference value.
Perhapsnot surprisingly,the largestsinglesourceof
variationis vowelidentityitself.Here, F 1 and F2 showa
rangeon the order of 170%-200% amongthe vowelsof a
singlespeaker.In the maleaveragesin Petersonand Barney
(1952), for example,theF 1 rangeis from 250-750 Hz, and
F 2 from 840-2290 Hz. As another benchmark, consider that
thespacingoff 1rangesfrom about38% 44% for adjacent
vowels in the front vowel series/i•ea•/in
the Peterson and
Barneyfemale
3averages,
andtheF2 spacing
for/•e/vs/^/
is about 46%.
The next largestsource,on the average,is speaker-dependentvariation.The rangeof within-category
variationis
on the order of 30% when comparingthe formantsof children with thoseof adult males.The Petersonand Barney
(1952) children'saveragesshowrangesof 370-1030 Hz in
F1 and 1060-3200 Hz in F2. Variation can be as large as
100%, comparedto male averages,if infants' vocalizations
are considered.Figure 9.3 of Lieberman and Blumstein
(1988) showsdata from a singleinfant with an F 1 rangeof
about450-1400 Hz andan F2 rangeof 1800-4000Hz (see
alsoBuhr, ! 980). Althoughtendencies
towardnonuniformities in scale factors have been noted by Fant (1973),
speaker-dependent
effectsapply to all vowelsalmost uniformly (Nordstr6m and Lindblom, 1975;Nearey, 1978).
Consonantalcontextand reductioneffectsvary considerably from vowel to vowel, and possiblyfrom speakerto
speaker,but on the averagetheyare smallerthan child vet2090
J. Acoust.Sec. Am., VoL 85, No. 5, May 1989
The Stevensand Housedata representmeasurements
of
full durationstressed
vowelsof threemalespeakers
of English. More extremeformant frequencyvariationhas been
reportedby Lindblomfor a singlespeakerwith changesin
stressand prosodicfactorsas well as consonantalcontext.
Lindblom'sformulas3 to 5 and 7 to 9 (togetherwith data
from his TablesI and lI) haveallowedthe compilationof
estimatedmagnitudes
of changein F 1 andF2 valuesfrom
isolatedvoweltargetsto midpointsof short ( 100ms) CVC
syllables.Thesecalculationsare summarizedin Table III.
To the best of my knowledge,the most extremecontextdependenteffectever reportedfor stop-consonant
bounded
vowelsisabout600Hz in F2, corresponding
toabout71% of
the isolatedvoweltarget,for the/dad/syllable shownin
TableIIl. Thisresultedfromthechangeof a stressed
isolated vowel to one in unstressed/dVd/context.
The foregoingdiscussionhas givensomeindicationof
the magnitudeto speakerandconsonantal
contexteffectson
vowelformantfrequencies.
In the followingsections,
an attemptwill bemadeto relatetheseeffectsfrom measurements
of productiondatato changes
in listeners'behaviorin speech
perceptionexperiments.
III. SPEAKER-DEPENDENT
SPACE NORMALIZATION
VARIATION
AND VOWEL
There
have
been
twomain
approaches
totheproblem
of
speaker-dependent
overlapin the F 1XF2 space.Both approacheshave long histories (see Chiba and Kajiyama,
1941; Joos,1948; Ladefoged,1967; Peterson,1961; Miller,
1989). Followingthe terminologyof Ainsworth(1975),
thesewill belabeledtheories
of intrinsicversus
extrinsic
specification.
Pureintrinsicspecification
assumes
that all information
necessaryto identify a vowel is contained within the vowel
itself.Approachesof suchresearchers
asMiller (1953), Peterson( 1961); Miller ( 1984, 1989), and Syrdal (1984) fall
into thisgeneralcategory.Althoughit is possibleto formulate suchapproachesas normalizationprocedures,the term
is rarely usedby this group.Instead,the invarianceproblem
is deemednot to exist when the correct parametric represen-
tationof spectralpropertiesof vowelsisconsidered.
Overlap
in theF 1XF2 planeis viewedasthe resultof lookingat the
wrong two-dimensionalprojectionin the wrong space.
When certain transformationsof F0 and the F pattern are
employed,the overlapis believedto be largely eliminated.
Typically, these transformationsinvolve a nonlinear frequencywarpingtransformation(log, mel, Bark,or modified
Bark) followedby somesimplelinear transforms(seeAppendixA).
Pure extrinsicspecification,on the other hand, assumes
TorranceM. Nearey:Vowel perception
2090
TABLEIII. Deviations
fromsteady-state
targets
forF I andF2 of 100-ms
CVCsyllables
aspredicted
fromformulas
andtables
inLindblom
( 1963). F I, and
F2, aretargetvaluesfromisolated
vowels:
F I c andF2c areestimated
midsyllable
values.
/CVC/
FI,
(Hz)
FI•
(Hz)
bib
beb
bvb
b•b
325
515
350
500
325
463
350
454
Dif.
(Hz)
0
-52
0
- 46
- 18
- 142
-52
0
0
-93
0
- 83
- 33
-256
b4b
425
407
bab
760
618
bab
bob
did
ded
dYd
ded
d4d
dad
515
370
325
515
350
500
425
760
463
370
325
422
350
417
392
504
dad
dad
515
370
422
370
-93
gig
geg
gYg
g•g
ggg
gag
gag
gag
325
515
350
500
425
760
515
370
325
421
350
416
391
501
421
370
0
- 94
0
- 84
-34
- 259
-94
0
% change
F2•
(Hz)
F2•
(Hz)
Dif.
(Hz)
% change
0.0
- 10.0
0.0
- 9.2
2200
1925
1925
1625
2004
1784
1796
1561
- 196
- 141
- 129
- 64
- 8.9
- 7.3
-6.7
- 4.0
--4.3
1125
1229
104
9.3
18.6
1275
1266
- 9
- 0.7
10.0
0.0
0.0
-- 18.1
0.0
- 16.6
--7.8
- 33.7
800
690
2200
1925
1925
1625
1125
1275
913
831
1983
1771
1802
1634
1462
1480
113
141
- 217
154
- 123
9
337
205
14.2
20.4
--9.9
-8.0
-6.4
0.6
30.0
16.1
800
690
1186
1178
386
488
48.2
70.7
2200
1925
1925
1625
1125
1275
800
690
2214
2110
2052
1887
1326
1780
977
869
14
185
127
262
201
505
177
179
0.6
9.6
6.6
16.1
17.9
39.6
22.1
26.0
-
-
0
18.1
0.0
0.0
-18.3
0.0
- 16.8
-7.9
- 34.1
- 18.3
0.0
that a frame of referenceis establishedfrom information that
between0.2964 and 0.3206.This corresponds
to a nearly
isdistributed
across
thevowelsof a singlespeaker,
e.g.,that
there is a transsyllabic
specification
of vocaltract sizeor
formantranges.
Thisistheapproach
of Joes(1948), LadefogedandBroadbent(1957), Ladefoged(1967), Gerstman
(1968), Nordstr/•mand Lindbiota (1975), and Nearey
uniform increaseof 35% to 38% on a linear (Hz) scale.All
otherthingsbeingequal,sucha uniformscalingof formant
frequencies
is to be expectedwith a changein vocaltract
length(Nordstr6m, 1975). On theotherhand,F0 risesconsiderablyfasterbetweenthe two groups,namely,0.7129on
(1978).
the log scaleor 104% in hertz. Thesevaluescorrespond
roughly to Ainsworth's (1975) observationthat formant
Varietiesof bothapproaches
havebeenincorporatedin
statisticalpatternrecognitionmodels,with generallysuc- frequencies
of childrenareabout30% higherthanthoseof
cessfulresults--better than 90% correct identification---on
adult males,while fundamentalsdiffer by about 100% ( 1
the individualdatafrom the studyof Petersonand Barney, oct). Relationshipsof this kind are alsothe basisof J. D.
1952. [See Nearey (1978), Hindle (1978), and Disner
Miller'sinclusionof theexponent1/3 onF0 in hisformula
(1980) for comparisons
of some extrinsicapproaches; for the "sensoryreference"(personalcommunication),
Neareyetal. (1979), Assmannetal.(1982), Syrdal(1984),
sinceon a log scale,formantfrequencies
riseat aboutonethird the rate of the fundamental? The valuesof 1 oct in F0
andHillenbrandand Gayvert (1987) for at leastsomecombelowwill beusedasbenchparisonof intrinsicandextrinsicapproaches.
] The rangeof and30% in formantfrequencies
issues
involvediscomplexanda full discussion
isnotpossi- marks in the discussion
that followsto compareagainst
ble here. However, an outline of someof the difficultiesis
shiftsobserved
in perceptual
experiments.
providedin AppendixA.
Relationships
of thistypehaveled researchers,
suchas
Oneof theproblemsin assessing
thedifferences
between FujisakiandKawashima(1968), Holmes(1986), andAinsintrinsic and extrinsicfactorsin pattern recognitionstudies worth (1975), to suggest
approaches
to normalizationthat
is that the two may be quitestronglycorrelatedin produc- might be viewedasmixturesof the two extremesdescribed
tion data. This can be illustratedby consideringrelationabove.For example,vowelinternalinformationcouldserve
shipsin the data of the Petersonand Barney (1952) study.
to specifya speakerframeof reference,
e.g.,vocaltractsize,
When all frequencyvaluesare firsttransformedto a natural
perhapsin conjunction
with externalfactors.RyallsandLielog scale [i.e., ln(Hz)], the pairwisecorrelationsbetween berman explicitly suggestthat "... averagefundamental
subjectmeansamongF0, F 1,F 2, andF 3 all rangebetween playsa secondary
rolein establishing
the normalizationfac0.82 and 0.87 (using the 76 individual subjects'data and
tor ..." (1982, p. 1633). On the basisof a priori perceptual
averagingeachformantoverall vowels).For groupeddata,
considerations
and experimentalevidence,thereare a numthe increasesbetweenmales'and children'smeansare nearly
berof arguments
thatcanbebroughtto bearfor andagainst
equalfor F 1, F2, and F3 on the natural log scale,ranging bothpureapproaches.
2091
J. Acoust.Sec. Am.,Vol. 85, No. 5, May 1989
TorranceM. Nearey:Vowelperception
2091
A. Extrinsic specification
The strongestperceptualargumentagainstpure extrinsicspecification
is that highidentificationratesarefoundfor
vowels (including gated vowelswhere spectralchangeis
minimal) evenwhendifferentspeakers'
voicesarerandomly
mixed (Assmannet al., 1982). This is the limiting casefor
what might be termed "the bootstrapproblem" for extrinsic
specification:If every vowel is relativeto everyother, how
can we everget into the system?(SeePeterson,1961.)
A partialescapefrom the bootstrapproblemmightbe
foundin universalconstraintsonF 1X F 2 patterns.As noted
by Nearey (1978, pp. 95-100), given some general constraintson the shapeof a possiblesinglespeakervowelspace
(the "vowel triangle") and on the natureof speakerdifferences(e.g., uniformscalingof formantfrequencies),certain
vowelscouldnot overlapin theF 1X F 2 space.Suchvowels,
or other
conventional
"conversation
starters"
of known
malizationin a naturalspeechsettingveryreminiscent
of the
Ladefogedand Broadbent(1957) experiment.The Dechovitz experiment involvessyllablesfrom an adult male's
voice,whichareimbeddedin a carrierphraseof a 9-year-old
child.
B. Intrinsic specification
1. Fundamental frequency
SomedifficultiesariseregardingvowelqualityspecificationbasedonF0. Someof theseobjections
areeithera priori
or anecdotal,ratherthanempirical,but eventheycannotbe
dismissedout of hand. Theories like those of J. D. Miller,
Syrdal,or Traunmiillerthat posita stronglink betweenF0
andF 1runintotheoreticallymurkywatersfromthepointof
view of source-filterindependence
on the productionside.
On theperceptualside,sucha link appearsto run afoulof the
traditional phoneticdistinctionbetweenfeaturesof phonation and featuresof articulationthat are generallyviewedas
phoneticquality might then serveto "calibrate"the restof
resultingin essentially
independent
perceptual
properties.
6
the system (Lieberman, 1984; Joos, 1948).
Dudley's (1939) early demonstrationswith source-filtervo-
Althoughexperiments
by Strangeet al. (1976) with
precursorvowelsin natural speechhave failed to provide
any evidencefor extrinsicnormalization,precursorvowels
of knownqualityhavebeenquitesuccessful
in studieswith
syntheticspeech.LadefogedandBroadbent(1957) demonstratedthat the categorizationof a fixed set of test vowels
coders,wherebyspectrumenvelopeinformationfrom humanspeechis preserved,
whilearbitrarysourcespectraare
substituted,
wastakenby Dudley asconfirmingthe perceptual independence
of sourceandfiltercharacteristics
andled
him to developthe conceptof "the carriernatureof speech"
couldbealteredsystematicallyby manipulatingthe formant
rangesof a syntheticcarrier sentence:"Pleasesaywhat this
word is __ ." They alsoshowedthat the nature of the inducedperceptualshiftswereconsistentwith the changesof
the relativepositionin F 1X F2 spaceof the testvowelswith
respectto vowelsin the carrier.Ainsworth(1975) demonstrateda similar effecton an entire continuumof synthetic
vowels,usingchanges
in formantrangesfor synthetic/ia u/
precursors.He found that a 30% increasein formant frequencies
of theprecursors
resultedin 3% to 6% risein "center of gravity" measurements
of the F 1X F 2 responseareas
for the test vowels.Ainsworth suggests,
however,that the
methodusedmightunderestimate
actualresponse
shiftsby a
factor of 2. Nearey ( 1978;seeLieberman,1984 for a summary) demonstratedthat the changeof a single"context
vowel"from formantfrequencies
nearthoseof Petersonand
Barney (1952) male average/i/to averagechildren's/i/
values was sufficientto significantly alter the responsere-
gionsof all vowelresponses
in a largeF 1X F 2 continuum.
Nearey estimatedthe changeinducedby a 35% increasein
F 1 and a 45 % increase in F 2 of the context/i/vowel
to be on
the order of 15% in F 1 and 20% in F2 for the test vowels in
the continuum.
In addition to this evidence from two formant stimuli,
thereare severalotherpositiveindicationsfor the extrinsic
specification
of vowels.Assmannet al. (1982) provideevidence for the relevanceof extrinsic normalization procedures in modifiednatural speech(100-ms sectionsgated
from isolatedvowels),thoughnot for full durationisolated
vowels.Remez et al. (1987) report resultsquite similar to
thoseof Ladefogedand Broadbentfor "sinusoidalvoices,"
Le., frequency-modulated
sinusoidsreplacingthe first three
formants of natural speechpatterns. Finally, Dechovitz
( 1977a,cf. 1977b) findsstriking evidencefor extrinsicnor2092
J. Acoust.Soc.Am.,Vol.85, No. 5, May 1989
(Dudley, 1940).
There are other indications of relative source-filter inde-
pendencein the perceptualdomain. Vocodedspeechwith
altered (usually raised) formant rangesis now sometimes
usedon newsbroadcastswith anonymousinformantswhere
it appearsto resultin highly intelligiblesignalsin which the
normal F0 and formant relationsare alteredsubstantially.
Helium speechat normal atmosphericpressureremains
highly intelligiblewhen formant frequenciesare doubled,
although fundamental frequenciesare largely unaffected
(Morrow, 1971;Beil, 1962;seealsoBarany, 1937). The early observations
of Chiba and Kajiyama ( 1941,Chap. XIII)
of vowel identificationfrom phonographrecords played
backat severalspeedsresultsin the perturbationof the normal relationsbetweenF0 andtheformantfrequencies.
They
reportthat intelligibilityof vowelsremainshighovera range
of about 1.5-0.8 timesnormal recordspeedsfor adult male
voices.
In natural speech,problemsassociatedwith rather large
fluctuations
of fundamentalfrequencywithinthespeechof a
singlespeaker (up to an octave;Lieberman, 1967) should
also give pause.Do vowel F ls really span a 1/3-oct range
overa singlespeaker'sintonationcontourto maintaina constantdistancefromF0? (Evidentlynotcompletely,although
there may be a slight correlation ofF0 and F 1 over a single
speaker'sintonationalrange;seeSyrdal and Steele,1985.)
And whatabouttonelanguages?
IfF 1failsto keeppacewith
contourtones,do diphthongsresult?
7 Finally, whispered
speech,with no fundamental,is relativelyintelligible,althoughlesssothan phonatedspeech(see,e.g.,Kallail and
Emanuel, 1984).
In natural speech,partial F0 versusformant independenceis shown by unusualvoiceslike thoseof Julia Child
(the French Chef), with a rather high fundamental,but low
formantfrequencies
on theonehand;andPopeye(the voice
TerrariceM. Nearey:Vowelperception
2092
of JackMercer), on the other, wherethe oppositesituation
error rates. However, becauselowering fundamentalfre-
occurs.
a Thereis othersupportfor relatively"loosecoupling"between
F0 andF 1 fromtheliterature.Thus,forex-
quencygenerallyhadlessof a detrimentaleffectthanraising
it, the authorsarguethat the densersamplingof the spectrum envelopeassociated
with low fundamentals
may lead
to moreaccurateformantfrequencyextractionby listeners.
Syntheticexperimentsinvolvingformantcontinuaalso
showclearevidencefor effectsof F0 on vowelcategoriza-
ample,data from a studyof sungvowelsby Gottfried and
Chew (1986) indicatethat vowelsproducedwith a full octavechangein fundamental
frequency(from 130to 260Hz)
showonlyabouta 10%increase
in F 1,whilefalsettovowels
at 260 Hz show lessthan 5% increase over low fundamental
tion. Miller (1953) estimatesa shift of 80 Hz ( 16% ) in the
chestregistervowels.Eventhelargerof thesetwochanges
is
only abouthalf the sizethat might havebeenexpectedby
interpolationfrom the rule of thumb notedabove(i.e., a
30% changein F 1 for a 100% changein F0}. In a studyof
thespeechof preadolescents,
BennettandWeinberg(1979)
presentdata indicatingthat formantfrequencydifferences
betweenboysand girls may actuallybe larger than fundamentalfrequency
values.ThusF 1 for thevowel/a•/is 10%
higherfor girls than boys,while the corresponding
F0 increaseis only about3% (seetheir Table II).
In spiteof theexistence
of"slippage"in therelationship
betweenF0 andformantrangesin naturaldata,thereisother
evid.
encefor cleareffectsofF0 in vowelperceptionfor syntheticspeech.
Therehavebeentwostudiesbasedon theaverageformantfrequencies
of tenvowelcategories
of Peterson
and Barney (1952) in which fundamentalfrequencywas
systematicallyvariedin combinationwith formantpatterns
for differentspeakergroups.Lehisteand Meltzer (1973)
crossedfundamentalfrequenciesfor males,females,and
childrenwith formantpatternsfor the samegroups.Results
/a-,x/F 1boundaryfor a 1-octshift ( 144to 288) in F0 for an
F I XF2 continuumspanninga rangeof back and central
from their Table 7 indicate that listeners' vowel identifica-
tion rates for the female vowel set were better (82%) for
vowels.He findsa smaller shift of 30 Hz (6%) for the/i-e/
F 1 boundaryin a front vowelseriesfor the samefundamental frequencies.
Fujisakiand Kawashima(1968) explorea
largerangeofF0 values,from 130to 350Hz fortwoseriesof
continuaconsisting
of correlatedF 1• F 2 changes.
Interpolationfromtheirgraphsto a 1-octshiftin F0 from 130to 160
Hz indicates shifts of 14% in F 1 for their/u-e/series
and
21% for their/o-a/series. Traunm/iller ( 1981) presentsa
seriesof rathercomplexexperiments
exploringthe relation
betweenF0 and F 1 in perception.He concludesthat F 1
boundariesare stronglyaffectedby F0, and that boundaries
betweendifferentphoneticvowelheightclasses
correspond
to nearly equal "tonality" differencesbetweenF0 and F 1
whena Bark scaleis used.Usingthe Bark scaleof Zwieker
and Terhardt (1980) or the modifiedBark scaleof Syrdal
andGopal ( 1986,whichwasbasedonTraunm/iller'swork),
thisimpliesa 130-to 150-Hz (26% to 30% ) increasefor a
boundarynear500Hz for an octaveincreasein F0 from 130
to 260 Hz. Holmes (1986) includesa pair of experimental
conditions(conditionsI and 2) wherean upwardshift of 1
vowel continuum leads to instimuli producedwith the matchingfemalefundamental Bark in F0 for an F1XF2
thanfor thoseusingeitherthe maleor children'sF0 values creasesof 0.2 to 0.5 Bark in centerof gravitymeasurements
areas,asestimatedfromhisFigure16.1.
(54% and 43%, respectively).For the male formantset, in F 1 response
As for naturallyproducedspeech,
evidence
for theperhowever,themaleandfemaleF0s producedaboutthe same
is
identification
rate(76% and77%, respectively),
whileiden- ceptualrole of F0 to formant frequencyrelationships
sparse.I amawareof nostudies
of theintelligibilityof voices
tificationrateswerelower (43%) usingthe children'sfundamentalfrequencies.
For the children'sformantpatterns, like thoseof JuliaChildor Popeye.However,Gottfriedand
the highestcorrectidentificationrate (77%) actuallyoc- Chew (1986) show an increasein error rate as fundamental
curred with the female F0 followed by the children's and
frequencyisincreased
by theircountertenorvoice,although
the spectrumenvelope
samplingeffectsuggested
by LiebermaleF0 (68% and44%, respectively).
man and Ryalls (1982) above cannot be ruled out here.
In a similarexperimentusingsyntheticvowelsbasedon
There is some evidence in Assmann et al. (1982) that the
Peterson and Barney averages, Ryalls and Lieberman
inclusionof fundamentalfrequencymeasurements
in a sta(1982) also show that listeners'error rates are affectedby
tisticalpatternrecognitionmodelincreases
correlationof
changes
in fundamental
frequency.However,thepatternof
era patternrecognition
modelwithlisteners'
changein errorsis againcomplex.Error ratesfor stimuli thepredictions
conditions;see
with formantfrequencies
basedon male averagesshowno judgments(at leastin speaker-randomized
their TableV). Furthermore,Assmann(1979) reportsthat
significantincreasewhenloweredfrom theaveragemaleF0
inclusionof fundamentalfrequency(in additionto formant
of 135to 100Hz, but errorsaresignificantly
higherwhenthe
fundamental is raised to 250 Hz. For vowels based on female
measures)in a regression
modelsignificantlyimprovescorrelationswith two phoneticians'
judgmentsof vowelheight
formantaverages,error ratesare alsoincreasedwhen the
fundamentalis raisedfrom the averagefemale185 to 250 and advancement of the same data.
Hz, althoughthe latter valueis notedby Ryallsand Liebetmanasbeing" ... within the normalrangeof femalespeak- 2. F3 and the higher formants
AsFujisaki
andKawashima
(1968)note,F3 varies
relers"( 1982,p. 1632).Whenthefundamental
wasloweredto
100 Hz, error rates also increasedfor the female average ativelylittle fromvowelto vowel(rhotasizedvowelsexceptformant set, although the error rate was still significantly ed) but considerablyfrom subject to subject, and hence
lower than that of the 250-Hz F0 stimuli. The authors note
mightservewell asa referencefor vocaltractlength.In this
that large mismatchesbetweenfundamentaland formant way, it mightbehaveasa kind of intrinsicsourceof extrinsic
frequenciescomparedto normal speechresultin increased information (vocal tract length) that could pervade the
2093
J. Acoust.Sec. Am_,VoL 85, No. 5, May 1989
Terrarice M. I•learey:Vowel perception
2093
wholevowelsystemand complementF 1XF2 rangeinformation.On the negativesideof thisissue,Ladefoged(1967)
presentssomegraphicevidencefor measurementsof cardinal vowelsthat he interpretsasunfavorableto F 3-basednormalization. Furthermore, for nonlow back vowels, the am-
plitudeoff 3 maybesolow thatit isbelowthreshold,though
this fact in itselfmight be usefulfor vowelidentification.
On the positiveside,Fujisaki and Kawashima (1968)
find clear F3 basedeffects,although their magnitudedependsonotherfactors.InterpolatingfromtheirFigure6, for
the/u-e/boundary, they find an increaseof about 18% in
the F 1 value of the boundaryfor our benchmark30% increasein F 3. For the/a-o/boundary (their Fig. 7), however,only abouta 3% increaseoccursfor the samechange.
largestchangeobservedfrom the baselinemale condition
occurswith a concomitantincreaseofF0 andF 3by 1.5Bark,
coupledwith a 15-dBattenuationof theF 3 region.This corresponds
to a 137% increasein F0 (or about1.25oct) anda
30% increasein F3. Comparingconditions1 and 7 on his
figure16.1,thelargestobservedincreasein F 1ison theorder
of 1 Bark (21%). The largestF2 increaseis about0.8 Bark
(15%).
IV. EXPERIMENT
EXTRINSIC
I: EVALUATION
OF INTRINSIC
FACTORS IN VOWEL PERCEPTION
AND
As the precedingreviewindicates,therearea numberof
unresolvedissuesin the literature. In view of the complex
Holmes (1986) also shows evidence for F3-related effects.
patternof evidence,
it seems
reasonable
to contemplate
com-
Comparinghisconditions2 and 6, a 1-Barkincreasein F3 of
promisepositions,and to considermodelsthat would allow
itself leads to no more than about a 0.25-Bark increase (for
fordifferential
weighting
of a numberof factors
thatmight
two vowels,there are actually small decreasesin F2) in the
vary from situationto situation.Assessing
the relativesizes
of effectsin differentexperimentsis difficult,sincedifferent
languages(or dialects),differentsynthesistechniquesand
differentmethodsof measuringresponseshiftsare involved.
There has beenone attempt in the literature,that of
Ainsworth (1975), to try to compareextrinsicversusintrinsiceffectsin a homogeneous
experimentalenvironment.In
his experiment,Ainsworth estimatedthat the extrinsicfactor, the formant rangesof precursor/i a u/syllables, had
roughlytwicetheeffectof theintrinsicfactorF0. Ainsworth
usedonly two formant/hVd/syllables, so the effectof F 3
centerof gravityof F 1or F 2 responses
for anyof the vowels
displayedin his Figure 16.1. In percentageterms, the F 3
increasein the stimuli is about 18%, while the largestobservedresponse
shiftsare on the orderof 4%.
There are also someapparentinteractionsof formant
amplitudeandspectraltilt onF 3 effects.FujisakiandKawashima find that for noise excited vowels with a + 6-riB/oct
noisesource,a 30% increasein F 3 (interpolatingfrom their
graphs)leadsto abouta 26% increasein F 1 for the/e-u/
boundary,
andabouta 16%increase
for/a-o/. However,
whenthe spectralroll-offof the noiseis changedto - 12
dB/oct (similar to that for their voiced stimuli), smaller
changescomparableto thosein the corresponding
voiced
stimulioccur.This wouldseemto suggestthat an increasein
therelativeamplitudeoff 3 mightincreasethesizeof theF 3
effect.However, Fujisaki and Kawashimaalso report that
increasingtheeffectivefrequencyof their higherpolecorrection circuit by as much as 60% has no measurableeffecton
the/u-e/boundary (for buzz excitedspeech), although
sucha changeshouldattenuateF3 by about9 dB andF4 by
about 20 dB (see footnote 9).
Finally,thereisevidenceof a reversalof the spectraltilt
effect in Holmes (1986). Holmes noted that attenuation of
theF 3 regionby 15dB generallyleadsto lessharshsounding
voicesthat were more likely to be acceptedas syntheticfemale speakers.He actuallyfound a slightpositiveshift in
F 1• F2 for vowelresponse
areaswith thedecreased
F 3 amplitude.He notes:" ... [ the] variationof amplitudeof theF 3
regionbetweencondition2 and 3 seemedto makeonly a very
smalldifferenceto vowellabeling,in spiteof the differences
in naturalnessand the tendencyfor one to soundfemaleand
the other male" ( 1986,p. 357).
cannotbeassessed.
Furthermore,hedid notreporta factorial breakdownof effectsof simultaneousversusseparate
changesin F0 and precursorvowels.In order to shedmore
light on theissues,an experimentwasconductedin our laboratories with four and five formant isolated steady-state
vowels.The presentexperimentwas designedprimarily to
extend the approachof Ainsworth (1975) by playing off
intrinsicandextrinsicfactorsin a fully crossedexperimental
design.
A. Stimulus
1. Overview
materials
of the stimuli
The stimuliweresynthesized
on an implementationof
the Klatt (1980) software synthesizeron a DEC PDP-12
minicomputerat a samplingrate of 12kHz. The stimuli were
steady-statevowelswith a duration of 150 ms, consistingof
thirty 5-ms frames. Either four or five cascadedformants
were used (dependingof the valueof the higherformants
factor as describedbelow). The stimuli were low-passfiltered at 5000 Hz beforerecording.
This experiment used a fully crossed faetorial design
tion with a 30% increasein F 3 yieldsabouta 30% increasein
with two intrinsicfactors:(a) pitch, i.e., fundamentalfrequency,and (b) higherformants,viz., F 3, F 4, and,for some
conditions,F 5;togetherwith an extrinsicfactor,F 1• F 2 ensemble,involvingthe formant rangesera setof contextvowels.Therewerea total of eightexperimentalconditions,each
of which wasintendedto simulatea singlespeaker'svowels.
The two levelsof the pitch factor were 120 Hz for the low
valueand 270 Hz for the high value.This is slightlylarger
than the 1-oct differenceusedby Ainsworth. A simulated
both the/a-o/and/u-e/boundaries.
fallingintonationcontourwasprovidedoneachof thevow-
3. Combinations
of FO and F3 increases
Data from both Fujisaki and Kawashima (1968) and
Holmes (1986) showevidencethat concomitantchangesin
F3 and F0 lead to largerincreasesthan eitheralone.In the
caseof Fujisakiand Kawashima(againinterpolatingto the
benchmark interval), an octave increasein F0 in combina-
2094
For Holmes' data, the
J. Acoust.Sec. Am., Vol. 85, No. 5, May 1989
TorranceM. Nearoy:Vowel perception
2094
els.The higherformantsfactorconsisted
of two setsof F3
andF 4 (andforthelowvalueof thisfactor,F 5) frequencies
that corresponded
approximately
to valuesappropriatefor
an adultmaleandof a child,respectively.
The two levelsof
theensemble
factorconsisted
oftwosetsoff 1X F 2 patterns.
Onesetcorresponded
to theF 1X F2 spaceof an adultmale
speakerand the Otherto that of a childspeaker.Detailed
descriptions
of eachof the factorsare givenbelow.
2. The baseline
condRion
A summaryof theeightconditions
isprovidedin Table
IV, whichalsogivesabbreviations
usedto referto theconditions in the remainder of the text. Individual conditions will
inganrmserrorof 106Hz, or about7% of theaverage
F3.
Only threeof the vowelsshowedresidualerrorsof greater
than 100 Hz.
For synthesis
purposes,
theboundarybetweenfrontand
back vowels was redefined as the intersection of the two
planesspecified
byEqs.(4) and ( 5) (seeBroadandWakita,
1977,p. 1468). This intersectioncorresponds
to the line
F2 = 0.17F 1 + 1463. If the F2 of a stimulus was less than
thisvalue,it wasclassed
asbackand Eq. (5) wasusedto
determineF3; otherwiseEq. (4) wasused.Thusredefined,
the revisedF2 boundaryrangesfrom 1505Hz for low F 1
vowelsto 1591Hz forhighF 1vowels.Thisprocedure
avoids
discontinuities
for vowelswithadjacent
F 2 valuesthathappento straddlethe original1500-Hzfront/backboundary.
Vowelsstraddlingthe revisedboundarywill havemorehomogeneous
F3 values,because
as the boundaryline is approachedin theF 1X F2 plane,F 3'scalculatedfrom either
bereferredto bycapitallettersindicatingthefactorsthatare
set high ( + ) for that condition.The conditionwhere all
factorswerelow (N) corresponds
approximately
to theformantfrequency
ranges
andfundamental
of anaverage
adult
male speakerand will be referredto as the baselineor "all
factorslow"condition.
Exceptasnoted,stimulus
specificationsforall otherconditions
werederivedbya simplemultiplicationof thefrequency
valuesof oneor moreof thepa-
and4500Hz, corresponding
roughlyto theneutralposition
(uniformtube)valuesfor a 17.5-cmvocaltractlength.Ordinarily, for cascadesynthesisat 6 kHz, a sixth formant at
rameters of this condition.
5500 Hz would alsobe included,and indeedthis wasdonein
(4) or (5) approacha commonvalue.
In the baselinecondition,F4 and F5 were fixed at 3500
(a) Fundamental
frequency.
A fundamental
frequency initial synthesis
attempts.However,preliminarylistening
contourwasprovidedthat wasfixedat 120Hz for the first 8
frames(40 ms)andthenexponentially
declined
to a valueof
testsindicatedthat maintainingthefull complement
of cascadeformantsresultedin extremeharshness
for the high
84% of the initial F0 overthe last22 frames( 110ms). More
precisely,the F0 of eachframe was
tation, it wasdecidedto omit formantsabove5450 Hz (i.e.,
FO(i) = 120, i< = 8,
FO(i)=O.99264FO(i--1),
higher
formant
s'timuli
(seebelow).
Aftersome
experimencloserthan 550 Hz to the foldingfrequency)in all condi-
(2)
i>8,
tions. The omission of F6 in the baseline stimuli leads to a
(3)
spectraltilt with a steeperhigh-frequencyroll-off than
where i is the frame number.
would occurwith a full complementof six formants.However,informallisteningtestswith fiveandsixformantstimu-
(b)F$ andthehigher
formants.
In orderto providereasonably
naturalrelationships
amongF 1,F 2,andF 3,thefre- li indicatedthatonlyminorchanges
in apparent
voicequaliquencyof F3 wasspecifiedasa functionof the F 1 and F2
ty and no obviouschangesin vowel identity resultedfrom
valuesusingpiecewise
linearrelationsof the typediscussed this strategyfor any of the low higherformantsstimuli."•
by Broadand Wakita (1977). [SeealsoSatoetal. (1982) ].
(See Appendix B for an outline of the effectson formant
amplitudes.)
(c} TheF1XF2 ensemble.The low levelof the ensemble
However,rather than usingthe coefficients
reportedby
BroadandWakitafor theirsinglefemalespeaker,
theirestimationprocedure
wasappliedto the averagemaledatareportedby Peterson
andBarney(1952) for AmericanEnglish (omitting the vowel /a•/) and by Fant (1973) for
Swedish.
This methodis described
brieflybelow.
Two separatemultiple regressions
were run: one for
factorconsisted
ofa setoff 1X F 2 patternswhoserangesare
typicalof an adultmalespeaker(seeAppendixB). A deliberateattemptwasmadeto confinetheF 1X F2 spaceto an
areaconsistent
with that of a singlespeakerbecause
of Nearey's(1978) findingof a relationshipbetweenthecategorization of certainvowelpairsanda decisionasto whetherboth
vowelscamefrom the same"apparentspeaker"(for a summary, seeLieberman,1984). TheF 1X F2 pattern,or ensemble, was confinedto a quadrilateralbasedon Petersonand
Barney(1952) andFant (1973) maleaveragevalues.Thirteenequallyspacedstepsin log F 1 and log F2 wereused,
front vowels (F2> 1500 Hz) and one for back vowels
(F2 < = 1500). The followingcoefficients
wereestimated:
F3tro,t = 0.522F1 + 1.197F2 + 57,
(4)
F3back= 0.7866F1 -- 0.365F2 + 2341.
(5)
Thisprocedure
leadsto a goodfit withobserved
F 3's,show-
TABLEIV. Theeight
stimulus
conditions
forexperiment
I. Individual
conditions
arelabeled
bycapital
letters
indicating
thefactors
thataresethigh( + )
forthatcondition,
except
forcondition
1,indicated
byN (for"nonehigh").
Condition
Abbreviation
I
N
2
P
3
H
4
PH
5
E
6
EP
7
EH
8
EPH
Ensemble
( F 1X F 2)
Pitch
Higherformants
....
---
+
_
+
+
+
+
_
_
+
+
_
+
_
+
+
+
+
2095
J. Acoust.
Soc.Am.,Vol.85,No.5, May1989
Terrance
M.Nearey:
Vowelperception
2095
rangingfrom250-750Hz inF 1and750-2250Hz inF 2.The
resultingpatternis shownin Fig. 1.
3000
XXXXXXXXX
2500
3. The modified
O
fi
X
X
X
O
O 0
0
O0
O
0
X
ß
X
X
X
I
I
I
I
I
I
I
I
X
X
I
I
I
III
I
I
X
X
X
X
conditions
2000
[a) Thehighensemble
condition{E}. The highlevelof
the ensemblefactorconsistedof the samegrid but shiftedup
1500
bythreelogsteps.Thisisequivalent
to a 32% increase
in the
formantfrequencies,
aboutthe sameas that usedby Ainsworth (1975). The relation of the two ensemblesis shownin
1000
Fig. 2. Thelowensemble
stimuliareshownwithO'sandthe
highensemble
stimuliwith X's. Noticethat thereis a substantialregionof overlapbetweenthe two conditions.These
vowelswill bereferredto astheoverlapping
testvowels.
The
I
X
XX
I
I
I
I
l
I
I
I
X
I
I
I
X
I
I
I
I
I
lB
I
I
I
I
I
I
Ill
I
I
lll
0
0
I
O
O
0
Ill
I
0
0
0
Ill
ll
O
O
O
I
O
OD
I
I
DDD
I
I
I
I
I
0 0
I
X
I
0
III
I
I
DO
I
I
O
l
O
OD
ODODDDDO
DDDDDD
700
other vowels in each ensemble will be referred to as nonover-
lappingvowels,
whichincludethe/i r>
O
000
X
I
200
/ contextvowels
I
I
300
400
I
i
600
I
I
800
i
I
I
1000
described below.
In the analyses
presented
below,responses
to the overlappingtestvowelswill sometimes
beconsidered
separately.
It shouldbe emphasized
that for theseoverlapping
vowels,
any comparisons
involvinga difference
of ensemble
factor
alone (with all other factorsequal) involve responsepatternstophysicallyidenticalstimuliin differentextrinsiccontexts.
The F0 in condition E was exactly as in the baseline
condition;F 3 wascalculatedusingtheF 3 formulaspecified
above,appliedtotheshiftedF 1X F 2 valuest
F4 wasleftatthe
baselinevaluefor all stimuliexceptfor a smallnumberof
casesfor highF2 vowels,whereit wasfoundthat the calcu-
latedF3 wouldapproach
theF4 baseline.
in suchcases,
F4
was moved to a value 300 Hz above F 3. Note that this small
additionalshiftofF4 did not affectany of the stimuliin the
F1 (Hz)
FIG. 2. Configurationof low and highensemble
stimuliin F 1• F 2 plane.
Here, O's representlow ensemblecontextstimuli, X's representhigh ensemblecontextstimuli,andfilledsymbolsrepresentoverlapping
teststimuli.
All other factors were the same as in the baseline condition.
(c} Thehighhigher
formantscondition
(H}. Here,F 3 and
F4 wereshiftedupwardby threelog steps(a multiplicative
factor of 1.32) from the corresponding
stimuli in the baseline condition, and F 5 was omitted from the cascadecircuit.
ScalingofF5 from the baselineconditionresultedin a formant at 5940 ( = 1.32X4500 Hz) just below the folding
(Nyquist) frequency,6000 Hz. Preliminary investigation
indicated that this configurationled to very harsh vocal
quality,with undulyhighamplitudesin thehigherformants
overlapping
regionof theF 1• F2 spacethat is commonto
(seeHolmes,1986).Indeed,it wasfoundto produce
a perbothensemble
conditions(seeFig. 2), but did affectonly a
ceptible
high-frequency
buzz
for
vowels
in
the
expected
high
few vowelstimulithat were uniqueto the high ensemble
condition.
backregion.
AftersSme
further
experimentation,
it wasde-
{b} Thehighpitchcondition
(P). Here,F0 wasshifted
upwardby a multiplicative
factorof 2.25,corresponding
to
ninestepsin thelogarithmicfrequency
scaleoff 1 andF2.
cidedto limit synthesisof formantsto 5450 Hz (i.e., 550 Hz
belowthe foldingfrequency)for all the stimulusconditions.
Informalcomparisons
of selectedstimuliindicatedthat this
strategygreatlyimprovedthe naturalness
of thehighhigher
formantstimuli,with only minor effectson the voicequality
of the low higher formant stimuli (seeAppendix B).
(d} Combinationconditions(HP, EH, EP, EHP}. Combinationconditionswere producedby combiningthe single-
3000
2500
condition shifts described above. It should be noted that the
"all factorshigh" condition (EHP) had fundamentaland
formant frequencyrangesroughly comparableto thoseof
the Petersonand Barneyaveragesfor children.SeeAppen-
2000
N
1500
dix B for details.
B. Subjects and procedures
1000
700
I
20
300
I
!
400
I
600
I
I
800
I
i
1
1000
F1 (Hz)
FIG. 1.Configuration
off 1X F 2 values
forlowensemble
stimuliofexperiment I.
2096
J. Acoust.Soc. Am., Vol. 85, No. 5, May 1989
Fifteen native speakersof Canadian English were recruited from graduateand undergraduatestudentsin linguisticsat theUniversityof Alberta.All hadat leastminimal
trainingin the useof phoneticsymbols.
In pilot experimentsfor someof the conditions,several
listenersreported hearing more than one syntheticvoice,
with someof the vowelsappearingto havebeenproducedby
a larger"apparentspeaker."After someadditionalinformal
listening,it wasdecidedto presenteachof thetestvowelsin a
TerranceM. Nearey:Vowel perception
2096
givenconditionwith a pair of fixedcontextvowels,drawn ticulartestvowelsounded
likeit wasproduced
bya different
from the sameconditionand corresponding
to precursor syntheticspeakerthan the context/i • __ / vowels.
vowelsin Ainsworth's(1975) experiment.
The vowelson
Subjects
wereinstructedto concentrate
on the primary
eachtrial werepresented
in the followingformat:/i D X/
task (identifyingthe phonemiccategoryof the targetvowwhere X is one of the 128 test vowels from the set and/i/and
els) and to mark secondaryresponses
as they sawfit. They
/D/ representcontextvowelsdrawn from the sameset.The
were givenan index card to help them keep track of their
/i/for a givenconditionwasthestimulusin that setwith the
positionon their answersheets.A briefsummaryof the sechighestF2 and lowest F 1, while the/D/was
the stimulus
ondary responsecategorieswas written on that card; it read
withlowestpermissible
F 2 atthehighest
F 1fortheset.Thus,
as follows:X--bad exampleof chosencategory;v'--good
exampleof chosencategory;O---other voice.
corresponds
to theO symbolnearesttheupperleft corner,
In addition,after eachexperimental
session,
subjects
whilethe/•/correspondsto theO symbolwiththehighest wereaskedfourquestions
aboutthevoicesonthetapethey
F 1 roughlydiagonallyoppositethe/i/. Preliminarylisten- hadjust heard.( 1) Did all thevowelsappearto comefrom
ingtestsrevealed
thatthecontextvowelsfroma singlecon- the samesyntheticspeaker?(2) Did the voiceyou heard
ditionwerewellidentifiedasthe intendedvowelcategories speakingthe contextvowelsseemmostlike a male or female
andtheyproduced
consistent
speaker
sizejudgments.
voice?(3) Given the answerto question(2), would you
The stimuliof eachconditionweresynthesized
andranexpectthevoiceof thecontextvowelsto belongto a relativedomized,placedin the appropriate/i • __ / contextand ly largespeaker,an averagesizedone,or a smallone?(4) If
recordedon cassette
tapesusinga SonyTC-K61 tapedeck. thereappearedto be morethan one speakerfor this tape,
In the recordings,
eachof the/i D X/triads wasrepeated how would you describe the sex/size differencesin the
threetimesona singletrial.Thisensured
adequate
timefor voices?
judgmentandforconsideration
ofboththeprimaryandsecondarytasksdescribedbelow.Each of the subjectsheard D. Results and discussion for secondary responses
eachof theeightsingleconditiontapesin a differentrandoThe secondaryresponses
were collectedas an exploramizedorder,overa periodof 2 or 3 days,listening
ona Sony tory measure.Although the all conditionslow voice was
TCM-737 cassette
playerandSonyDR-S3 headphones.
The
basedon adult malefrequencyvalues,and the all conditions
data from the entireexperimentcompriseda total 15 360 highvoicewasbasedon valuesnearthePetersonandBarney
categorization
judgments( 128 stimuliX 8 conditions
X 15
children'saverages,no detailedhypotheses
about speaker
in Fig. 2, the/i/context
vowel for the low ensembleset
listeners).
C. Listeners'
instructions
and tasks
Listeners
weretoldthat theywouldheara seriesof synthetic vowelsconsisting
of two contextvowelsand a test
vowelandthat eachsetof threevowelswouldbe repeated
three times. They were told that the context vowelswere
intendedto be tokensof the categories/i/and/•/. Their
primarytaskwasto decideonthephonetic
category
of the
thirdvowelin theseries.Beforetheirfirstsession,
theywere
allowedto listento a fewitemsfromthebeginning
of their
firsttapein orderto familiarizethemselves
withthepresentationformat.Theymarkedtheirresponses
onspecially
preparedanswersheetsthat includedbothkeywordsand phonetic symbols.
classjudgmentswereformulatedin advanceof the experiment. Nonetheless,a descriptivesummaryof the resultsindicatesfairly consistentpatternsthat may shedsomeadditional light on the formalanalysisof the primaryresponses
detailed below.
1. "Goodness" ratings
TableV presents
the percentage
of targetvowelsecondary responses
rated as goodor bad or left unmarkedon the
answersheets.Overall,theredoesnot appearto havebeena
greatdeal of differenceamongthe voiceson theseratings.
Between38% and 46% of the responses
wereleft unmarked
(presumablyneitherparticularlygoodnor particularlybad
representatives
of thecategories
chosen),whilefrom 29% to
37% were marked as "bad," and 25% to 28% were called
In addition,however,it was suggested
that listeners, "good."
whenthey felt it appropriate,mightmark two additional
typesof informationon their sheetsnext to eachtestitem.
The firstwaswhetherthetargetvowelsounded
like a particularlybad(indicated
byanX nextto theitem) or particu- TABLE V. Percentresponsesto targetasgood,bad,and left unmarkedfor
larly goodexemplar(indicatedby a checkmark) of thechoquality.
sencategory.The secondinvolvedlisteners'judgmentsof
Bad
Unmarked
Good
whetherthetargetvowelsappeared
to comefromthesame Condition
synthetic
speaker
ascontext/i t) __/.
It wasexplained N
34
43
26
thatwehadintended
thatthevowelsona singletapeshould P
30
46
24
all soundlike they were producedby the sameartificial H
34
40
26
29
43
28
voice.They weretold, however,that listenersin pilot tests PH
E
33
41
26
occasionally
reportedhearingmorethan onespeakeron
EP
37
37
25
someof thetapesandthatwewereinterested
in findingout EH
29
44
27
more about this. The new listeners were asked to mark an O
EPH
32
43
25
for "othervoice"nextto anitemif theythoughtthata par2097
J. Acoust.
Soc.Am.,Vol.85, No.5, May1989
TerranceM. Nearey:Vowelperception
2097
œ. Voice consistency
TABLE VII. Judgments
of apparentsexandsizeof thespeakerof thecon-
Analysisof the first post-session
questionabout voice
consistency
showsthat for all the low ensembleconditions
(i.e., N, P, H, PH, alsoTable V), mostof the 15 subjects
(from 13to 15of 15) thoughtthattheyheardonlyonevoice
throughoutthe experiment.For the high ensembleconditions, however,only the all factorshigh (EPH) condition
showeda majority (11/15) of "yes" (consistentvoice responses).
The otherconditions
E, EP, andEH showedonly
four,seven,andsix"consistent
voice"responses,
respectively. Thusit appearsthat,whilelowensemble
voicesareheard
quite generallyas originatingfrom a singlespeaker,only
whenall factorsare shiftedsimultaneously
will a majorityof
listenersheara singlevoicefromthe highF 1X F 2 ensemble
sets.Additionalinformationisavailablefrom the percentage
of individualtargetvowelsthat were markedexplicitlyas
emanatingfrom anothervoice,hereafterreferredto as an
Seetextfor a description
of thesizescoreandthesizerankingprocedure.
text vowelsin the eight conditions(L = large, M = medium,S = small).
Male
L
M
Female
Total
S
L
M
S
female
Size
ranking
N
10
5
0
0
0
0
0
5.7
P
H
PH
0
11
I
6
3
4
4
I
5
3
0
4
2
0
I
0
0
0
5
0
5
3.9
5.6
4.0
3.4
E
0
0
14
0
I
0
I
EP
0
0
3
0
5
7
12
1.8
EH
EPH
1
0
7
0
6
3
I
0
0
9
0
3
I
12
4.3
2.1
6
5
3.5
3.5
2
I
Size
Score
intruder voice, shown in Table VI.
In all but two conditions,intruders constitutelessthan
5% of the total. Far andawaythe highestnumberof intruders occurs in condition E.
An inspection
wasmadeof theplotsof thetotalnumber
of intrudersfor eachstimulusin conditionE. Thoseplots
revealedthat all stimulijudgedto haveshiftedby 5 or more
of the 15subjectsoccurin the lowest5 F 1 steps(of the 13 in
the condition).However,vowelswith highF l's andvowels
with lowF l's andveryhighF 2's (i.e., thosenearthecontext
/i/) arelargelyexemptfrom "othervoice"judgments.
3. Apparent speaker characteristics of the vowels
Table VII summarizeslisteners'responses
to post-sessionquestions
(2) and (3), concerning
theapparentsexand
sizecharacteristics
of thespeakerof thecontext(/iD
/)
vowels.Responses
to the extremeconditionsshowoverall
effectsin the expecteddirection.The all factorslow (N)
conditionshowsa majority (10/15) of listenerslabeledthe
voiceas a largemale,while the restheardit as a medium
sizedmale. For the all factorshigh (EPH) condition,the
majority (9/15) calledthe contextvoicemediumfemale,
with the restevenlydividedbetweensmallfemaleandsmall
er voiceresponses
for this conditionmentionedabove.The
EP and EPH conditionsare fairly similar to each other,
showingonly small male and mediumand smallfemaleresponses.
The EH conditionshowsa patternsomewhatsimilar to the PH condition,exceptthat there is only onefemale
voicejudgment.
The seventh column in Table VII shows the total num-
ber of female voice judgments by all listeners. It is
noteworthythat noneof the voicesisunanimouslyjudgedas
female.This is perhapsnot surprisingin view of recentfindings (Klatt, 1987;Fant et al., 1987) on the importanceof
certainglottalwaveformcharacteristics
(seeAppendixB)
in synthesizing
convincingfemalevoices.However,substantial numbersof femalevoice (5/15 or more) judgmentsoccur in all conditionswith high pitch (P, HP, EP, EHP).
Majority (12/15) femalevoiceresponses
occurwhen both
the fundamental and F 1 • F 2 ensemble factors are simulta-
neouslyhigh (EP, EHP).
The last columnin Table VII representsan attempt to
summarizeinformation characterizingapparent speaker
sizein a way relatedto supralaryngealvocaltract size.An a
male.
priori scoringscheme(shownin the last row of Table VII)
ConditionH showsa generallysimilarprofileto N. The
was used,rankingthe voice"sizes"from left to right in detwo low pitch,low ensembleconditions(N and H) are the
creasingorder. It was (arbitrarily) decidedto score"small
only onesfor whichthereare no femalevoicejudgments. male"and "largefemale"asa tie at 3.5. The "sizeranking"
The two high pitch, low ensembleconditions,P and PH,
in the lastcolumnof Table VII wascalculatedasa weighted
showprofilesquitedifferentfrom N andH, with themajor- averageof the sizescoresasfollows:
6
ity of responses
spreadfairly evenlyamongmediumand
smallmaleandlargefemale.
As for the high ensemblestimuli, conditionE shows
where Si is the size scorefor the ith column (given in the
almostunanimous
judgmentsas a smallmale voice.This
1,•,(S,
Nc,),
peculiarpropertymayberelatedto thelargenumberofothTABLE VI. Percentage
of individualtargetvowelsjudgedascomingfrom
an intruder voice, i.e., one judged as a different voice from the/i
o .../
context.
Condition
2098
N
P
H
PH
E
EP
EH
EHP
2
I
4
2
18
8
3
4
J. Acoust.Sec.Am.,Vol.85, No.5, May 1989
(6)
bottomrow of TableVII) andN½,iis the numberof judgments in the ith column for condition c in Table VII.
Using this index, we find that, all other thingsbeing
equal,either raisingpitch or raisingthe ensemblealways
leadsto "smaller" voicejudgments.However, raising the
higherformantsdoesnot showany cleartrendin that direction. For the low ensemblevoices,the sizejudgmentsfor N
and H are nearlyequal,as are thosefor P and PH. For the
two high ensemble
pairsdifferingonly in the higherforTerranceM. Noarey:Vowelperception
2098
mantsfactor,the trendis in the oppositedirection:EH and
EPH showsomewhatlargersizeindicesthan E andEP, respectively.This pointwill be returnedto below.
It wasthe intentionof question(4) to elicit whether
targetvowelsthat appearedto befroma differentvoicethan
the contextvowelsseemedto be producedby a largeror
smallerspeaker.
Unfortunately,
thewordingof thequestion
wasnot sufficiently
precise.While all subjects
indicatedthat
theyheardmorethanonevoicein someof theconditions,for
only 8 of the 15 subjectswasit clearwhetherthe contextor
intrudervoicewas larger.By and large,evenamongthis
group,interpretableresponses
of this type weremadeonly
for high ensembleconditions(E, EP, EH, and EHP). The
mostcommonsituationwasfor subjectsto hearthe intruder
asa "largervoice,"i.e., aseithera changefrom a smallerto
largerspeakerof the samesex,or from a femaleto a maleof
thesameor largersizecatego{y.
Thisoccurred
in 18outof
21 judgmentsof changegivenby theseeightsubjects.The
remainingthreeresponses
wereessentiallyneutral:oneinvolved an intruder voice that sounded different but was still
of the samesex/sizeclass;a secondreportwasof speakers
thatchangedsizeandsexclasses
in oppositedirections,i.e.,a
changefrom a largefemaleto a mediummale;and the third
wasa reportof hearingseveralintrudervoices,somelarger
3000
2000
1500
1000
700
I
200
300
I
I
400
I
600
I
I
I
800
I
I
I
F1 (Hz)
FIG. 3. Majorityresponses
forthevowel/o/in experiment
I: O'srepresent
responses
in the all conditionslow context;and X's representthosein the
all conditions
highcontext.Filled symbolsshowmajorityresponses
in all
conditions.
Hatchedareasshowuniquelowensemble
( • • • ) andhighensemble(///) stimulusareas.Blankbackground
in centerof figurerepresentsoverlappingstimuli.
and some smaller than the context voice.
The mostconsistent
changeinvolvedconditionE (high
ensemble
only) whereall eightof the subjects
whogaveinterpretableresponses
agreedthattheintrudersoundedlike a
larger male speakerthan the contextvoice.This fact, combined with the large number (18%) of intruder targets
foundfor thiscondition,mightberelatedto the anomalyin
theapparentsizeshiftcausedby the higherformants.It was
notedthat EH actuallyshoweda largersizeratingthan E.
However,it isconceivable
thattheoverwhelming
numberof
smallmalejudgmentsto the contextvoicefor E condition
(Table VII) wasdueto a contrasteffectwith the fairly frequent"largermale" intrudervoice.(ConditionEH showed
onlytwo of sixjudgmentsof a largermaleintrudervoiceand
an intruderrate of only 3%.)
E. Results and analysis for primary responses
majority /•/ responses
in both extremeconditions.The
backgroundpatternof Fig. 3 indicatesthreeregions:The
hatchedregionsrepresent
uniquehighensemble
(///) and
uniquelow ensemble( • • • ) stimulusareas,whiletheblank
backgroundin the centerof the figurerepresents
the overlappingstimulusarea.
Two setsof analyseswill be presentedhere. The first
analysisinvolvesthe locationof the response
areafor the
vowel /•/ in the overlappingtest vowel region only. It
shouldbereemphasized
thattheanalysis
of theoverlapping
stimulussetinvolvesmeasures
basedon physicallyidentical
stimulifor corresponding
intrinsicconditions
in the high
andlow ensemble
conditions.
By limitingconsideration
to
thesestimuli,we canbe assuredthat any effectsinvolving
extrinsic factors are in fact due to the context/i D
/ and
to the contextualinfluenceof othermembersof the approWhile theanalysisof voicecharacteristics
isinteresting, priateensemble
outsidetheoverlaparea.
themainquestion
of interesthereinvolves
phoneticcategoriA secondanalysiswill alsobe reportedfor centerof
zation.How do changesin the factorsaffectthe response gravityof/•/responsesfor theentirestimulus
setin a given
areasof the vowelsin the F1 XF2 space?Followingthe
ensemble.
Theoretically,this couldbiasthe analysisof the
methodology
of Ainsworth(1975), the centerof gravityof
ensemble
factorbecause
a differentrangeof stimuliwouldbe
the responsearea for a particularvowel was chosenas a
potentiallyavailablefor averagingin the two ensemble
consummarystatisticfor itslocation.Unlike Ainsworth'sanalyditions. However, failure to consider vowels outside the
sis,whichusedthecentersof gravityof all vowelcategories, overlapregionresultsin an increasein severityof the "windiscussion
herewill belimitedto a singleresponse
category. dowing"problem.As canbeseenfromFig. 3, the majority
The vowel/a/was chosensinceit is an "interior"category, response
areafor/a/in theall factorshighcondition( X 's)
surrounded
onall sidesbyresponse
areasforothervowels.
•t
extendsbeyondtheoverlappingstimulusarea (the clearcenFurthermore, a substantialnumber of the stimuli in the overter region) intotheuniquehighensemble
stimuli(///). The
lap regionreceived/•/responses,regardless
of condition. problemis lessseverefor the low ensemblestimuli, but even
Majority responses(i.e., agreementby at least 8 of the 15
hereonemajorityresponse
stimulusspillsoutof theoverlapsubjects)for the vowel/o/for the two mostextremecondipingstimulusregioninto the uniquelow ensemble
stimuli
tionsareshownin Fig.3. Theall factorslowresponses
(con- ( • • • ). Extendingthe "viewingarea"intotheentirestimudition N) are represented
by O's and the all factorshigh lusensemble
regionmayyieldmoreaccurateestimates
of the
(conditionEPH) responses
by X 's. Filled symbolsindicate size of the shifts involved.
2099
J.Acoust.
Sec.Am.,Vol.85,No.5,May1989
Terrance
M.Nearey:
Vowel
perception 2099
1.Analysis of the overlappingstimuli
B
A graphicsummaryof theoverallmeanF 1 andF2 for
160
/•/for the overlappingstimuliis presented
in Fig. 4. (Note
the combinationof high intrinsicfactorsPH is represented
N
by thesinglesymbolB.) Repeatedmeasures
analysis
of vari1500'
HIGH ENSEMBLE
ancewasperformedfor the meanof the/•/response areas
for eachlistener.A significantmain effectwas foundfor
1400ensembleIF(1,14) = 177.5,p < 0.0001]. The meanF 1 was
P
B
HN
12.1% higher (54 Hz) in high ensemblestimuli.There is
also a main effectfor pitch IF(I,14) = 62.1, p<0.0001],
1300
ß
corresponding
to an upwardshiftof about7.4% (33 Hz).
35O
400
450
500
550
Therewasno significant
maineffectfor higherformants;
F1 (Hz)
thoughthehigherformantsbyensemble
interaction
wassignificant [F( 1, 14) = 4.74, p < 0.048]. Simplemain effects FIG. 5. Meansof the response
areasfor/a/in F 1XF2 in the eightconditestingof higherformantswithin ensemble
revealedsignifi- tionsof experimentI, includingall stimuliin eachensemble.Intrinsicfactors are indicatedas follows:P = high pitch; H = high higherformants;
cant differencesonly within the high ensemblestimuli,
B = both P and H; N = neitherhigh.
wherehigh higherformantsstimuliwereabout 13 Hz or
2.7% greater.
A similar analysiswas performed for F2 response
means.A significantmain effectfor ensemblewas again of variancewas performedon individuallisteners'means.
found [F(1, 14) = 103.8,p<0.001], with an averageupThe resultsfor F 1 are very similarto thosefor the overlapward shift of 136 Hz or 10.0%. There was no main effect for
pingstimuli.A significant
maineffectwasagainfoundfor
pitchandthe maineffectfor higherformantsjust failedto
ensemble['F( 1, 14) = 248.4,p < 0.0001]. The meanF 1 was
reach significance[F(1, 14) = 4.44, p<0.055], corre16.8% higher (72 Hz) in high ensemblestimuli.There is
sponding
to a shiftof 17Hz or 1.2%.Therewas,however,a
alsoa main effectfor pitch [F(1, 14) --- 73.0, p<0.0001],
significantinteractionof pitch by higherformants[F( 1, corresponding
to an upwardshiftof about8.8% ( 39.5 Hz).
14) = 5.25,p< 0.038]. Simplemaineffectstestingonhigher As in thecaseof theoverlapping
stimuli,therewasnosignififormantswithin pitch revealsa significantdifferenceonly cantmaineffectfor higherformants.The higherformantsby
withinthe low pitchcondition,wherethe meanfor the low
ensemble interaction was again significant IF( 1,
higherformantsconditionwassignificantly
lower(by about 14) = 7.95,p < 0.014]. Simplemaineffectstestingof higher
30 Hz or 2%).
formants within ensemble revealed a somewhat different
2. Analysisof responsesto the completeensembles
patternof significantdifferences
from thosefoundin the
overlapping
stimuli.In the caseof the overlapping
vowels,
withinhighensemble
stimuli,increasing
thehigherformants
A graphicsummaryof theoverallmeanF 1 andF2 for
/•/ for the completeensembles(both overlappingand
factor resultedin a significantincreasein F 1 valuesas ex-
pected.While in the presentcase,the trend for the same
unique
ensemble
stimuli)ispr.esented
in Fig.5. Asin the comparisonwas in the samedirection(by about9 Hz or
Instead,a significaseof the overlapping
stimuli,repeatedmeasures
analysis 1.9%), it just failedto reachsignificance.
cant differencefor higherformantswas found within low
ensemblestimuli,wherethe higherlevelof higherformants
actuallyled to significantlylower meansby about2.4% ( 11
Hz), representinga smallshift in the wrongdirection,from
thepointofviewofanintrinsic
normalization
approach.
•:
1600
HIGH
For F 2 responses,
a significantmaineffectfor ensemble
was again found [F(1, 14)= 195.0,p<0.001], with an
average
upwardshiftof 243Hz or 17.7%.Unlikethecaseof
the overlappingstimuli,there was a main effectboth for
pitch [F( 1, 14) = 4.78,p < 0.047] andfor higherformants
ENSEMBLE
.....1500
P
N
N
IF( 1, 14) = 8.70, p < 0.011 ]. The pitch effectcorresponded
u'1400
P
B
HN
1300
,
400
450
,
600
SS0
F1 (Hz)
FIG. 4. Meansoftheresponse
areasfor/o/in F 1X F2 in theeightconditionsof experiment
I fortheoverlapping
stimulionly.Intrinsicfactors
are
indicated
asfollows:
P = highpitch;H = highhigherformants;
B = bothP
and H; N = neitherhigh.
2100
J. Acoust.Soc.Am.,Vol.85, No.5, May1989
to an averageshift of 22 Hz or 1.5%, while that due to the
higherformantsfactorwas27 Hz or 1.8%. Unlike the overlappingstimuli,therewasno significant
interactionof pitch
by higherformants,but oneadditionalinteractionwassignificant, ensembleby higher formants IF( 1, 14) = 6.08,
p <0.028]. Simplemaineffectstestingon higherformants
within ensemblerevealssignificantdifferenceonly within
the highensemble
condition,wherethe meanfor the high
higherformantsconditionwas higherby about58 Hz or
3.6%.
TerrariceM. Nearey:Vowelperception
2100
F. Discussion
I. Overall shift in extreme conditions
Considering
thedatasummarized
in Fig.4, wefindthat
total response
shiftbetweenthe mostextreme(i.e., between
all factorslow andall factorshigh) conditions
for theoverlappingstimuliisabout20% inF 1and 12% inF 2. If thedata
in Fig.5 areconsidered
(i.e.,boththeoverlapping
testvowels and the nonoverlapping
vowelsin the ensemble),the
averageshiftof/a/area meansis about25% in F 1 and 21%
in F2. This is roughlytwo-thirdsthe 32% shiftof the formantensemble
itself.Althoughwindowing
effects
andother
artifacts (seeAinsworth, 1975for a discussion)of measure-
mentofchange
maystillbepresent,
in lightofthedescriptive
analysisof thevoicejudgments,it ispossible
thatthefailure
to findshiftsaslargeasthoseobserved
in naturaldataisa
resultof failingto producea fullyconvincing
"smallvoice"
condition.It is possible
that otherfactors,suchasspectral
tilt andbreathiness
maybe requiredto providea full effect
(Klatt, 1987). Nonetheless,the observedshiftsare consider-
ably larger than the empirical estimatesof Ainsworth
(1975), althoughhisextrapolated
estimateof 16%, involvingliberalallowances
for potentialartifactssuchaswindow-
ingandrangeeffects,
arein theneighborhood
of theempirical findingsfor theoverlapping
stimulionly.
2. Ensemble and pitch effects
Boththeensemble
andpitchfactorsshowedlargemain
effectsin the statisticalanalysesdetailedabove.From either
Fig. 4 or 5, it is clearthat the extrinsicensembleeffectdominatesthe changes.From the analysisabove,we seethat the
averageeffectsof ensemblerangefrom about 12%-17% in
F 1and 10%-18% in F2. On the average,theseshiftsareon
theorderof 31% to 56% of theexpected
changes
undera
pureextrinsichypothesis
giventhe32% upwardshiftin the
ensemble
factor.However,apartfrom thisshortcoming
in
magnitude,pureextrinsicspecification
predictsno effectof
pitchor higherformants.Clearly,thismodeliswrong,since
pitch hasa considerableeffecton F 1 in both ensemblesets.
Theaverageeffectfor pitchisontheorderof 7% to 9% in F 1
thumbbasedon naturaldata (or considering
the Peterson
andBarney'smaleandchildaverages
for thevowels/i/or
/a/, whichareintherangeoftheF 1average
observed
forthe
all conditions
low stimulihere),we wouldexpectabouta
30% increase
in F 1for theroughlyoctaveincrease
in F0 in
the presentexperiment.The observedshiftsare thuson the
orderof 23% to 30% of thepredicted
shifts.
It shouldbe notedin passingthat thereare otherdifficultiesfor severaltheories,includingMiller's and Syrdal's
intrinsicandNearey'sextrinsic.
•3According
to thesetheories,an increasein F 1for a vowelcategoryshouldbeaccompaniedby an increasein F2, sincefor all thesetheoriesthe
ratio (or difference,in log or Bark scales)of F2 to F 1 is
supposed
to be invariantfor a givenvowel(seeAppendix
A). Thisimpliesthatwhentheresponse
areato a givenvowel is shiftedupwardalongthe F 1 axis,it shouldbe shifted
upwardby roughlyan equalamounton theF2 axis.In contrast,the presentdataindicateF0 inducedshiftsin F 1 cate-
gorizationthat are, to a largedegree,independent
of F2
shifts.
This lackof strictinvarianceisconfirmedby analysisof
varianceonformantratios(F2/F 1) in theeightconditions,
analogous
to thoseperformedonthemeanF 1andF2 values.
For theoverlapping
stimuli,thereisa significant
maineffect
for pitchIF(1,14) = 30.3,p < 0.0002], withhighpitchconditionshowing6.8% lowerF2/F 1 ratiosthan low pitch.
There wasalsoa significantensemble
by pitch interaction
[F(1,14) = 5.51,p < 0.024]. Simplemaineffectstestingof
ensemblewithin levelsof pitchrevealedthat ensemble
differenceswere not significantwithin the low pitch conditions,
but that within the high pitch conditions,high ensemble
vowelsshoweda significantly
higherF2/F 1ratio (by about
3.7% ). A similaranalysison the responses
to the complete
ensemble
(includingbothoverlapping
andnonoverlapping
testvowels)revealedonlyonesignificant
effect,namelythe
main effectfor pitch [F(1,14) = 29.21,p <0.0002], with
highpitchstimulishowingF2/F 1 ratiosabout6.2% lower
thanlowpitch.Whilethesedifferences
arenotlargein absolute terms,they are nonethelessreliable.
It shouldbe noted here that Traunmfiller ( 1981), while
insisting
ontheimportance
off l-F0 distance
in theperception
of
vowel
height,
seems
to
imply
some
freedom
off 2-F 1
those of Ainsworth (1975), who also found that extrinsic
differences
in
vowel
perception,
at
least
as
it
affects
rounding
vowelcontexteffecthad a largerinfluencethan pitch;and
judgments.
Specifically,
he
claims:
"...the
distance
between
further,that F0 effectsare muchgreaterfor F 1 than F2.
F2
and
F
1
is
not
the
major
cue
for
the
distinction
of front
Ontheotherhand,modelsofpureintrinsicspecification
rounded
versus
front
unrounded
vowels"
(1981,
p.
1471).
predictupwardshiftsof response
areasfor a pitchincrease
Furthermore,
some
recent
unpublished
work
by
Traunand/or higherformantincrease,but no changefor the enmiiller (citedby Lindblom,1987) suggests
that an increase
sembleeffect.The presentresultsare not consistentwith
in
F0
and
F
I
unaccompanied
by
increases
in
higher
formants
eithera pureintrinsicor a pureextrinsicapproach.Contrary
may
be
associated
with
a
perceived
increase
in
vocal
effort,
to pureintrinsicspecification,
the extrinsicensemblefactor
e.g.,
in
"shouted"
versus
normal
speech.
However,
it
is
also
hasthe largestoveralleffecton bothF 1 andF2 response
apparent
that
the
present
experiment
does
not
confirm
a
patterns.
Ontheotherhand,contraryto pureextrinsic
speciconstant difference in modified Barks between F0 and F 1
fication,theeffects
of pitchonF 1categorization
areclearly
evident.Theseare in qualitativeaccordwith aspects
of the that is part of Traunmiiller'stheory.The shiftsinducedby
F0 on F 1 are simplytoo small.
theoriesof Miller ( 1984;1989), Syrdal(1984), andTraunand about 1.5% in F2. These results are consistent with
mailer ( 1981).
But whataboutquantitativecomparison?
The average 3. The effects of higher formants
The higherformantsfactorappearson the averageto
largea shift might we expect?Using Ainsworth'srule of
havehad very little effectin this experiment,although,for
pitch inducedshift in F 1 is on the orderof 7% to 9%. How
2101
J. Acoust.
Soc.Am.,Vol.85, No.5, May1989
TerrariceM. Nearey:Vowelperception
2101
the mostpart,the smalldifferences
observed
arein the expecteddirection.A significant
main effectfor higherfor-
bein directconflictwiththoseof FujisakiandKawashima
for their noiseexcitedstimuli, sinceinspectionof Holmes'
Figure 16.1 showsthat vowelswith attenuatedF3's, and,
hence, with more falling spectralslopes,actually show a
raisingthe higherformantsresultedin abouta 1.8% increase.(The main effectjust failed to reachsignificance
for
largershiftfrom hisbaselineconditionthan the corresponding unattenuated
conditions(cf. condition2 vs3 andcondithe overlappingstimuli,showinga 1.2% increase.)There
are also several interactionsin the case of the higher
tion 6 vs 7). Clearly,furtherexperiments
with smallerstep
formants.For the most part, theseshowedshifts in the
sizesin the higherformantsfactorandwith parallelsynthevaryingtheamplitudeoff 3 arecalledfor.
expecteddirectionin somesubsetof the stimuli,although sis,independently
In spiteof the generallysmallaverageeffectof thehighensemble
X higherformantsinteractionfor F 1 of the comer formants,thereis oneclearsuggestion
that thisfactordid
plete
ensemble
stimuli
wasexceptional
inthisregard.
The failure to find a substantialglobalshift due to F 3
havea substantialeffecton someaspectsof listeners'percepappearsto be in conflictwith Fujisakiand Kawashima tion in the presentexperiment.Specifically,in the descriptive analysisof intruder voiceresponses
givenabove,condi(1968), who found perceptualboundaryshifts that were
roughlyequivalent
to observed
shiftsin acoustic
datawhen tion E (in which only the ensemblefactor was raised)
produceda high proportionof intruder voiceresponses
for
fundamentalfrequencyand higher formantsfactorswere
also associatedwith subcombined.The fact that the presentexperimentwasmeant low F 1 stimuli. These responses
stantiallylowervaluesoff I andF 2 averages(Figs.4 and5 ).
to correspondto a speaker-segregated
experiment (one
When F3 is raisedto a valuecompatiblewith the F 1XF2
"syntheticspeaker"per experimentalblock,and with F 1
andF 2 confinedto reasonablynaturalrangesfor the intendrange of the high ensemblestimuli (condition EH), the
dropsoff dramatically,
ed speaker),while Fujisaki and Kawashima'sexperiment numberof intruder voiceresponses
involveda syntheticspeaker-randomized
situation,may be
andboththeF 1andF 2 averages
increasesubstantially.
Thus
importanthere.This may indicatethat differentperceptual it seemslikely that a "lower than expectedF3" canserveto
weightshouldbe attachedto F 3 for conditionsin which a
greatly increasethe probabilityof hearinga "larger voice"
listeneris tunedin to the voiceof a singlespeaker.However,
and thuspartly counteractthe effectof the extrinsicfactor.
Holmes (1986) alsoreportsa somewhatlarger influenceof
4. Summary
F0 and F3 than foundfor the presentdata, without any extrinsicfactor.Holmes'stimuliwereapparentlypresentedin
Althoughmoreexperiments
are neededto clarifysome
mantswasfoundonlyfor F 2 for thefull setof stimuli,where
a blockedconditionwith respectto F0 andhigherformant
factors.However, the sameF 1 • F2 ensemblewasusedin all
casesand the formant rangeswere apparentlynot confined
to correspond
to the vowelspaceof any singlespeakerasin
the presentexperiment,but rather they "...exploredthe
of the resultsfoundhere,it appearsthat noexistingmodelis
capableof adequatelyaccountingfor all the resultsof this
experiment.Pureextrinsicmodels,suchasNearey's(1978)
or Nordstr/Sm and Lindblom's (1975), fail becauseof the
clear effectsøfF0 on F 1. Pure intrinsic models,suchasthose
wholeF1, F2 planewithin the limits of the synthesizer"
( 1986,p. 353). Additionalexperiments
involvingsimulated
speaker-randomized
(mixed speaker)aswell asspeaker-segregated(blockedspeaker)conditionsare requiredto seeif
of Miller, Syrdal, or Traunm•ller, fail to accountfor the
extrinsicfactor, the largesteffectobserved.All the factors
ever consideredin the identificationof isolatedvowelsappear to be playing somerole here. We clearly need models
F3 and F0 related effectsare larger when the apparent
that are sensitive to both intrinsic
and extrinsic effects of
speakervariesrandomlyfrom trial to trial.
As notedin the reviewpresentedabove,the sizeof F 3
related effectsappearsto dependon a number of factors,
includingF 1andF2 values(FujisakiandKawashimafound
speakervariation.Furthermore,it seemslikely that other
factorsrelatedto voicequalitymaybe necessary
to attain the
full shift in categorizationbetweensyntheticadult maleand
children'sspeechparallelto that in natural data. In view of
smaller effectsfor/a-o/than/u-e/),
and spectral roll-off
the varietyof effectsinvolvingspeakeridentity,vowelqualifor noiseexcitedstimuli.A possible
sourceof thisdiscrepan- ty, andvocaleffort,it appearsthat a considerable
numberof
cy may lie in the relativelyloweramplitudeoff 3 in the high experimentsinvolving simultaneousjudgments of vowel
higherformantsconditionsfor the presentexperiment.In
category,naturalhess,
andapparentspeakerqualitieswill be
the overlappingregionof the test vowels,F3 occursat a
necessarybeforethesematterscan be fully sortedout. The
relativelylargeseparationfrom F2 in the high higherfor- informationgatheredin the courseof theseexperimentsis
mants condition and, consequently,its amplitude is lower
likely to be useful not only for an account of vowel percep(by about 8 dB on the average)comparedto vowelsin the
tion,but alsoin the attainmentof higherqualitysynthesis
of
low F 3 condition.However, one result from Fujisaki and a variety of voices.
Kawashima'sexperimentindicatesthat changesin higher
formant amplitudespectralslopemay not matter greatlyin
V. EXPERIMENT
Ih CONTEXT-DEPENDENT
VARIATION
voicedspeech.Specifically,
raisingthe effectivefrequencyof
Although they are somewhatsmallerin magnitudethan
the higherpolecorrectionnetworkin their analogsynthesizspeaker-dependent
effects,systematicconsonantalcontext
er (and thusindirectlyloweringtheamplitudeoff 3) hadno
effects
in
vowel
formant
frequencieshavebeenfirmly estabeffecton listeners'categorization(seeAppendixB). As noted earlier, Holmes (1986) alsoexperimentedwith changes lishedin a numberof studies(e.g., Stevensand House, 1963;
Lindblom, 1963). The study of Lindblom (1963) is of parin theamplitudeoff 3 in voicedstimuli.His resultsappearto
2102
J. Acoust.Soc. Am., Vol. 85, No. 5, May 1989
TerrancoM. Noarey:Vowel perception
2102
ticularimportance
because
of its influence
on subsequent
theoriesof coarticulation.
Variationsof up to 70% in the
formantfrequencies
ofa single
vowelcategory
werefoundto
be caused
by coarticulation
effectswith surrounding
stop
consonants.
LindblomandStuddeft-Kennedy
(1967) present the results of perceptual experimentswith
glide+ vowel+ glidesyllables
thatshowed
a kindof "per- UJ
ceptualovershoot"
effectthat mightserveto offsetthe undershoot
effects
of production.
[Williams(1987)hasreplicatedthisexperiment
with variations
andprovided
strong
evidence
thatit isa "speech
mode,"ratherthangeneral
auditoryeffect.
] Whileit is clearthattheperceptual
shiftsobservedby LindblomandStuddert-Kennedy
arein thedirectionpredictedbyundershoot
theory,thequestion
of detailed
complementary
matchbetween
production
andperception
4000
3000
2000
1000
0
0
20
40
60
80
100
I
TIME(MS)
FIG. 6. Schematic
spectrogram
of transitions
ofoneof the/dVd/stimuli of
experimentlI.
still remains.
Note that Lindblomand Studdert-Kennedy's
stimuli
Fl(t) =Flo + (F1,-Flo)[(t-tv)P/t•],
(7)
consistedof glide+ vowel+ glide stimuli (/wVw/ and
whereF 1(t) isthefrequency
off 1at timet, F 1i istheinitial
/jVj/). On theotherhand,Lindblom's
(1963) production
targetfrequency,
F 1o is the frequency
of the steady-state
studyinvolvedstop+ vowel+ stop syllables.There is a
targetat to,thetimeof themidpointof thestimulus(50 ms),
spectrogra
m of runningspeechin the LindblomandStudandp is the orderof the transition,discussed
in moredetail
dert-Kennedystudythat showstransitionssimilar to their
below.Thesecond
halfofthestimulus
wasthemirrorimage
stimuli;however,
nomeasurements
of/wVw/or/jVj/sylof the initial half, sothat the transitions
weresymmetrical
lablesareprovided.Indeed,thiswouldbedifficult,since(as
functions
aboutthetemporalmidpoints
of thestimuli.Analtheauthorsnote)thesearephonologically
ill-formedsylla- ogousfunctionswere usedto defineF2 and F3 transitions.
blesin English.
Due to an error in the formanttrack generation
algorithm
An experimentconductedin our laboratoriesbasedon
that wasnotdetecteduntilaftertheexperiment
wasrun,the
the'Lindblom/Studdert-Kennedy
paradigm
shows
that mirrorimageofthesecond
halfwasnotquitecomplete,
and
"perceptualovershooteffects"can occur in synthetic
thelastframesynthesis
did not achievethetargetvalueof
stop+ vowel+ stopsyllables;
however,themagnitudes
of
the initial frame,but ratheronly that of the second.
the perceptualcompensations
are somewhatsmallerthan
The transitionsin the studyof LindblomandStuddertthoseobserved
in production
databyLindblom(1963),and,
Kennedy(1967) werequadratic,
corresponding
to a value
in somecases,are verysmallindeed.
ofp = 2.0in theabovetransition
formula.In pilotlistening
tests,suchstimuliwerefoundto produce
glidelikepercepts
A. Methods and procedures
(/wVw/for the/b/transition onsetsor/jVj/for the/d/
transitiononsets).As the powerp is increased,
the rate of
1. Baseline isolated vowel stimuli
transition
becomes
greaterneartheendpointsofthesyllable
Thebaseline
stimuliconsisted
of a continuum
of steady- andslowernearthecenter.After somepreliminaryexperistate,four-formantsyntheticvowels,usinga cascade
model
mentation,it wasfoundthat a valueof 6.0 for p provided
ofsynthesis
based
onthatdescribed
byFisherandEngebret- reasonably
convincing
stoplikeeffectsfor all the/bVb/and
son(1975),implemented
ona PDP-12minicomputer
witha
/dVd/stimuli.
sampling
rateof i6 kHz. Here,F 1,F3, andF4 werefixedat
Possiblybecause
of the verylow frequency
of the F 1
700,2400,and4000Hz, respectively.
Also,F2 wasvariedin
onset, the nominal onsetsof the F2 and F3 transitions were
20 stepsfrom 900 to 1800Hz. The vowelswere 100 ms in
not visibleon spectrograms
of the stimuli.It wasdecidedto
durationand had a fundamentalfrequencyof 120Hz. The
redigitizethestimuliandempiricallydeterminetheeffective
resulting
continuumspannedthreephoneticcategories
in
formantonsets.
UsingLPC basedformantanalysis
( 17coefWesternCanadianEnglish/D/,/,x/, and/e/.
ficientsat 16 kHz, 16-ms window advancedin 4-ms frames
witha negative
second
derivative
peak-picking
algorithm),
2. CV stimuli
F2 onsets were measured at the first frame at which the for-
Two additionalcontinuawerecreatedto producestimuli corresponding
to/bVb/and/dVd/syllables. The/bVb/
stimuliwereproducedwith risinginitial F2 andF3 transitions,andthe/dVd/with fallinginitialtransitions.
The general nature of the transitionsis indicatedin Fig. 6. The
mant amplitudewaslessthan20 dB belowthe maximumF2
amplitudefor the syllable.For the final transitions,there
wasa drop of more than 10 dB over two framesin the LPC
/dVd/stimuli
scribedabove.For/bVb/syllables,onsetandoffsetfrequenciesmeasuredin this way rangedfrom about800-1170 Hz,
while/dVd/onsets rangedfrom about 1510-1920Hz. It
turnsoutthat thesemeasures
correspond
withina fewhertz
had initial "loci" of F 1 = 150, F2 = 2000,
and F 3 = 3000. The/bVb/stimuli
had initial loci F 1 = 150
Hz, F2 = 700, and F3 = 2100.
The F 1 transitionswerespecifiedasfollowsfor the first
half of the stimulus duration:
2103
J. Acoust.Soc.Am.,Vol.85, No.5, May 1989
measurements.
Valuesmeasured
just beforethis drop-off
were quite close in value to the initial measurementsde-
to the theoretical values for the second and the last frame of
TerranceM. Nearey:Vowelperception
2103
/dVd/
Fig. 6. Fortuitously,
the empiricallymeasured
onsetsare
correlatedwith the F2 of the voweltargetin a mannerrea-
sonably
similartotransitions
observed
byNeareyandShammass(1987) for naturalCanadianEnglishstop+ vowelsyllables.
lOO
8o
6o
3. Subjects and procedure
40
Subjects
were 14 nativespeakers
of CanadianEnglish
20
who wereenrolledin a coursein phonetics.Responses
were
recorded
oncomputer
answersheets.
Keywordsforthevowelsin bothphonetic
transcription
andEnglishorthography
werevisibleon a blackboard.The stimuliwereplayedovera
high-quality
audiosystem
througha loudspeaker
in a quiet
phonetics
laboratory.
For themainexperimental
condition,
denotedby the blockedcontextcondition,listenersheard
only oneof the syllabletypes(isolatedvowel,/dVd/ or
/bVb/) in a singlesession,
consisting
ofa randomized
listof
tenreplications
ofeachofthe20stimuli.Eachofthelisteners
heardeachof thestimulustapestwice,yieldinga total numberof responses
of 280perstimulus( 14subjects
X 20 replications),or a totalof 16800responses
for the60 stimuli.
In a smaller,follow-upexperimentalcondition,the
o
00
entire stimulus set.
B. Results
1400
ISOLATED
p • 0.0025], but hereit is in thesamegeneralrangeasthe
1800
2000
VOWELS
8o
60
40
20
0
60O
1000
1200
1400
1600
1800
2000
F2
/bVb/
100
.•
8O
60
4O
800
1000
1200
1400
0
2000
F2
FIG. 7. Categorization
responses
of 14 CanadianEnglishlisteners
to the
blockedcontextstimuli of experimentII. Circlesrepresent/D/responses;
trianglesrepresent/A/responses;
andsquares
represent/e/responses.
In spiteof the high significance
levelsfor threeof the
four shifts,their absolutemagnitudeis not verylarge,in the
rangeof 86-125 Hz. How do theseshiftscompareto what
mightbeexpected
fromproduction
data,if "perfectcompensation"for undershoottook placeaccordingto Lindblom's
TABLE VIII. Averageof individualboundaryestimates(Hz) for isolated
vowels
and vowels
in consonantal
contexts.
Blocked
and mixed
refer
to the
two contextpresentation
conditions.
Numbersin parentheses
indicatethe
sizeof boundaryshiftscomparedto isolatedvowels.
/dVd/shifts.
Analysisfor the mixedcontextconditionyieldssimilar
results.For the/d/stimuli, bothboundaryshiftsaresignificant [for /e-^/, t(10)=5.826, p•0.001; for /^-o/,
t(13) = 3.373,p • 0.0025]. For the/b/stimuli, again,only
the /^-•/ boundary shift is significant[t(13)- 3.385,
1800
lOO
and discussion
The resultsof the categorizationof all three stimulus
continuain the blockedcontextconditionare shownin Fig.
7. Resultsfrom the mixed contextconditionare generally
similar.The observedboundaryshiftsoccurin the direction
expected
by "undershoot
theory."Boundaryestimates
were
madefor individualsubjectsusinglogisticregression.
Table VIII showsthe boundaryestimatesfor both the
blockedand mixed contextsaveragedover subjects.The
Bonferroni approach to multiple comparison(Meyers,
1979), with a familyerror rate of 0.05, wasadoptedin the
analysisof the significance
of/e-^/and/^-o/boundary
shiftsinducedby consonantal
context.Sincefour comparisonswereinvolved,a per testalphalevelof 0.0125waschosen(one-tailed,becausethe directionof the shiftsis predicted a priori). Usingthiscriterion,t-testsof boundaryshifts
for individualsubjectsindicatethat boundaryshiftsin the
blockedcontextfor the/d/syllables are both highlysignificant [for /e-^/, t(13)=5.884, p<0.001; for /^-t•/,
t(13)----4.426, p<0.001]. For the /b/ stimuli, only the
/^-o/ boundary shift is significant[t(13)=--3.385,
p < 0.0025] and is lessthan half the sizeof eitherof the
1200
F2
mixed contextcondition, 11 of the samelistenersheard stim-
uli from all threesyllabletypesrandomizedtogether.These
tapescontainedsix replications
of eachof the 60 stimuli.
Thustheentireexperiment
yielded66 ( 11subjects
X 6 replications)responses
per stimulusand 3960responses
for the
1000
Condition
Boundary
/#V#/
Syllabletype
/bVb/
/dVd/
Blocked
Mixed
/D--A/
/O--A/
1191
1195
1155 (- 36)
1087( -- 108)
1279 (88)
1305(110)
Blocked
Mixed
/A--e/
/A--e/
1541
1552
1525 (- 15)
1539 ( -- 11)
1626 (86)
1678(125)
/dVd/shifts.
2104
J. Acoust.Soc.Am., Vol.85, No. 5, May 1989
TerranceM. Nearey:Vowelperception
2104
(1963) model? Consider first, the /dVd/ stimuli, which
showthelargestshifts.Theformulagivenby Lindblom[his
Eq. (5) ] for thecenterfrequencyofa/dVd/is
F2c = 2.0(F2i - F2, )exp( - 0.021d) + F2,,
(8)
whered is the durationof the syllable(in ms); F2c is the
estimated
frequency
of F2 at the temporalmidpointof the
syllable(i.e., at 1/2d); F 2i is the empiricallymeasuredvalueofF2 at thetransitiononset;andF 2, isthe"targetvalue"
off 2 fora longdurationisolatedvowel.To calculate
a range
for theestimated
sizeof shiftfor thepresentexperiment,
the
stimulus values bracketing the isolated vowel boundaries
and the midpointof the/^/range are substitutedfor F2,.
For the valuesof F2•, the nominal formant valuesat the
second(and the last) synthesisframes,are substituted.The
( 1984,1988),in light of someof the evidencementionedbelow.] Physiologi6al
studies,suchas thoseof Gay (1974),
Gay andUshijima(1974), or KuehnandMoll (1976), indicateconsiderable
variationin strategies
employed
by speakerswhenproducing
shortdurationsyllables.
Changes
in gestures adopted by some speakers,notably the increased
velocity of articulator movement, result in the effective
avoidanceof largeamountsof positionalundershootat short
syllable durations. Keeproans-vanBeinum (1979,1980)
noted large variation in the degreeof acousticcontrastreduction in vowelsof a numberof Dutch speakersand has
shownthat intelligibilityof vowellikeintervalsexcisedfrom
speechisinverselyrelatedto the degreeof reductionin their
acousticmeasures.
One possibility
that mustbeconsidered
(assuggested
by Koopmans-van
Beinum)isthatsomeof the
results of these calculations are shown in Table IX.
reductionthatoccursin fastspeechof somespeakers
islikely
In general,
theobserved
shiftsfromTableVIII, ranging notcompensated
for by listeners,
but ratherissimplya lossof
from 88 to 125 Hz for/d/and from -- 11 to -- 108 Hz for
perceptualphonetic
contrast.Thispositionseemscompatible
/b/, aresubstantially
smallerthanthosepredictedin Table Withsomerecentfindingsof Lindblomand Moon (1988) on
IX. Predicted shift sizesfor/d/from Broad and Clermont's
the pronunciation
of "hyperarticulated
speech,"wheresubformulas
38and39andtheirTableVI arelargerstillranging jects are askedto speakvery clearly. Lindblom and Moon
from311to605Hz for/d/. For/b/, theirpredictions
range foundthat formantfrequencyvaluesthat showedsubstanfrom + 12 for stimulus 7 to -- 209 Hz for stimulus 15.
tial undershootin normally articulatedspeechhad values
These resultsare important, sincethey show effects muchcloserto slowspeech
"targets"in thehyperarticulated
reminiscent
of undershoot
compensation.
Buttheyalsoindicondition, even at short durations.
catethat cautionis in orderconcerning
the magnitudeof
On the otherhand, the perceptualcompensation
effects
sucheffects.The observedshiftsare fairly small, lessthan
notedin thisexperimentare reliable,evenif of limitedmag10% of thevowelcenterfrequency,andontheorderof twice nitude. Viewed from another perspective,the boundary
Flanagan's(1955) estimates
ofthejnd for formantfrequen- shiftsaresizablecomparedto thewidthof the/^/category.
ciesof steady-state
vowels.In fact,theyareactuallylessthan
Furthermore,syllable-lengthor shorterstretchesextracted
the 176 Hz averageDL estimateof Mermelstein(1978) for
from runningspeechare knownto be betteridentifiedwhen
F2 in syntheticCVCs.
The fact that the sizeof the "perceptualcompensation
effects"in the aboveexperimentsis smallerthan expected
from Lindblom's(1963) modelmaybedueto a numberof
factors,including
transition
shapes
andonsetfrequencies.
•4
Clearly, perceptualexperimentsvarying these factors
shouldbeundertaken.
It ispossible
thattheactualshapeof
transitions
playsa role,althoughneitherLindblom's(1963)
nor Broad and Clermont's (1987) modelsallow this as a free
parameter;they treat it insteadas a universalconstraint.
However,anotherpossibilitythat mustbe consideredis
that the singlesubjectstudiedby Lindblomshoweda rela-
moreof theirsurrounding
contextissupplied(Verbruggeet
al., 1976; Kuwabara and Sakai, 1972). The work of Kuwa-
bara (1985) and Broad and his colleagues(e.g., Broad,
1976;BroadandClermont,1987)showsconsiderable
promisefor the development
of "correctionformulas"for consonantalcontextsthat canbe includedin explicitpatternrecognitionmodelsfor vowels.Furtherexperiments
comparing
suchmodelsto listeners'behaviorare beingplannedin our
laboratories.
We mustbearin mindthe possibilityofpartial
perceptualcompensation
combinedwith partial lossof contrastin studyingthisproblem.
tively extremedegreeof undershootin this production.
Lindblom'soriginalhypothesiswas that physiological
undershooteffectswereboundto increase
asa matterof physical necessity
as the durationof the voweldecreased.[This
position was revised, considerably in Lindblom
VI. VOWEL-INHERENT
TABLE IX. Shiftsin F 2 predictedby Lindblom's( 1963) undershootmodel
for stimuliin the rangeof isolatedvowel/^/responses,givenF2 at center
(F2½) of syllableand at onset (F2i).
change.
•5It islikelythata considerable
amountof instanta-
Category
Stimulus F2• (Hz)
F2i (Hz)
/d/
/b/
Predictedshift (Hz)
/dVd/
/bVb/
7
11
1184
1374
1618 927
1707 1016
261
201
- 158
-- 219
/^-e/
15
1563
1795 1104
140
--281
2105
J. Acoust.$oc.Am.,Vol.85, No.5, May1989
FACTORS
Thusfar, the questionof overlapin vowelnucleusmeasurements has dominated the discussion. But in the case of
Englishvowels,there are at leasttwo other factorsthat are
independentof spectral propertiesof the nucleusitself;
namely, intrinsic duration contrast and what Nearey and
Assmann (1986) have termed vowel-inherent spectral
neousspectraloverlapcouldbe toleratedperceptually,both
in isolationand in consonantalcontext,providedthe affectedcategories
remaineddistinctontheseadditionalvariables.
A. Intrinsic
/^-D/
mid/^/
DYNAMIC
duration
There is clearevidencefrom studiesof productiondata
(e.g.,Joes,1948;PetersonandLehiste,1960) andfromperceptual experiments (Bennett, 1968; Ainsworth, 1972;
TerranceM. Nearey:Vowelperception
2105
Strangeet al., 1983) that voweldurationhasa roleto playin
distinguishing
Englishvowels.Because
of variationsof duration causedby suchfactorsas speakingrate, and prosodic
and consonantalcontext,issuesvery similar to thosein vowel formant frequencynormalizationare raisedin connection
with voweldurationas a cue (see,e.g., Miller, 1981,1987;
Port, 1981;Klatt, 1976). It is beyondthe scopeof thisstudy
to review this evidencein detail. It is, however, clear that a
full accountof the perceptionof Englishvowelswill ultimatelyhaveto integrateinformationaboutvowelduration.
model,trained on the productiondata only (i.e., without
accessto the resultsof the perceptionexperiment)that includedformant changeinformation.They showeda clear
correlationbetweenpredictionsbasedon this modeland listeners'categorizationof boththe unmodifiedandwindowed
vowels.They alsopresented
somepreliminaryevidencethat
inherentspectralchangeappearedto persistin/bVb/contextsand thusmight, in part at least,accountfor the high
identificationrate of Strangeet al.'s silentcentersyllables.
Sucha suggestion
is,of course,purespeculation.
How could
formal tests be made? There is no reason to believe that the
B. Vowel-inherent spectral change
Anotherdynamicpropertythat mustbe considered
is
that of intrinsic spectralchange.Nearey and Assmann
(1986) presentevidencefor the importanceof suchinformation in the Perceptionof isolatedvowels.Strangeet al.
(1983) presented
resultsof an experimentthat showedan
error rate of 13% for "silentcentersyllables,"whereinrelatively steady-state
centersof/bVb/syllables werereplaced
with silence,so only initial and final transitionsremained.
Identificationof the complementarysyllables,the vowel
formal modelingtechniquesusedby Nearey and Assmann
for isolated vowels could be extended to test effects of conso-
nantal contextin a completelyexplicitway. Suchtestsare
currently underway in our laboratories (Nearey and Andruski, 1988).
VII. GENERAL
DISCUSSION
The presentwork has reviewedthree typesof studies
relatingto vowelperceptionin English:( 1) perceptionstudiesthat focuson correctidentificationof naturallyproduced
speech;(2) dataanalyticor patternrecognitionstudiesthat
centerswithout the initial and final transitions,showednearexaminethe acousticfactorsthat separatevowelcategories;
ly the sameerror rate (14%).
and (3) perceptualexperiments
wherelisteners'categorizaIn consideringtheseresults,the hypothesisarosethat
bothkindsof modifiedsyllablesmightbespecifying
thesame tion is relatedto specificmeasurablepropertiesof stimuli,
e.g.,phoneticcontinuumexperiments.Of key importanceto
informationaboutvowelsandthat thisinformationmightbe
the discussionis what Lindblom and Studdert-Kennedy
preserved
in theendpointsof the formanttrajectoriesat the
boundariesof the vowelcentersand the rapid consonantal (1967) havetermed"complementarity,"that is, the correbetweendetailsof variationin productiondataon
transitions.If thiswerethecase,thenexperiments
analogous spondence
to thoseof Strangeetal. (1983) mightbeobtainedfor isolat- the onehandand detailsof perceptionon the other.
ed vowels.
Four typesof informationhavebeenimplicatedin the
In Neareyand Assmann(1986), evidencewasprovided perceptionof Englishvowels.Theseare: ( 1) staticproperties,suchassteady-state
formantfrequencies
and the fundathat this wasindeedthe case.In that experiment,naturally
mental;(2) dynamicproperties,includinginherentspectral
producedisolatedvowelstimuliweremanipulatedin thefollowing way: From each originalstimulus,two 30-mssec- changeandconsonantalcontexteffects;( 3 ) intrinsic(intrasegmental)relationalproperties,especiallyrelationsamong
tions, one each from the onset (A) and offglide (B) porthe fundamentaland formant frequencieswithin vowels;(4)
tions, were extractedand multipliedby a 30-msHamming
window (to avoid gatingtransients).The original vowels extrinsic(transsegmental)relationalproperties,suchasrelof
and threesetsof testvowelswereplayedback to listeners. ativevoweldurationandthe relativeformantfrequencies
The testvowelsconsisted
of two windowedsections
separat- a vowel comparedto thoseof other vowelsof the same
edby 10msof silence.In the naturalorder(A-B) condition, speaker.The main conclusion
of thiswork is that, although
the two sectionsfrom the sameoriginalvowelwereplayedin
relativeimportanceof someof theseeffectsis situationdetheir natural order;in the repeatednucleus(A-A) condi- pendent,noneof thesefactorscansafelybe ignoredin a full
tion, the firstsectionwaspresentedtwice;andin the reverse accountof Englishvowelperception.It seemsfruitlessfor us
to concentrateon only onesetof effectsand assumethat the
order (B-A) condition,the sectionswere playedin the reverse of their natural order. Analysis of listeners'errors othersare laboratorycuriosities.
showedthat listenersmadesignificantlymoreerrorson the
While a completeunderstanding
of the relativeweights
of thesefactorsin variouslaboratoryconditions,much less
repeatednucleusA-A, conditionand reverseorder B-A
in normal conversation and fluent discourse, must await
conditionthan for the originalunmodifiedvowels,while the
natural order A-B sections were identified as well as the full
further researchand theorybuilding,thereis little reasonto
duration vowels. These results indicate that sufficient infordoubtthe adequacyof availableexperimentaland modeling
mation is retained for reliable vowel identification in two
techniquesin advancingtoward that goal. As a matter of
brief sectionstaken from near the beginningand end of a
researchpolicy, it is probablybestto assumethat there are
no real mysteriesin speechperception,but rather only some
vowel,providedthey are presentedin their originalorder.
Statisticalanalysisof formantfrequencies
of production difficult puzzles.Assessingthe varying weightsof different
data showed that half of the ten vowels studied showed reli-
factors in different circumstancesis a complex task, but one
able formantmovement,includingthe vowels/i/,/e/, and
/•e/, whichare usuallydescribed
asmonophthongs.
Nearey
andAssmann(1986) alsoconstructed
a patternrecognition
whereit shouldbe possible
to makeincrementalprogress.
For example,aspects
of experiments
I andII abovecouldbe
combinedin an attemptto assess
the relativesizeof conso-
2106
J. Acoust.Soc.Am.,Vol.85, No.5, May1989
TerranceM. Nearey:Vowelperception
2106
nantalcontexteffectsandspeakerrelatedeffects.As we attemptto handleproblems
thatbeginto approachthecomplexity of natural speech,we must expect that our
experiments
will becomemorecomplex.
At thislevelof complexity,qualitativecomparisons
are
unlikelyto resolveall issues.We shouldstriveto develop
formal,quantitative
modelssothattheycanguideusin the
designof critical experiments.Experimentswith modified
natural speech(or syntheticspeechcarefullymodeledon
natural tokens), that attempt to correlate perceptual
changes
withmeasurable
properties
of thestimuliareof particularinterest.In our laboratories,
we hopeto continueto
developmodelingtechniques
for the explicitcomparisonof
explicitrecognition
modelsto categorization
by humanlisteners.It issuchdetailedcomparison,
I believe,that will lead
to the most rapid--and, perhaps,the only convincing-progress.
ACKNOWLEDGMENTS
Thanksaredueto GraceWiebe,Ming Ming Pu, Satomi
Komai, JamesKo, and D. P. Lee for their help in running
experimentI andin listeningto pilots,aswellastheirhelpful
commentson the preliminaryanalysisof the data. I would
alsolike to thank Bjfrn Lindblomand an anonymousreviewerfor numeroushelpfulsuggestions
andcriticisms.Portionsof this work supportedby SSHRC.
APPENDIX A: SOME ISSUES IN DATA ANALYTIC
VOWEL NORMALIZATION
piedwith a subtractivemodelof comparison,actuallyproridesa moreaccurateaccountof pitchscaling.
A numberof contemporary
researchers
representthe
frequencyaxisfor vowelsin the Bark scale.This measureis
basedon auditorymaskingand other studiesthat estimate
the bandwidthof auditory filters (Zwicker and Terhardt,
1980). Note that thesescaleswereoriginallydevelopedto
accountfor propertiesrelatingtofrequencyresolution,
rather than "pitch magnitude"or "tonality scale."The transfer
from one to another,whenexplicit,is often madewithout
discussion.
Thus,e.g.,Traunm/illerspeaksof "..a scalerepresentingcritical bandswith unit width (1 Bark), which
may alsobe consideredas a tonalityscale"( 1981,p. 1465).
Thesecritical-band-rate
scales(whicharesimilarin appearanceto pitchmagnitudescales,like the mel scale)are constructedby integratingthe inverseof the estimatedcriticalbandfunction(Moore andGlasberg,1983). Above1000Hz
or so, the relativeshapesof the Bark, mel, and hertz scales
are quite similar. The Bark scaleand reel scaleboth differ
from the logscalein that the lowerfrequencyregion(below
about500 Hz) ismorenearlylinearlyrelatedto hertzthan is
the log scale (which is concavedownward when plotted
againsthertz throughoutits range). There is somedispute
about the bandwidthof the auditory filters in the low-frequencyrange.Moore and Glasberg( 1981) haveproposed
an equivalentrectangularbandwidth(ERB) scalethat is
slightly more "log-shaped"(ultimately, becausethe estimatedfilter bandwidthshavemorenearlyconstantQ properties than the Zwicker and Terhardt data) in the low-frequency range.
This discussion
providesa somewhatdifferentperspective on some issuesin data analytic vowel normalization
from thosepresentedby Miller (1989), althoughthere is
substantive
agreement
ona numberof points.The issues
that
differentiatethe majorityof currentapproaches
fall into four
majorclasses:
(1) the "correct"or optimaltransformation
of the frequencyaxis (e.g., log versusBark); (2) the geometric natureof subsequent
transformations
(e.g., translations,
rotations,rescalings);(3) the number and nature of
speaker-dependent
variables
andhowtheyareto beestimated;and (4) thenatureof theclassification
algorithmapplied
to thetransformed
space(e.g.,linearor quadraticdiscriminantanalysis,simpledistancefunctions,empiricallychosen
boundaryregions).Someof theseissuesand their interac-
A possibilitythat has longintriguedsomeworkersin
thisareais that certainrelationalpropertiesof basilarmembraneexcitationpatternsmightbe invariantfor equivalent
vowels(Chiba and Kajiyama, 1941;Potter and Steinberg,
1950;Syrdaland Gopal, 1986). Suchmodelsseemmostappropriatefor a pure "placetheory"of frequencycoding,
rather than, for example,modernsynchronytheories.Such
modelsalsorun into the difficultythat individualharmonics
areclearlyresolved(by anyof thecritical-bandfunctions,as
wellasby empiricalresults)in theF 1 region.This poses
a
problemfor bothformant-based
and"wholespectrum"
auditorytheories.SeeKlatt (1986b), Chistovich(1985), Assmann (1985), and Assmannand Nearey (1987) for discus-
tions are discussed below.
There havebeenseveralattemptsto providead hoc
scalesfor formantfrequencies
specifically
designedto optimizesomepropertyof vowelrepresentation.
In thecontext
of intrinsicnormalizationprocedures,
a modifiedBark scale
hasbeensuggested
by Traunmiiller (1981; seealsoSyrdal,
1984;SyrdalandGopal, 1986).This scalewasspecifically
developed
to accountfor relationships
ofF1 andF0. [The
exactprocedureusedis far from clear.Traunmfillerstates:
"Severalcarefulidentifications,madeby the author [Traunmfiller], of one-formantvowels...ledto the suggestion
that
the dependenceof perceivedopennesson the distance
1. Nonlinear transformations of the frequency axis
The choiceof frequencyaxistransformationis a problem that cutsacrossboth intrinsic and extrinsicapproaches.
Thescales
adopted
havegenerally
beenborrowed
fromauditory psychophysics.
A logarithmicscale (musicalsemitones)wasadvocatedby Joos(1948). In the 1950sand '60s,
the mel scale (Stevens and Volkmann, 1940) and two ap-
proximations
to it, theKoenigscaleandFant's(1973) technical mel scale,camein to widespreaduse.Steven'sscalewas
basedon such tasksas fractionationof pitch intervalsand
direct subjective
judgmentsof pitch differences
for sinu=
soids.However, recent work by Elmasian and Birnbaum
(1984) indicatesthat the logarithmic(musical) scale,cou2107
J. Acoust.Soc. Am., VoL85, No. 5, May 1989
sion of related issues.
betweenF 1 and F0 could be describeduniformly if the Bark
scaleof tonalitywasmodifiedat its low frequencyend.This
modificationis shownin Fig. 7" ( 1981,p. 1469).] It might
be notedthat the modificationssuggested
do not appearto
TorranceM. Noarey:Vowelperception
2107
have any independentpsychephysical
motivationwhatsoeverand that, in fact, the changesto the classicalBark scale
suggested
by Traunmfillerfor the regionbelow250 Hz, are
in directconflictwith the revisionssuggested
by Mooreand
Glasbergin their ERB-rate scale.
In the context of extrinsic normalization, there have
been at least two effortsinvolving ad hoc scales.Nearey
(1978) attemptedto constructan optimalscalefor additive
decomposition
speakerand vowel effects(in an extrinsic
normalizationcontext},usingmethodsdescribedby Anscombeand Tukey (1963) and Box and Cox (1964). Lennig
(1978) attemptedto constructa scalethat would lead to
more homogeneous
variancesfor vowelformant measures
[alsoin an additiveextrinsicnormalizationframework;see
alsoKent andFornor(1979) andNearey ( 1978,p. 146) for
relatedremarkson variancestabilizingpropertiesof the log
transform].Althoughsomeminimal improvementin extrinsicnormalizationwasfoundin bothcases,both Nearey
and Lennigconcludedthat the scalestheyconstructed
were
not sufficientlydifferentfrom an ordinarylog scaleto merit
theadditionalcomplexity,in theabsence
of morecompelling
resultsand moredata. Similarly,Peterson( 1961) and Miller (1989) havesuggested
that thereis little evidenceto favor
auditoryscales(mels, Barks) overmore traditional scalesin
intrinsicnormalizationschemes.Hillenbrand and Gayvert
(1987) provideevidencethat evena representation
in linear
hertz can producehigh identificationratesof the Peterson
and Barneydata, providedquadraticdiscriminantanalysis,
whichdoesnot assumethehomogeneityof dispersionmatrices, is used for classification.
2. Normalizing transformations of parameters
Givenaninitialtransformation
of thefrequency
axis,all
the approaches
discussed
hereagreethat the nextstepis to
applysomesetof linearnormalizingtransformations
to the
derivedmeasures.They disagreein the numberof parametersin the transformsand how the parametersare to be
estimated.The major differencebetweenintrinsicand extrinsicschemes,
asnotedearlier,isin thelocus(intrasyllabic
versustranssyllabic)of speaker-dependent
informationthat
is allowed to enter into subsequenttransformations.The
transformations themselvesare often remarkably similar
( seediscussionbelow).
Therehasbeenextensive
discussion
of the appropriate
numberof parametersand appropriatefamily of transformations to use in connection with extrinsic normalization
schemes.The simplestof theseare basedon the uniform
scalinghypothesis(Nordstrtm and Lindbiota, 1975) also
on normalizationissues.Disnorarguedthat Nearey'sCLIH
procedurewasnotwellsuitedto certaincross-language
comparisonswhere there were large skewingsof vowel inventories(e.g.,comparing
languages
withfrontroundedvowels
to thosewithout). Hindle and Lennig, on the other hand,
found Nearey's CLIH scheme(aptly renamedlog-mean
normalizationby Hindle) to be the methodof choicein dialectcomparisons
(seealsoHoldenandNearey,1986}.Interestingly,Disnet (1986) adoptsan ANOVA modelfor cross
languagecomparison,which, exceptfor the choiceof scale
(reel versuslog), bearsmuchclosertiesto Nearey's(1978)
additive normalization
models than to the PARAFAC
mod-
elsshepreferredin her 1980paper.Nearey (1978,1983) has
alsoinvestigated
theadequacy
of CLIH in perceptual
experimentsfor "singlesyntheticspeaker"contexts
andfoundit to
providea goodfirstapproximationof perceptualresults(see
also Lieberman, 1984).
3. Classification procedures
The initial frequencytransformationfollowed by the
normalizingprocedures
togetherresultin whatwill becalled
a transformednormalizedformantspace(TNFS). There is
againconsiderable
diversityof opinionon the matterof appropriatemethodsclassification.
Someresearchers
usestandard statisticalpatternrecognitionprocedures,
suchasdiscriminant analysis, assuming vowel measuresdisplay
multinormal distributionsin the TNFS (Nearey, 1978;
Neareyet al., 1979;Assmannet al., 1982;Hillenbrandand
Gayverts, 1987). Others useFratios and Euclidiandistance
metricsor graphictechniquesasclusteringindices,without
explicitclassification(Hindle, 1978;Lennig, 1978;Disner,
1980). Syrdal (1984; Syrdal and Gopal, 1986) prefersa
coarsecategorization
of theaxesof theTNFS corresponding
to binaryfeaturevalues,thoughshealsousesdiscriminant
analysisin some comparisons.[Interestingly, Nearey's
(1978) CLIH schemecomparesfavorablyin the latter.]
Finally,in Miller's scheme,complexcategorization
regions,
called perceptualtarget zones,are constructedby an iterative (and not fully disclosed)process.It hasrecentlybeen
pointedout by Hiilenbrandand Gayvert (1987) that the
choiceof classificationalgorithm can interact profoundly
with other factors, such as choice of scale. Furthermore, in
the courseof linear discriminantanalysis,intermediatedimensionreducinglinear transformationsof the input axes
are often imposedprior to classification
and thus interact
with the apriori secondarytransformationsdiscussed
above.
known as the constantratio hypothesis(CRH) or constant
4. Covert similarities among representational
log interval hypothesis(CLIH, Nearey, 1978, pp. 89-90).
Thesemethodsallowfor a singlespeaker-dependent
parameter to be usedfor normalization.This parameteris a multiplicativescalefactorin a linearhertzspace;in a log frequency space,it is an additivelocationparameter,corresponding
roughlyto thecenterof gravityof a speaker's
voweltriangle
(Nearey, 1978). Other methodsallow for a much broader
classof speaker-dependent
arlinetransformations.Disner
(1980), Lennig(1978), andHindle (1978) attemptto bring
evidencefrom crosslanguageanddialectcomparisonto bear
Differencesbetween several TNFS systemsappear
smalleron reflectionthan first meetsthe eye.Someof the key
propertiesof Miller's intrinsicschemeare identicalto those
of Nearey's (1978) CLIH scheme.Accordingto both, differencesin log(F2)-log(Fl)
should be invariant across
2108
J. Acoust.Sec. Am., Vol. 85, No. 5, May 1989
systems
speakers
for a givenvowel(Nearey,1978,p. 91). Thismay
belessthanapparentto thecasualobserverof normalization
schemes,
sinceMiller chooses
to usethenotationlog(F2/
F 1) to expressthe samequantity. If CLIH is extendedto
threeformants,precisely
thesamewithin-vowel
cross-speaker
TorranceM. Nearey:Vowel perception
2108
relationships
amongtheformantsare positedby CLIH and
Miller'sapproach.
CLIH alsoexplicitlypositsfixedrelationshipsamongthe formantfrequencies
of differentvowelsof
individualspeakers.Given an extensionof CLIH to three
formantsand the degreeof correlationbetweena speaker's
meanlogF0 andmeanlogformantfrequencies,
it wouldbe
surprising
if strikinglydifferentclassification
resultswould
be obtained,say,by discriminantfunctionanalysisof vowel
data"preprocessed"
by the frontendsof thetwo theories.It
is alsointerestingto note that Miller's sensoryreferencefor-
mula makesuseof geometricmeansof a speaker'sfundamentalfrequency,whichcorresponds,
in effect,to an extrinsic(transsyllabic)factor.Sincethelogmeansusedin CLIH
aresimplythelogsofthe geometricmeansoverF 1andF 2, a
slight redefinitionof the sensoryreferencewould lead to a
theoryvirtuallyindistinguishable
from a threeformantversionof log-meannormalization,apartfromthe methodused
to determine the final classification boundaries.
Miller's and Syrdal's methodsdiffer by considerably
lessthanmightbeapparentfroma casualreading.As noted
above,andby Miller (1989), differences
betweena logarithmic versusBark scaletransformation(apart from an arbitrary, globallineartransformthat will not affectmostclassification schemes) are difficult to detect except in the
low-frequency
range.In additionto this,Miller's useof the
sensoryreferenceis,exceptfor a globalconstant,equivalent
to subtracting1/3 of the log frequencyof the fundamental.
This hasa grosslysimilareffecton fundamentalfrequencies
(comparedto formantfrequencies
above250 Hz) asTraunmilllet's ( 1981) ad hocmodificationof the verylow rangeof
the Bark scale.They bothdecrease
the apparentmagnitude
of changes
in theF0 range,comparedto the uniformuseof
unmodifiedsales(log or Bark) throughoutthe F0 and formantranges.Thoughtherearesomepotentiallytestabledifferences
in detail,boththesemethodsmaybeviewedasprimarilyincorporating
theempiricalrelationship
summarized
byAinsworth( 1975,viz., a 30% risein formantfrequencies
corresponds
roughlyto a 100% rise in fundamental).It
shouldbe realizedthat the adjustmentsin scalesare essentially ad hocmodificationsto accountfor the empiricalrelationshipsfoundin speechdata and do not appearto follow
from any independently
motivated"auditory"properties.
In passing,it might be noted that there is at leastone
moreessentially
ad hochypothesis
that is quiteprevalentin
auditoryapproaches
to vowelperception,namely,the existenceof a 3- to 3.5-Barkcriticalseparation
betweenadjacent
APPENDIX B: CONSIDERATONS OF THE
"NATURALNESS" OF THE STIMULI IN EXPERIMENT
I
Thesynthesis
conditions
of experiment
I involvecombinationsof parametersthat are not encountered
in normal
synthesissituations.Quite rightly, both reviewersraised
somequestions
aboutthe naturalness
of the stimuli.This
appendixaddresses
someof theseconcerns.
1. Fundamental frequency and formant frequency
ranges
It shouldbe pointedout that all the individualfactors
involvedin this experimentare within rangesalreadyexplored in the literature. Both Fujisaki and Kawashima
(1968) and Holmes (1986) studied increases(184% and
137%, respectively)in F0 that were actuallylarger than
thosein the presentstudy (about 104%). Fujisaki and
Kawashima'sF 3 increase(60%) was considerablylarger
thanthatof thepresentstudy(32%), whileHolmes'ranges
wereroughlyequivalent.
The increase
of theextrinsicfactor
(32%) was similar to that of Ainsworth (1975). While all
the abovementionedstudiesincludedsmallerstepsizesin
theirexperiments,
theirmostextremeconditions
invariably
producedthemostextremeand,therefore,mostreadilydetectable results.
Fromthepointof viewofF0 andtherangesofthelowest
three formant frequencies,
the all factorslow, or baseline,
conditionwasroughlyequivalent
to theaveragemaledataof
Petersonand Barney ( 1952,excludingthe vowel/,/); the
all factorshighcondition,corresponded
roughlyto average
children's measurementson these parameters.Table B I
comparesthe formantfrequencyrangesfor F 1 to F 3 and
fundamentalfrequencyrangeswith thoseof Petersonand
Barney (1952).
As far as the mixed conditions(i.e., where somefactors
arehighandsomelow) areconcerned,
someof thenormal
relationships
that are present,on the average,in natural
speechare altered.Two of the mixed conditionsinvolve
caseswherethereisan unusualrelationshipbetweenthefundamentel and vocal tract resonances.Thus condition 2 (P)
corresponds
approximately
to a male countertenor
voice
TABLE BI. Comparison
ofF0 andformantrangesfromPetersonandBarney, 1952averages
and experimentI.
ExperimentI
Petersonand Barney, 1952
All factors low
Adult males
F0
F !
F2
120
250-750
750-2250
124-141
270-730
840-2290
F3
2003-2774
2240-3010
All factorshigh
Children
formants. Chistovich, who introduced this rule of thumb
intotheliterature,isalwayscarefulto pointout that suchan
effectmustbe "post-auditory,"sincethe criticalauditory
distance, the critical band, is 1, not 3 or 3.5, Bark wide. Ac-
cordingto classicalcritical-bandtheory,auditoryfiltersat
spacings
largerthanthisintervalprovideindependent
information to the higher processinglevels.To the bestof my
knowledge,the 3-Bark interval, like Traunmtiller's modificationof the Bark scale,is motivatedpurelyby phenomena
related to speechand has no counterpartin generalauditory
psychophysics.
Arguably, both of theseeffectscould be
classedas"speechmode,"rather than purelyauditoryphe-
FO
270
25 !-276
F 1
F2
F3
329-987
987-2961
2657-3806
370-1030
1060-3200
3 i 70-3730
nomena.
2109
J. Acoust. Sec. Am., Vol. 85, No. 5. May 1989
Torrance M. Nearey: Vowel perception
2109
(Gottfried and Chew, 1986) or perhapsto a femalevoice
like that of Julia Child. Condition6 (EH) corresponds
roughlyto a Popeye-likemalevoiceor to heliumspeechat
normal atmosphericpressure(Morrow, 1971;Beil, 1962).
The two remainingconditions,
however,namely,condition 3 (H) andcondition5 (E), correspond
to casesthat are
somewhat
contradictory
withrespectto intrinsicandextrinsic specificationof supralaryngealvocal tract size. I am
awareof no nonsynthetic
conditionscorresponding
to these
situations.Nonetheless,theseconditionswere includedin an
attemptto explorepartsof logicallypossible
phoneticspace
relativelyremotefrom the main clusterof naturalvariation.
Thereissomeriskthatsuchimprobable
stimulimaygiverise
to artifactsthathavelittle to dowith theperception
of natural speech.However,to the extentthat listenersrespondin
systematicwaysto suchstimuli, it may still be usefulto attemptto comparelisteners'performance
with expectations
from alternativemodels.All otherthingsbeingequal,a robustmodelthat accountsfor listeners'behavioronrelatively
unnatural(but stillperceptually
interpretable)stimuliwithoutsacrificing
performanceon morenaturalstimuliis a better accountof perceptionthan onethat is highlysensitiveto
the covariancepatternsof natural speechin waysin which
listenersare not.Furthermore,stimuliwith evenlargermismatchesof intrinsicproperties
wereincludedin theFujisaki
and Kawashimaexperiment,and yet their resultsshow
smoothchangesof boundarypositionthroughoutthis extendedrange.
2. Spectral tilt, formant amplitude, and voice quality
Althoughno bodyof normativedata is yet availablein
the literature, there has recentlybeen considerableinterest
in glottalsourceparameters,includingfactorssuchasspectral tilt and breathyexcitation.Indicationsare quitestr.ong
(Fant et al., 1987;Klatt, 1987) that suchparametersmust
be controlledfor high-qualitysynthesisof femaleand children'svoices.Sinceneithersufficientinformationnorappropriatesoftwarewasavailableat the time of the construction
of thesestimuli,no explicitattemptwasmadeto deal with
suchfactors.However,the synthesis
strategydescribedin
experimentI had someindirecteffectson relativeformant
amplitudesthat mighthaveaffectedapparentvoicequality.
(A simplerand betterapproachwouldhavebeento useparallelmodesynthesis
with a fixednumberof formants.Unfortunately, the continuumgenerationsoftwareusedto drive
In thecaseof thehighF3 stimuli,theproximityofF5 in
the "completeformant"set (five cascaded
formants)to the
foldingfrequencyled to high amplitudesofF5, and a highly
unnaturalquality,particularlyfor backvowels.In thiscase,
the short formant set (four cascadedformants) was consid-
erablymore speechlikeand the voiceshad more or lessthe
expected
impressionistie
qualitiesof speakersize.The effects
of theleavingouta formantareglobal,sincethepoledensity
throughoutthe entireperiodicdigital spectrumis affected.
However,the magnitudeof the effectof omittinga single
pole near the foldingfrequencydecreases
with decreasing
frequency.The resultis comparable
to raisingthe effective
frequencyof a higherpolecorrectionnetworkin analogsynthesis.
To investigatethe consequences
of this effect on the
presentexperiment,
stimuliat thecornersof thevowelspace
corresponding
roughlyto/i, •e, t•/, and/u/of bothensemblesetsaswellasteststimuliin theoverlappingregions(corresponding
to thestimulinearthecenteroff I andF2 ranges
for bothensemble
sets)wereresynthesized.
The amplitudes
of thelowestfouror fiveformantsweremeasured
usingLPC
techniquesdescribedin Nearey and Assmann(1986). For
the low higher formantsstimuli, vowelssynthesizedwith
fiveformantswerecomparedwith vowelssynthesized
with
six formants.For the highhigherformantsstimuli,vowels
were synthesizedwith four and five formants.The results
showedthe expectedeffectof spectraltilt, with shortformantstimulishowingrelativelymorerapidspectralroll-off
in the higherformants.
The largestattenuationofF3 (relativeto F 1) observed
was 11 dB for the low ensemble,
high higherformants/i/
stimulus.For theother"cornervowels"andin theoverlap-
pingregion,
attenuation
ranged
from3-7 dB.Therangeof
variationin therelativeamplitudesoff 3 to F 2 andF 2 to F 1
causedby the presence
or absence
of the "top formant"fell
wellwithintherangeof naturalspeaker-to-speaker
variation
for comparablevowels(i.e., vowelsin analogous
positionsin
theF 1xF2 plane) reportedby Peterson( 1961). The total
rangeof relativeamplitudesof adjacentformantsfor vowels
of nominallycomparable
qualitywasgreaterfor thepresent
stimulithanin the Petersondata.However,thiswaslargely
dueto the differences
in formantspacingcausedby the differentformantrangecombinations
deliberatelymanipulated
in the experimentratherthan the presence
or absence
of the
top formant.
It is certainlypossiblethat formantamplitudediffertheimplementation
oftheKlattsynthesizer
d!dnotinclude
this option.)
enceshaveaffectedaspects
of theperceptionof vowelquality
The fact that formants were omitted from the cascade
in experimentI. Indeed, the resultsof Fujisaki and Kawasynthesiswhencloserthan 550 Hz to the Nyquist frequency shima ( 1968,seetheir experiment7 and Figure 8) indicate
resultedin a somewhatmore rapid high-frequency
roll-off
that a changein spectraltilt of 12dB/oet canhavea considthanwouldbethe casefor a full setof formants.Impression- erableeffecton the efficacyof F3 as a sourceof intrinsic
normalization information for noise excited stimuli. Howistically,a slightdifferencein voicequalitywasnoticeablein
the low higherformantstimulussets,corresponding
to addiever,they alsoreportthat raisingthe effectivefrequencyof
tional "bassiness"of the "short formant" set (five cascaded
thehigherpolecorrection(whichcorrectsfor theabsence
of
formants)comparedto the "completeformantset" (six cas- formantsaboveF4 in their analogsynthesizer)hadno effect
caded formants). However, in informal back-to-back comon vowelidentificationboundaries(seetheir experiment5
andFigure9). The effectof thismodificationwouldbe quite
parisonfor a numberof vowels,this differencedid not appear to affect the phoneticquality, nor did one set sound similarto the deletionof the highestpolebelowthe Nyquist
particularlymore "natural" than another.
frequencyin digital formant synthesis.Finally, as reported
2110
J. Acoust.Soc.Am.,Vol.85, No.5, May 1989
TerranceM. Nearey:Vowelperception
2110
in thetext,a decrease
in F 3 amplitudeby 1$ dB (largerthan
anycausedhereby omission
of thetopformant)by Holmes
(1986) actuallyled to slightlylargerobservedF 1 and F2
shiftsaswell asto morefemalevoicejudgments.
In general,the effectsof changesof formantamplitude
and/or spectraltilt on vowelperception
arenot well understood.The work of Ainsworthand Millar (1971) suggests
that for two formantstimuli,over a rangeof about28 dB,
vowelidentificationis relativelyinsensitiveto formant amplitude change.Lindquist and Pauli (1968) found that a
largechange(25 dB) of theamplitudeof anF 2 - F 3 pattern
relativeto F 1producedonlynegligiblechangein categorizationalonganF 2 -- F 3 continuumwhenthestimuliwerepresentedsuchthat a givenspectraltilt conditionwas in effect
for an entire experimentalsession(although measurable
changesare observedwhen stimuli with differingspectral
tilts are mixed). In experimentsinvolvingidentificationof
synthetictalkersby voicequality,Carrell (1984) foundthat
glottalwaveshapes,includingchangesof spectralslope,had
relativelylittle influenceon apparentspeakeridentity,while
fundamentalfrequencyand especiallyformantrangeeffects
had a muchlargereffect.Chistovich(1985) reviewsthe effects of formant amplitude changeson phoneticquality
whenformantsare relativelycloselyspaced.In the caseof
thepresentexperiment,whenF 1andF 2 werecloselyspaced,
the effectof the missingtop formantwasquitesmall,showingnomorethana l-dB changein relativeamplitudeoff 1
and F2 for the hack comer vowels/o/and/u/.
Sincethe stimuli of experimentI represent,to a large
extent,explorationsinto syntheticterra incognita,it is conceivablethat someof the effectsnotedare not representative
of phenomenaencounteredin natural speech.However,
thereis no compellingevidence,whetherfrom the literature
or from listeners'comments,to suggestthat thereare any
grossviolationsof constraintsof known perceptualrelevance.
Nonetheless,
further experimentsare calledfor, using
parallelsynthesis
withindependent
manipulationof formant
amplitudesand with more sophisticated
controlof glottal
Waveshape.
Simultaneous
judgmentsshouldbegatheredincluding:(I) vowel identity; (2) speakersize, sex, and age
characteristics;
and (3) vocaleffort [in view of "the Traunmiiller effect"reportedby Lindblom (1987) ]. It wouldbe
impracticalto attemptto produce"fully crossed"experimentaldesigns
involvingall thesefactors.Judiciousinterpolation betweenrelativelywell-undestoodsyntheticvoices
basedon naturalspeechandsomeof the extremeconditions
representedin experimentI might be a promisingresearch
strategy.
•Thisassumption
isthesubject
of criticismby advocates
of whatmightbe
calledthe "whole (auditory) spectrum"school,e.g.,Suomi(1984), Bladonetal. (1984). As Klatt haspointedout ( in thediscussion
thatfollowed
theoralversionof thispaper),theworkof Polsetal. (1969) indicates
that
decidingbetween"spectralshape"and "formant-oriented"approaches
maybe verydifficultdueto the similarityof the putativeperceptualrelationships that may be derived from these alternate representations(see
alsoNeareyet al., 1979). For at leastsomeof theseapproaches,
problems
suchasspeakerdifferences
(e.g.,Pols1977,pp. 14-15) leadto similardifficuries as for formant basedtheories (but seeSuomi, 1984; Bladon et al.,
1984). Further problemsassociatedwith auditorytheoryand vowelperceptionsuchasF', F2' and theroleof 3- to 3.5-Barkseparationof formants
2111
J. Acoust.Sec. Am., Vol. 85, No. 5, May 1989
are importanttopics,but cannotbe dealt with here.SeeAssmann( 1985),
Chistovich(1985) and Klatt (1986b) for a generaldiscussion
of issues
relatedto peripheralauditorymodelingin speechresearch.
•-This
isnottosaythatsuchissues
arenotamenable
toexperiment.
See,for
example,the reactiontime studyof Summerfieldand Haggard (1975).
Experimentsto decidebetweencompetingmodelsof real time processing
of speakerdifferences
are likely to be quite subtleand will undoubtedly
benefitfromsomeprioragreementaboutwhatstimuluspropertiesare relevantin the firstplace.
3Female
averages
wereusedherebecause
thechosen
intervals
aresomewhat
moreuniformthanthe maleaverages.In particular,the male/r/-/ae/F 1
differenceis on the orderof only 25%. How muchof thisisdue to nonuniform scaling (Fant, 1973) or dialect heterogeneity(Nordstrfm and
Lindblom, 1975;Nearey, 1978) is not clear.
4Effects
maybelargerforstopsbounded
bysomeresonant
consonants.
Pre/r/and pre-/l/effectsarequitelargein manydialectsof English.However,in thesecases,thereissomequestionasto whetherspecifically
selected variants,or "extrinsicallophones"are involved,ratherthan ordinary
coarticulationeffects.To the bestof my knowledge,no suchproposalhas
beenmadefor vowelsin stopcontextsin English,at leastfor bilabialand
alveolarstops.However,because
of thecommonneutralization
of phonelogicalcontrastin dialectsprecedingvoicedvelar stops(especiallythe
mergerof (/eg/-/eg/and to a lesserextent/ig/-/ig/), somecaution
shouldbe appliedin this caseas well.
•It mightbenotedthat,strictlyspeaking,
Miller'stheoryisnotpurelyintrinsic, sincehis sensoryreferenceformula usesa geometricmean of a
speaker'sF0, which unlessthe time constantof the averagingprocessis
veryshort,likelyincludesinformationextrinsicto thevowelbeingnormalized.
6Asananonymous
reviewer
haspointedout,someimpliedassumptions
of
orthogonalityof traditionalphoneticsdo not strictlyhold perceptually.
Thus,for example,F0 andF 1 transitionshavebeenshownto haveaneffect
on VOT boundaries.
However,suchdisorthogonalities,
althoughreliable,
aregenerally
secondary,
shiftingboundaries
byonlya smallpercentage
of
thenaturalspeech
rangeof theprimaryvariable(VOT). Suchsmallinteractions,while theoreticallyvery important,are nothinglike the implied
linkages
in Traunmiiller'sor Syrdal'stheories,
where,ratherthansuggestinga small"retuning"off I, a newperceptual
feature--thetonalitydifferencebetweenF 1 and F0•is proposedto replaceF 1.
7Lehiste
(1970)reportsthatthereisnonecessary
connection
between
pitch
changes
andformantfrequencies.
Althoughsomelanguages,
likeVietnamese,may havedistinctallophonesof certainvowelsin association
with
differenttones,theseare languagespecific"extrinsicallophones"
and not
necessary
phoneticcovariation.
For otherlanguages,
suchasSerbo-Croation,Lehisteasserts
"...toneappearsto havenoeffecton phoneticquality"
( 1970,p. 78).
•Formantmeasurements
ofa fewvoweltokensfromrecordings
of thesetwo
speakers
havebeenmadein our laboratories.
Popeyeshowsformantfrequencyvaluesapproximately
in therangeof Peterson
andBarney(1952)
childaverages,
whilehisfundamentalisbelowtheaveragemalerange.Julia Child showsformant rangesonly slightlyhigherthan male average
data,with a fundamentalfrequencyin the femalerange(200-300 Hz).
øThese
figures
arebased
onthehigherpolecorrection
calculation
ofFlanagan( 1972,pp.217-218) appliedto neutraltubeF 3 andF4 frequencies
of
2500 and 3500 Hz.
•øThisisconsistent
withthelackofeffectonvowelidentityforanincrease
in
the effectivefrequencyof the higherpolecorrectionfactorobserved
by
Fujisakiand Kawashima(1968).
••Such"surrounded"
categories
tendtohavesymmetrical
"moundlike"
response
surfaces,
for whichvariousmeasures
of locationsuchasthemean,
median,or modearelikelyto beverysimilar.For exteriorcategories,
categorizationprofilesalongoneor moreof the dimensions
islikelyto havea
nonsymmetrical,
ogivalshapefor whichthereisnomode,butonlyasymptotesof 0% and 100% identificationwithin reasonablestimulusranges.
This characteristicis a major factor in the windowingproblemnotedby
Ainsworth (1975).
•2Such
negative
shiftsarenotunprecedented.
Holmes'(1986) Figure16.1
showstwosmallnegativeshiftsin centerof gravityofF2 for anincrease
in
F3 between conditions
2 and 6.
•3Thisappliesto Nearey's(1978) constantlogintervalhypothesis
(CLIH).
Neareyalsodiscusses
a secondversion,CLIH2, in whichseparatescale
factorsareestimatedfor F 1 andF2. Althoughthemethodof estimation
usedthereinvolvesonlyF 1andF2 measurements,
CLIH2 couldbemodified for an F0 correction
factor.
•4Thisis particularlytrue in light of the rapidtransitionsin the present
Terrance M. Noarey: Vowel perception
2111
stimuli.However,it might be pointedout that Broadand Clermont's
model,whichpredictsevenlargershiftsfor/dVd/than doesLindblom's,
doesnot dependon empiricaltransitiononsets,but only on theoretically
derived"loci" for thephonological
categoryin question.
I•Thisadmittedly
awkward
termwasdeliberately
usedin preference
tothe
morefamiliar,but phoneticallyloaded,term "diphthongization,"
since
the latter impliesa perceptually
salient(to the phoneticJan
at least)
changeof vowelqualityovertime.
Fant,G. (1973).Speech
Sounds
andFeatures
(MIT, Cambridge,
MA).
Fant, G., Gobl, C., Karlsson,I., andLin, Q. •1987)."The femalevoice-Experiments
andoverview,"
J. Acoust.Soc.Am. Suppl.I 82, S90.
Fisher,W., andEngebretson,
A. (1975)."Simpledigitalspeech
synthesis,"
Am. J. Computation.
Ling.16.
Flanagan,
J. (1972).Speech
Analysis,
Synthesis
andPerception.
Second
Edition (Springer-Verlag,Berlin).
Flanagan,
J. (1955)."Difference
limenfor vowelformantfrequency,"
J.
Aeoust. Soc. Am. 27, 613-617.
Ainsworth,W. (1972)."Durationasa cuein the recognition
of synthetic
vowels," J. Acoust. Soc. Am. $1, 648-651.
Ainsworth,
W. (1975I. "Intrinsicandextrinsic
factorsinvoweljudgments,"
in AuditoryAnalysis
andPerception
ofSpeech,
editedby G. Fant andM.
Tatham(Academic,London),pp. 103-113.
Ainsworth,W., andMillar, J.(1971)."Theeffectof relativeformantamplitudeon the perceived
identityof syntheticvowels,"Lang.Speech15,
328-341.
Anscombe,F., andTukey,J. (1963I. "The examinationand analysisof residuals,"Technometrics$, 141-160.
Assmann,
P. (1985)."The roleof harmonics
andformantsin theperception
of vowelquality,"unpublished
Ph.D. thesis,Universityof Alberta,Edmonton, Canada.
Assmann,P., Nearey,T., and Hogan,J. (1982I. "Vowel identification:
Orthographic,perceptualand acousticaspects,"J. Acoust.Soc.Am. 71,
975-989.
Assmann,
P.(1979)."Theroleofcontextin vowelperception,"
unpublished
M.S. thesis,Universityof Alberta,Edmonton,Canada.
Assmann,P., andNearey,T. (1987)."Perception
of frontvowels:The role
of harmonics
in the firstformantregion,"J. Acoust.Soc.Am. 81, 520534.
Barany,E. •1937)."Transposition
ofspeech
sounds,"
J.Acoust.Soc.Am. 8,
217-219.
Fujisaki,H., andKawashima,
T. (1968)."Therolesofpitchandthehigher
formantsin the perceptionof vowels,"IEEE Trans. Audio Electroacoust. AU-16, 73-77.
Gay, T. (1974}."A cinefiuorographic
studyof vowelproduction,"
J. Phon.
2, 255-266.
GayT., andUshijima,
T. (1974)."Effectofspeaking
rateonstopconsonantvowel coarticulation,"in SpeechCommunication
Seminars(KTH,
Stockholm),Vol. I, pp.205-208.
Gerstman,L. (1968). "Classification
of self-normalized
vowels,"IEEE
Trans. Audio Electroacoust.AU-16, 78-80.
Gottfried,T., andChew,S.(1986)."Intelligibility
ofvowels
sungbyacountertenor," J. Acoust. Soc. Am. 79, 124-130.
Hillenbrand,
J.,andGayvert,R. (1987)."Speaker-independent
vowelclassification
basedonfundamental
frequency
andformantfrequencies,"
J.
Acoust.Soc.Am. Suppl.I 81, S93.
Hindle,D. (1978)."Approaches
to vowelnormalization
in thestudyofnatural speech,"
LanguageVariation:Modelsand Methods,editedby D.
Sankoff(Academic,
NewYork), pp.161-171.
Holden,K., andNearey,T. (1986I. "A preliminary
reportonthreeRussian
dialects:
Vowelperception
andproduction,"
Russ.Lang.J.40, 3-21.
Holmes,J. (1986I. "Normalization
in vowelperception,"
inInoariance
and
Variabilityin SpeechProcesses,
editedby J. Perkelland D. Klatt (Erlbaum,Hillsdale,N J), pp. 346-357.
Bell, R. (1962)."Frequencyanalysisof vowelsproducedin a heliumrich
atmosphere,"J. Acoust.Soc.Am. 34, 347-349.
Bennett,D. (1968)."Spectralformanddurationcuesin the recognition
of
EnglishandGermanvowels,"Lang.Speech11, 65-85.
Benett,S., and Weinberg,B. (1979)."Sexualcharacteristics
of preadolescent children'svoices,"J. Acoust. Soc.Am. 65, 179-189.
Bladon,A., Hendon,C., andPickering,J. B. •1984)."Towardsan auditory
theoryof speakernormalization,"
Lang.Comm.4, 59-69.
Box,G., andCox,D. (1964)."An analysisof transformations,"
J. R. Stat.
Soc. B. 26, 211-252.
Broad,D. (1976)."Towarddefiningacoustic
phoneticequivalence
for vowels," Phonetica 33, 401-424.
Broad,D., andWakita,H. 11977)."Piecewise
planarrepresentation
of vowel formantfrequencies,"
J. Acoust.Soc.Am. 62, 1467-1473.
Broad,D., andClermont,F. (1987)."A methodology
for modeling
vowel
formantcontoursin CVC context,"J. Acoust.Soc.Am. 81, 155-165.
Buhr,R. D. (1980)."Theemergence
ofvowels
inaninfant,"J.Speech
Hear.
Res. 23, 73-94.
Cartell,T. (1984)."Contributions
offundamental
frequency,
formantspac-
Joos,M. (1948)."Acoustic
phonetics,"
Lang.Suppl.24, 1-136.
Kahn, D. (1977)."Near-perfect
identification
of speaker-randomized
vow-
els withoutformanttransitions,"
J. Acoust.Soc.Am. Suppl.I 62,
SI01(A).
Kallail, K., andEmanuel,F. (1984)."An acoustic
comparison
of isolated
whisperedand phonatedvowelsamplesproducedby adult malesubjects," J. Phon. 12, 175-186.
Kent,R. D., andFornet,L. (1979)."Developmental
studyofvowelformant
frequencies
in an imitationtask,"J. Acoust.Soc.Am. 65, 208-217.
Klatt, D. {1976)."Linguistic
usesofsegmental
durationin English:Acoustic and perceptualevidence,"J. Acoust.Soc.Am. $9, 1208-1221.
Klatt, D. (1980)."Softwarefor a cascade/parallel
formantsynthesizer,"
J.
Acoust. Soc. Am. 67, 971-995.
Klatt, D. (1986a)."Problems
of variabilityin speech
recognition
andin
modelsof speech
perception,"
in Invariance
and Variability
in Speech
Processes,
editedbyJ.PerkellandD. Klatt (Erlbaum,Hillsdale,
NJ), pp.
300-319.
Klatt,D. (1986b).
"Representation
of thefirstformantin speech
recognitionandin models
oftheauditoryperiphery,"
in Proceedings
oftheMontreal Symposium
on SpeechRecognition
(CanadianAcousticalAssoci-
ing and glottal waveformto talker identification,"unpublishedPh.D.
dissertation,
IndianaUniversity,Bloomington,IN.
Chiba, T., and Kajiyama, M. (1941). The VowekIts Nature and Structure
(Tokyo PublishingCo., Tokyo).
Chistovish,L. 11985)."Centralauditoryprocessing
of peripheralvowel
spectra,"J. Acoust.Soc.Am. 77, 789-805.
Klatt,D. (1987)."Acoustic
correlates
ofbreathiness:
Firstharmonic
ampli-
Dechovitz,D. (1977a)."Informationconveyed
by vowels:
a negative
finding,"J. AcoustSoc.Am. Suppl.I 61, S39.
vowels:Isolated,from words,and from normalconversation,"Proc. 9th
Int. Congr.Phon.Sci.1, 233.
Dechovitz,D. (1977b)."Informationconveyedby vowels:a confirmation,"
HaskinsLab. Stat.Rep. SpeechRes.SR-51/$2, 213-219.
Diehi, R. L., McCusker,S. B., and Chapman,L. A. {1981)."On the identifiability of synthesizedsteady-stateisolatedvowelsin isolationand in
Koopmans-vanBeinum,F. J. 11980)."Vowel ContrastReduction:An
Acoustic
andPerceptuaI
StudyofDutchVowels
in Various
Speech
Condi-
consonantalcontext," J. Acoust. Soc. Am. 68, 1626-1635.
Disnet,S. F. {1980)."Evaluationof vowelnormalizationprocedures,"
J.
Acoust. Soc. Am. 67, 253-261.
Disnet,S. F. (1986)."On describing
vowelquality,"in Experimental
Phonology,
editedbyJ.OhalaandJ.Jaeger(Academic,
NewYork), pp.6979.
Dudley,H. (1939)."The vocoder,"Bell Lab. Rec. 17, 122-126.
Dudley,H. (1940)."The carriernatureof speech,"BellSystemTech.J. 19,
495-515.
tude, turbulencenoise,and trachealcoupling,"J. Acoust.Soc.Am.
Suppl. I 82, S91.
Koopmans-van
Beinum,F. J. (1979)."Perception
of naturallyproduced
tions (Academische Pers B. V., Amsterdam, The Netherlands).
Kuehn,D., andMoll, K. (1976}."A cinefluorographic
studyof VC andCV
articulatory velocities,"J. Phon. 4, 303-320.
Kuwabara,H. (1985)."An approachto thenormalization
of coarticulation
effects
forvowelsin connected
speech,"
J.Acoust.Soc.Am. 77, 686-694.
Kuwabara,H., andSakai,H. (1972)."Perception
of vowelsandCV-syllablessegmented
fromconnected
speech,"
J. Acoust.Soc.Jpn.29,91-99.
Ladefoged,P. (1967).ThreeAreasofExperimentaIPhonetics
(OxfordU. P.,
London).
Ladefoged,
P., andBroadbent,D. (1957)."Informationconveyed
by vowels," J. Acoust. Soc.Am. 29, 98-104.
Elmasian,R., and Birnbaum,M. (1984). "A harmoniousnote on pitch:
Scalesof pitchderivedfromsubtractivemodelof comparison
agreewith
the musicalscale,"Percept.Psychophys.
36, 531-537.
2112
ation,Montreal), pp. 5-7.
J. Acoust.Soc. Am.,Vol.85, No. 5, May 1989
Lehiste,I. (1970).Suprasegmentals
(MIT, Cambridge,
MA).
Lehiste,I., and Meltzer, D. (1973)."Vowel and speakeridentification
in
naturalandsyntheticspeech,"Lang.Speech16, 356-364.
TerranceM. Nearey:Vowelperception
2112
Lennig,M. (1978).,4coustic
Measurement
ofLinguisticChange:theModern
ParisVowelSystem,
Pennsylvania
Dissertation
Series:Numberone(U.S.
RegionalSurvey,Philadelphia,
PA).
Lieberman,P. (1967).IntonationPerception
and Language(MIT, Cambridge,MA).
Lieherman,P. (1984).TheBiologyand Evolutionof Language( HarvardU.
P., Cambridge,MA ).
Lieherman,P., andBlumstein,S. (1988).SpeechPhysiology,
SpeechPerceptionandAcoustic
Phonetics
(CambridgeU. P., New York).
Lindbiota,B. (1963}."Spectrographic
studyof vowelreduction,"J. Acoust.
Soc. Am. 35, 1773-1781.
Lindblom,B. (1984}."Economyof speechgestures,"in TheProductionof
Speech,
editedby P. MacNeilage(Springer,New York), pp. 217-245.
Lindblom, B. (1987). "Phonetic invarianceand the adaptive nature of
speech,"lecturepresentedat symposiumon "Working Modelsof Human Perception,"Eindhoven,August,1987.
Lindblom,B. (1988)."The statusof phoneticgestures,"
paperpresented
at
symposium
in honorofA. M. Libermanandtoappearin Modularityand
theMotor Theoryof SpeechPerception,
editedby I. Mattinglyand M.
Studden-Kennedy(Erlbaum, Hillsdale,N J).
Lindbiota, B., and Moon, S-J. (1988). "Formant undershootin clear and
citationformspeech,"paperpresented
at StockholmUniversitysymposiumin May 1988on "DistinctiveFeatures",and to appearin Perilus
VIII, Departmentof Linguistics,StockholmUniversity,Sweden.
Lindbiota,B., andStudalert-Kennedy,
M. (1967)."On the roleof formant
transitionsin vowelrecognition,"J. Acoust.Soc.Am. 42, 830-843.
Lindquist,J., and Pauli,S. (1968)."The role of relativespectrumlevelsin
vowel perception,"Proceedings
of the 6th InternationalCongress
on
Acoustics(Elsevier, Amsterdam,The Netherlands);also in Speech
Transmis.Lab. Q. Prog.Stat. Rep. KTH, Stockholm2-3, 12-15.
Macchi,M. J. (1980)."Identificationof vowelsspokenin isolationversus
vowelsspokenin consonantalcontext," J. Acoust.Soc.Am. 68, t636-1642.
Nearey,T., Hogan,J., and Rozsypal,A. (1979}."Speechsignals,cuesand
features,"in Perspectives
in Experimental
Linguistics,
editedby G. Prideaux(Benjamin,Amsterdam,The Netherlands).
Nordstr/•m,P.-E., and Lindbiota,B. 11975}."A normalizationprocedure
for vowelformantdata," Proceedings
of the8th InternationalCongress
of
PhoneticSciences,
Leeds,England.
Nordstri3m,P.-E. 11975}."Attemptsto simulatefemaleand infant vocal
tractsfrom maleareafunctions,"SpeechTransmis.Lab. Q. Prog.Stat.
Rep. (KTH, Stockholm)2-3, 20-33.
Peterson,G., and Barney,H. (1952}."Control methodsusedin a studyof
vowels," J. Acoust. Soc. Am. 24, 175-184.
Peterson,G., and Lehiste,I. 11960}."Durationof syllablenucleiin English," J. Acoust. Soc. Am. 32, 693-703.
Peterson,G. •1961}."Parametersof vowelquality,"J. SpeechHear. Res.4,
10-29.
Pols, L. {1977}.SpectralAnalysisand Identificationof Dutch Vowelsin
MonosyllabicWords(AcademischePersB.V., Amsterdam,The Netherlands).
Pols,L., vander Kamp,L., andPIomp,R. 11969}."Perceptualandphysical
spaceof vowelsounds,"J. Acoust.Soc.Am. 46, 458--467.
Port,R. 11981}.
"Linguistictimingfactorsin combination,"
J. Acoust.Soc.
Am. 69, 262-274.
Rakerd,B., Verbrugge,R. R., andShankweiler,D. P. 11984}."Monitoring
for vowels in isolation and in consonantalcontext," J. Acoust. Soc. Am.
76, 27-31.
Remez,R., Rubin, P., Nygaard,L., and Howell, W. •1987}."Perceptual
normalizationof vowelsproducedbysinusoidal
voices,"J. Exp.Psychol.
Hum. Percept.Perf. 13, 40-61.
Ryalls,J., and Lieberman,P. {1982}."Fundamentalfrequencyand vowel
perception,"J. Acoust.Soc.Am. 72, 1631-1634.
Sato, S., Yokuta, M., and Kasuga,H. (1982). "Statisticalrelationships
amongthefirstthreeformantfrequencies
in continuous
speech,"Phonefica 39, 36-46.
Mermelstein, P. (1978}. "Difference limens for formant frequenciesof
steady-state
andconsonant-bound
vowels,"1. Acoust.Soc.Am. 63, 572580.
Stevens,K. N., and House,A. S. (1963}."Perturbationof vowel articulationsby consonantal
context:An acousticalstudy,"J. SpeechHear. Res.
6, 111-128.
Miller, J..D. (1984}."Auditory perceptualcorrelatesof the vowel," J.
Acoust.Soc.Am. Suppl. I 76, S79(A).
Miller, J. D. {1989}."Auditory-perceptualinterpretationof the vowel," J.
Acoust. Soc. Am. 85, 2114-2134.
Miller, J. L. {1981}."Effectsof speaking
rateonsegmental
distinctions,"
in
Perspectives
on theStudyof Speech,
editedby P. EimasandJ. L. Miller
(Erlbaum,Hillsdale,NJ), pp. 39-74.
Miller, J. L. {1987}."Rate-dependent
processing
in speechperception,"in
Progress
in thePsychology
of Language,editedby A. Ellis ( Erlbaum,Am stemdam,
The Netherlands),Vol. 3, pp. 119-157.
Miller, R. L. {1953}."Auditorytestswith syntheticvowels,"J. Acoust.Soc.
Am. 25, 114-121.
Moore, B., and Glasberg,B. (1983}."Suggested
formulaefor calculating
auditory-filterbandwidthsand excitationpatterns,"J. Acoust.Soc.Am.
74, 750--753.
Morrow, C. (1971)."Speechin deep-submergence
atmospheres,"
J. Acoust.
Soc. Am. 50, 715-728.
Mullennix,J., Pisoni,D., and Martin, C. S. (1989)."Someeffectsof talker
variabilityon spokenwordrecognition,"
J. Acoust.Soc.Am. 85, 365378.
Myers,J. (1979).Fundamentals
of ExperimentalDesign(Allyn andBacon,
Boston, MA).
Nearey,T., andAndruski,J. {1988}."Modelinglisteners'
perception
of'silent center'syllables,"J. Acoust.Soc.Am. Suppl.I 83, S83.
Nearey,T., andAssmann,
P. (1986)."Modelingtheroleof inherentspectral
changein vowelidentification,"
J. Acoust.Soc.Am. 80, 1297-1308.
Nearey,T., and Shammass,
S. •1987)."Formant transitionsas partly distinctiveinvariantsin the identificationof voicedstops,"Can. Acoust.18,
17-24.
Nearey,T. (1978).Phonetic
FeatureSystems
for Vowels
(IndianaUniversity
LinguisticsClub, Bloomington,IN).
Nearey,T. (1983)."Vowel-space
normalizationprocedures
andphone-preservingtransformations
of syntheticvowels,"J. Acoust.Soc.Am. Suppl.
Stevens,
S.,andVolkmann,J. (1940)."The relationof pitchto frequency:
A
revisedscale,"Am. J. Psychol.53, 329-353.
Strange,W., Verbrugge,R., Shankweiler,D., andEdman,T. (1976)."Consonantenvironment
specifies
vowelidentity,"J. Acoust.Soc.Am. 60,
213-224.
Strange,
W., andGottfried,T. (1980}."Taskvariables
in thestudyofvowel
perception,"J. Acoust.Soc.Am. 68, 1622-1625.
Strange,W., Jenkins,J. J., andJohnson,T. L. (1983}."Dynamicspecification of coarticulated vowels," J. Acoust. Soc. Am. 74, 695-705.
Summerfield,Q., and Haggard, M. (1975). "Vocal tract normalizationas
demonstrated
by reactiontimes,"in AuditoryAnalysisandPerception
of
Speech,editedby G. Fant and M. Tatham (Academic,London), pp.
115-141.
Suomi,K. (1984}."On talker and phonemeinformationconveyedby vowels:a wholespectrumapproachto the normalizationproblem,"Speech
Commun. 3, 199-209.
Syrdal,A. K. (1984)."Aspectsof a modelof theauditoryrepresentation
of
AmericanEnglishvowels,"SpeechComm.4, 121-135.
Syrdal,A., andSteele,S. (1985}."VowelF 1 asa functionof speakerfundamentalfrequency,"J. Acoust.Soc.Am. Suppl.I 78, S56.
Syrdal,A., and Gopal,H. {1986}."A perceptualmodelof vowelrecognition
basedon the auditory representationof American Englishvowels,"J.
Acoust. Soc. Am. 79, 1086-1100.
Traunmfiller,H. (1981I. "Perceptualdimensionof openness
in vowels,"J.
Acoust. Soc. Am. 69, 1465-1475.
Verbrugge,R., Strange,W., Shankweiler,D. P., and Edman,T. R. (1976}.
"What informationallowsa speakerto map a talker'svowelspace?,"J.
Acoust. Soc. Am. 60, 198-212.
Williams,D. {1987}."Judgments
of coatticulatedvowelsarebasedon dynamicinformation,"J. Acoust.Soc.Am. Suppl.1 81, S17.
Zwicker,E., and Terhardt,E. (1980}."Analytical expressions
for critical
bandrateand criticalbandwidthasa functionof frequency,"J. Acoust.
Soc. Am. 68, 1523-1525.
1 74, S17.
2113
J. Acoust.Soc. Am., Vol. 85, No. 5, May 1989
Terrance M. Nearey: Vowel perception
2113
© Copyright 2026 Paperzz