Vocal Quality Factors: Analysis, Synthesis, and Perception

Vocal quality factors: Analysis, synthesis, and perception
D.G.
Childers
DepartmentofElectricalEngineering,University
ofFlorida;Gainesville,
Florida32611-2024
C.K.
Lee
Department
ofElectrieal
Engineering,
TatungInstituteof Technology,
Taipei,Taiwan,Republic
ofChina
(Received28 September
1990;accepted
forpublication
25July1991)
The purposeof thisstudywasto examineseveralfactorsof vocalqualitythat mightbeaffected
by changesin vocalfold vibratorypatterns.Four voicetypeswereexamined:modal,vocalfry,
falsetto,andbreathy.Threecategories
of analysistechniques
weredeveloped
to extractsourcerelatedfeaturesfrom speechandelectroglottographic
(EGG) signals.Four factorswerefound
to beimportantfor characterizing
theglottalexcitations
for thefourvoicetypes:theglottal
pulsewidth,the glottalpulseskewhess,
the abruptness
of glottalclosure,andthe turbulent
noisecomponent.
The significance
of thesefactorsfor voicesynthesis
wasstudiedanda new
voicesourcemodelthataccounted
for certainphysiological
aspects
of vocalfoldmotionwas
developedand testedusingspeechsynthesis.
Perceptuallisteningtestswereconductedto
evaluatethe auditoryeffectsof the sourcemodelparameters
uponsynthesized
speech.The
effectsof thespectralslopeof the sourceexcitation,the shapeof the glottalexcitationpulse,
and the characteristics
of the turbulentnoisesourcewereconsidered.
Applicationsfor these
research
resultsincludesynthesis
of naturalsounding
speech,
synthesis
andmodelingof vocal
disorders,
andthedevelopment
of speakerindependent
(or adaptive)speech
recognition
systems.
PACS numbers:43.70.Aj, 43.70.Gr, 43.71.Bp
INTRODUCTION
The qualityof voicemaybereferredto asthe totalauditory impression
the listenerexperiences
upon hearingthe
speechof anothertalker.Thereis no generallyaccepteddefinition of vocalquality and the term hasbeenusedin differentcontexts,
e.g.,a phonetician
mightusequalityin thecontext of articulatorydifferences;
a singermightreferto the
quality of vocal registers,which are relatedto vocal fold
vibration;andqualitymightbe usedto describevoicetypes
suchasbreathy,hoarse,or harsh.In this study,we investigatedaspects
of vocalqualityrelatedto vocalfold vibratory
patterns,i.e., laryngealvocalquality.We did not address
two othercommoncharacteristics
of vocalquality,namely,
harshness,and whisper. We eliminated harshnessand
whisperfrom our studybecauseharshness
involvedlarge
variationsfromcycle-to-cycle
in thevocalfoldvibratorypatterns (Coleman, 1960; Moore, 1975; Wendahl, 1963; Wen-
dahl, 1966) andwhisperis characterizedby little or no vocal
fold vibratorymotion.To help us achieveour goal,we reviewedvocalquality with respectto its physiological,
perceptual,and acousticalaspectsfor the four voicetypesof
modal,vocalfry, falsetto,and breathy.
A. Physiological characteristics
Vocal fold length and thicknesshave beenshownto affectthethreevoiceregisters
of modal,vocalfry, andfalsetto.
Our purposewas to improveexistingor developnew
For modal voicevocalfold lengthmay be consideredto be
speech
andelectroglottographic
(EGG) analysis
techniques mediumwith the lengthincreasingasfundamentalfrequento assistthe assessment
of vocalqualityand to designnew cy (F0) increases(Damste et al., 1968;Hollien, 1974). For
voice sourcemodelsfor synthesizingnatural sounding vocalfry andfalsettothevocalfoldlengthisshortandlong,
speechwith a selectablevocalquality. We investigatedthe
respectively
(Boone,1971;Hollien, 1974;Ladefoged,1975).
nature and extent of the glottal excitation variations of four
Vocal fold thicknessvarieswith F0, beingmediumfor modal
voice types,namely breathy voice and the three registers, voice,while the vocalfoldsbecomethick for vocalfry and
modal, vocal fry, and falsetto. Our analysisof resultsled to
thin for falsetto (Hollien et al., 1968; Hollien and Colton,
thedevelopment
of a newsourcemodelfor speechsynthesis 1969;Boone,1971;Allen andHollien, 1973;Hollien, 1974).
that containssomefactorsthat appearperceptuallyimporSomecharacteristics
of the vocalfold vibratorypattern
tant for characterizing
vocalquality.The knowledgegained arealsoknown.For modalvoicethevocalfoldshavea speed
from this researchmay improvethe developmentof natural
quotientgreaterthanonewith theglottishavinga longopensoundingspeechsynthesizers
and assistthe advancementof
ing phaseand a rapid closingphase(Hollien, 1974;Monsen
speaker-independent
speechrecognitionsystems.
and Engebretson,1977). Vocal fry ischaracterizedby a glotLaver and Hanson ( 1981) havedefinedsix major types tal area functionthat hassharp,shortpulsesfollowedby a
ofphonation, modal voice,vocalfry, falsetto,breathyvoice,
long closedglottal interval.The glottal openingphasemay
loudness, and resonance.
2394
J. Acoust.Soc. Am.90 (5), November1991
0001-4966/91/112394-17500.80
¸ 1991 AcousticalSocietyof America
2394
haveone,two,or threeopening/closing
pulses(Moore and
yon Leden, 1958;Timeke etal., 1959;Hollien, 1974;Monsen
and Engebretson,1977;Hollien etaL, 1977;Whiteheadet
al., 1984). Falsettovoicehasgradualglottalopeningand
closingphaseswith a shortor no closedphase(Hollien,
1974; Monsen and Engebretson,1977; Kitzing, 1982).
Breathyvoicemay havea slightvibratoryexcursionof the
vocal folds with incompleteglottal closure(Boone, 1971;
Hollien, 1974;Ladefoged,1975).
B. Perceptual characteristics
Generally, the pitch for modal, vocal fry, falsetto,and
breathy voice is characterizedas medium, low, high, and
wide ranging,respectively(Hollien and Michel, 1968;Colton, 1969;Boone, 1971;Colton and Hollien, 1973; Hollien,
1974;Laver, 1980). Loudnessfor modalvoicecan vary over
a wide rangewhile vocalfry, falsetto,and breathyvoiceare
typicallysoft (Boone, 1971;Hollien, 1974;Laver, 1980).
The quality of thesefour voicetypeshasbeendescribedas
normal (modal), a low pitch, rough-sounding
voice (vocal
fry), a flutelike tone that is sometimesbreathy (falsetto),
and a friction noiselikesound (breathy) (Wendahl etal.,
1963;Boone,1971;Coltonand Hollien, 1973;Hollien, 1974;
Laver, 1980).
C. Acoustical
characteristics
TheF0 ofmodalvoicesrangesover94-287 Hz for males
and 144 to 538 Hz for females,while the F0 rangefor vocal
fry is 24-52 Hz for malesand 18-46 Hz for females;the
falsettoF0 rangefor malesis 275-634 Hz and495-1131 Hz
for females,whilebreathyisdescribedaswideranging(Hollien and Michel, 1968; Colton, 1969; Boone, 1971; Hollien,
1974). Vocal intensity characteristicsare lessspecificand
have beendescribedas wide ranging,low, low to medium,
and low for modal, vocal fry, falsettoand breathy,respectively (Boone, 1971; Colton, 1973a;Hollien, 1974; Laver,
1980). Likewise,the sourcespectralslope(tilt) for these
samevoicetypeshasbeendescribedas mediumbut F0 dependent,relativelyflat, steep ( -20 dB/oct), and steep
(Colton, 1973b; Hollien, 1974; Monsen and Engebretson,
1977; Hurme and Sonninen, 1985). Turbulent noise has
beenassociatedwith only modal (low) and breathy (high)
voices(Yanagihara,1967;Isshikietal., 1978;Hiraokaetal.,
1984). Pitch perturbationis generallylow for modal voice,
highfor vocalfry andbreathyvoiceswhileamplitudeperturbationis low for modalandhighfor breathyvoices(Monsen
and Engebrctson,1977; Deal and Emanual, 1978; Davis,
1979;Horii, 1980;Heibergerand Horii, 1982;Hiller etal.,
1983; Hirano etal., 1985; Askenfelt and Hammarburg,
1986;Kasuyaetal., 1986;Wolfe and Steinfatt,1987).
Many moredetailsare availableon the four voicetypes
we studied. We selected and summarized those features we
thoughtwouldbeanalyzablefrom thespeechandEGG signalsandthat mightbeusefulin designing
a sourceexcitation
modelfor synthesizing
natural soundingspeechand vocal
disorders.Recently,severalissuesin synthesizingnatural
soundingvoiceshavebeenconsidered,
includingthe design
of sourceexcitationmodels (Fant, 1979; Fant et al., 1985;
Klatt, 1987;Lee and Childers, 1989;Klatt and Klatt, 1990;
Childers and Wu, 1990; Childers and Wu, 1991; Wu and
Childers, 1991). We will discusstheseissueslater when we
presentour sourceexcitationmodel.This studyconsidered
two typesof experiments:(1) voicequality factorsthat
mightbe influenced
by changes
in vocalfold vibratorypatternsand (2) the perceptualvalidationof the effectsof
acoustic
parametervariationsonthequalityof thesynthesis
of a particularvoicetype.
I. EXPERIMENTAL
A summaryof our researchschemeappearsin Fig. 1.
The speech
andelectroglottographic
dataweredigitizedsimultaneously
usingDigital SoundCorp. DSC-240preamplifiersanda DSC-200digitizer.We sampled
eachsignalat
10kHz with 16-bitsprecision.The microphone
wasan Electro-Voice RE-10 held six inchesfrom the lips. The EGG
devicewas a model from Synchrovoice,Inc. All data were
collectedin a professional
IAC singlewall soundbooth.A
Digital EquipmentCorp. VAX 11/750 computersystem
managedthe datacollection.
The subjectsfor this studyconsisted
of 23 (8 male, 15
female)patientswith a vocaldisorderor pathologyand 52
(27 male,25 female) subjectswith a normallarynx.We denotethepatientsasPn,wheren goesfrom I to 23,whilethe
normalsubjects
aredenotedasNn wheren goesfrom 1to 52.
Thesubject's
agesrangedfrom20 to 80yearsold.Thecompletespeechprotocolconsisted
of 27 tasks,includingten
sustainedvowels/i, l, •, •e, o, a, u, u, ^, e/, two sustained
diphthongs/ou,
el/, fivesustained
unvoiced
fricatives/h,f,
O,s,J'/, andfoursustained
voicedfricatives/v,8, z, 3/. The
subjects
wereinstructed
to pronounce
andsustaineachvowel asit wouldbe pronounced
in thefollowingwords,respec-
tively:beet,bit,bet,bat,Bob,bought,
book,boot,but,Bert.
Similar instructionswere given for the diphthongs,for
which the cue words were boat and bait, while for the frica-
tiveswe usedthe followingcuewords:h_at,fix, thick,sat,
ß
VOICE
TYPE
'•
ß Modal
ß
Speech
WAVEFORM
[
VOICE
ESTIMATION
•
SYNTHESIS
SSOURCE
r
ß Vocal Fry
ß Breethy
EXTI•,CTION
J
vatlou• voice qualities
EGG features for different types
of phonatlon
ß
Ac½:•.sstlc
effects of source features
F•ATURB
J•ii•00
WAVEFORM
ANALYSIS
2395
Source features that characterize
ß
PERCEPTUAL
EVALUATION
Falsetto
PROCEDURES
d. A½oust.Soo. Am., Vol. 90, No. 5, November 1991
FIG. 1. Black diagram of basic research scheme.
Source model for natural sounding
voice synthesis
Rulesfor syntheslzalngspecificvoice
qualifies
D.G. Childorsand C. K. Lee: Analysis of vocal quality
2395
LP Residue[
ship,van,this,zoo,andazure.Thedurationof eachvowel,
diphthong,and fricative approximated2 s. The additional
tasksincludedcountingfrom oneto ten with a comfortable
pitchandloudness,countingfrom oneto fivewith a progressive increasein loudness,singingthe musicalscaleusing
"la," and speakingthree sentences.(We were away a year
ago.Early onemorninga man and a woman ambledalonga
onemile lane. Shouldwe chasethosecowboys.9)
This completeprotocolhasprovideda databasefor severalstudieson
voicequalityand speechsynthesis,
includingthis one.For
this study,we analyzedonly data for the two vowels/i/and
Error q• (n)
s(n)_
Fix•-Frame
V•(z)
LPAnalysis
Pul•
and
Determine
Aflaly•is
Frame
for
2fid-PassLP Analysis
I
FIG. 3. Blockdiagramof thetwo-pass
methodforglottalinverse
filtering
analysis.
/o/, the three sentences,and selecteddata records from the
counting task. The two vowelswere selectedbecausetwo
recentstudieshadfoundacousticcorrelates
of vocalquality
and vocal disordersusingthesetwo vowels(Prosek et al.,
1987;Eskenaziet el., 1990). We analyzedthe sentences
and
the countingtaskto give us someindicationof the glottal
factors in a word and sentence context.
II. ANALYSIS
FOR EXTRACTING
SOURCE
FEATURES
A. Inverse filtering
Figure 2 illustratesthat the peaksin the linear prediction (LP) error functionoccurnearly simultaneously
with
the negativepeaksof the differentiatedEGG (DEGG) signal, which correspondto the instantsof glottal closure
(Childerset el., 1983;Childersand Krishnamurthy,1985;
Childers et el., 1990). From these observations,we devel-
opedthe "two-passmethod"for accurate,automaticglottal
inversefiltering, which works as well as the two-channel
(speechandEGG) method(KrishnamurthyandChilders,
volume-velocity
waveform.As with theEGG signal,theestimatedLP pulsesmay not providean exactindicationfor
theinstantof glottalclosure.The keyfeatureof thetwo-pass
methodis that it ensuresthe exclusionof the main pulseof
theLP errorsignalfromtheanalysisinterval.Thistailoring
of theanalysis
intervalincreases
theaccuracy
of theLP analysis.For a definitionof the signalprocessing
termsusedin
this paper, the readeris referredto Rabiner and Schafer
(1978).
A block diagramof the two-passmethodis shownin
Fig. 3. During the first pass,a pitch-asynchronous
(fixed
frame) LP analysis
is performed
on theinputspeech
signal
s(n ). The estimatedLP filter, V, (z), is usedto derivethe LP
error signal,q• (n), by inversefiltering.For a voicedspeech
signal,the LP errorfunctionischaracterized
by a pulsetrain
with the appropriatepitch period.The locationsof these
pulses
aredetected
bya peak-picking
methodandareusedas
indicators
of
glottal
closure.
In
the
second
passof theproce1986).
dure,
a
pitch-synchronous
covariance
LP
analysis
is usedto
The two-passmethodfirstidentifiesthe locationsof the
estimate
an
improved
LP
filter,
V
2
(z).
For
each
pitch
perimain pulsesof the LP error signalderivedduringthe first
od,
the
criterion
for
determining
the
analysis
interval
is to
passof theinversefilteringprocedure.
Thesemainpulsesare
pick
the
samples
starting
one
point
after
the
instant
of
the
then usedas indicatorsof glottalclosure,and a "pseudomain
pulse.
The
formant
resonances
of
the
vocal
tract
are
closedphase"is selected
astheanalysisintervalfor a pitchsynchronouscovarianceLP analysisto estimatethe vocal
tractfilter,whichin turn isusedto obtainthedesiredglottal
estimatedby solvingthe rootsof the LP polynomial.The
formantstructureis thenshapedby empiricalrules,which
include:( 1) discarding
therootswithcenterfrequencies
below 250 Hz, (2) discardingthe roots with bandwidths
greaterthan 500 Hz, and (3) mergingtwo adjacentroots.
The refined ferment resonances are then used to construct
the vocal tract transfer function, which is usedin the final
(second-pass)
glottalinversefilteringprocedure.The direct
outputof the glottalinversefilteringoperationis a differen-
tial glottalvolume-velocity
u• (n) (i.e.,theequivalent
drivDEGG-BASED
CLOSED PHASE AND
ANALYSIS INTERVAL FOR THE TWO-
DEGG
ing functionto the vocaltract filter), which represents
the
combinedeffectof the lip radiationand the glottalvolumevelocity.A glottalvolume-velocity
waveform,u• (n), isderived by integration.The validity of the two-passmethod
wasverifiedby testingit with syntheticspeechsignals(Lee,
1988).The syntheticspeechsignalswereproducedby a cascadeformantsynthesizer(Klatt, 1980) excitedby stylized
MAIN
EXCITATION
glottalpulsesgeneratedby the LF (Liljencrantsand rant)
model (rant etal., 1985). A typicalresultappearsin Fig. 4.
PASS METHOD (APPROXIMATELY
35% OF THE PITCH PERIOD)
•
LPRESIDUAL
ERROR
•
B. Source
FIG. 2. SynchronizedEGG, DEGG, and LP residualerror.
2396
J. Acoust.Sec. Am., Vol.90, No. 5, November1991
features
Usingthe inversefilteredwaveforms(glottalpulses)as
the excitationfor voicetypes,modal,vocalfry, falsetto,and
D.G. Childorsand C. K. Leo:Analysisof vocalquality
2396
SPEECH WAVEFORM
gan(1957) gavean averagevalueof -- 12dB/octavefor the
slopeof the glottalspectra.This leadto the commonlyacceptedtwo-polemodelfor approximating
thegeneralspectral trendof a normalglottalvolumeflow:
Ug(z)=K/(1 - zoz- •)(1 - zbz- •),
(1)
whereK is a constantrelatedto theamplitudeof theglottal
flow andzo,zb are real polesinsidethe unit circle.A pointzo
LP RESIDUAL
ERROR
INTEGRAL OF ORIGINAL EXCITATION
is saidto bea poleof a functionUg(z), if Ug (z•) equals
WAVEFORM
infinity.The numberof polesa functioncontainsdetermines
the slopeof the spectrumof the functionas the frequency
increases.
A singlepole producesa spectralslopeof - 6
dB/octave,two poles -- 12dB/octave,andsoon. This discussionappliesprovidedthereare no zerosof the function,
for whichthereare nonein the all polemodelswe consider
here.
ORIGINAL EXCITATION WAVEFORM
INTEGRALOF ESTIMATEDEXCITATIONWAVEFORM
ESTIMATED EXCITATION WAVEFORM
Figure5 showstheFourierspectraandthecorrespondingtwo-poleapproximations
for theglottalpulses
of various
voicetypes.The resultsshowthat the two-polemodelis appropriatefor modalphonations
and is reasonably
goodfor
vocalfry exceptfor a smalllow-frequency
interval(below
the third harmonic). However, the two-polemodel is not
adequatefor falsettoand breathyphonations,
whichhave
spectralroll-offrateshigherthan 12 dB/octave.For these
latter two typesof phonation,we found that a three-pole
modelprovidesa betterfit to the data, e.g.,
Ug(z)=K/(1-z•z
FIG. 4. An exampleof two-pass
glottalinverse
filtering.
breathy,we measuredthe followingfeatures:( 1) instantof
maximumclosingslopeofthe glottalpulse,(2) glottalpulse
width, and (3) glottal pulseskewness(ratio of durationof
glottal openingphaseto durationof glottal closingphase).
For modalandvocalfry phonations,theinstantof the maximum closingslopeoccursnearthe instantof glottalclosure,
resultingin an abrupt terminationof the glottal airflow. For
falsettoandbreathyphonations,the instantof themaximum
closingslopeoccursnear the middleof the glottal closing
phase,followedby a residualphaseof progressive
closure.
The glottalpulsewidthsweremoderate(65%-70% of
the pitch period) for modal phonationsand small (25%45% of the pitchperiod)for vocalfry. Falsettoandbreathy
voiceshadlargepulsewidths(90%-100% of the pitchperiod), oftenmakingit appeartherewasnoclosedphase.Glottal pulseskewness
alsovariedwith voicetype. The ranking
of voicetypeaccordingto decreasing
skewness
wasvocalfry,
modal, falsetto,and breathy.
C. Glottal spectral characteristics
The literature (e.g., Monsen and Engebretson,1977;
Pinto et al., 1989) suggestedthat the glottal waveformsof
differentvoicetypescould be distinguished
by two factors:
(1) the generalspectralslope (tilt or.trend), and (2) the
relationshipbetweenthe intensity of the fundamentalfrequencyand its harmonics.For normal phonations,Flana2397
J. Acoust.Soc. Am., Vol. 90, No. 5, November 1991
•)(l-zl, z-•)(l-zcz
•),
(2)
wherezcisthethird realpoleinsidetheunitcircle.For both
the two- and three-polemodels,we foundthat we couldset
z• equalto unity and then calculatezb and z• from the
preemphasized
glottal waveformusinglinear prediction
analysisof thewaveformestimated
by inversefiltering.
Table I liststypicalvaluesof zb andzc for the inverse
filteredglottal wavesfor the four voicetypesfor selected
patientsandnormalsubjects.
The resultsconfirmedthatthe
generalspectralslope(tilt) of a modalor vocalfry phonationcanbemodeledby two realpoles.On theotherhand,to
simulatethespectralslopeof a falsettoor breathyphonation,
one extra real pole was requiredto accountfor its steeper
roll-offrate.Figure6 showsthe improvedspectralmatching
usingthethree-pole
modelforthefalsetto
andbreathyphonations.
Our two- andthree-polemodelsapproximate
the highfrequencyspectralslopeof a glottalpulsebetterthan the
low-frequency
characteristics.
An explanationfor thisisthat
the glottal pulsemustbe of finite duration.Therefore,an
exactmodelwould be a finite impulseresponsefilter, and
hence,would containonly zeros.Our resultsshowedthat the
useof an all-polemodelwill causemismatchof the model
spectrumwith the dataat low frequencies
(seeFigs.5 and
6). We reasonedthat the low-frequencymismatchwould
notbeasimportantperceptually
asthehigh-frequency
characteristics
providedthe harmonicsin the speechsignalwere
reasonablywell matchedby the model. This point was to be
examinedusinga perceptualevaluationof synthesizedtokens.
The spectraltilt or slopeof the glottalpulsecanbe measuredwithout estimatingthe waveshapeof the glottal pulse.
D.G. Childersand C. K. Lee: Analysisof vocal quality
2397
The generalspectralslopefor a voicephonationis determinedby the combinedcontributionof the spectraof the
glottalpulseandthelip radiation.
Thisspectral
slopemay
oftenbeapproximated
asa two-polespectrum,
i.e.,tworeal
TABLE I. Theestimated
realpolesfor thelow-pass
filtermodeloftheglottal waveforms
fora fewtypicalmalesubjects
(N = normalsubject;
P = disorderedsubject}.
Subject
Voicetype
F0 (Hz)
N 11
N24
vocalfry/a/
vocalfry/i/
N20
slightvocalf•/i/
N3
NI4
N27
modal/o/
modal/i/
modal/i/
N7
NI9
falsetto/a/
falsetto/a/
P8
PI3
breathy/i/
breathy/i/
zb
zc
46
45
0.83
0.81
'"
---
90
0.83
---
155
106
126
0.92
0.92
0.96
'"
-----
210
312
1.00
1.00
0.60
0.73
137
200
1.00
1.00
0.25
0.87
polesinsidetheunitcircle.Thesepolesareestimated
using
LP analysis
of preemphasized
speech.
The resultsobtained
are consistentwith thoseobtainedusing inversefiltered
6O
Nil, vocalfry
speech.For falsettoand breathyvoices,one pole is on the
unit circle(or nearlyso) andthe othervariesroughlyfrom
0.2 to 0.9. The polevaluesare, however,affectedby the formantsof the phonation.For example,the locationof the
poleswithin the unit circle for the two pole model varied
dependingon the type of vowel (/i/or/o/)
for a given
speakeranda givenvoicetype.The poleswereusuallycloser
to the originfor/i/than for/o/.
O. Fundamental frequency and harmonics
The harmonicsin the speechsignalbelowthe first formant are often consideredimportantfor the perceptionof
(D
'
'
NI9,
falsetto
5O
P8, breathy
].3O
P8, breathy
133
9O
60
spectra
andsuperimposed
three-pole
modelapproxiFIG. 5. Glottalsource
spectra
andsuperimposed
two-pole
modelapproxi- FIG. 6.Glottalsource
andbreathyvoicetypesforthesamecorresponding
submationformodal,vocalfry,falsetto,
andbreathy
voicetypes.Thesubjects mationforfalsetto
are indicated in Table I.
2398
J. Acoust.
Sec.Am.,Vol.90, No.5, November
1991
jeersasin Fig. 5.
D.G. Childors
andC. K.Leo:Analysis
ofvocalquality
2398
vocalquality(Holmes,1973). Thisispresumably
dueto the
highenergyin theseharmonics.
We foundthattheglottal
spectraofdifferentvoicetypesshowed
distinctive
amplitude
relationships
betweenthefundamental
andhigherharmonics.Figure7 shows
theglottalspectra
forvarious
voicetypes,
whereHi is the ainplitudeof the ith harmonicandH• is the
amplitudeof the fundamentalfrequency.We foundthat vocal fry wasapproximatelycharacterizedby a high HRF (2.1
dB) followedby modal ( - 9.9 dB), breathy ( - 16.7dB),
and falsetto( - 19.1dB). Falsettoandbreathyvoiceshad a
displaying
onlythefirsttenharmonics.
Therearedistinctive highintensityfundamentalaswell.Thesevariationsin haramplituderelationships
betweenthe fundamentaland the monicrelationsappearto havelittle significance
for speech
higherharmonics
asidefrom thedifferences
in thespectral intelligibility (phonetic identifiability) but affect the perslopeasdiscussed
above.We defineda parametercalledthe ceivedvocalquality.
"harmonicrichnessfactor" (HRF) to ineasurethisrelationship
HRF- --,
•2H"
(3)
Hi
80
N14, modal
E. Interharmonic
noise
Turbulenceat the level of the glottis has beennoted to
contributeto the perceptualquality ofbreathy voice(Isshiki
et al., 1978;Yumoto etal., 1982; Hilhnan etal., 1983; Hiraokaet al., 1984). We modifiedthe frequency-domain
method ofHiraoka et al. (1984) to measurethe noise-to-harmon-
(dB)
o
8o
N11, vocalfry
ic ratio (NHR). Our methodusesan adaptiveprocedureto
estimateF0, which is crucialfor identifyingthe higherharmonics.Thiswasaccomplished
bya two-stepprocedure.
We
estimatethe pitch periodsof the speechsignalby usingthe
synchronousEGG signal.The accuracyof this initial estimationis restrictedby the signalsamplingrate (in our case,
10 kHz). For example,if the accuracyof F0 is within one
samplepoint, an F0 estimateof 100samplepointsmeansan
F0 of 100 q- 1 Hz and an F0 estimateof 50 samplepoints
mean an F0 of 200 q- 4 Hz. Such estimation errors can cause
ßa severeproblemin identifyinghigh-orderharmonics.For
example,a 4-Hz error of F0 couldleadto a 40-Hz shift in
locatingthe tenth harmonic.
An adaptiveF0 correctionprocedurewasusedto reduce
the
o
80
N19, falsetto
error
in
the
initial
F0
estimate.
We
defined
FO,, = FO _+nAF, wheren = 1,2..... 10,and AFis onetenth
of the maximum possibleerror of F0, as describedabove.
Basedon eachFO,,, the energyof the harmoniccomponents
over the frequencyrangeof 0 to 2 kHz are computed.The
valueFO,, givingthe inaximumharmonicenergyis selected
as the final estimate of F0.
This criterion
is based on our
experimentalresults,which showedthat over the low-fre-
quency
rangeof0 to 2 kHz, theharmonic
energy
isconsiderably higher than the interharmonicnoiseenergy.
OnceF0 isdetermined,theith harmonicamplitude,Hi,
o
80
P8, breathy
and the interharmonicnoise,N/, are computedin the frequencyregioniFO +__
FO/2, whereH/representsthe energy
in the subregioncenteredat iFO with a bandwidthof the
Hamming analysiswindow. The symbolN• representsthe
energyin the remainingfrequencyregion (Fig. 8). And the
noise-to-harmonicratio at the ith harmonicregion(NHR i )
is definedas:
NHRi = Ni/H i.
The distribution
(4)
of the interharxnonic
noise can be ob-
servedby plottingNHRi alongthe frequencyaxisasin Fig.
1
2
3
4
5
6
7
8
9
10
FIG. 7. The spectrum(dB) for the first ten harmonicsof the glottal waveformsfor modal,vocalfry, falsetto,and breathyvoicetypes.All dataarefor
vowel/i/except falsettowhich was/o/.
2399
d. Acoust.Sec. Am., Vol. 90, No. 5, November1991
9, where we seethat a voicejudged to be breathy has higher
interharmonic
noise above 2 kHz
than modal
or falsetto.
Based on this observation, we defined a noise-to-harmonic
ratio over a high-frequencyrangeto be an indicatorfor the
vocalqualityof breathiness,
namely,
D.G. Ohildersand C. K. Lee: Analysisof vocalquality
2399
Gandv•d•h
o•
high (4.0), medium (2.8), and low (1.8), respectively.We
wereunableto measurereliablythe WPF for breathyvoices.
Our resultsimply that vocalfry registeris characterizedby a
pulselikeexcitationwaveformwith a long glottal closure,
while falsettoregisteris characterizedby a shortglottalclo-
Hamming V•ndov
I
I
I
[
sure.
G. EGG waveform
features
The electroglottographic
signal(EGG) is generallybelievedto be representative
of the amountof the lateralcontact betweenthe vocal folds (Childers and Krishnamurthy,
1985; Childers et al., 1986; Childers et al., 1990; Titze,
1990). The EGG is usefulfor registeringglottaleventsduring vocal fold vibration.When the EGG waveformis arrangedsuchthat an upward deflectionreflectsthe opening
]
i'F0
I
(i,,1)F0
of the glottisand a downwarddeflectiondepictsthe closing
of the glottis,the sharpnegativepeaksin the differentiated
EGG (DEGG) waveformhavebeenfoundto be very close
FIG. 8. Frequencyintervalsfor harmonicandinterharmonicnoisecompoto the instantsof glottalclosure(if closureexists),and the
nents.
maximaof the DEGG are indicationsof the instantof glottal
openingasin Fig. 10 (Childersetal., 1990). In thisstudy,we
investigated
the EGG waveformfeaturesfor varioustypesof
phonation.
NHR/, =
,
(5)
Figure 11showstypicalEGG andDEGG waveformsof
modal,vocalfry, falsetto,andbreathyphonations.All of the
where (Ni) and (Hi) are the noiseand harmoniccompo- EGG waveformsexhibit a steeperslope (implying a rapid
nentsabove2 kHz, respectively.The overall NHR, which
changein vocalfold contactarea) duringthe glottalclosing
wasdefinedoverthe wholefrequencyrange(0-5 kHz), was
phasethan during the openingphase.This characteristic
alsocomputedfor comparison.
Table II liststhe analysisof
phenomenonof vocalfold vibrationresultsin a sharpnegaresultsfor threevoicetypes.The NHR h (noise-to-harmonic tive pulsein the DEGG waveformat or near the instantof
ratio above2 kHz) isa goodindicatorfor the existence
of the
maximumglottalclosure.Asidefrom thiscommonfeature,
vocalqualityofbreathiness.We wereunableto measurereliablyNHRh for vocalfry.
F. Temporal energy distribution
The temporalenergydistributionof a speechwaveform
is relatedto the glottalexcitationandhaslongbeenthought
to affectvocalquality.Wendahlet al. (1963) andColeman
(1963) established
that the perceptionof vocalfry is related
to the dampingof the speechsignalbetweenglottal excitations.To measurethe decaycharacteristicsof a speechwaveform during a singlepitch period, a parametercalled the
waveformpeakfactor (WPF) wasdefinedas
TABLE II. Noise-to-harmonic
ratiosfor severaltypicalsubjects
and three
voicetypes(N = normalsubject;P = disorderedsubject).
Hi-Freq.
NHR
Overall
NHR
Subject
Sex
Voice type
N12
N13
N28
M
M
M
modal/i/
modal/o/
modal/i/
- 5.1 dB
- 4.2
-
-- 4.2
-- 14.8
NI
M
modal/i/
-- 5.4
-- 14.8
N2
N30
N31
N32
N33
M
M
M
M
M
modal/o/
modal/i/
modal/u/
modal/i/
modal/a/
-- 5.6
-4.8
-- 6.2
-- 7.1
-- 5.1
-----
wherexi istheamplitudein theith samplepointandNis the
total numberof samplepointsin onepitchperiod.Theoreti-
N16
N17
M
M
falsetto/i/
falsetto/o/
-- 8.5
-- 8.2
-- 20.4
- 20.3
cally, the WPF has a minimum value of 1 when the wave-
N18
M
falsetto/i/
-- 9.9
-- 23.2
form is fiat, anda maximumvalueN •/2whenthe waveform
is an impulse.The WPF valueof a speechwaveformis related to the underlying glottal waveshape.For glottal waves
with narrow pulsesseparatedby a long glottalclosure,the
WPF valueis largeand for pulsesof longdurationthe WPF
is near unity.
Althoughthe averageWPF valuesfor sustainedvowels
vary somewhatwith the type of vowelfor the samesubject
and voicetype, a generalrule is that vocalfry, modal and
falsettoregistersare characterizedby WPF valuesthat are
N19
M
falsetto/a/
-
8.4
- 22.1
N4
N5
N6
N7
M
M
M
M
falsetto/i/
falsetto/o/
falsetto/i/
falsetto/o/
-----
4.3
3.9
6.1
3.3
----
21.7
20.1
22. I
18.5
N25
N29
P7
P12
P20
P21
M
M
M
M
F
F
breathy/i/
breathy/i/
breathy/i/
breathy/i/
breathy/i/
breathy/i/
-- 0.7
5.5
-- 0.4
0.2
5.9
6.2
-----
12.6
12.7
19.6
19.8
14.8
8.9
WPF
= peak
amplitude
= max(Ixil) (6)
rmsvalue
2400
[ (l/N) Z•=ix•]•/2'
d. Acoust.Soc. Am.,Vol.90, No. 5, November1991
13.8 dB
16.5
16.6
16.9
15.1
14.9
16.2
D.G. Childersand C. K. Lee:Analysisof vocalquality
2400
N12, modal
[2O
N16, falsetto
P12, breathy
•o
8O
7O
(dB)
(dS)
Noisa-to-aar•onic
0
I0•0
2000
•atio
(NRlq
i)
(dB)
4O
Noise-to-HarmonicR•tfo
40
Notse-to-ff&cmonteRatio (8HRX)
3000
FIG. 9. The spectraand noise-to-harmonic
ratio (NHR•) for threemalesubjectsindicatedin Table II for the vowel/i/for modal,falsetto,and breathyvoicetypes.
FIG. 10.Thespeech,
EGG, DEGG, andglottalareawaveforms
forthevowel/i/. Theglottalareawaveform
wasmeasured
fromultrahigh-speed
laryngeal
filmsand wasusedasthe basisfor validatingthe estimationof the opening,closingand closedglottalphasesfrom the EGG and DEGG waveforms.The
speech
signal
wasmonitored
simultaneously
withtheEGG.All waveforms
weresynchronized.
Theclosed
phase
wasmeasured
astheintervalontheDEGG
betweenthe negativeand followingpositivepeak.
distinctiveEGG and DEGG waveformpatterns(implying
differentvocalfoldcontactphenomena)
werefoundto characterizethe differenttypesof phonation.For modaland vocal fry phonationsthe instantof the maximumclosingslope
in the EGG waveform is closeto its minimum extension, and
thus resultsin a very narrow negativepulsein the DEGG
waveform.This impliesthat the instantof the DEGG negative peak is near the occurrenceof glottal closure,as ob-
Figure 11 indicatesthat the vocalfold contactvaries
duringglottalclosurefor the four voicetypes.We defined
parameters
of the EGG waveformto predictthesedifferent
glottalclosurephenomena
and foundwe couldusethese
featuresalongwith additionalfeaturessuchaspitch,double
opening/closing
patternof the volumevelocitywaveform,
andsinusoidal-like
patternfor falsettoto distinguish
thefour
voicetypes(Lee, 1988).In addition,weestimated
thepitch
servedby Childers et al. (1983) and Childers et al. (1990).
The DEGG waveforms for falsetto, on the other hand, show
period (PP) and the glottal openquotient (OQ) for various
muchwidernegativepulses,andthusdonotprovidereliable
indicationsfor glottal closure.The EGG waveformsfor
breathyvoicehavea rapid closingsegmentcloselyfollowed
by the nextopeningsegment,implyinga momentaryglottal
closureor eventhe absence
of glottalclosure.The vocalfry
EGG waveformalsoshowsthe doubleopening/closingpatternduringan individualglottalcycle,asobserved
by other
researchers(Moore and von Leden, 1958; Timcke et al.,
estimatedas the time durationbetweentwo successive
nega-
1959;Whiteheadet al., 1984;Klatt and Klatt, 1990).
in the DEGG
2402
J. Acoust.Sec.Am.,Vol.90, No.5, November1991
voicetypes (Childers et al., 1983). The pitch period was
tive peaksin the DEGG waveform.The openquotientwas
defined as
OQ = duration of the open phase/pitch period,
(7)
wherethe openphasewas estimatedby the time duration
betweena positivepeakandthe nextadjacentnegativepeak
waveform.
D.G. ChildersandC. K. Lee:Analysisof vocalquality
2402
N12, modal
•
formant synthesizers
have beenfound to producenatural
soundingspeechwhen the sourceexcitationwaveformattemptsto replicateboth temporaland spectralcharacteristicsof the human source.(SeeChildersand Wu, 1990,for a
•
__._A._
.•
DEGG review.)Holmes(1973) recommended
an excitationsignal
basedon the secondtime derivativeof a typicalglottalvolume-velocitywaveformfor theparallelformantsynthesizer.
N24, vocalf•,
DEGG
N18, falsetto
For a glottalpulsewitha - 12-dB/octspectral
slopesucha
differentiated
signalwouldgivean approximately
flat spectrum. But aswe haveshownin the previoussection,human
glottalexcitationcharacteristics
vary for differenttypesof
voiceproduction.To synthesize
or modelnaturalsounding
speech(for normalspeechor for a vocaldisorder)a more
sophisticated
approachis calledfor. Linearpredictivecoders are usingmore sophisticated
excitationwaveformsas
well (Schroeder and Atal, 1985; Trancoso and Atal, 1990).
But thesemodelsstill requireextensivecomputation,e.g.,
Schroeder
andAtal (1985) reportedthattheircodingprocedure took 125s ofCray-1 CPU time to process1 sof speech.
P13, breath},
A. Source
.•A.a_. . •
•
. t.t•. '
, ,•,.,,•.
, •
FIG. 11. The EGG and DEGG waveformsfor the vowel/i/for
vocalfry, falsetto,and breathyvoicetypes.
DEG.
G
modal,
The averagePP and OQ valuesestimatedfrom the
DEGG waveformsof variousvoicetypeswere generally
lowerthanthoseestimated
fromthe inversefilteredglottal
waves.This was particularlyso for falsettoand breathy
voices,
whichhaveprogressive
glottalclosures,
consequently, theEGG-based
technique
underestimated
theOQ values.
Nevertheless,
theresultsstillshowedthattherankingin orderof increasing
OQ valueswerevocalfry, modal,falsetto,
and breathyvoices,respectively.
Oneof the importantadvantages
of usingtheEGG isthatit canregisterthedynamic
characteristics of vocal fold movements of continuous
speech.
III. A NEW EXCITATION
MODEL
Theoverallspectral
envelope
forvoiced
speech
isdeterminedby threefactors(Fant, 1960;Flanagan,1972): (1)
the spectrumof the glottalexcitationpulse,(2) the vocal
tracttransferfunction,and (3) theradiationof thelipsand
nostrils.For thispaper,we arefocusing
on thespectrum
of
thesource.
In recentyearsresearchers
havecometo recognizetheimportance
of accurately
modelingthesourceexcitationifa formantsynthesizer
isto faithfullyreproduce
natural vocal characteristics(Rosenberg, 1971; Hedelin, 1984;
Fantetal., 1985;FujisakiandLjungquist,1986;Klatt, 1987;
Lee and Childers,1989,Pinto et al., 1989;Klatt and Klatt,
1990;Childers
andWu, 1990).Boththecascade
andparallel
2403
J.Acoust.
Soc.Am.,Vol.90,No.5,November
1991
excitation
classifications
Basically,researchers
have usedthree typesof excitationfor speechsynthesis,
especially
for formantsynthesizers
(Childersand Wu, 1990). Theseare ( 1) impulseexcitation
with a glottalshapingfilter (Flanagan,1957;Rabiner,1968;
Klatt, 1980; Klatt, 1987); (2) glottal waveformsobtained
by inversefiltering (Rosenberg,1971; Holmes, 1973) or
glottalareawaveforms(Yea et al., 1983); and (3) excitation
waveformmodels(Rosenberg,1971;Fant, 1979;Ananthapadmanabha,1984;Hedelin, 1984;Fant et al., 1985;Fujisakiand Ljungqvist,1986;Klatt, 1987;Childerset al., 1989;
Lee and Childers, 1989; Pinto e! al., 1989; Klatt and Klatt,
1990). For various reasons,excitation waveform modelsare
flexible,easyto use,and producenatural soundingspeech
(Childers and Wu, 1990).
The sourcemodel (1) aboveis simpleand can easily
yield a sourcespectrumwith - 12-dB/octslope(or any
other slopein incrementsof - 6 dB/octave). However, the
phaseand the spectralnotchesthat are oftencharacteristics
of natural voicingare not reproducedwell and the vocal
quality is poor (Childers and Wu, 1990). The fault with
model(2) isthat theglottalwaveformisdifficultto estimate
and generallyrequiresa microphonewith a high-quality,
low-frequencyresponse(Childersand Wu, 1990). Furthermore,suchan excitationsourceis not suitablefor applicationin text-to-speech
systems
that requireprestoredsource
parameters.
A well-designed
excitationwaveformmodelalongthe
linesof model(3) above,is easyto useandcapableof producingnaturalsoundingspeechasshownby theresearchers
citedabove.Klatt andKlatt (1990) haverecentlysuggested
onemodel.The LF modelhasalsofoundacceptance(Fant
et al., 1985).An advantage
of theLF modelisthatparameters of the model can be measuredusingthe speechand
EGG signals.For example,theLF model (Fant et al., 1985)
requiresfour modelparametersfor a differentialglottal
wave [Fig. 12(b) ]:
D.G.Childers
andC.K.Lee:Analysis
ofvocal
quality
2403
eta = 1-e -•'ø-"•
(11)
To relatethewaveshape
parameters
of theLF modelto measureddata,we expressed
the openquotient(OQ) andspeed
quotient(SQ) for thismodelas
OQ = openphase/pitchperiod= OQLv= (te + kt• )/to,
for LF model,
(12)
and
[al
SQ= opening
phase
u
closingphase
= SQ•v= tv/(te q-kta- tv), for LF model.
(13)
For the LF model,the instantof "glottal closure"wasdefinedastheinstantat whichtheglottalflowamplitudedrops
to 1% of itspeakvalue.Basedonthisdefinition,
thevalueof
k isa functionof theparameter
t•. Our datashowthatk has
valuesin therangeof 2.0 to 3.0 when0% < t• < 10% (k = 0
whenta = 0), whereta isrepresented
bya percentage
of the
pitch periodto.
B. Glottal factors for source modeling
As shownin the previoussection,four major factors
werefoundto beimportantfor characterizingdifferenttypes
[b]
FIG. 12.Theglottalwaveform,
u,•(t), isshownin (a), whiletheLiljencrantsandFant (LF) modelfor thedifferential
glonalwaveform,u'n(t),
appearsin (b).
of voiceproduction:namely,the glottal pulsewidth, the
glottalpulseskewness,
theabruptness
of glottalclosure,and
the turbulentnoisecomponent.Thesefactorsare important
in modelingthe sourceexcitationasshownbelow.Furthermore, thesefactors are related to characteristicsor features
of theglottalvolume-velocity
waveform,
u• (t), itsderivative,u•(t), theEGG,itsderivative,
DEGG,andthespeech
waveform.One characteristicis depictedin Fig. 13. The
main excitation of the vocal tract occurs at the instant of the
tp = glottalflow peakposition,
maximum
negative
peakof u•(t), whichcorresponds
to the
t,. = instant of maximum closing rate,
ta = time constantof an exponentialrecovery,i.e.,
return phase,from the point of maximum
closingdiscontinuitytowards the maximum
closure.
tersthe vocaltractdampingtrend.Althoughnot assignificantasthe mainexcitationpulse,whichoccurseverypitch
(8)
(9)
pulses.
ity in the model are
and
u•(t)= [u's(t•)/et
• ] (e-•t'-'")for to<t<t0,
glottalopening
phase,
a secondary
excitation
occurs
thatalperiod,the secondary
excitations
alsocontribute
to vocal
quality,e.g.,whensuchexcitations
are absentas in LPC
synthetic
speech,
the voicesounds
buzzy(Samburet al.,
1978).The durationof glottalclosureis alsorelatedto the
voicetype,e.g.,falsetto
andbreathyvoices
havea briefglottal closureor perhapsnoneat all. Vocalfry, on the other
hand,has a long closedphasebetweenglottalexcitation
The equations
governing
thederivativeof thevolumeveloc-
u•(t) = Ae•'sin(oagt),for O<t<te
instant of glottal closure.Followingglottal closure,the
speech
signaldecays
dueto vocaltractdamping.Duringthe
where
•%= rr/tpisthefrequency
ofthesinewave
inEq.(8)
andto isthepitchperiod.Parameters
a ande aredefined
for
Another featurethat characterizesthe four voicetypes,
whichwe examinedin the previoussection,is the "sharp-
ofthemainexcitation
pulsein u•(t), which,inturn,is
computational
use.The valueof a canbederivedby using ness"
caused
by
the
steepness
of
the
closingphaseof the volume
the four basicparameters
listedaboveand by solvingthe
velocity
and
by
the
abruptness
of glottalclosurein the volequation
ume-velocity
waveform.Thesetime domaincharacteristics
of thewaveformaffecttheslopeof thespectrum(Fantetal.,
•I'"
u•
(t)dt
=0.
(10)
Similarly,thevalueof • canbederivedby solvingtheequation
2404
J. Aooust.Sec.Am.,Vol.90, No.5, November1991
1985).
The extentof the glottalwaveshape
variationsis illustratedin Fig. 14, whichshowsinverse-filtered
differential
D.G. ChildersandC. K.Lee:Analysis
of vocalquality
2404
%(0
i
½losedl
opening
•
Iclosing
u'g(t)
NIl
secondary exc i tat Ions
/
'V
vocalfry
Speech vive fore
I
I
foreant
oscillation
FIG. 13.Threesynchronized
waveforms:
the glottalwaveform,its deriva-
tive,andthespeech
waveform.
Indicated
onthefigurearetheinstants
of
excitation
ofthevocaltract,theopening,
closed,
andclosing
glottalphases,
FIG. 14.The inversefiltereddifferentialglottalwaveformsand the superimposed (least-mean-square
error) waveformsproducedby the Liljencrantsand Fant (LF) modelfor selectedsubjects,voicetypes,and phonation for the subjectsindicatedin Table III.
andtheexponential
decayof thespeech
waveform.
glottalwaveformsfor thefour voicetypesweexamined,each
waveform
matched
with
an LF
model
waveform.
The
matchingLF model was derivedby measuringthe initial
estimates
oftheamplitude
parameter
u• (te) andthewaveshapeparameters
tp,t•, andt• in thecorresponding
pitch
periodof the differentialglottal waveform.Theseparameterswere then adjustedusinga least-mean-squared
error
criteriathat minimizedthe errorbetweenthe signalandthe
matching
waveform
• (Lee, 1988).Sometypicalmeasured
waveformparameters(representedin percentages
of the
pitchperiod)and the corresponding
factorsfor the glottal
pulsewidth (OQL•) and the glottalpulseskewness(SQLv)
aregivenin Table III. For the four voicetypeswe foundthat
theseglottalfactorsvariedwidely,e.g.,OQ• rangedfrom
0.26 to 1.0, SQL• from 1.3 to 3.6, and t• from 0.5% to
13.3%.
C. Turbulent noise component
When the glottishasan imperfectclosureand the airflow rate is high, turbulentairflow is produced.This phenomenonhasbeenexaminedby others(Flanaganand Ishizaka, 1976;Isshiki etal., 1978;Pinto etal., 1989;Childers
andDing, 1991). The soundpressureof the turbulentnoise
is approximatelyproportional to the squareof the volume
velocity of the airflow and inverselyproportionalto the
cross-sectionalarea of the constriction. Furthermore, the
of frequencies(2-8 kHz) with someaccentuationat about4
kHz. Our data for breathy voicesagreedwith theseresults,
showinga high interharmonicnoiselevel above2 kHz.
Although the existenceof high-frequencyturbulent
noisehasbeenshownto bean importantfeaturefor breathihess(Yanagihara, 1967;Yumoto etal., 1982;Hiraoka etal.,
1984; Hurme and Sonninen, 1985), the role of a turbulent
noisesourcein syntheticspeechhasonly recentlyreceived
TABLE III. Typicalwaveshape
parameters(basedon the LF model) for
glottalwaveforms
forthefourvoicetypesfora fewmalesubjects
(N = normalsubject;
P -- disordered
subject),to (pitchperiod) in numberofsample
points(sampling
period= 0.1ms).t•, t,,,andt• inpercentage
ofthepitch
period.
Subject Voicetype
N14
N27
N3
modal/i/
modal/i/
modal/o/
N23
Nll
slightvocalfry/o/
vocal fry/o/
NI9
N7
P8
P13
to
tp
t,,
t•
OQu- SQLr
94 49%
79 53%
65 51%
64%
71%
68%
2.1%
2.5%
1.5%
0.67
0.75
0.69
2.7
2.4
2.8
119 49%
220 20%
63%
25%
0.8%
0.5%
0.64
0.26
3.3
3.6
falsetto/o/
falsetto/o/
29 57%
47 62%
77%
89%
13.3%
4.3%
1.00
0.98
1.3
1.7
breathy/i/
breathy/i/
73 48%
50 58%
68%
84%
6.8% 0.81
10.0% 1.00
1.5
1,4
energyof theturbulentnoiseisdistributedovera widerange
2405
J. Acoust.Soc.Am.,Vol. 90, No. 5, November1991
D.G. Childorsand C. K. Lee:Analysisof vocalquality
2405
seriousconsideration(Pinto et al., 1989;Lee and Childers,
1989; Klatt and Klatt, 1990).
Waveform
M•I
D. Existing source models
I Tn I Dn
and noise component (lower trace).
To synthesizenatural-sounding
speechwith various
qualitycharacteristics,
a voicesourcemodelmusthavecontrollableparameters
that are importantfor perception.As
discussed
above,we consideredglottalpulsewidth, glottal
pulseskewhess,
the abruptness
of glottalclosure,anda turbulentnoisecomponentas importantfeaturesfor a source
excitationmodel.Severaltypicalglottal waveformmodels
and voice sourcemodelsalready exist (Rosenberg,1971;
Fant, 1979; Hedelin, 1984; Ananthapadmanabha,1984;
Fant et al., 1985;Fujisakiand Ljungqvist,1986;Klatt, 1987;
FIG. 15.Blockdiagramof thenewsource
modelandan example
of the
generatedexcitation wave.
Klatt and Klatt, 1990). All of these models allow variable
glottalpulsewidth and skewness.
Only onemodelusesa
turbulent noisecomponent(Klatt, 1987;Klatt and Klatt,
1990). But few detailsof the turbulentnoisegeneratorare
given.Three glottalwaveformmodels(Fant et al., 1985;
Ananthapadmanabha,1984;FujisaMandLjungqvist,1986)
canvary theabruptness
of the glottalclosure,whilethe others (Rosenberg,1971; Fant, 1979;Hedelin, 1984; Klatt,
1987) alwaysgeneratean abruptglottal closureafter the
instantof maximumclosingslope.No sourcemodel incorporatesall of the four factorswe considered
importantfor
characterizingvocalquality. Exceptfor the turbulentnoise
component,the LF modelhasthe capabilityof varyingthe
threeglottalwaveshape
factorswith onlyfour modelparameters,andaswehaveshowncanapproximatea widerangeof
glottalexcitationwaveforms.
E. A new experimental source model
Basedon thedatafrom thisstudy,we designed
theexcitationmodelshownin Fig. 15,whichconsists
of two components: (1) a glottal pulse generator,and (2) a turbulent
noisegenerator.
I. GIottal pulse generator
We adaptedthe LF model (Fant et al., 1985) for the
glottalpulsegeneratorbecausethe LF modelcan approximate a wide range of pulse widths, pulse skewness,and
abruptnessof glotta1closure.Furthermore,the parameter
valuesfor the LF model can be derivedby usingthe inverse
filtered differentialglottal waveformand/or the EGG signal.
trum-shapingfilter was designedto simulatethe spectral
characteristics
of the glottalturbulentnoise.The presentdesign,basedonourdataandthatofIsshikietal. (1978) usesa
high-pass
filter with a 3-dBfrequencyof 2 kHz. This differs
from that suggested
by Klatt and Klatt (1990). The amplitude modulatorsimulatesthe amplitudefluctuationsof the
glottalturbulentnoisedue to the variationsin airflowand
glottalareaduringvocalfold vibration.A pitch-modulated
squarewavewith an adjustableduty cycleis presentlyused,
but is easilymodified.Two parametersare usedto control
the startingposition( T, ) of the turbulentnoiseanditsduty
cycle(D,). The parameterT,, designates
thelocationof the
onsetof the turbulentnoisesourcerelativeto the beginning
of the pitchperiod,e.g., T, = 75% denotesa noiseonsetat
3/4 of the pitch periodof the glottalexcitationwaveform.
The parameter
D• controlsthedurationof thenoisesource,
e.g.,D, = 50% denotes
thatthenoisesourceisonthe 1/2 of
thepitchperiodof theglottalexcitationwaveform.At present,theseparametersarespecified
by thedesigner,but in the
future they might be measuredfrom the speechsignal.Besidessimulatingbreathiness,
theturbulentnoisegeneratoris
alsosuitablefor producingexcitationsignalsfor voicedfricatives,wherethe turbulentnoiseis generatedat a constriction in the vocal tract. Our observationssuggestthat the
amplitude modulation factor is important for voicedfricarives.
IV. PERCEPTUAL
EVALUATION
A. Method
2. Turbulent noise generator
The turbulent noisegeneratorconsistsof a random
numbergenerator,a spectrum-shaping
filter, and an amplitude modulator. The random number generatorproduces
The tokensfor the listeningtestsweresynthesized
using
a cascadeformant synthesizerof our own design(Pinto et
al., 1989) but basedon Klatt's synthesizer(Klatt, 1980).
For the listeningtestsreportedhere,all tokenswerethe sus-
random noise with a normal distribution and a flat spec-
tained vowels/i/and/o/.
trum. The amplitudelevel of the randomnoisewas controlledby a parameter,A,,, that specifiedthe ratio of the
energyof the noiseto the energyof the glottalpulsewaveform. This ratio is small, typicallyabout0.25%. The spec-
wereadjustedto havethe sameenergylevel.While we were
ableto synthesize
sentences,
thisinvolvedadditionalfactors
suchasphonemetransitionsthat werenot part of thisstudy.
The tokensfor the listeningtestsweregeneratedby comput-
2406
J. Acoust.Soc. Am.,Vol. 90, No. 5, November1991
All tokens were 2 s in duration; all
D.G. Childersand C. K. Lee:Analysisof vocalquality
2406
er via a digital-to-analog
converterandpresented
via headphonesin an IAC single-wallsoundbooth.The tokenswere
arrangedinto groups,eachgroupconsisting
of four vowel
tokens.For eachglottalsourcefactorunderinvestigation,
a
groupof syntheticvowelswasproducedby progressively
changing
theselected
sourceparameter
whilemaintaining
the formantstructurefixed.A groupof voweltokenswas
createdby varyingoneglottalexcitationsourceparameter
overa rangeof four valueswhileall otherparameters
were
heldfixed.The vowels/i/and/o/were synthesized
using
knownformantstructures
whiletheglottalexcitationsource
characteristics
werevaried.Forexample,
wevariedto (the
glottalopeningtime), te (the instantof maximumclosing
rate), andto (the time factorfor controllingthe abruptness
of glottalclosure).Eachof theseparameters
wasprogressivelyvariedin four stepswhiletheotherparameters
were
heldfixed.Thuswecreateda groupof foursynthesized
vowelswith onlyoneparameterbeingvariedin four steps.The
$Q and OQ were varied in a similar mannerand the same
vowelswerealsosynthesized.
The judgesfor the listeningtestswerethreeprofessors
fromtheDepartmentof Speechat the Universityof Florida.
Each judge was familiar with syntheticspeech.Furthermore,eachjudgewasan expertin voiceevaluationin a clinicalsetting.Onlythreejudgeswereusedfor thelisteningtests
sincewe soughtpreliminaryguidancewith respectto the
effectiveness
of our glottal sourcemodel for synthesizing
threevoicequalities.Eachjudgewasaskedto perceptually
rate the various tokens in the manner described below. The
judgeswere not informedas to the mannerby which the
tokensweresynthesized;
i.e., theydid not know the excitationsourceparameters
that werebeingvariednor theorder
in whichthe synthesized
tokenswerepresented,
although
this was not a factor sincethey could listento the vowel
tokenswithin a groupas many timesas they desiredbefore
indicatinga ratingof the voweltokens.Thejudgesweregiven three termsto describeand, thereby,rate the threevoice
qualitieswe synthesized,namely,naturalness,breathiness,
and hypo-/hyperfunction.Naturalnesswasdefinedas "human sounding."The judgeswere familiar with breathiness
andgenerallyagreedthat a breathyvoiceis usuallyassociated with an incompleteglottalclosureduringvocalfold vibration,suggesting
thattheimportantaudiblecomponent
of
the soundis the noisethat is producedby the escapage
of
turbulentairflowat the glottis.The judgesalsoagreedthat
the voicequalityof hypo-or hyperfunctionis relatedto the
perceptualsensationof vocaleffort.Hyperfunctionwasused
to describea strainedor tensevoicequality,as if the vocal
foldswerecompressed
duringphonation.Hypofunction
was
describedas too little tensionin the vocal folds, therebyresultingin an thin or lax voicequality.For eachgroupof four
voweltokensthejudgesratedthe stimulion a 7-pointscale
hypo-/hyperfunction
scalerepresented
the extremehyperfunctionwhilezerorepresented
the extremehypofunction
and three represented
a normalvoicequality.The judges
wereallowedto trainonbothsynthesized
andspokentokens
for thethreevoicequalitiesof naturalness,
breathiness,
and
hypo-/hyperfunction.
Thisallowedeachjudgeto establish
a
correspondence
betweentheir own perceptualevaluation
and the rating scalethey were to use.
B. Results
For the rating of the degreeof hypo-/hyperfunction,
i.e., the lax/tense vocal quality of the vowel tokens,the re-
sultswereasfollows.Withto andt,,fixed,asto variedfrom
58% to 36% the synthesized
voicewasjudgedto go from
hyperfunctional
to hypofunctional.
The perceptualratings
by the threejudgeswerein generalagreementon the seven
pointscale.
Theratingforto = 58%was(5, 4, 6) forjudges
1,2, and3, respectively.
Whento= 36%,theratingwas(0,
1,0). Withto andtofixedandL,varyingfrom55%to 85%,
the synthesizedvoicewasjudgedto go from slightlyhyperfunctional to hypofunctional. The judges' ratings for
te = 55% was (4, 4, 6), while that for t,. = 85% was (0, 0,
0). A resultsimilarto thelatterwasobtained
withtpand
fixedbut with t,, varyingfrom 0% to 10%. The ratingsfor
to = 0% was (5, 4, 5) and was (0, 1, 0) for t• = 10%.
The perceptualsensationof vocaleffortwascloselyrelated to the speedquotient,SQ. A highSQ (7.3) produceda
broad peak in the spectrumat high frequencies,
which was
perceivedby thejudgesas a tenseor hyperfunctionalvocal
quality. The judges' ratings were (5, 4, 6). On the other
hand,a smallSQ ( 1.2) produceda steeplydecliningspectral
slopefor the source,which resultedin a lax or hypofunctionalvocalquality.Thejudges'ratingwere(0, 1,0). An SQ
of 3.0 producedtokensthat werejudgedasnormalin voice
quality.The rating was (3, 2, 3). The openquotient,OQ,
wasnot very usefulfor predictingthe vocalqualityof hypo/hyperfunctionalvoice.
A numberof experimentswereconductedsynthesizing
a breathy voice by varying aspectsof the turbulent noise
component.In oneexperimentfour typesof turbulentnoise
were tested:(1) continuousnoisewith a flat spectrum,(2)
thenoisetimewaveformsetto 50% of thepitchperiodwith a
flat spectrum,(3) continuoushigh-passfilterednoise,and
(4) high-pass
filterednoisewith the time durationbeingset
to 50% of the pitchperiod.Other experimentsevaluatedthe
perceptualeffectof varying the duration (in percentageof
thepitchperiod)of thenoisecomponent
aswellasvarying
thelocationof thenoisesourcewithinthepitchperiod.We
alsoexaminedthe correlationbetweenthe degreeof perceptual breathiness
and the noise-to-harmonic
ratio The find-
ingsfrom theseexperimentswereas follows.
( 1) Amplitude modulationof the turbulentnoisesource
Only onetypeof ratingwasmadeat a time,e.g.,naturalness is important for achievinga natural soundingsynthetic
orbreathiness
or hypo-/hyperfunction.
Ifa particulargroup breathyvoice.Our resultssuggestthat a duty cycle,D,,, of
from zero to six (Prosek et al., 1987; Eskenazi et al., 1990).
of tokens was to be rated for two voice qualities, then the
about 50% (or slightly greater) is preferred.
judgeslistenedto the groupof tokenson two separateoccasions.A ratingof sixrepresented
the highestdegreeof quality for naturalnessand breathiness.
A rating of six on the
(2) Although the locationwithin a pitch periodof the
noiseproductionwasnot critical,the perceptionof naturalnesswas improvedwhen the noisesourcewaslocatednear
2407
J. Acoust.Soc.Am.,Vol.90, No.5, November1991
D.G. ChildersandC. K. Lee:Analysisof vocalquality
2407
the time at which maximum glottal closureoccurredor
slightlyafter,i.e.,T, occurred
atabout75% oftheexcitation
pitchperiod(Lee and Childers,1990;Childersand Ding,
featuresin the frequencydomainincludedthe slopeof the
1991).
monic ratio. The results showed that the sensation of vocal
spectrumandtherelationships
betweenthefundamentalfrequencyand higher harmonicsas well as the noise-to-har-
effortiscloselyrelatedto twoparameters
of theglottalwaveform:thespeedquotient(a timedomainparameter)andthe
spectralslope(a frequencydomainparameter).A largeSQ
of the fundamentalfrequency.
(7.3) producesenergyat higherfrequencies,
therebycaus(4) The degreeof perceptual
breathiness
wasprimarily ing the perceptualsensationof tenseor hyperfunctional
determinedby the noise-to-harmonic
ratio at frequencies voicequality.On the otherhand,a smallSQ (1.2) causesa
steep-falling
spectralsloperesultingin a lax or hypofuncabove2 kHz, The noisepulseenergy,A,, wasabout0.25%.
The NHR h wascontrolled
by varyingtheharmonics
of the tional voice quality. The resultsof the listeningtestsalso
glottalexcitation
pulse.
Thiswasdonebyvaryingtp,t•, and revealedthat the degree of perceptualbreathinesswas
stronglycorrelatedwith thenoise-to-harmonic
ratio above2
t, beingthe major parameter.
of the turbulent noise
Other experiments
wereconductedto synthesize
vocal kHz. The temporalcharacteristics
sourcewerealsofoundto beimportantfor producinga natufry andfalsetto.Theseexperiments,
however,werefoundto
requireparameters
thatwerenottheprimaryinterestof this ral-soundingvoice or a breathy voice.The three major pastudy,e.g.,F0, F0 perturbations,
and multipleexcitation rameters of the glottal excitation model are: (1) T,--the
pulsesfor vocalfry. Consequently,
we do not reportthese locationof the onsetof the turbulentnoisesourceas a perresults here.
centageof the pitch period (typically 75% ); (2) D,the
duty cycle of the turbulent noisesource(typically 50%);
and (3) A,--the ratio of the noiseenergyof the turbulent
V. DISCUSSION AND CONCLUSIONS
noisesourceto the energyof the glottal excitationpulse
The four voicetypes (modal, vocalfry, falsetto,and
(typically0.25%). The NHR hfor themodeliscontrolledby
breathy)werefoundto becharacterized
by four majorfac- the shapeof the glottalpulsegeneratedby the model.The
tors:pulsewidth,pulseskewness,
the abruptness
of glottal key parameteris
closure,and turbulent noise.The effectivenessof the four
Sincethejudges'ratingsfor the variouslisteningtasks
factorsfor synthesizing
a particularvocalqualitywasevalu- werefoundto generallyagreewith oneanother,we feelthat
ated usinga new sourceexcitationmodelwith a formant othersimilarlyskilledjudgeswouldhaveratedthe tasksin a
synthesizer.
Otherfactorsincludedtheglottalspectral
slope, similarmanner.The listeningtestresultsalsoimply that on
theharmonicrichness
factor,andthewaveformpeakfactor. the wholethe glottalexcitationmodelandits associated
paTypicalmeasured
valuesforeachofthesefactorsareindicat- rameterswereeffectivein synthesizing
the threevoicequalied in Table IV.
ties of naturalness,breathiness,
and hypo-/hyperfunction.
Fromtheperceptual
listeningevaluations
of thesynthe- However,the listeningtestresultsshouldnot be interpreted
sizedvoweltokens,we found that the major distinguishing as determiningthe bestchoiceof glottal excitationparam-
(3) High-pass
filteringof theturbulentnoiseisnotcriticalfor theperception
of breathiness
sincetheeffectof noise
in the low-frequencyregionis maskedby strongharmonics
TABLE IV. Summaryof the source-related
featuresand their typicalvalues.
Voicetypes
Features
Modal voice
Vocalfry
Falsetto
Breathyvoice
Glottal
Pulsewidth
medium
short
long
long
waveform
(OQ)
Pulseskewing
(0.70)
medium
(0.45)
high
(0.99)
low
(0.91)
low
(SQ)
Abruptness
of closure
(t.)
(2.6)
abrupt
closure
(3.5)
veryabrupt
closure
(1.5)
progressive
closure
(1.4)
progressive
closure
(2.0%)
(0.7%)
(8.8%)
(8.4%)
Glottal
Spectralslope
medium
slight
steep
steep
spectrum
(riB/oct)
Harmonic richnessfactor
(dB)
( -- 12)
medium
(--9.9)
(--6)
high
(2.1)
( -- 18)
low
(- 19.1)
( - 18)
low
( - 16.7)
Speech
Turbulent noise
low
low
low
high
features
(NHR,, in dB)
( -- 5.3)
(unableto
( -- 6.6)
(2.8)
low
measure)
Waveform
peak
medium
high
low
factor
(2.8)
(4.0)
(1.8)
(unableto
measure)
2408
d. Acoust.Sec.Am.,Vol.90, No.5, November1991
D.G. ChildorsandC. K. Leo:Analysisof vocalquality
2408
etersoreventhebestglottalmodelforthethreevoicequalitiesweconsidered.
Our results
areonlyvalidfor theexcitation modeland associated
parameters
we used.Another
modelmayyielddifferentlistening
testresults.
Thecurrentlyavailable
methods
for estimating
glottal
wavesarerestrictive
in onewayor another.Thisstudyhas
shown
thatit ispossible
to measure
parameters
thataresensitivetosource
features
fromboththespeech
andEGGsignals.However,theseprocedures
needfurtherdevelopment
especially
forconnected
speech,
wherevarious
prosodic
patternsare usedto expressdifferenttypesof statements.
Thus
an important
extension
of thisstudywouldbeto develop
methods
formeasuring
voicesource
dynamics
forconnected
speech.
Forexample,
various
intonation
andstress
patterns
may be correlatedto sourceparametersother than fundamentalfrequency
andtiming.The ultimatepracticalgoalis
to develop
a source
modelandsynthesis
rulesthatcanproducenatural-sounding
syntheticspeechwith desiredvocal
and tonal characteristics.
[Only preliminaryexperiments
havebeenconducted
onsynthesizing
connected
speech
usingthemethodsreportedhere(Pintoet al., 1989;Childers
andWu,1990).] Wealsonotethatdifferences
inphysiologicalstructure,
socialcustoms,
gender,
andagemaygiverise
to distinctivephonationpatternsthatresultin variousvoice
characteristics.
As a finalnote,we mentionthat the knowledge
gained
fromthisstudymightbenefit
applications
ofspeech
recognitionandspeaker
identification.
For example,
oneapplica-
Childers,D. G., and Krishnamurthy,A. K. (1985). "A critical reviewof
electroglottography,"
CRC Crit. Rev. Bioeng.12, 131-164.
Childers,D. G., Naik, J. M., Larar, J. N., Krishnamurthy,A. K., and
Moore,G. P. (1983). "Electroglottography,
speech
andultra-highspeed
cinematography,"in VocalFoldPhysiology:
Biomechanics,
Acoustics
and
PhonotoryControl, edited by I. R. Titze and R. C. Scherer (Denver
Centerfor the Performing
Arts, Denver,CO), pp.202-220.
Childers,D. G., andWu, K. (1990). "Qualityofspeech
produced
byanalysis-synthesis,"
SpeechCommun.9, 97-117.
Childers,D. G., andDing, C. (1991). "Articulatorysynthesis:
nasalsounds
and male and female voices," J. Phon. 19, 453-464.
Childers,D. G. andWu, K. ( 1991). "Genderrecognitionfrom speech,Part
II: Fine analysis,"J. Acoust. Soc.Am. 90, 1828-1840.
Coleman, R. F. (1960). "Some acousticcorrelatesof hoarsehess,"Master's
thesis,VanderbiltUniversity,Nashville,TN.
Colton,R. F. (1969). "Someacousticand perceptualcorrelatesof the modal and falsettoregisters,"Ph.D. dissertation,University of Florida,
Gainesville, Fl.
Colton, R. H. (1973a). "Vocal intensityin the modaland falsettoregisters," Folia Phoniatr. 25, 62-70.
Colton,R. H. (1973b). "Someacousticparametersrelatedto the perception of modal-falsettovoicequality," Folia Phoniatr.25, 302-31 I.
Colton, R. H., and Hollien, H. (1972). "Phonationalrangein the modal
and falsettoregisters,"J. SpeechHear. Res. 15, 708-713.
Colton, R. H., and Hollien, H. (1973). "Perceptualdifferentiationof the
modaland falsettoregisters,"Folia Phoniatr.25, 270-280.
Damste,H., Hollien, H., Moore, G. P., and Murry, T. (1968). "An x-ray
studyof vocalfold length,"Folia Phoniatr.20, 349-359.
Davis, S. B. (1979). "Acousticcharacteristics
of normal and pathological
voices,"in Speechand Language:Advancesin BasicResearchand Practice,editedby N. Lass (Academic,New York), Vol. 1, pp. 271-335.
Deal, R. E., and Emanuel,F. W. (1978). "Somewaveformand spectral
featuresof vowelroughness,"J. SpeechHear. Res.21, 250-264.
Eskenazi, L., Childers, D. G., and Hicks, D. M. (1990). "Acoustic corre-
latesof vocalquality," J. SpeechHear. Res. 33, 298-306.
Fant, G. (1960). AcousticTheoryof SpeechProduction(Mouton, Paris).
tionwouldbeto examinewaysto studytheextractionand
Fant, G. (1979). "Glottal sourceand excitationanalysis,"QuarterlyProuseofsource
parameters
forspeaker
adaptive
signalprocess- gressand StatusReport(SpeechTransmissionLaboratory,Royal Instiing to improvethe reliabilityof a speaker-independent tute of Technology,Stockholm,Sweden),Vol. 1, pp. 85-125.
Fant, G., Liljencrants,J., andLin, Q.-G. (1985). "A fourparametermodel
speechrecognitionsystem.
ACKNOWLEDGMENTS
This work was supportedin part by NIH Grant No.
NIDCD DC 00577 with additionalsupportfrom the Universityof FloridaCenterfor ExcellenceProgramin Information Transfer and Processingand also from the MindMachine Interaction
Research Center.
I Thisissimilar
toa procedure
adopted
byTitze(1984).Note,however,
thata least-mean-square
errorcriterion
is notguaranteed
to produce
a
glottalwaveform
excitation
modelthatmaygenerate
perceptually
the
mostnaturalsounding
syntheticspeech.
Allen,E. L.,andHollien,
H. (1973)."A laminagraphic
study
ofpulse(vocalfry) registerphonation,"Folia Phoniatr.25, 241-250.
Ananthapadmanabha,
T. V. (1984)."Acoustic
analysis
ofvoicesource
dynamics,"
in Quarterly
Progress
andStatusReport(Speech
Transmission
Laboratory,
RoyalInstituteofTechnology,
Stockholm,
Sweden),
Parts2
and 3, pp. 1-24.
Askenfelt,
A. G.,andHammarberg,
D. (1986)."Speech
waveform
perturbationanalysis:
a perceptual-acoustical
comparison
of sevenmeasures,"
J. SpeechHear. Res.29, 50-64.
Boone,
D. R. (1971).TheVoice
andVoice
Therapy
(Prentice-Hall,
Englewood Cliffs, NJ).
Childers,
D. G., Hicks,D. M., Moore,G. P.,andAlsaka,Y. A. (1986)."A
modelforvocalfoldvibratory
motion,
contact
area,andtheelectroglottogram," J. Acoust. Soc.Am. 80, 1309-1320.
Childers,D. G., Hicks,D. M., Moore,G. P., Eskenazi,
L., andLalwani,A.
L. (1990)."Electroglottography
andvocalfoldphysiology,"
J. Speech
Hear. Res. 33, 245-254.
2409
J. Acoust.
Soc.Am.,Vol.90,No.5, November
1991
of glottalflow," STL-OPSR4, 1-13.
Flanagan,J. L. (1957). "Note on the designof terminal-analogspeech
synthesizers,"J. Acoust. Soc.Am. 29, 306-310.
Flanagan, J. L. (1972). SpeechAnalysis',Synthesisand Perception
(Springer-Verlag,New York), 2nd ed.
Flanagan,J. L., andIshizaka,K. (1976). "Acousticgeneration
of voiceless
excitationin a vocalcord-vocaltract speechsynthesizer,"IEEE Trans.
Acoust. SpeechSignal Process.24, 163-170.
Fujisaki,H., and Ljungqvist,M. (1986). "Proposalandevaluationof modelsfor the glottal sourcewaveform,"Proc. IEEE Int. Conf. Acoust.
SpeechSignalProcess.1605-1608.
Hedelin, P. (1984). "A glottal LPC-vocoder," Proc. IEEE Int. Conf.
Acoust.SpeechSignalProcess.1.6.1-1.6.4.
Heiberger,V. L., and Horii, Y. (1982). "Jitter and shimmerin sustained
phonation,"in Speechand Language:Advancesin BasicResearchand
Practice,editedby N. Lass(Academic,New York), Vol. 7, pp. 299-322.
Hiller, S. M., Laver,J., and Mackenzie,J. (1983). "Automaticanalysisof
waveformperturbations
in connected
speech,"Work in Progress,Dept.
of Linguistics,EdinburghUniversity,16, 40-68.
Hillman, R. E., Osterle, E., and Feth, L. L. (1983). "Characteristics of the
turbulent noise source," J. Acoust. Soc. Am. 74, 690-694.
Hirano,M., Hibi, S., Terasawa,R., and Fujiu, M. (1985). "Relationship
betweenaerodynamic,
vibratory,acousticand psychoacoustic
correlates
in dysphonia,"Symposium
on VoiceAcousticandDysphonia,Gotland,
Sweden.
Hiraoka, N., Kitazoe, Y., Ueta, H., Tanaka, S., and Tanabe, M. (1984).
"Harmonic-intensity analysisof normal and hoarse voices," J. Acoust.
Soc. Am. 76, 1648-1651.
Hollien, H. (1974). "On vocalregister,"J. Phon. 2, 125-144.
Hollien,H., Coleman,R. F., andMoore,P. (1968). "Stroboscopic
laminagraphyof the larynx duringphonation,"Acta Oto-laryngol.LXV, 209215.
Hollien, H., and Colton, R. H. (1969). "Four laminagraphicstudiesof vocal fold thickness," Folia Phoniatr. 21, 179-198.
D.G. Childers
andC. K.Lee:Analysis
ofvocalquality
2409
Hollien,H., Girard,G. T., andColeman,R, F. (1977). "Vocalfoldvibratorypatterns
ofpulseregister
phonation,"
FoliaPhoniatr.29,200-205.
Hollien,H., andMichel,J. F. (1968). "Vocalfry asa phonationalregister,"
J. SpeechHear.Res.11, 600-604.
Holmes,J. N. (1973). "Theinfluence
of giottalwaveformon thenaturalhess
ofspeech
froma parallelformantsynthesizer,"
IEEE Trans.Audio
Electroacoust.AU-21, 298-305.
Horii,Y. (1980). "Vocalshimmerin sustained
phonation,"
J.Speech
Hear.
Res. 23, 202-209.
patternin thenormallarynx,"FoliaPhoniatr.10, 205-238.
Pinto, N. B., Childers,D. G., and Lalwani, A. L. (1989). "Formant speech
synthesis:
improvingproductionquality,"IEEE Trans.Acoust.Speech
SignalProcess.37(12), 1870-1887.
ProseLA. R., Montgomery,B. E., andHawkins,D. B. (1987). "An evaluation of residuefeaturesas correlatesof voice disorders," 1. Commun.
Disord. 20, 105-117.
Rabiner,L. R. (1968). "Digital-formantsynthesizer
for speechsynthesis
studies," J. Acoust. SOC.Am. 43, 822-828.
ty: listeningtestsand long-termspectrumanalyses,"
Symposium
on
Rabiner, L. R., and Schafer,R. W. (1978). Digital Processing
of Speech
Signals(Prentice-Hall, EnglewoodCliffs, N J).
VoiceAcousticsand Dysphonia,Gotland,Sweden.
Rosenberg,
A. E. (1971). "Effectof glottalpulseshapeon the qualityof
Hurme, P., and Sonninen,A. (1985). "Normal anddisorderedvoicequali-
Isshiki,N., Kitajima,K., Kojima,H., andHarita,Y. (1978). "Turbulent
noisein dysphonia,"Folia Phoniatr.30, 214-224.
Kasuya,H., Ogawa,S.,andKikuchi,Y. (1986). "An acoustic
analysis
of
pathologic
voiceanditsapplication
totheevaluation
oflaryngeal
pathology,"SpeechCommun.5, 171-181.
Kitzing,P. (1982). "Photo-andelectrophysiological
recording
of the laryngeal
vibratory
patternduringdifferent
registers,"
FoliaPhoniatr.
34,
234-241,
Klatt, D. H. (1980). "Softwarefor a cascade/parallel
formantsynthesizer," J. Acoust. Soc.Am. 67, 971-995.
Klatt, D. H. (1987). "Reviewoftext-to-speech
conversion
for English,"$.
Acoust. Soc.Am. 82, 737-793.
Klatt, D. H., andKlatt, L. C. (1990). "Analysis,synthesis,
andperception
of voicequalityvariations
amongfemaleandmaletalkers,"J. Acoust.
Soc. Am. 87, 820-857.
Krishnamurthy,
A. K., andChilders,D. G. (1986). "Two-channel
speech
analysis,"
IEEE Trans.Acoust.Speech
SignalProcess.
ASSP-34,730743.
Ladefoged,
P. ( 1975)..4Course
in Phonetics
( HarcourtBraceJovanovich,
New York), pp. 121-124.
Laver,J. (1980). ThePhonetic
Description
of VoiceQuality(CambridgeU.
P., New York), pp. 115-117.
Laver,J.,andHanson,R. (1981). "Describing
thenormalvoice,"in Evaluationof Speech
in Psychiatry,
editedby J. Darby (GruneandStratton,
New York), pp. 51-78.
Lee,C. K. (1988). "Voicequality:Analysisandsynthesis,"
Ph.D. dissertation, Universityof Florida,Gainesville,Fl.
Lee,C. K., andChilders,D. G. (1989). "Someacoustical
perceptual
and
physiological
aspects
of vocalquality,"in VocalFoldPhysiology,
edited
by B. HammarbergandJ. Gaulfin(Raven,New York) (in press).
Monsen,R., andEngebretson,
M. (1977). "Studyof variations
in themale
andfemaleglottalwave,"J. Acoust.Soc.Am. 62, 981-993.
Moore,G. P. (1975). "Observation
onthephysiology
of hoarseness,"
Proceedings
ofthe4thInternational
Congress
ofPhonetic
Science,
Helsinki,
Finland,pp. 92-95.
Moore,P., andyonLeden,H. (1958). "Dynamicvariationsofthevibratory
2410
J. Acoust.
Soc.Am.,Vol.90,No.5, November
1991
natural vowels," J. Acoust. Soc. Am. 49, 583-590.
Sambur,M. R., Rosenberg,
A. E., Rabiner,L. R., and McGonegal,C. A.
(1978). "On reducingthebuzzin LPC synthesis,"
J. Acoust.Soc.Am.
63, 918-924.
Schroeder,M. R., and Atal, B. S. (1985). "Code-excitedlinear prediction
(CELP): high quality speechat very low bit rates,"Proc. IEEE Int.
Conf.Acoust.SpeechSignalProcess.
937-940.
Timcke,R., yonLeden,H., andMoore,P. (1959). "Laryngealvibrations:
measurements
of theglotticwave,"Arch. Otolaryngol.69, 438 44•..
Titze, I. R. (1984). "Parameterization
of the glottalarea,glottalflow,and
vocal fold contactarea," J. Acoust. Soc.Am. 75(2), 570-580.
Titze, I. R. (1990). "Interpretationof theelectroglottographic
signal,"J.
Voice 4, 1-9.
Trancoso,I. M., and Atal, B. S. (1990). "Efficientsearchprocedures
for
selectingthe optimuminnovationin stochastic
coders,"IEEE Trans.
Acoust.SpeechSignalProcess.
38, 385-396.
Wendahl,R. W. (1963). "Laryngealanalogsynthesis
of harshvoicequality," Folia Phoniatr. 15, 241-250.
Wendahl,R. W. (1966). "Laryngealanalogsynthesis
ofjitter andshimmer
auditoryparametersof harshness,"
Folia Phoniatr.18, 98-108.
Wendahl,R. W., Moore, P., and Hollien, H. (1963). "Commentson vocal
fry," Folia Phoniatr. 15, 251-255.
Whitehead,R. L., Mets, D. E., and Whitehead,B. H. (1984). "Vibratory
patternsof the vocalfoldsduringpulseregisterphonation,"J. Acoust.
Soc. Am. 75, 1293-1297.
Wolfe, V. I., and Steinfatt, T. M. (1987). "Prediction of vocal severity
within and acrossvoicetypes,"J. SpeechHearingRes.30, 230-240.
Wu, K., and Childers, D. G. (1991). "Gender recognitionfrom speech,
Part I: Coarseanalysis,"J. Acoust.Soc.Am. (in press).
Yanagihara,H. (1967). "Significance
of harmonicchanges
andnoisecomponentsin hoarseness,"
J. SpeechHear. Res.10, 531-541.
Yea, J. J., Krishnamurthy,A. K., Naik, J. M., andChilders,D. G. (1983).
"Glottal sensing
for speechanalysisandsynthesis,"
Int. Conf. Acoust.
SpeechSignalProcess.1332-1335.
Yumoto, E., Gould, W., and Baer,T. (1982). "Harmonic-to-noiseratio as
anindexofthedegreeof hoarsehess,"
J.Acoust.Soc.Am. 71, 1544-1550.
D.G.'Childers
andC. K.Lee:Analysis
ofvocalquality
2410