Vocal quality factors: Analysis, synthesis, and perception D.G. Childers DepartmentofElectricalEngineering,University ofFlorida;Gainesville, Florida32611-2024 C.K. Lee Department ofElectrieal Engineering, TatungInstituteof Technology, Taipei,Taiwan,Republic ofChina (Received28 September 1990;accepted forpublication 25July1991) The purposeof thisstudywasto examineseveralfactorsof vocalqualitythat mightbeaffected by changesin vocalfold vibratorypatterns.Four voicetypeswereexamined:modal,vocalfry, falsetto,andbreathy.Threecategories of analysistechniques weredeveloped to extractsourcerelatedfeaturesfrom speechandelectroglottographic (EGG) signals.Four factorswerefound to beimportantfor characterizing theglottalexcitations for thefourvoicetypes:theglottal pulsewidth,the glottalpulseskewhess, the abruptness of glottalclosure,andthe turbulent noisecomponent. The significance of thesefactorsfor voicesynthesis wasstudiedanda new voicesourcemodelthataccounted for certainphysiological aspects of vocalfoldmotionwas developedand testedusingspeechsynthesis. Perceptuallisteningtestswereconductedto evaluatethe auditoryeffectsof the sourcemodelparameters uponsynthesized speech.The effectsof thespectralslopeof the sourceexcitation,the shapeof the glottalexcitationpulse, and the characteristics of the turbulentnoisesourcewereconsidered. Applicationsfor these research resultsincludesynthesis of naturalsounding speech, synthesis andmodelingof vocal disorders, andthedevelopment of speakerindependent (or adaptive)speech recognition systems. PACS numbers:43.70.Aj, 43.70.Gr, 43.71.Bp INTRODUCTION The qualityof voicemaybereferredto asthe totalauditory impression the listenerexperiences upon hearingthe speechof anothertalker.Thereis no generallyaccepteddefinition of vocalquality and the term hasbeenusedin differentcontexts, e.g.,a phonetician mightusequalityin thecontext of articulatorydifferences; a singermightreferto the quality of vocal registers,which are relatedto vocal fold vibration;andqualitymightbe usedto describevoicetypes suchasbreathy,hoarse,or harsh.In this study,we investigatedaspects of vocalqualityrelatedto vocalfold vibratory patterns,i.e., laryngealvocalquality.We did not address two othercommoncharacteristics of vocalquality,namely, harshness,and whisper. We eliminated harshnessand whisperfrom our studybecauseharshness involvedlarge variationsfromcycle-to-cycle in thevocalfoldvibratorypatterns (Coleman, 1960; Moore, 1975; Wendahl, 1963; Wen- dahl, 1966) andwhisperis characterizedby little or no vocal fold vibratorymotion.To help us achieveour goal,we reviewedvocalquality with respectto its physiological, perceptual,and acousticalaspectsfor the four voicetypesof modal,vocalfry, falsetto,and breathy. A. Physiological characteristics Vocal fold length and thicknesshave beenshownto affectthethreevoiceregisters of modal,vocalfry, andfalsetto. Our purposewas to improveexistingor developnew For modal voicevocalfold lengthmay be consideredto be speech andelectroglottographic (EGG) analysis techniques mediumwith the lengthincreasingasfundamentalfrequento assistthe assessment of vocalqualityand to designnew cy (F0) increases(Damste et al., 1968;Hollien, 1974). For voice sourcemodelsfor synthesizingnatural sounding vocalfry andfalsettothevocalfoldlengthisshortandlong, speechwith a selectablevocalquality. We investigatedthe respectively (Boone,1971;Hollien, 1974;Ladefoged,1975). nature and extent of the glottal excitation variations of four Vocal fold thicknessvarieswith F0, beingmediumfor modal voice types,namely breathy voice and the three registers, voice,while the vocalfoldsbecomethick for vocalfry and modal, vocal fry, and falsetto. Our analysisof resultsled to thin for falsetto (Hollien et al., 1968; Hollien and Colton, thedevelopment of a newsourcemodelfor speechsynthesis 1969;Boone,1971;Allen andHollien, 1973;Hollien, 1974). that containssomefactorsthat appearperceptuallyimporSomecharacteristics of the vocalfold vibratorypattern tant for characterizing vocalquality.The knowledgegained arealsoknown.For modalvoicethevocalfoldshavea speed from this researchmay improvethe developmentof natural quotientgreaterthanonewith theglottishavinga longopensoundingspeechsynthesizers and assistthe advancementof ing phaseand a rapid closingphase(Hollien, 1974;Monsen speaker-independent speechrecognitionsystems. and Engebretson,1977). Vocal fry ischaracterizedby a glotLaver and Hanson ( 1981) havedefinedsix major types tal area functionthat hassharp,shortpulsesfollowedby a ofphonation, modal voice,vocalfry, falsetto,breathyvoice, long closedglottal interval.The glottal openingphasemay loudness, and resonance. 2394 J. Acoust.Soc. Am.90 (5), November1991 0001-4966/91/112394-17500.80 ¸ 1991 AcousticalSocietyof America 2394 haveone,two,or threeopening/closing pulses(Moore and yon Leden, 1958;Timeke etal., 1959;Hollien, 1974;Monsen and Engebretson,1977;Hollien etaL, 1977;Whiteheadet al., 1984). Falsettovoicehasgradualglottalopeningand closingphaseswith a shortor no closedphase(Hollien, 1974; Monsen and Engebretson,1977; Kitzing, 1982). Breathyvoicemay havea slightvibratoryexcursionof the vocal folds with incompleteglottal closure(Boone, 1971; Hollien, 1974;Ladefoged,1975). B. Perceptual characteristics Generally, the pitch for modal, vocal fry, falsetto,and breathy voice is characterizedas medium, low, high, and wide ranging,respectively(Hollien and Michel, 1968;Colton, 1969;Boone, 1971;Colton and Hollien, 1973; Hollien, 1974;Laver, 1980). Loudnessfor modalvoicecan vary over a wide rangewhile vocalfry, falsetto,and breathyvoiceare typicallysoft (Boone, 1971;Hollien, 1974;Laver, 1980). The quality of thesefour voicetypeshasbeendescribedas normal (modal), a low pitch, rough-sounding voice (vocal fry), a flutelike tone that is sometimesbreathy (falsetto), and a friction noiselikesound (breathy) (Wendahl etal., 1963;Boone,1971;Coltonand Hollien, 1973;Hollien, 1974; Laver, 1980). C. Acoustical characteristics TheF0 ofmodalvoicesrangesover94-287 Hz for males and 144 to 538 Hz for females,while the F0 rangefor vocal fry is 24-52 Hz for malesand 18-46 Hz for females;the falsettoF0 rangefor malesis 275-634 Hz and495-1131 Hz for females,whilebreathyisdescribedaswideranging(Hollien and Michel, 1968; Colton, 1969; Boone, 1971; Hollien, 1974). Vocal intensity characteristicsare lessspecificand have beendescribedas wide ranging,low, low to medium, and low for modal, vocal fry, falsettoand breathy,respectively (Boone, 1971; Colton, 1973a;Hollien, 1974; Laver, 1980). Likewise,the sourcespectralslope(tilt) for these samevoicetypeshasbeendescribedas mediumbut F0 dependent,relativelyflat, steep ( -20 dB/oct), and steep (Colton, 1973b; Hollien, 1974; Monsen and Engebretson, 1977; Hurme and Sonninen, 1985). Turbulent noise has beenassociatedwith only modal (low) and breathy (high) voices(Yanagihara,1967;Isshikietal., 1978;Hiraokaetal., 1984). Pitch perturbationis generallylow for modal voice, highfor vocalfry andbreathyvoiceswhileamplitudeperturbationis low for modalandhighfor breathyvoices(Monsen and Engebrctson,1977; Deal and Emanual, 1978; Davis, 1979;Horii, 1980;Heibergerand Horii, 1982;Hiller etal., 1983; Hirano etal., 1985; Askenfelt and Hammarburg, 1986;Kasuyaetal., 1986;Wolfe and Steinfatt,1987). Many moredetailsare availableon the four voicetypes we studied. We selected and summarized those features we thoughtwouldbeanalyzablefrom thespeechandEGG signalsandthat mightbeusefulin designing a sourceexcitation modelfor synthesizing natural soundingspeechand vocal disorders.Recently,severalissuesin synthesizingnatural soundingvoiceshavebeenconsidered, includingthe design of sourceexcitationmodels (Fant, 1979; Fant et al., 1985; Klatt, 1987;Lee and Childers, 1989;Klatt and Klatt, 1990; Childers and Wu, 1990; Childers and Wu, 1991; Wu and Childers, 1991). We will discusstheseissueslater when we presentour sourceexcitationmodel.This studyconsidered two typesof experiments:(1) voicequality factorsthat mightbe influenced by changes in vocalfold vibratorypatternsand (2) the perceptualvalidationof the effectsof acoustic parametervariationsonthequalityof thesynthesis of a particularvoicetype. I. EXPERIMENTAL A summaryof our researchschemeappearsin Fig. 1. The speech andelectroglottographic dataweredigitizedsimultaneously usingDigital SoundCorp. DSC-240preamplifiersanda DSC-200digitizer.We sampled eachsignalat 10kHz with 16-bitsprecision.The microphone wasan Electro-Voice RE-10 held six inchesfrom the lips. The EGG devicewas a model from Synchrovoice,Inc. All data were collectedin a professional IAC singlewall soundbooth.A Digital EquipmentCorp. VAX 11/750 computersystem managedthe datacollection. The subjectsfor this studyconsisted of 23 (8 male, 15 female)patientswith a vocaldisorderor pathologyand 52 (27 male,25 female) subjectswith a normallarynx.We denotethepatientsasPn,wheren goesfrom I to 23,whilethe normalsubjects aredenotedasNn wheren goesfrom 1to 52. Thesubject's agesrangedfrom20 to 80yearsold.Thecompletespeechprotocolconsisted of 27 tasks,includingten sustainedvowels/i, l, •, •e, o, a, u, u, ^, e/, two sustained diphthongs/ou, el/, fivesustained unvoiced fricatives/h,f, O,s,J'/, andfoursustained voicedfricatives/v,8, z, 3/. The subjects wereinstructed to pronounce andsustaineachvowel asit wouldbe pronounced in thefollowingwords,respec- tively:beet,bit,bet,bat,Bob,bought, book,boot,but,Bert. Similar instructionswere given for the diphthongs,for which the cue words were boat and bait, while for the frica- tiveswe usedthe followingcuewords:h_at,fix, thick,sat, ß VOICE TYPE '• ß Modal ß Speech WAVEFORM [ VOICE ESTIMATION • SYNTHESIS SSOURCE r ß Vocal Fry ß Breethy EXTI•,CTION J vatlou• voice qualities EGG features for different types of phonatlon ß Ac½:•.sstlc effects of source features F•ATURB J•ii•00 WAVEFORM ANALYSIS 2395 Source features that characterize ß PERCEPTUAL EVALUATION Falsetto PROCEDURES d. A½oust.Soo. Am., Vol. 90, No. 5, November 1991 FIG. 1. Black diagram of basic research scheme. Source model for natural sounding voice synthesis Rulesfor syntheslzalngspecificvoice qualifies D.G. Childorsand C. K. Lee: Analysis of vocal quality 2395 LP Residue[ ship,van,this,zoo,andazure.Thedurationof eachvowel, diphthong,and fricative approximated2 s. The additional tasksincludedcountingfrom oneto ten with a comfortable pitchandloudness,countingfrom oneto fivewith a progressive increasein loudness,singingthe musicalscaleusing "la," and speakingthree sentences.(We were away a year ago.Early onemorninga man and a woman ambledalonga onemile lane. Shouldwe chasethosecowboys.9) This completeprotocolhasprovideda databasefor severalstudieson voicequalityand speechsynthesis, includingthis one.For this study,we analyzedonly data for the two vowels/i/and Error q• (n) s(n)_ Fix•-Frame V•(z) LPAnalysis Pul• and Determine Aflaly•is Frame for 2fid-PassLP Analysis I FIG. 3. Blockdiagramof thetwo-pass methodforglottalinverse filtering analysis. /o/, the three sentences,and selecteddata records from the counting task. The two vowelswere selectedbecausetwo recentstudieshadfoundacousticcorrelates of vocalquality and vocal disordersusingthesetwo vowels(Prosek et al., 1987;Eskenaziet el., 1990). We analyzedthe sentences and the countingtaskto give us someindicationof the glottal factors in a word and sentence context. II. ANALYSIS FOR EXTRACTING SOURCE FEATURES A. Inverse filtering Figure 2 illustratesthat the peaksin the linear prediction (LP) error functionoccurnearly simultaneously with the negativepeaksof the differentiatedEGG (DEGG) signal, which correspondto the instantsof glottal closure (Childerset el., 1983;Childersand Krishnamurthy,1985; Childers et el., 1990). From these observations,we devel- opedthe "two-passmethod"for accurate,automaticglottal inversefiltering, which works as well as the two-channel (speechandEGG) method(KrishnamurthyandChilders, volume-velocity waveform.As with theEGG signal,theestimatedLP pulsesmay not providean exactindicationfor theinstantof glottalclosure.The keyfeatureof thetwo-pass methodis that it ensuresthe exclusionof the main pulseof theLP errorsignalfromtheanalysisinterval.Thistailoring of theanalysis intervalincreases theaccuracy of theLP analysis.For a definitionof the signalprocessing termsusedin this paper, the readeris referredto Rabiner and Schafer (1978). A block diagramof the two-passmethodis shownin Fig. 3. During the first pass,a pitch-asynchronous (fixed frame) LP analysis is performed on theinputspeech signal s(n ). The estimatedLP filter, V, (z), is usedto derivethe LP error signal,q• (n), by inversefiltering.For a voicedspeech signal,the LP errorfunctionischaracterized by a pulsetrain with the appropriatepitch period.The locationsof these pulses aredetected bya peak-picking methodandareusedas indicators of glottal closure. In the second passof theproce1986). dure, a pitch-synchronous covariance LP analysis is usedto The two-passmethodfirstidentifiesthe locationsof the estimate an improved LP filter, V 2 (z). For each pitch perimain pulsesof the LP error signalderivedduringthe first od, the criterion for determining the analysis interval is to passof theinversefilteringprocedure. Thesemainpulsesare pick the samples starting one point after the instant of the then usedas indicatorsof glottalclosure,and a "pseudomain pulse. The formant resonances of the vocal tract are closedphase"is selected astheanalysisintervalfor a pitchsynchronouscovarianceLP analysisto estimatethe vocal tractfilter,whichin turn isusedto obtainthedesiredglottal estimatedby solvingthe rootsof the LP polynomial.The formantstructureis thenshapedby empiricalrules,which include:( 1) discarding therootswithcenterfrequencies below 250 Hz, (2) discardingthe roots with bandwidths greaterthan 500 Hz, and (3) mergingtwo adjacentroots. The refined ferment resonances are then used to construct the vocal tract transfer function, which is usedin the final (second-pass) glottalinversefilteringprocedure.The direct outputof the glottalinversefilteringoperationis a differen- tial glottalvolume-velocity u• (n) (i.e.,theequivalent drivDEGG-BASED CLOSED PHASE AND ANALYSIS INTERVAL FOR THE TWO- DEGG ing functionto the vocaltract filter), which represents the combinedeffectof the lip radiationand the glottalvolumevelocity.A glottalvolume-velocity waveform,u• (n), isderived by integration.The validity of the two-passmethod wasverifiedby testingit with syntheticspeechsignals(Lee, 1988).The syntheticspeechsignalswereproducedby a cascadeformantsynthesizer(Klatt, 1980) excitedby stylized MAIN EXCITATION glottalpulsesgeneratedby the LF (Liljencrantsand rant) model (rant etal., 1985). A typicalresultappearsin Fig. 4. PASS METHOD (APPROXIMATELY 35% OF THE PITCH PERIOD) • LPRESIDUAL ERROR • B. Source FIG. 2. SynchronizedEGG, DEGG, and LP residualerror. 2396 J. Acoust.Sec. Am., Vol.90, No. 5, November1991 features Usingthe inversefilteredwaveforms(glottalpulses)as the excitationfor voicetypes,modal,vocalfry, falsetto,and D.G. Childorsand C. K. Leo:Analysisof vocalquality 2396 SPEECH WAVEFORM gan(1957) gavean averagevalueof -- 12dB/octavefor the slopeof the glottalspectra.This leadto the commonlyacceptedtwo-polemodelfor approximating thegeneralspectral trendof a normalglottalvolumeflow: Ug(z)=K/(1 - zoz- •)(1 - zbz- •), (1) whereK is a constantrelatedto theamplitudeof theglottal flow andzo,zb are real polesinsidethe unit circle.A pointzo LP RESIDUAL ERROR INTEGRAL OF ORIGINAL EXCITATION is saidto bea poleof a functionUg(z), if Ug (z•) equals WAVEFORM infinity.The numberof polesa functioncontainsdetermines the slopeof the spectrumof the functionas the frequency increases. A singlepole producesa spectralslopeof - 6 dB/octave,two poles -- 12dB/octave,andsoon. This discussionappliesprovidedthereare no zerosof the function, for whichthereare nonein the all polemodelswe consider here. ORIGINAL EXCITATION WAVEFORM INTEGRALOF ESTIMATEDEXCITATIONWAVEFORM ESTIMATED EXCITATION WAVEFORM Figure5 showstheFourierspectraandthecorrespondingtwo-poleapproximations for theglottalpulses of various voicetypes.The resultsshowthat the two-polemodelis appropriatefor modalphonations and is reasonably goodfor vocalfry exceptfor a smalllow-frequency interval(below the third harmonic). However, the two-polemodel is not adequatefor falsettoand breathyphonations, whichhave spectralroll-offrateshigherthan 12 dB/octave.For these latter two typesof phonation,we found that a three-pole modelprovidesa betterfit to the data, e.g., Ug(z)=K/(1-z•z FIG. 4. An exampleof two-pass glottalinverse filtering. breathy,we measuredthe followingfeatures:( 1) instantof maximumclosingslopeofthe glottalpulse,(2) glottalpulse width, and (3) glottal pulseskewness(ratio of durationof glottal openingphaseto durationof glottal closingphase). For modalandvocalfry phonations,theinstantof the maximum closingslopeoccursnearthe instantof glottalclosure, resultingin an abrupt terminationof the glottal airflow. For falsettoandbreathyphonations,the instantof themaximum closingslopeoccursnear the middleof the glottal closing phase,followedby a residualphaseof progressive closure. The glottalpulsewidthsweremoderate(65%-70% of the pitch period) for modal phonationsand small (25%45% of the pitchperiod)for vocalfry. Falsettoandbreathy voiceshadlargepulsewidths(90%-100% of the pitchperiod), oftenmakingit appeartherewasnoclosedphase.Glottal pulseskewness alsovariedwith voicetype. The ranking of voicetypeaccordingto decreasing skewness wasvocalfry, modal, falsetto,and breathy. C. Glottal spectral characteristics The literature (e.g., Monsen and Engebretson,1977; Pinto et al., 1989) suggestedthat the glottal waveformsof differentvoicetypescould be distinguished by two factors: (1) the generalspectralslope (tilt or.trend), and (2) the relationshipbetweenthe intensity of the fundamentalfrequencyand its harmonics.For normal phonations,Flana2397 J. Acoust.Soc. Am., Vol. 90, No. 5, November 1991 •)(l-zl, z-•)(l-zcz •), (2) wherezcisthethird realpoleinsidetheunitcircle.For both the two- and three-polemodels,we foundthat we couldset z• equalto unity and then calculatezb and z• from the preemphasized glottal waveformusinglinear prediction analysisof thewaveformestimated by inversefiltering. Table I liststypicalvaluesof zb andzc for the inverse filteredglottal wavesfor the four voicetypesfor selected patientsandnormalsubjects. The resultsconfirmedthatthe generalspectralslope(tilt) of a modalor vocalfry phonationcanbemodeledby two realpoles.On theotherhand,to simulatethespectralslopeof a falsettoor breathyphonation, one extra real pole was requiredto accountfor its steeper roll-offrate.Figure6 showsthe improvedspectralmatching usingthethree-pole modelforthefalsetto andbreathyphonations. Our two- andthree-polemodelsapproximate the highfrequencyspectralslopeof a glottalpulsebetterthan the low-frequency characteristics. An explanationfor thisisthat the glottal pulsemustbe of finite duration.Therefore,an exactmodelwould be a finite impulseresponsefilter, and hence,would containonly zeros.Our resultsshowedthat the useof an all-polemodelwill causemismatchof the model spectrumwith the dataat low frequencies (seeFigs.5 and 6). We reasonedthat the low-frequencymismatchwould notbeasimportantperceptually asthehigh-frequency characteristics providedthe harmonicsin the speechsignalwere reasonablywell matchedby the model. This point was to be examinedusinga perceptualevaluationof synthesizedtokens. The spectraltilt or slopeof the glottalpulsecanbe measuredwithout estimatingthe waveshapeof the glottal pulse. D.G. Childersand C. K. Lee: Analysisof vocal quality 2397 The generalspectralslopefor a voicephonationis determinedby the combinedcontributionof the spectraof the glottalpulseandthelip radiation. Thisspectral slopemay oftenbeapproximated asa two-polespectrum, i.e.,tworeal TABLE I. Theestimated realpolesfor thelow-pass filtermodeloftheglottal waveforms fora fewtypicalmalesubjects (N = normalsubject; P = disorderedsubject}. Subject Voicetype F0 (Hz) N 11 N24 vocalfry/a/ vocalfry/i/ N20 slightvocalf•/i/ N3 NI4 N27 modal/o/ modal/i/ modal/i/ N7 NI9 falsetto/a/ falsetto/a/ P8 PI3 breathy/i/ breathy/i/ zb zc 46 45 0.83 0.81 '" --- 90 0.83 --- 155 106 126 0.92 0.92 0.96 '" ----- 210 312 1.00 1.00 0.60 0.73 137 200 1.00 1.00 0.25 0.87 polesinsidetheunitcircle.Thesepolesareestimated using LP analysis of preemphasized speech. The resultsobtained are consistentwith thoseobtainedusing inversefiltered 6O Nil, vocalfry speech.For falsettoand breathyvoices,one pole is on the unit circle(or nearlyso) andthe othervariesroughlyfrom 0.2 to 0.9. The polevaluesare, however,affectedby the formantsof the phonation.For example,the locationof the poleswithin the unit circle for the two pole model varied dependingon the type of vowel (/i/or/o/) for a given speakeranda givenvoicetype.The poleswereusuallycloser to the originfor/i/than for/o/. O. Fundamental frequency and harmonics The harmonicsin the speechsignalbelowthe first formant are often consideredimportantfor the perceptionof (D ' ' NI9, falsetto 5O P8, breathy ].3O P8, breathy 133 9O 60 spectra andsuperimposed three-pole modelapproxiFIG. 5. Glottalsource spectra andsuperimposed two-pole modelapproxi- FIG. 6.Glottalsource andbreathyvoicetypesforthesamecorresponding submationformodal,vocalfry,falsetto, andbreathy voicetypes.Thesubjects mationforfalsetto are indicated in Table I. 2398 J. Acoust. Sec.Am.,Vol.90, No.5, November 1991 jeersasin Fig. 5. D.G. Childors andC. K.Leo:Analysis ofvocalquality 2398 vocalquality(Holmes,1973). Thisispresumably dueto the highenergyin theseharmonics. We foundthattheglottal spectraofdifferentvoicetypesshowed distinctive amplitude relationships betweenthefundamental andhigherharmonics.Figure7 shows theglottalspectra forvarious voicetypes, whereHi is the ainplitudeof the ith harmonicandH• is the amplitudeof the fundamentalfrequency.We foundthat vocal fry wasapproximatelycharacterizedby a high HRF (2.1 dB) followedby modal ( - 9.9 dB), breathy ( - 16.7dB), and falsetto( - 19.1dB). Falsettoandbreathyvoiceshad a displaying onlythefirsttenharmonics. Therearedistinctive highintensityfundamentalaswell.Thesevariationsin haramplituderelationships betweenthe fundamentaland the monicrelationsappearto havelittle significance for speech higherharmonics asidefrom thedifferences in thespectral intelligibility (phonetic identifiability) but affect the perslopeasdiscussed above.We defineda parametercalledthe ceivedvocalquality. "harmonicrichnessfactor" (HRF) to ineasurethisrelationship HRF- --, •2H" (3) Hi 80 N14, modal E. Interharmonic noise Turbulenceat the level of the glottis has beennoted to contributeto the perceptualquality ofbreathy voice(Isshiki et al., 1978;Yumoto etal., 1982; Hilhnan etal., 1983; Hiraokaet al., 1984). We modifiedthe frequency-domain method ofHiraoka et al. (1984) to measurethe noise-to-harmon- (dB) o 8o N11, vocalfry ic ratio (NHR). Our methodusesan adaptiveprocedureto estimateF0, which is crucialfor identifyingthe higherharmonics.Thiswasaccomplished bya two-stepprocedure. We estimatethe pitch periodsof the speechsignalby usingthe synchronousEGG signal.The accuracyof this initial estimationis restrictedby the signalsamplingrate (in our case, 10 kHz). For example,if the accuracyof F0 is within one samplepoint, an F0 estimateof 100samplepointsmeansan F0 of 100 q- 1 Hz and an F0 estimateof 50 samplepoints mean an F0 of 200 q- 4 Hz. Such estimation errors can cause ßa severeproblemin identifyinghigh-orderharmonics.For example,a 4-Hz error of F0 couldleadto a 40-Hz shift in locatingthe tenth harmonic. An adaptiveF0 correctionprocedurewasusedto reduce the o 80 N19, falsetto error in the initial F0 estimate. We defined FO,, = FO _+nAF, wheren = 1,2..... 10,and AFis onetenth of the maximum possibleerror of F0, as describedabove. Basedon eachFO,,, the energyof the harmoniccomponents over the frequencyrangeof 0 to 2 kHz are computed.The valueFO,, givingthe inaximumharmonicenergyis selected as the final estimate of F0. This criterion is based on our experimentalresults,which showedthat over the low-fre- quency rangeof0 to 2 kHz, theharmonic energy isconsiderably higher than the interharmonicnoiseenergy. OnceF0 isdetermined,theith harmonicamplitude,Hi, o 80 P8, breathy and the interharmonicnoise,N/, are computedin the frequencyregioniFO +__ FO/2, whereH/representsthe energy in the subregioncenteredat iFO with a bandwidthof the Hamming analysiswindow. The symbolN• representsthe energyin the remainingfrequencyregion (Fig. 8). And the noise-to-harmonicratio at the ith harmonicregion(NHR i ) is definedas: NHRi = Ni/H i. The distribution (4) of the interharxnonic noise can be ob- servedby plottingNHRi alongthe frequencyaxisasin Fig. 1 2 3 4 5 6 7 8 9 10 FIG. 7. The spectrum(dB) for the first ten harmonicsof the glottal waveformsfor modal,vocalfry, falsetto,and breathyvoicetypes.All dataarefor vowel/i/except falsettowhich was/o/. 2399 d. Acoust.Sec. Am., Vol. 90, No. 5, November1991 9, where we seethat a voicejudged to be breathy has higher interharmonic noise above 2 kHz than modal or falsetto. Based on this observation, we defined a noise-to-harmonic ratio over a high-frequencyrangeto be an indicatorfor the vocalqualityof breathiness, namely, D.G. Ohildersand C. K. Lee: Analysisof vocalquality 2399 Gandv•d•h o• high (4.0), medium (2.8), and low (1.8), respectively.We wereunableto measurereliablythe WPF for breathyvoices. Our resultsimply that vocalfry registeris characterizedby a pulselikeexcitationwaveformwith a long glottal closure, while falsettoregisteris characterizedby a shortglottalclo- Hamming V•ndov I I I [ sure. G. EGG waveform features The electroglottographic signal(EGG) is generallybelievedto be representative of the amountof the lateralcontact betweenthe vocal folds (Childers and Krishnamurthy, 1985; Childers et al., 1986; Childers et al., 1990; Titze, 1990). The EGG is usefulfor registeringglottaleventsduring vocal fold vibration.When the EGG waveformis arrangedsuchthat an upward deflectionreflectsthe opening ] i'F0 I (i,,1)F0 of the glottisand a downwarddeflectiondepictsthe closing of the glottis,the sharpnegativepeaksin the differentiated EGG (DEGG) waveformhavebeenfoundto be very close FIG. 8. Frequencyintervalsfor harmonicandinterharmonicnoisecompoto the instantsof glottalclosure(if closureexists),and the nents. maximaof the DEGG are indicationsof the instantof glottal openingasin Fig. 10 (Childersetal., 1990). In thisstudy,we investigated the EGG waveformfeaturesfor varioustypesof phonation. NHR/, = , (5) Figure 11showstypicalEGG andDEGG waveformsof modal,vocalfry, falsetto,andbreathyphonations.All of the where (Ni) and (Hi) are the noiseand harmoniccompo- EGG waveformsexhibit a steeperslope (implying a rapid nentsabove2 kHz, respectively.The overall NHR, which changein vocalfold contactarea) duringthe glottalclosing wasdefinedoverthe wholefrequencyrange(0-5 kHz), was phasethan during the openingphase.This characteristic alsocomputedfor comparison. Table II liststhe analysisof phenomenonof vocalfold vibrationresultsin a sharpnegaresultsfor threevoicetypes.The NHR h (noise-to-harmonic tive pulsein the DEGG waveformat or near the instantof ratio above2 kHz) isa goodindicatorfor the existence of the maximumglottalclosure.Asidefrom thiscommonfeature, vocalqualityofbreathiness.We wereunableto measurereliablyNHRh for vocalfry. F. Temporal energy distribution The temporalenergydistributionof a speechwaveform is relatedto the glottalexcitationandhaslongbeenthought to affectvocalquality.Wendahlet al. (1963) andColeman (1963) established that the perceptionof vocalfry is related to the dampingof the speechsignalbetweenglottal excitations.To measurethe decaycharacteristicsof a speechwaveform during a singlepitch period, a parametercalled the waveformpeakfactor (WPF) wasdefinedas TABLE II. Noise-to-harmonic ratiosfor severaltypicalsubjects and three voicetypes(N = normalsubject;P = disorderedsubject). Hi-Freq. NHR Overall NHR Subject Sex Voice type N12 N13 N28 M M M modal/i/ modal/o/ modal/i/ - 5.1 dB - 4.2 - -- 4.2 -- 14.8 NI M modal/i/ -- 5.4 -- 14.8 N2 N30 N31 N32 N33 M M M M M modal/o/ modal/i/ modal/u/ modal/i/ modal/a/ -- 5.6 -4.8 -- 6.2 -- 7.1 -- 5.1 ----- wherexi istheamplitudein theith samplepointandNis the total numberof samplepointsin onepitchperiod.Theoreti- N16 N17 M M falsetto/i/ falsetto/o/ -- 8.5 -- 8.2 -- 20.4 - 20.3 cally, the WPF has a minimum value of 1 when the wave- N18 M falsetto/i/ -- 9.9 -- 23.2 form is fiat, anda maximumvalueN •/2whenthe waveform is an impulse.The WPF valueof a speechwaveformis related to the underlying glottal waveshape.For glottal waves with narrow pulsesseparatedby a long glottalclosure,the WPF valueis largeand for pulsesof longdurationthe WPF is near unity. Althoughthe averageWPF valuesfor sustainedvowels vary somewhatwith the type of vowelfor the samesubject and voicetype, a generalrule is that vocalfry, modal and falsettoregistersare characterizedby WPF valuesthat are N19 M falsetto/a/ - 8.4 - 22.1 N4 N5 N6 N7 M M M M falsetto/i/ falsetto/o/ falsetto/i/ falsetto/o/ ----- 4.3 3.9 6.1 3.3 ---- 21.7 20.1 22. I 18.5 N25 N29 P7 P12 P20 P21 M M M M F F breathy/i/ breathy/i/ breathy/i/ breathy/i/ breathy/i/ breathy/i/ -- 0.7 5.5 -- 0.4 0.2 5.9 6.2 ----- 12.6 12.7 19.6 19.8 14.8 8.9 WPF = peak amplitude = max(Ixil) (6) rmsvalue 2400 [ (l/N) Z•=ix•]•/2' d. Acoust.Soc. Am.,Vol.90, No. 5, November1991 13.8 dB 16.5 16.6 16.9 15.1 14.9 16.2 D.G. Childersand C. K. Lee:Analysisof vocalquality 2400 N12, modal [2O N16, falsetto P12, breathy •o 8O 7O (dB) (dS) Noisa-to-aar•onic 0 I0•0 2000 •atio (NRlq i) (dB) 4O Noise-to-HarmonicR•tfo 40 Notse-to-ff&cmonteRatio (8HRX) 3000 FIG. 9. The spectraand noise-to-harmonic ratio (NHR•) for threemalesubjectsindicatedin Table II for the vowel/i/for modal,falsetto,and breathyvoicetypes. FIG. 10.Thespeech, EGG, DEGG, andglottalareawaveforms forthevowel/i/. Theglottalareawaveform wasmeasured fromultrahigh-speed laryngeal filmsand wasusedasthe basisfor validatingthe estimationof the opening,closingand closedglottalphasesfrom the EGG and DEGG waveforms.The speech signal wasmonitored simultaneously withtheEGG.All waveforms weresynchronized. Theclosed phase wasmeasured astheintervalontheDEGG betweenthe negativeand followingpositivepeak. distinctiveEGG and DEGG waveformpatterns(implying differentvocalfoldcontactphenomena) werefoundto characterizethe differenttypesof phonation.For modaland vocal fry phonationsthe instantof the maximumclosingslope in the EGG waveform is closeto its minimum extension, and thus resultsin a very narrow negativepulsein the DEGG waveform.This impliesthat the instantof the DEGG negative peak is near the occurrenceof glottal closure,as ob- Figure 11 indicatesthat the vocalfold contactvaries duringglottalclosurefor the four voicetypes.We defined parameters of the EGG waveformto predictthesedifferent glottalclosurephenomena and foundwe couldusethese featuresalongwith additionalfeaturessuchaspitch,double opening/closing patternof the volumevelocitywaveform, andsinusoidal-like patternfor falsettoto distinguish thefour voicetypes(Lee, 1988).In addition,weestimated thepitch servedby Childers et al. (1983) and Childers et al. (1990). The DEGG waveforms for falsetto, on the other hand, show period (PP) and the glottal openquotient (OQ) for various muchwidernegativepulses,andthusdonotprovidereliable indicationsfor glottal closure.The EGG waveformsfor breathyvoicehavea rapid closingsegmentcloselyfollowed by the nextopeningsegment,implyinga momentaryglottal closureor eventhe absence of glottalclosure.The vocalfry EGG waveformalsoshowsthe doubleopening/closingpatternduringan individualglottalcycle,asobserved by other researchers(Moore and von Leden, 1958; Timcke et al., estimatedas the time durationbetweentwo successive nega- 1959;Whiteheadet al., 1984;Klatt and Klatt, 1990). in the DEGG 2402 J. Acoust.Sec.Am.,Vol.90, No.5, November1991 voicetypes (Childers et al., 1983). The pitch period was tive peaksin the DEGG waveform.The openquotientwas defined as OQ = duration of the open phase/pitch period, (7) wherethe openphasewas estimatedby the time duration betweena positivepeakandthe nextadjacentnegativepeak waveform. D.G. ChildersandC. K. Lee:Analysisof vocalquality 2402 N12, modal • formant synthesizers have beenfound to producenatural soundingspeechwhen the sourceexcitationwaveformattemptsto replicateboth temporaland spectralcharacteristicsof the human source.(SeeChildersand Wu, 1990,for a • __._A._ .• DEGG review.)Holmes(1973) recommended an excitationsignal basedon the secondtime derivativeof a typicalglottalvolume-velocitywaveformfor theparallelformantsynthesizer. N24, vocalf•, DEGG N18, falsetto For a glottalpulsewitha - 12-dB/octspectral slopesucha differentiated signalwouldgivean approximately flat spectrum. But aswe haveshownin the previoussection,human glottalexcitationcharacteristics vary for differenttypesof voiceproduction.To synthesize or modelnaturalsounding speech(for normalspeechor for a vocaldisorder)a more sophisticated approachis calledfor. Linearpredictivecoders are usingmore sophisticated excitationwaveformsas well (Schroeder and Atal, 1985; Trancoso and Atal, 1990). But thesemodelsstill requireextensivecomputation,e.g., Schroeder andAtal (1985) reportedthattheircodingprocedure took 125s ofCray-1 CPU time to process1 sof speech. P13, breath}, A. Source .•A.a_. . • • . t.t•. ' , ,•,.,,•. , • FIG. 11. The EGG and DEGG waveformsfor the vowel/i/for vocalfry, falsetto,and breathyvoicetypes. DEG. G modal, The averagePP and OQ valuesestimatedfrom the DEGG waveformsof variousvoicetypeswere generally lowerthanthoseestimated fromthe inversefilteredglottal waves.This was particularlyso for falsettoand breathy voices, whichhaveprogressive glottalclosures, consequently, theEGG-based technique underestimated theOQ values. Nevertheless, theresultsstillshowedthattherankingin orderof increasing OQ valueswerevocalfry, modal,falsetto, and breathyvoices,respectively. Oneof the importantadvantages of usingtheEGG isthatit canregisterthedynamic characteristics of vocal fold movements of continuous speech. III. A NEW EXCITATION MODEL Theoverallspectral envelope forvoiced speech isdeterminedby threefactors(Fant, 1960;Flanagan,1972): (1) the spectrumof the glottalexcitationpulse,(2) the vocal tracttransferfunction,and (3) theradiationof thelipsand nostrils.For thispaper,we arefocusing on thespectrum of thesource. In recentyearsresearchers havecometo recognizetheimportance of accurately modelingthesourceexcitationifa formantsynthesizer isto faithfullyreproduce natural vocal characteristics(Rosenberg, 1971; Hedelin, 1984; Fantetal., 1985;FujisakiandLjungquist,1986;Klatt, 1987; Lee and Childers,1989,Pinto et al., 1989;Klatt and Klatt, 1990;Childers andWu, 1990).Boththecascade andparallel 2403 J.Acoust. Soc.Am.,Vol.90,No.5,November 1991 excitation classifications Basically,researchers have usedthree typesof excitationfor speechsynthesis, especially for formantsynthesizers (Childersand Wu, 1990). Theseare ( 1) impulseexcitation with a glottalshapingfilter (Flanagan,1957;Rabiner,1968; Klatt, 1980; Klatt, 1987); (2) glottal waveformsobtained by inversefiltering (Rosenberg,1971; Holmes, 1973) or glottalareawaveforms(Yea et al., 1983); and (3) excitation waveformmodels(Rosenberg,1971;Fant, 1979;Ananthapadmanabha,1984;Hedelin, 1984;Fant et al., 1985;Fujisakiand Ljungqvist,1986;Klatt, 1987;Childerset al., 1989; Lee and Childers, 1989; Pinto e! al., 1989; Klatt and Klatt, 1990). For various reasons,excitation waveform modelsare flexible,easyto use,and producenatural soundingspeech (Childers and Wu, 1990). The sourcemodel (1) aboveis simpleand can easily yield a sourcespectrumwith - 12-dB/octslope(or any other slopein incrementsof - 6 dB/octave). However, the phaseand the spectralnotchesthat are oftencharacteristics of natural voicingare not reproducedwell and the vocal quality is poor (Childers and Wu, 1990). The fault with model(2) isthat theglottalwaveformisdifficultto estimate and generallyrequiresa microphonewith a high-quality, low-frequencyresponse(Childersand Wu, 1990). Furthermore,suchan excitationsourceis not suitablefor applicationin text-to-speech systems that requireprestoredsource parameters. A well-designed excitationwaveformmodelalongthe linesof model(3) above,is easyto useandcapableof producingnaturalsoundingspeechasshownby theresearchers citedabove.Klatt andKlatt (1990) haverecentlysuggested onemodel.The LF modelhasalsofoundacceptance(Fant et al., 1985).An advantage of theLF modelisthatparameters of the model can be measuredusingthe speechand EGG signals.For example,theLF model (Fant et al., 1985) requiresfour modelparametersfor a differentialglottal wave [Fig. 12(b) ]: D.G.Childers andC.K.Lee:Analysis ofvocal quality 2403 eta = 1-e -•'ø-"• (11) To relatethewaveshape parameters of theLF modelto measureddata,we expressed the openquotient(OQ) andspeed quotient(SQ) for thismodelas OQ = openphase/pitchperiod= OQLv= (te + kt• )/to, for LF model, (12) and [al SQ= opening phase u closingphase = SQ•v= tv/(te q-kta- tv), for LF model. (13) For the LF model,the instantof "glottal closure"wasdefinedastheinstantat whichtheglottalflowamplitudedrops to 1% of itspeakvalue.Basedonthisdefinition, thevalueof k isa functionof theparameter t•. Our datashowthatk has valuesin therangeof 2.0 to 3.0 when0% < t• < 10% (k = 0 whenta = 0), whereta isrepresented bya percentage of the pitch periodto. B. Glottal factors for source modeling As shownin the previoussection,four major factors werefoundto beimportantfor characterizingdifferenttypes [b] FIG. 12.Theglottalwaveform, u,•(t), isshownin (a), whiletheLiljencrantsandFant (LF) modelfor thedifferential glonalwaveform,u'n(t), appearsin (b). of voiceproduction:namely,the glottal pulsewidth, the glottalpulseskewness, theabruptness of glottalclosure,and the turbulentnoisecomponent.Thesefactorsare important in modelingthe sourceexcitationasshownbelow.Furthermore, thesefactors are related to characteristicsor features of theglottalvolume-velocity waveform, u• (t), itsderivative,u•(t), theEGG,itsderivative, DEGG,andthespeech waveform.One characteristicis depictedin Fig. 13. The main excitation of the vocal tract occurs at the instant of the tp = glottalflow peakposition, maximum negative peakof u•(t), whichcorresponds to the t,. = instant of maximum closing rate, ta = time constantof an exponentialrecovery,i.e., return phase,from the point of maximum closingdiscontinuitytowards the maximum closure. tersthe vocaltractdampingtrend.Althoughnot assignificantasthe mainexcitationpulse,whichoccurseverypitch (8) (9) pulses. ity in the model are and u•(t)= [u's(t•)/et • ] (e-•t'-'")for to<t<t0, glottalopening phase, a secondary excitation occurs thatalperiod,the secondary excitations alsocontribute to vocal quality,e.g.,whensuchexcitations are absentas in LPC synthetic speech, the voicesounds buzzy(Samburet al., 1978).The durationof glottalclosureis alsorelatedto the voicetype,e.g.,falsetto andbreathyvoices havea briefglottal closureor perhapsnoneat all. Vocalfry, on the other hand,has a long closedphasebetweenglottalexcitation The equations governing thederivativeof thevolumeveloc- u•(t) = Ae•'sin(oagt),for O<t<te instant of glottal closure.Followingglottal closure,the speech signaldecays dueto vocaltractdamping.Duringthe where •%= rr/tpisthefrequency ofthesinewave inEq.(8) andto isthepitchperiod.Parameters a ande aredefined for Another featurethat characterizesthe four voicetypes, whichwe examinedin the previoussection,is the "sharp- ofthemainexcitation pulsein u•(t), which,inturn,is computational use.The valueof a canbederivedby using ness" caused by the steepness of the closingphaseof the volume the four basicparameters listedaboveand by solvingthe velocity and by the abruptness of glottalclosurein the volequation ume-velocity waveform.Thesetime domaincharacteristics of thewaveformaffecttheslopeof thespectrum(Fantetal., •I'" u• (t)dt =0. (10) Similarly,thevalueof • canbederivedby solvingtheequation 2404 J. Aooust.Sec.Am.,Vol.90, No.5, November1991 1985). The extentof the glottalwaveshape variationsis illustratedin Fig. 14, whichshowsinverse-filtered differential D.G. ChildersandC. K.Lee:Analysis of vocalquality 2404 %(0 i ½losedl opening • Iclosing u'g(t) NIl secondary exc i tat Ions / 'V vocalfry Speech vive fore I I foreant oscillation FIG. 13.Threesynchronized waveforms: the glottalwaveform,its deriva- tive,andthespeech waveform. Indicated onthefigurearetheinstants of excitation ofthevocaltract,theopening, closed, andclosing glottalphases, FIG. 14.The inversefiltereddifferentialglottalwaveformsand the superimposed (least-mean-square error) waveformsproducedby the Liljencrantsand Fant (LF) modelfor selectedsubjects,voicetypes,and phonation for the subjectsindicatedin Table III. andtheexponential decayof thespeech waveform. glottalwaveformsfor thefour voicetypesweexamined,each waveform matched with an LF model waveform. The matchingLF model was derivedby measuringthe initial estimates oftheamplitude parameter u• (te) andthewaveshapeparameters tp,t•, andt• in thecorresponding pitch periodof the differentialglottal waveform.Theseparameterswere then adjustedusinga least-mean-squared error criteriathat minimizedthe errorbetweenthe signalandthe matching waveform • (Lee, 1988).Sometypicalmeasured waveformparameters(representedin percentages of the pitchperiod)and the corresponding factorsfor the glottal pulsewidth (OQL•) and the glottalpulseskewness(SQLv) aregivenin Table III. For the four voicetypeswe foundthat theseglottalfactorsvariedwidely,e.g.,OQ• rangedfrom 0.26 to 1.0, SQL• from 1.3 to 3.6, and t• from 0.5% to 13.3%. C. Turbulent noise component When the glottishasan imperfectclosureand the airflow rate is high, turbulentairflow is produced.This phenomenonhasbeenexaminedby others(Flanaganand Ishizaka, 1976;Isshiki etal., 1978;Pinto etal., 1989;Childers andDing, 1991). The soundpressureof the turbulentnoise is approximatelyproportional to the squareof the volume velocity of the airflow and inverselyproportionalto the cross-sectionalarea of the constriction. Furthermore, the of frequencies(2-8 kHz) with someaccentuationat about4 kHz. Our data for breathy voicesagreedwith theseresults, showinga high interharmonicnoiselevel above2 kHz. Although the existenceof high-frequencyturbulent noisehasbeenshownto bean importantfeaturefor breathihess(Yanagihara, 1967;Yumoto etal., 1982;Hiraoka etal., 1984; Hurme and Sonninen, 1985), the role of a turbulent noisesourcein syntheticspeechhasonly recentlyreceived TABLE III. Typicalwaveshape parameters(basedon the LF model) for glottalwaveforms forthefourvoicetypesfora fewmalesubjects (N = normalsubject; P -- disordered subject),to (pitchperiod) in numberofsample points(sampling period= 0.1ms).t•, t,,,andt• inpercentage ofthepitch period. Subject Voicetype N14 N27 N3 modal/i/ modal/i/ modal/o/ N23 Nll slightvocalfry/o/ vocal fry/o/ NI9 N7 P8 P13 to tp t,, t• OQu- SQLr 94 49% 79 53% 65 51% 64% 71% 68% 2.1% 2.5% 1.5% 0.67 0.75 0.69 2.7 2.4 2.8 119 49% 220 20% 63% 25% 0.8% 0.5% 0.64 0.26 3.3 3.6 falsetto/o/ falsetto/o/ 29 57% 47 62% 77% 89% 13.3% 4.3% 1.00 0.98 1.3 1.7 breathy/i/ breathy/i/ 73 48% 50 58% 68% 84% 6.8% 0.81 10.0% 1.00 1.5 1,4 energyof theturbulentnoiseisdistributedovera widerange 2405 J. Acoust.Soc.Am.,Vol. 90, No. 5, November1991 D.G. Childorsand C. K. Lee:Analysisof vocalquality 2405 seriousconsideration(Pinto et al., 1989;Lee and Childers, 1989; Klatt and Klatt, 1990). Waveform M•I D. Existing source models I Tn I Dn and noise component (lower trace). To synthesizenatural-sounding speechwith various qualitycharacteristics, a voicesourcemodelmusthavecontrollableparameters that are importantfor perception.As discussed above,we consideredglottalpulsewidth, glottal pulseskewhess, the abruptness of glottalclosure,anda turbulentnoisecomponentas importantfeaturesfor a source excitationmodel.Severaltypicalglottal waveformmodels and voice sourcemodelsalready exist (Rosenberg,1971; Fant, 1979; Hedelin, 1984; Ananthapadmanabha,1984; Fant et al., 1985;Fujisakiand Ljungqvist,1986;Klatt, 1987; FIG. 15.Blockdiagramof thenewsource modelandan example of the generatedexcitation wave. Klatt and Klatt, 1990). All of these models allow variable glottalpulsewidth and skewness. Only onemodelusesa turbulent noisecomponent(Klatt, 1987;Klatt and Klatt, 1990). But few detailsof the turbulentnoisegeneratorare given.Three glottalwaveformmodels(Fant et al., 1985; Ananthapadmanabha,1984;FujisaMandLjungqvist,1986) canvary theabruptness of the glottalclosure,whilethe others (Rosenberg,1971; Fant, 1979;Hedelin, 1984; Klatt, 1987) alwaysgeneratean abruptglottal closureafter the instantof maximumclosingslope.No sourcemodel incorporatesall of the four factorswe considered importantfor characterizingvocalquality. Exceptfor the turbulentnoise component,the LF modelhasthe capabilityof varyingthe threeglottalwaveshape factorswith onlyfour modelparameters,andaswehaveshowncanapproximatea widerangeof glottalexcitationwaveforms. E. A new experimental source model Basedon thedatafrom thisstudy,we designed theexcitationmodelshownin Fig. 15,whichconsists of two components: (1) a glottal pulse generator,and (2) a turbulent noisegenerator. I. GIottal pulse generator We adaptedthe LF model (Fant et al., 1985) for the glottalpulsegeneratorbecausethe LF modelcan approximate a wide range of pulse widths, pulse skewness,and abruptnessof glotta1closure.Furthermore,the parameter valuesfor the LF model can be derivedby usingthe inverse filtered differentialglottal waveformand/or the EGG signal. trum-shapingfilter was designedto simulatethe spectral characteristics of the glottalturbulentnoise.The presentdesign,basedonourdataandthatofIsshikietal. (1978) usesa high-pass filter with a 3-dBfrequencyof 2 kHz. This differs from that suggested by Klatt and Klatt (1990). The amplitude modulatorsimulatesthe amplitudefluctuationsof the glottalturbulentnoisedue to the variationsin airflowand glottalareaduringvocalfold vibration.A pitch-modulated squarewavewith an adjustableduty cycleis presentlyused, but is easilymodified.Two parametersare usedto control the startingposition( T, ) of the turbulentnoiseanditsduty cycle(D,). The parameterT,, designates thelocationof the onsetof the turbulentnoisesourcerelativeto the beginning of the pitchperiod,e.g., T, = 75% denotesa noiseonsetat 3/4 of the pitch periodof the glottalexcitationwaveform. The parameter D• controlsthedurationof thenoisesource, e.g.,D, = 50% denotes thatthenoisesourceisonthe 1/2 of thepitchperiodof theglottalexcitationwaveform.At present,theseparametersarespecified by thedesigner,but in the future they might be measuredfrom the speechsignal.Besidessimulatingbreathiness, theturbulentnoisegeneratoris alsosuitablefor producingexcitationsignalsfor voicedfricatives,wherethe turbulentnoiseis generatedat a constriction in the vocal tract. Our observationssuggestthat the amplitude modulation factor is important for voicedfricarives. IV. PERCEPTUAL EVALUATION A. Method 2. Turbulent noise generator The turbulent noisegeneratorconsistsof a random numbergenerator,a spectrum-shaping filter, and an amplitude modulator. The random number generatorproduces The tokensfor the listeningtestsweresynthesized using a cascadeformant synthesizerof our own design(Pinto et al., 1989) but basedon Klatt's synthesizer(Klatt, 1980). For the listeningtestsreportedhere,all tokenswerethe sus- random noise with a normal distribution and a flat spec- tained vowels/i/and/o/. trum. The amplitudelevel of the randomnoisewas controlledby a parameter,A,,, that specifiedthe ratio of the energyof the noiseto the energyof the glottalpulsewaveform. This ratio is small, typicallyabout0.25%. The spec- wereadjustedto havethe sameenergylevel.While we were ableto synthesize sentences, thisinvolvedadditionalfactors suchasphonemetransitionsthat werenot part of thisstudy. The tokensfor the listeningtestsweregeneratedby comput- 2406 J. Acoust.Soc. Am.,Vol. 90, No. 5, November1991 All tokens were 2 s in duration; all D.G. Childersand C. K. Lee:Analysisof vocalquality 2406 er via a digital-to-analog converterandpresented via headphonesin an IAC single-wallsoundbooth.The tokenswere arrangedinto groups,eachgroupconsisting of four vowel tokens.For eachglottalsourcefactorunderinvestigation, a groupof syntheticvowelswasproducedby progressively changing theselected sourceparameter whilemaintaining the formantstructurefixed.A groupof voweltokenswas createdby varyingoneglottalexcitationsourceparameter overa rangeof four valueswhileall otherparameters were heldfixed.The vowels/i/and/o/were synthesized using knownformantstructures whiletheglottalexcitationsource characteristics werevaried.Forexample, wevariedto (the glottalopeningtime), te (the instantof maximumclosing rate), andto (the time factorfor controllingthe abruptness of glottalclosure).Eachof theseparameters wasprogressivelyvariedin four stepswhiletheotherparameters were heldfixed.Thuswecreateda groupof foursynthesized vowelswith onlyoneparameterbeingvariedin four steps.The $Q and OQ were varied in a similar mannerand the same vowelswerealsosynthesized. The judgesfor the listeningtestswerethreeprofessors fromtheDepartmentof Speechat the Universityof Florida. Each judge was familiar with syntheticspeech.Furthermore,eachjudgewasan expertin voiceevaluationin a clinicalsetting.Onlythreejudgeswereusedfor thelisteningtests sincewe soughtpreliminaryguidancewith respectto the effectiveness of our glottal sourcemodel for synthesizing threevoicequalities.Eachjudgewasaskedto perceptually rate the various tokens in the manner described below. The judgeswere not informedas to the mannerby which the tokensweresynthesized; i.e., theydid not know the excitationsourceparameters that werebeingvariednor theorder in whichthe synthesized tokenswerepresented, although this was not a factor sincethey could listento the vowel tokenswithin a groupas many timesas they desiredbefore indicatinga ratingof the voweltokens.Thejudgesweregiven three termsto describeand, thereby,rate the threevoice qualitieswe synthesized,namely,naturalness,breathiness, and hypo-/hyperfunction.Naturalnesswasdefinedas "human sounding."The judgeswere familiar with breathiness andgenerallyagreedthat a breathyvoiceis usuallyassociated with an incompleteglottalclosureduringvocalfold vibration,suggesting thattheimportantaudiblecomponent of the soundis the noisethat is producedby the escapage of turbulentairflowat the glottis.The judgesalsoagreedthat the voicequalityof hypo-or hyperfunctionis relatedto the perceptualsensationof vocaleffort.Hyperfunctionwasused to describea strainedor tensevoicequality,as if the vocal foldswerecompressed duringphonation.Hypofunction was describedas too little tensionin the vocal folds, therebyresultingin an thin or lax voicequality.For eachgroupof four voweltokensthejudgesratedthe stimulion a 7-pointscale hypo-/hyperfunction scalerepresented the extremehyperfunctionwhilezerorepresented the extremehypofunction and three represented a normalvoicequality.The judges wereallowedto trainonbothsynthesized andspokentokens for thethreevoicequalitiesof naturalness, breathiness, and hypo-/hyperfunction. Thisallowedeachjudgeto establish a correspondence betweentheir own perceptualevaluation and the rating scalethey were to use. B. Results For the rating of the degreeof hypo-/hyperfunction, i.e., the lax/tense vocal quality of the vowel tokens,the re- sultswereasfollows.Withto andt,,fixed,asto variedfrom 58% to 36% the synthesized voicewasjudgedto go from hyperfunctional to hypofunctional. The perceptualratings by the threejudgeswerein generalagreementon the seven pointscale. Theratingforto = 58%was(5, 4, 6) forjudges 1,2, and3, respectively. Whento= 36%,theratingwas(0, 1,0). Withto andtofixedandL,varyingfrom55%to 85%, the synthesizedvoicewasjudgedto go from slightlyhyperfunctional to hypofunctional. The judges' ratings for te = 55% was (4, 4, 6), while that for t,. = 85% was (0, 0, 0). A resultsimilarto thelatterwasobtained withtpand fixedbut with t,, varyingfrom 0% to 10%. The ratingsfor to = 0% was (5, 4, 5) and was (0, 1, 0) for t• = 10%. The perceptualsensationof vocaleffortwascloselyrelated to the speedquotient,SQ. A highSQ (7.3) produceda broad peak in the spectrumat high frequencies, which was perceivedby thejudgesas a tenseor hyperfunctionalvocal quality. The judges' ratings were (5, 4, 6). On the other hand,a smallSQ ( 1.2) produceda steeplydecliningspectral slopefor the source,which resultedin a lax or hypofunctionalvocalquality.Thejudges'ratingwere(0, 1,0). An SQ of 3.0 producedtokensthat werejudgedasnormalin voice quality.The rating was (3, 2, 3). The openquotient,OQ, wasnot very usefulfor predictingthe vocalqualityof hypo/hyperfunctionalvoice. A numberof experimentswereconductedsynthesizing a breathy voice by varying aspectsof the turbulent noise component.In oneexperimentfour typesof turbulentnoise were tested:(1) continuousnoisewith a flat spectrum,(2) thenoisetimewaveformsetto 50% of thepitchperiodwith a flat spectrum,(3) continuoushigh-passfilterednoise,and (4) high-pass filterednoisewith the time durationbeingset to 50% of the pitchperiod.Other experimentsevaluatedthe perceptualeffectof varying the duration (in percentageof thepitchperiod)of thenoisecomponent aswellasvarying thelocationof thenoisesourcewithinthepitchperiod.We alsoexaminedthe correlationbetweenthe degreeof perceptual breathiness and the noise-to-harmonic ratio The find- ingsfrom theseexperimentswereas follows. ( 1) Amplitude modulationof the turbulentnoisesource Only onetypeof ratingwasmadeat a time,e.g.,naturalness is important for achievinga natural soundingsynthetic orbreathiness or hypo-/hyperfunction. Ifa particulargroup breathyvoice.Our resultssuggestthat a duty cycle,D,,, of from zero to six (Prosek et al., 1987; Eskenazi et al., 1990). of tokens was to be rated for two voice qualities, then the about 50% (or slightly greater) is preferred. judgeslistenedto the groupof tokenson two separateoccasions.A ratingof sixrepresented the highestdegreeof quality for naturalnessand breathiness. A rating of six on the (2) Although the locationwithin a pitch periodof the noiseproductionwasnot critical,the perceptionof naturalnesswas improvedwhen the noisesourcewaslocatednear 2407 J. Acoust.Soc.Am.,Vol.90, No.5, November1991 D.G. ChildersandC. K. Lee:Analysisof vocalquality 2407 the time at which maximum glottal closureoccurredor slightlyafter,i.e.,T, occurred atabout75% oftheexcitation pitchperiod(Lee and Childers,1990;Childersand Ding, featuresin the frequencydomainincludedthe slopeof the 1991). monic ratio. The results showed that the sensation of vocal spectrumandtherelationships betweenthefundamentalfrequencyand higher harmonicsas well as the noise-to-har- effortiscloselyrelatedto twoparameters of theglottalwaveform:thespeedquotient(a timedomainparameter)andthe spectralslope(a frequencydomainparameter).A largeSQ of the fundamentalfrequency. (7.3) producesenergyat higherfrequencies, therebycaus(4) The degreeof perceptual breathiness wasprimarily ing the perceptualsensationof tenseor hyperfunctional determinedby the noise-to-harmonic ratio at frequencies voicequality.On the otherhand,a smallSQ (1.2) causesa steep-falling spectralsloperesultingin a lax or hypofuncabove2 kHz, The noisepulseenergy,A,, wasabout0.25%. The NHR h wascontrolled by varyingtheharmonics of the tional voice quality. The resultsof the listeningtestsalso glottalexcitation pulse. Thiswasdonebyvaryingtp,t•, and revealedthat the degree of perceptualbreathinesswas stronglycorrelatedwith thenoise-to-harmonic ratio above2 t, beingthe major parameter. of the turbulent noise Other experiments wereconductedto synthesize vocal kHz. The temporalcharacteristics sourcewerealsofoundto beimportantfor producinga natufry andfalsetto.Theseexperiments, however,werefoundto requireparameters thatwerenottheprimaryinterestof this ral-soundingvoice or a breathy voice.The three major pastudy,e.g.,F0, F0 perturbations, and multipleexcitation rameters of the glottal excitation model are: (1) T,--the pulsesfor vocalfry. Consequently, we do not reportthese locationof the onsetof the turbulentnoisesourceas a perresults here. centageof the pitch period (typically 75% ); (2) D,the duty cycle of the turbulent noisesource(typically 50%); and (3) A,--the ratio of the noiseenergyof the turbulent V. DISCUSSION AND CONCLUSIONS noisesourceto the energyof the glottal excitationpulse The four voicetypes (modal, vocalfry, falsetto,and (typically0.25%). The NHR hfor themodeliscontrolledby breathy)werefoundto becharacterized by four majorfac- the shapeof the glottalpulsegeneratedby the model.The tors:pulsewidth,pulseskewness, the abruptness of glottal key parameteris closure,and turbulent noise.The effectivenessof the four Sincethejudges'ratingsfor the variouslisteningtasks factorsfor synthesizing a particularvocalqualitywasevalu- werefoundto generallyagreewith oneanother,we feelthat ated usinga new sourceexcitationmodelwith a formant othersimilarlyskilledjudgeswouldhaveratedthe tasksin a synthesizer. Otherfactorsincludedtheglottalspectral slope, similarmanner.The listeningtestresultsalsoimply that on theharmonicrichness factor,andthewaveformpeakfactor. the wholethe glottalexcitationmodelandits associated paTypicalmeasured valuesforeachofthesefactorsareindicat- rameterswereeffectivein synthesizing the threevoicequalied in Table IV. ties of naturalness,breathiness, and hypo-/hyperfunction. Fromtheperceptual listeningevaluations of thesynthe- However,the listeningtestresultsshouldnot be interpreted sizedvoweltokens,we found that the major distinguishing as determiningthe bestchoiceof glottal excitationparam- (3) High-pass filteringof theturbulentnoiseisnotcriticalfor theperception of breathiness sincetheeffectof noise in the low-frequencyregionis maskedby strongharmonics TABLE IV. Summaryof the source-related featuresand their typicalvalues. Voicetypes Features Modal voice Vocalfry Falsetto Breathyvoice Glottal Pulsewidth medium short long long waveform (OQ) Pulseskewing (0.70) medium (0.45) high (0.99) low (0.91) low (SQ) Abruptness of closure (t.) (2.6) abrupt closure (3.5) veryabrupt closure (1.5) progressive closure (1.4) progressive closure (2.0%) (0.7%) (8.8%) (8.4%) Glottal Spectralslope medium slight steep steep spectrum (riB/oct) Harmonic richnessfactor (dB) ( -- 12) medium (--9.9) (--6) high (2.1) ( -- 18) low (- 19.1) ( - 18) low ( - 16.7) Speech Turbulent noise low low low high features (NHR,, in dB) ( -- 5.3) (unableto ( -- 6.6) (2.8) low measure) Waveform peak medium high low factor (2.8) (4.0) (1.8) (unableto measure) 2408 d. Acoust.Sec.Am.,Vol.90, No.5, November1991 D.G. ChildorsandC. K. Leo:Analysisof vocalquality 2408 etersoreventhebestglottalmodelforthethreevoicequalitiesweconsidered. Our results areonlyvalidfor theexcitation modeland associated parameters we used.Another modelmayyielddifferentlistening testresults. Thecurrentlyavailable methods for estimating glottal wavesarerestrictive in onewayor another.Thisstudyhas shown thatit ispossible to measure parameters thataresensitivetosource features fromboththespeech andEGGsignals.However,theseprocedures needfurtherdevelopment especially forconnected speech, wherevarious prosodic patternsare usedto expressdifferenttypesof statements. Thus an important extension of thisstudywouldbeto develop methods formeasuring voicesource dynamics forconnected speech. Forexample, various intonation andstress patterns may be correlatedto sourceparametersother than fundamentalfrequency andtiming.The ultimatepracticalgoalis to develop a source modelandsynthesis rulesthatcanproducenatural-sounding syntheticspeechwith desiredvocal and tonal characteristics. [Only preliminaryexperiments havebeenconducted onsynthesizing connected speech usingthemethodsreportedhere(Pintoet al., 1989;Childers andWu,1990).] Wealsonotethatdifferences inphysiologicalstructure, socialcustoms, gender, andagemaygiverise to distinctivephonationpatternsthatresultin variousvoice characteristics. As a finalnote,we mentionthat the knowledge gained fromthisstudymightbenefit applications ofspeech recognitionandspeaker identification. For example, oneapplica- Childers,D. G., and Krishnamurthy,A. K. (1985). "A critical reviewof electroglottography," CRC Crit. Rev. Bioeng.12, 131-164. Childers,D. G., Naik, J. M., Larar, J. N., Krishnamurthy,A. K., and Moore,G. P. (1983). "Electroglottography, speech andultra-highspeed cinematography,"in VocalFoldPhysiology: Biomechanics, Acoustics and PhonotoryControl, edited by I. R. Titze and R. C. Scherer (Denver Centerfor the Performing Arts, Denver,CO), pp.202-220. Childers,D. G., andWu, K. (1990). "Qualityofspeech produced byanalysis-synthesis," SpeechCommun.9, 97-117. Childers,D. G., andDing, C. (1991). "Articulatorysynthesis: nasalsounds and male and female voices," J. Phon. 19, 453-464. Childers,D. G. andWu, K. ( 1991). "Genderrecognitionfrom speech,Part II: Fine analysis,"J. Acoust. Soc.Am. 90, 1828-1840. Coleman, R. F. (1960). "Some acousticcorrelatesof hoarsehess,"Master's thesis,VanderbiltUniversity,Nashville,TN. Colton,R. F. (1969). "Someacousticand perceptualcorrelatesof the modal and falsettoregisters,"Ph.D. dissertation,University of Florida, Gainesville, Fl. Colton, R. H. (1973a). "Vocal intensityin the modaland falsettoregisters," Folia Phoniatr. 25, 62-70. Colton,R. H. (1973b). "Someacousticparametersrelatedto the perception of modal-falsettovoicequality," Folia Phoniatr.25, 302-31 I. Colton, R. H., and Hollien, H. (1972). "Phonationalrangein the modal and falsettoregisters,"J. SpeechHear. Res. 15, 708-713. Colton, R. H., and Hollien, H. (1973). "Perceptualdifferentiationof the modaland falsettoregisters,"Folia Phoniatr.25, 270-280. Damste,H., Hollien, H., Moore, G. P., and Murry, T. (1968). "An x-ray studyof vocalfold length,"Folia Phoniatr.20, 349-359. Davis, S. B. (1979). "Acousticcharacteristics of normal and pathological voices,"in Speechand Language:Advancesin BasicResearchand Practice,editedby N. Lass (Academic,New York), Vol. 1, pp. 271-335. Deal, R. E., and Emanuel,F. W. (1978). "Somewaveformand spectral featuresof vowelroughness,"J. SpeechHear. Res.21, 250-264. Eskenazi, L., Childers, D. G., and Hicks, D. M. (1990). "Acoustic corre- latesof vocalquality," J. SpeechHear. Res. 33, 298-306. Fant, G. (1960). AcousticTheoryof SpeechProduction(Mouton, Paris). tionwouldbeto examinewaysto studytheextractionand Fant, G. (1979). "Glottal sourceand excitationanalysis,"QuarterlyProuseofsource parameters forspeaker adaptive signalprocess- gressand StatusReport(SpeechTransmissionLaboratory,Royal Instiing to improvethe reliabilityof a speaker-independent tute of Technology,Stockholm,Sweden),Vol. 1, pp. 85-125. Fant, G., Liljencrants,J., andLin, Q.-G. (1985). "A fourparametermodel speechrecognitionsystem. ACKNOWLEDGMENTS This work was supportedin part by NIH Grant No. NIDCD DC 00577 with additionalsupportfrom the Universityof FloridaCenterfor ExcellenceProgramin Information Transfer and Processingand also from the MindMachine Interaction Research Center. I Thisissimilar toa procedure adopted byTitze(1984).Note,however, thata least-mean-square errorcriterion is notguaranteed to produce a glottalwaveform excitation modelthatmaygenerate perceptually the mostnaturalsounding syntheticspeech. Allen,E. L.,andHollien, H. (1973)."A laminagraphic study ofpulse(vocalfry) registerphonation,"Folia Phoniatr.25, 241-250. Ananthapadmanabha, T. V. (1984)."Acoustic analysis ofvoicesource dynamics," in Quarterly Progress andStatusReport(Speech Transmission Laboratory, RoyalInstituteofTechnology, Stockholm, Sweden), Parts2 and 3, pp. 1-24. Askenfelt, A. G.,andHammarberg, D. (1986)."Speech waveform perturbationanalysis: a perceptual-acoustical comparison of sevenmeasures," J. SpeechHear. Res.29, 50-64. Boone, D. R. (1971).TheVoice andVoice Therapy (Prentice-Hall, Englewood Cliffs, NJ). Childers, D. G., Hicks,D. M., Moore,G. P.,andAlsaka,Y. A. (1986)."A modelforvocalfoldvibratory motion, contact area,andtheelectroglottogram," J. Acoust. Soc.Am. 80, 1309-1320. Childers,D. G., Hicks,D. M., Moore,G. P., Eskenazi, L., andLalwani,A. L. (1990)."Electroglottography andvocalfoldphysiology," J. Speech Hear. Res. 33, 245-254. 2409 J. Acoust. Soc.Am.,Vol.90,No.5, November 1991 of glottalflow," STL-OPSR4, 1-13. Flanagan,J. L. (1957). "Note on the designof terminal-analogspeech synthesizers,"J. Acoust. Soc.Am. 29, 306-310. Flanagan, J. L. (1972). SpeechAnalysis',Synthesisand Perception (Springer-Verlag,New York), 2nd ed. Flanagan,J. L., andIshizaka,K. (1976). "Acousticgeneration of voiceless excitationin a vocalcord-vocaltract speechsynthesizer,"IEEE Trans. Acoust. SpeechSignal Process.24, 163-170. Fujisaki,H., and Ljungqvist,M. (1986). "Proposalandevaluationof modelsfor the glottal sourcewaveform,"Proc. IEEE Int. Conf. Acoust. SpeechSignalProcess.1605-1608. Hedelin, P. (1984). "A glottal LPC-vocoder," Proc. IEEE Int. Conf. Acoust.SpeechSignalProcess.1.6.1-1.6.4. Heiberger,V. L., and Horii, Y. (1982). "Jitter and shimmerin sustained phonation,"in Speechand Language:Advancesin BasicResearchand Practice,editedby N. Lass(Academic,New York), Vol. 7, pp. 299-322. Hiller, S. M., Laver,J., and Mackenzie,J. (1983). "Automaticanalysisof waveformperturbations in connected speech,"Work in Progress,Dept. of Linguistics,EdinburghUniversity,16, 40-68. Hillman, R. E., Osterle, E., and Feth, L. L. (1983). "Characteristics of the turbulent noise source," J. Acoust. Soc. Am. 74, 690-694. Hirano,M., Hibi, S., Terasawa,R., and Fujiu, M. (1985). "Relationship betweenaerodynamic, vibratory,acousticand psychoacoustic correlates in dysphonia,"Symposium on VoiceAcousticandDysphonia,Gotland, Sweden. Hiraoka, N., Kitazoe, Y., Ueta, H., Tanaka, S., and Tanabe, M. (1984). "Harmonic-intensity analysisof normal and hoarse voices," J. Acoust. Soc. Am. 76, 1648-1651. Hollien, H. (1974). "On vocalregister,"J. Phon. 2, 125-144. Hollien,H., Coleman,R. F., andMoore,P. (1968). "Stroboscopic laminagraphyof the larynx duringphonation,"Acta Oto-laryngol.LXV, 209215. Hollien, H., and Colton, R. H. (1969). "Four laminagraphicstudiesof vocal fold thickness," Folia Phoniatr. 21, 179-198. D.G. Childers andC. K.Lee:Analysis ofvocalquality 2409 Hollien,H., Girard,G. T., andColeman,R, F. (1977). "Vocalfoldvibratorypatterns ofpulseregister phonation," FoliaPhoniatr.29,200-205. Hollien,H., andMichel,J. F. (1968). "Vocalfry asa phonationalregister," J. SpeechHear.Res.11, 600-604. Holmes,J. N. (1973). "Theinfluence of giottalwaveformon thenaturalhess ofspeech froma parallelformantsynthesizer," IEEE Trans.Audio Electroacoust.AU-21, 298-305. Horii,Y. (1980). "Vocalshimmerin sustained phonation," J.Speech Hear. Res. 23, 202-209. patternin thenormallarynx,"FoliaPhoniatr.10, 205-238. Pinto, N. B., Childers,D. G., and Lalwani, A. L. (1989). "Formant speech synthesis: improvingproductionquality,"IEEE Trans.Acoust.Speech SignalProcess.37(12), 1870-1887. ProseLA. R., Montgomery,B. E., andHawkins,D. B. (1987). "An evaluation of residuefeaturesas correlatesof voice disorders," 1. Commun. Disord. 20, 105-117. Rabiner,L. R. (1968). "Digital-formantsynthesizer for speechsynthesis studies," J. Acoust. SOC.Am. 43, 822-828. ty: listeningtestsand long-termspectrumanalyses," Symposium on Rabiner, L. R., and Schafer,R. W. (1978). Digital Processing of Speech Signals(Prentice-Hall, EnglewoodCliffs, N J). VoiceAcousticsand Dysphonia,Gotland,Sweden. Rosenberg, A. E. (1971). "Effectof glottalpulseshapeon the qualityof Hurme, P., and Sonninen,A. (1985). "Normal anddisorderedvoicequali- Isshiki,N., Kitajima,K., Kojima,H., andHarita,Y. (1978). "Turbulent noisein dysphonia,"Folia Phoniatr.30, 214-224. Kasuya,H., Ogawa,S.,andKikuchi,Y. (1986). "An acoustic analysis of pathologic voiceanditsapplication totheevaluation oflaryngeal pathology,"SpeechCommun.5, 171-181. Kitzing,P. (1982). "Photo-andelectrophysiological recording of the laryngeal vibratory patternduringdifferent registers," FoliaPhoniatr. 34, 234-241, Klatt, D. H. (1980). "Softwarefor a cascade/parallel formantsynthesizer," J. Acoust. Soc.Am. 67, 971-995. Klatt, D. H. (1987). "Reviewoftext-to-speech conversion for English,"$. Acoust. Soc.Am. 82, 737-793. Klatt, D. H., andKlatt, L. C. (1990). "Analysis,synthesis, andperception of voicequalityvariations amongfemaleandmaletalkers,"J. Acoust. Soc. Am. 87, 820-857. Krishnamurthy, A. K., andChilders,D. G. (1986). "Two-channel speech analysis," IEEE Trans.Acoust.Speech SignalProcess. ASSP-34,730743. Ladefoged, P. ( 1975)..4Course in Phonetics ( HarcourtBraceJovanovich, New York), pp. 121-124. Laver,J. (1980). ThePhonetic Description of VoiceQuality(CambridgeU. P., New York), pp. 115-117. Laver,J.,andHanson,R. (1981). "Describing thenormalvoice,"in Evaluationof Speech in Psychiatry, editedby J. Darby (GruneandStratton, New York), pp. 51-78. Lee,C. K. (1988). "Voicequality:Analysisandsynthesis," Ph.D. dissertation, Universityof Florida,Gainesville,Fl. Lee,C. K., andChilders,D. G. (1989). "Someacoustical perceptual and physiological aspects of vocalquality,"in VocalFoldPhysiology, edited by B. HammarbergandJ. Gaulfin(Raven,New York) (in press). Monsen,R., andEngebretson, M. (1977). "Studyof variations in themale andfemaleglottalwave,"J. Acoust.Soc.Am. 62, 981-993. Moore,G. P. (1975). "Observation onthephysiology of hoarseness," Proceedings ofthe4thInternational Congress ofPhonetic Science, Helsinki, Finland,pp. 92-95. Moore,P., andyonLeden,H. (1958). "Dynamicvariationsofthevibratory 2410 J. Acoust. Soc.Am.,Vol.90,No.5, November 1991 natural vowels," J. Acoust. Soc. Am. 49, 583-590. Sambur,M. R., Rosenberg, A. E., Rabiner,L. R., and McGonegal,C. A. (1978). "On reducingthebuzzin LPC synthesis," J. Acoust.Soc.Am. 63, 918-924. Schroeder,M. R., and Atal, B. S. (1985). "Code-excitedlinear prediction (CELP): high quality speechat very low bit rates,"Proc. IEEE Int. Conf.Acoust.SpeechSignalProcess. 937-940. Timcke,R., yonLeden,H., andMoore,P. (1959). "Laryngealvibrations: measurements of theglotticwave,"Arch. Otolaryngol.69, 438 44•.. Titze, I. R. (1984). "Parameterization of the glottalarea,glottalflow,and vocal fold contactarea," J. Acoust. Soc.Am. 75(2), 570-580. Titze, I. R. (1990). "Interpretationof theelectroglottographic signal,"J. Voice 4, 1-9. Trancoso,I. M., and Atal, B. S. (1990). "Efficientsearchprocedures for selectingthe optimuminnovationin stochastic coders,"IEEE Trans. Acoust.SpeechSignalProcess. 38, 385-396. Wendahl,R. W. (1963). "Laryngealanalogsynthesis of harshvoicequality," Folia Phoniatr. 15, 241-250. Wendahl,R. W. (1966). "Laryngealanalogsynthesis ofjitter andshimmer auditoryparametersof harshness," Folia Phoniatr.18, 98-108. Wendahl,R. W., Moore, P., and Hollien, H. (1963). "Commentson vocal fry," Folia Phoniatr. 15, 251-255. Whitehead,R. L., Mets, D. E., and Whitehead,B. H. (1984). "Vibratory patternsof the vocalfoldsduringpulseregisterphonation,"J. Acoust. Soc. Am. 75, 1293-1297. Wolfe, V. I., and Steinfatt, T. M. (1987). "Prediction of vocal severity within and acrossvoicetypes,"J. SpeechHearingRes.30, 230-240. Wu, K., and Childers, D. G. (1991). "Gender recognitionfrom speech, Part I: Coarseanalysis,"J. Acoust.Soc.Am. (in press). Yanagihara,H. (1967). "Significance of harmonicchanges andnoisecomponentsin hoarseness," J. SpeechHear. Res.10, 531-541. Yea, J. J., Krishnamurthy,A. K., Naik, J. M., andChilders,D. G. (1983). "Glottal sensing for speechanalysisandsynthesis," Int. Conf. Acoust. SpeechSignalProcess.1332-1335. Yumoto, E., Gould, W., and Baer,T. (1982). "Harmonic-to-noiseratio as anindexofthedegreeof hoarsehess," J.Acoust.Soc.Am. 71, 1544-1550. D.G.'Childers andC. K.Lee:Analysis ofvocalquality 2410
© Copyright 2026 Paperzz