CS 224S / LINGUIST 285
Spoken Language Processing
AndrewMaas
StanfordUniversity
Spring2017
Lecture14:Text-to-SpeechI:Text
Normalization,LettertoSound,
Prosody
OriginalslidesbyDanJurafsky,AlanBlack,&RichardSproat
Outline
History,Demos
ArchitecturalOverview
Stage1:TextAnalysis
TextNormalization
Tokenization
Endofsentencedetection
Homographdisambiguation
Letter-to-sound(grapheme-to-phoneme)
Prosody
Dave Barry on TTS
“Andcomputersaregettingsmarterallthe
time;scientiststellusthatsoontheywillbe
abletotalkwithus.
(By“they”,Imeancomputers;Idoubt
scientistswilleverbeabletotalktous.)
History of TTS
PicturesandsometextfromHartmut Traunmüller’s
website:
http://www.ling.su.se/staff/hartmut/kemplne.htm
VonKempeln 1780b.Bratislava1734d.Vienna1804
Leatherresonatormanipulatedbytheoperatortotry
andcopyvocaltractconfigurationduringsonorants
(vowels,glides,nasals)
Bellowsprovidedairstream,counterweightprovided
inhalation
Vibratingreedproducedperiodicpressurewave
Von Kempelen:
Smallwhistlescontrolled
consonants
Rubbermouthandnose;
nosehadtobecoveredwith
twofingersfornon-nasals
Unvoicedsounds:mouth
covered,auxiliarybellows
drivenbystringprovides
puffofair
FromTraunmüller’s website
Closer to a natural vocal tract:
Riesz 1937
Homer Dudley 1939 VODER
Synthesizingspeechbyelectricalmeans
1939World’sFair
Homer Dudley’s VODER
•Manually
controlled through
complex keyboard
•Operator training
was a problem
An aside on demos
Thatlastslideexhibited:
Rule1ofplayingaspeechsynthesisdemo:
Alwayshaveahumansaywhatthewordsare
rightbeforeyouhavethesystemsaythem
The 1936 UK Speaking Clock
From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm
The UK Speaking Clock
July24,1936
Photographicstorageon4glassdisks
2disksforminutes,1forhour,oneforseconds.
Otherwordsinsentencedistributedacross4disks,so
all4usedatonce.
Voiceof“MissJ.Cain”
A technician adjusts the amplifiers
of the first speaking clock
From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm
Gunnar Fant’s OVE synthesizer
OftheRoyalInstituteofTechnology,Stockholm
FormantSynthesizerforvowels
F1andF2couldbecontrolled
FromTraunmüller’s website
Cooper’s Pattern Playback
HaskinsLabsforinvestigatingspeech
perception
Workslikeaninverseofaspectrograph
Lightfromalampgoesthrougharotating
diskthenthroughspectrograminto
photovoltaiccells
Thusamountoflightthatgetstransmitted
ateachfrequencybandcorrespondsto
amountofacousticenergyatthatband
Cooper’s Pattern Playback
Pre-modern TTS systems
1960’sfirstfullTTS:Umeda etal(1968)
1970’s
JoeOlive1977concatenationoflinear-
predictiondiphones
TexasInstrumentsSpeakandSpell,
June1978
PaulBreedlove
Types of Synthesis
ArticulatorySynthesis:
Modelmovementsofarticulatorsandacousticsofvocal
tract
FormantSynthesis:
Startwithacoustics,createrules/filterstocreateeach
formant
Concatenative Synthesis:
Usedatabasesofstoredspeechtoassemblenew
utterances.
Diphone
UnitSelection
Parametric(HMM)Synthesis
Trainsparametersondatabasesofspeech
Text modified from Richard Sproat slides
1980s: Formant Synthesis
Werethemostcommoncommercial
systemswhencomputerswereslowand
hadlittlememory.
1979MITMITalk (Allen,Hunnicut,Klatt)
1983DECtalk systembasedonKlatttalk
“PerfectPaul” (ThevoiceofStephen
Hawking)
“BeautifulBetty”
1990s: 2nd Generation Synthesis
Diphone Synthesis
Unitsarediphones;middleofonephone
tomiddleofnext.
Why?Middleofphoneissteadystate.
Record1speakersayingeachdiphone
~1400recordings
Pastethemtogetherandmodifyprosody.
3rd Generation Synthesis
Unit Selection Systems
Mostcurrentcommercialsystems.
Largerunitsofvariablelength
Recordonespeakerspeaking10hoursor
more,
Havemultiplecopiesofeachunit
Usesearchtofindbestsequenceofunits
Parametric Synthesis
Veryactiveareaofresearch
OftenassociatedwithHMMSynthesis
Trainastatisticalmodelonlargeamountsof
data.
GoogleshowedthatHMM-LSTMworkswell,
fitsbetterondevice
Wavenet etc arethelatestgenerationof
parametric
In-class listening test
Enteryourselectionshere:
https://goo.gl/forms/KLLgwhnY04Y8fi9f2
Sample1
Sample2
Sample3
Sample4
TTS Demos:
Diphone, Unit-Selection and Parametric
Festival
http://www.cstr.ed.ac.uk/projects/festival/more
voices.html
Google:
chromeextension/::chhkejkkcghanjclmhhpncachhgejoel:ttsdemo.html
Cereproc
https://www.cereproc.com/en/products/voices
The two stages of TTS
PG&EwillfileschedulesonApril20.
1.TextAnalysis:Textintointermediate
representation:
2.WaveformSynthesis:Fromthe
intermediaterepresentationintowaveform
Outline
HistoryofSpeechSynthesis
StateoftheArtDemos
BriefArchitecturalOverview
Stage1:TextAnalysis
TextNormalization
PhoneticAnalysis
Letter-to-sound(grapheme-to-phoneme)
ProsodicAnalysis
Text Normalization
Analysisofrawtextintopronounceablewords:
SentenceTokenization
TextNormalization
Identifytokensintext
Chunktokensintoreasonablysizedsections
Maptokenstowords
Tagthewords
Text Processing
Hestole$100millionfromthebank
It’s13St.AndrewsSt.
Thehomepageishttp://www.stanford.edu
Yes,seeyouthefollowingtues,that’s11/12/01
IV:four,fourth,I.V.
IRA:I.R.A.orIra
1750:seventeenfifty(date,address)oronethousand
seven…(dollars)
Text Normalization Steps
Identifytokensintext
2. Chunktokens
3. Identifytypesoftokens
4. Converttokenstowords
1.
Step 1: identify tokens and chunk
Whitespacecanbeviewedasseparators
Punctuationcanbeseparatedfromtheraw
tokens
Forexample,Festival convertstextinto
orderedlistoftokens
eachwithfeatures:
itsownprecedingwhitespace
itsownsucceedingpunctuation
Important issue in tokenization:
end-of-utterance detection
Relativelysimpleifutteranceendsin?!
Butwhataboutambiguityof“.”?
Ambiguousbetweenend-of-utterance,end-
of-abbreviation,andboth
MyplaceonMainSt.isaroundthecorner.
Iliveat123MainSt.
(Not“Iliveat151MainSt..”)
Rules/features for end-of-utterance
detection
Adotwithoneortwolettersisanabbrev
Adotwith3caplettersisanabbrev.
Anabbrevfollowedby2spacesanda
capitalletterisanend-of-utterance
Non-abbrevs followedbycapitalizedword
arebreaks
Simple Decision Tree
The Festival hand-built decision tree for
end-of-utterance
((n.whitespace matches".*\n.*\n[\n]*");;Asignificantbreakintext
((1))
((punc in("?"":""!"))
((1))
((punc is".")
;;Thisistodistinguishabbreviationsvs periods
;;Theseareheuristics
((namematches"\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)")
((n.whitespace is"")
((0));;ifabbrev,singlespaceenoughforbreak
((n.name matches"[A-Z].*")
((1))
((0))))
((n.whitespace is"");;ifitdoesn'tlooklikeanabbreviation
((n.name matches"[A-Z].*");;singlesp.+non-capisnobreak
((1))
((0)))
((1))))
((0)))))
Problems with the previous
decision tree
Failsfor
Cog.Sci.Newsletter
Lotsofcasesatendofline.
Badlyspaced/capitalizedsentences
More sophisticated decision tree
features
Prob(wordwith“.”occursatend-of-s)
Prob(wordafter“.”occursatbegin-of-s)
Lengthofwordwith“.”
Lengthofwordafter“.”
Caseofwordwith“.”:Upper,Lower,Cap,Number
Caseofwordafter“.”:Upper,Lower,Cap,Number
Punctuationafter“.”(ifany)
Abbreviationclassofwordwith“.”(monthname,unit-
of-measure,title,addressname,etc)
From Richard Sproat slides
Sproat EOS tree
From Richard Sproat slides
Some good references on end-ofsentence detection
DavidPalmerandMartiHearst.1997.Adaptive
MultilingualSentenceBoundaryDisambiguation.
ComputationalLinguistics23,2.241-267.
DavidPalmer.2000.Tokenisation andSentence
Segmentation.In“HandbookofNaturalLanguage
Processing”,editedbyDale,Moisl,Somers.
Steps 3+4: Identify Types of Tokens,
and Convert Tokens to Words
Pronunciationofnumbersoftendependson
type.Threewaystopronounce1776:
Date:seventeenseventysix
Phonenumber:onesevensevensix
Quantifier:onethousandsevenhundred
(and)seventysix
Also:
25Day:twenty-fifth
Festival rule for dealing with “$1.2 million”
(define(token_to_words utt tokenname)
(cond
((and(string-matchesname"\\$[0-9,]+\\(\\.[0-9]+\\)?")
(string-matches(utt.streamitem.feat utt token"n.name")
".*illion.?"))
(append
(builtin_english_token_to_words utt token(string-aftername"$"))
(list
(utt.streamitem.feat utt token"n.name"))))
((and(string-matches(utt.streamitem.feat utt token"p.name")
"\\$[0-9,]+\\(\\.[0-9]+\\)?")
(string-matchesname".*illion.?"))
(list"dollars"))
(t
(builtin_english_token_to_words utt tokenname))))
Rules versus machine learning
Rules/patterns
Simple
Quick
Canbemorerobust
Easytodebug,inspect,andpasson
Vastlypreferredincommercialsystems
MachineLearning
Worksforcomplexproblemswhereruleshardto
write
Higheraccuracyingeneral
Worsegeneralizationtoverydifferenttestsets
Vastlypreferredinacademia
Machine learning for Text Normalization
Sproat,R.,Black,A.,Chen,S.,Kumar,S.,Ostendorf,M.,andRichards,C.2001.
NormalizationofNon-standardWords,ComputerSpeechandLanguage,
15(3):287-333
NSWexamples:
Numbers:
123,12March1994
Abbreviations,contractions,acronyms:
approx.,mph.ctrl-C,US,pp,lb
Punctuationconventions:
3-4,+/-,and/or
Dates,times,urls,etc
How common are NSWs?
Wordnotinlexicon,orwithnon-alphabeticcharacters
(Sproat etal2001,beforeSMS/Twitter)
TextType
novels
%NSW
1.5%
presswire
4.9%
e-mail
10.7%
recipes
13.7%
classified
17.9%
How common are NSWs?
Wordnotingnuaspell dictionary(Han,Cook,Baldwin
2013)notcounting@mentions,#hashtags,urls,
Twitter: 15% of tweets have 50% or more OOV
Twitter variants
Han,Cook,Baldwin2013
Category
Example
%
Letterchanged
(deleted)
Slang
Other
Number
Substitution
LetterandNumber
shuld
72%
lol
sucha
4 (“for”)
12%
10%
3%
b4 (“before”)
2%
State of the Art for Twitter
normalization
Simpleone-to-onenormalization(mapOOVtoa
singleIVword)
Han,Cook,andBaldwin2013
Createatyped-baseddictionaryofmappings
tmrw ->tomorrow
Learnedoveralargecorpusbycombining
distributionalsimilarityandstringsimilarity
4 steps to Sproat et al. algorithm
Splitter(onwhitespaceoralsowithinword
(“AltaVista”)
Typeidentifier:foreachsplittokenidentifytype
Tokenexpander:foreachtypedtoken,expandto
words
Deterministicfornumber,date,money,lettersequence
Onlyhard(nondeterministic)forabbreviations
LanguageModel:toselectbetweenalternative
pronunciations
From Alan Black slides
Step 1: Splitter
Letter/numberconjunctions(WinNT,SunOS,
PC110)
Hand-writtenrulesintwoparts:
PartI:groupthingsnottobesplit(numbers,etc;
includingcommasinnumbers,slashesindates)
PartII:applyrules:
Attransitionsfromlowertouppercase
Afterpenultimateupper-casecharintransitionsfrom
uppertolower
Attransitionsfromdigitstoalpha
Atpunctuation
From Alan Black Slides
Step 2: Classify token into 1 of 20 types
EXPN:abbrev,contractions(adv,N.Y.,mph,gov’t)
LSEQ:lettersequence(CIA,D.C.,CDs)
ASWD:readasword,e.g.CAT,propernames
MSPL:misspelling
NUM:number(cardinal)(12,45,1/2,0.6)
NORD:number(ordinal)e.g.May7,3rd,BillGatesII
NTEL:telephone(orpart)e.g.212-555-4523
NDIG:numberasdigitse.g.Room101
NIDE:identifier,e.g.747,386,I5,PC110
NADDR:numberasstresst address,e.g.5000Pennsylvania
NZIP,NTIME,NDATE,NYER,MONEY,BMONY,PRCT,URL,etc
SLNT:notspoken(KENT*REALTY)
More about the types
4categoriesforalphabeticsequences:
EXPN:expandtofullword(s)(fplc=fireplace,NY=NewYork)
LSEQ:sayaslettersequence(IBM)
ASWD:sayasstandardword(eitherOOVoracronyms)
5mainwaystoreadnumbers:
Cardinal (quantities)
Ordinal (dates)
Stringofdigits(phonenumbers)
Pairofdigits(years)
Trailingunit:serialuntillastnon-zerodigit:8765000is
“eightsevensixfivethousand”(phone#s,longaddresses)
Butstillexceptions:(947-3030,830-7056)
Type identification classifier
Handlabelalargetrainingset,buildclassifier
Examplefeaturesforalphabetictokens:
P(o|t)fortinASWD,LSWQ,EXPN(from
trigramlettermodel)
N
p(o | t) = ∑ p(li1 | li−1,li− 2 )
i=1
€
P(t)fromcountsofeachtagintext
P(o)normalizationfactor
P(t|o)=p(o|t)p(t)/p(o)
Type identification algorithm
Hand-writtencontext-dependentrules:
Listoflexicalitems(Act,Advantage,
amendment)afterwhichRomannumbers
arereadascardinalsnotordinals
Classifieraccuracy:
98.1%innewsdata,
91.8%inemail
Step 3: expand NSW Tokens by typespecific rules
ASWDexpandstoitself
LSEQexpandstolistofwords,oneforeach
letter
NUMexpandstostringofwords
representingcardinal
NYERexpandto2pairsofNUMdigits…
NTEL:stringofdigitswithsilencefor
puncutation
In practice
NSWsarestillmainlydonebyruleinTTSsystems
MovingtowardsMLapproaches
Labeleddatanotreadilyavailableforthetask
ferentwords
parts of
example,
the two for
to pronounce even standard
is speech.
difficult.For
This
is particularly
tr
parts
ofpronunciations
speech.
For
example,
thearetw(
homographs include filsferent
(which
has
two
[fis]
‘son’ofversus
noun
and
a
verb,
while
the
two
forms
live
phs,
which are words with the same
spelling
butordifferent
pronunciati
d’]), or the multiple pronunciations
for
fier
(‘proud’
‘to
trust’),
andof
estlive
noun
and
a
verb,
while
the
two
forms
ure
8.5
shows
some
interesting
systematic
relation
disambiguation
e examples
ofVitale,
the English
use, live, and bass:
st’)
(DivayHomograph
and
1997). homographs
andsome
adj-verb
homographs.
Indeed, re
L
urenoun-verb
8.5 shows
interesting
systematic
homograph
disambiguation,
the
two
forms
ofhomographs
homographsin 4
) for
It’sthenotask
useof(/y
uw s/)noun-verb
tothat
askmany
to use
(/y
uw
z/)
the
telephone.
of
the
most
frequent
and adj-verb
homographs.
Inde
(as well as in similar languages
like
French
and
German)
tend
to
have
difdisambiguatable
just (/l
by ay
using
part
of speech (the
) ofDo
you
live
(/l
ih
v/)
near
a
zoo
with
live
v/)
animals?
that
many
of
the
most
frequent
homographs
speech. For example, the two forms of use above are (respectively)
a
der are use, increase, close, record, house, contr
whilebass
the two
offishing
live areto(respectively)
abass
verb
andpart
a noun.
Fig)verb,
I prefer
(/b forms
ae s/)
playing
the
(/b
ey
s/)
guitar.
disambiguatable
just
by
using
of
speech
project, separate, present, read).
ws some interesting systematic
relations
between theclose,
pronunciation
of some
der
are
use,
increase,
record,
house,
French
homographs
include
fils
(which
has
two
pronunciations
and adj-verb homographs. Indeed, Liberman and Church (1992) showed[fis]
Final
voicing
Stress
shift
project,
separate,
present,
‘thread’]),
or
the
multiple
pronunciations
for
fier
(‘proud’
or are
‘to tr
of the most frequent homographs in 44 million words of read).
AP
newswire
N (/s/) V (/z/)
N (init. stress) V (fin. stress)
table
just by(Divay
using part
speech 1997).
(the most frequent 15 homographs in oror ‘East’)
andofVitale,
use voicing
y uw s y uw z
record r eh1 kStress
axr0 d shift
r ix0 k ao1 r d esti
Final
,
increase,
close,
record,
house,
contract,
lead,
live,
lives,
protest,
survey,
Luckily for
the
task
of
homograph
disambiguation,
the
two
of
close k l ow s k l ow z insult ih1 n s ax0 l t ix0 n s ah1forms
lt
sep
arate, present,
read).
N
(/s/)
(/z/)
Nlike
(init.
stress)
(fin.
nglish
(ashouse
well
has
awin
sVsimilar
h aw z languages
object aa1
b jFrench
eh0
k t and
ax0VbGerman)
j eh1 kstress)
t tend
mo
use of
yspeech.
uw8.5
s Some
yFor
uwexample,
z
record
r eh1
kbetween
axr0
duse
r ix0
k ao1
r (re
dco
nt parts
the
two
forms
of
above
are
Figure
systematic
relationships
homographs:
final
Stress shift
-ate final vowel
close
kstress)
l while
ow sVinitial
k l ow
insult
ih1
n(final
s(respectively)
ax0
l final
t Vix0
na/ey/)
sverb
ah1 land
t
shift
(noun
stress
versus
verb
final
stress),
and
vowel
weakening
n and
verb,
the
twoz forms
of live
are
Na(init.
(fin.
stress)
N/A
/ax/)
(final
haxr0
awd sinteresting
h kaw
aa1
j eh0
j eh1
8.5
shows
ord house
r eh1 k some
r ix0
ao1z r systematic
d object
estimate relations
eh sbt ih
mbetween
axkt t ehax0
sthe
t ihb pronuncia
m
ey t k t
knowledge
of
speech
ult Figure
ih1and
n s 8.5
ax0
l t Some
ix0 nhomographs.
ssystematic
ah1 l tBecause
separate
s eh pLiberman
axbetween
r axpart
t sof
eh
p ax
r eyis
t suffi
n-verb
adj-verb
Indeed,
and
Church
(1fi
relationships
homographs:
ect
aa1of
b jthe
eh0 most
k t ax0frequent
b j eh1
k thomographs
moderate
min
aa 44
dperform
axmillion
r ax t homograph
aa d ax r of
eydisam
tAP
graphs,
inverb
practice
we
many
words
shift
(noun
initial
stress
versus
final
stress),
and mfinal
vowel
wea
ciations
these (noun
homographs
part of
elationships between homographs:
final for
consonant
/s/ versuslabeled
verb /z/),bystress
Homograph disambiguation
•
•
•
19mostfrequenthomographs,fromLibermanandChurch1992
Countsarepermillion,fromanAPnewscorpusof44millionwords.
Notahugeproblem,butstillimportant
use
increase
close
record
house
contract
lead
live
lives
protest
319
230
215
195
150
143
131
130
105
94
survey
project
separate
present
read
subject
rebel
finance
estimate
91
90
87
80
72
68
48
46
46
POS Tagging for homograph
disambiguation
ManyhomographscanbedistinguishedbyPOS
use
close
house
live
REcord
INsult
OBject
OVERflow
DIScount
CONtent
yuw s yuw z
klow sklow z
hawshawz
layvlih v
reCORD
inSULT
obJECT
overFLOW
disCOUNT
conTENT
POStaggingalsousefulforCONTENT/FUNCTION
distinction,whichisusefulforphrasing
Festival
Opensourcespeechsynthesissystem
Designedfordevelopmentandruntimeuse
Useinmanycommercialandacademicsystems
Hundredsofthousandsofusers
Multilingual
Nobuilt-inlanguage
Designedtoallowadditionofnewlanguages
Additionaltoolsforrapidvoicedevelopment
Statisticallearningtools
Scriptsforbuildingmodels
Text from Richard Sproat
Festival as software
http://festvox.org/festival/
Generalsystemformulti-lingualTTS
C/C++codewithSchemescriptinglanguage
Generalreplaceablemodules:
Lexicons,LTS,duration,intonation,phrasing,POS
tagging,tokenizing,diphone/unitselection,signal
processing
Generaltools
Intonationanalysis(f0,Tilt),signalprocessing,CART
building,N-gram,SCFG,WFST
Text from Richard Sproat
CMU FestVox project
Festivalisanengine,howdoyoumakevoices?
Festvox:buildingsyntheticvoices:
Tools,scripts,documentation
Discussionandexamplesforbuildingvoices
Examplevoicedatabases
Stepbystepwalkthroughsofprocesses
SupportforEnglishandotherlanguages
Supportfordifferentwaveformsynthesismethods
Diphone
Unitselection
1/5/07
Text from Richard Sproat
Synthesis tools
Iwantmycomputertotalk
FestivalSpeechSynthesis
Iwantmycomputertotalkinmyvoice
FestVox Project
Iwantittobefastandefficient
Flite
1/5/07
Text from Richard Sproat
Dictionaries
CMUdictionary:127Kwords
http://www.speech.cs.cmu.edu/cgi-
bin/cmudict
Unisyn dictionary
Significantlymoreaccurate,includesmultiple
dialects
http://www.cstr.ed.ac.uk/projects/unisyn/
Dictionaries aren’t always
sufficient: Unknown words
Goupwithsquarerootof#ofwordsintext
Mostlyperson,company,productnames
FromaBlacketalanalysis
5%oftokensinasampleofWSJnotintheOALDdictionary
77%:Names
20%OtherUnknownWords
4%:Typos
Socommercialsystemshave3-partsystem:
Bigdictionary
Specialcodeforhandlingnames
MachinelearnedLTSsystemforotherunknownwords
Names
Bigproblemareaisnames
Namesarecommon
20%oftokensintypicalnewswiretext
Spiegel(2003)estimateofUSnames:
2millionsurnames
100,000firstnames
Personalnames:McArthur,D’Angelo,Jiminez,
Rajan,Raghavan,Sondhi,Xu,Hsu,Zhang,Chang,
Nguyen
Company/Brandnames:Infinit,Kmart,Cytyc,
Medamicus,Inforte,Aaon,Idexx Labs,Bebe
Methods for Names
Candomorphology(Walters->Walter,Lucasville)
Canwritestress-shiftingrules(Jordan->
Jordanian)
Rhymeanalogy:Plotsky byanalogywithTrostsky
(replacetr withpl)
LibermanandChurch:for250Kmostcommon
names,got212K(85%)fromthesemodifieddictionarymethods,usedLTSforrest.
Candoautomaticcountrydetection(fromletter
trigrams)andthendocountry-specificrules
Letter to Sound Rules
AKAGraphemetoPhoneme(G2P)
Generallymachinelearning,inducedfromadictionary
Pickyourfavoritemachinelearningtoolandgoforit
Earlierwork:(Blacketal.1998)
Twosteps:alignmentand(CART-based)rule-induction
Alignment
Letters:
c h e
c k e d
Phones:
ch _ eh _ k _ t
BlacketalMethod1(EM)
Firstscatterepsilonsinallpossiblewaystocause
lettersandphonestoalign
ThencollectstatsforP(phone|letter)andselect
besttogeneratenewstats
Iterate5-6timesuntilsettles
Alignment
Blacketalmethod2
Handspecifywhichletterscanberenderedaswhich
phones
Cgoestok/ch/s/sh
Wgoestow/v/f,etc
Anactuallist:
Oncemappingtableiscreated,findallvalid
alignments,findp(letter|phone),scoreallalignments,
takebest
Alignment
Somealignmentswillturnouttobereallybad.
Thesearejustthecaseswherepronunciation
doesn’tmatchletters:
Dept
dih paa rtmahnt
CMU
siy ehmyuw
Lieutenant
lehftehnaxnt(British)
Orforeignwords
Thesecanjustberemovedfromalignment
training
Alsoremoveanddealseparatelywith
namesandacronyms
Black et al. Classifier
ACARTtreeforeachletterinalphabet(26
plusaccented)usingcontextof+-3letters
# # # c h e c -> ch
c h e c k e d -> _
Thisproduces92-96%correctLETTER
accuracy(58-75wordacc)forEnglish
Modern systems: more powerful
classifiers, more features
Predicting Intonation in TTS
Prominence/Accent:Decidewhichwordsare
accented,whichsyllablehasaccent,what
sortofaccent
Boundaries:Decidewhereintonational
boundariesare
Duration:Specifylengthofeachsegment
F0:GenerateF0contourfromthese
Predicting Intonation in TTS
Prominence/Accent:Decidewhichwordsare
accented,whichsyllablehasaccent,what
sortofaccent
Boundaries:Decidewhereintonational
boundariesare
Duration:Specifylengthofeachsegment
F0:GenerateF0contourfromthese
Stress vs. accent
Stress:structuralpropertyofaword
Fixedinthelexicon:marksapotential(arbitrary)
locationforanaccenttooccur,ifthereisone
Accent: propertyofawordincontext
Context-dependent.Marksimportantwordsin
thediscourse
(x)
x
x
x x
x
vi ta mins
x
x
Ca
x
li
(x)
x
x
x
x
for nia
(accentedsyll)
stressedsyll
fullvowels
syllables
Stress vs. accent (2)
Thespeakerdecidestomakethewordvitamin more
prominentbyaccenting it.
Lexicalstresstellusthatthisprominencewillappearon
thefirstsyllable,hencevitamin.
Sowewillhavetolookatboththelexiconandthe
contexttopredictthedetailsofprominence
I’malittlesurprised tohearitcharacterizedasupbeat
Levels of prominence
Mostphraseshavemorethanoneaccent
NuclearAccent:Lastaccentinaphrase,perceivedas
moreprominent
Playssemanticrolelikeindicatingawordiscontrastiveor
focus.
Modeledvia***sinIM,orcapitalizedletters
‘IknowSOMETHING interestingissuretohappen,’shesaid
Canalsohavereducedwordsthatareless prominent
thanusual(especiallyfunctionwords)
Sometimesuse4classesofprominence:
Emphaticaccent,pitchaccent,unaccented,reduced
Pitch accent prediction
Whichwordsinanutteranceshouldbearaccent?
ài believeatibm theymakeyouwearabluesuit.
ài BELIEVEatIBMtheyMAKEyouWEARaBLUESUIT.
à2001wasagoodmovie,ifyouhadreadthebook.
à2001 wasagoodMOVIE,ifyouhadreadtheBOOK.
Broadcast News Example
Hirschberg(1993)
SUNMICROSYSTEMSINC,theUPSTARTCOMPANYthat
HELPEDLAUNCHtheDESKTOPCOMPUTERindustry
TRENDTOWARDHIGHpoweredWORKSTATIONS,was
UNVEILINGanENTIREOVERHAULofitsPRODUCTLINE
TODAY.SOMEofthenewMACHINES,PRICEDfromFIVE
THOUSANDNINEhundredNINETYfiveDOLLARSto
seventyTHREEthousandnineHUNDREDdollars,BOAST
SOPHISTICATEDnewgraphicsandDIGITALSOUND
TECHNOLOGIES,HIGHERSPEEDSANDaCIRCUITboard
thatallowsFULLmotionVIDEOonaCOMPUTER
SCREEN.
Predicting Pitch Accent: Part of
speech
Contentwordsareusuallyaccented
Functionwordsarerarelyaccented
Of,for,inon,that,the,a,an,no,to,andbutor
willmaywouldcanheristheiritsourthereis
amarewaswere,etc
Baselinealgorithm:Accentallcontentwords.
Factors in accent prediction:
Contrast
LegumesarepoorsourceofVITAMINS
No,legumesareaGOODsourceofvitamins
IthinkJOHNorMARYshouldgo
No,IthinkJOHNANDMARYshouldgo
Factors in Accent Prediction:
List intonation
IwentandsawANNA,LENNY,MARY,
andNORA.
Factors: Word order
Preposed itemsareaccentedmore
frequently
TODAYwewillBEGINtoLOOKatFROG
anatomy.
WewillBEGINtoLOOKatFROG
anatomytoday.
Factor: Information
Newinformationinananswer isoftenaccented
Q1:Whattypesoffoodsareagoodsourceofvitamins?
A1:LEGUMESareagoodsourceofvitamins.
Q2:Arelegumesasourceofvitamins?
A2:LegumesareaGOODsourceofvitamins.
Q3:I’veheardthatlegumesarehealthy,butwhatare
theyagoodsourceof?
A3:LegumesareagoodsourceofVITAMINS.
SlidefromJenniferVenditti
Factors: Information Status (2)
Newversusoldinformation.
Oldinformationisdeaccented
ThereareLAWYERS,andthereareGOOD
lawyers
EACHNATIONDEFINESitsOWNnational
INTERST.
ILIKEGOLDENRETRIEVERS,butMOSTdogs
LEAVEmeCOLD.
Complex Noun Phrase Structure
Sproat,R.1994.Englishnoun-phraseaccentpredictionfortext-to-speech.Computer
SpeechandLanguage8:79-94.
ProperNames,stressonright-mostword
NewYorkCITY;Paris,FRANCE
Adjective-Nouncombinations,stressonnoun
LargeHOUSE,redPEN,newNOTEBOOK
Noun-Nouncompounds:stressleftnoun
HOTdog (food)versusadj-NHOTDOG(overheatedanimal)
WHITEhouse(place)versusadj-NWHITEHOUSE(painted)
Other:
MEDICALBuilding,APPLEcake,cherryPIE,kitchenTABLE
Madisonavenue,Parkstreet?
Rule:Furniture+Room ->RIGHT
Other features
POS
POSofpreviousword
POSofnextword
Stressofcurrent,previous,nextsyllable
Unigramprobabilityofword
Bigramprobabilityofword
Positionofwordinsentence
Advanced features
Accentisoftendeflectedawayfromaworddueto
focusonaneighboringword.
Couldusesyntacticparallelismtodetectthiskindof
contrastivefocus:
……driving[FIFTYmiles]anhourina[THIRTYmile]zone
[WELD][APPLAUDS]mandatoryrecycling.[SILBER]
[DISMISSES]recyclinggoalsasmeaningless.
…butwhileWeldmaybe[LONG]onpeopleskills,hemay
be[SHORT]onmoney
Accent prediction
Usefulfeaturesinclude:(startingwithHirschberg1993)
Lexicalclass(functionwords,clitics notaccented)
Wordfrequency
Wordidentity(promisingbutproblematic)
Given/New,Theme/Rheme
Focus
Wordbigrampredictability
Positioninphrase
Complexnominalrules(Sproat)
Combinedinaclassifier:
Decisiontrees(Hirchberg 1993),Bagging/boosting(Sun2002)
HiddenMarkovmodels(Hasegawa-Johnsonetal2005)
Conditionalrandomfields(GregoryandAltun 2004)
But best single feature: Accent ratio dictionary
Nenkova, J. Brenier, A. Kothari, S. Calhoun, L. Whitton, D. Beaver, and D. Jurafsky. 2007. To memorize
or to predict: Prominence labeling in conversational speech. in Proceedings of NAACL-HLT
Ofallthetimesthiswordoccurred
Whatpercentagewereaccented?
Memorizedfromalabeledcorpus
60Switchboardconversations(Ostendorfetal2001)
Given:
k:numberoftimesawordisprominent
n:alloccurrencesoftheword
k
AccentRatio( w) = , if B(k , n,0.5) < 0.05
n
Accent Ratio
Conversationalspeechdictionaries:
http://www.cis.upenn.edu/~nenkova/AR.sign
ReadNewsdictionaries:
http://www.cis.upenn.edu/~nenkova/buAR.sign
Accentratioclassifier
awordnot-prominentifAR<0.38
wordsnotintheARdictionaryarelabeled“prominent”
Performance
Bestsinglepredictorofaccent;addingallotherfeaturesonly
helps1%onpredictiontask
ImprovesTTS:
VolkerStrom,Ani Nenkova,RobertClark,YolandaVazquez-Alvarez,JasonBrenier,SimonKing,andDan
Jurafsky.2007.Modelling ProminenceandEmphasisImprovesUnit-SelectionSynthesis.Interspeech
Predicting Intonation in TTS
Prominence/Accent:Decidewhichwordsare
accented,whichsyllablehasaccent,what
sortofaccent
Boundaries:Decidewhereintonational
boundariesare
Duration:Specifylengthofeachsegment
F0:GenerateF0contourfromthese
Predicting Boundaries:
Full || versus intermediate |
Ostendorf andVeilleux.1994“HierarchicalStochasticmodelforAutomatic
PredictionofProsodicBoundaryLocation”,ComputationalLinguistics20:1
Computerphonecalls,||whichdoeverything|
fromsellingmagazinesubscriptions||toreminding
peopleaboutmeetings||havebecomethe
telephoneequivalent|ofjunkmail.||
DoctorNormanRosenblatt,||deanofthecollege|
ofcriminaljusticeatNortheasternUniversity,||
agrees.||
ForWBUR,||I’mMargoMelnicove.
Simplest CART
(set!simple_phrase_cart_tree
'
((lisp_token_end_punc in("?""."":"))
((BB))
((lisp_token_end_punc in("'""\""","";"))
((B))
((n.name is0);;endofutterance
((BB))
((NB))))))
More complex features
Ostendorf andVeilleux
English:boundariesaremorelikelybetween
contentwordsandfunctionwords
Syntacticstructure(parsetrees)
Largestsyntacticcategorydominatingpreceding
wordbutnotsucceedingword
Howmanysyntacticunitsbegin/endbetween
words
Typeoffunctionwordtoright
Capitalizednames
#ofcontentwordssincepreviousfunctionword
Ostendorf and Veilleux CART
Predicting Intonation in TTS
Prominence/Accent:Decidewhichwordsare
accented,whichsyllablehasaccent,what
sortofaccent
Boundaries:Decidewhereintonational
boundariesare
Duration:Specifylengthofeachsegment
F0:GenerateF0contourfromthese
Duration
Simplest:fixedsizeforallphones(100ms)
Nextsimplest:averagedurationforthatphone
(fromtrainingdata).SamplesfromSWBDinms:
aa 118
b
68
ax 59
d
68
ay 138
dh
44
eh 87
f
90
ih 77
g
66
NextNextSimplest:addinphrase-finalandinitial
lengtheningplusstress
Klatt duration rules
Modelshowcontext-neutraldurationofaphone
lengthened/shortenedbycontext,whilestayingaboveaminduration
dmin
Prepausal lengthening:vowelbeforepauselengthenedby1.4
Non-phrase-finalshortening:Segmentsnotphrase-finalareshortened
by0.6.Phrase-finalpostvocalicliquidsandnasalslengthenedby1.4
Unstressedshortening:unstressedsegmentsminimumdurationdmin
ishalved,andareshortenedby.7formostphonetypes.
Lengtheningforaccent:Avowelwithaccentlengthenedby1.4
Shorteninginclusters:Consonantfollowedbyconsonantshortenedby
0.5
Pre-voicelessshortening:Vowelsareshortenedbeforeavoiceless
plosiveby0.7
Klatt duration rules
Klatt formulaforphonedurations:
Festival:2options
Klatt rules
UselabeledtrainingsetwithKlatt featuresto
trainCART
Identityoftheleftandrightcontextphone
Lexicalstressandaccentvaluesofcurrentphone
Positioninsyllable,word,phrase
Followingpause
Duration: state of the art
Lotsoffancymodelsofdurationprediction:
UsingZ-scoresandotherclever
normalizations
Sum-of-productsmodel
Newfeatureslikewordpredictability
Wordswithhigherbigramprobabilityare
shorter
Outline
History,Demos
ArchitecturalOverview
Stage1:TextAnalysis
TextNormalization
Tokenization
Endofsentencedetection
Homographdisambiguation
Letter-to-sound(grapheme-to-phoneme)
Prosody
PredictingAccents
PredictingBoundaries
PredictingDuration
GeneratingF0
© Copyright 2025 Paperzz