Text Normalization, Letter to Sound, Prosody

CS 224S / LINGUIST 285
Spoken Language Processing
AndrewMaas
StanfordUniversity
Spring2017
Lecture14:Text-to-SpeechI:Text
Normalization,LettertoSound,
Prosody
OriginalslidesbyDanJurafsky,AlanBlack,&RichardSproat
Outline
— History,Demos
— ArchitecturalOverview
— Stage1:TextAnalysis
— TextNormalization
— Tokenization
— Endofsentencedetection
— Homographdisambiguation
— Letter-to-sound(grapheme-to-phoneme)
— Prosody
Dave Barry on TTS
“Andcomputersaregettingsmarterallthe
time;scientiststellusthatsoontheywillbe
abletotalkwithus.
(By“they”,Imeancomputers;Idoubt
scientistswilleverbeabletotalktous.)
History of TTS
— PicturesandsometextfromHartmut Traunmüller’s
website:
http://www.ling.su.se/staff/hartmut/kemplne.htm
— VonKempeln 1780b.Bratislava1734d.Vienna1804
— Leatherresonatormanipulatedbytheoperatortotry
andcopyvocaltractconfigurationduringsonorants
(vowels,glides,nasals)
— Bellowsprovidedairstream,counterweightprovided
inhalation
— Vibratingreedproducedperiodicpressurewave
Von Kempelen:
— Smallwhistlescontrolled
consonants
— Rubbermouthandnose;
nosehadtobecoveredwith
twofingersfornon-nasals
— Unvoicedsounds:mouth
covered,auxiliarybellows
drivenbystringprovides
puffofair
FromTraunmüller’s website
Closer to a natural vocal tract:
Riesz 1937
Homer Dudley 1939 VODER
— Synthesizingspeechbyelectricalmeans
— 1939World’sFair
Homer Dudley’s VODER
•Manually
controlled through
complex keyboard
•Operator training
was a problem
An aside on demos
Thatlastslideexhibited:
Rule1ofplayingaspeechsynthesisdemo:
Alwayshaveahumansaywhatthewordsare
rightbeforeyouhavethesystemsaythem
The 1936 UK Speaking Clock
From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm
The UK Speaking Clock
— July24,1936
— Photographicstorageon4glassdisks
— 2disksforminutes,1forhour,oneforseconds.
— Otherwordsinsentencedistributedacross4disks,so
all4usedatonce.
— Voiceof“MissJ.Cain”
A technician adjusts the amplifiers
of the first speaking clock
From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm
Gunnar Fant’s OVE synthesizer
— OftheRoyalInstituteofTechnology,Stockholm
— FormantSynthesizerforvowels
— F1andF2couldbecontrolled
FromTraunmüller’s website
Cooper’s Pattern Playback
— HaskinsLabsforinvestigatingspeech
perception
— Workslikeaninverseofaspectrograph
— Lightfromalampgoesthrougharotating
diskthenthroughspectrograminto
photovoltaiccells
— Thusamountoflightthatgetstransmitted
ateachfrequencybandcorrespondsto
amountofacousticenergyatthatband
Cooper’s Pattern Playback
Pre-modern TTS systems
— 1960’sfirstfullTTS:Umeda etal(1968)
— 1970’s
— JoeOlive1977concatenationoflinear-
predictiondiphones
— TexasInstrumentsSpeakandSpell,
— June1978
— PaulBreedlove
Types of Synthesis
— ArticulatorySynthesis:
— Modelmovementsofarticulatorsandacousticsofvocal
tract
— FormantSynthesis:
— Startwithacoustics,createrules/filterstocreateeach
formant
— Concatenative Synthesis:
— Usedatabasesofstoredspeechtoassemblenew
utterances.
— Diphone
— UnitSelection
— Parametric(HMM)Synthesis
— Trainsparametersondatabasesofspeech
Text modified from Richard Sproat slides
1980s: Formant Synthesis
— Werethemostcommoncommercial
systemswhencomputerswereslowand
hadlittlememory.
— 1979MITMITalk (Allen,Hunnicut,Klatt)
— 1983DECtalk systembasedonKlatttalk
— “PerfectPaul” (ThevoiceofStephen
Hawking)
— “BeautifulBetty”
1990s: 2nd Generation Synthesis
Diphone Synthesis
— Unitsarediphones;middleofonephone
tomiddleofnext.
— Why?Middleofphoneissteadystate.
— Record1speakersayingeachdiphone
— ~1400recordings
— Pastethemtogetherandmodifyprosody.
3rd Generation Synthesis
Unit Selection Systems
— Mostcurrentcommercialsystems.
— Largerunitsofvariablelength
— Recordonespeakerspeaking10hoursor
more,
— Havemultiplecopiesofeachunit
— Usesearchtofindbestsequenceofunits
Parametric Synthesis
— Veryactiveareaofresearch
— OftenassociatedwithHMMSynthesis
— Trainastatisticalmodelonlargeamountsof
data.
— GoogleshowedthatHMM-LSTMworkswell,
fitsbetterondevice
— Wavenet etc arethelatestgenerationof
parametric
In-class listening test
— Enteryourselectionshere:
https://goo.gl/forms/KLLgwhnY04Y8fi9f2
— Sample1
— Sample2
— Sample3
— Sample4
TTS Demos:
Diphone, Unit-Selection and Parametric
Festival
http://www.cstr.ed.ac.uk/projects/festival/more
voices.html
Google:
chromeextension/::chhkejkkcghanjclmhhpncachhgejoel:ttsdemo.html
Cereproc
https://www.cereproc.com/en/products/voices
The two stages of TTS
PG&EwillfileschedulesonApril20.
1.TextAnalysis:Textintointermediate
representation:
2.WaveformSynthesis:Fromthe
intermediaterepresentationintowaveform
Outline
— HistoryofSpeechSynthesis
— StateoftheArtDemos
— BriefArchitecturalOverview
— Stage1:TextAnalysis
— TextNormalization
— PhoneticAnalysis
— Letter-to-sound(grapheme-to-phoneme)
— ProsodicAnalysis
Text Normalization
Analysisofrawtextintopronounceablewords:
— SentenceTokenization
— TextNormalization
— Identifytokensintext
— Chunktokensintoreasonablysizedsections
— Maptokenstowords
— Tagthewords
Text Processing
— Hestole$100millionfromthebank
— It’s13St.AndrewsSt.
— Thehomepageishttp://www.stanford.edu
— Yes,seeyouthefollowingtues,that’s11/12/01
— IV:four,fourth,I.V.
— IRA:I.R.A.orIra
— 1750:seventeenfifty(date,address)oronethousand
seven…(dollars)
Text Normalization Steps
Identifytokensintext
2. Chunktokens
3. Identifytypesoftokens
4. Converttokenstowords
1.
Step 1: identify tokens and chunk
— Whitespacecanbeviewedasseparators
— Punctuationcanbeseparatedfromtheraw
tokens
— Forexample,Festival convertstextinto
— orderedlistoftokens
— eachwithfeatures:
— itsownprecedingwhitespace
— itsownsucceedingpunctuation
Important issue in tokenization:
end-of-utterance detection
— Relativelysimpleifutteranceendsin?!
— Butwhataboutambiguityof“.”?
— Ambiguousbetweenend-of-utterance,end-
of-abbreviation,andboth
— MyplaceonMainSt.isaroundthecorner.
— Iliveat123MainSt.
— (Not“Iliveat151MainSt..”)
Rules/features for end-of-utterance
detection
— Adotwithoneortwolettersisanabbrev
— Adotwith3caplettersisanabbrev.
— Anabbrevfollowedby2spacesanda
capitalletterisanend-of-utterance
— Non-abbrevs followedbycapitalizedword
arebreaks
Simple Decision Tree
The Festival hand-built decision tree for
end-of-utterance
((n.whitespace matches".*\n.*\n[\n]*");;Asignificantbreakintext
((1))
((punc in("?"":""!"))
((1))
((punc is".")
;;Thisistodistinguishabbreviationsvs periods
;;Theseareheuristics
((namematches"\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)")
((n.whitespace is"")
((0));;ifabbrev,singlespaceenoughforbreak
((n.name matches"[A-Z].*")
((1))
((0))))
((n.whitespace is"");;ifitdoesn'tlooklikeanabbreviation
((n.name matches"[A-Z].*");;singlesp.+non-capisnobreak
((1))
((0)))
((1))))
((0)))))
Problems with the previous
decision tree
— Failsfor
— Cog.Sci.Newsletter
— Lotsofcasesatendofline.
— Badlyspaced/capitalizedsentences
More sophisticated decision tree
features
— Prob(wordwith“.”occursatend-of-s)
— Prob(wordafter“.”occursatbegin-of-s)
— Lengthofwordwith“.”
— Lengthofwordafter“.”
— Caseofwordwith“.”:Upper,Lower,Cap,Number
— Caseofwordafter“.”:Upper,Lower,Cap,Number
— Punctuationafter“.”(ifany)
— Abbreviationclassofwordwith“.”(monthname,unit-
of-measure,title,addressname,etc)
From Richard Sproat slides
Sproat EOS tree
From Richard Sproat slides
Some good references on end-ofsentence detection
DavidPalmerandMartiHearst.1997.Adaptive
MultilingualSentenceBoundaryDisambiguation.
ComputationalLinguistics23,2.241-267.
DavidPalmer.2000.Tokenisation andSentence
Segmentation.In“HandbookofNaturalLanguage
Processing”,editedbyDale,Moisl,Somers.
Steps 3+4: Identify Types of Tokens,
and Convert Tokens to Words
— Pronunciationofnumbersoftendependson
type.Threewaystopronounce1776:
Date:seventeenseventysix
Phonenumber:onesevensevensix
Quantifier:onethousandsevenhundred
(and)seventysix
— Also:
— 25Day:twenty-fifth
Festival rule for dealing with “$1.2 million”
(define(token_to_words utt tokenname)
(cond
((and(string-matchesname"\\$[0-9,]+\\(\\.[0-9]+\\)?")
(string-matches(utt.streamitem.feat utt token"n.name")
".*illion.?"))
(append
(builtin_english_token_to_words utt token(string-aftername"$"))
(list
(utt.streamitem.feat utt token"n.name"))))
((and(string-matches(utt.streamitem.feat utt token"p.name")
"\\$[0-9,]+\\(\\.[0-9]+\\)?")
(string-matchesname".*illion.?"))
(list"dollars"))
(t
(builtin_english_token_to_words utt tokenname))))
Rules versus machine learning
— Rules/patterns
— Simple
— Quick
— Canbemorerobust
— Easytodebug,inspect,andpasson
— Vastlypreferredincommercialsystems
— MachineLearning
— Worksforcomplexproblemswhereruleshardto
write
— Higheraccuracyingeneral
— Worsegeneralizationtoverydifferenttestsets
— Vastlypreferredinacademia
Machine learning for Text Normalization
Sproat,R.,Black,A.,Chen,S.,Kumar,S.,Ostendorf,M.,andRichards,C.2001.
NormalizationofNon-standardWords,ComputerSpeechandLanguage,
15(3):287-333
— NSWexamples:
— Numbers:
— 123,12March1994
— Abbreviations,contractions,acronyms:
— approx.,mph.ctrl-C,US,pp,lb
— Punctuationconventions:
— 3-4,+/-,and/or
— Dates,times,urls,etc
How common are NSWs?
— Wordnotinlexicon,orwithnon-alphabeticcharacters
(Sproat etal2001,beforeSMS/Twitter)
TextType
novels
%NSW
1.5%
presswire
4.9%
e-mail
10.7%
recipes
13.7%
classified
17.9%
How common are NSWs?
Wordnotingnuaspell dictionary(Han,Cook,Baldwin
2013)notcounting@mentions,#hashtags,urls,
Twitter: 15% of tweets have 50% or more OOV
Twitter variants
Han,Cook,Baldwin2013
Category
Example
%
Letterchanged
(deleted)
Slang
Other
Number
Substitution
LetterandNumber
shuld
72%
lol
sucha
4 (“for”)
12%
10%
3%
b4 (“before”)
2%
State of the Art for Twitter
normalization
— Simpleone-to-onenormalization(mapOOVtoa
singleIVword)
— Han,Cook,andBaldwin2013
— Createatyped-baseddictionaryofmappings
— tmrw ->tomorrow
— Learnedoveralargecorpusbycombining
distributionalsimilarityandstringsimilarity
4 steps to Sproat et al. algorithm
— Splitter(onwhitespaceoralsowithinword
(“AltaVista”)
— Typeidentifier:foreachsplittokenidentifytype
— Tokenexpander:foreachtypedtoken,expandto
words
— Deterministicfornumber,date,money,lettersequence
— Onlyhard(nondeterministic)forabbreviations
— LanguageModel:toselectbetweenalternative
pronunciations
From Alan Black slides
Step 1: Splitter
— Letter/numberconjunctions(WinNT,SunOS,
PC110)
— Hand-writtenrulesintwoparts:
— PartI:groupthingsnottobesplit(numbers,etc;
includingcommasinnumbers,slashesindates)
— PartII:applyrules:
— Attransitionsfromlowertouppercase
— Afterpenultimateupper-casecharintransitionsfrom
uppertolower
— Attransitionsfromdigitstoalpha
— Atpunctuation
From Alan Black Slides
Step 2: Classify token into 1 of 20 types
EXPN:abbrev,contractions(adv,N.Y.,mph,gov’t)
LSEQ:lettersequence(CIA,D.C.,CDs)
ASWD:readasword,e.g.CAT,propernames
MSPL:misspelling
NUM:number(cardinal)(12,45,1/2,0.6)
NORD:number(ordinal)e.g.May7,3rd,BillGatesII
NTEL:telephone(orpart)e.g.212-555-4523
NDIG:numberasdigitse.g.Room101
NIDE:identifier,e.g.747,386,I5,PC110
NADDR:numberasstresst address,e.g.5000Pennsylvania
NZIP,NTIME,NDATE,NYER,MONEY,BMONY,PRCT,URL,etc
SLNT:notspoken(KENT*REALTY)
More about the types
4categoriesforalphabeticsequences:
EXPN:expandtofullword(s)(fplc=fireplace,NY=NewYork)
LSEQ:sayaslettersequence(IBM)
ASWD:sayasstandardword(eitherOOVoracronyms)
5mainwaystoreadnumbers:
Cardinal (quantities)
Ordinal (dates)
Stringofdigits(phonenumbers)
Pairofdigits(years)
Trailingunit:serialuntillastnon-zerodigit:8765000is
“eightsevensixfivethousand”(phone#s,longaddresses)
Butstillexceptions:(947-3030,830-7056)
Type identification classifier
Handlabelalargetrainingset,buildclassifier
Examplefeaturesforalphabetictokens:
P(o|t)fortinASWD,LSWQ,EXPN(from
trigramlettermodel)
N
p(o | t) = ∑ p(li1 | li−1,li− 2 )
i=1
€
P(t)fromcountsofeachtagintext
P(o)normalizationfactor
P(t|o)=p(o|t)p(t)/p(o)
Type identification algorithm
— Hand-writtencontext-dependentrules:
— Listoflexicalitems(Act,Advantage,
amendment)afterwhichRomannumbers
arereadascardinalsnotordinals
— Classifieraccuracy:
— 98.1%innewsdata,
— 91.8%inemail
Step 3: expand NSW Tokens by typespecific rules
— ASWDexpandstoitself
— LSEQexpandstolistofwords,oneforeach
letter
— NUMexpandstostringofwords
representingcardinal
— NYERexpandto2pairsofNUMdigits…
— NTEL:stringofdigitswithsilencefor
puncutation
In practice
— NSWsarestillmainlydonebyruleinTTSsystems
— MovingtowardsMLapproaches
— Labeleddatanotreadilyavailableforthetask
ferentwords
parts of
example,
the two for
to pronounce even standard
is speech.
difficult.For
This
is particularly
tr
parts
ofpronunciations
speech.
For
example,
thearetw(
homographs include filsferent
(which
has
two
[fis]
‘son’ofversus
noun
and
a
verb,
while
the
two
forms
live
phs,
which are words with the same
spelling
butordifferent
pronunciati
d’]), or the multiple pronunciations
for
fier
(‘proud’
‘to
trust’),
andof
estlive
noun
and
a
verb,
while
the
two
forms
ure
8.5
shows
some
interesting
systematic
relation
disambiguation
e examples
ofVitale,
the English
use, live, and bass:
st’)
(DivayHomograph
and
1997). homographs
andsome
adj-verb
homographs.
Indeed, re
L
urenoun-verb
8.5 shows
interesting
systematic
homograph
disambiguation,
the
two
forms
ofhomographs
homographsin 4
) for
It’sthenotask
useof(/y
uw s/)noun-verb
tothat
askmany
to use
(/y
uw
z/)
the
telephone.
of
the
most
frequent
and adj-verb
homographs.
Inde
(as well as in similar languages
like
French
and
German)
tend
to
have
difdisambiguatable
just (/l
by ay
using
part
of speech (the
) ofDo
you
live
(/l
ih
v/)
near
a
zoo
with
live
v/)
animals?
that
many
of
the
most
frequent
homographs
speech. For example, the two forms of use above are (respectively)
a
der are use, increase, close, record, house, contr
whilebass
the two
offishing
live areto(respectively)
abass
verb
andpart
a noun.
Fig)verb,
I prefer
(/b forms
ae s/)
playing
the
(/b
ey
s/)
guitar.
disambiguatable
just
by
using
of
speech
project, separate, present, read).
ws some interesting systematic
relations
between theclose,
pronunciation
of some
der
are
use,
increase,
record,
house,
French
homographs
include
fils
(which
has
two
pronunciations
and adj-verb homographs. Indeed, Liberman and Church (1992) showed[fis]
Final
voicing
Stress
shift
project,
separate,
present,
‘thread’]),
or
the
multiple
pronunciations
for
fier
(‘proud’
or are
‘to tr
of the most frequent homographs in 44 million words of read).
AP
newswire
N (/s/) V (/z/)
N (init. stress) V (fin. stress)
table
just by(Divay
using part
speech 1997).
(the most frequent 15 homographs in oror ‘East’)
andofVitale,
use voicing
y uw s y uw z
record r eh1 kStress
axr0 d shift
r ix0 k ao1 r d esti
Final
,
increase,
close,
record,
house,
contract,
lead,
live,
lives,
protest,
survey,
Luckily for
the
task
of
homograph
disambiguation,
the
two
of
close k l ow s k l ow z insult ih1 n s ax0 l t ix0 n s ah1forms
lt
sep
arate, present,
read).
N
(/s/)
(/z/)
Nlike
(init.
stress)
(fin.
nglish
(ashouse
well
has
awin
sVsimilar
h aw z languages
object aa1
b jFrench
eh0
k t and
ax0VbGerman)
j eh1 kstress)
t tend
mo
use of
yspeech.
uw8.5
s Some
yFor
uwexample,
z
record
r eh1
kbetween
axr0
duse
r ix0
k ao1
r (re
dco
nt parts
the
two
forms
of
above
are
Figure
systematic
relationships
homographs:
final
Stress shift
-ate final vowel
close
kstress)
l while
ow sVinitial
k l ow
insult
ih1
n(final
s(respectively)
ax0
l final
t Vix0
na/ey/)
sverb
ah1 land
t
shift
(noun
stress
versus
verb
final
stress),
and
vowel
weakening
n and
verb,
the
twoz forms
of live
are
Na(init.
(fin.
stress)
N/A
/ax/)
(final
haxr0
awd sinteresting
h kaw
aa1
j eh0
j eh1
8.5
shows
ord house
r eh1 k some
r ix0
ao1z r systematic
d object
estimate relations
eh sbt ih
mbetween
axkt t ehax0
sthe
t ihb pronuncia
m
ey t k t
knowledge
of
speech
ult Figure
ih1and
n s 8.5
ax0
l t Some
ix0 nhomographs.
ssystematic
ah1 l tBecause
separate
s eh pLiberman
axbetween
r axpart
t sof
eh
p ax
r eyis
t suffi
n-verb
adj-verb
Indeed,
and
Church
(1fi
relationships
homographs:
ect
aa1of
b jthe
eh0 most
k t ax0frequent
b j eh1
k thomographs
moderate
min
aa 44
dperform
axmillion
r ax t homograph
aa d ax r of
eydisam
tAP
graphs,
inverb
practice
we
many
words
shift
(noun
initial
stress
versus
final
stress),
and mfinal
vowel
wea
ciations
these (noun
homographs
part of
elationships between homographs:
final for
consonant
/s/ versuslabeled
verb /z/),bystress
Homograph disambiguation
•
•
•
19mostfrequenthomographs,fromLibermanandChurch1992
Countsarepermillion,fromanAPnewscorpusof44millionwords.
Notahugeproblem,butstillimportant
use
increase
close
record
house
contract
lead
live
lives
protest
319
230
215
195
150
143
131
130
105
94
survey
project
separate
present
read
subject
rebel
finance
estimate
91
90
87
80
72
68
48
46
46
POS Tagging for homograph
disambiguation
— ManyhomographscanbedistinguishedbyPOS
use
close
house
live
REcord
INsult
OBject
OVERflow
DIScount
CONtent
yuw s yuw z
klow sklow z
hawshawz
layvlih v
reCORD
inSULT
obJECT
overFLOW
disCOUNT
conTENT
— POStaggingalsousefulforCONTENT/FUNCTION
distinction,whichisusefulforphrasing
Festival
— Opensourcespeechsynthesissystem
— Designedfordevelopmentandruntimeuse
— Useinmanycommercialandacademicsystems
— Hundredsofthousandsofusers
— Multilingual
— Nobuilt-inlanguage
— Designedtoallowadditionofnewlanguages
— Additionaltoolsforrapidvoicedevelopment
— Statisticallearningtools
— Scriptsforbuildingmodels
Text from Richard Sproat
Festival as software
— http://festvox.org/festival/
— Generalsystemformulti-lingualTTS
— C/C++codewithSchemescriptinglanguage
— Generalreplaceablemodules:
— Lexicons,LTS,duration,intonation,phrasing,POS
tagging,tokenizing,diphone/unitselection,signal
processing
— Generaltools
— Intonationanalysis(f0,Tilt),signalprocessing,CART
building,N-gram,SCFG,WFST
Text from Richard Sproat
CMU FestVox project
— Festivalisanengine,howdoyoumakevoices?
— Festvox:buildingsyntheticvoices:
— Tools,scripts,documentation
— Discussionandexamplesforbuildingvoices
— Examplevoicedatabases
— Stepbystepwalkthroughsofprocesses
— SupportforEnglishandotherlanguages
— Supportfordifferentwaveformsynthesismethods
— Diphone
— Unitselection
1/5/07
Text from Richard Sproat
Synthesis tools
— Iwantmycomputertotalk
— FestivalSpeechSynthesis
— Iwantmycomputertotalkinmyvoice
— FestVox Project
— Iwantittobefastandefficient
— Flite
1/5/07
Text from Richard Sproat
Dictionaries
— CMUdictionary:127Kwords
— http://www.speech.cs.cmu.edu/cgi-
bin/cmudict
— Unisyn dictionary
— Significantlymoreaccurate,includesmultiple
dialects
— http://www.cstr.ed.ac.uk/projects/unisyn/
Dictionaries aren’t always
sufficient: Unknown words
— Goupwithsquarerootof#ofwordsintext
— Mostlyperson,company,productnames
— FromaBlacketalanalysis
— 5%oftokensinasampleofWSJnotintheOALDdictionary
77%:Names
20%OtherUnknownWords
4%:Typos
— Socommercialsystemshave3-partsystem:
— Bigdictionary
— Specialcodeforhandlingnames
— MachinelearnedLTSsystemforotherunknownwords
Names
— Bigproblemareaisnames
— Namesarecommon
— 20%oftokensintypicalnewswiretext
— Spiegel(2003)estimateofUSnames:
— 2millionsurnames
— 100,000firstnames
— Personalnames:McArthur,D’Angelo,Jiminez,
Rajan,Raghavan,Sondhi,Xu,Hsu,Zhang,Chang,
Nguyen
— Company/Brandnames:Infinit,Kmart,Cytyc,
Medamicus,Inforte,Aaon,Idexx Labs,Bebe
Methods for Names
— Candomorphology(Walters->Walter,Lucasville)
— Canwritestress-shiftingrules(Jordan->
Jordanian)
— Rhymeanalogy:Plotsky byanalogywithTrostsky
(replacetr withpl)
— LibermanandChurch:for250Kmostcommon
names,got212K(85%)fromthesemodifieddictionarymethods,usedLTSforrest.
— Candoautomaticcountrydetection(fromletter
trigrams)andthendocountry-specificrules
Letter to Sound Rules
— AKAGraphemetoPhoneme(G2P)
— Generallymachinelearning,inducedfromadictionary
— Pickyourfavoritemachinelearningtoolandgoforit
— Earlierwork:(Blacketal.1998)
— Twosteps:alignmentand(CART-based)rule-induction
Alignment
Letters:
c h e
c k e d
Phones:
ch _ eh _ k _ t
— BlacketalMethod1(EM)
— Firstscatterepsilonsinallpossiblewaystocause
lettersandphonestoalign
— ThencollectstatsforP(phone|letter)andselect
besttogeneratenewstats
— Iterate5-6timesuntilsettles
Alignment
— Blacketalmethod2
— Handspecifywhichletterscanberenderedaswhich
phones
— Cgoestok/ch/s/sh
— Wgoestow/v/f,etc
— Anactuallist:
— Oncemappingtableiscreated,findallvalid
alignments,findp(letter|phone),scoreallalignments,
takebest
Alignment
— Somealignmentswillturnouttobereallybad.
— Thesearejustthecaseswherepronunciation
doesn’tmatchletters:
Dept
dih paa rtmahnt
CMU
siy ehmyuw
Lieutenant
lehftehnaxnt(British)
— Orforeignwords
— Thesecanjustberemovedfromalignment
training
— Alsoremoveanddealseparatelywith
namesandacronyms
Black et al. Classifier
ACARTtreeforeachletterinalphabet(26
plusaccented)usingcontextof+-3letters
# # # c h e c -> ch
c h e c k e d -> _
— Thisproduces92-96%correctLETTER
accuracy(58-75wordacc)forEnglish
Modern systems: more powerful
classifiers, more features
—
Predicting Intonation in TTS
Prominence/Accent:Decidewhichwordsare
accented,whichsyllablehasaccent,what
sortofaccent
Boundaries:Decidewhereintonational
boundariesare
Duration:Specifylengthofeachsegment
F0:GenerateF0contourfromthese
Predicting Intonation in TTS
Prominence/Accent:Decidewhichwordsare
accented,whichsyllablehasaccent,what
sortofaccent
Boundaries:Decidewhereintonational
boundariesare
Duration:Specifylengthofeachsegment
F0:GenerateF0contourfromthese
Stress vs. accent
— Stress:structuralpropertyofaword
— Fixedinthelexicon:marksapotential(arbitrary)
locationforanaccenttooccur,ifthereisone
— Accent: propertyofawordincontext
— Context-dependent.Marksimportantwordsin
thediscourse
(x)
x
x
x x
x
vi ta mins
x
x
Ca
x
li
(x)
x
x
x
x
for nia
(accentedsyll)
stressedsyll
fullvowels
syllables
Stress vs. accent (2)
— Thespeakerdecidestomakethewordvitamin more
prominentbyaccenting it.
— Lexicalstresstellusthatthisprominencewillappearon
thefirstsyllable,hencevitamin.
— Sowewillhavetolookatboththelexiconandthe
contexttopredictthedetailsofprominence
I’malittlesurprised tohearitcharacterizedasupbeat
Levels of prominence
— Mostphraseshavemorethanoneaccent
— NuclearAccent:Lastaccentinaphrase,perceivedas
moreprominent
— Playssemanticrolelikeindicatingawordiscontrastiveor
focus.
— Modeledvia***sinIM,orcapitalizedletters
— ‘IknowSOMETHING interestingissuretohappen,’shesaid
— Canalsohavereducedwordsthatareless prominent
thanusual(especiallyfunctionwords)
— Sometimesuse4classesofprominence:
— Emphaticaccent,pitchaccent,unaccented,reduced
Pitch accent prediction
— Whichwordsinanutteranceshouldbearaccent?
ài believeatibm theymakeyouwearabluesuit.
ài BELIEVEatIBMtheyMAKEyouWEARaBLUESUIT.
à2001wasagoodmovie,ifyouhadreadthebook.
à2001 wasagoodMOVIE,ifyouhadreadtheBOOK.
Broadcast News Example
Hirschberg(1993)
SUNMICROSYSTEMSINC,theUPSTARTCOMPANYthat
HELPEDLAUNCHtheDESKTOPCOMPUTERindustry
TRENDTOWARDHIGHpoweredWORKSTATIONS,was
UNVEILINGanENTIREOVERHAULofitsPRODUCTLINE
TODAY.SOMEofthenewMACHINES,PRICEDfromFIVE
THOUSANDNINEhundredNINETYfiveDOLLARSto
seventyTHREEthousandnineHUNDREDdollars,BOAST
SOPHISTICATEDnewgraphicsandDIGITALSOUND
TECHNOLOGIES,HIGHERSPEEDSANDaCIRCUITboard
thatallowsFULLmotionVIDEOonaCOMPUTER
SCREEN.
Predicting Pitch Accent: Part of
speech
— Contentwordsareusuallyaccented
— Functionwordsarerarelyaccented
— Of,for,inon,that,the,a,an,no,to,andbutor
willmaywouldcanheristheiritsourthereis
amarewaswere,etc
— Baselinealgorithm:Accentallcontentwords.
Factors in accent prediction:
Contrast
LegumesarepoorsourceofVITAMINS
No,legumesareaGOODsourceofvitamins
IthinkJOHNorMARYshouldgo
No,IthinkJOHNANDMARYshouldgo
Factors in Accent Prediction:
List intonation
IwentandsawANNA,LENNY,MARY,
andNORA.
Factors: Word order
Preposed itemsareaccentedmore
frequently
TODAYwewillBEGINtoLOOKatFROG
anatomy.
WewillBEGINtoLOOKatFROG
anatomytoday.
Factor: Information
— Newinformationinananswer isoftenaccented
Q1:Whattypesoffoodsareagoodsourceofvitamins?
A1:LEGUMESareagoodsourceofvitamins.
Q2:Arelegumesasourceofvitamins?
A2:LegumesareaGOODsourceofvitamins.
Q3:I’veheardthatlegumesarehealthy,butwhatare
theyagoodsourceof?
A3:LegumesareagoodsourceofVITAMINS.
SlidefromJenniferVenditti
Factors: Information Status (2)
Newversusoldinformation.
Oldinformationisdeaccented
ThereareLAWYERS,andthereareGOOD
lawyers
EACHNATIONDEFINESitsOWNnational
INTERST.
ILIKEGOLDENRETRIEVERS,butMOSTdogs
LEAVEmeCOLD.
Complex Noun Phrase Structure
Sproat,R.1994.Englishnoun-phraseaccentpredictionfortext-to-speech.Computer
SpeechandLanguage8:79-94.
— ProperNames,stressonright-mostword
NewYorkCITY;Paris,FRANCE
— Adjective-Nouncombinations,stressonnoun
LargeHOUSE,redPEN,newNOTEBOOK
— Noun-Nouncompounds:stressleftnoun
HOTdog (food)versusadj-NHOTDOG(overheatedanimal)
WHITEhouse(place)versusadj-NWHITEHOUSE(painted)
— Other:
MEDICALBuilding,APPLEcake,cherryPIE,kitchenTABLE
Madisonavenue,Parkstreet?
— Rule:Furniture+Room ->RIGHT
Other features
— POS
— POSofpreviousword
— POSofnextword
— Stressofcurrent,previous,nextsyllable
— Unigramprobabilityofword
— Bigramprobabilityofword
— Positionofwordinsentence
Advanced features
— Accentisoftendeflectedawayfromaworddueto
focusonaneighboringword.
— Couldusesyntacticparallelismtodetectthiskindof
contrastivefocus:
— ……driving[FIFTYmiles]anhourina[THIRTYmile]zone
— [WELD][APPLAUDS]mandatoryrecycling.[SILBER]
[DISMISSES]recyclinggoalsasmeaningless.
— …butwhileWeldmaybe[LONG]onpeopleskills,hemay
be[SHORT]onmoney
Accent prediction
— Usefulfeaturesinclude:(startingwithHirschberg1993)
— Lexicalclass(functionwords,clitics notaccented)
— Wordfrequency
— Wordidentity(promisingbutproblematic)
— Given/New,Theme/Rheme
— Focus
— Wordbigrampredictability
— Positioninphrase
— Complexnominalrules(Sproat)
— Combinedinaclassifier:
— Decisiontrees(Hirchberg 1993),Bagging/boosting(Sun2002)
— HiddenMarkovmodels(Hasegawa-Johnsonetal2005)
— Conditionalrandomfields(GregoryandAltun 2004)
But best single feature: Accent ratio dictionary
Nenkova, J. Brenier, A. Kothari, S. Calhoun, L. Whitton, D. Beaver, and D. Jurafsky. 2007. To memorize
or to predict: Prominence labeling in conversational speech. in Proceedings of NAACL-HLT
— Ofallthetimesthiswordoccurred
— Whatpercentagewereaccented?
— Memorizedfromalabeledcorpus
— 60Switchboardconversations(Ostendorfetal2001)
— Given:
— k:numberoftimesawordisprominent
— n:alloccurrencesoftheword
k
AccentRatio( w) = , if B(k , n,0.5) < 0.05
n
Accent Ratio
— Conversationalspeechdictionaries:
— http://www.cis.upenn.edu/~nenkova/AR.sign
— ReadNewsdictionaries:
— http://www.cis.upenn.edu/~nenkova/buAR.sign
— Accentratioclassifier
— awordnot-prominentifAR<0.38
— wordsnotintheARdictionaryarelabeled“prominent”
— Performance
— Bestsinglepredictorofaccent;addingallotherfeaturesonly
helps1%onpredictiontask
— ImprovesTTS:
— VolkerStrom,Ani Nenkova,RobertClark,YolandaVazquez-Alvarez,JasonBrenier,SimonKing,andDan
Jurafsky.2007.Modelling ProminenceandEmphasisImprovesUnit-SelectionSynthesis.Interspeech
Predicting Intonation in TTS
Prominence/Accent:Decidewhichwordsare
accented,whichsyllablehasaccent,what
sortofaccent
Boundaries:Decidewhereintonational
boundariesare
Duration:Specifylengthofeachsegment
F0:GenerateF0contourfromthese
Predicting Boundaries:
Full || versus intermediate |
Ostendorf andVeilleux.1994“HierarchicalStochasticmodelforAutomatic
PredictionofProsodicBoundaryLocation”,ComputationalLinguistics20:1
Computerphonecalls,||whichdoeverything|
fromsellingmagazinesubscriptions||toreminding
peopleaboutmeetings||havebecomethe
telephoneequivalent|ofjunkmail.||
DoctorNormanRosenblatt,||deanofthecollege|
ofcriminaljusticeatNortheasternUniversity,||
agrees.||
ForWBUR,||I’mMargoMelnicove.
Simplest CART
(set!simple_phrase_cart_tree
'
((lisp_token_end_punc in("?""."":"))
((BB))
((lisp_token_end_punc in("'""\""","";"))
((B))
((n.name is0);;endofutterance
((BB))
((NB))))))
More complex features
Ostendorf andVeilleux
— English:boundariesaremorelikelybetween
contentwordsandfunctionwords
— Syntacticstructure(parsetrees)
— Largestsyntacticcategorydominatingpreceding
wordbutnotsucceedingword
— Howmanysyntacticunitsbegin/endbetween
words
— Typeoffunctionwordtoright
— Capitalizednames
— #ofcontentwordssincepreviousfunctionword
Ostendorf and Veilleux CART
Predicting Intonation in TTS
Prominence/Accent:Decidewhichwordsare
accented,whichsyllablehasaccent,what
sortofaccent
Boundaries:Decidewhereintonational
boundariesare
Duration:Specifylengthofeachsegment
F0:GenerateF0contourfromthese
Duration
— Simplest:fixedsizeforallphones(100ms)
— Nextsimplest:averagedurationforthatphone
(fromtrainingdata).SamplesfromSWBDinms:
aa 118
b
68
ax 59
d
68
ay 138
dh
44
eh 87
f
90
ih 77
g
66
— NextNextSimplest:addinphrase-finalandinitial
lengtheningplusstress
Klatt duration rules
Modelshowcontext-neutraldurationofaphone
lengthened/shortenedbycontext,whilestayingaboveaminduration
dmin
Prepausal lengthening:vowelbeforepauselengthenedby1.4
Non-phrase-finalshortening:Segmentsnotphrase-finalareshortened
by0.6.Phrase-finalpostvocalicliquidsandnasalslengthenedby1.4
Unstressedshortening:unstressedsegmentsminimumdurationdmin
ishalved,andareshortenedby.7formostphonetypes.
Lengtheningforaccent:Avowelwithaccentlengthenedby1.4
Shorteninginclusters:Consonantfollowedbyconsonantshortenedby
0.5
Pre-voicelessshortening:Vowelsareshortenedbeforeavoiceless
plosiveby0.7
Klatt duration rules
— Klatt formulaforphonedurations:
— Festival:2options
— Klatt rules
— UselabeledtrainingsetwithKlatt featuresto
trainCART
— Identityoftheleftandrightcontextphone
— Lexicalstressandaccentvaluesofcurrentphone
— Positioninsyllable,word,phrase
— Followingpause
Duration: state of the art
— Lotsoffancymodelsofdurationprediction:
— UsingZ-scoresandotherclever
normalizations
— Sum-of-productsmodel
— Newfeatureslikewordpredictability
— Wordswithhigherbigramprobabilityare
shorter
Outline
— History,Demos
— ArchitecturalOverview
— Stage1:TextAnalysis
— TextNormalization
— Tokenization
— Endofsentencedetection
— Homographdisambiguation
— Letter-to-sound(grapheme-to-phoneme)
— Prosody
— PredictingAccents
— PredictingBoundaries
— PredictingDuration
— GeneratingF0