words

Thislecture
• Automaticspeechrecognition(ASR)
• ApplyingHMMstoASR,
• PracticalaspectsofASR,and
• Levenshtein distance.
CSC401/2511 – Spring 2017
2
Considerwhatwewantspeechtodo
Buyticket...
AC490...
yes
Myhandsarein
theair.
Dictation
Telephony
Putthis
there.
…
Multimodalinteraction
CSC401/2511 – Spring 2017
CanwejustuseGMMs?
3
Speechisdynamic
• Speechchanges overtime.
• GMMsaregoodforhigh-levelclustering,buttheyencode
nonotionoforder,sequence,ortime.
• Speechisanexpressionoflanguage.
• Wewanttoincorporateknowledgeofhowphonemesand
wordsareorderedwithlanguagemodels.
CSC401/2511 – Spring 2017
4
Speechissequencesofphonemes
(*)
/owpahndhahpaadbeydaorz/
open(podBay.doors);
“openthepodbaydoors”
WewanttoconvertaseriesofMFCC
vectorsintoasequenceofphonemes.
(*)notreally
CSC401/2511 – Spring 2017
5
Phonemedictionaries
• Therearemanyphonemicdictionariesthatmapwordsto
pronunciations(i.e.,listsofphonemesequences).
• TheCMUdictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict)is
popular.
• 127KwordstranscribedwiththeARPAbet.
• Includessomerudimentaryprosodymarkers.
…
EVOLUTION
EVOLUTION(2)
EVOLUTION(3)
EVOLUTION(4)
EVOLUTIONARY
CSC401/2511 – Spring 2017
EH2
IY2
EH2
IY2
EH2
V
V
V
V
V
6
AH0
AH0
OW0
OW0
AH0
L
L
L
L
L
UW1
UW1
UW1
UW1
UW1
SH
SH
SH
SH
SH
AH0
AH0
AH0
AH0
AH0
N
N
N
N
N EH2 R IY0
Annotation/transcription
• Speechdatamustbesegmented andannotated inordertobe
usefultoanASRlearningcomponent.
• ProgramslikeWavesurfer orPraat allowyoutodemarcate
whereaphonemebeginsandendsintime.
CSC401/2511 – Spring 2017
7
Puttingittogether?
“openthepodbaydoors”
Languagemodel
CSC401/2511 – Spring 2017
Acousticmodel
8
ThenoisychannelmodelforASR
Languagemodel
Source
!(#)
#∗
Acousticmodel
W′
Channel
!(%|#)
(′
Observed%
Decoder
Word
sequence*
Acoustic
sequence(
* ∗ = argmax 2((|*)2(*)
1
Howtoencode2((|*)?
CSC401/2511 – Spring 2017
9
Reminder– discreteHMMs
• PreviouslywesawdiscreteHMMs:at
eachstateweobservedadiscretesymbol
fromafinitesetofdiscretesymbols.
CSC401/2511 – Spring 2017
word
P(word)
ship
0.1
pass
0.05
camp
0.05
frock
0.6
soccer
0.05
mother
0.1
tops
0.05
word
P(word)
word
P(word)
ship
0.25
ship
0.3
pass
0.25
pass
0
camp
0.05
camp
0
frock
0.3
frock
0.2
soccer
0.05
soccer
0.05
mother
0.09
mother
0.05
tops
0.01
tops
0.4
10
ContinuousHMMs(CHMM)
• AcontinuousHMM hasobservationsthataredistributed
overcontinuousvariables.
• Observationprobabilities,34 ,arealsocontinuous.
• E.g.,here35 (6⃗) tellsustheprobabilityofseeingthe
(multivariate)continuousobservation6⃗ whileinstate0.
b0
b1
4.32957
2.48562
6⃗ =
1.08139
…
0.45628
CSC401/2511 – Spring 2017
11
b2
DefiningCHMMs
• ContinuousHMMsareverysimilartodiscreteHMMs.
• 8 = {:; , … , :> }
:setofstates(e.g.,subphones)
• ( = ℝBC
:continuousobservationspace
R
• Π = {E; , … , E> }
• F = G4H , I, J ∈ 8
• L = 34 6⃗ , I ∈ 8, 6⃗ ∈ (
:initialstateprobabilities
:statetransitionprobabilities
:stateoutputprobabilities
(i.e.,Gaussianmixtures)
yielding
• M = {N5 , … , NO }, N4 ∈ 8 :statesequence
• P = ℴ5 , … , ℴO , ℴ4 ∈ ( :observationsequence
CSC401/2511 – Spring 2017
12
Word-levelHMMs?
• ImaginethatwewanttolearnanHMMforeachwordinour
•
lexicon(e.g.,60Kwords→60KHMMs).
No,thankyou!Zipf’s lawtellsusthatmany wordsoccur
very infrequently.
• 1(orafew)trainingexamplesofawordisnot enoughto
trainamodelashighlyparameterizedasaCHMM.
b0
b1
• Inaword-level
HMM,eachstate
mightbea
phoneme.
CSC401/2511 – Spring 2017
13
b2
PhonemeHMMs
• Phonemeschange overtime– wemodelthesedynamicsby
buildingoneHMMforeach phoneme.
• Tristatephonememodelsarepopular.
• Thecentrestateisoftenthe‘steady’part.
b0
b1
b2
tristate phonememodel(e.g.,/oi/)
CSC401/2511 – Spring 2017
14
PhonemeHMMs
• WetraineachphonemeHMMusing
all sequencesofthatphoneme.
• Evenfromdifferentwords.
…
annotation
CSC401/2511 – Spring 2017
85
…
96
…
1
…
…
…
2
…
…
…
3
…
…
…
…
…
42
…
…
…
…
…
15
/eh/
…
/s/
…
…
observations
/iy/
/ih/
Time,S
MFCC
...
64 85 ae
85 96 sh
96 102 epi
102 106 m
...
PhonemeHMMs
/sh/
Combiningmodels
• WecanlearnanN-gramlanguagemodelfromword-level
transcriptionsofspeechdata.
• ThesemodelsarediscreteandaretrainedusingMLE.
• OurphonemeHMMstogetherconstituteouracousticmodel.
• EachphonemeHMMtellsushowaphoneme‘sounds’.
• Wecancombine thesemodelsbyconcatenating phoneme
HMMstogetheraccordingtoaknownlexicon.
• Weuseaword-to-phonemedictionary.
CSC401/2511 – Spring 2017
16
Combiningmodels
• Ifweknowhowphonemescombinetomakewords,wecan
simplyconcatenate togetherourphonememodelsby
insertingandadjusting transitionweights.
• e.g.,Zipf ispronounced/zih f/,so…
(It’satinybitmorecomplicatedthanthis–
normallyphonemeHMMshavespecial‘handle’states
ateitherendthatconnecttootherHMMs)
CSC401/2511 – Spring 2017
17
Co-articulationandtriphones
• Co-articulation: n. Whenaphonemeisinfluencedby
adjacentphonemes.
• Atriphone HMMcapturesco-articulation.
• Triphone model/a-b+c/isphonemeb whenprecededbya and
followedbyc.
Two(ofmany)triphone HMMsfor/t/
/s-t+iy/
/iy-t+eh/
CSC401/2511 – Spring 2017
18
Combiningtriphone HMMs
• Triphone modelscanonlyconnecttoothertriphone models
that‘match’.
/z+ih/
CSC401/2511 – Spring 2017
/z-ih+f/
19
/ih-f/
Concatenatingphonememodels
Wecaneasily
incorporateunigram
probabilitiesthrough
transitions,too.
FromJurafsky &
Martintext
CSC401/2511 – Spring 2017
20
Bigrammodels
FromJurafsky &
Martintext
CSC401/2511 – Spring 2017
21
UsingCHMMs
• Asbefore,theseHMMsaregenerative modelsthatencode
statisticalknowledgeofhowoutputisgenerated.
• Wetrain CHMMswithBaum-Welch(atypeofExpectationMaximization),aswedidbeforewithdiscreteHMMs.
• Here,theobservationparameters,34 6⃗ ,areadjusted
usingtheGMMtraining‘recipe’fromlastlecture.
• WefindthebeststatesequencesusingViterbi,asbefore.
• Here,thebeststatesequencegivesusasequenceof
phonemes andwords.
CSC401/2511 – Spring 2017
22
Speechrecognitionarchitecture
Cepstral
featureextraction
(
MFCCfeatures
2(*)
Gaussian
Mixturemodels
2((|*)
Phonemelikelihoods
HMMlexicon
Viterbidecoder
CSC401/2511 – Spring 2017
23
N-gram
languagemodel
…arealponcho
*
Speechdatabases
• Large-vocabularycontinuousASRismeanttoencodefull
conversationalspeech,withavocabularyof>64Kwords.
• Thisrequireslots ofdatatotrainourmodels.
• TheSwitchboard corpuscontains2430conversationsspread
•
•
outoverabout240hoursofdata(~14GB).
TheTIMIT databasecontains63,000sentencesfrom630
speakers.
• Relativelysmall(~750MB),butverypopular.
Speechdatafromconferences(e.g.,TED)orfrombroadcast
newstendstobebetween3GBand30GB.
CSC401/2511 – Spring 2017
24
AspectsofASRsystemsintheworld
• Speakingmode: Isolated word(e.g.,“yes”)vs.continuous
•
•
•
•
(e.g.,“Siri,askCortanafortheweather”)
Speakingstyle: Read speechvs.spontaneous speech;
thelattercontainsmanydysfluencies
(e.g.,stuttering,uh,like,…)
Enrolment:
Speaker-dependent (alltrainingdatafrom
onespeaker)vs. speaker-independent
(trainingdatafrommanyspeakers).
Vocabulary:
Small (<20words)orlarge (>50,000words).
Transducer:
Cellphone?Noise-cancellingmicrophone?
Teleconferencemicrophone?
CSC401/2511 – Spring 2017
25
Signal-to-noiseratio
• Weareoftenconcernedwiththesignal-to-noise ratio (SNR),
whichmeasurestheratio betweenthepowerofadesired
signal withinarecording(2T4UVWX ,e.g.,thehumanspeech)
andadditivenoise(2VY4TZ ).
• Noisetypicallyincludes:
• Backgroundnoise(e.g.,peopletalking,wind),
• Signaldegradation.Thisisnormally ‘white’noise
producedbythemediumoftransmission.
2T4UVWX
Youdon’thaveto
8[\]^ = 10 log;5
memorizethis
2VY4TZ
formula.
• High8[\]^ is>30dB.Low8[\]^ is<10dB.
CSC401/2511 – Spring 2017
26
Audio-visualspeechmethods
• Observingthevocaltractdirectly,
ratherthanthroughinference,canbe
veryhelpfulinautomaticspeech
recognition.
• Theshapeandapertureofthemouth
givessomecluesastothephoneme
beinguttered.
• Dependingonthelevelof
invasiveness,wecanevenmeasure
theglottisandtonguedirectly.
CSC401/2511 – Spring 2017
27
Exampleofarticulatorydata
• TORGOwasbuilttotrainaugmentedASRsystems.
• 9 subjectswithcerebralpalsy(1withALS),9matchedcontrols.
• Eachreads500—1000promptsover3hours thatcoverphonemes
andarticulatorycontrasts(e.g.,meat vs.beat).
• Electromagneticarticulography (andvideo)trackpointsto<1mm.
CSC401/2511 – Spring 2017
28
Acoustic
spectrograms
Example– Lipapertureandnasals
/n/
Lipapertures
overtime
/m/
CSC401/2511 – Spring 2017
29
/ng/
EvaluatingASRaccuracy
• HowcanyoutellhowgoodanASRsystematrecognizingspeech?
• E.g.,ifsomebodysaid
Reference: howto recognizespeech
butanASRsystemheard
Hypothesis: howto wreckanicebeach
howdowequantifytheerror?
• Onemeasureiswordaccuracy:#CorrectWords/#ReferenceWords
• E.g.,2/4,above
• ThisrunsintoproblemssimilartothosewesawwithSMT.
• E.g.,thehypothesis‘howtorecognizespeechboingboing
•
boing boing boing’has100%accuracybythismeasure.
Normalizingby#HypothesisWords alsohasproblems…
CSC401/2511 – Spring 2017
30
Word-errorrates(WER)
• ASRenthusiastsareoftenconcernedwithword-errorrate
(WER),whichcountsdifferentkinds oferrorsthatcanbe
madebyASRattheword-level.
• Substitutionerror: Onewordbeingmistookforanother
• Deletionerror:
• Insertionerror:
e.g.,‘shift’ given‘ship’
Aninputwordthatis‘skipped’
e.g.‘ITorgo’ given‘Iam Torgo’
A‘hallucinated’wordthatwasnotin
theinput.
e.g.,‘ThisNorwegianparrotisnomore’
given‘Thisparrotisnomore’
CSC401/2511 – Spring 2017
31
Evaluating ASR accuracy
• Buthowtodecidewhicherrorsareofeachtype?
• E.g.,
Reference:
howtorecognizespeech
Hypothesis:
howtowreckanicebeach,
• It’snotsosimple:‘speech’seemstobemistakenfor‘beach’,except
the/s/phonemeisincorporatedintotheprecedinghypothesis
word,‘nice’(/nays/).
• Here,‘recognize’seemstobemistakenfor‘wreckanice’
• Areeachof‘wreckanice’substitutions of‘recognize’?
• Is‘wreck’ asubstitutionfor‘recognize’?
• Ifso,thewords‘a’ and‘nice’mustbeinsertions.
• Is‘nice’ asubstitutionfor‘recognize’?
• Ifso,thewords‘wreck’ and‘a’mustbeinsertions.
CSC401/2511 – Spring 2017
32
Levenshtein distance
• Inpractice,ASRpeopleareoftenmoreconcernedwithoverall
WER,anddon’tcareabouthowthoseerrorsarepartitioned.
• E.g.,3substitutionerrorsare‘equivalent’to1substitutionplus
2insertions.
• TheLevenshtein distanceisastraightforwardalgorithmbasedon
dynamicprogrammingthatallowsustocomputeoverallWER.
CSC401/2511 – Spring 2017
33
Levenshtein distance
Allocate matrix\[g + 1, i + 1] //whereg isthenumberofreferencewords
//andi isthenumberofhypothesiswords
Initialize \ 0,0 ≔ 0,and \ I, J ≔ ∞ forallotherI = 0 or J = 0
for I ≔ 1. . g //#ReferenceWords
for J ≔ 1. . i //#Hypothesiswords
\[I, J] ≔ min( \ I − 1, J + 1,
//deletion
//iftheI qr referenceword and
//theJqr hypothesiswordmatch
\ I − 1, J − 1 + 1, //iftheydiffer,i.e.,substitution
\ I, J − 1 + 1 )
//insertion
\ I − 1, J − 1 ,
Return 100×\ g, i /g
CSC401/2511 – Spring 2017
34
Levenshtein distance– initialization
Reference
hypothesis
-
how
to
wreck
a
nice
beach
-
0
∞
∞
∞
∞
∞
∞
how
∞
to
∞
recognize
∞
speech
∞
Thevalueatcell(I, J)istheminimum numberoferrors
necessarytoalignI withJ.
CSC401/2511 – Spring 2017
35
Levenshtein distance
Reference
hypothesis
-
how
to
wreck
a
nice
beach
-
0
∞
∞
∞
∞
∞
∞
how
∞
0
to
∞
recognize
∞
speech
∞
• \ 1,1 = min ∞ + 1, (0), ∞ + 1 = 0 (match)
• Weputalittlearrow inplacetoindicatethechoice.
• ‘Arrows’arenormallystoredinabacktrace matrix.
CSC401/2511 – Spring 2017
36
Levenshtein distance
Reference
hypothesis
-
how
to
wreck
a
nice
beach
-
0
∞
∞
∞
∞
∞
∞
how
∞
0
1
2
3
4
5
to
∞
recognize
∞
speech
∞
• Wecontinuealongforthefirstreferenceword…
• Theseareallinsertion errors
CSC401/2511 – Spring 2017
37
Levenshtein distance
Reference
hypothesis
-
how
to
wreck
a
nice
beach
-
0
∞
∞
∞
∞
∞
∞
how
∞
0
1
2
3
4
5
to
∞
1
0
1
2
3
4
recognize
∞
speech
∞
• Andontothesecondreferenceword
CSC401/2511 – Spring 2017
38
Levenshtein distance
Reference
hypothesis
-
how
to
wreck
a
nice
beach
-
0
∞
∞
∞
∞
∞
∞
how
∞
0
1
2
3
4
5
to
∞
1
0
1
2
3
4
recognize
∞
2
1
1
2
3
4
speech
∞
• Sincerecognize≠ wreck,wehaveasubstitution error.
• Atsomepoints,youhave>1possiblepathasindicated.
• Wecanprioritizetypesoferrorsarbitrarily.
CSC401/2511 – Spring 2017
39
Levenshtein distance
Reference
hypothesis
-
how
to
wreck
a
nice
beach
-
0
∞
∞
∞
∞
∞
∞
how
∞
0
1
2
3
4
5
to
∞
1
0
1
2
3
4
recognize
∞
2
1
1
2
3
4
speech
∞
3
2
2
2
3
4
• Andwefinishthegrid.
• Thereare\ g, i = 4 worderrorsandaWERof4⁄4 = 100%.
• WERcanbegreaterthan100%(relativetothereference).
CSC401/2511 – Spring 2017
40
Levenshtein distance
Reference
hypothesis
-
how
to
wreck
a
nice
beach
-
0
∞
∞
∞
∞
∞
∞
how
∞
0
1
2
3
4
5
to
∞
1
0
1
2
3
4
recognize
∞
2
1
1
2
3
4
speech
∞
3
2
2
2
3
4
• Ifwewant,wecanbacktrack usingourarrowstofindthe
proportionofsubstitution,deletion,andinsertionerrors.
CSC401/2511 – Spring 2017
41
Levenshtein distance
Reference
hypothesis
-
how
to
wreck
a
nice
beach
-
0
∞
∞
∞
∞
∞
∞
how
∞
0
1
2
3
4
5
to
∞
1
0
1
2
3
4
recognize
∞
2
1
1
2
3
4
speech
∞
3
2
2
2
3
4
• Here,weestimate2substitution errorsand2insertion errors.
• Arrowscanbeencodedwithinaspecialbacktrace matrix.
CSC401/2511 – Spring 2017
42
Recentperformance
Corpus
Digits
Spontaneous
10
ASR
WER(%)
0.3%
Read
1000
3.6%
0.1%
Read
64,000
6.6%
1%
Mixed
64,000
13.5%
-
10,000
19.3%
4%
Speechtype Lexiconsize
Phone
directory
WallStreet
Journal
Radionews
Switchboard
conversation
(telephone)
CSC401/2511 – Spring 2017
43
Human
WER(%)
0.009%