LanguageModeling
Manyslidesfrom DanJurafsky
ProbabilisticLanguageModels
• Today’sgoal:assignaprobabilitytoasentence
• MachineTranslation:
• P(highwindstonite)>P(large windstonite)
Why?
• SpellCorrection
• Theofficeisaboutfifteenminuets frommyhouse
• P(aboutfifteenminutes from)>P(aboutfifteenminuets from)
• SpeechRecognition
• P(Isawavan)>>P(eyesaweofan)
• +Summarization,question-answering,etc.,etc.!!
ProbabilisticLanguageModeling
• Goal:computetheprobabilityofasentenceorsequence
ofwords:
P(W)=P(w1,w2,w3,w4,w5…wn)
• Relatedtask:probability ofanupcomingword:
P(w5|w1,w2,w3,w4)
• Amodelthatcomputeseitherofthese:
P(W)orP(wn|w1,w2…wn-1) iscalledalanguagemodel.
• Better:thegrammarButlanguagemodelorLMisstandard
HowtocomputeP(W)
• Howtocomputethisjointprobability:
• P(its,water,is,so,transparent,that)
• Intuition:let’srelyontheChainRuleofProbability
Reminder:TheChainRule
• Recallthedefinition ofconditionalprobabilities
p(B|A)=P(A,B)/P(A)
Rewriting:P(A,B)=P(A)P(B|A)
• Morevariables:
P(A,B,C,D)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• TheChainRuleinGeneral
P(x1,x2,x3,…,xn)=P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
TheChainRuleappliedtocomputejoint
probabilityofwordsinsentence
P(w1w 2 …w n ) = ∏ P(w i | w1w 2 …w i−1 )
i
P(“itswaterissotransparent”) =
P(its)× P(water|its)× P(is|its water)
× P(so|its wateris)× P(transparent|its wateris
so)
Howtoestimatetheseprobabilities
• Couldwejustcountanddivide?
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)
• No!Toomanypossiblesentences!
• We’llneverseeenoughdataforestimatingthese
MarkovAssumption
•Simplifyingassumption:
AndreiMarkov
P(the | its water is so transparent that) ≈ P(the | that)
•Ormaybe
P(the | its water is so transparent that) ≈ P(the | transparent that)
MarkovAssumption
P(w1w 2 …w n ) ≈ ∏ P(w i | w i−k …w i−1 )
i
•Inotherwords,weapproximateeach
componentintheproduct
P(w i | w1w 2 …w i−1 ) ≈ P(w i | w i−k …w i−1 )
Simplestcase:Unigrammodel
P(w1w 2 …w n ) ≈ ∏ P(w i )
i
Someautomaticallygeneratedsentencesfromaunigrammodel
€
fifth, an, of, futures, the, an, incorporated, a, a,
the, inflation, most, dollars, quarter, in, is, mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
Bigrammodel
Conditiononthepreviousword:
P(w i | w1w 2 …w i−1 ) ≈ P(w i | w i−1 )
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
N-grammodels
• Wecanextendtotrigrams,4-grams,5-grams
• Ingeneralthisisaninsufficient modeloflanguage
• becauselanguagehaslong-distancedependencies:
“ThecomputerwhichIhadjustputintothemachineroomon
thefifthfloorcrashed.”
• ButwecanoftengetawaywithN-grammodels
Language
Modeling
IntroductiontoN-grams
Language
Modeling
EstimatingN-gram
Probabilities
€
Estimatingbigramprobabilities
• TheMaximumLikelihoodEstimate
count(w i−1,w i )
P(w i | w i−1 ) =
count(w i−1 )
c(w i−1,w i )
P(w i | w i−1 ) =
c(w i−1 )
Anexample
c(w i−1,w i )
P(w i | w i−1 ) =
c(w i−1 )
<s>IamSam</s>
<s>SamIam</s>
<s>Idonotlikegreeneggsandham</s>
Moreexamples:
BerkeleyRestaurantProjectsentences
• canyoutellmeaboutanygoodcantonese restaurantscloseby
• midpricedthai foodiswhati’m lookingfor
• tellmeaboutchezpanisse
• canyougivemealistingofthekindsoffoodthatareavailable
• i’m lookingforagoodplacetoeatbreakfast
• wheniscaffe venezia openduringtheday
Rawbigramcounts
• Outof9222sentences
Rawbigramprobabilities
• Normalizebyunigrams:
• Result:
Bigramestimatesofsentenceprobabilities
P(<s>Iwantenglish food</s>)=
P(I|<s>)
× P(want|I)
× P(english|want) × P(food|english) × P(</s>|food)
=.000031
Whatkindsofknowledge?
• P(english|want) =.0011
• P(chinese|want) =.0065
• P(to|want)=.66
• P(eat|to)=.28
• P(food|to)=0
• P(want|spend)=0
• P(i |<s>)=.25
PracticalIssues
•Wedoeverythinginlogspace
•Avoidunderflow
•(alsoaddingisfasterthanmultiplying)
log( p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4
LanguageModelingToolkits
•SRILM
•http://www.speech.sri.com/projects/srilm/
•KenLM
•https://kheafield.com/code/kenlm/
GoogleN-GramRelease,August2006
…
GoogleN-GramRelease
•
•
•
•
•
•
•
•
•
•
serve
serve
serve
serve
serve
serve
serve
serve
serve
serve
as
as
as
as
as
as
as
as
as
as
the
the
the
the
the
the
the
the
the
the
incoming 92
incubator 99
independent 794
index 223
indication 72
indicator 120
indicators 45
indispensable 111
indispensible 40
individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
GoogleBookN-grams
• http://ngrams.googlelabs.com/
Language
Modeling
EstimatingN-gram
Probabilities
Language
Modeling
Evaluationand
Perplexity
Evaluation:Howgoodisourmodel?
• Doesourlanguagemodelprefergoodsentencestobadones?
• Assignhigherprobabilityto“real”or“frequentlyobserved”sentences
• Than“ungrammatical”or“rarelyobserved”sentences?
• Wetrainparametersofourmodelonatrainingset.
• Wetestthemodel’sperformanceondatawehaven’tseen.
• Atestsetisanunseendatasetthatisdifferentfromourtrainingset,totally
unused.
• Anevaluationmetrictellsushowwellourmodeldoesonthetestset.
Trainingonthetestset
• Wecan’tallowtestsentencesintothetrainingset
• Wewillassignitanartificiallyhighprobabilitywhenwesetitinthe
testset
• “Trainingonthetestset”
• Badscience!
• Andviolatesthehonorcode
30
ExtrinsicevaluationofN-grammodels
• BestevaluationforcomparingmodelsAandB
• Puteachmodelinatask
• spellingcorrector,speechrecognizer,MTsystem
• Runthetask,getanaccuracyforAandforB
• Howmanymisspelledwordscorrectedproperly
• Howmanywordstranslatedcorrectly
• CompareaccuracyforAandB
Difficultyofextrinsic(in-vivo)evaluationof
N-grammodels
• Extrinsicevaluation
• Time-consuming;cantakedaysorweeks
• So
• Sometimesuseintrinsic evaluation:perplexity
• Badapproximation
• unlessthetestdatalooksjust likethetrainingdata
• Sogenerallyonlyusefulinpilotexperiments
• Butishelpfultothinkabout.
IntuitionofPerplexity
• TheShannonGame:
• Howwellcanwepredictthenextword?
Ialwaysorderpizzawithcheeseand____
mushrooms 0.1
pepperoni 0.1
anchovies 0.01
The33rd PresidentoftheUSwas____
….
Isawa____
fried rice 0.0001
• Unigramsareterribleatthisgame.(Why?)
• Abettermodelofatext
….
and 1e-100
• isonewhichassignsahigherprobabilitytothewordthatactuallyoccurs
Perplexity
Thebestlanguagemodelisonethatbestpredictsanunseentestset
• GivesthehighestP(sentence)
Perplexityistheinverseprobabilityof
thetestset,normalizedbythenumber
ofwords:
−
PP(W ) = P(w1w2 ...wN )
Chainrule:
Forbigrams:
Minimizingperplexityisthesameasmaximizingprobability
=
N
1
N
1
P(w1w2 ...wN )
Perplexityasbranchingfactor
• Let’ssupposeasentenceconsistingofrandomdigits
• Whatistheperplexityofthissentenceaccordingtoamodelthat
assignP=1/10toeachdigit?
Lowerperplexity=bettermodel
• Training38millionwords,test1.5millionwords,
WSJ
N-gram Unigram Bigram
Order
Perplexity 962
170
Trigram
109
Language
Modeling
Evaluationand
Perplexity
Language
Modeling
Generalizationandzeros
TheShannonVisualizationMethod
• Choosearandombigram
<s> I
(<s>,w)accordingtoitsprobability
I want
• Nowchoosearandombigram(w,
want to
x)accordingtoitsprobability
to eat
• Andsoonuntilwechoose</s>
eat Chinese
• Thenstringthewordstogether
Chinese food
food
I want to eat Chinese food
</s>
bigrams by first generating a random bigram that starts with <s> (according to its
bigram probability), then choosing a random bigram to follow (again, according to
its bigram probability), and so on.
To give an intuition for the increasing power of higher-order N-grams, Fig. 4.3
shows random sentences generated from unigram, bigram, trigram, and 4-gram
models trained on Shakespeare’s works.
ApproximatingShakespeare
1gram
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
rote life have
–Hill he late speaks; or! a more to leg less first you enter
2gram
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
3gram
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
’tis done.
–This shall forbid it should be branded, if renown made it empty.
4gram
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
great banquet serv’d in;
–It cannot be but so.
Figure 4.3 Eight sentences randomly generated from four N-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
Shakespeareascorpus
•N=884,647tokens,V=29,066
•Shakespeareproduced300,000bigramtypesout
ofV2=844millionpossiblebigrams.
• So99.96%ofthepossible bigramswereneverseen
(havezeroentriesinthetable)
•Quadrigrams worse:What'scomingoutlooks
likeShakespearebecauseitis Shakespeare
Thewallstreetjournalisnotshakespeare(no
offense)
4.3 • G ENERALIZATION AND Z EROS 11
1
gram
2
gram
3
gram
Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 4.4 Three sentences randomly generated from three N-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
Zeros
• Testset
•Trainingset:
…deniedtheallegations
…deniedtheoffer
…deniedthereports
…deniedtheloan
…deniedtheclaims
…deniedtherequest
P(“offer”|deniedthe)=0
Zeroprobabilitybigrams
• Bigramswithzeroprobability
• meanthatwewillassign0probabilitytothetestset!
• Andhencewecannotcomputeperplexity(can’tdivideby0)!
Language
Modeling
Generalizationandzeros
Language
Modeling
Smoothing:Add-one
(Laplace)smoothing
The intuition of smoothing (from Dan Klein)
man
outcome
man
outcome
attack
request
claims
reports
P(w|denied the)
3allegations
2reports
1claims
1request
allegations
• Whenwehavesparsestatistics:
…
7total
attack
request
claims
7total
reports
P(w|denied the)
2.5allegations
1.5reports
0.5claims
0.5request
2other
allegations
allegations
• Stealprobabilitymasstogeneralizebetter
…
Add-oneestimation
• AlsocalledLaplacesmoothing
• Pretendwesaweachwordonemoretimethanwedid
• Justaddonetoallthecounts!
• MLEestimate:
• Add-1estimate:
c(wi−1, wi )
PMLE (wi | wi−1 ) =
c(wi−1 )
c(wi−1, wi ) +1
PAdd−1 (wi | wi−1 ) =
c(wi−1 ) +V
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Compare with raw bigram counts
Add-1estimationisabluntinstrument
• Soadd-1isn’tusedforN-grams:
• We’llseebettermethods
• Butadd-1isusedtosmoothotherNLPmodels
• Fortextclassification
• Indomainswherethenumberofzerosisn’tsohuge.
Language
Modeling
Smoothing:Add-one
(Laplace)smoothing
Language
Modeling
Interpolation,Backoff,
andWeb-ScaleLMs
Backoff and Interpolation
• Sometimesithelpstouseless context
• Conditiononlesscontextforcontextsyouhaven’tlearnedmuchabout
• Backoff:
• usetrigramifyouhavegoodevidence,
• otherwisebigram,otherwiseunigram
• Interpolation:
• mixunigram,bigram,trigram
• Interpolationworksbetter
from all the N-gram estimators, weighing and combining the trigram, bigram, and
unigram counts.
P̂(wn |wn 2 wn 1 ) = l1 P(wn |w
In simple linear interpolation, we combine different order N-grams by linearly
interpolating all the models. Thus, we estimate the trigram probability P(wn |wn 2 wn +l
1 ) 2 P(wn |
by mixing together the unigram, bigram, and trigram probabilities, each weighted
+l3 P(wn )
by a l :
LinearInterpolation
• Simpleinterpolation
such
that the l s sum to 1:
P̂(wn |wn 2 wn 1 ) = l1 P(wn |wn 2 wn
+l2 P(wn |wn 1 )
+l3 P(wn )
• Lambdasconditionaloncontext:
such that the l s sum to 1:In
1)
X
li = 1
i (4.24)
a slightly
more sophisticated version of linear
X
=1
computed ini ali more
sophisticated way,(4.25)
by condition
ifsophisticated
we have version
particularly
accurateeach
counts
foris a particul
In a slightly more
of linear interpolation,
l weight
computed in a morecounts
sophisticated
conditioning
on the context.
way,
of way,
the bytrigrams
based
on thisThisbigram
will be
if we have particularly accurate counts for a particular bigram, we assume that the
make the l s for those trigrams higher and thus giv
counts of the trigrams based on this bigram will be more trustworthy, so we can
Howtosetthelambdas?
• Useaheld-out corpus
TrainingData
Held-Out
Data
Test
Data
• Chooseλs tomaximizetheprobabilityofheld-outdata:
• FixtheN-gramprobabilities(onthetrainingdata)
• Thensearchforλs thatgivelargestprobabilitytoheld-outset:
log P(w1...wn | M (λ1...λk )) = ∑ log PM ( λ1... λk ) (wi | wi−1 )
i
Unknownwords:Openversusclosed
vocabularytasks
• Ifweknowallthewordsinadvanced
• VocabularyVisfixed
• Closedvocabularytask
• Oftenwedon’tknowthis
• OutOfVocabulary =OOVwords
• Openvocabularytask
• Instead:createanunknownwordtoken<UNK>
• Trainingof<UNK>probabilities
• CreateafixedlexiconLofsizeV
• Attextnormalizationphase,anytrainingwordnotinLchangedto<UNK>
• Nowwetrainitsprobabilitieslikeanormalword
• Atdecodingtime
• Iftextinput:UseUNKprobabilitiesforanywordnotintraining
Hugeweb-scalen-grams
• Howtodealwith,e.g.,GoogleN-gramcorpus
• Pruning
• OnlystoreN-grams withcount>threshold.
• Removesingletonsofhigher-ordern-grams
• Entropy-basedpruning
• Efficiency
• Efficientdatastructuresliketries
• Bloomfilters:approximatelanguagemodels
• Storewordsasindexes,notstrings
• UseHuffmancodingtofitlargenumbersofwordsintotwobytes
• Quantizeprobabilities(4-8bitsinsteadof8-bytefloat)
SmoothingforWeb-scaleN-grams
• “Stupidbackoff”(Brants etal.2007)
• Nodiscounting, justuserelativefrequencies
"
i
)
i
$$ count(wi−k+1
if
count(w
i−k+1 ) > 0
i−1
i−1
S(wi | wi−k+1 ) = # count(wi−k+1 )
$
i−1
otherwise
$% 0.4S(wi | wi−k+2 )
count(wi )
S(wi ) =
N
61
Biglanguagemodelshelpmachinetranslation
alot
N-gramSmoothingSummary
• Add-1smoothing:
• OKfortextcategorization,notforlanguagemodeling
• Themostcommonlyusedmethod:
• ExtendedInterpolatedKneser-Ney
• ForverylargeN-gramsliketheWeb:
• Stupidbackoff
63
Advanced Language Modeling
• Discriminativemodels:
• choosen-gramweightstoimproveatask,nottofitthetrainingset
• Parsing-basedmodels
• CachingModels
• Recentlyusedwordsaremorelikelytoappear
• Theseperformverypoorlyforspeechrecognition(why?)
c(w ∈ history)
PCACHE (w | history) = λ P(wi | wi−2 wi−1 ) + (1− λ )
| history |
Language
Modeling
Interpolation,Backoff,
andWeb-ScaleLMs
Language
Modeling
Advanced:
Kneser-Ney Smoothing
Absolute discounting: just subtract a little
from each count
• Supposewewantedtosubtractalittlefromacount
of4tosaveprobabilitymassforthezeros
• Howmuchtosubtract?
• ChurchandGale(1991)’scleveridea
• Divideup22millionwordsofAPNewswire
• Trainingandheld-outset
• foreachbigraminthetrainingset
• seetheactualcountintheheld-outset!
• Itsurelookslikec*=(c- .75)
Bigramcount
intraining
0
1
2
3
4
5
6
7
8
9
Bigramcountin
heldout set
.0000270
0.448
1.25
2.24
3.23
4.21
5.23
6.21
7.21
8.26
Absolute Discounting Interpolation
• Saveourselvessometimeandjustsubtract0.75(orsomed)!
discountedbigram
Interpolationweight
c(wi−1, wi ) − d
PAbsoluteDiscounting (wi | wi−1 ) =
+ λ (wi−1 )P(w)
c(wi−1 )
unigram
(Maybekeepingacoupleextravaluesofdforcounts1and2)
• ButshouldwereallyjustusetheregularunigramP(w)?
68
Kneser-Ney Smoothing I
• Betterestimateforprobabilitiesoflower-orderunigrams!
Francisco
glasses
• Shannongame:Ican’tseewithoutmyreading___________?
• “Francisco”ismorecommonthan“glasses”
• …but“Francisco”alwaysfollows“San”
• Theunigramisusefulexactlywhenwehaven’tseenthisbigram!
• InsteadofP(w):“Howlikelyisw”
• Pcontinuation(w):“Howlikelyiswtoappearasanovelcontinuation?
• Foreachword,countthenumberofbigramtypesitcompletes
• Everybigramtypewasanovelcontinuationthefirsttimeitwasseen
PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0}
Kneser-Ney Smoothing II
• Howmanytimesdoeswappearasanovelcontinuation:
PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0}
• Normalizedbythetotalnumberofwordbigramtypes
{(w j−1, w j ) : c(w j−1, w j ) > 0}
{wi−1 : c(wi−1, w) > 0}
PCONTINUATION (w) =
{(w j−1, w j ) : c(w j−1, w j ) > 0}
Kneser-Ney Smoothing IV
max(c(wi−1, wi ) − d, 0)
PKN (wi | wi−1 ) =
+ λ (wi−1 )PCONTINUATION (wi )
c(wi−1 )
λ isanormalizingconstant;theprobabilitymasswe’vediscounted
d
λ (wi−1 ) =
{w : c(wi−1, w) > 0}
c(wi−1 )
thenormalizeddiscount
Thenumberofwordtypesthatcanfollowwi-1
=#ofwordtypeswediscounted
=#oftimesweappliednormalizeddiscount
71
Kneser-Ney Smoothing: Recursive
formulation
i−1
PKN (wi | wi−n+1 ) =
i
max(cKN (wi−n+1 ) − d, 0)
i−1
+ λ (wi−n+1 )PKN (wi
i−1
cKN (wi−n+1 )
i−1
| wi−n+2 )
!#
count(•) for the highest order
cKN (•) = "
#$ continuationcount(•) for lower order
Continuationcount=Numberofuniquesinglewordcontextsfor
72
Language
Modeling
Advanced:
Kneser-Ney Smoothing
© Copyright 2026 Paperzz