1 n−1) ≈ P(wn | wn−N +1 ) n−1) ≈ P(wn | wn−1)

Natural Language
Processing
Lecture 7—9/15/2015
Jim Martin
Today
More language modeling
§  Probabilistic model
§  Independence assumptions
§  Smoothing
§  Laplace smoothing (add-1)
§  Kneyser-Ney
9/15/15
Speech and Language Processing - Jurafsky and Martin
2
Markov Assumption
So for each component in the product replace with the
approximation (assuming a prefix of N - 1)
n−1
P(wn | w1n−1 ) ≈ P(wn | w n−N
+1 )
Bigram version
P(w n | w1n−1 ) ≈ P(w n | w n−1 )
€
9/15/15
Speech and Language Processing - Jurafsky and Martin
3
€
1!
Estimating Bigram
Probabilities
§  The Maximum Likelihood Estimate (MLE)
P(w i | w i−1 ) =
count(w i−1,w i )
count(w i−1 )
€
9/15/15
Speech and Language Processing - Jurafsky and Martin
4
An Example
§  <s> I am Sam </s>
§  <s> Sam I am </s>
§  <s> I do not like green eggs and ham </s>
9/15/15
Speech and Language Processing - Jurafsky and Martin
5
Bigram Counts
§  Vocabulary size is 1446 |V|
§  Out of 9222 sentences
§  Eg. “I want” occurred 827 times
9/15/15
Speech and Language Processing - Jurafsky and Martin
6
2!
Bigram Probabilities
§  Divide bigram counts by prefix unigram
counts to get bigram probabilities.
Speech and Language Processing - Jurafsky and Martin
9/15/15
7
Bigram Estimates of Sentence
Probabilities
§  P(<s> I want english food </s>) =
P(i|<s>)*
P(want|I)*
P(english|want)*
P(food|english)*
P(</s>|food)*
=.000031
Speech and Language Processing - Jurafsky and Martin
9/15/15
8
Perplexity
§  Perplexity is the probability of a
test set (assigned by the
language model), as normalized
by the number of words:
§  Chain rule:
§  For bigrams:
§  Minimizing perplexity is the same as maximizing
probability
§  The best language model is one that best
predicts an unseen test set
9/15/15
Speech and Language Processing - Jurafsky and Martin
9
3!
Lower perplexity means a
better model
§  Training 38 million words, test 1.5 million
words, WSJ
Speech and Language Processing - Jurafsky and Martin
9/15/15
10
Practical Issues
§  Once we start looking at test data, we’ll
run into words that we haven’t seen
before.
§  Standard solution
§  Given a corpus
§  Create an unknown word token <UNK>
Create a fixed lexicon L, of size V
§  From a dictionary or
§  A subset of terms from the training set
§  At text normalization phase, any training word not in L is changed to
<UNK>
§  Collect counts for that as for any normal word
§  At test time
§  Use UNK counts for any word not seen in training
9/15/15
Speech and Language Processing - Jurafsky and Martin
11
Practical Issues
§  Multiplying a bunch of small numbers
between 0 and 1 is a really bad idea
§  Multiplication is slow
§  And underflow is likely
§  So do everything in log space
§  Avoid underflow
§  Adding is faster than multiplying
9/15/15
Speech and Language Processing - Jurafsky and Martin
12
4!
Smoothing
§  Back to Shakespeare
§  Recall that Shakespeare produced 300,000 bigram
types out of V2= 844 million possible bigrams...
§  So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
§  Does that mean that any sentence that contains one
of those bigrams should have a probability of 0?
§  For generation (shannon game) it means we’ll never
emit those bigrams
§  But for analysis it’s problematic because if we run
across a new bigram in the future then we have no
choice but to assign it a probability of zero.
Speech and Language Processing - Jurafsky and Martin
9/15/15
13
Zero Counts
§  Some of those zeros are really zeros...
§  Things that really aren’t ever going to happen
§  On the other hand, some of them are just rare events.
§  If the training corpus had been a little bigger they would have had a
count
§  What would that count be in all likelihood?
§  Zipf’s Law (long tail phenomenon):
§ 
§ 
§ 
§ 
A small number of events occur with high frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high frequency events
You might have to wait an arbitrarily long time to get valid statistics
on low frequency events
§  Result:
§  Our estimates are sparse! We have no counts at all for the vast
number of things we want to estimate!
§  Answer:
§  Estimate the likelihood of unseen (zero count) N-grams!
Speech and Language Processing - Jurafsky and Martin
9/15/15
14
Laplace Smoothing
§  Also called Add-One smoothing
§  Just add one to all the counts!
§  Very simple
§  MLE estimate:
§  Laplace estimate:
9/15/15
Speech and Language Processing - Jurafsky and Martin
15
5!
Bigram Counts
9/15/15
Speech and Language Processing - Jurafsky and Martin
16
Laplace-Smoothed Bigram
Counts
9/15/15
Speech and Language Processing - Jurafsky and Martin
17
Laplace-Smoothed Bigram
Probabilities
9/15/15
Speech and Language Processing - Jurafsky and Martin
18
6!
Reconstituted Counts
9/15/15
Speech and Language Processing - Jurafsky and Martin
19
Reconstituted Counts (2)
9/15/15
Speech and Language Processing - Jurafsky and Martin
20
Big Change to the Counts!
§  C(want to) went from 608 to 238!
§  P(to|want) from .66 to .26!
§  Discount d= c*/c
§  d for “chinese food” =.10!!! A 10x reduction
§  So in general, Laplace is a blunt instrument
§  Could use more fine-grained method (add-k)
§  Because of this Laplace smoothing not often used for language
models, as we have much better methods
§  Despite its flaws Laplace (add-1) is still used to smooth other
probabilistic models in NLP and IR, especially
§  For pilot studies
§  In document classification
§  In domains where the number of zeros isn’t so huge.
9/15/15
Speech and Language Processing - Jurafsky and Martin
21
7!
4.4 • S MOOTHING 15
The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros.
4.4.2 Add-k smoothing
One alternative to add-one smoothing is to move a bit less of the probability mass
Better Smoothing
from the seen to the unseen events. Instead of adding 1 to each count, we add a fracadd-k
tional
count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing.
§  Two
key
ideas
§  Use less context if the counts are missing
C(wn 1 wn ) + k
⇤
for longer contexts
PAdd-k
(wn |wn 1 ) =
(4.23)
§  Use the count of things we’ve seen once C(wn 1 ) + kV
to help estimate
the count
of4.4
things
we’ve
Add-k smoothing
requires
that• we
have a method
for choosing k; this can be
S MOOTHING
15
neverdone,
seenfor example, by optimizing on a devset. Although add-k is is useful for some
The sharp change
in counts
and probabilities
occurs because too
tasks
(including
text classification),
it much
turnsprobabilout that it still doesn’t work well for
ity mass is movedlanguage
to all the zeros.
modeling, generating counts with poor variances and often inappropriate
discounts (Gale and Church, 1994).
4.4.2 Add-k smoothing
One alternative to add-one smoothing is to move a bit less of the probability mass
4.4.3
Backoff
Interpolation
from the seen to the
unseen events.
Instead and
of adding
1 to each count, we add a fractional count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing.
9/15/15
add-k
backoff
interpolation
Speech and Language Processing - Jurafsky and Martin
22
The discounting we have been discussing so far can help solve the problem of zero
C(w
wn ) + kis an additional source of knowledge we can draw
frequency
Butn 1there
⇤
PAdd-k
(wN-grams.
(4.23)
n |wn 1 ) =
C(w
) + kV P(w |w w ) but we have no examples of a
n
1
on. If we are trying to compute
n n 2 n 1
Add-k smoothing
requires that
we have
this can beestimate its probability by using
particular
trigram
wna 2method
wn 1 wforn ,choosing
we cank; instead
done, for example, by optimizing on a devset. Although add-k is is useful for some
the bigram and
probability
P(w |wn 1doesn’t
). Similarly,
if we don’t have counts to compute
Backoff
tasks (including
text classification),
it turnsInterpolation
out thatnit still
work well for
P(w
), we with
can poor
lookvariances
to the unigram
P(wn ).
n |wn 1 counts
language modeling,
generating
and often inappropriate
§  Sometimes
it 1994).
helps
use less
context
discounts
(Gale and Church,
In other
words,tosometimes
using
less context is a good thing, helping to general§  Condition
on for
lesscontexts
context
youlearned much about. There are two ways
ize more
thatfor
thecontexts
model hasn’t
haven’tand
learned
much about
4.4.3backoff
Backoff
Interpolation
to use
this N-gram
“hierarchy”. In backoff, we use the trigram if the evidence is
§ discounting
Backoffwesufficient,
The
have been discussing
far can
the problem
of zerothe unigram. In other words, we
otherwiseso we
usehelp
thesolve
bigram,
otherwise
frequency
N-grams.
there
isoff”
an additional
source evidence,
of knowledge
canhave
draw zero evidence for a higher-order
§  use
trigram
if you
have
good
onlyBut“back
to a lower-order
N-gram ifwewe
on. If § weotherwise
are trying to compute
P(wotherwise
we have no examples of a
n |wn 2 wn 1 ) butunigram
bigram,
interpolation
N-gram.
By
contrast,
in
interpolation,
we
always
particular trigram wn 2 wn 1 wn , we can instead estimate its probability by using mix the probability estimates
Interpolation
fromP(w
alln|wthe
estimators,
weighing
combining the trigram, bigram, and
the§ bigram
probability
). Similarly,
if we don’t have
counts toand
compute
n 1N-gram
bigram,
P(wn |w§ n mix
can
look to the
unigram P(wtrigram
unigram
counts.
n ).
1 ), weunigram,
other words, sometimes
usinglinear
less context
is a good thing,
to generalIn simple
interpolation,
wehelping
combine
different order N-grams by linearly
§ InInterpolation
works
better
ize more for contexts that the model hasn’t learned much about. There are two ways
interpolating
all
the
models.
Thus,
we
estimate
the
to use this N-gram “hierarchy”. In backoff, we use the trigram if the evidence istrigram probability P(wn |wn 2 wn 1 )
bywemixing
togetherotherwise
the unigram,
bigram,
trigram
sufficient, otherwise
use the bigram,
the unigram.
In otherand
words,
we probabilities, each weighted
only “back off” toby
a lower-order
a l: N-gram if we have zero evidence for a higher-order
N-gram. By contrast, in interpolation, we always mix the probability estimates
from all the N-gram estimators, weighing and combining the trigram, bigram, and
unigram counts.
P̂(w
l1linearly
P(wn |wn 2 wn 1 )
n |wn order
2 wn N-grams
1 ) = by
In simple linear interpolation, we combine
different
interpolating all the models. Thus, we estimate the trigram probability P(w+l
) n 1)
n |wn2 P(w
2 wn n1|w
by mixing together
the unigram, bigram,
and trigram probabilities, each weighted
Linear
Interpolation
+l3 P(wn )
by a l:
§  Simple interpola-on such that the ls sum to 1:
(4.24)
X
P̂(wn |wn 2 wn 1 ) = l1 P(wn |wn 2 wn 1 )
li = 1
(4.25)
+l2 P(wn |wn 1 )
i
+l3 P(wn )
(4.24)
In a slightly
sophisticated version of linear interpolation, each l weight is
§  Lambdas condi-onal on more
context such that the ls sum
to 1:
computed
in a X
more sophisticated way, by conditioning on the context. This way,
li = 1
(4.25)
if we have particularly
accurate counts for a particular bigram, we assume that the
i
counts
of
the
trigrams
based
on this bigram
willisbe more trustworthy, so we can
In a slightly more sophisticated version of linear
interpolation,
each l weight
make
the
ls
for
those
trigrams
higher
and
thus
give that trigram more weight in
computed in a more sophisticated way, by conditioning on the context. This way,
if we have particularly accurate counts for a particular bigram, we assume that the
counts of the trigrams based on this bigram will be more trustworthy, so we can
make the ls for those trigrams higher and thus give that trigram more weight in
8!
How to set the lambdas?
§  Use a held-­‐out corpus Training Data
HeldOut
Data
Test
Data
§  Choose λs to maximize the probability of held-­‐out data: §  Fix the N-­‐gram probabili-es (using the training data) §  Then search for λs that give largest probability to held-­‐out set. Types, Tokens and Fish
§  Much of what’s coming up was first
studied by biologists who are often faced
with 2 related problems
§  Determining how many species occupy a
particular area (types)
§  And determining how many individuals of a
given species are living in a given area
(tokens)
9/15/15
Speech and Language Processing - Jurafsky and Martin
26
One Fish Two Fish
§  Imagine you are fishing
§  There are 8 species: carp, perch, whitefish, trout, salmon, eel,
catfish, bass
§  Not sure where this fishing hole is...
§  You have caught up to now
§  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
§  How likely is it that the next fish to be caught is an eel?
§  How likely is it that the next fish caught will be a
member of newly seen species?
§  Now how likely is it that the next fish caught will be an
eel?
9/15/15
Slide
adapted from Josh Goodman
Speech and Language Processing - Jurafsky and Martin
27
9!
Fish Lesson
§  We need to steal part of the observed
probability mass to give it to the as yet
unseen N-Grams. Questions are:
§  How much to steal
§  How to redistribute it
9/15/15
Speech and Language Processing - Jurafsky and Martin
28
Absolute Discounting
§  Just subtract a fixed amount from all the
observed counts (call that d).
§  Redistribute it proportionally based on
observed data
9/15/15
Speech and Language Processing - Jurafsky and Martin
29
Absolute Discounting w/
Interpolation
discounted bigram
Interpolation weight
c(w , w )− d
PAbsoluteDiscounting (wi | wi−1 ) = i−1 i
+ λ (wi−1 )P(w)
c(wi−1 )
unigram
30
10!
Kneser-Ney Smoothing
§  BeGer es-mate for probabili-es of lower-­‐order unigrams! §  Shannon game: I can’t see without my
Francisco
glasses
reading___________? §  “Francisco” is more common than “glasses” §  … but “Francisco” frequently follows “San” §  So P(w) isn’t what we want Kneser-Ney Smoothing
§  Pcon-nua-on(w): “How likely is w to appear as a novel con-nua-on? §  For each word, count the number of bigram types it completes PCONTINUATION (w)∝ {wi−1 : c(wi−1, w) > 0}
Speech and Language Processing - Jurafsky and Martin
9/15/15
32
Kneser-Ney Smoothing
§  Normalize by the total number of word bigram
types to get a true probability
PCONTINUATION (w) =
{wi−1 : c(wi−1, w) > 0}
{(w j−1, w j ) : c(w j−1, w j ) > 0}
11!
Kneser-Ney Smoothing
PKN (wi | wi−1 ) =
max(c(wi−1, wi )− d,0)
+ λ (wi−1 )PCONTINUATION (wi )
c(wi−1 )
λ is a normalizing constant; the probability mass we’ve discounted
λ (wi−1 ) =
d
{w : c(wi−1, w) > 0}
c(wi−1 )
the normalized discount
The number of word types that can follow wi-1
= # of word types we discounted
= # of times we applied normalized discount
34
12!