Smoothing

Smoothing
Bonnie Dorr
Christof Monz
CMSC 723: Introduction to Computational Linguistics
Lecture 5
October 6, 2004
The Sparse Data Problem


Maximum likelihood estimation works fine for
data that occur frequently in the training corpus
Problem 1: Low frequency n-grams


If n-gram x occurs twice, and n-gram y occurs once.
Is x really twice as likely as y?
Problem 2: Zero counts

If n-gram y does not occur in the training data, does
that mean that it should have probability zero?
The Sparse Data Problem
Data sparseness is a serious and
frequently occurring problem
 Probability of a sequence is zero if it
contains unseen n-grams

Smoothing=Redistributing Probability Mass
Add-One Smoothing
Most simple smoothing technique
 For all n-grams, including unseen ngrams, add one to their counts


Un-smoothed probability:
c(w)
P(w) 
N

Add-one probability:
c(w) 1
P1(w) 

N V
Add-One Smoothing
P(wn|wn-1) = C(wn-1wn)/C(wn-1)
P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]
Add-One Smoothing
ci=c(wi,wi-1)
N
ci’=(ci+1) N  V
Add-One Smoothing


Pro: Very simple technique
Cons:





Too much probability mass is shifted towards unseen
n-grams
Probability of frequent n-grams is underestimated
Probability of rare (or unseen) n-grams is
overestimated
All unseen n-grams are smoothed in the same way
Using a smaller added-counted does not solve
this problem in principle
Witten-Bell Discounting
Probability mass is shifted around,
depending on the context of words
 If P(wi | wi-1,…,wi-m) = 0, then the
smoothed probability PWB(wi | wi-1,…,wi-m)
is higher if the sequence wi-1,…,wi-m
occurs with many different words wi

Witten-Bell Smoothing

Let’s consider bi-grams
T(wi-1) is the number of different words
(types) that occur to the right of wi-1
 N(wi-1) is the number of all word occurrences
(tokens) to the right of wi-1
 Z(wi-1) is the number of bigrams in the
current data set starting with wi-1 that do not
occur in the training data

Witten-Bell Smoothing

If c(wi-1,wi) = 0
T(wi1 )
P (wi | wi1) 
Z(wi1)(N  T(wi1 ))
WB


If c(wi-1,wi) > 0
P
WB
c(wi1wi )
(wi | wi1 ) 
N(wi1)  T(wi1 )
Witten-Bell Smoothing
ci
N
ci′=(ci+1) N  V
c i′ =
N
T/Z · N  T
N
ci · N  T
if ci=0
otherwise
Witten-Bell Smoothing
Witten-Bell Smoothing is more
conservative when subtracting probability
mass
 Gives rather good estimates
 Problem: If wi-1 and wi did not occur in the
training data the smoothed probability is
still zero

Backoff Smoothing

Deleted interpolation


If the n-gram wi-n,…,wi is not in the training data, use
wi-(n-1) ,…,wi
More general, combine evidence from different ngrams
PDI (wi | win ,...,wi1)  P(wi | win ,...,wi1)  (1 )PDI (wi | wi(n1),...,wi1)




Where lambda is the ‘confidence’ weight for the
longer n-gram
Compute lambda parameters from held-out data
Lambdas can be n-gram specific
Other Smoothing Approaches



Good-Turing Discounting: Re-estimate
amount of probability mass for zero (or low
count) n-grams by looking at n-grams with
higher counts
Kneser-Ney Smoothing: Similar to Witten-Bell
smoothing but considers number of word types
preceding a word
Katz Backoff Smoothing: Reverts to shorter ngram contexts if the count for the current ngram is lower than some threshold