Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004 The Sparse Data Problem Maximum likelihood estimation works fine for data that occur frequently in the training corpus Problem 1: Low frequency n-grams If n-gram x occurs twice, and n-gram y occurs once. Is x really twice as likely as y? Problem 2: Zero counts If n-gram y does not occur in the training data, does that mean that it should have probability zero? The Sparse Data Problem Data sparseness is a serious and frequently occurring problem Probability of a sequence is zero if it contains unseen n-grams Smoothing=Redistributing Probability Mass Add-One Smoothing Most simple smoothing technique For all n-grams, including unseen ngrams, add one to their counts Un-smoothed probability: c(w) P(w) N Add-one probability: c(w) 1 P1(w) N V Add-One Smoothing P(wn|wn-1) = C(wn-1wn)/C(wn-1) P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V] Add-One Smoothing ci=c(wi,wi-1) N ci’=(ci+1) N V Add-One Smoothing Pro: Very simple technique Cons: Too much probability mass is shifted towards unseen n-grams Probability of frequent n-grams is underestimated Probability of rare (or unseen) n-grams is overestimated All unseen n-grams are smoothed in the same way Using a smaller added-counted does not solve this problem in principle Witten-Bell Discounting Probability mass is shifted around, depending on the context of words If P(wi | wi-1,…,wi-m) = 0, then the smoothed probability PWB(wi | wi-1,…,wi-m) is higher if the sequence wi-1,…,wi-m occurs with many different words wi Witten-Bell Smoothing Let’s consider bi-grams T(wi-1) is the number of different words (types) that occur to the right of wi-1 N(wi-1) is the number of all word occurrences (tokens) to the right of wi-1 Z(wi-1) is the number of bigrams in the current data set starting with wi-1 that do not occur in the training data Witten-Bell Smoothing If c(wi-1,wi) = 0 T(wi1 ) P (wi | wi1) Z(wi1)(N T(wi1 )) WB If c(wi-1,wi) > 0 P WB c(wi1wi ) (wi | wi1 ) N(wi1) T(wi1 ) Witten-Bell Smoothing ci N ci′=(ci+1) N V c i′ = N T/Z · N T N ci · N T if ci=0 otherwise Witten-Bell Smoothing Witten-Bell Smoothing is more conservative when subtracting probability mass Gives rather good estimates Problem: If wi-1 and wi did not occur in the training data the smoothed probability is still zero Backoff Smoothing Deleted interpolation If the n-gram wi-n,…,wi is not in the training data, use wi-(n-1) ,…,wi More general, combine evidence from different ngrams PDI (wi | win ,...,wi1) P(wi | win ,...,wi1) (1 )PDI (wi | wi(n1),...,wi1) Where lambda is the ‘confidence’ weight for the longer n-gram Compute lambda parameters from held-out data Lambdas can be n-gram specific Other Smoothing Approaches Good-Turing Discounting: Re-estimate amount of probability mass for zero (or low count) n-grams by looking at n-grams with higher counts Kneser-Ney Smoothing: Similar to Witten-Bell smoothing but considers number of word types preceding a word Katz Backoff Smoothing: Reverts to shorter ngram contexts if the count for the current ngram is lower than some threshold
© Copyright 2026 Paperzz