Exercise 8 - ETH Zürich

Exercise 8
Statistical Language Models
Program Analysis and Synthesis 2017
ETH Zürich
April 11, 2017
The goal of statistical language models is to learn a probability distribution over sequences of words. That is, given a sentence s = w1 · w2 · · · wi , we are interested in
estimating the probability p(s). Without loss of generality, this can be expressed by
rewriting the joint probability of sentence s as:
p(s) = p(w1 ) · p(w2 | w1 ) · · · p(wi | w1 · · · wi−1 )
where each of the conditional probabilities describes the probability that the word wi
follows the preceding part of the sentence w1 · · · wi−1 .
The n-gram Statistical Language Model The n-gram model is a popular and widely
used language model due to its simplicity and efficient training algorithms. It is a
generative model that computes the probability of the sentence word by word, using the
conditional probabilities p(wi | w1 · · · wi−1 ) shown above.
However, computing the exact probability of each word wi given all of the preceding
words w1 · · · wi−1 is expensive or even intractable due to unbounded sentence length.
Instead, the n-gram model is based on the Markov assumption of order n − 1, where wi
depends only on the previous n − 1 words wi−n+1 · · · wi−1 . Intuitively, the bigger n is,
the more accurate the approximation becomes. The probability of a whole sentence is
therefore estimated as:
p(w1 . . . wm ) =
m
Y
p(wi | w1 . . . wi−1 ) ≈
i=0
m
Y
i=0
1
p(wi | wi−n+1 · · · wi−1 )
Training In terms of training an n-gram model, the conditional probability of a word
wi given its context wi−n+1 · · · wi−1 is typically estimated using a maximum likelihood
estimation technique as follows:
c(wi−n+1 · · · wi−1 · wi )
p(wi | wi−n+1 · · · wi−1 ) = P
w c(wi−n+1 · · · wi−1 · w)
where c(w) denotes the number of times the sequence of words w has been seen in the
training data.
Although such simple maximum likelihood estimation often works well, in practice this
technique suffers from data sparsity and assigns zero probability to n-grams never seen
in training data. As a result, the probability of the whole sequence s will be zero, which
is clearly undesirable.
Problem 1. To address the issue of data sparsity define a smoothed version of n-gram
using the idea that each n-gram occurs once more than it actually does in the training
data. Write down the calculation of the conditional probability in such model denoted
as pplus one and show that it is a valid probability distribution.
Problem 2. Extend the definition from Problem 1 allowing arbitrary value 0 < δ
(typically 0 < δ ≤ 1) to be added to the count of each n-gram.
Problem 3. The additive smoothing from Problem 1 and 2 preforms in general quite
poorly as it’s main purpose is to ensure that the probability of the whole sentence is
not zero. Instead, we would like to use a smoothing technique that helps us to better
estimate the true probability distribution of the data. To address this issue we would
like fall back to n-gram of lower order when estimating probability or rare n-grams.
Write down how the calculation of the conditional probability in such model looks like
assuming that we want to interpolate a bigram model with an unigram model.
Problem 4. Extend the definition from Problem 3 to work with arbitrary n-gram order
by using recursive formulation.
2