Exercise 8 Statistical Language Models Program Analysis and Synthesis 2017 ETH Zürich April 11, 2017 The goal of statistical language models is to learn a probability distribution over sequences of words. That is, given a sentence s = w1 · w2 · · · wi , we are interested in estimating the probability p(s). Without loss of generality, this can be expressed by rewriting the joint probability of sentence s as: p(s) = p(w1 ) · p(w2 | w1 ) · · · p(wi | w1 · · · wi−1 ) where each of the conditional probabilities describes the probability that the word wi follows the preceding part of the sentence w1 · · · wi−1 . The n-gram Statistical Language Model The n-gram model is a popular and widely used language model due to its simplicity and efficient training algorithms. It is a generative model that computes the probability of the sentence word by word, using the conditional probabilities p(wi | w1 · · · wi−1 ) shown above. However, computing the exact probability of each word wi given all of the preceding words w1 · · · wi−1 is expensive or even intractable due to unbounded sentence length. Instead, the n-gram model is based on the Markov assumption of order n − 1, where wi depends only on the previous n − 1 words wi−n+1 · · · wi−1 . Intuitively, the bigger n is, the more accurate the approximation becomes. The probability of a whole sentence is therefore estimated as: p(w1 . . . wm ) = m Y p(wi | w1 . . . wi−1 ) ≈ i=0 m Y i=0 1 p(wi | wi−n+1 · · · wi−1 ) Training In terms of training an n-gram model, the conditional probability of a word wi given its context wi−n+1 · · · wi−1 is typically estimated using a maximum likelihood estimation technique as follows: c(wi−n+1 · · · wi−1 · wi ) p(wi | wi−n+1 · · · wi−1 ) = P w c(wi−n+1 · · · wi−1 · w) where c(w) denotes the number of times the sequence of words w has been seen in the training data. Although such simple maximum likelihood estimation often works well, in practice this technique suffers from data sparsity and assigns zero probability to n-grams never seen in training data. As a result, the probability of the whole sequence s will be zero, which is clearly undesirable. Problem 1. To address the issue of data sparsity define a smoothed version of n-gram using the idea that each n-gram occurs once more than it actually does in the training data. Write down the calculation of the conditional probability in such model denoted as pplus one and show that it is a valid probability distribution. Problem 2. Extend the definition from Problem 1 allowing arbitrary value 0 < δ (typically 0 < δ ≤ 1) to be added to the count of each n-gram. Problem 3. The additive smoothing from Problem 1 and 2 preforms in general quite poorly as it’s main purpose is to ensure that the probability of the whole sentence is not zero. Instead, we would like to use a smoothing technique that helps us to better estimate the true probability distribution of the data. To address this issue we would like fall back to n-gram of lower order when estimating probability or rare n-grams. Write down how the calculation of the conditional probability in such model looks like assuming that we want to interpolate a bigram model with an unigram model. Problem 4. Extend the definition from Problem 3 to work with arbitrary n-gram order by using recursive formulation. 2
© Copyright 2025 Paperzz