Assignment 5 - Computer Science

Assignment 5
Generating Random N-Grams
1. Overview
This project is a variation on what you will find in J&M, p. 93. It will give you an
opportunity to explore the properties of N-Grams.
2. The Basic Idea (Using the Bogensberger-Johnson Cumulative Probability
Technique)
Imagine that you have before you a very long sheet of paper, along with a list of
all words in English and their relative frequencies, in no particular order. The
paper represents the interval [0..1]. Go through the words in the list one by one.
Assign each word an interval on the paper in proportion to its relative frequency.
The interval is indicated by a vertical line and the cumulative probability of that
word, given all words that have preceded it, written below the word. As an
example, suppose your impoverished corpus has only these six Italian words with
their accompanying relative frequencies:
al
tavola
non
ci
invecchia
mai
Relative Frequency
.3
.1
.05
.05
.4
.1
Cumulative Probability
.3
.4
.45
.5
.9
1
At the end of the process you would have a sheet of paper with these properties:
 the ‘al’ interval would be the first 30% of the sheet of paper
 the ‘tavola’ interval would be the next 10% of the sheet of paper
 the ‘non’ interval would be the next 5% of the sheet of paper
 the ‘ci’ inverval would be the next 5% of the sheet of paper
 the ‘invecchia’ interval would be the 40% of the sheet of paper
 the ‘mai’ interval would be the final 10% of the sheet of the paper.
The sum of the relative frequencies of the word adds up to 1.0 and is represented
by the full length of the sheet of paper. The paper has been divided into as many
intervals as there are words, with each word having been assigned an interval
whose width is in proportion to its relative frequency. You now generate a
random number in the interval (0..1) and print the first word whose cumulative
probability mass marker exceeds the random number generated. You continue
this process until you have generated a pre-determined number of words.
3. Extending the Idea to N-Grams
The basic idea can be extended to bigrams, trigrams, and quadgrams by using
the N-Gram relative frequency equation on p. 89 (4.15). You can further extend
the idea to corpora that have beginning and end of sentence markers (<s>, </s>)
by altering the beginning and end of the process. At the beginning, generate
random numbers until an N-Gram beginning with <s> has an interval of
appropriate size. At the end, continue generating grams until an N-Gram ending
with <\s> has an interval of the appropriate size.
4. Tokenize a corpus
Bring up python 2.7 and do this:
>>import nltk
>>from nltk.corpus import brown
>>news = brown.sents(categories=’editorial’)
This will produce a list of all sentences in a subsection of the Brown corpus, the
first large corpus of English text, created in 1961 at Brown University. The little
‘u’ you see before each word indicates that it has been encoded using Unicode.
You can process in Unicode and print. That is, no encoding modifications are
necessary. Nevertheless, if you find the ‘u’s annoying, you can remove them with
this:
>>new_news = [[item.encode(‘ascii’) for in item in lst] for lst in news]
This list comprehension is, in essence, a nested loop. For each sentence in news,
it encodes in asci each word.
The first two lists in the editorial section, in asci, look like this:
['Assembly', 'session', 'brought', 'much', 'good']
['The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed',
'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it',
'convened', '.']
Almost all of the tokenizing has already been done. The only remaining task is
to insert <s> at the front of each list and </s> at the end (in place of the period
if there is one).
5. From Brown editorials, create four N-Gram relative frequency data structures
using equation 4.15 on p. 89 in J&M for unigrams, bigrams, trigrams,
quadgrams. To do this, generate all N-Grams for each gram size and compute
their relative frequencies, storing the grams and frequencies in a data structure
of your choice. A dictionary to count the frequencies is a good start.
6. Generate five random sentences from each N-Gram relative frequency data
structure, using the technique described above. Each sentence begins with <s>
and ends with </s>. You will probably have to generate multiple random
numbers at the beginning and end of each sentence so that the gram you
choose starts and ends with beginning and sentence markers.
7. The output should look like that on p. 93.
8. Extra Credit: Download a text version of the complete works of Shakespeare.
Tokenize it and do steps 5 and 6.
9. The program must be submitted according to specs on the class web site (link
#5) and formatted according to programming specs (link #7). With respect to
program specs, I am looking for clear, clean, nicely documented, nicely
decomposed code which includes, at the very top of the program, clear
instructions on how to run it. Though you don’t need a main() function in
python, I hope you can see, by looking at line #7, how it provides the user with
an outline of the structure of the program.