Assignment 5 Generating Random N-Grams 1. Overview This project is a variation on what you will find in J&M, p. 93. It will give you an opportunity to explore the properties of N-Grams. 2. The Basic Idea (Using the Bogensberger-Johnson Cumulative Probability Technique) Imagine that you have before you a very long sheet of paper, along with a list of all words in English and their relative frequencies, in no particular order. The paper represents the interval [0..1]. Go through the words in the list one by one. Assign each word an interval on the paper in proportion to its relative frequency. The interval is indicated by a vertical line and the cumulative probability of that word, given all words that have preceded it, written below the word. As an example, suppose your impoverished corpus has only these six Italian words with their accompanying relative frequencies: al tavola non ci invecchia mai Relative Frequency .3 .1 .05 .05 .4 .1 Cumulative Probability .3 .4 .45 .5 .9 1 At the end of the process you would have a sheet of paper with these properties: the ‘al’ interval would be the first 30% of the sheet of paper the ‘tavola’ interval would be the next 10% of the sheet of paper the ‘non’ interval would be the next 5% of the sheet of paper the ‘ci’ inverval would be the next 5% of the sheet of paper the ‘invecchia’ interval would be the 40% of the sheet of paper the ‘mai’ interval would be the final 10% of the sheet of the paper. The sum of the relative frequencies of the word adds up to 1.0 and is represented by the full length of the sheet of paper. The paper has been divided into as many intervals as there are words, with each word having been assigned an interval whose width is in proportion to its relative frequency. You now generate a random number in the interval (0..1) and print the first word whose cumulative probability mass marker exceeds the random number generated. You continue this process until you have generated a pre-determined number of words. 3. Extending the Idea to N-Grams The basic idea can be extended to bigrams, trigrams, and quadgrams by using the N-Gram relative frequency equation on p. 89 (4.15). You can further extend the idea to corpora that have beginning and end of sentence markers (<s>, </s>) by altering the beginning and end of the process. At the beginning, generate random numbers until an N-Gram beginning with <s> has an interval of appropriate size. At the end, continue generating grams until an N-Gram ending with <\s> has an interval of the appropriate size. 4. Tokenize a corpus Bring up python 2.7 and do this: >>import nltk >>from nltk.corpus import brown >>news = brown.sents(categories=’editorial’) This will produce a list of all sentences in a subsection of the Brown corpus, the first large corpus of English text, created in 1961 at Brown University. The little ‘u’ you see before each word indicates that it has been encoded using Unicode. You can process in Unicode and print. That is, no encoding modifications are necessary. Nevertheless, if you find the ‘u’s annoying, you can remove them with this: >>new_news = [[item.encode(‘ascii’) for in item in lst] for lst in news] This list comprehension is, in essence, a nested loop. For each sentence in news, it encodes in asci each word. The first two lists in the editorial section, in asci, look like this: ['Assembly', 'session', 'brought', 'much', 'good'] ['The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '.'] Almost all of the tokenizing has already been done. The only remaining task is to insert <s> at the front of each list and </s> at the end (in place of the period if there is one). 5. From Brown editorials, create four N-Gram relative frequency data structures using equation 4.15 on p. 89 in J&M for unigrams, bigrams, trigrams, quadgrams. To do this, generate all N-Grams for each gram size and compute their relative frequencies, storing the grams and frequencies in a data structure of your choice. A dictionary to count the frequencies is a good start. 6. Generate five random sentences from each N-Gram relative frequency data structure, using the technique described above. Each sentence begins with <s> and ends with </s>. You will probably have to generate multiple random numbers at the beginning and end of each sentence so that the gram you choose starts and ends with beginning and sentence markers. 7. The output should look like that on p. 93. 8. Extra Credit: Download a text version of the complete works of Shakespeare. Tokenize it and do steps 5 and 6. 9. The program must be submitted according to specs on the class web site (link #5) and formatted according to programming specs (link #7). With respect to program specs, I am looking for clear, clean, nicely documented, nicely decomposed code which includes, at the very top of the program, clear instructions on how to run it. Though you don’t need a main() function in python, I hope you can see, by looking at line #7, how it provides the user with an outline of the structure of the program.
© Copyright 2026 Paperzz