comp5338: Advanced Data Models Week 8: Bigram Conditional Probabilities using hadoop Waiho Wong and Will Radford September 19, 2011 1 Conditional Probabilities Conditional probabilities1 form the basis for many interesting techniques in Data Mining and Natural Language Processing, including Naı̈ve Bayes classifiers2 and Hidden Markov Models3 . The following is the first paragraph of the Wikipedia article on Conditional Probability. In general, the probability of an event depends on the circumstances in which it occurs. A conditional probability is the probability of an event, assuming a particular set of circumstances. More formally, a conditional probability is the probability that event A occurs when the sample space is limited to event B. This is denoted by P (A|B), and is read “the probability of A, given B”. We’re going to extend last week’s tutorial to extract pairs of words – bigrams – from “War and Peace”, then calculate the statistics that will allow us to answer questions like: What is the probability of seeing the word Denisov after the word Major? This is the conditional probability: P (Denisov|Major). 1 http://en.wikipedia.org/wiki/Conditional_probability http://en.wikipedia.org/wiki/Naive_Bayes_classifier 3 http://en.wikipedia.org/wiki/Hidden_Markov_model 2 1 2 The Plan As per last week, we’re given the text of “War and Peace”, and we need to calculate the following: 1. Bigram counts. 2. Counts of all bigrams that start Major. 3. Total of all bigrams that start Major. 4. Conditional Probabilities. This will let us calculate the conditional probabilities for each bigram starting with Major, that is: the number of times that bigram occurs divided by the total number of times that bigrams starting with Major occur. 2.1 Bigram counts Problem Pairs of words in text are called bigrams and we need to count how many times each pair occurs in the text. For example, the text “the cat sat on the cat” generates the following bigrams: the|cat cat|sat sat|on on|the 2 1 1 1 Notice that we’ve represented the bigrams using the following format: <word1><pipe><word2><tab><count> and that the|cat occurs twice. Map/Reduce You will need to write a hadoop Tool class, BigramBuilder where the mapper iterates through the text, each pair of words at a time. You will need to transform to lower case and remove punctuation and emit bigram records as described above. The reducer will simply aggregate these bigram counts. This should only require changing last week’s solution to iterate over word pairs rather than single words. 2 You should be able to run your code using something like: export WAP=/user/whwong/warnpeace.txt export HHOME=/user/$USER export JAR=~/bigrams.jar hadoop jar $JAR bigram.BigramBuilder job $WAP $HHOME/bis This assumes that you have built a jar file named bigrams.jar and copied it to your home directory. We’re using Eclipse4 for this tutorial. 2.2 Counts of all bigrams that start Major Once we’ve created our bigrams, we need to filter them so we keep only those that begin with a particular term. For example, we need to use the term Major and keep only the following bigrams: major|and 4 major|angrily 1 major|began 1 major|ceased 1 major|change 1 major|denisov 5 major|ekonomov 1 major|grumbled 1 major|had 1 major|how 1 major|ivan 1 major|of 1 major|quietly 1 major|raised 1 major|skirted 1 major|tell 1 major|they 1 major|was 1 major|were 1 major|who 3 major|with 1 major|zakharchenko 1 4 http://eclipse.org/ 3 Map/Reduce You will need to write a hadoop Tool class, BigramFilter where the mapper scans through each bigram and checks the first word against a term. If it matches case insensitively, then that bigram should be written to the context. The reducer does not do anything. Some details: You will need to read in the term from the command line and store it in the JobContext’s Configuration so the mapper can easily access it. Set the combiner and reducer class to Reducer.class to provide default behaviour. You should be able to run the job like this: hadoop jar $JAR bigram.BigramFilter job Major $HHOME/bis $HHOME/starts 2.3 Total of all bigrams that start Major Now we need to count all those bigrams for the denominator. Map/Reduce You will need to write a hadoop Tool class, TotalsBuilder where the mapper scans through each bigram and creates a fake bigram key <word1>|*, then emits it. The reducer will aggregate all these together, counting all bigrams starting with word1. You should be able to run this: hadoop jar $JAR bigram.TotalsBuilder job $HHOME/bis $HHOME/totals You should see in the output file: major|* 31 4 2.4 Conditional Probabilities Now you need to put these together. Map/Reduce You will need to write a hadoop Tool class, ConditionalProbBuilder where the mapper reads the starts bigrams and emits them. The reducer reads the totals file, and can then calculate the conditional probability (the two lines below are one long line): hadoop jar $JAR bigram.ConditionalProbBuilder job major \ $HHOME/totals/part-r-00000 $HHOME/starts $HHOME/cprob You should find that P (Denisov|Major) (major|denisov in the output file) is about 0.16 (i.e., 5/31). 5
© Copyright 2024 Paperzz