Bigram Conditional Probabilities using hadoop

comp5338: Advanced Data Models
Week 8: Bigram Conditional Probabilities using
hadoop
Waiho Wong and Will Radford
September 19, 2011
1
Conditional Probabilities
Conditional probabilities1 form the basis for many interesting techniques in Data
Mining and Natural Language Processing, including Naı̈ve Bayes classifiers2 and
Hidden Markov Models3 . The following is the first paragraph of the Wikipedia
article on Conditional Probability.
In general, the probability of an event depends on the circumstances in which it occurs. A conditional probability is the probability of an event, assuming a particular set of circumstances. More
formally, a conditional probability is the probability that event A occurs when the sample space is limited to event B. This is denoted by
P (A|B), and is read “the probability of A, given B”.
We’re going to extend last week’s tutorial to extract pairs of words – bigrams –
from “War and Peace”, then calculate the statistics that will allow us to answer
questions like:
What is the probability of seeing the word Denisov after the
word Major? This is the conditional probability: P (Denisov|Major).
1
http://en.wikipedia.org/wiki/Conditional_probability
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
3
http://en.wikipedia.org/wiki/Hidden_Markov_model
2
1
2
The Plan
As per last week, we’re given the text of “War and Peace”, and we need to
calculate the following:
1. Bigram counts.
2. Counts of all bigrams that start Major.
3. Total of all bigrams that start Major.
4. Conditional Probabilities.
This will let us calculate the conditional probabilities for each bigram starting
with Major, that is: the number of times that bigram occurs divided by the total
number of times that bigrams starting with Major occur.
2.1
Bigram counts
Problem Pairs of words in text are called bigrams and we need to count how
many times each pair occurs in the text.
For example, the text “the cat sat on the cat” generates the following bigrams:
the|cat
cat|sat
sat|on
on|the
2
1
1
1
Notice that we’ve represented the bigrams using the following format:
<word1><pipe><word2><tab><count>
and that the|cat occurs twice.
Map/Reduce You will need to write a hadoop Tool class, BigramBuilder
where the mapper iterates through the text, each pair of words at a time. You
will need to transform to lower case and remove punctuation and emit bigram
records as described above. The reducer will simply aggregate these bigram
counts. This should only require changing last week’s solution to iterate over
word pairs rather than single words.
2
You should be able to run your code using something like:
export WAP=/user/whwong/warnpeace.txt
export HHOME=/user/$USER
export JAR=~/bigrams.jar
hadoop jar $JAR bigram.BigramBuilder job $WAP $HHOME/bis
This assumes that you have built a jar file named bigrams.jar and copied it
to your home directory. We’re using Eclipse4 for this tutorial.
2.2
Counts of all bigrams that start Major
Once we’ve created our bigrams, we need to filter them so we keep only those
that begin with a particular term.
For example, we need to use the term Major and keep only the following bigrams:
major|and 4
major|angrily 1
major|began 1
major|ceased 1
major|change 1
major|denisov 5
major|ekonomov 1
major|grumbled 1
major|had 1
major|how 1
major|ivan 1
major|of 1
major|quietly 1
major|raised 1
major|skirted 1
major|tell 1
major|they 1
major|was 1
major|were 1
major|who 3
major|with 1
major|zakharchenko 1
4
http://eclipse.org/
3
Map/Reduce You will need to write a hadoop Tool class, BigramFilter
where the mapper scans through each bigram and checks the first word against
a term. If it matches case insensitively, then that bigram should be written to
the context. The reducer does not do anything.
Some details:
ˆ You will need to read in the term from the command line and store it in
the JobContext’s Configuration so the mapper can easily access it.
ˆ Set the combiner and reducer class to Reducer.class to provide default
behaviour.
You should be able to run the job like this:
hadoop jar $JAR bigram.BigramFilter job Major $HHOME/bis $HHOME/starts
2.3
Total of all bigrams that start Major
Now we need to count all those bigrams for the denominator.
Map/Reduce You will need to write a hadoop Tool class, TotalsBuilder
where the mapper scans through each bigram and creates a fake bigram key
<word1>|*, then emits it. The reducer will aggregate all these together, counting
all bigrams starting with word1.
You should be able to run this:
hadoop jar $JAR bigram.TotalsBuilder job $HHOME/bis $HHOME/totals
You should see in the output file:
major|* 31
4
2.4
Conditional Probabilities
Now you need to put these together.
Map/Reduce You will need to write a hadoop Tool class, ConditionalProbBuilder
where the mapper reads the starts bigrams and emits them. The reducer reads
the totals file, and can then calculate the conditional probability (the two lines
below are one long line):
hadoop jar $JAR bigram.ConditionalProbBuilder job major \
$HHOME/totals/part-r-00000 $HHOME/starts $HHOME/cprob
You should find that P (Denisov|Major) (major|denisov in the output file) is
about 0.16 (i.e., 5/31).
5