CE314/887 Lab 3 Part of Speech Tagging

CE314/887 Lab 3
Part of Speech Tagging
Annie Louis
17 November, 2016
1
What this lab is about
This lab focuses on understanding part of speech (POS) tags and exploring NLTK’s methods for automatically assigning POS tags. To get more practice on this topic, go through the problems and exercises
in http://www.nltk.org/book/ch05.html.
2
NLTK environment
The first step is to import NLTK and some corpora as in last labs. To carry out this step, type:
>>> import nltk
>>> nltk.download()
In the ensuing dialog box, choose book which downloads all the corpora used in the examples in the
NLTK book1 . Click download and once the download is finished, you can close the dialog box.
You can import the corpora by typing:
>>> from nltk.book import ∗
3
Part of speech tagsets
POS taggers assign tags to words. The tags are drawn from a tagset. We introduced two tagsets in the
lectures. Do you recall their names?
>>> nltk.help.upenn tagset(".*")
Will display the Penn Treebank tagset which contains 45 tags. You will see tags followed by a list of
representative words in the category.
$:
dollar
$ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
’’: closing quotation mark
’ ’’
...
MD: modal auxiliary
can cannot could couldn’t dare may might must need ought shall should...
NN: noun, common, singular or mass
1 http://www.nltk.org/book/
1
common-carrier cabbage knuckle-duster casino afghan shed thermostat investment ...
You can use regular expressions to display subsets of the tags. For example, by replacing .* in the above
command with NN*, you can get only the noun tags. Using VB*, you can get all the verb categories.
How many noun and verb categories are there? Recall what the distinctions mean. Write down the list
of noun categories and a very short phrase describing what it is.
Repeat this exercise with the Brown tagset which has 87 tags.
>>> nltk.help.brown tagset("NN*")
You do not need to write down the tags, but take some time to browse through the distinctions that this
tagset makes.
4
Exploring tagged corpora
We spoke in class about how the POS tagging algorithm’s probabilities are estimated from corpora.
These corpora have been hand-annotated by linguists. To get a sense of what these corpora look like,
do this exercise.
>>> from nltk.corpus import treebank
>>> treebank.fileids()
Will display a list of files with extension .mrg which are the Penn Treebank annotated files. (NLTK only
contains 10% of the Penn Treebank corpus, the total corpus has about 1 million words.)
Now to display the content of a file together with its POS tags, you can use:
>>> treebank.tagged words("wsj 0001.mrg")[0:]
(u‘Pierre’, u‘NNP’), (u‘Vinken’, u‘NNP’), (u‘,’, u’,’), (u‘61’, u‘CD’), (u‘years’, u‘NNS’),...
The Penn Treebank is composed of financial news documents from the Wall Street Journal published
around 1989. Since this file is the first file of the annotations, everyone doing research in NLP is
familiar with its content. You can ask about the age of Pierre Vinken to any NLP researcher and they
will get the joke :)
Look at all the words tagged as nouns. Based on the distinctions you observed in the previous section,
do you see how different types of noun tags have been assigned to the data?
In this case, [0 :] queries the entire file wsj 0001.mrg. For longer files, use a shorter range. Try to look
at the first 100 words of wsj 0003.mrg.
>>> x100 = treebank.tagged words("wsj 0003.mrg")[0:100]
>>> for p in x100:
print p
will print the first 100 words with tags in a pretty format.
2
5
Collecting counts from corpora
In class last week, we discussed an example sentence “The Secretariat is expected to race next week.”
and we hypothesized in general English (not in this particular sentence, but general use) whether the
word ‘race’ is more frequently a noun or a verb. Get counts from corpora to empirically see the result.
In NLTK, a word-tag pair is represented as a tuple (word, tag). Let’s first create two tuples corresponding
to our two hypotheses.
>>> race1 = nltk.tag.str2tuple(‘race/NN’)
>>> race2 = nltk.tag.str2tuple(‘race/VB’)
NLTK contains only a sample of the Penn Treebank corpus, so we will use the Brown corpus for our
estimates.
>>> from nltk.corpus import brown
>>> len(brown.tagged words())
What is the size of the Brown corpus?
Now estimate how often the tuples occur in the Brown corpus.
>>> brown.tagged words().count(race1)
>>> brown.tagged words().count(race2)
Which is the more frequent usage in the Brown corpus?
. What this what you expected?
The HMM based POS tagger introduced in the lectures used two sources of information. One was the
probability of a word under a certain tag, p(wi |ti ). What was the other source of information which will
help you to tag this sentence?
6
POS tagging a new sentence
We have studied the HMM Viterbi algorithm for POS tagging. NLTK includes an implementation of a
HMM tagger as well as a couple of others.
The first tagger we will use is a UnigramTagger. Based on the name, you may guess that it will be doing
something really simple and probably does not use any context. That is correct. The UnigramTagger
records from the training data, the most frequent tag for every word. When it sees that word in a test
sentence, it will assign this most frequent tag to it. So no context.
First let us train a unigram tagger using 5000 sentences of the Brown corpus as training data. The
categories=news argument selects only the news documents from the diverse collection within Brown.
>>>from nltk.corpus import brown
>>> unigram tagger = nltk.tag.UnigramTagger(brown.tagged sents(categories=’news’)[:5000])
Now unigram tagger is an object holding the trained model. Apply this model to your test sentence,
“The Secretariat is expected to race tomorrow.” Before we even do that, what tag do you think will be
assigned to the word “race” under this model?
>>>
>>>
>>>
>>>
from nltk import word tokenize
S = "The Secretariat is expected to race tomorrow."
S tok = word tokenize(S)
unigram tagger.tag(S tok)
Did you guess correctly?
3
Now we will use the HMM tagger which takes context into account.
>>> hmm tagger = nltk.hmm.HiddenMarkovModelTrainer().
train supervised(brown.tagged sents(categories="news")[:5000])
>>> hmm tagger.tag(S tok)
What tag is assigned by this model to the word “race”?
Try both the unigram and HMM models to tag another ambiguous sentence of your own.
Now try this exercise from the NLTK book. It is possible to find ambiguous news headlines such as the
following:
“British Left Waffles on Falkland Islands”
“Juvenile Court to Try Shooting Defendant”
Do you notice the ambiguity? Write down the two interpretations.
Manually (by hand) tag these headlines to see if knowledge of the part-of-speech tags removes the
ambiguity. Approximately is fine, need not use accurate tags from the tagset.
Run the HMM tagger on these sentences. Did it give you the right tags?
4

Download Report

CE314/887 Lab 3 Part of Speech Tagging

Paperzz.com

Your Paperzz