Word Classes: Tag Sets

Word classes and part of
speech tagging
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
Automatic approaches 1: rule-based tagging
Automatic approaches 2: stochastic tagging
Automatic approaches 3: transformation-based
tagging
Other issues: tagging unknown words, evaluation
Slide 1
Definition
“The process of assigning a part-of-speech or
other lexical class marker to each word in a
corpus”
WORDS
the
girl
kissed
the
baby
on
the
cheek
TAGS
N
V
P
DET
Slide 2
An Example
WORD
the
girl
kissed
the
baby
on
the
cheek
LEMMA
the
girl
kiss
the
baby
on
the
cheek
TAG
+DET
+NOUN
+VPAST
+DET
+NOUN
+PREP
+DET
+NOUN
Slide 3
Motivation
Speech synthesis — pronunciation
Speech recognition — class-based N-grams
Information retrieval — stemming, selection highcontent words
Word-sense disambiguation
Corpus analysis of language & lexicography
Slide 4
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
Automatic approaches 1: rule-based tagging
Automatic approaches 2: stochastic tagging
Automatic approaches 3: transformation-based
tagging
Other issues: tagging unknown words, evaluation
Slide 5
Word Classes
Basic word classes: Noun, Verb, Adjective, Adverb,
Preposition, …
Open vs. Closed classes
Open:
Nouns, Verbs, Adjectives, Adverbs.
Why “open”? (that readily accept new members)
Closed:
Definition: The category of function words--that is, parts of speech
(or word classes) that do not readily accept new members.
determiners: a, an, the
pronouns: she, he, I
prepositions: on, under, over, near, by, …
Slide 6
Open Class Words
Every known human language has nouns and verbs
Nouns: people, places, things
Classes of nouns
proper vs. common
count vs. mass
Verbs: actions and processes
Adjectives: properties, qualities
Adverbs: hodgepodge!
Unfortunately, John walked home extremely slowly yesterday
Numerals: one, two, three, third, …
Slide 7
Closed Class Words
Differ more from language to language than open class
words
Examples:
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
Slide 8
Prepositions from CELEX
Slide 9
English Single-Word Particles
Slide 10
Pronouns in CELEX
Slide 11
Conjunctions
Slide 12
Auxiliaries
Slide 13
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
Automatic approaches 1: rule-based tagging
Automatic approaches 2: stochastic tagging
Automatic approaches 3: transformation-based tagging
Other issues: tagging unknown words, evaluation
Slide 14
Word Classes: Tag Sets
• Vary in number of tags: a dozen to over 200
• Size of tag sets depends on language, objectives and
purpose
– Some tagging approaches (e.g., constraint grammar based) make
fewer distinctions e.g., conflating prepositions, conjunctions,
particles
– Simple morphology = more ambiguity = fewer tags
Slide 15
Word Classes: Tag set example
PRP
PRP$
Slide 16
Example of Penn Treebank Tagging of
Brown Corpus Sentence
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
VB DT NN .
Book that flight .
VBZ DT NN VB NN ?
Does that flight serve dinner ?
See http://www.infogistics.com/posdemo.htm
Slide 17
The Problem
Words often have more than one word class: this
This is a nice day = PRP
This day is nice = DT
You can go this far = RB
Slide 18
Word Class Ambiguity
(in the Brown Corpus)
Unambiguous (1 tag): 35,340
Ambiguous (2-7 tags): 4,100
2 tags
3,760
3 tags
264
4 tags
61
5 tags
12
6 tags
2
7 tags
1
Slide 19
Part-of-Speech Tagging
• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level
analysis)
• Stochastic Tagger: HMM-based
• Transformation-Based Tagger (Brill)
Slide 20
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
Automatic approaches 1: rule-based tagging
Automatic approaches 2: stochastic tagging
Automatic approaches 3: transformation-based tagging
Other issues: tagging unknown words, evaluation
Slide 21
Rule-Based Tagging
• Basic Idea:
– Assign all possible tags to words
– Remove tags according to set of rules of type: if word+1 is an
adj, adv, or quantifier and the following is a sentence boundary
and word-1 is not a verb like “consider” then eliminate non-adv
else eliminate adv.
– Typically more than 1000 hand-written rules, but may be
machine-learned.
Slide 22
Sample ENGTWOL Lexicon
Demo: http://www2.lingsoft.fi/cgi-bin/engtwol
Slide 23
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
Automatic approaches 1: rule-based tagging
Automatic approaches 2: stochastic tagging
Automatic approaches 3: transformation-based tagging
Other issues: tagging unknown words, evaluation
Slide 24
Stochastic Tagging
• Based on probability of certain tag occurring given
various possibilities
• Requires a training corpus
• No probabilities for words not in corpus.
• Training corpus may be different from test corpus.
Slide 25
Stochastic Tagging (cont.)
•Simple Method: Choose most frequent tag in training text
for each word!
– Result: 90% accuracy
– Baseline
– Others will do better
Slide 26
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
Automatic approaches 1: rule-based tagging
Automatic approaches 2: stochastic tagging
Automatic approaches 3: transformation-based tagging
Other issues: tagging unknown words, evaluation
Slide 27
Transformation-Based Tagging
(Brill Tagging)
• Combination of Rule-based and stochastic tagging methodologies
– Like rule-based because rules are used to specify tags in a certain
environment
– Like stochastic approach because machine learning is used—with
tagged corpus as input
• Input:
– tagged corpus
– dictionary (with most frequent tags)
• Usually constructed from the tagged corpus
Slide 28
Transformation-Based Tagging (cont.)
• Basic Idea:
– Set the most probable tag for each word as a start value
– Change tags according to rules of type “if word-1 is a determiner
and word is a verb then change the tag to noun” in a specific order
• Training is done on tagged corpus:
–
–
–
–
Write a set of rule templates
Among the set of rules, find one with highest score
Continue from 2 until lowest score threshold is passed
Keep the ordered set of rules
• Rules make errors that are corrected by later rules
Slide 29
Outline
Why part of speech tagging?
Word classes
Tag sets and problem definition
Automatic approaches 1: rule-based tagging
Automatic approaches 2: stochastic tagging
Automatic approaches 3: transformation-based tagging
Other issues: tagging unknown words, evaluation
Slide 30
Tagging Unknown Words
• New words added to (newspaper) language 20+ per
month
• Plus many proper names …
• Increases error rates by 1-2%
• Method 1: assume they are nouns
• Method 2: assume the unknown words have a
probability distribution similar to words only
occurring once in the training set.
• Method 3: Use morphological information, e.g.,
words ending with –ed tend to be tagged VBN.
Slide 31
Evaluation
• The result is compared with a manually coded “Gold
Standard”
– Typically accuracy reaches 96-97%
– This may be compared with result for a baseline tagger (one that
uses no context).
• Important: 100% is impossible even for human
annotators.
• Factors that affects the performance
– The amount of training data available
– The tag set
– The difference between training corpus and test corpus
– Dictionary
– Unknown words
Slide 32