Word classes and part of speech tagging Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation Slide 1 Definition “The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” WORDS the girl kissed the baby on the cheek TAGS N V P DET Slide 2 An Example WORD the girl kissed the baby on the cheek LEMMA the girl kiss the baby on the cheek TAG +DET +NOUN +VPAST +DET +NOUN +PREP +DET +NOUN Slide 3 Motivation Speech synthesis — pronunciation Speech recognition — class-based N-grams Information retrieval — stemming, selection highcontent words Word-sense disambiguation Corpus analysis of language & lexicography Slide 4 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation Slide 5 Word Classes Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, … Open vs. Closed classes Open: Nouns, Verbs, Adjectives, Adverbs. Why “open”? (that readily accept new members) Closed: Definition: The category of function words--that is, parts of speech (or word classes) that do not readily accept new members. determiners: a, an, the pronouns: she, he, I prepositions: on, under, over, near, by, … Slide 6 Open Class Words Every known human language has nouns and verbs Nouns: people, places, things Classes of nouns proper vs. common count vs. mass Verbs: actions and processes Adjectives: properties, qualities Adverbs: hodgepodge! Unfortunately, John walked home extremely slowly yesterday Numerals: one, two, three, third, … Slide 7 Closed Class Words Differ more from language to language than open class words Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … Slide 8 Prepositions from CELEX Slide 9 English Single-Word Particles Slide 10 Pronouns in CELEX Slide 11 Conjunctions Slide 12 Auxiliaries Slide 13 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation Slide 14 Word Classes: Tag Sets • Vary in number of tags: a dozen to over 200 • Size of tag sets depends on language, objectives and purpose – Some tagging approaches (e.g., constraint grammar based) make fewer distinctions e.g., conflating prepositions, conjunctions, particles – Simple morphology = more ambiguity = fewer tags Slide 15 Word Classes: Tag set example PRP PRP$ Slide 16 Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. VB DT NN . Book that flight . VBZ DT NN VB NN ? Does that flight serve dinner ? See http://www.infogistics.com/posdemo.htm Slide 17 The Problem Words often have more than one word class: this This is a nice day = PRP This day is nice = DT You can go this far = RB Slide 18 Word Class Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35,340 Ambiguous (2-7 tags): 4,100 2 tags 3,760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1 Slide 19 Part-of-Speech Tagging • Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis) • Stochastic Tagger: HMM-based • Transformation-Based Tagger (Brill) Slide 20 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation Slide 21 Rule-Based Tagging • Basic Idea: – Assign all possible tags to words – Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. – Typically more than 1000 hand-written rules, but may be machine-learned. Slide 22 Sample ENGTWOL Lexicon Demo: http://www2.lingsoft.fi/cgi-bin/engtwol Slide 23 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation Slide 24 Stochastic Tagging • Based on probability of certain tag occurring given various possibilities • Requires a training corpus • No probabilities for words not in corpus. • Training corpus may be different from test corpus. Slide 25 Stochastic Tagging (cont.) •Simple Method: Choose most frequent tag in training text for each word! – Result: 90% accuracy – Baseline – Others will do better Slide 26 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation Slide 27 Transformation-Based Tagging (Brill Tagging) • Combination of Rule-based and stochastic tagging methodologies – Like rule-based because rules are used to specify tags in a certain environment – Like stochastic approach because machine learning is used—with tagged corpus as input • Input: – tagged corpus – dictionary (with most frequent tags) • Usually constructed from the tagged corpus Slide 28 Transformation-Based Tagging (cont.) • Basic Idea: – Set the most probable tag for each word as a start value – Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order • Training is done on tagged corpus: – – – – Write a set of rule templates Among the set of rules, find one with highest score Continue from 2 until lowest score threshold is passed Keep the ordered set of rules • Rules make errors that are corrected by later rules Slide 29 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation Slide 30 Tagging Unknown Words • New words added to (newspaper) language 20+ per month • Plus many proper names … • Increases error rates by 1-2% • Method 1: assume they are nouns • Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set. • Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN. Slide 31 Evaluation • The result is compared with a manually coded “Gold Standard” – Typically accuracy reaches 96-97% – This may be compared with result for a baseline tagger (one that uses no context). • Important: 100% is impossible even for human annotators. • Factors that affects the performance – The amount of training data available – The tag set – The difference between training corpus and test corpus – Dictionary – Unknown words Slide 32
© Copyright 2026 Paperzz