6.863J Natural Language Processing Lecture 5: Finite state

6.863J Natural Language Processing
Lecture 5: Finite state machines &
part-of-speech tagging
Instructor: Robert C. Berwick
The Menu Bar
• Administrivia:
• Schedule alert: Lab1 due next Weds (Feb
24)
• Lab 2, handed out Feb 24 (look for it on the
web as laboratory2.html; due the Weds
after this – March 5
• Agenda:
• Part of speech ‘tagging’ (with sneaky
intro to probability theory that we need)
• Ch. 6 & 8 in Jurafsky; see ch. 5 on
Hidden Markov models
Two finite-state approaches to
tagging
1. Noisy Channel Model (statistical)
2. Deterministic baseline tagger composed
with a cascade of fixup transducers
•
•
PS: how do we evaluate taggers? (and
such statistical models generally?)
1, 2, & evaluation = Laboratory 2
The real plan…
p(X)
*
p(Y | X)
Find x that maximizes
this quantity
*
p(y | Y)
=
p(X, y)
Cartoon version
Verb
Det
Det 0.8
Adj 0.3
Start
Adj 0.4
Noun
0.7
p(X)
Prep
Adj Noun
0.5
Noun
ε 0.1
ε 0.2
Stop
*
…
Noun:cortege/0.000001
Noun:autos/0.001
Noun:Bill/0.002
Det:a/0.6
Det:the/0.4
*
the
cool
Adj:cool/0.003
Adj:directed/0.0005
Adj:cortege/0.000001
…
directed
=
autos
transducer: scores candidate tag seqs
on their joint probability with obs words;
we should pick best path
*
p(Y | X)
*
p(y | Y)
=
p(X, y)
What’s the big picture? Why NLP?
Computers would be a lot more useful if
they could handle our email, do our
library research, talk to us …
But they are fazed by natural human
language.
How can we tell computers about language?
(Or help them learn it as kids do?)
What is NLP for, anyway?
• If we could do it perfectly, we could pass
the Turing test (more on this below)
• Two basic ‘engineering’ tasks – and third
scientific one
• Text-understanding
• Information extraction
• ?What about how people ‘process’
language??? [psycholinguistics]
Some applications…
•
•
•
•
•
•
•
Spelling correction, grammar checking …
Better search engines
Information extraction
Language identification (English vs. Polish)
Psychotherapy; Harlequin romances; etc.
And: plagiarism detection - www.turnitin.com
For code:
www.cs.berkeley.edu/~aiken/moss.html
• New interfaces:
• Speech recognition (and text-to-speech)
• Dialogue systems (USS Enterprise onboard computer)
• Machine translation (the Babel fish)
Text understanding is very hard
John stopped at the donut store on his way
home from work. He thought a coffee
was good every few hours. But it turned
out to be too expensive there.
• NL relies on ambiguity! (Why?)
• “We haven’t had a sale in 40 years”
What’s hard about the story?
John stopped at the donut store on his way
home from work. He thought a coffee
was good every few hours. But it turned
out to be too expensive there.
To get a donut (spare tire) for his car?
What’s hard?
John stopped at the donut store on his way home
from work. He thought a coffee was good every
few hours. But it turned out to be too
expensive there.
store where donuts shop? or is run by donuts?
or looks like a big donut? or made of donut?
or has an emptiness at its core?
(Think of five other issues…there are lots)
What’s hard about this story?
John stopped at the donut store on his way
home from work. He thought a coffee
was good every few hours. But it turned
out to be too expensive there.
Describes where the store is? Or when he
stopped?
What’s hard about this story?
John stopped at the donut store on his way
home from work. He thought a coffee
was good every few hours. But it turned
out to be too expensive there.
Well, actually, he stopped there from
hunger and exhaustion, not just from
work.
What’s hard about this story?
John stopped at the donut store on his way
home from work. He thought a coffee
was good every few hours. But it turned
out to be too expensive there.
At that moment, or habitually?
(Similarly: Mozart composed music.)
What’s hard about this story?
John stopped at the donut store on his way
home from work. He thought a coffee
was good every few hours. But it turned
out to be too expensive there.
That’s how often he thought it?
What’s hard about this story?
John stopped at the donut store on his way
home from work. He thought a coffee
was good every few hours. But it turned
out to be too expensive there.
But actually, a coffee only stays good for
about 10 minutes before it gets cold.
What’s hard about this story?
John stopped at the donut store on his
way home from work. He thought a
coffee was good every few hours. But
it turned out to be too expensive there.
Similarly: In America a woman has a baby
every 15 minutes. Our job is to find
that woman and stop her.
What’s hard about this story?
John stopped at the donut store on his way
home from work. He thought a coffee
was good every few hours. But it turned
out to be too expensive there.
the particular coffee that was good every
few hours? the donut store? the
situation?
What’s hard about this story?
John stopped at the donut store on his
way home from work. He thought a
coffee was good every few hours. But
it turned out to be too expensive there.
too expensive for what? what are we
supposed to conclude about what John
did?
how do we connect “it” to “expensive”?
Example tagsets
• 87 tags - Brown corpus
• Three most commonly used:
1. Small: 45 Tags - Penn treebank (Medium
size: 61 tags, British national corpus
2. Large: 146 tags
Big question: have we thrown out the right
info? Impoverished? How?
Current performance
Input: the lead paint
Output: the/Det lead/N
is unsafe
paint/N is/V unsafe/Adj
• How many tags are correct?
• About 97% currently
• But baseline is already 90%
• Baseline is performance Homer Simpson
algorithm:
• Tag every word with its most frequent tag
• Tag unknown words as nouns
• How well do people do?
Knowldege-based (rule-based)
vs. Statistically-based systems
A picture: the statistical, noisy
channel view
Wreck a nice beach?
Reckon eyes peach?
Recognize speech?
x(speech)
Acoustic
Model
P(x|y)
Language
Model
P(y)
y(text)
Language models, probability & info
• Given a string w, a language model gives us
the probability of the string P(w), e.g.,
• P(the big dog) > (dog big the) > (dgo gib eth)
• Easy for humans; difficult for machines
• Let P(w) be called a language model
Language models – statistical
view
• Application to speech recognition (and parsing,
generally)
• x= Input (speech)
• y= output (text)
• We want to find max P(y|x) Problem: we don’t know
this!
• Solution: We have an estimate of P(y) [the language
model] and P(x|y) [the prob. of some sound given
text = an acoustic model ]
• From Bayes’ law, we have,
max P(y|x) = max P(x|y) • P(y) = max Pr acoustic
model x lang model
(hold P(x) fixed, i.e., P(x|y) • P(y) / P(x), but max is
same for both)
Can be generalized to Turing
test…!
If we could solve this (the?) NLP problem, we
could solve the Turing test…. (but not the Twain
test)! Language modeling solves the Turing
test!
human
q?
a!
Interrogator
q?
b!
P(a|q)= pr human answers q with a =
P(q,a)/P(q)
P(b|q)= pr computer answers q with b =
P(q,b)/P(q)
If P(q,a) = P(q,b) then computer
successfully imitates human
computer
Some applications…
•
•
•
•
•
•
•
Spelling correction, grammar checking …
Better search engines
Information extraction
Language identification (English vs. Polish)
Psychotherapy; Harlequin romances; etc.
And: plagiarism detection - www.turnitin.com
For code:
www.cs.berkeley.edu/~aiken/moss.html
• New interfaces:
• Speech recognition (and text-to-speech)
• Dialogue systems (USS Enterprise onboard computer)
• Machine translation (the Babel fish)
What’s this stuff for anyway?
Information extraction
• Information extraction involves processing
text to identify selected information:
• particular types of names
• specified classes of events.
• For names, it is sufficient to find the name in
the text and identify its type
• for events, we must extract the critical
information about each event (the agent,
objects, date, location, etc.) and place this
information in a set of templates (data base)
Example – “Message
understanding” (MUC)
ST1-MUC3-0011
SANTIAGO, 18 MAY 90 (RADIO COOPERATIVA NETWORK) -- [REPORT] [JUAN
ARAYA]
[TEXT]
EDMUNDO VARGAS CARRENO, CHILEAN FOREIGN MINISTRY UNDER
SECRETARY, HAS STATED THAT THE BRYANT TREATY WITH THE UNITED STA TES WILL
BE APPLIED IN THE LETELIER CASE ONLY TO COMPENSATE THE RELATIVES OF THE
FORMER CHILEAN FOREIGN MINISTER MURDERED IN WASHINGTON AND THE
RELATIVES OF HIS U.S. SECRETARY, RONNIE MOFFIT. THE CHILEAN FOREIGN UNDER
SECRETARY MADE THIS STATEMENT IN REPLY TO U.S. NEWSPAPER REPORTS STATING
THAT THE TREATY WOULD BE PARTIALLY RESPECTED.
FOLLOWING ARE VARGAS CARRENO'S STATEMENTS AT A NEWS CONFERENCE HE
HELD IN BUENOS AIRES BEFORE CONCLUDING HIS OFFICIAL VISIT TO
ARGENTINA:
Extracted info – names, events
0. MESSAGE: ID
1. MESSAGE: TEMPLATE
2. INCIDENT: DATE
3. INCIDENT: LOCATION
4. INCIDENT: TYPE
5. INCIDENT: STAGE OF EXECUTION
6. INCIDENT: INSTRUMENT ID
7. INCIDENT: INSTRUMENT TYPE 8. PERP: INCIDENT CATEGORY
9. PERP: INDIVIDUAL ID
10. PERP: ORGANIZATION ID
11. PERP: ORGANIZATION
CONFIDENCE REPORTED AS FACT:
12. PHYS TGT: ID
13. PHYS TGT: TYPE
14. PHYS TGT: NUMBER
15. PHYS TGT: FOREIGN NATION
16. PHYS TGT: EFFECT OF INCIDENT 17. PHYS TGT: TOTAL NUMBER
18. HUM TGT: NAME
19. HUM TGT: DESCRIPTION
20. HUM TGT: TYPE
21.
TST1-MUC3-0011
1
18 MAY 90
UNITED STATES: WASHINGTON D.C. (CITY)
ATTACK
ACCOMPLISHED
STATE-SPONSORED VIOLENCE
"CHILEAN GOVERNMENT"
"CHILEAN GOVERNMENT"
"ORLANDO LETELIER"
"RONNIE MOFFIT"
"FORMER CHILEAN FOREIGN MINISTER": "ORLANDO
LETELIER"
"U.S. SECRETARY" / "ASSISTANT" /
"SECRETARY": "RONNIE MOFFIT"
GOVERNMENT OFFICIAL: "ORLANDO LETELIER"
CIVILIAN: "RONNIE MOFFIT"
Text understanding vs. Info
extraction
• For information extraction:
• generally only a fraction of the text is relevant; for
example, in the case of the MUC-4 terrorist reports,
probably only about 10% of the text was relevant;
• information is mapped into a predefined, relatively
simple, rigid target representation; this condition
holds whenever entry of information into a database
is the task;
• the subtle nuances of meaning and the writer's goals
in writing the text are of at best secondary interest.
Properties of message understanding
task
• Simple, fixed definition of the information
to be sought.
• Much, or even most, of the text is
irrelevant to the information extraction
goal.
• Large volumes of text need to be
searched.
Text understanding
• the aim is to make sense of the entire
text;
• the target representation must
accommodate the full complexities of
language;
• one wants to recognize the nuances of
meaning and the writer's goals
MUC 5 - business
Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese
trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co.,
capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000
iron and "metal wood" clubs a month.
TIE-UP-1:
Relationship:
Entities:
Joint Venture Company:
Amount:
TIE-UP
"Bridgestone Sports Co."
"a local concern“
"a Japanese trading house"
"Bridgestone Sports Taiwan Co." Activity: ACTIVITY-1
NT$20000000
ACTIVITY-1:
Activity:
Company:
Product:
Start Date:
PRODUCTION
"Bridgestone Sports Taiwan Co."
"iron and `metal wood' clubs"
DURING: January 1990
Applications
• IR tasks:
• Routing queries to prespecified topics
• Text classification/routing
• Summarization
• Highlighting, clipping
• NL generation from formal output
representation
• Automatic construction of knowledge
bases from text
Message understanding tasks
Business News:Joint Ventures (English and
Japanese),
Labor Negotiations, Management Succession
Geopolitical News:Terrorist Incidents
Military Messages:DARPA Message Handler
Legal English:Document Analysis Tool
Integration with OCR
Name recognition: a fsa task
Shallow parsing or ‘chunking’ –
Lab 3
Recognizing domain patterns
What about part of speech tagging
here?
• Advantages
• Ambiguity can be potentially reduced (but we shall
see in our laboratory if this is true)
• Avoid errors due to incorrect categorization of rare
senses e.g., “has been” as noun
• Disadvantages
• Errors taggers make often those you’d most want to
eliminate
• High performance requires training on similar genre
• Training takes time
Proper names…
• Proper names are particularly important
for extraction systems
• Because typically one wants to extract
events, properties, and relations about
some particular object, and that object is
usually identified by its name
…A challenge…
• Problems though…
• proper names are huge classes and it is
difficult, if not impossible to enumerate their
members
• Hundreds of thousands of names of locations
around the world
• Many of these names are in languages other
than the one in which the extraction system
is designed
How are names extracted?
• (Hidden) Markov Model
• Hypothesized that there is an underlying finite
state machine (not directly observable, hence
hidden) that changes state with each input
element
• probability of a recognized constituent is
conditioned not only on the words seen, but the
state that the machine is in at that moment
• “John” followed by “Smith” is likely to be a
person, while “John” followed by “Deere” is
likely to be a company (a manufacturer of heavy
farming and construction equipment).
HMM statistical name tagger
Person name
Start
Company name
Not-a-name
End
HMM
• Whether a word is part of a name or not is a random
event with an estimable probability
• The probability of name versus non-name readings can
be estimated from a training corpus in which the names
have been annotated
• In a Hidden Markov model, it is hypothesized that there
is an underlying finite state machine (not directly
observable, hence hidden) that changes state with each
input element
• The probability of a recognized constituent is
conditioned not only on the words seen, but the state
that the machine is in at that moment
• “John” followed by “Smith” is likely to be a person,
while “John” followed by “Deere” is likely to be a
company (a manufacturer of heavy farming and
construction equipment).
HMM construction
• Hidden state transition model governs word
sequences
• Transitions probabilistic
• Estimate transition probabilities from an
annotated corpus
• P(sj | sj-1, wj)
• Based just on prior state and current word seen
(hence Markovian assumption)
• At runtime, find maximum likelihood path
through the network, using a max-flow
algorithm (Viterbi)
How much data is needed?
• System performance bears a roughly loglinear relationship to the training data
quantity, at least up to about 1.2 million
words
• Obtaining 1.2 million words of training
data requires transcribing and annotating
approximately 200 hours of broadcast
news programming, or if annotating text,
this would amount to approximately 1,777
average-length Wall Street Journal articles
If you think name recog is not
relevant, then…
• Microsoft announced plans to include
“Smart Tags” in its browser and other
products. This is a feature that
automatically inserts hyperlinks from
concepts in text to related web pages
chosen by Microsoft.
• The best way to make automatic
hyperlinking unbiased is to base it on an
unbiased source of web pages, such as
Google.
How to do this?
• The main technical problem is to find pieces of
text that are good concept anchors… like
names!
• So: Given a text, find the starting and ending
points of all the names. Depending on our
specific goals, we can include the names of
people, places, organizations, artifacts (such as
product names), etc.
• . Once we have the anchor text, we can send it
to a search engine, retrieve a relevant URL (or
set of URLs, once browsers can handle multiway hyperlinks), and insert them into the original
text on the fly.
OK, back to the tagging task
• This will illustrate all the issues with name
recognition too
Brown/Upenn corpus tags
J. text,
p. 297
Fig 8.6
1M words
60K tag
counts
Ok, what should we look at?
correct tags
PN
Verb Det Noun Prep Noun Prep Det Noun
Bill directed a cortege of autos through the dunes
PN
Adj
Det Noun Prep Noun Prep Det Noun
Verb Verb Noun Verb
Adj
som e possible tags for
Prep
each w ord (m aybe m ore)
…?
Each unknown tag is constrained by its word
and by the tags to its immediate left and right.
But those tags are unknown too …
Ok, what should we look at?
correct tags
PN
Verb Det Noun Prep Noun Prep Det Noun
Bill directed a cortege of autos through the dunes
PN
Adj
Det Noun Prep Noun Prep Det Noun
Verb Verb Noun Verb
Adj
som e possible tags for
Prep
each w ord (m aybe m ore)
…?
Each unknown tag is constrained by its word
and by the tags to its immediate left and right.
But those tags are unknown too …
Ok, what should we look at?
correct tags
PN
Verb Det Noun Prep Noun Prep Det Noun
Bill directed a cortege of autos through the dunes
PN
Adj
Det Noun Prep Noun Prep Det Noun
Verb Verb Noun Verb
Adj
som e possible tags for
Prep
each w ord (m aybe m ore)
…?
Each unknown tag is constrained by its word
and by the tags to its immediate left and right.
But those tags are unknown too …
Ok, what should we look at?
correct tags
PN
Verb Det Noun Prep Noun Prep Det Noun
Bill directed a cortege of autos through the dunes
PN
Adj
Det Noun Prep Noun Prep Det Noun
Verb Verb Noun Verb
Adj
som e possible tags for
Prep
each w ord (m aybe m ore)
…?
Each unknown tag is constrained by its word
and by the tags to its immediate left and right.
But those tags are unknown too …
Finite-state approaches
• Noishy Chunnel Muddle (statistical)
real language X
part-of-speech tags
noisy channel X Æ Y
yucky language Y
want to recover X from Y
insert words
text
Noisy channel – and prob intro
real language X
p(X)
*
noisy channel X Æ Y
p(Y | X)
yucky language Y
=
p(X,Y)
choose sequence of tags X that m ax im izes p(X | Y)
[oops… this isn’t quite correct… need 1 more step]
Two approaches
• Fix up Homer Simpson idea with more
than unigrams – look at tag sequence
• Go to more powerful Hidden Markov
model
What are unigrams and bigrams?
• Letter or word frequencies: 1-grams
• useful in solving cryptograms: ETAOINSHRDLU…
• If you know the previous letter: 2-grams
• “h” is rare in English (4%; 4 points in Scrabble)
• but “h” is common after “t” (20%)
• If you know the previous 2 letters: 3-grams
• “h” is really common after “ ” “t”
etc. …
In our case
• Most likely word? Most likely tag t given a word
w? = P(tag|word)
• Task of predicting the next word
• Woody Allen:
“I have a gub”
In general: predict the Nth word (tag) from the
preceding N-1 word (tags) aka N-gram
Homer Simpson: just use the current word (don’t
look at context) = unigram (1-gram)
Where do these probability info
estimates come from?
• Use tagged corpus e.g. “Brown corpus” 1M
words ( fewer token instances); many others –
Celex 16M words
• Use counts (relative frequencies) as estimates
for probabilities (various issues w/ this, these
so-called Maximum-Likelihood estimates – don’t
work well for low numbers)
• Train on texts to get estimates – use on new
texts
General probabilistic decision
problem
• E.g.: data = bunch of text
• label = language
• label = topic
• label = author
• E.g.2: (sequential prediction)
• label = translation or summary of entire text
• label = part of speech of current word
• label = identity of current word (ASR) or
character (OCR)
Formulation, in general
Label = arg max Pr( Label | Data)
Label
How far should we go?
•
•
•
•
•
“long distance___”
Next word? Call?
p(wn|w
Consider special case above
Approximation says that
| long distance call|/|distance call| ≈ |distance call|/|distance|
• If context 1 word back = bigram
But even better approx if 2 words back: long distance___
Not always right: long distance runner/long distance call
Further you go: collect long distance_____
Bigrams, fsa’s, and Markov
models – take two
• We approximate p(tag| all previous tags)
Instead of
p(rabbit|Just then the white…) we use:
P(rabbit|white)
• This is a Markov assumption where past
memory is limited to immediately previous
state – just 1 state corresponding to the
previous word or tag
Forming classes
• “n-gram” = sequence of n “words”
•
•
•
•
unigram
bigram
trigram
four-gram
• In language modeling, the conditioning
variables are sometimes called the “history” or
the “context.”
• The Markov assumption says that the prediction
is conditionally independent of ancient history,
given recent history.
• I.e., we divide all possible histories into equiv.
classes based on the recent history.
3-gram
[Genmetheyesse orils of Ted you doorder [6], the Grily
Capiduatent pildred and For thy werarme: nomiterst halt i,
what production the Covers, in calt cations on wile ars,
was name conch rom the exce of the man, Winetwentagaint up,
and and Al1. And of Ther so i hundal panite days th the
res of th rand ung into the forD six es, wheralf the hie
soulsee, frelatche rigat. And the LOperact camen
unismelight fammedied: and nople,
4-gram
[1] By the returall benefit han every familitant of all thou
go? And At the eld to parises of the nursed by thy way of
all histantly be the ~aciedfag
. to the narre gread abrasa
of thing, and vas these conwuning clann com to one language;
all Lah, which for the greath othey die. -
5-gram
[Gen 3:1] In the called up history of its opposition of
bourgeOIS AND Adam to rest, that the existing of heaven; and
land the bourgeoiS ANger anything but concealed, the land
whethere had doth know ther: bury thy didst of Terature their
faces which went masses the old society [2] is the breaks out
of oppressor of all which, the prolETARiat goest, unto German
pleast twelves applied in manner with these, first of this
polities have all
3-word-gram
[Gen 4:25] And Adam gave naines to ail feudal,
patriarchal, idyllic relations. It bas but –established
new classes, new conditions of oppression, new forme of
struggle in place of the West? The bourgeoisie keeps
more and more splitting up into two great lights;
the greater light to rule the day of my house is this
Eliezer of Damascus.
How far can we go??
Shakespeare in lub…
The unkindest cut of all
• Shakespeare: 884,647 words or tokens
(Kucera, 1992)
• 29,066 types (incl. proper nouns)
• So, # bigrams is 29,0662 > 844 million. 1
million word training set doesn’t cut it –
only 300,000 difft bigrams appear
• Use backoff and smoothing
• So we can’t go very far…
Reliability vs. discrimination
“large green ___________”
tree? mountain? frog? car?
“swallowed the large green ________”
pill? broccoli?
Reliability vs. discrimination
• larger n: more information about the
context of the specific instance (greater
discrimination)
• smaller n: more instances in training
data, better statistical estimates (more
reliability)
Choosing n
Suppose we have a vocabulary (V) = 20,000 words
n
Number of bins
2 (bigrams) 400,000,000
3 (trigrams) 8,000,000,000,000
4 (4-grams) 1.6 x 1017
Statistical estimators
Example:
Corpus: five Jane Austen novels
N = 617,091 words
V = 14,585 unique words
Task: predict the next word of the trigram “inferior to ________”
from test data, P ersuasion :
“[In person, she was] inferior to both [sisters.]”
Instances in the Training Corpus:
“inferior to ________”
Maximum likelihood estimate
Actual probability distribution
Comparison
Smoothing
• Develop a model which decreases
probability of seen events and allows the
occurrence of previously unseen n-grams
• a.k.a. “Discounting methods”
• “Cross-Validation” – Smoothing methods
which utilize a second batch of data.
How to use Pr( Label | Data) ?
• Assume we have a training corpus, i.e. a bunch
of Data with their Labels.
• Then we can compute Maximum Likelihood
Estimates (MLEs) of the probabilities as relative
frequencies:
count ( Label , Data )
Pr( Label | Data ) =
count ( Data)
• To predict a Label for a new bunch of Data, we
just look up Pr( Label | Data) for all possible
Labels and output the argmax.
Practical limitations
• Equivalently,
Pr( Label , Data)
Pr( Label | Data) =
Pr( Data )
• Let Data = (w1, w2, …, wd).
• Let v = size of vocabulary.
• Then we need vd probability
estimates!
We can use n-grams for tagging
• Replace ‘words’ with ‘tags’
• Find best maximum likelihood estimates
• Estimates calculated this way:
• P(noun|det) = p(det, noun)/p(det) replace:
• ≈ count(det at position i-1 & noun at i)
count(det at position i-1)
• Correction: include frequency of context word
≈ count(det at position i-1 & noun at I)
count(det at position i-1)*count noun at i)
Find optimal path – highest p, using dynamic
programming algorithm, approx. linear in length
of tag sequence
Example
The guy still saw her
Det NN NN NN PPO
VB VB VBD PP$
RB
Table 2 from DeRose (1988)
Det=determiner, NN=noun, VB=verb,
RB=adverb, VBD=past-tense-verb, PPO=object
pronoun and PP$=possessive pronoun
Find the Max likelihood estimate (MLE) path
through this ‘trellis’
Transitional probability estimates
from counts
DT
NN
PPO
PP$
RB
VB
VBD
DT
0
186
0
8
1
8
9
NN
40
1
3
40
9
66
186
PPO
7
3
16
164
109
16
313
PP$
176
0
0
5
1
1
2
RB
5
3
16
164
109
16
313
VB
22
694
146
98
9
1
59
VBD
11
584
143
160
2
1
91
Tagging search tree (trellis)
The guy still saw her
Det NN NN NN
VB
PPO
VB VBD PP$
RB
Step 1. c(DT-NN)= 186
c(DT-VB) = 1
Keep both paths. (Why?)
Step 2. Pick max to each of the tags NN, VB, RB
need keep only the max. Why?
Trellis search
The guy still saw her
186
1
1
66
Det NN NN NN
VB
VB VBD PP$
9
RB
Det NN NN NN
1
PPO
694
PPO
VB 1 VB VBD PP$
9
RB
Smoothing
• We don’t see many of the words in English
(uniqram)
• We don’t see the huge majority of bigrams in
English
• We see only a tiny sliver of the possible trigrams
• So: most of the time, bigram model assigns p(0) to
bigram:
p(food|want) = |want food| /|want| = 0/whatever
But means event can’t happen – we aren’t warranted
to conclude this… therefore, we must adjust…how?
Simplest idea: add-1 smoothing
• Add 1 to every cell of
• P(food | want) = |want to| ÷ |want| = 1 ÷
2931 = .0003
Laplace’s law (add 1)
Laplace’s Law (add 1)
Laplace’s Law
Initial counts – Berkeley
restaurant project
Old vs.New table
Changes
• All non-zero probs went down
• Sometimes probs don’t change much
• Some predictable events become less
predictable (P(to|want) [0.65 to 0.22])
• Other probs change by large factors (
P(lunch|Chinese) [0.0047 to 0.001]
• Conclusion: generally good idea, but effect on
nonzeroes not always good – blur original model
– too much prob to the zeros, we want less
‘weight’ assigned to them (zero-sum game,
‘cause probs always sum to 0)
So far, then…
• n-gram models are a.k.a. Markov
models/chains/processes.
• They are a model of how a sequence of
observations comes into existence.
• The model is a probabilistic walk on a
FSA.
• Pr(a|b) = probability of entering state a,
given that we’re currently in state b.
How well does this work for
tagging?
• 90% accuracy (for unigram) pushed up to
96%
• So what?
• How good is this? Evaluation!
Evaluation of systems
• The principal measures for information
extraction tasks are recall and precision.
• Recall is the number of answers the system got
right divided by the number of possible right
answers
• It measures how complete or comprehensive the
system is in its extraction of relevant information
• Precision is the number of answers the system
got right divided by the number of answers the
system gave
• It measures the system's correctness or accuracy
• Example: there are 100 possible answers and the
system gives 80 answers and gets 60 of them right,
its recall is 60% and its precision is 75%.
A better measure - Kappa
• Takes baseline & complexity of task into
account – if 99% of tags are Nouns, getting
99% correct no great shakes
• Suppose no “Gold Standard” to compare
against?
• P(A) = proportion of times hypothesis agrees
with standard (% correct)
• P(E) = proportion of times hypothesis and
standard would be expected to agree by chance
(computed from some other knowledge, or
actual data)
Kappa [p. 315 J&M text]
P( A) − P( E )
κ=
1 − P( E )
• Note K ranges between 0 (no agreement,
except by chance; to complete
agreement, 1)
• Can be used even if no ‘Gold standard’
that everyone agrees on
• K> 0.8 is good
Kappa
P( A) − P( E )
κ=
1 − P( E )
• A = actual agreement; E = expected
agreement
• consistency is more important than “truth”
• methods for raising consistency
• style guides (often have useful insights into
data)
• group by task, not chronologically, etc.
• annotator acclimatization
Statistical Tagging Methods
• Simple bigram – ok, done
• Combine bigram and unigram
Markov chain…
.6
1
note: Pr(h|h) = 0
a
h
.4
p
.4
.3
1
e
.3
t
1
i
.4
Start
∀y ∑ Pr( x | y ) = 1
x
.6
First-order Markov (bigram)
model as fsa
Det
Verb
Start
Prep
Adj
Noun
Stop
Markov inferences
1. Given a Markov model M, we can compute the
(MLE) probability of a given observation
sequence.
2. Given the states of a Markov model, and an
observation sequence, we can estimate the
state transition probabilities.
3. Given a pair of Markov models and an
observation sequence, we can choose the
more likely model.
From Markov models to Hidden
Markov models (HMMs)
• The HMM of how an observation sequence
comes into existence adds one step to a
(simple/visible) Markov model.
• Instead of being observations, the states now
probabilistically emit observations.
• The relationship between states and
observations is, in general, many-to-many.
• We can’t be sure what sequence of states
emitted a given sequence of observations
HMM for ice-cream consumption
• Predict Boston weather given my
consumption (observable) of ice-creams,
1, 2, 3.
• Underlying state is either H(ot) or C(old)
• Prob of moving between states, and also
prob of eating 1, 2, 3, given in state H or
C. Formally:
HMM
.2
.8
.8
S1=Cold
3
.5
S
.1
.3
2
S2=Hot
.1
.6
.2
1
.5
1
.7
.2
2
3
• Start in state Si with probability pi.
• While (t++) do:
• Move from Si to Sj with probability aij (i.e.
Xt+1=j)
• Emit observation ot=k with probability bjk.
Noisy channel maps well to our
fsa/fst notions
• What’s p(X)?
• Ans: p(tag sequence) – i.e., some finite state
automaton
• What’s p(Y|X)?
• Ans: transducer that takes tags→words
• What’s P(X,Y)?
• The joint probability of the tag sequence, given
the words (well, gulp, almost… we will need one
more twist – why? What is Y?)
The plan modeled as composition (xproduct) of finite-state machines
.7
0
/
a:a
.1
0
/
a:C
9
.
0
/
D
:
a
b:b
/0.
3
*
*
b:C
/0.
8
b:D
/0.
2
=
07
.
0
/
3
a:C
6
.
0
/
D
a:
p(X)
b:C
/0.
24
b:D
/0.
06
p(Y | X)
=
p(X,Y)
Note p(x,y) sums to 1.
Cross-product construction for
fsa’s (or fst’s)
0
3
1
2
0
0,0
1
ε
ε
ε
2
1,1
1,2
2,1
3,1
*
=
4
3
4
1,3
1,4 ε
2,2
2,3
2,4 ε
3,2
3,3
3,4
ε
4,4
Pulled a bit of a fast one here…
• So far, we have a plan to compute P(X,Y) – but
is this correct?
• Y= all the words in the world
• X= all the tags in the world (well, for English)
• What we get to see as input is y∈Y not Y!
• What we want to compute is REALLY this:
want to recover x∈X from y∈Y
choose x that maximizes p(X | y) so…
The real plan…
p(X)
*
p(Y | X)
Find x that maximizes
this quantity
*
p(y | Y)
=
p(X, y)
Cartoon version
Verb
Det
Det 0.8
Adj 0.3
Start
Adj 0.4
Noun
0.7
p(X)
Prep
Adj Noun
0.5
Noun
ε 0.1
ε 0.2
Stop
*
…
Noun:cortege/0.000001
Noun:autos/0.001
Noun:Bill/0.002
Det:a/0.6
Det:the/0.4
*
the
cool
Adj:cool/0.003
Adj:directed/0.0005
Adj:cortege/0.000001
…
directed
=
autos
transducer: scores candidate tag seqs
on their joint probability with obs words;
we should pick best path
*
p(Y | X)
*
p(y | Y)
=
p(X, y)
This is an HMM for tagging
• Each hidden tag state produces a word in
the sentence
• Each word is
• Uncorrelated with all the other words and
their tags
• Probabilistic depending on the N previous
tags only
The plan modeled as composition
(product) of finite-state machines
.7
0
/
a:a
.1
0
/
a:C
9
.
0
/
D
:
a
*
b:b
/0.
3
*
b:C
/0.
8
b:D
/0.
2
=
07
.
0
/
3
a:C
6
.
0
/
D
a:
p(X)
b:C
/0.
24
b:D
/0.
06
p(Y | X)
=
p(X,Y)
Note p(x,y) sums to 1.
Suppose y=“C”; what is best “x”?
We need to factor in one more machine
that models the actual word sequence, y
b:b
/0.
3
.7
0
/
a:a
.1
0
/
a:C
9
.
0
/
D
:
a
restrict just to
paths compatible
with output “C”
07
.
0
/
a:C
*
p(X)
*
b:C
/0.
8
b:D
/0.
2
*
c:C/
1
=
b:C
/0.
24
bes
t pa
th
p(Y | X)
*
p(y | Y)
find x to =
maximize p(X, y)
The statistical view, in short:
• We are modeling p(word seq, tag seq)
• The tags are hidden, but we see the
words
• What is the most likely tag sequence?
• Use a finite-state automaton, that can
emit the observed words
• FSA has limited memory
• AKA this Noisy channel model is a “Hidden
Markov Model” -
Put the punchline before the joke
Bill directed a
cortege of autos through the dunes
Recover tags
Punchline – recovering (words,
tags)
tags X→
Start PN Verb
Det
Bill directed a
Noun Prep Noun Prep Det Noun Stop
cortege
of autos through the dunes
words Y→
Find tag sequence X that maximizes probability product
Punchline – ok, where do the pr
numbers come from?
0.4
tags X→
0.6
Start PN Verb
Det
Noun Prep Noun Prep Det Noun Stop
0.001
Bill directed a
cortege of autos through the dunes
words Y→
the tags are not observable & they are states of some fsa
We estimate transition probabilities between states
We also have ‘emission’ pr’s from states
En tout: a Hidden Markov Model (HMM)
Our model uses both bigrams &
unigrams:
0.4
tags X→
0.6
Start PN Verb
Det
Noun Prep Noun Prep Det Noun Stop
0.001
Bill directed
a cortege of autos through the dunes
probs
words Y→ probs from
from tag
This only shows the
unigram
bigram
best path… how do
replacement
model
we find it?
Submenu for probability theory –
redo n-grams a bit more formally
• Define all this p(X), p(Y|X), P(X,Y)
notation
• p, event space, conditional probability &
chain rule;
• Bayes’ Law
• (Eventually) how do we estimate all these
probabilities from (limited) text? (Backoff
& Smoothing)
Rush intro to probability
p(Paul Revere wins | weather’s clear) = 0.9
What’s this mean?
p(Paul Revere wins | weather’s clear) = 0.9
• Past performance?
• Revere’s won 90% of races with clear weather
• Hypothetical performance?
• If he ran the race in many parallel universes …
• Subjective strength of belief?
• Would pay up to 90 cents for chance to win $1
• Output of some computable formula?
• But then which formulas should we trust?
p(X | Y) versus q(X | Y)
p is a function on event sets
p(win | clear) ≡ p(win, clear) / p(clear)
weather’s
clear
Paul Revere
wins
All Events (races)
p is a function on event sets
p(win | clear) ≡ p(win, clear) / p(clear)
syntactic sugar
logical conjunction predicate selecting
of predicates
races where
weather’s clear
weather’s
clear
Paul Revere
wins
All Events (races)
p measures total
probability of a
set of events.
Commas in p(x,y) mean conjunction –
on the left…
p(Paul Revere wins, Valentine places, Epitaph
shows | weather’s clear)
what happens as we add conjuncts to left of bar ?
• probability can only decrease
• numerator of historical estimate likely to go to zero:
# times Revere wins AND Val places… AND weather’s clear
# times weather’s clear
Commas in p(x,y)…on the right
p(Paul Revere wins | weather’s clear,
ground is dry, jockey getting over sprain, Epitaph
also in race, Epitaph was recently bought by Gonzalez,
race is on May 17, …
)
what happens as we add conjuncts to right of bar ?
• probability could increase or decrease
• probability gets more relevant to our case (less bias)
• probability estimate gets less reliable (more variance)
# times Revere wins AND weather clear AND … it’s May 17
# times weather clear AND … it’s May 17
Backing off: simplifying the righthand side…
p(Paul Revere wins | weather’s clear,
ground is dry, jockey getting over sprain, Epitaph
also in race, Epitaph was recently bought by Gonzalez,
race is on May 17, …
)
not exactly what we want but at least we can get a
reasonable estimate of it!
try to keep the conditions that we suspect will have
the most influence on whether Paul Revere wins
Recall ‘backing off’ in using just p(rabbit|white)
instead of p(rabbit|Just then a white) – so this is a
general method
What about simplifying the lefthand side?
p(Paul Revere wins, Valentine places,
Epitaph shows | weather’s clear)
NOT ALLOWED!
but we can do something similar to help …
We can FACTOR this information – the so-called
“Chain Rule”
Chain rule: factoring lhs
p(Revere, Valentine, Epitaph | weather’s clear) RVEW/W
= p(Revere | Valentine, Epitaph, weather’s clear) = RVEW/VEW
* p(Valentine | Epitaph, weather’s clear) * VEW/EW
* p(Epitaph | weather’s clear) * EW/W
True because numerators cancel against denominators
Makes perfect sense when read from bottom to top
Moves material to right of bar so it can be ignored
If this prob is unchanged by backoff, we say Revere was
CONDITIONALLY INDEPENDENT of Valentine and Epitaph
(conditioned on the weather’s being clear). Often we just
ASSUME conditional independence to get the nice product above.
The plan: summary so far
b:b
/0.3
automaton: p(tag sequence)
p(X)
*
*
.7
0
/
a
a:
“Markov Model”
.1
0
/
C
a:
0.9
/
D
a:
b:C
/0.8
b:D
/0.2
transducer: tags Æ words
“Unigram Replacement”
*
automaton: the observed words
C:C/1
“straight line”
=
b:C
transducer: scores
candidate
tag seqs
7
/
0
.
0
0
.
24
:C/
a
b
e
on their joint probabilitystwith
path obs words;
pick best path
p(Y | X)
*
p(y | Y)
=
p(X, y)
First-order Markov (bigram)
model as fsa
Det
Verb
Start
Prep
Adj
Noun
Stop
Add in transition probs - sum to 1
Verb
Det
Start
0.3
0.7
Prep
Adj
0.4
0.5
0.1
Noun
Stop
Same as bigram
P(Noun|Det)=0.7 ≡
Det
Noun
Add in start & etc.
Det
0.8
Start
Verb
0.3
0.7
Prep
Adj
0.4
0.5
0.1
Noun
0.2
Stop
Markov Model
p(tag seq)
Det
0.8
Start
Verb
0.3
0.7
Prep
Adj
0.4
0.5
Noun
0.2
Stop
0.1
Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
Markov model as fsa
p(tag seq)
Det
0.8
Start
Verb
0.3
0.7
Prep
Adj
0.4
0.5
Noun
0.2
Stop
0.1
Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
Add ‘output tags’ (transducer)
p(tag seq)
Det
Det 0.8
Start
Verb
Adj 0.3
Noun
0.7
Prep
Adj
Adj 0.4
Noun
0.5
Noun
ε 0.2
Stop
ε 0.1
Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
Tag bigram picture
p(tag seq)
Det 0.8
Det
Adj 0.3
Start
Adj
Adj 0.4
Noun
0.5
Noun
ε 0.2
Stop
Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
Our plan
automaton: p(tag sequence)
“Markov Model”
*
transducer: tags Æ words
p(X)
*
p(Y | X)
“Unigram Replacement”
*
automaton: the observed words
“straight line”
=
transducer: scores candidate tag seqs
on their joint probability with obs words;
pick best path
*
p(y | Y)
=
p(X, y)
Cartoon form again
Verb
Det
Det 0.8
Adj 0.3
Start
Adj 0.4
Noun
0.7
p(X)
Prep
Adj Noun
0.5
Noun
ε 0.1
ε 0.2
Stop
*
…
Noun:cortege/0.000001
Noun:autos/0.001
Noun:Bill/0.002
Det:a/0.6
Det:the/0.4
the
cool
*
Adj:cool/0.003
Adj:directed/0.0005
Adj:cortege/0.000001
…
directed
=
autos
transducer: scores candidate tag seqs
on their joint probability with obs words;
we should pick best path
*
p(Y | X)
*
p(y | Y)
=
p(X, y)
Next up: unigram replacement
model
p(word seq | tag seq)
…
Noun:cortege/0.000001
Noun:autos/0.001
Noun:Bill/0.002
sums to 1
Det:a/0.6
Det:the/0.4
Adj:cool/0.003
Adj:directed/0.0005
Adj:cortege/0.000001
…
sums to 1
Verb
Det
Det 0.8Adj 0.3
Start
Compose
Adj 0.4
Adj Noun
0.5
ε 0.1
p(tag seq)
Det 0.8
Noun
0.7
Prep
Noun
Stop
ε 0.2
…
Noun:cortege/0.000001
Noun:autos/0.001
Noun:Bill/0.002
Det:a/0.6
Det:the/0.4
Adj:cool/0.003
Adj:directed/0.0005
Adj:cortege/0.000001
…
Verb
Det
Adj 0.3
Start
Prep
Adj
Adj 0.4
Noun
0.5
Noun
ε 0.2
Stop
Verb
Det
Det 0.8Adj 0.3
Compose
Start
Adj 0.4
Noun
0.7
Adj Noun
0.5
ε 0.1
Prep
Noun
Stop
ε 0.2
…
Noun:cortege/0.000001
Noun:autos/0.001
Noun:Bill/0.002
Det:a/0.6
Det:the/0.4
Adj:cool/0.003
Adj:directed/0.0005
Adj:cortege/0.000001
…
p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq)
Det:a 0.48
Det:the 0.32
Verb
Det
Adj:cool 0.0009
Adj:directed 0.00015
Adj:cortege 0.000003
Start
Prep
Adj
Adj:cool 0.0012
Adj:directed 0.00020
Adj:cortege 0.000004
N:cortege
N:autos
Noun
ε
Stop
Observed words as straight-line fsa
word seq
the
cool
directed
autos
Compose with
the
cool
directed
autos
p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq)
Det:a 0.48
Det:the 0.32
Verb
Det
Adj:cool 0.0009
Adj:directed 0.00015
Adj:cortege 0.000003
Start
Prep
Adj
Adj:cool 0.0012
Adj:directed 0.00020
Adj:cortege 0.000004
N:cortege
N:autos
Noun
ε
Stop
Compose with
the
cool
directed
autos
p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq)
Verb
Det
Det:the 0.32
Adj:cool 0.0009
Start
Prep
Adj
why did this
loop go away?
Adj:directed 0.00020
Adj
Noun
N:autos
ε
Stop
The best path:
Start Det Adj Adj
Noun Stop = 0.32 * 0.0009*0.00020…
the cool directed autos
p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq)
Verb
Det
Det:the 0.32
Adj:cool 0.0009
Start
Prep
Adj
Noun
Adj:directed 0.00020
Adj
N:autos
ε
Stop
But…how do we find this ‘best’
path???
All paths together form ‘trellis’
p(word seq, tag seq)
Ad
j :c
oo
2
l 0.
3
.
0 Det
Det
00 Det
e
N
0
h
ou
9
t :t
n
e
:co
Adj:directed… Nou
D
n
Start
Adj
Noun
ol
Adj
0.0
07
Adj
d…
e
t
irec
d
:
NounAdj
Noun
Det
:au
tos
…
Adj
.2
0
ε
Stop
Noun
WHY?
The best path:
Start Det Adj Adj
Noun Stop = 0.32 * 0.0009 …
the cool directed autos
Cross-product construction forms
trellis
0
3ε
1
2
0
0,0
1
ε
ε
4
*
2
1,1
1,2
2,1
3,1
3
=
4
All paths here are 5 words
1,3
1,4 ε
2,2
2,3
2,4 ε
3,2
3,3
3,4
ε
4,4
So all paths here must have 5 words on output side
Trellis isn’t complete
p(word seq, tag seq)
Lattice has no Det Æ Det or Det ÆStop arcs; why?
Ad
j :c
oo
2
l 0.
3
.
0 Det
Det
00 Det
e
N
0
h
ou
9
t :t
n
e
:co
Adj:directed… Nou
D
n
Start
Adj
Noun
ol
Adj
0.0
07
Adj
d…
e
t
irec
d
:
NounAdj
Noun
Det
:au
tos
…
Adj
.2
0
ε
Stop
Noun
The best path:
Start Det Adj Adj
Noun Stop = 0.32 * 0.0009 …
the cool directed autos
Trellis incomplete
p(word seq, tag seq)
Lattice is missing some other arcs; why?
Ad
j :c
oo
2
l 0.
3
.
0 Det
Det
00 Det
e
N
0
h
ou
9
t :t
n
e
:co
Adj:directed… Nou
D
n
Start
Adj
Noun
ol
Adj
0.0
07
Adj
d…
e
t
irec
d
:
NounAdj
Noun
Det
:au
tos
…
Adj
.2
0
ε
Stop
Noun
The best path:
Start Det Adj Adj
Noun Stop = 0.32 * 0.0009 …
the cool directed autos
And missing some states…
p(word seq, tag seq)
Lattice is missing some states; why?
Ad
j :c
oo
2
l 0.
3
.
0 Det
00
e
N
09
h
ou
t
:
t
n:
De
co
Start
No
un
:au
ol
Adj
Adj
tos
0.0
…
…
d
e
07
t
c
ε
ire
d
:
NounAdj
Noun
Noun
Adj:directed…
0.2
Stop
The best path:
Start Det Adj Adj
Noun Stop = 0.32 * 0.0009 …
the cool directed autos
Finding the best path from start to
stop
Ad
j :c
oo
2
l 0.
3
.
0 Det
Det
00 Det
e
N
0
h
ou
9
t :t
n
e
:co
Adj:directed… Noun
D
Start
Adj
Noun
ol
Adj
0.0
07
NounAdj
Adj
ed…
t
c
e
:dir
Noun
Det
:au
tos
…
Adj
.2
0
ε
Stop
Noun
• Use dynamic programming
• What is best path from Start to each node?
• Work from left to right
• Each node stores its best path from Start (as
probability plus one backpointer)
• Special acyclic case of Dijkstra’s shortest-path
algorithm
• Faster if some arcs/states are absent
Method: Viterbi algorithm
• For each path reaching state s at step (word)
t, we compute a path probability. We call the
max of these viterbi(s,t)
• [Base step]
Compute viterbi(0,0)=1
• [Induction step] Compute viterbi(s',t+1),
assuming we know viterbi(s,t) for all s
Viterbi recursion
path-prob(s'|s,t) =
probability of path to
s’ through s
viterbi(s,t)
*
max path score *
for state s at time t
a[s,s']
transition p
s →s’
viterbi(s',t+1) = max s in STATES path-prob(s' | s,t)
Method…
• This is almost correct…but again, we need
to factor in the unigram prob of a state s’
given an observed surface word w
• So the correct formula for the path prob
is:
path-prob(s'|s,t) = viterbi(s,t) * a[s,s'] * bs’ (ot )
bigram
unigram
Or as in your text…p. 179
Summary
•
•
•
•
We are modeling p(word seq, tag seq)
The tags are hidden, but we see the words
Is tag sequence X likely with these words?
Noisy channel model is a “Hidden Markov Model”:
probs
from tag
bigram
model
0.4
0.6
Start PN Verb
probs from
unigram
replacement
Det
Noun Prep Noun Pre
Bill directed a
cortege of autos thro
0.001
• Find X that maximizes probability product
Evaluation of systems…redux
• The principal measures for information
extraction tasks are recall and precision.
• Recall is the number of answers the system got
right divided by the number of possible right
answers
• It measures how complete or comprehensive the
system is in its extraction of relevant information
• Precision is the number of answers the system
got right divided by the number of answers the
system gave
• It measures the system's correctness or accuracy
• Example: there are 100 possible answers and the
system gives 80 answers and gets 60 of them right,
its recall is 60% and its precision is 75%.
A better measure - Kappa
• Takes baseline & complexity of task into
account – if 99% of tags are Nouns, getting
99% correct no great shakes
• Suppose no “Gold Standard” to compare
against?
• P(A) = proportion of times hypothesis agrees
with standard (% correct)
• P(E) = proportion of times hypothesis and
standard would be expected to agree by chance
(computed from some other knowledge, or
actual data)
Kappa [p. 315 J&M text]
P( A) − P( E )
κ=
1 − P( E )
• Note K ranges between 0 (no agreement,
except by chance; to complete
agreement, 1)
• Can be used even if no ‘Gold standard’
that everyone agrees on
• K> 0.8 is good
Two approaches
1. Noisy Channel Model (statistical) –
what’s that?? (we will have to learn
some statistics)
2. Deterministic baseline tagger composed
with a cascade of fixup transducers
These two approaches will the guts of Lab 2
(lots of others: decision trees, …)
Fixup approach: Brill tagging
Another FST Paradigm:
Successive Fixups
3
Fix
up
2
Fix
up
Fix
up
input
1
Like successive markups but alter
Morphology
Phonology
Part-of-speech tagging
…
In
itia
la
nn
ota
tio
n
•
•
•
•
•
output
figure from Brill’s thesis
Transformation-Based Tagging
(Brill 1995)
figure from Brill’s thesis
Transformations Learned
BaselineTag*
NN @Æ VB // TO _
VBP @Æ VB // ... _
etc.
Compose this
cascade of FSTs.
Get a big FST that
does the initial
tagging and the
sequence of fixups
“all at once.”
figure from Brill’s thesis
Initial Tagging of OOV Words
Laboratory 2
•
1.
2.
3.
Goals:
Use both HMM and Brill taggers
Find errors that both make
Compare performance – use of kappa &
‘confusion matrix’
4. All the slings & arrows of corpora – use
Wall Street Journal excerpts