Natural Language
Processing
Lecture 8—2/5/2015
Susan W. Brown
Today
Part of speech tagging
HMMs
Basic HMM model
Decoding
Viterbi
Review chapters 1-4
7/13/2017
Speech and Language Processing - Jurafsky and Martin
2
POS Tagging as Sequence
Classification
We are given a sentence (an “observation”
or “sequence of observations”)
Secretariat is expected to race tomorrow
What is the best sequence of tags that
corresponds to this sequence of
observations?
Probabilistic view
Consider all possible sequences of tags
Out of this universe of sequences, choose the
tag sequence which is most probable given the
observation sequence of n words w1…wn.
7/13/2017
Speech and Language Processing - Jurafsky and Martin
3
Getting to HMMs
We want, out of all sequences of n tags t1…tn
the single tag sequence such that
P(t1…tn|w1…wn) is highest.
Hat ^ means “our estimate of the best one”
Argmaxx f(x) means “the x such that f(x) is
maximized”
7/13/2017
Speech and Language Processing - Jurafsky and Martin
4
Getting to HMMs
This equation should give us the best tag
sequence
But how to make it operational? How to
compute this value?
Intuition of Bayesian inference:
Use Bayes rule to transform this equation into
a set of probabilities that are easier to
compute (and give the right answer)
7/13/2017
Speech and Language Processing - Jurafsky and Martin
5
Bayesian inference
Update the probability of a hypothesis as
you get evidence
Rationale: two components
How well does the evidence match the
hypothesis?
How probable is the hypothesis a priori?
7/13/2017
Speech and Language Processing - Jurafsky and Martin
6
Using Bayes Rule
7/13/2017
Speech and Language Processing - Jurafsky and Martin
7
Likelihood and Prior
7/13/2017
Speech and Language Processing - Jurafsky and Martin
8
Two Kinds of Probabilities
Tag transition probabilities p(ti|ti-1)
Determiners likely to precede adjs and nouns
That/DT flight/NN
The/DT yellow/JJ hat/NN
So we expect P(NN|DT) and P(JJ|DT) to be high
Compute P(NN|DT) by counting in a labeled
corpus:
7/13/2017
Speech and Language Processing - Jurafsky and Martin
9
Two Kinds of Probabilities
Word likelihood probabilities p(wi|ti)
VBZ (3sg Pres Verb) likely to be “is”
Compute P(is|VBZ) by counting in a labeled
corpus:
7/13/2017
Speech and Language Processing - Jurafsky and Martin
10
Example: The Verb “race”
Secretariat/NNP is/VBZ expected/VBN to/TO
race/VB tomorrow/NR
People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
How do we pick the right tag?
7/13/2017
Speech and Language Processing - Jurafsky and Martin
11
Disambiguating “race”
7/13/2017
Speech and Language Processing - Jurafsky and Martin
12
Disambiguating “race”
7/13/2017
Speech and Language Processing - Jurafsky and Martin
13
Example
P(NN|TO) = .00047
P(VB|TO) = .83
P(race|NN) = .00057
P(race|VB) = .00012
P(NR|VB) = .0027
P(NR|NN) = .0012
P(VB|TO)P(NR|VB)P(race|VB) = .00000027
P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
So we (correctly) choose the verb tag for “race”
7/13/2017
Speech and Language Processing - Jurafsky and Martin
14
Question
If there are 30 or so tags in the Penn set
And the average sentence is around 20
words...
How many tag sequences do we have to
enumerate to argmax over in the worst
case scenario?
3020
7/13/2017
Speech and Language Processing - Jurafsky and Martin
15
Hidden Markov Models
Remember FSAs?
HMMs are a special kind that use probabilities
with the transitions
Minimum edit distance?
Viterbi and Forward algorithms
Dynamic programming?
Efficient means of finding most likely path
7/13/2017
Speech and Language Processing - Jurafsky and Martin
16
Hidden Markov Models
We can represent our race tagging
example as an HMM.
This is a kind of generative model.
There is a hidden underlying generator of
observable events
The hidden generator can be modeled as a
network of states and transitions
We want to infer the underlying state
sequence given the observed event sequence
7/13/2017
Speech and Language Processing - Jurafsky and Martin
17
Hidden Markov Models
States Q = q1, q2…qN;
Observations O= o1, o2…oN;
Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Transition probabilities
Transition probability matrix A = {aij}
Observation likelihoods
Vectors of probabilities associated with the states
Special initial probability vector
7/13/2017
Speech and Language Processing - Jurafsky and Martin
18
HMMs for Ice Cream
You are a climatologist in the year 2799
studying global warming
You can’t find any records of the weather
in Baltimore for summer of 2007
But you find Jason Eisner’s diary which
lists how many ice-creams Jason ate every
day that summer
Your job: figure out how hot it was each
day
7/13/2017
Speech and Language Processing - Jurafsky and Martin
19
Eisner Task
Given
Ice Cream Observation Sequence:
1,2,3,2,2,2,3…
Produce:
Hidden Weather Sequence:
H,C,H,H,H,C, C…
7/13/2017
Speech and Language Processing - Jurafsky and Martin
20
HMM for Ice Cream
7/13/2017
Speech and Language Processing - Jurafsky and Martin
21
Ice Cream HMM
Let’s just do 131 as the sequence
How many underlying state (hot/cold) sequences
are there? HHH
HHC
HCH
HCC
CCC
CCH
CHC
CHH
How do you pick the right one?
Argmax P(sequence | 1 3 1)
7/13/2017
Speech and Language Processing - Jurafsky and Martin
22
Ice Cream HMM
Let’s just do 1 sequence: CHC
Cold as the initial state
P(Cold|Start)
.2
Observing a 1 on a cold day
P(1 | Cold)
.4
Hot as the next state
P(Hot | Cold)
.4
Observing a 3 on a hot day
P(3 | Hot)
Cold as the next state
P(Cold|Hot)
Observing a 1 on a cold day
P(1 | Cold)
7/13/2017
.5
.3
.0024
.5
Speech and Language Processing - Jurafsky and Martin
23
POS Transition Probabilities
7/13/2017
Speech and Language Processing - Jurafsky and Martin
24
Observation Likelihoods
7/13/2017
Speech and Language Processing - Jurafsky and Martin
25
Decoding
Ok, now we have a complete model that can give us
what we need. Recall that we need to get
We could just enumerate all paths given the input and
use the model to assign probabilities to each.
Not a good idea.
Luckily dynamic programming (last seen in Ch. 3 with minimum
edit distance) helps us here
7/13/2017
Speech and Language Processing - Jurafsky and Martin
26
Intuition
Consider a state sequence (tag sequence)
that ends at state j with a particular tag T.
The probability of that tag sequence can
be broken into two parts
The probability of the BEST tag sequence up
through j-1
Multiplied by the transition probability from
the tag at the end of the j-1 sequence to T.
And the observation probability of the word
given tag T.
7/13/2017
Speech and Language Processing - Jurafsky and Martin
27
The Viterbi Algorithm
7/13/2017
Speech and Language Processing - Jurafsky and Martin
28
Viterbi Summary
Create an array
With columns corresponding to inputs
Rows corresponding to possible states
Sweep through the array in one pass
filling the columns left to right using our
transition probs and observations probs
Dynamic programming key is that we
need only store the MAX prob path to
each cell, (not all paths).
7/13/2017
Speech and Language Processing - Jurafsky and Martin
29
Evaluation
So once you have you POS tagger running
how do you evaluate it?
Overall error rate with respect to a goldstandard test set
With respect to a baseline
Error rates on particular tags
Error rates on particular words
Tag confusions...
7/13/2017
Speech and Language Processing - Jurafsky and Martin
30
Error Analysis
Look at a confusion matrix
See what errors are causing problems
Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
7/13/2017
Speech and Language Processing - Jurafsky and Martin
31
Evaluation
The result is compared with a manually
coded “Gold Standard”
Typically accuracy reaches 96-97%
This may be compared with result for a
baseline tagger (one that uses no context).
Important: 100% is impossible even for
human annotators.
Issues with manually coded gold
standards
7/13/2017
Speech and Language Processing - Jurafsky and Martin
32
Summary
Parts of speech
Tagsets
Part of speech tagging
HMM Tagging
Markov Chains
Hidden Markov Models
Viterbi decoding
7/13/2017
Speech and Language Processing - Jurafsky and Martin
33
Review
Exam readings
Chapters 1 to 6
Chapter 2
Chapter 3
Skip 3.4.1, 3.10, 3.12
Chapter 4
Skip 4.7, 4.8-4.11
Chapter 5
Skip 5.5.4, 5.6, 5.8-5.10
7/13/2017
Speech and Language Processing - Jurafsky and Martin
34
3 Formalisms
Regular expressions describe languages (sets of
strings)
Turns out that there are 3 formalisms for
capturing such languages, each with their own
motivation and history
Regular expressions
Compact textual strings
Perfect for specifying patterns in programs or command-lines
Finite state automata
Graphs
Regular grammars
Rules
7/13/2017
Speech and Language Processing - Jurafsky and Martin
35
Regular expressions
Anchor expressions
^, $, \b
Counters
*, +, ?
Single character expressions
., [ ], [ - ]
Grouping for precedence
()
[dog]* vs. (dog)*
No need to memorize shortcuts
7/13/2017
\d, \s
Speech and Language Processing - Jurafsky and Martin
36
FSAs
Components of an FSA
Know how to read one and draw one
Deterministic vs. non-deterministic
How is success/failure different?
Relative power
Recognition vs. generation
How do we implement FSAs for
recognition?
7/13/2017
Speech and Language Processing - Jurafsky and Martin
37
More Formally
You can specify an FSA by enumerating
the following things.
7/13/2017
The set of states: Q
A finite alphabet: Σ
A start state
A set of accept states
A transition function that maps QxΣ to Q
Speech and Language Processing - Jurafsky and Martin
38
FSTs
Components of an FST
Inputs and outputs
Relations
7/13/2017
Speech and Language Processing - Jurafsky and Martin
39
Morphology
What is a morpheme?
Stems and affixes
Inflectional vs. derivational
Fuzzy -> fuzziness
Fuzzy -> fuzzier
Application of derivation rules
N -> V with –ize
System, chair
Regular vs. irregular
7/13/2017
Speech and Language Processing - Jurafsky and Martin
40
Derivational Rules
7/13/2017
Speech and Language Processing - Jurafsky and Martin
41
Lexicons
So the big picture is to store a lexicon (list
of words you care about) as an FSA. The
base lexicon is embedded in larger
automata that captures the inflectional
and derivational morphology of the
language.
So what? Well, the simplest thing you can
do with such an FSA is spell checking
If the machine rejects, the word isn’t in the language
Without listing every form of every word
7/13/2017
Speech and Language Processing - Jurafsky and Martin
42
Next Time
Three tasks for HMMs
Decoding
Viterbi algorithm
Assigning probabilities to inputs
Forward algorithm
Finding parameters for a model
EM
7/13/2017
Speech and Language Processing - Jurafsky and Martin
43
© Copyright 2025 Paperzz