CSCI 5582 Artificial Intelligence

Natural Language
Processing
Lecture 8—2/5/2015
Susan W. Brown
Today
 Part of speech tagging
 HMMs
 Basic HMM model
 Decoding
 Viterbi
 Review chapters 1-4
7/13/2017
Speech and Language Processing - Jurafsky and Martin
2
POS Tagging as Sequence
Classification
 We are given a sentence (an “observation”
or “sequence of observations”)
 Secretariat is expected to race tomorrow
 What is the best sequence of tags that
corresponds to this sequence of
observations?
 Probabilistic view
 Consider all possible sequences of tags
 Out of this universe of sequences, choose the
tag sequence which is most probable given the
observation sequence of n words w1…wn.
7/13/2017
Speech and Language Processing - Jurafsky and Martin
3
Getting to HMMs
 We want, out of all sequences of n tags t1…tn
the single tag sequence such that
P(t1…tn|w1…wn) is highest.
 Hat ^ means “our estimate of the best one”
 Argmaxx f(x) means “the x such that f(x) is
maximized”
7/13/2017
Speech and Language Processing - Jurafsky and Martin
4
Getting to HMMs
 This equation should give us the best tag
sequence
 But how to make it operational? How to
compute this value?
 Intuition of Bayesian inference:
 Use Bayes rule to transform this equation into
a set of probabilities that are easier to
compute (and give the right answer)
7/13/2017
Speech and Language Processing - Jurafsky and Martin
5
Bayesian inference
 Update the probability of a hypothesis as
you get evidence
 Rationale: two components
 How well does the evidence match the
hypothesis?
 How probable is the hypothesis a priori?
7/13/2017
Speech and Language Processing - Jurafsky and Martin
6
Using Bayes Rule
7/13/2017
Speech and Language Processing - Jurafsky and Martin
7
Likelihood and Prior
7/13/2017
Speech and Language Processing - Jurafsky and Martin
8
Two Kinds of Probabilities
 Tag transition probabilities p(ti|ti-1)
 Determiners likely to precede adjs and nouns
 That/DT flight/NN
 The/DT yellow/JJ hat/NN
 So we expect P(NN|DT) and P(JJ|DT) to be high
 Compute P(NN|DT) by counting in a labeled
corpus:
7/13/2017
Speech and Language Processing - Jurafsky and Martin
9
Two Kinds of Probabilities
 Word likelihood probabilities p(wi|ti)
 VBZ (3sg Pres Verb) likely to be “is”
 Compute P(is|VBZ) by counting in a labeled
corpus:
7/13/2017
Speech and Language Processing - Jurafsky and Martin
10
Example: The Verb “race”
 Secretariat/NNP is/VBZ expected/VBN to/TO
race/VB tomorrow/NR
 People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
 How do we pick the right tag?
7/13/2017
Speech and Language Processing - Jurafsky and Martin
11
Disambiguating “race”
7/13/2017
Speech and Language Processing - Jurafsky and Martin
12
Disambiguating “race”
7/13/2017
Speech and Language Processing - Jurafsky and Martin
13
Example






P(NN|TO) = .00047
P(VB|TO) = .83
P(race|NN) = .00057
P(race|VB) = .00012
P(NR|VB) = .0027
P(NR|NN) = .0012
 P(VB|TO)P(NR|VB)P(race|VB) = .00000027
 P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
 So we (correctly) choose the verb tag for “race”
7/13/2017
Speech and Language Processing - Jurafsky and Martin
14
Question
 If there are 30 or so tags in the Penn set
 And the average sentence is around 20
words...
 How many tag sequences do we have to
enumerate to argmax over in the worst
case scenario?
3020
7/13/2017
Speech and Language Processing - Jurafsky and Martin
15
Hidden Markov Models
 Remember FSAs?
 HMMs are a special kind that use probabilities
with the transitions
 Minimum edit distance?
 Viterbi and Forward algorithms
 Dynamic programming?
 Efficient means of finding most likely path
7/13/2017
Speech and Language Processing - Jurafsky and Martin
16
Hidden Markov Models
 We can represent our race tagging
example as an HMM.
 This is a kind of generative model.
 There is a hidden underlying generator of
observable events
 The hidden generator can be modeled as a
network of states and transitions
 We want to infer the underlying state
sequence given the observed event sequence
7/13/2017
Speech and Language Processing - Jurafsky and Martin
17
Hidden Markov Models
 States Q = q1, q2…qN;
 Observations O= o1, o2…oN;
 Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
 Transition probabilities
 Transition probability matrix A = {aij}
 Observation likelihoods
 Vectors of probabilities associated with the states
 Special initial probability vector 
7/13/2017
Speech and Language Processing - Jurafsky and Martin
18
HMMs for Ice Cream
 You are a climatologist in the year 2799
studying global warming
 You can’t find any records of the weather
in Baltimore for summer of 2007
 But you find Jason Eisner’s diary which
lists how many ice-creams Jason ate every
day that summer
 Your job: figure out how hot it was each
day
7/13/2017
Speech and Language Processing - Jurafsky and Martin
19
Eisner Task
 Given
 Ice Cream Observation Sequence:
1,2,3,2,2,2,3…
 Produce:
 Hidden Weather Sequence:
H,C,H,H,H,C, C…
7/13/2017
Speech and Language Processing - Jurafsky and Martin
20
HMM for Ice Cream
7/13/2017
Speech and Language Processing - Jurafsky and Martin
21
Ice Cream HMM
 Let’s just do 131 as the sequence
 How many underlying state (hot/cold) sequences
are there? HHH
HHC
HCH
HCC
CCC
CCH
CHC
CHH
 How do you pick the right one?
Argmax P(sequence | 1 3 1)
7/13/2017
Speech and Language Processing - Jurafsky and Martin
22
Ice Cream HMM
Let’s just do 1 sequence: CHC
Cold as the initial state
P(Cold|Start)
.2
Observing a 1 on a cold day
P(1 | Cold)
.4
Hot as the next state
P(Hot | Cold)
.4
Observing a 3 on a hot day
P(3 | Hot)
Cold as the next state
P(Cold|Hot)
Observing a 1 on a cold day
P(1 | Cold)
7/13/2017
.5
.3
.0024
.5
Speech and Language Processing - Jurafsky and Martin
23
POS Transition Probabilities
7/13/2017
Speech and Language Processing - Jurafsky and Martin
24
Observation Likelihoods
7/13/2017
Speech and Language Processing - Jurafsky and Martin
25
Decoding
 Ok, now we have a complete model that can give us
what we need. Recall that we need to get
 We could just enumerate all paths given the input and
use the model to assign probabilities to each.
 Not a good idea.
 Luckily dynamic programming (last seen in Ch. 3 with minimum
edit distance) helps us here
7/13/2017
Speech and Language Processing - Jurafsky and Martin
26
Intuition
 Consider a state sequence (tag sequence)
that ends at state j with a particular tag T.
 The probability of that tag sequence can
be broken into two parts
 The probability of the BEST tag sequence up
through j-1
 Multiplied by the transition probability from
the tag at the end of the j-1 sequence to T.
 And the observation probability of the word
given tag T.
7/13/2017
Speech and Language Processing - Jurafsky and Martin
27
The Viterbi Algorithm
7/13/2017
Speech and Language Processing - Jurafsky and Martin
28
Viterbi Summary
 Create an array
 With columns corresponding to inputs
 Rows corresponding to possible states
 Sweep through the array in one pass
filling the columns left to right using our
transition probs and observations probs
 Dynamic programming key is that we
need only store the MAX prob path to
each cell, (not all paths).
7/13/2017
Speech and Language Processing - Jurafsky and Martin
29
Evaluation
 So once you have you POS tagger running
how do you evaluate it?
 Overall error rate with respect to a goldstandard test set
 With respect to a baseline
 Error rates on particular tags
 Error rates on particular words
 Tag confusions...
7/13/2017
Speech and Language Processing - Jurafsky and Martin
30
Error Analysis
 Look at a confusion matrix
 See what errors are causing problems
 Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
 Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
7/13/2017
Speech and Language Processing - Jurafsky and Martin
31
Evaluation
 The result is compared with a manually
coded “Gold Standard”
 Typically accuracy reaches 96-97%
 This may be compared with result for a
baseline tagger (one that uses no context).
 Important: 100% is impossible even for
human annotators.
 Issues with manually coded gold
standards
7/13/2017
Speech and Language Processing - Jurafsky and Martin
32
Summary
 Parts of speech
 Tagsets
 Part of speech tagging
 HMM Tagging
 Markov Chains
 Hidden Markov Models
 Viterbi decoding
7/13/2017
Speech and Language Processing - Jurafsky and Martin
33
Review
 Exam readings
 Chapters 1 to 6
 Chapter 2
 Chapter 3
 Skip 3.4.1, 3.10, 3.12
 Chapter 4
 Skip 4.7, 4.8-4.11
 Chapter 5
 Skip 5.5.4, 5.6, 5.8-5.10
7/13/2017
Speech and Language Processing - Jurafsky and Martin
34
3 Formalisms
 Regular expressions describe languages (sets of
strings)
 Turns out that there are 3 formalisms for
capturing such languages, each with their own
motivation and history
 Regular expressions
 Compact textual strings
 Perfect for specifying patterns in programs or command-lines
 Finite state automata
 Graphs
 Regular grammars
 Rules
7/13/2017
Speech and Language Processing - Jurafsky and Martin
35
Regular expressions
 Anchor expressions
 ^, $, \b
 Counters
 *, +, ?
 Single character expressions
 ., [ ], [ - ]
 Grouping for precedence
 ()
 [dog]* vs. (dog)*
 No need to memorize shortcuts
7/13/2017
 \d, \s
Speech and Language Processing - Jurafsky and Martin
36
FSAs
 Components of an FSA
 Know how to read one and draw one
 Deterministic vs. non-deterministic
 How is success/failure different?
 Relative power
 Recognition vs. generation
 How do we implement FSAs for
recognition?
7/13/2017
Speech and Language Processing - Jurafsky and Martin
37
More Formally
 You can specify an FSA by enumerating
the following things.





7/13/2017
The set of states: Q
A finite alphabet: Σ
A start state
A set of accept states
A transition function that maps QxΣ to Q
Speech and Language Processing - Jurafsky and Martin
38
FSTs
 Components of an FST
 Inputs and outputs
 Relations
7/13/2017
Speech and Language Processing - Jurafsky and Martin
39
Morphology
 What is a morpheme?
 Stems and affixes
 Inflectional vs. derivational
 Fuzzy -> fuzziness
 Fuzzy -> fuzzier
 Application of derivation rules
 N -> V with –ize
 System, chair
 Regular vs. irregular
7/13/2017
Speech and Language Processing - Jurafsky and Martin
40
Derivational Rules
7/13/2017
Speech and Language Processing - Jurafsky and Martin
41
Lexicons
 So the big picture is to store a lexicon (list
of words you care about) as an FSA. The
base lexicon is embedded in larger
automata that captures the inflectional
and derivational morphology of the
language.
 So what? Well, the simplest thing you can
do with such an FSA is spell checking
 If the machine rejects, the word isn’t in the language
 Without listing every form of every word
7/13/2017
Speech and Language Processing - Jurafsky and Martin
42
Next Time
 Three tasks for HMMs
 Decoding
 Viterbi algorithm
 Assigning probabilities to inputs
 Forward algorithm
 Finding parameters for a model
 EM
7/13/2017
Speech and Language Processing - Jurafsky and Martin
43