Document

Review: Hidden Markov Models
0.5
A 0.9
0.9 C 0.4
C 0.1
0.1 S1
• Efficient dynamic programming
algorithms exist for
– Finding Pr(S)
– The highest probability path P that maximizes
Pr(S,P) (Viterbi)
• Training the model
– State seq known: MLE + smoothing
– Otherwise: Baum-Welch algorithm
A 0.6
0.8
S3
S2 0.5
S4
0.2
A 0.5
A 0.3
C 0.5
C 0.7
HMM for Segmentation
• Simplest Model: One state per entity type
HMM Learning
• Manally pick HMM’s graph (eg simple
model, fully connected)
• Learn transition probabilities: Pr(si|sj)
• Learn emission probabilities: Pr(w|si)
Learning model parameters
• When training data defines unique path through HMM
– Transition probabilities
• Probability of transitioning from state i to state j =
number of transitions from i to j
total transitions from state i
– Emission probabilities
• Probability of emitting symbol k from state i =
number of times k generated from i
number of transition from I
What is a “symbol” ???
Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?
4601 => “4601”, “9999”, “9+”, “number”, … ?
All
Numbers
3-digits
000..
...999
Words
5-digits
00000..
..99999
Others
0..99
0000..9999
Chars
000000..
A..
Delimiters
Multi-letter . , / - + ? #
..z
aa..
Datamold: choose best abstraction level using holdout set
What is a symbol?
We can extend the HMM model so that each state generates
multiple “features” – but they should be independent.
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Ideally we would like to use many, arbitrary, overlapping
features of words.
Borthwick et al solution
Instead of an HMM, classify each token. Don’t learn
transition probabilities, instead constrain them at test time.
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
We could use YFCL: an SVM, logistic regression, a
decision tree, …. We’ll be talking about logistic regression.
Stupid HMM tricks
Pr(red|red) = 1
Pr(red)
start
Pr(green)
Pr(green|green) = 1
Stupid HMM tricks
Pr(red|red) = 1
Pr(red)
start
Pr(green)
Pr(green|green) = 1
Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x)
argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y)
= argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y)
Pr(“I voted for Ralph Nader”|ggggg) =
Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)
From NB to Maxent
Pr( y | x) 
1
Pr( y ) Pr( wk | y )
Z
j
  0   f i ( x , y )
i
where wk is word j in x
f j ,k (doc, y )  [word k appears at position j of doc of class y?1 : 0]
f i (doc)  i - th j, k combinatio n
  Pr( wk | y )
   Pr( y ) / Z
From NB to Maxent
Pr( y | x) 
1
Pr( y ) Pr( wk | y )
Z
j
where wk is word j in x
  0   f i ( x , y )
i
From NB to Maxent
Pr( y | x) 
1
Pr( y ) Pr( wk | y )
Z
j
  0   f i ( x , y )
i
where wk is word j in x
Or: log P( y | x)   0 
  f ( x, y)
i
i
Idea: keep the same functional form as naïve Bayes, but pick
the parameters to optimize performance on training data.
One possible definition of performance is conditional log
likelihood of the data:
t
t
 log P( y
t
|x )
MaxEnt Comments
– Implementation:
• All methods are iterative
• For NLP like problems with many features, modern gradient-like or
Newton-like methods work well
• Thursday I’ll derive the gradient for CRFs
– Smoothing:
• Typically maxent will overfit data if there are many infrequent features.
• Old-school solutions: discard low-count features; early stopping with
holdout set; …
• Modern solutions: penalize large parameter values with a prior centered
on zero to limit size of alphas (ie, optimize log likelihood - sum alpha);
other regularization techniques
What is a symbol?
Ideally we would like to use many, arbitrary, overlapping
features of words.
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Borthwick et al idea
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
…
part of
noun phrase
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations
Pr( st | xt )  ...
Another idea….
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state
Pr( st | xt , st 1, )  ...
MaxEnt taggers and MEMMs
Learning does not change –
you’ve
just
identity
of added
word a few
additional
features that are
ends in “-ski”
the
previous labels.
is capitalized
is part of a noun phrase
Classification
is trickier
is in a list of city
names– we
don’t
knownode
the previous-label
is under
X in WordNet
features
at
test
time
– so we
is in bold font
will
need to search for the
is indented
best
labels (like
is insequence
hyperlinkof
anchor
for
…an HMM).
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state history
Pr( st | xt , st 1, st 2, ...)  ...
Partial history of the idea
• Sliding window classifiers
– Sejnowski’s NETTalk, mid 1980’s
• Recurrent neural networks and other “recurrent” slidingwindow classifiers
– Late 1980’s and 1990’s
• Ratnaparkhi’s thesis
– Mid-late 1990’s
• Frietag, McCallum & Pereira ICML 2000
– Formalize notion of MEMM
• OpenNLP
– Based largely on MaxEnt taggers, Apache Open Source
Ratnaparkhi’s MXPOST
• Sequential learning problem:
predict POS tags of words.
• Uses MaxEnt model
described above.
• Rich feature set.
• To smooth, discard features
occurring < 10 times.
MXPOST
MXPOST: learning & inference
GIS
Feature
selection
Using the HMM to segment
• Find highest probability path through the HMM.
• Viterbi: quadratic dynamic programming
15213 Butler Highway Greenville 21578
algorithm
15213
Butler
House
House
House
Road
Road
Road
City
City
City
Pin
Pint
o
...
21578
o
Pint
23
Inference for MENE
(Borthwick et al system)
When
will
prof
Cohen
post
the
B
B
I
O
notes …
B
B
B
B
B
I
I
I
I
I
I
O
O
O
O
O
O
Goal: best legal path through lattice (i.e., path that
runs through the most black ink. Like Viterbi but
cost of possible transitions are ignored.)
Inference for MXPOST
When
will
prof
Cohen
post
the
B
B
I
O
B
B
B
B
B
I
I
I
I
I
I
O
O
O
O
O
O
 

Pr( y | x )   Pr( yi | x , y1, ..., yi 1 )
i
window
of k tags
k=1

  Pr( yi | x , yi  k , ..., yi 1 )
i

  Pr( yi | x , yi 1 )
i
notes …
(Approx view):
find best path,
weights are now
on arcs from state
to state.
Inference for MXPOST
When
will
prof
Cohen
post
the
B
B
I
O
notes …
B
B
B
B
B
I
I
I
I
I
I
O
O
O
O
O
O
More accurately:

 t 1 ( y )    t ( y ' )  Pr(Yt 1  y | x , Yt  y ' ) find total flow to
y'
each node, weights
are now on arcs
from state to state.
Inference for MXPOST
When
will
prof
Cohen
post
the
B
B
I
O
B
B
B
B
B
I
I
I
I
I
I
O
O
O
O
O
O
 

Pr( y | x )   Pr( yi | x , y1, ..., yi 1 )
i

  Pr( yi | x , yi  k , ..., yi 1 )
i

  Pr( yi | x , yi  2, yi 1 )
i
notes …
Find best path?
tree? Weights are
on hyperedges
Inference for MxPOST
When
I
will
prof
Cohen
post
the
iI
iI
iI
iI
iI
iI
oI
oI
oI
oI
oI
oI
…
O
notes …
…
iO
iO
iO
iO
iO
iO
oO
oO
oO
oO
oO
oO
Beam search is alternative to Viterbi:
at each stage, find all children, score them,
and discard all but the top n states
MXPost results
• State of art accuracy (for 1996)
• Same approach used successfully for several other
sequential classification steps of a stochastic parser (also
state of art).
• Same (or similar) approaches used for NER by
Borthwick, Malouf, Manning, and others.
MEMMs
• Basic difference from ME tagging:
– ME tagging: previous state is feature of MaxEnt
classifier
– MEMM: build a separate MaxEnt classifier for each
state.
• Can build any HMM architecture you want; eg parallel nested
HMM’s, etc.
• Data is fragmented: examples where previous tag is “proper
noun” give no information about learning tags when previous
tag is “noun”
– Mostly a difference in viewpoint
– MEMM does allow possibility of “hidden” states and
Baum-Welsh like training
MEMM task: FAQ parsing
MEMM features
MEMMs
Looking forward
• HMMS
– Easy to train generative model
– Features for a state must be independent (-)
• MaxEnt tagger/MEMM
– Multiple cascaded classifiers
– Features can be arbitrary (+)
– Have we given anything up?
HMM inference
• Total probability of transitions out of a state must sum to 1
• But …they can all lead to “unlikely” states
• So…. a state can be a (probable) “dead end” in the lattice
15213
Butler
House
House
House
Road
Road
Road
City
City
City
Pin
Pint
o
...
21578
o
Pint
37
Inference for MXPOST
When
will
prof
Cohen
post
the
B
B
I
O
B
B
B
B
B
I
I
I
I
I
I
O
O
O
O
O
O

 t 1 ( y )    t ( y ' )  Pr(Yt 1  y | x , Yt  y ' )
y'
Flow out of each node is always fixed:

y ' ,  Pr(Yt 1  y | x , Yt  y ' )  1
y
notes …
More accurately:
find total flow to
each node, weights
are now on arcs
from state to state.
Label Bias Problem
(Lafferty, McCallum, Pereira ICML 2001)
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rib)=1
Pr(0453|rob)=1
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3
= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’
= 0.5 * 1 *1
Another max-flow scheme
When
will
prof
Cohen
post
the
B
B
I
O
B
B
B
B
B
I
I
I
I
I
I
O
O
O
O
O
O

 t 1 ( y )    t ( y ' )  Pr(Yt 1  y | x , Yt  y ' )
y'
Flow out of a node is always fixed:

y ' ,  Pr(Yt 1  y | x , Yt  y ' )  1
y
notes …
More accurately:
find total flow to
each node, weights
are now on arcs
from state to state.
Another max-flow scheme: MRFs
When
will
prof
Cohen
post
the
B
B
I
O
notes …
B
B
B
B
B
I
I
I
I
I
I
O
O
O
O
O
O
Goal is to learn how to weight edges in the graph:
• weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and
isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]
Another max-flow scheme: MRFs
When
will
prof
Cohen
post
the
B
B
I
O
notes …
B
B
B
B
B
I
I
I
I
I
I
O
O
O
O
O
O
Find total flow to each node, weights are now on edges
from state to state. Goal is to learn how to weight edges
in the graph, given features from the examples.
Another view of label bias
[Sha & Pereira]
So what’s the alternative?