pdf, 0.9MB

Simple syntax: Hidden Markov Models (HMMs) Computa:onal Cogni:ve Science 2011 Lecture 13 Last :me: Hidden Markov Models Xt-1
yt-1
Xt
yt
Xt+1
Each x is a state s
yt+1
Each y is an observa:on o
Last :me: Hidden Markov Models Xt-1
Xt-1Xt
yt
Xt
XtXt+1
yt+1
Xt+1
Each x is a state s
Each y is an observa:on o
Note: It is possible (and equivalent) to define an HMM in which the output distribu:on B is dependent on an arc (defined by the states at either end) rather than a single state. This variant is not as oMen used in NLP, mainly because it is so natural to interpret states as parts of speech. I also find it far more counterintui:ve. However, the Manning & Schütze book (one of the op:onal readings) defines it this way. A non-­‐linguis:c example States S = {asleep, calm, angry, hungry}
Outputs O = {roar, zzz, snort, grumble}
Output symbol matrix B: State transi:on matrix A: Roar
Zzz
Snort
Grumble
Asleep
0.0
0.9
0.1
0.0
0.2
Calm
0.0
0.0
0.8
0.2
0.6
0.1
Angry
1.0
0.0
0.0
0.0
0.5
0.3
Hungry 0.2
0.0
0.0
0.8
Asleep
Calm
Angry
Hungry
Asleep
0.5
0.2
0.1
0.2
Calm
0.4
0.3
0.1
Angry
0.1
0.2
Hungry 0.1
0.1
Ini:al state probabili:es Π: Asleep
Calm
Angry
Hungry
0.3
0.3
0.2
0.2
A linguis:c example End state symbol States S = {noun, verb, det, pro, adj, {*}}
Outputs O = {boy, dog, tiger, eats, runs, the, a, it, she, he, happy}
State transi:on matrix A: Output symbol matrix B: Boy
Dog
Tiger
Eats
Runs
Noun
0.4
0.4
0.2
0.0
0.0
1.0
Verb
0.0
0.0
0.0
0.5
0.5
0.2
0.0
Det
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Pro
0.0
0.0
0.0
0.0
0.0
0.0
0.1
0.0
Adj
0.0
0.0
0.0
0.0
…
Noun
Verb Det
Pro
Adj
*
Noun
0.0
1.0
0.0
0.0
0.0
0.0
Verb
0.0
0.0
0.0
0.0
0.0
Det
0.8
0.0
0.0
0.0
Pro
0.0
1.0
0.0
Adj
0.9
0.0
0.0
Ini:al state probabili:es Π: Noun
Verb
Det
Pro
Adj
*
0.0
0.0
0.3
0.7
0.0
0.0
Three fundamental ques:ons 1)  Given a model M = (A,B,Π), how do we efficiently compute how likely a certain observa:on is? Forward* algorithm 2)  Given a sequence of observa:ons Y and a model M, how do we infer the state sequence that best explains the observa:ons? Viterbi* algorithm 3)  Given an observa:on sequence Y and a space of possible models found by varying the model parameters M = (A,B,Π), how do we find the model that best explains the observed data? Baum-­‐Welch** algorithm * You should be able to implement this; ** You don’t need to be able to implement this Three fundamental ques:ons 1)  Given a model M = (A,B,Π), how do we efficiently compute how likely a co:va:ng ertain Conclude by m
the observa:on is? need for more complex grammars if the goal is to fully 2)  Given a sequence of observa:ons Y and a describe uman model M, how do we infer thhe state lsanguage equence that best explains the observa:ons? 3)  Given an observa:on sequence Y and a space of possible models found by varying the model parameters M = (A,B,Π), how do we find the model that best explains the observed data? This lecture, in detail: Viterbi* algorithm This lecture, in brief: Baum-­‐
Welch** algorithm * You should be able to implement this; ** You don’t need to be able to implement this Three fundamental ques:ons 1)  Given a model M = (A,B,Π), how do we efficiently compute how likely a certain observa:on is? 2)  Given a sequence of observa:ons Y and a model M, how do we infer the state sequence that best explains the observa:ons? Forward algorithm Compu:ng likelihood of observa:ons For any output sequence Y = (y1,…,yT) we can calculate the probability of observing it by summing over all possible sequences of hidden states that could have generated it: Example: simple language How likely are you to see “he eats”? P( he eats | A,B,Π)
= P(pro|S) P(he|pro) P(verb | pro) P(eats| verb) P(E|verb)
= 0.7 * 0.3 * 1.0 * 0.5 * 1.0
= 0.105
But that was easy, because there was just one way to generate that observa:on Example: Mitee the warrior How likely are you to see “zzz snort”?
P( snort | A,B,Π)
= P (zzz| asleep) P(asleep) P(calm | asleep) P(snort| calm) +
P(zzz| asleep) P(asleep) P(asleep | asleep) P(snort| asleep)
= (0.9) (0.3) (0.2) (0.8) + (0.9) (0.3) (0.5) (0.1)
= 0.0675 + 0.0135
= 0.081
Compu:ng likelihood of observa:ons You can see that this will grow increasingly difficult as the HMM grows increasingly larger (or there are fewer zeros in the transi:on matrix) Having to sum over every possible set of hidden states, in general, requires on the order of NT+1 mul:plica:ons, where T = # of :me steps, and N = the number of states. The complexity is thus O(NT) Simplifying the computa:on Luckily, in order to calculate the most likely path we don’t have to sum over all possible state sequences Because of the limited horizon property, the probability of the path at any one point only depends on the probability of the current point and the probability of the previous point Forward algorithm An algorithm for efficiently calcula:ng the probability of a sequence of observa:ons Incremental: at each observa:on step, you find the most likely path un:l that point Complexity is O(N2T), assuming a fully connected model – a big improvement over O(NT) Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
asleep
There is only one state that outputs zzz
snort
grumble
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
P(asleep|S)
0.3; 0.9
zzz
asleep
0.27
P(zzz|asleep)
(0.9)(0.3)
snort
grumble
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
asleep
0.0135
P(snort|asleep)
P(asleep|asleep)
(0.27)(0.5)(0.1)
grumble
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
asleep
0.0135
calm
0.0432
P(calm|asleep)
P(snort|calm)
(0.27)(0.2)(0.8)
grumble
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
asleep
0.0135
grumble
0.5; 0
asleep
0
calm
0.0432
P(grumble|asleep)
P(asleep|asleep)
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
asleep
grumble
0.5; 0
0.0135
calm
0.0432
roar
asleep
0
0.3; 0.2
calm
0.00313
(0.0432)(0.2)(0.3) + (0.0135)(0.2)(0.2)
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
asleep
grumble
0.5; 0
0.0135
calm
0.0432
roar
asleep
0
0.3; 0.2
calm
0.00313
hungry
0.00907
(0.0432)(0.2)(0.8)+ (0.0135)(0.2)(0.8)
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00313
hungry
0.00907
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00313
hungry
0.00907
0.3; 0.2
hungry
0.00067
0.00313*0.2*0.2 + 0.00907*0.3*0.2
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.3; 0.2
0.0432
Total probability of observing this sequence: 0.00067+0.00485 = 0.055
0.00907*0.5*1.0 + 0.00313*0.1*1.0
calm
0.00313
hungry
0.00907
0.3; 0.2
hungry
0.00067
angry
0.00485
Example: Mitee the warrior
What is the likeliest path underlying
“zzz snort grumble roar”?
zzz
snort
grumble
roar
This is called the forward algorithm,
because we calculated incrementally
moving forward in time
Three fundamental ques:ons 1)  Given a model M = (A,B,Π), how do we efficiently compute how likely a certain observa:on is? Forward algorithm 2)  Given a sequence of observa:ons Y and a model M, how do we infer the state sequence that best explains the observa:ons? Viterbi algorithm Idea: what if we maximise as we go through the trellis, rather than sum up all of the states? Viterbi algorithm An algorithm for efficiently calcula:ng the most likely path through an HMM, given a sequence of observa:ons Incremental: at each observa:on step, you find the most likely path un:l that point Complexity is O(N2T), assuming a fully connected model – a big improvement over O(NT) Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
asleep
There is only one state that outputs zzz
snort
grumble
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
P(asleep|S)
0.3; 0.9
zzz
asleep
0.27
P(zzz|asleep)
(0.9)(0.3)
snort
grumble
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
asleep
0.0135
P(snort|asleep)
P(asleep|asleep)
(0.27)(0.5)(0.1)
grumble
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
asleep
0.0135
calm
0.0432
P(calm|asleep)
P(snort|calm)
(0.27)(0.2)(0.8)
grumble
roar
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00259
(0.0432)(0.2)(0.3) = 0.00259 > (0.0135)(0.2)(0.2) = 0.00054
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00259
hungry
0.00691
(0.0432)(0.2)(0.8) = 0.00691 > (0.0135)(0.2)(0.8) = 0.00216
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00259
hungry
0.00691
0.3; 0.2
hungry
0.000415
(0.00691)(0.3)(0.2) = 0.000415 > (0.00259)(0.2)(0.2) = 0.000104
Example: Mitee the warrior How likely are you to see “zzz snort”?
zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00259
hungry
0.00691
(0.00259)(0.1) = 0.000259 >
(0.00691)(0.5) = 0.00345
0.3; 0.2
hungry
0.000415
angry
0.00345
Given this, is the most likely state sequence just the one whose states are most probable at every point in :me? This worked out… zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00259
hungry
0.00691
0.3; 0.2
hungry
0.000415
angry
0.00345
Given this, is the most likely state sequence just the one whose states are most probable at every point in :me? But imagine the transi:on probabili:es were slightly different zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00259
hungry
0.00691
0.1; 0.9
hungry
0.00163
angry
3.45e-5
Given this, is the most likely state sequence just the one whose states are most probable at every point in :me? But imagine the transi:on probabili:es were slightly different zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00259
hungry
Now the most likely final state is different… 0.00691
0.1; 0.9
hungry
0.00163
angry
3.45e-5
Given this, is the most likely state sequence just the one whose states are most probable at every point in :me? But imagine the transi:on probabili:es were slightly different zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
0.3; 0.2
calm
0.00259
hungry
And so is the most likely path to that 0.00691
0.1; 0.9
hungry
0.00163
angry
3.45e-5
Given this, is the most likely state sequence just the one whose states are most probable at every point in :me? But imagine the transi:on probabili:es were slightly different zzz
0.3; 0.9
asleep
0.27
snort
0.5; 0.1
grumble
roar
asleep
0.0135
calm
0.0432
Note that this does not include the state that was most likely previously 0.3; 0.2
calm
0.00259
hungry
0.00691
0.1; 0.9
hungry
0.00163
angry
3.45e-5
In order to calculate the most likely state sequence, you need to find the maximum transi:on at each point, get to the end, and then backtrack through This algorithm – finding all of the forward probabili:es (maxima, not sums), and then backtracking – is called the Viterbi algorithm. Three fundamental ques:ons 1)  Given a model M = (A,B,Π), how do we I’ll give the main idea of haow it works, but not all of efficiently compute how likely certain the niiy-­‐griiy detail. You don’t need to be able to observa:on is? implement this – I just want to get you started in you eover want to. This also 2)  Given a scase equence f observa:ons Y ais nd a called the algorithm. model M, how dforward-­‐backward o we infer the state sequence that best explains the observa:ons? 3)  Given an observa:on sequence Y and a space of possible models found by varying the model parameters M = (A,B,Π), how do we find the model that best explains the observed data? Baum-­‐Welch algorithm Baum-­‐Welch algorithm Basic idea: This is just an EM algorithm! But instead of: Assignment step (E-step):!
Calculate the likelihood of
each data point in each cluster,
assuming the cluster is a Gaussian
with the current mean, standard
deviation, and weight!
Update step (M-step):!
Recalculate the means!
Recalculate the standard !
deviations!
Recalculate the weights!
Baum-­‐Welch algorithm Basic idea: This is just an EM algorithm! But instead of: Assignment step (E-step):!
Calculate the probability of the observation
sequence given the current model (A, B, Π)
Forward algorithm Update step (M-step):!
Recalculate A
aij = expected number of transi:ons from state i to j Expected number of transi:ons from state i Recalculate B
bijk = expected # of transi:ons from state i to j with k observed Expected number of transi:ons from i to j
Recalculate Π!
πi = expected frequency in state i at :me t=1 Baum-­‐Welch algorithm Because it is an EM algorithm, it has the same proper:es: 1. Guaranteed (fairly rapid) convergence, but only to local maxima, not global maxima 2. Dependence on ini:al values. In prac:ce, it is especially important to have good star:ng points for the output parameters B; es:mates of A are fairly robust to ini:al star:ng point Stepping back a bit… We have defined what a Hidden Markov Model (HMM) is, and proposed it as a beier model for language than an n-­‐gram model (i.e., a standard Markov Model) We have seen in detail how it is possible to calculate the most probable path of hidden states in an HMM, and the probability of an observa:on We have seen in brief how it is possible to figure out (imperfectly) what the most probable set of transi:on probabili:es (A,B,Π) are, given a set of observa:ons HMMS: A good model for language? Not really. S:ll have a parameter explosion problem 0.1
0.2
S
0.3
0.7
det
pro
adj
0.8
0.9
noun
1.0
1.0
verb
1.0
E
(0.5) verb  eats
(0.5) verb  runs
(0.3) pro  he
(0.3) pro  she
(0.4) pro  it
(0.7) det  the
(0.3) det  a
(0.4) noun  boy
(0.4) noun  dog
(0.2) noun  tiger
(1.0) adj  happy
S:ll have a parameter explosion problem Suppose you want to make it able to produce: The students write
Now it also produces: The dog write
A students runs
The students eats
He write
…
(0.5) verb  eats
(0.3) verb  runs
(0.2) verb  write
(0.3) pro  he
(0.3) pro  she
(0.4) pro  it
(0.7) det  the
(0.3) det  a
(0.4) noun  boy
(0.2) noun  dog
(0.2) noun  tiger
(0.2) noun  students
(1.0) adj  happy
S:ll have a parameter explosion problem As before, you have to add new states to the model to solve this problem. S:ll have a parameter explosion problem 0.1
0.2
det
0.15
0.3
S
0.4
adj
0.4
0.4
0.5
0.4
nounP
1.0
proP
1.0
proS
nounS
1.0
verbS
1.0
verbP
1.0
1.0
E
(0.5) verbS  eats
(0.5) verbS  runs
(1.0) verbP  write
(0.3) proS  he
(0.3) proS  she
(0.4) proS  it
(0.7) det  the
(0.3) det  a
(0.4) nounS  boy
(0.4) nounS  dog
(0.2) nounS  tiger
(1.0) nounP  students
(1.0) adj  happy
The HMM has less of an explosion problem than a Markov model would, as we can see if we just increase the vocabulary: the model doesn’t change (whereas it would if it were a Markov model) 0.1
0.2
det
0.15
0.3
S
0.4
adj
0.4
0.4
0.5
0.4
nounP
1.0
proP
1.0
proS
nounS
1.0
verbS
1.0
verbP
1.0
1.0
E
(0.5) verbS  eats
(0.5) verbS  runs
(0.5) verbP  write
(0.5) verbP  smile
(0.2) proS  he
(0.2) proS  she
(0.3) proS  it
(0.3) proS  you
(1.0) proP  they
(0.7) det  the
(0.2) det  a
(0.4) nounS  boy
(0.4) nounS  dog
(0.2) nounS  tiger
(0.3) nounP  students
(0.4) nounP  women
(0.3) nounP  children
(1.0) adj  happy
The HMM has less of an explosion problem than a Markov model would, as we can see if we just increase the vocabulary: the model doesn’t change (whereas it would if it were a Markov model) S:ll, fundamentally, the problem of accoun:ng for long-­‐distance dependencies without making the model super-­‐complicated remains Nevertheless, people use HMMs for all sorts of things (not just in language) Language: Approxima0ons to grammars Speech recogni:on Handwri:ng recogni:on Part-­‐of-­‐speech tagging Other: Music Muta:on rates in biology Protein structure / folding Financial system analysis Next :me We’ll look at more complex and appropriate models of human grammar, which start to chisel away at this problem. Actually modelling human grammar tractably is one of the most difficult, state-­‐of-­‐the-­‐art problems in computa:onal linguis:cs, so we won’t be expec:ng you to do it here! (Just get a sense of what the solu:ons kinda-­‐sorta look like) Take-­‐home points Word-­‐chain grammars (i.e., n-­‐grams or Markov Models) are not good models of language: parameter explosion Hidden Markov Models: Markov models with hidden states (which, in language, oMen correspond to parts of speech like nouns or verbs) Forward algorithm: Lets you calculate the probability of an observa:on Viterbi algorithm: Lets you calculate the most likely path through an HMM Baum-­‐Welch algorithm: Lets you figure out the most likely model (A,B,Π)
given a set of observa:ons S:ll some problems with HMMs as a proper theory of grammar (but they’re genng closer) References HMMS Wikipedia entry on Hidden Markov Models. Manning, C., & Schutze, H. (1999). Founda0ons of Sta0s0cal Natural Language Processing. Chapter 9: Markov Models. Russell, S., & Norvig, P. (1995). Ar:ficial Intelligence: A modern approach (1st edi:on).