Syntax 1: Hidden Markov Models (HMMs) Computa;onal Cogni;ve Science 2011 Lecture 12 The plan: Language lectures Week 3 Week 4 Lecture 8: Intro to language Phone;c category learning K-‐means clustering* Lecture 10-‐11: Word segmenta;on N-‐gram models* More complex model** Lecture 9: Phone;c category learning Extension to k-‐means* Mixture of Gaussians* More complex models** Lecture 12: Regular grammars Markov chains* HMM fundamentals* Week 5 Lecture 13: HMMs con;nued Viterbi algorithm* Lecture 14: More complex grammars CFGs, etc** Lecture 15: Other / State of the art Word learning** Language evolu;on** * = I’ll present this at a level of detail that you should be able to implement it ** = You should understand the basic idea and the approximate behaviour of the model, but not implement (implemen;ng might be a good project idea!) Plan for today Problem of syntax learning: word-‐chain grammars -‐ Link to Markov Models, n-‐grams -‐ Are these good models of language? Needing to learn parts of speech -‐ What are parts of speech? -‐ How are parts of speech learned? A (slightly) more complicated type of grammar -‐ Hidden Markov Models (HMMs) -‐ Link to learning parts of speech The cogni;ve ques;on The problem of syntax learning Learning syntax Grammars: Describe how to arrange words in a sentence. Includes many smaller problems, and many differences across languages En;re courses worth of study; this is scratching the very ;p of the surface Learning syntax Fundamentally, about how sentences are created. The basic unit is not the word but the morpheme A morpheme is the smallest unit in a language that conveys meaning. teach er dog s baby un comfort able dis lik ing Morphology Languages vary in their morpheme-‐per-‐word ra;o. baby 1:1 dog s 2:1 un comfort able 3:1 Isola;ng languages have low ra;os (close to 1:1) – that is, each word tends to convey one unit of meaning. They tend to have very fixed word order, and to use lots of par;cles. Synthe;c (or polysynthe;c, at the extreme) have high ra;os; one word can convey up to an en;re sentence of meaning. A con;nuum of languages Highly isola;ng Chinese English Japanese Finnish Mohawk Highly synthe;c Washakotya'tawitsherahetkvhta'se He made the thing that one puts on one's body ugly for her “He ruined her dress” Learning syntax Syntax is about how morphemes are combined to make a sentence. In English, which is more isola;ng, this is approximated by the ques;on of how to combine words ✔ Pink pajamas are awesome
✗ Awesome are pink pajamas
✗ Pink are awesome pajamas
✗ Pajamas are awesome pink
✗ Awesome pink are pajamas
Learning syntax 1) What are the representa;ons used to generate correct sentences? 2) How are they learned? A very simple possibility: Word-‐chain grammars Basic idea: words in sentences are produced (probabilis;cally) based on what the previous word was 0.1 happy 0.2 S 1.0 the 0.4 0.4 0.5 0.4 dog boy 1.0 1.0 eats 0.4 hot dogs 0.1 potatoes 0.5 1.0 1.0 1.0 ice cream E A very simple possibility: Word-‐chain grammars This is just a bigram model! p(the|S) = 1.0
p(happy|the) = 0.2
p(dog|the) = 0.4
p(boy|the) = 0.4
p(happy|happy) = 0.1
p(dog|happy) = 0.5
p(boy|happy) = 0.4
p(eats|dog) = 1.0
p(eats|boy) = 1.0
…
A very simple possibility: Word-‐chain grammars It is also a Markov Model, or, equivalently, a Markov chain Formally: Let X = (X1,…,XT) be a sequence of random variables taking values in the state space: some countable set S={s1,…,sN}. Xt = word at ;me t S = the set of possible words {boy, happy, dog,…}
By defini;on, all Markov Models have the Markov Property Note that ;me doesn’t have to be discrete, but for every model we will be using it is Markov Property Limited horizon: The probability of moving into a new state depends only on the current one, not on anything earlier (For this reason, Markov models are oken called memoryless learners) The probability of a word at ;me t+1 given the previous t words is the same as the probability of that word at t+1 given the single previous word Equivalency to n-‐grams? Does limited horizon mean that Markov Models are equivalent to bigram models only, or are they more generically equivalent to any type of n-‐gram model? Answer: they are (or can be) equivalent to any n-‐
gram model, not just bigram models. Equivalency to n-‐grams? Any higher-‐order n-‐
gram with redescribed states is a Markov model; limited horizon requires only that the states depend on some finite number of previous states, not necessarily one Equivalency to n-‐grams? Any higher-‐order n-‐
gram with redescribed states is a Markov model; limited horizon requires only that the states depend on some finite number of previous states, not necessarily one First-‐order Markov chain: depends only on previous state Second-‐order Markov chain: depends on previous two states … etc … Time invariance Most Markov models (and all of the ones we will be using) are Jme invariant, or sta;onary: the transi;on probabili;es between any two specific states will always be the same The probability that a word at t+1 is some word k given that the word at ;me t is some word m, is the same as the probability that a word at ;me t is k given that the word at t-1 is m
When transi;on probabili;es change with ;me, this is called a Jme inhomogeneous Markov chain The cogni;ve ques;on Are Markov Models a good descriptor of human language? The problem of long-‐distance dependencies Long-‐distance dependencies are dependencies between different parts of a sentence, separated by many words S
1.0
0.4
the
0.6
dog
boys
1.0
1.0
0.2
at
0.8
school
home
1.0
1.0
like
1.0
0.5
to
0.5
swim
run
1.0
E
1.0
Chomsky 1957
The problem of long-‐distance dependencies Long-‐distance dependencies are dependencies between different parts of a sentence, separated by many words S
1.0
0.4
the
0.6
dog
1.0
at
0.2
school 1.0
0.8
home
1.0
likes
1.0
0.5
to
boys
1.0
0.2
at
0.8
school 1.0
home
1.0
0.5
swim
run
1.0
E
1.0
like
1.0
Chomsky 1957
The problem of long-‐distance dependencies Long-‐distance dependencies are dependencies between different parts of a sentence, separated by many words if
either
the
a
one
happy
boy
girl
dog
eats
ice cream
hot dogs
candy
then
or
Chomsky 1957
The problem of long-‐distance dependencies Long-‐distance dependencies are dependencies between different parts of a sentence, separated by many words either
if
or
then
Chomsky 1957
The problem of long-‐distance dependencies This problem is not isolated to English: if anything, it is worse in less isolated languages las
muchachas
los
muchachos
me
la
muchacha
me
el
muchacho
algunos
dieron
diό
regalos
un
regalo
una
pelota
algunas
pelotas
The problem of long-‐distance dependencies With any reasonably sized vocabulary, Markov Models would have to be enormously complex to account for the dependencies between words in human language (There are other reasons to think Markov Models aren’t appropriate, but this is one of the main ones) Chomsky 1957
Idea for reducing the size of a grammar Instead of describing the order of par;cular words, describe the order of par;cular parts of speech These are things like nouns, verbs, etc. Different languages vary highly in what parts of speech they have (indeed, there is no agreed-‐upon classifica;on scheme for what makes different items different parts of speech). Parts of speech Name DefiniJon Examples Adjec;ve Modifies a noun by describing it Old, big, scary, hungry Adverb Modifies anything other than a noun Greatly, happily, very Noun Person, place, thing, idea, quan;ty Bob, chair, lecture, freedom Verb Expresses ac;on or state of being Want, run, think, put, make Pronoun Subs;tutes for a noun where context gives it meaning Him, her, it, them, we Auxiliary verb Helps other verbs, giving addi;onal informa;on Be, have, shall, will, may, can Conjunc;on Connects parts of a sentence together And, but, if, or, so Preposi;on Introduces a certain kind of phrase, oken a loca;on In, on, around, with, for Determiner Modifies a noun by expressing the reference A, an, the, that, this, those Open class: easy to add new members; carry a lot of the content Closed class: hard to add new members; carry a lot of the grammar Psychological difference between open and closed class They are disrupted differently in different kinds of aphasia (brain damage) Broca’s Wernicke’s Lower Falls… Maine… Paper. Four hundred
tons a day! And ah… sulphur machines, and
ah… wood… Two weeks and eight hours. Eight
hours … no! Twelve hours, fifteen hours…
workin … workin … workin! Yes, and … ah…
sulphur.!
Boy, I’m sweating, I’m awful nervous, you
know, once in a while I get caught up, I
can’t mention the tarripoi, a month ago,
quite a little, I’ve done a lot well, I
impose a lot, while, on the other hand, you
know what I mean, I have to run around, look
it over, trebbin and all that sort of stuff.!
Psychological difference between open and closed class They are disrupted differently in different kinds of aphasia (brain damage) Children’s words are almost always open class Mummy
Want
Doggie
Cookie
Up
Juice
Psychological difference between open and closed class They are disrupted differently in different kinds of aphasia (brain damage) Children’s words are almost always open class Closed-‐class words are the ones that second-‐language learners have the most difficulty with Children appear to learn at least something about parts of speech fairly early For instance, 14-‐month-‐olds generalise differently depending on if something is a noun or an adjec;ve “This is a blicket” “This is blickish” “This is not a blicket” “This is not blickish” Find a blicket Find the blickish one Booth & Waxman, 2003 Children appear to learn at least something about parts of speech fairly early We’re not at all sure how they do this, but one possibility is that they do so similarly to how our automated models do … (Which is what we’ll go over in the next few days) A grammar over parts of speech Instead of this… A grammar over parts of speech You have this 0.1
0.2
S
0.3
0.7
det
pro
adj
0.8
0.9
noun
1.0
1.0
verb
1.0
E
(0.5) verb eats
(0.5) verb runs
(0.3) pro he
(0.3) pro she
(0.4) pro it
(0.7) det the
(0.3) det a
(0.4) noun boy
(0.4) noun dog
(0.2) noun tiger
(1.0) adj happy
This is a Hidden Markov Model (HMM) Hidden states (and associated probabili;es) Observa;ons (and associated probabili;es) Hidden Markov Models: The basics HMMs: The basics Hidden states Observa;ons Actual series of states generated Actual observa;ons Probability of transi;oning from state si to state sj
Probability of emipng symbol ok from state si
HMMs: The basics Xt-1
yt-1
Xt
yt
Xt+1
Each x is a state s
yt+1
Each y is an observa;on o
HMMs: The basics Technically, not all HMMs have to have this “linear” structure (this is called a Bakis network)
It is also possible to have HMMs that have loops. Most of the ones we’ll consider don’t, except fairly trivially
A non-‐linguis;c example You are a mighty warrior named Mitee. You have been sent on a quest to kill a dragon in its cave. You want to catch it while it is asleep or not paying aqen;on, but since it is in a cave you can’t observe that directly. Instead, you can only hear the sounds it makes. How do you decide when to enter the cave? Snort, grumble
grumble
A non-‐linguis;c example States S = {asleep, calm, angry, hungry}
Outputs O = {roar, zzz, snort, grumble}
Output symbol matrix B: State transi;on matrix A: Roar
Zzz
Snort
Grumble
Asleep
0.0
0.9
0.1
0.0
0.2
Calm
0.0
0.0
0.8
0.2
0.6
0.1
Angry
1.0
0.0
0.0
0.0
0.5
0.3
Hungry 0.2
0.0
0.0
0.8
Asleep
Calm
Angry
Hungry
Asleep
0.5
0.2
0.1
0.2
Calm
0.4
0.3
0.1
Angry
0.1
0.2
Hungry 0.1
0.1
Ini;al state probabili;es Π: Asleep
Calm
Angry
Hungry
0.3
0.3
0.2
0.2
Genera;ng output from the model Pick an ini;al state propor;onal to the ini;al state probabili;es Π At each ;me t: Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model zzz
Pick an ini;al state propor;onal to the ini;al state probabili;es Π At each ;me t: Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model zzz
Pick an ini;al state propor;onal to the ini;al state probabili;es Π At each ;me t: Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model zzz grumble
Pick an ini;al state propor;onal to the ini;al state probabili;es Π At each ;me t: Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model generateobserva;ons(‘miteethewarrior.mat’,10) A linguis;c example End state symbol States S = {noun, verb, det, pro, adj, {*}}
Outputs O = {boy, dog, tiger, eats, runs, the, a, it, she, he, happy}
State transi;on matrix A: Output symbol matrix B: Boy
Dog
Tiger
Eats
Runs
Noun
0.4
0.4
0.2
0.0
0.0
1.0
Verb
0.0
0.0
0.0
0.5
0.5
0.2
0.0
Det
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Pro
0.0
0.0
0.0
0.0
0.0
0.0
0.1
0.0
Adj
0.0
0.0
0.0
0.0
…
Noun
Verb Det
Pro
Adj
*
Noun
0.0
1.0
0.0
0.0
0.0
0.0
Verb
0.0
0.0
0.0
0.0
0.0
Det
0.8
0.0
0.0
0.0
Pro
0.0
1.0
0.0
Adj
0.9
0.0
0.0
Ini;al state probabili;es Π: Noun
Verb
Det
Pro
Adj
*
0.0
0.0
0.3
0.7
0.0
0.0
Genera;ng output from the model Pick an ini;al state propor;onal to the ini;al state probabili;es Π While current state is not the end (*) Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model Pick an ini;al state propor;onal to the ini;al state probabili;es Π While current state is not the end (*) Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model he
Pick an ini;al state propor;onal to the ini;al state probabili;es Π While current state is not the end (*) Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model he
Pick an ini;al state propor;onal to the ini;al state probabili;es Π While current state is not the end (*) Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model he runs
Pick an ini;al state propor;onal to the ini;al state probabili;es Π While current state is not the end (*) Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Genera;ng output from the model he runs
Pick an ini;al state propor;onal to the ini;al state probabili;es Π While current state is not the end (*) Generate observa;on given transi;on matrix B from current state Generate state for next ;me based on transi;on matrix A between states Generating output from the
model
generateobservationswithend(‘simplelanguage.mat’,10) Doing more with HMMs Genera;ng data is easy: the real interest comes from assuming that some data was generated by an HMM, and inferring the probabili;es and state sequences Sepng up for next ;me… Three fundamental quesJons for HMMs Three fundamental ques;ons 1) Given a model M = (A,B,Π), how do we efficiently compute how likely a certain observa;on is? Example: simple language How likely are you to see he eats? Example: Mitee the warrior How likely are you to see zzz snort? Three fundamental ques;ons 1) Given a model M = (A,B,Π), how do we efficiently compute how likely a certain observa;on is? 2) Given a sequence of observa;ons Y and a model M, how do we infer the state sequence that best explains the observa;ons? Example: Mitee the warrior Make the observa;ons on the right What were the moods of the dragon at each point? zzz snort zzz
zzz zzz snort
snort grumble
roar grumble
Three fundamental ques;ons 1) Given a model M = (A,B,Π), how do we efficiently compute how likely a certain observa;on is? 2) Given a sequence of observa;ons Y and a model M, how do we infer the state sequence that best explains the observa;ons? Example: Language Hear the following sentence. What were the parts of speech you heard? They run to
the park
Three fundamental ques;ons 1) Given a model M = (A,B,Π), how do we efficiently compute how likely a certain observa;on is? 2) Given a sequence of observa;ons Y and a model M, how do we infer the state sequence that best explains the observa;ons? 3) Given an observa;on sequence Y and a space of possible models found by varying the model parameters M = (A,B,Π), how do we find the model that best explains the observed data? Three fundamental ques;ons 1) Given a model M = (A,B,Π), how do we efficiently Example: language compute how likely as imple certain observa;on is? This ;me, you don’t know which words correspond to which arts of speech: ou ah ave 2) Given a sequence of opbserva;ons Y aynd model M, to idnfer he transi;on probabili;es A, tB
, and Π
how o wte infer the state sequence hat best explains the observa;ons? 3) Given an observa;on sequence Y and a space of possible models found by varying the model parameters M = (A,B,Π), how do we find the model that best explains the observed data? Three fundamental ques;ons 1) Given a model M = (A,B,Π), how do we efficiently compute how likely a certain observa;on is? 2) Given a sequence of observa;ons Y and a model M, how do we infer the state sequence that best explains the observa;ons? 3) Given an observa;on sequence Y and a space of possible models found by varying the model parameters M = (A,B,Π), how do we find the model that best explains the observed data? Next lecture, in detail: Viterbi* algorithm Next lecture, in brief: Baum-‐
Welch** algorithm * You should be able to implement this; ** You don’t need to be able to implement this Take-‐home points Iden;fying parts of speech (and figuring out how to put the morphemes of a sentence together) is a hard problem. Markov models may help us to learn to do both of these things. Markov models have the Markov property of memorylessness Hidden Markov Models are Markov models with hidden states (which, in language, oken correspond to parts of speech); they can help to deal with the parameter explosion that occurs in natural language We saw how to define an HMM and generate observa;ons given an HMM References HMMS Wikipedia entry on Hidden Markov Models. Manning, C., & Schutze, H. (1999). Founda>ons of Sta>s>cal Natural Language Processing. Chapter 9: Markov Models. Russell, S., & Norvig, P. (1995). Ar;ficial Intelligence: A modern approach (1st edi;on). Children learning parts of speech Booth, A., & Waxman, S. (2003). Mapping words to the world in infancy: Infants’ expecta;ons for count nouns and adjec;ves. Journal of Cogni>on & Development 4: 357-‐381.
© Copyright 2026 Paperzz