pdf 4-up

Lecture 4: Walk-ing the walk; talk-ing the
talk
Professor Robert C. Berwick
[email protected]
6.863J/9.611J SP11
Smoothing: the Robin Hood Answer
(and the Dark Arts)
Steal (probability mass) from the rich
And give to the poor (unseen words)
The Menu Bar
•  Administrivia: Lab 2 out
– Note that you are expected to figure out the ‘guts’ of the
discounting methods & how to implement them
•  Today: finish smoothing & word parsing (chapter 3, of JM
textbook)
•  Two level morpho-phonology (why two level ? –
Gold star idea: match representation to the task)
•  How does it work? Salivation nation; plumbing the Wells
•  Two sets of finite-state transducers
•  Hunting through an example so that we re not out-foxed
•  Spy vs. spies; multiple language
6.863J/9.611J SP11
Smoothing has 2 parts:
(1) discounting & (2) backoff
•  Deals with events that have been observed zero times
•  Generally improves accuracy of model
•  Implements Robin Hood by answering 2 questions:
1. Who do we take the probability mass away from, and
how much? (= discounting)
2. How do we redistribute the probability mass?
(= backoff)
Let s cover two of the basic methods for discounting and
backoff: Add-1; Witten-Bell
Sum of probability mass must still add up to 1!
6.863J/9.611J SP11
6.863J/9.611J SP11
Smoothing: the Dark Side of the force…
This dark art is why NLP is
taught in the engineering
school
Simplest fix
•  So our estimate for w is now an interpolation ( a
weighted backoff sum) between the unigram and
the uniform distributions:
N
V +N
·
C(w)
N
+
V
V +N
·
1
V
=
C(w)+1
N +V
6.863J/9.611J SP11
There are actually less hacky
smoothing methods, but they require
more math and more compute cycles
and don t usually work any better.
So we ll just motivate and present
a few basic methods that are often
used in practice.
Add-1 smoothing: the Robin Hood Answer
Steal (probability mass) from the rich
Add new 1 s to everybody (as if there
was a new list V long)
•  This method is called Add-1 because we add 1 to
all the counts, and divide by V to normalize
•  It s really an interpolation of the unigram (or
bigram, …) MLE plus the uniform distribution
Example from restaurants: add 1
p(wi |wi−1 ) =
1 + c(wi−1 wi )
V + c(wi−1 )
Problem: Add-1 adds way too much probability mass to unseen events!
(It therefore robs too much from the actually observed events.)
V is too big! So how to fix this?
6.863J/9.611J SP11
6.863J/9.611J SP11
Add-One Smoothing (Laplace, 1720)
xya
xyb
xyc
xyd
xye
…
xyz
Total xy
p(food | Chinese)= 0.52 dropped to 0.052 !!!
6.863J/9.611J SP11
see the abacus
1
1/3
2 2/20003
0
0/3
1 1/20003
see the abduct
0
0/3
1 1/20003
see theevent
above= 0-count event
Novel
2 (never happened
2/3 in training data)
3 . 3/20003
Here: 19,998 novel events, with total estimated probability 19,998/20,003.
see the Abram
0
0/3
1/20003
So add-one smoothing thinks we are extremely likely to see1novel
events, rather
see the abbot
…we’ve seen in training data.
than words
It thinks this only because we have a big dictionary: 20,000 possible events.
see the zygote
0
0/3
1 1/20003
Is this a good reason?
3/3
6.863J/9.611J SP11
20003
2
1
1
3
1
2/29
1/29
1/29
3/29
1/29
0
3
0/3
3/3
1
29
1/29
29/29
Serious problems with the Add-1 method
Suppose we’re considering 20000 word types, not 26 letters
3
1/3
0/3
0/3
2/3
0/3
6.863J/9.611J SP11
Add-One Smoothing
Total
1
0
0
2
0
20003/20003
1.  It assigns far too much probability to rare n-grams as V
gets bigger (the size of the vocabulary, the # of distinct
words grows) – the amount of probability mass we steal is
too large!
2.  All unseen n-grams are smoothed in the same way – how
we redistribute this probability mass is also wrong
•  For bigrams, about ½ the probability mass goes to unseen
events
(e.g., for V=50,000, corpus size N= 5 x 109, V2= 2.5 x 109)
•  So, we must reduce the count for unseen events to
something waay less than 1….
6.863J/9.611J SP11
Diversity Smoothing
One fix: just how likely are novel events?
Counts from Brown Corpus (N approx. 1 million tokens)
Counts from Brown Corpus (N approx. 1 million tokens)
the
69836
EOS
52108
n6 * 6
n5 * 5
n4 * 4
n6 * 6
abdomen, bachelor, Caesar …
n5 * 5
aberrant, backlog, cabinets …
n3 * 3
n4 * 4
abdominal, Bach, cabana …
n2 * 2
doubletons (occur twice)
n3 * 3
Abbas, babel, Cabot …
n2 * 2
aback, Babbitt, cabanas …
n1 * 1
singletons (occur once)
n1 * 1
abaringe, Babatinde, cabaret …
n0 * 0
novel words (in dictionary V but never occur)
n0 * 0
0
0
10000
20000
30000
40000
50000
60000
70000
80000
total # types = T (purple bars); total # tokens = N (all bars)
5000
n2
= # doubleton types
n2 * 2 = # doubleton tokens
10000
15000
20000
25000
Σr nr
= total # types = T (purple bars)
Σr (nr * r) = total # tokens = N (all bars)
6.863J/9.611J SP11
Witten-Bell discounting: in other words
Witten-Bell Smoothing/Discounting Idea
If T/N is large, we ve seen lots of novel
types in the past, so we expect lots more.
•  Imagine scanning the corpus in order.
•  Each type s first token was novel.
•  So we saw T novel types (purple).
n6 * 6
n5 * 5
n4 * 4
n3 * 3
unsmoothed à smoothed
n2 * 2
doubletons
2/N à 2/(N+T)
n1 * 1
singletons
1/N à 1/(N+T)
n0 * 0
novel words
0/N à (T/(N+T)) / n0
0
5000
10000
15000
20000
25000
Intuition: When we see a new type w in training, ++count(w); ++count(novel)
So p(novel) is estimated as T/(N+T), divided among n0=Z specific novel types
•  A zero frequency n-gram is simply an event that hasn’t happened yet
•  When we do see it, then it will be for the first time
•  We will get an MLE estimate of this first time frequency by
counting the # of n-gram types in the training data (the # of types is
just the number of first-time observed n-grams)
•  So the total probability mass of all the zero n-grams is estimated as
this count of types T (= first time n-grams) normalized by the number
of tokens N plus number of types T:
�
i:ci =0
p∗i =
T
N +T
•  Intuition: a training corpus consists of a series of events; one event for
each token, and one event for each new type; this formula gives a kind
of MLE for the probability of a new type event occurring
Witten-Bell discounting
•  So, answer to the Robin Hood question #1 is:
– We take away T rather than V from the
rich (and note that the # of types of words will
be very much smaller than the # of all possible
words)
•  Now we must also answer Robin Hood question 2:
How do we redistribute this wealth ?
2nd question: re-allocate this probability
mass we just took away from the total…
•  Let Z be the total # of words with count 0 types
•  Divide the probability mass equally: each gets 1/
Zth of this mass, i.e.,
p∗i =
�
if (ci = 0)
•  Compare this correction to what Add-1 does:
1
V
1
T
!
!
V V+N
Z T+N
•  And as usual we must correspondingly reduce
(discount) the probability of all the seen items:
p∗i =
Note that the total probability mass still
adds up to 1
T
Z(N +T )
ci
(N +T )
if (ci > 0)
Witten- Bell smoothing: the Robin Hood
Answer
Steal (probability mass) from the rich
ci = N
i
�
1 �
N
ci =
for(ci > 0)
N
+
T
N
+T
i
i
�
T
p∗i =
for(ci = 0)
(N
+ T)
i
p∗i =
Summed together:
N
N +T
+
6.863J/9.611J SP11
T
(N +T )
N/(T+N)
T/(T+N)
=1
6.863J/9.611J SP11
We apply this idea to bigrams
(you will do it for trigrams)
•  Condition the type counts on word history, eg bigrams
•  To compute the probability of a bigram wiwi-1 we
haven’t seen before (a first time bigram ):
– Compute the probability of a new bigram starting
with a particular wi-1
•  Words that occur in a larger # of bigrams will supply
a larger ‘unseen bigram’ estimate than words that do
not appear in a lot of difft bigrams
•  Then condition T and N, bigram types and tokens, on
the previous word, wi–1, as follows:
6.863J/9.611J SP11
Condition p(wi) on previous word, wi–1
"
p* (wi | wi !1 ) =
i:c(wi!1 wi )= 0
T (wi !1 )
N(wi !1 ) + T (wi !1 )
Let Z be the total # of bigrams with a given first word that has count
0 bigram types (e.g., chinese creampuffs )
Each such bigram will get 1/Zth of the redistributed probability:
p* (wi | wi !1 ) =
T (wi !1 )
if c(wi !1wi ) = 0
Z(wi !1 )(N + T (wi !1 ))
For non-zero bigrams, we discount as before, e.g., chinese food :
p* (wi | wi !1 ) =
c(wi !1wi )
if c(wi !1wi ) > 0
c(wi !1 ) + T (wi !1 )
Reduce bigram
6.863J/9.611J SP11
by this amount
Example: using T(w) & Z values
# of bigram types T(w) following each word in the restaurant example:
95 different bigrams follow I; 76 different bigrams follow want; etc.
I
35
Chinese
20
want
14
food
15
to
42
lunch
8
eat
9
• Recall that bigram Chinese food has nonzero count of 82 & c(Chinese)
is 158, so MLE bigram estimate is p=82/158 =0.52
•  The T re-adjusted bigram estimate is: p*(food | Chinese) =
c(Chinese food)/(c(Chinese)+T(Chinese)) = 82/(158+20)= 0.515
Compare this to the MLE estimate of 0.520 – as compared to Add-1, this
revised estimate is much closer to the MLE– it is lowered because
Chinese might be followed by something else not yet seen, and not
food (or the 14 other words already observed)
Working with T and Z values
•  The more promiscuous a word, the more types
observed that follow it, the more probability mass
we reserve for unseen words that might follow this
word, the more we lower its other MLE estimates
But there s one more twist…backoff!
1.  Why are we treating all novel events as the same?
2.  What happens if the bigrams themselves are 0?
p(zygote | see the) = p(baby | see the) ???
Suppose both trigrams have 0 count
But:
doesn t baby beat out zygote as a unigram?
doesn t the baby beat out the zygote as a bigram?
So shouldn t see the baby beat see the zygote ?
How can we factor this knowledge in?
Use the Backoff Luke!
•  Hold out probability mass for novel events
•  But divide up this mass unevenly in proportion to a
backoff probability (like the interpolation scheme, but
now the weights depend on the backoff probabilities)
•  What s a backoff probability?
•  When defining p(z|xy), for novel z, it is p(z|y) (ie, the
bigram probability estimate), which is itself the
weighted average of the MLE estimate for the bigram
and the unigram estimate for z;
•  But what if the unigram estimate p(z) is itself 0?
6.863J/9.611J SP11
Back-off Smoothing
If p(z|y) is not observed, we back-off to more the frequent p(z)
•  Suppose p(z) (ie, the unigram probability) is not observed,
do we need a back-off probability for it?
•  Yes; it is just the 0-gram uniform estimate, 1/V
•  So use the uniform distribution, 1/V , and interpolate with
the unigram MLE estimate as with the Add-1 scheme,
selecting weights λ, (1–λ) that sum to 1. For Witten-Bell,
this means using T/(T+N) and N/(T+N) as the weights
•  Let’s put all this together…compare this weighting scheme
to our previous one for Add-1 smoothing…
6.863J/9.611J SP11
Witten-Bell Backoff for bigrams – Weighted
between the MLE estimate for bigrams and 0-gram
uniform distribution
pbackof f (wi ) =
= pW B (wi ) = NN
+T · pM LE (wi ) +
T
N +T
·
1
V
If MLE estimate is 0, just don t use it and use the
uniform estimate!
Now we must combine this with our bigram
estimate, recursively, interpolating between
unigram and bigram, backing off if an estimate is
not available just as we did with the Add-1 idea…
6.863J/9.611J SP11
Modified Witten-Bell: from bigrams to
unigrams to uniform estimates
pW B (wi |wi−1 )
=
pbackof f (wi )
=
c(wi−1 )
· pM LE (wi |wi−1 )
c(wi−1 ) + T (wi−1 )
T (wi−1 )
+
· pbackof f (wi )
c(wi−1 ) + T (wi−1 )
T
1
N
· pM LE (wi ) +
·
N +T
N +T V
•  This is competitive with best methods known
•  As to what works most smoothly in practice?
•  ‘It depends’ – you’ll have to try it out (see Lab 2)
Now let’s look at word parsing…
6.863J/9.611J SP11
The central problem: parsing
•  The problem of mapping from one
representational level to another is called parsing
•  If there is > 1 possible outcome (the mapping is
not a function) then the input expression is
ambiguous
dogs → dog-Noun-plural or
dog-Verb-present-Tense
•  We’ll begin with word parsing Why?
6.863J/9.611J SP10
Why words? (Engineer’s view)
•  Prerequisite for higher level processing
•  Cross-linguistic generality
•  Applications:
•  electronic dictionaries (Franklin)
•  spelling checkers (ispell)
•  machine translation (Systran)
•  Filthy lucre
6.863J/9.611J SP10
Start with words: they illustrate all the
problems (and solutions) in NLP
Why words? (linguistics)
• 
• 
• 
• 
Test the representation
Test the resource requirements
Find out about unexpected predictions
Filthy lucre
•  Linguistic analysis of words: morphology
(morpho-logos)
6.863J/9.611J SP10
leaf N Pl
leave N Pl
What knowledge do we need to do this?
Two kinds needed
Generation
leave V Sg3
leaves
Parsing words
Cats = CAT + N(oun) + PL(ural)
Used in:
Traditional NLP applications
Finding word boundaries (e.g., Latin, Chinese)
Text to speech (boathouse)
Document retrieval
In particular, the problems of parsing, ambiguity,and
computational efficiency (as well as the problems of how
people do it)
6.863J/9.611J SP10
Computational morphology
Analysis
• 
• 
• 
• 
• 
• 
• 
• 
hang V Past
hanged
hung
1. Knowledge of a stem and possible endings:
Dogs → dog + s
Cat → cat + s
Doer → do + er
But not: beer → be + er
2. Knowledge of spelling changes in context
Bigger →big + er (double the g )
Flies → flie + s ?
6.863J/9.611J SP10
6.863J/9.611J SP10
Two kinds of knowledge = two kinds of
representations/constraints
What knowledge is needed?
Morphotactics
Words are composed of smaller elements that must be
combined in a certain order but not in others:
piti-less-ness is English
piti-ness-less is not English
Phonological alternations
The shape (someties, as written) of an element may
vary depending on its context
pity is realized as piti in pitilessness
die becomes dy in dying
6.863J/9.611J SP10
6.863J/9.611J SP10
What Knowledge is needed?
Linear arrangement of morphemes – beads on a string:
Lebensversichergungesellschaftsangestellter
Leben+s+versichergun+gesellschaft+s+angestellter
life+CmpAug+insurance+CmpAug+company+CompAug+employee
What would be a good computational model for this?
Stemming and Search
• 
• 
• 
6.863J/9.611J SP10
Google is more successful than other search engines in part because it returns
better , i.e. more relevant, information
•  its algorithm (a trade secret) is called PageRank
SEO (Search Engine Optimization) is a topic of considerable commercial
interest
•  Goal: How to get your webpage listed higher by PageRank
•  Techniques:
•  e.g. by writing keyword-rich text in your page
•  e.g. by listing morphological variants of keywords
Google does not use stemming everywhere
•  and it does not reveal its algorithm to prevent people optimizing their
pages
6.863J/9.611J SP10
What does Google do?
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>BBC - Health - Healthy living - Dietary requirements</title><meta
content="The foods
that can help to prevent and treat certain conditions" name="description"/>
6.863J/9.611J SP09 Lecture 2.5
We can’t just store it all…or can we?
Why?
1.  A Sagan number
Uygarlasutıramadıklarımızdanmısusınızcasına
(behaving) as if you are among those whom we could not
civilize
behave+BEC+CAUS+NAB+PART+PL+P1PL+AB+PA+2PL
+AsIF
2.  What about exotic languages – like English?
Anti-missile
Anti-anti-missile
Anti-anti-antimissile…
6.863J/9.611J SP09 Lecture 2.5
Why do this? because current theories argue that language
is mostly (even all) about lexical (word) features…
One cat is/are on the table
Zero cats…
6.863J/9.611J SP10
6.863J/9.611J SP10
Solution: lexical finite-state machine
Bidirectional: generation or analysis
Compact and fast
Comprehensive systems have been built for
over 40 languages:
English, German, Dutch, French, Italian,
Spanish, Portuguese, Finnish, Russian,
Turkish, Japanese, Korean, Basque, Greek,
Arabic, Hebrew, Bulgarian, …
Even for translation
vouloir +IndP +SG + P3
Finite-state
mapper
veut
citation form
v
o
u
v
e
u
l
o
inflection codes
i
r
+IndP
+SG
+P3
inflected form
6.863J/9.611J SP10
6.863J/9.611J SP11
Our solution: 2 sets of finite-state
automata (aka, kimmo )
6.863J/9.611J SP10
We can get more sophisticated…
t
Morphology
Inflectional Morphology:
basically: no change in category; sometimes meaningful
Agreement features (person, number, gender,…)
Examples: movies, blonde, actress, one dog, two dogs, zero …
Irregular examples:
appendices (from appendix), geese (from goose)
Morphology
Derivational Morphology
basically: part of speech category changing
nominalization
Examples: formalization, informant, informer, refusal, lossage
deadjectivals
Examples: weaken, happiness, simplify, formalize, slowly, calm
case
Examples: she/her, who/whom
comparatives and superlatives
Examples: happier/happiest
tense
Examples: drive/drives/drove (-ed)/driven
Morphology and Semantics
Morphemes: units of meaning
suffixation
Examples:
x employ y
employee: picks out y
employer: picks out x
x read y
readable: picks out y
prefixation
Examples:
undo, redo, un-redo, encode, defrost, asymmetric,
malformed, ill-formed, pro-Chomsky
deverbals
Examples: see nominalization, readable, employee
denominals
Examples: formal, bridge, ski, cowardly, useful
What about other processes?
•  Stem/root: core meaning unit (morpheme) of a word
•  Affixes: bits and pieces that combine with the stem to
modify its meaning and grammatical functions
•  Prefix: un- , anti-, etc.
•  Suffix: -ity, -ation, etc.
•  Infix:
•  Tagalog: um+hinigi + humingi (borrow)
•  Any infixes in nonexotic language like English?
Here s one: un-f******-believable
OK, what knowledge is needed?
How do we represent this knowledge?
Two parts to the what
Which units can glue to which others (roots and affixes) (or
stems and affixes)= morphotactics
What spelling changes (orthographic changes) occur – like
dropping the e in chase + ed (morpheme shape
depends on its context – like plural)
IDEA: MODEL EACH AS A FINITE-STATE MACHINE,
then combine.
WHY is this a good model? HOW should we ‘combine’
these two processes?
How: we will modify the idea of a finitestate automaton (fsa)
A finite-state automaton (FSA) is a quintuple
(Q,Σ,δ,I ,F) where:
Q is a finite set of states
Σ is a finite set of terminal symbols, the alphabet
δ ⊆ Q x Σ x Q is the transition mapping (or next state relation)
I ⊆ Q the initial states
F ⊆ Q, the final states
If δ:( Q x Σ x Q ) is a function, then the FSA is deterministic, o.w. it is
nondeterministic
Why finite-state machines?
• 
• 
• 
• 
Minimal model
Fast
Note what defines a finite-state machine
Linear concatenation of equivalence classes of
words (beads on a string) – the minimal machinery
to represent precedence
More ancient linguistic corpus
analysis
Rulon Wells (1947, pp.81-82):
A word A belongs to the class determined by the
environment: ___ X if AX is either an
utterance or occurs as a part of some utterance
AKA: distributional analysis
This turns out to be algebraically correct, but Wells
did not know the math to take the next step…
Under construction
Palin will lose the election.
Palin could lose the election.
____ z
z = tail following word x
will ≡ could;
Why are 2 words in the same ‘class’?
Because they have the same tails;
they are in the same equivalence class;
The class = a state in an fsa
More precisely
Splitting equivalence classes as data
comes in
ε-transition
Any fsa must sometimes fail to classify an
arbitrarily long string of this form properly…
Why???
aaabbb
aabb
aaaabb
a500b500
1
2
…
5
0
Remembrance of languages past…
•  Finite-state automaton defined by having a finite
number of states…. (duh)
•  Which define a finite # of equivalence classes or
bins, ie, an equivalence relation R s.t. for all
strings x, y over the defined alphabet Σ s.t. either
xRy or else not xRy.
•  (This equivalence relation is of finite rank )
•  No fsa can properly bin anbn, for all n≥1
Modeling English examples – how?
Finite-state machine for stems + endings
Adjective
•  As an example, consider adjectives
Big, bigger, biggest
Cool, cooler, coolest, coolly
Red, redder, reddest
Clear, clearer, clearest, clearly, unclear, unclearly
Happy, happier, happiest, happily
Unhappy, unhappier, unhappiest, unhappily
Real, unreal, silly
er
#
Pure model of linear concatenation
Additions:
1. Obviously, we need to add more states: real-??realer
2. We also need to add output (e.g., Adj, comparative),
so we must make this machine a transducer
3. Finally, we must add in a second set of machines to
check for valid roots and endings character by character,
as well as possible spelling changes (big-bigger)
Adding output to an fsa to create a
transducer
cool
Adjective
er
Morpho-phonology is finite-state:
regular relations
#
Comparative
Q: So how do we add output to these sorts
of devices?
A: Finite-state automata → finite-state transducer
6.863J/9.611J SP10
An example regular relation – a pairing of sets
of strings – each of which is finite-state –
described by an fst
{(ab)n, (ba)n}, s.t. n >0, i.e., {(ab,ba), (abab, baba), (ababab,
bababa), … } (“Interchanges” a’s and b’s)
S
a:b
Danger, danger, Will Robinson:
Finite State Transducers (FST) are a bit
different from ordinary finite-state automata
B
b:a
Note that the relation specifies no directionality between X and Y
(there is no input and no output, so neutral between parsing &
generation )
What are the properties of regular (rational) relations?
How do we implement these in our
word parsing application?
Properties of regular (rational) relations/
transductions
•  Key differences: (important for implementation)
1.  Not closed under intersection (unlike fsa s, unless they
obey the same-length constraint)
2.  Are closed under composition
3.  Cannot always be determinized (unlike nondeterministic
fsa s)
Definition & example of composition, and then
why not closed under intersection & why this
matters
Unlike FSAs, FSTs (regular relations)
are not closed under intersection (gulp!)
Regular relation 1: a pair of finite-state languages:
{an, bnc*} Claim: this is a regular relation
ε:c
a:b
ε:c
Regular relation 2: a similar pair of finite state languages:
{an, b*cn} Claim: this too is a regular relation
But what is the intersection of (1) and (2)?
Ans: {an, bncn}. And this is not a regular relation, by
definition, because bncn is not a finite-state (regular)
language (Why not? Salivate appropriately as bell rings…)
So, we use this kind of machine to model
the root + ending constraints…as a set of
underlying/surface pairs
Modeling English Adjectives
fsa (3.5):
Root
Fox + s
Affix
Affix dict
Root
Dictionary,
Noun
6.863J/9.611J SP11
from section 3.3 of textbook
examples
big, bigger, biggest, *unbig
cool, cooler, coolest, coolly
red, redder, reddest, *redly
clear, clearer, clearest, clearly, unclear, unclearly
happy, happier, happiest, happily
unhappy, unhappier, unhappiest, unhappily
real, *realer, *realest, unreal, really
The valid root plus affix combinations are
given by a finite-state transducer
Initial machine
is overly simple
need more classes
to make finer grain
distinctions
e.g. *unbig
plural
Modeling English Adjectives
divide adjectives into classes
examples
adj-root2: big, bigger, biggest, *unbig
adj-root2: cool, cooler, coolest, coolly
adj-root2: red, redder, reddest, *redly
adj-root1: clear, clearer, clearest, clearly, unclear, unclearly
adj-root1: happy, happier, happiest, happily
adj-root1: unhappy, unhappier, unhappiest, unhappily
adj-root1: real, *realer, *realest, unreal, really
However...
Examples
uncooler
• Smoking uncool and getting uncooler.
• google: 22,800 (2006), 10,900 (2005)
*realer
• google: 3,500,000 (2006) 494,000 (2005)
*realest
• google: 795,000 (2006) 415,000 (2005)
Of course, the full automaton for roots+endings is,
uhm, big+er! We’ll call this the dictionary or
Lexicon
And this illustrates why we a second set of
FST constraints… what are they for?
6.863J/9.611J SP11
A 2nd set of FST’s is required to specify
correct “spelling changes”
Describing (surface, underlying) forms as pairs of languages (a
language being just a set of strings):
String 1
f
a
t
String 2
f
a
t
+Adj
t
+Comp
e
Relation(String 1, String 2)
Forget about the terms input and output
r
What are the ‘spelling change’ rules for
English?
•  Cat+s, ….
•  Big+er, …
•  Fly+s, …
(What about that last one?)
Any others?
•  Fox+s
How do we implement these as FSTs?
Let’s try fox+s/foxes…
6.863J/9.611J SP11
Outfoxed
And acceptance (cook until done)
F
f
O
o
X
x
+
0
0
e
S
s
# lexical
# surface
Let’s convert this to a machine that checks
whether this pair is OK…
@:@
@
Csib:Csib +:0
1
2
0:e
3
reject
reject
reject
@
s:s
4
@
#:#
5
6
@:@
Standard notation: +:0 <=> x:x__0:e !
6.863J/9.611J SP09 Lecture 3
6.863J/9.611J SP11
Insert ‘e’ before non-initial z, s, x
(“epenthesis”)
important!
The fst is designed to pass input
not matching any arcs of the rule
through unmodified
(rather than fail)
0
Outfoxed!
0
f:f, o:o
0
0
0
+:0
x:x
0
0
#:#
implements context-sensitive rule
q0 to q2 : left context
q3 to q0 : right context
F
f
O
o
X
x
+
0
0
e
S
s
# lexical
# surface
s:s
0:e
F
f
O
o
X
x
+
0
0
e
S
s
# lexical
# surface
Here’s the way we’ll write this ‘epenthesis’
rule in terms of left/right contexts and
underlying/surface pairings, i.e., ‘two levels’:
f o x 0 + s
f o x e 0 s
Underlying or lexical form
(level 1)
Surface form (level 2)
0:e <=> x:x __ +:0
0:e <=> x __ +:0
Q: why didn’t we just right this rule like this:
+:e <=> x:x __ s:s
6.863J/9.611J SP11
Why the ‘padding’
•  The ‘0’ character is exactly 1 (one!) character
long; it is not an empty string! (don’t make this
misinterpretation)
•  This ensures that the lexical/surface string pairs
are exactly the same length
•  This keeps the machine running
‘synchronously’
•  Q: Why do we need to do that…?
•  A: Because we also have the dictionary
(lexicon) finite-state constraints to keep track
of, and here’s our proposed implementation:
6.863J/9.611J SP11
That is, what’s with all those ‘zero’ characters ????