Lecture 4: Walk-ing the walk; talk-ing the talk Professor Robert C. Berwick [email protected] 6.863J/9.611J SP11 Smoothing: the Robin Hood Answer (and the Dark Arts) Steal (probability mass) from the rich And give to the poor (unseen words) The Menu Bar • Administrivia: Lab 2 out – Note that you are expected to figure out the ‘guts’ of the discounting methods & how to implement them • Today: finish smoothing & word parsing (chapter 3, of JM textbook) • Two level morpho-phonology (why two level ? – Gold star idea: match representation to the task) • How does it work? Salivation nation; plumbing the Wells • Two sets of finite-state transducers • Hunting through an example so that we re not out-foxed • Spy vs. spies; multiple language 6.863J/9.611J SP11 Smoothing has 2 parts: (1) discounting & (2) backoff • Deals with events that have been observed zero times • Generally improves accuracy of model • Implements Robin Hood by answering 2 questions: 1. Who do we take the probability mass away from, and how much? (= discounting) 2. How do we redistribute the probability mass? (= backoff) Let s cover two of the basic methods for discounting and backoff: Add-1; Witten-Bell Sum of probability mass must still add up to 1! 6.863J/9.611J SP11 6.863J/9.611J SP11 Smoothing: the Dark Side of the force… This dark art is why NLP is taught in the engineering school Simplest fix • So our estimate for w is now an interpolation ( a weighted backoff sum) between the unigram and the uniform distributions: N V +N · C(w) N + V V +N · 1 V = C(w)+1 N +V 6.863J/9.611J SP11 There are actually less hacky smoothing methods, but they require more math and more compute cycles and don t usually work any better. So we ll just motivate and present a few basic methods that are often used in practice. Add-1 smoothing: the Robin Hood Answer Steal (probability mass) from the rich Add new 1 s to everybody (as if there was a new list V long) • This method is called Add-1 because we add 1 to all the counts, and divide by V to normalize • It s really an interpolation of the unigram (or bigram, …) MLE plus the uniform distribution Example from restaurants: add 1 p(wi |wi−1 ) = 1 + c(wi−1 wi ) V + c(wi−1 ) Problem: Add-1 adds way too much probability mass to unseen events! (It therefore robs too much from the actually observed events.) V is too big! So how to fix this? 6.863J/9.611J SP11 6.863J/9.611J SP11 Add-One Smoothing (Laplace, 1720) xya xyb xyc xyd xye … xyz Total xy p(food | Chinese)= 0.52 dropped to 0.052 !!! 6.863J/9.611J SP11 see the abacus 1 1/3 2 2/20003 0 0/3 1 1/20003 see the abduct 0 0/3 1 1/20003 see theevent above= 0-count event Novel 2 (never happened 2/3 in training data) 3 . 3/20003 Here: 19,998 novel events, with total estimated probability 19,998/20,003. see the Abram 0 0/3 1/20003 So add-one smoothing thinks we are extremely likely to see1novel events, rather see the abbot …we’ve seen in training data. than words It thinks this only because we have a big dictionary: 20,000 possible events. see the zygote 0 0/3 1 1/20003 Is this a good reason? 3/3 6.863J/9.611J SP11 20003 2 1 1 3 1 2/29 1/29 1/29 3/29 1/29 0 3 0/3 3/3 1 29 1/29 29/29 Serious problems with the Add-1 method Suppose we’re considering 20000 word types, not 26 letters 3 1/3 0/3 0/3 2/3 0/3 6.863J/9.611J SP11 Add-One Smoothing Total 1 0 0 2 0 20003/20003 1. It assigns far too much probability to rare n-grams as V gets bigger (the size of the vocabulary, the # of distinct words grows) – the amount of probability mass we steal is too large! 2. All unseen n-grams are smoothed in the same way – how we redistribute this probability mass is also wrong • For bigrams, about ½ the probability mass goes to unseen events (e.g., for V=50,000, corpus size N= 5 x 109, V2= 2.5 x 109) • So, we must reduce the count for unseen events to something waay less than 1…. 6.863J/9.611J SP11 Diversity Smoothing One fix: just how likely are novel events? Counts from Brown Corpus (N approx. 1 million tokens) Counts from Brown Corpus (N approx. 1 million tokens) the 69836 EOS 52108 n6 * 6 n5 * 5 n4 * 4 n6 * 6 abdomen, bachelor, Caesar … n5 * 5 aberrant, backlog, cabinets … n3 * 3 n4 * 4 abdominal, Bach, cabana … n2 * 2 doubletons (occur twice) n3 * 3 Abbas, babel, Cabot … n2 * 2 aback, Babbitt, cabanas … n1 * 1 singletons (occur once) n1 * 1 abaringe, Babatinde, cabaret … n0 * 0 novel words (in dictionary V but never occur) n0 * 0 0 0 10000 20000 30000 40000 50000 60000 70000 80000 total # types = T (purple bars); total # tokens = N (all bars) 5000 n2 = # doubleton types n2 * 2 = # doubleton tokens 10000 15000 20000 25000 Σr nr = total # types = T (purple bars) Σr (nr * r) = total # tokens = N (all bars) 6.863J/9.611J SP11 Witten-Bell discounting: in other words Witten-Bell Smoothing/Discounting Idea If T/N is large, we ve seen lots of novel types in the past, so we expect lots more. • Imagine scanning the corpus in order. • Each type s first token was novel. • So we saw T novel types (purple). n6 * 6 n5 * 5 n4 * 4 n3 * 3 unsmoothed à smoothed n2 * 2 doubletons 2/N à 2/(N+T) n1 * 1 singletons 1/N à 1/(N+T) n0 * 0 novel words 0/N à (T/(N+T)) / n0 0 5000 10000 15000 20000 25000 Intuition: When we see a new type w in training, ++count(w); ++count(novel) So p(novel) is estimated as T/(N+T), divided among n0=Z specific novel types • A zero frequency n-gram is simply an event that hasn’t happened yet • When we do see it, then it will be for the first time • We will get an MLE estimate of this first time frequency by counting the # of n-gram types in the training data (the # of types is just the number of first-time observed n-grams) • So the total probability mass of all the zero n-grams is estimated as this count of types T (= first time n-grams) normalized by the number of tokens N plus number of types T: � i:ci =0 p∗i = T N +T • Intuition: a training corpus consists of a series of events; one event for each token, and one event for each new type; this formula gives a kind of MLE for the probability of a new type event occurring Witten-Bell discounting • So, answer to the Robin Hood question #1 is: – We take away T rather than V from the rich (and note that the # of types of words will be very much smaller than the # of all possible words) • Now we must also answer Robin Hood question 2: How do we redistribute this wealth ? 2nd question: re-allocate this probability mass we just took away from the total… • Let Z be the total # of words with count 0 types • Divide the probability mass equally: each gets 1/ Zth of this mass, i.e., p∗i = � if (ci = 0) • Compare this correction to what Add-1 does: 1 V 1 T ! ! V V+N Z T+N • And as usual we must correspondingly reduce (discount) the probability of all the seen items: p∗i = Note that the total probability mass still adds up to 1 T Z(N +T ) ci (N +T ) if (ci > 0) Witten- Bell smoothing: the Robin Hood Answer Steal (probability mass) from the rich ci = N i � 1 � N ci = for(ci > 0) N + T N +T i i � T p∗i = for(ci = 0) (N + T) i p∗i = Summed together: N N +T + 6.863J/9.611J SP11 T (N +T ) N/(T+N) T/(T+N) =1 6.863J/9.611J SP11 We apply this idea to bigrams (you will do it for trigrams) • Condition the type counts on word history, eg bigrams • To compute the probability of a bigram wiwi-1 we haven’t seen before (a first time bigram ): – Compute the probability of a new bigram starting with a particular wi-1 • Words that occur in a larger # of bigrams will supply a larger ‘unseen bigram’ estimate than words that do not appear in a lot of difft bigrams • Then condition T and N, bigram types and tokens, on the previous word, wi–1, as follows: 6.863J/9.611J SP11 Condition p(wi) on previous word, wi–1 " p* (wi | wi !1 ) = i:c(wi!1 wi )= 0 T (wi !1 ) N(wi !1 ) + T (wi !1 ) Let Z be the total # of bigrams with a given first word that has count 0 bigram types (e.g., chinese creampuffs ) Each such bigram will get 1/Zth of the redistributed probability: p* (wi | wi !1 ) = T (wi !1 ) if c(wi !1wi ) = 0 Z(wi !1 )(N + T (wi !1 )) For non-zero bigrams, we discount as before, e.g., chinese food : p* (wi | wi !1 ) = c(wi !1wi ) if c(wi !1wi ) > 0 c(wi !1 ) + T (wi !1 ) Reduce bigram 6.863J/9.611J SP11 by this amount Example: using T(w) & Z values # of bigram types T(w) following each word in the restaurant example: 95 different bigrams follow I; 76 different bigrams follow want; etc. I 35 Chinese 20 want 14 food 15 to 42 lunch 8 eat 9 • Recall that bigram Chinese food has nonzero count of 82 & c(Chinese) is 158, so MLE bigram estimate is p=82/158 =0.52 • The T re-adjusted bigram estimate is: p*(food | Chinese) = c(Chinese food)/(c(Chinese)+T(Chinese)) = 82/(158+20)= 0.515 Compare this to the MLE estimate of 0.520 – as compared to Add-1, this revised estimate is much closer to the MLE– it is lowered because Chinese might be followed by something else not yet seen, and not food (or the 14 other words already observed) Working with T and Z values • The more promiscuous a word, the more types observed that follow it, the more probability mass we reserve for unseen words that might follow this word, the more we lower its other MLE estimates But there s one more twist…backoff! 1. Why are we treating all novel events as the same? 2. What happens if the bigrams themselves are 0? p(zygote | see the) = p(baby | see the) ??? Suppose both trigrams have 0 count But: doesn t baby beat out zygote as a unigram? doesn t the baby beat out the zygote as a bigram? So shouldn t see the baby beat see the zygote ? How can we factor this knowledge in? Use the Backoff Luke! • Hold out probability mass for novel events • But divide up this mass unevenly in proportion to a backoff probability (like the interpolation scheme, but now the weights depend on the backoff probabilities) • What s a backoff probability? • When defining p(z|xy), for novel z, it is p(z|y) (ie, the bigram probability estimate), which is itself the weighted average of the MLE estimate for the bigram and the unigram estimate for z; • But what if the unigram estimate p(z) is itself 0? 6.863J/9.611J SP11 Back-off Smoothing If p(z|y) is not observed, we back-off to more the frequent p(z) • Suppose p(z) (ie, the unigram probability) is not observed, do we need a back-off probability for it? • Yes; it is just the 0-gram uniform estimate, 1/V • So use the uniform distribution, 1/V , and interpolate with the unigram MLE estimate as with the Add-1 scheme, selecting weights λ, (1–λ) that sum to 1. For Witten-Bell, this means using T/(T+N) and N/(T+N) as the weights • Let’s put all this together…compare this weighting scheme to our previous one for Add-1 smoothing… 6.863J/9.611J SP11 Witten-Bell Backoff for bigrams – Weighted between the MLE estimate for bigrams and 0-gram uniform distribution pbackof f (wi ) = = pW B (wi ) = NN +T · pM LE (wi ) + T N +T · 1 V If MLE estimate is 0, just don t use it and use the uniform estimate! Now we must combine this with our bigram estimate, recursively, interpolating between unigram and bigram, backing off if an estimate is not available just as we did with the Add-1 idea… 6.863J/9.611J SP11 Modified Witten-Bell: from bigrams to unigrams to uniform estimates pW B (wi |wi−1 ) = pbackof f (wi ) = c(wi−1 ) · pM LE (wi |wi−1 ) c(wi−1 ) + T (wi−1 ) T (wi−1 ) + · pbackof f (wi ) c(wi−1 ) + T (wi−1 ) T 1 N · pM LE (wi ) + · N +T N +T V • This is competitive with best methods known • As to what works most smoothly in practice? • ‘It depends’ – you’ll have to try it out (see Lab 2) Now let’s look at word parsing… 6.863J/9.611J SP11 The central problem: parsing • The problem of mapping from one representational level to another is called parsing • If there is > 1 possible outcome (the mapping is not a function) then the input expression is ambiguous dogs → dog-Noun-plural or dog-Verb-present-Tense • We’ll begin with word parsing Why? 6.863J/9.611J SP10 Why words? (Engineer’s view) • Prerequisite for higher level processing • Cross-linguistic generality • Applications: • electronic dictionaries (Franklin) • spelling checkers (ispell) • machine translation (Systran) • Filthy lucre 6.863J/9.611J SP10 Start with words: they illustrate all the problems (and solutions) in NLP Why words? (linguistics) • • • • Test the representation Test the resource requirements Find out about unexpected predictions Filthy lucre • Linguistic analysis of words: morphology (morpho-logos) 6.863J/9.611J SP10 leaf N Pl leave N Pl What knowledge do we need to do this? Two kinds needed Generation leave V Sg3 leaves Parsing words Cats = CAT + N(oun) + PL(ural) Used in: Traditional NLP applications Finding word boundaries (e.g., Latin, Chinese) Text to speech (boathouse) Document retrieval In particular, the problems of parsing, ambiguity,and computational efficiency (as well as the problems of how people do it) 6.863J/9.611J SP10 Computational morphology Analysis • • • • • • • • hang V Past hanged hung 1. Knowledge of a stem and possible endings: Dogs → dog + s Cat → cat + s Doer → do + er But not: beer → be + er 2. Knowledge of spelling changes in context Bigger →big + er (double the g ) Flies → flie + s ? 6.863J/9.611J SP10 6.863J/9.611J SP10 Two kinds of knowledge = two kinds of representations/constraints What knowledge is needed? Morphotactics Words are composed of smaller elements that must be combined in a certain order but not in others: piti-less-ness is English piti-ness-less is not English Phonological alternations The shape (someties, as written) of an element may vary depending on its context pity is realized as piti in pitilessness die becomes dy in dying 6.863J/9.611J SP10 6.863J/9.611J SP10 What Knowledge is needed? Linear arrangement of morphemes – beads on a string: Lebensversichergungesellschaftsangestellter Leben+s+versichergun+gesellschaft+s+angestellter life+CmpAug+insurance+CmpAug+company+CompAug+employee What would be a good computational model for this? Stemming and Search • • • 6.863J/9.611J SP10 Google is more successful than other search engines in part because it returns better , i.e. more relevant, information • its algorithm (a trade secret) is called PageRank SEO (Search Engine Optimization) is a topic of considerable commercial interest • Goal: How to get your webpage listed higher by PageRank • Techniques: • e.g. by writing keyword-rich text in your page • e.g. by listing morphological variants of keywords Google does not use stemming everywhere • and it does not reveal its algorithm to prevent people optimizing their pages 6.863J/9.611J SP10 What does Google do? <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>BBC - Health - Healthy living - Dietary requirements</title><meta content="The foods that can help to prevent and treat certain conditions" name="description"/> 6.863J/9.611J SP09 Lecture 2.5 We can’t just store it all…or can we? Why? 1. A Sagan number Uygarlasutıramadıklarımızdanmısusınızcasına (behaving) as if you are among those whom we could not civilize behave+BEC+CAUS+NAB+PART+PL+P1PL+AB+PA+2PL +AsIF 2. What about exotic languages – like English? Anti-missile Anti-anti-missile Anti-anti-antimissile… 6.863J/9.611J SP09 Lecture 2.5 Why do this? because current theories argue that language is mostly (even all) about lexical (word) features… One cat is/are on the table Zero cats… 6.863J/9.611J SP10 6.863J/9.611J SP10 Solution: lexical finite-state machine Bidirectional: generation or analysis Compact and fast Comprehensive systems have been built for over 40 languages: English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, … Even for translation vouloir +IndP +SG + P3 Finite-state mapper veut citation form v o u v e u l o inflection codes i r +IndP +SG +P3 inflected form 6.863J/9.611J SP10 6.863J/9.611J SP11 Our solution: 2 sets of finite-state automata (aka, kimmo ) 6.863J/9.611J SP10 We can get more sophisticated… t Morphology Inflectional Morphology: basically: no change in category; sometimes meaningful Agreement features (person, number, gender,…) Examples: movies, blonde, actress, one dog, two dogs, zero … Irregular examples: appendices (from appendix), geese (from goose) Morphology Derivational Morphology basically: part of speech category changing nominalization Examples: formalization, informant, informer, refusal, lossage deadjectivals Examples: weaken, happiness, simplify, formalize, slowly, calm case Examples: she/her, who/whom comparatives and superlatives Examples: happier/happiest tense Examples: drive/drives/drove (-ed)/driven Morphology and Semantics Morphemes: units of meaning suffixation Examples: x employ y employee: picks out y employer: picks out x x read y readable: picks out y prefixation Examples: undo, redo, un-redo, encode, defrost, asymmetric, malformed, ill-formed, pro-Chomsky deverbals Examples: see nominalization, readable, employee denominals Examples: formal, bridge, ski, cowardly, useful What about other processes? • Stem/root: core meaning unit (morpheme) of a word • Affixes: bits and pieces that combine with the stem to modify its meaning and grammatical functions • Prefix: un- , anti-, etc. • Suffix: -ity, -ation, etc. • Infix: • Tagalog: um+hinigi + humingi (borrow) • Any infixes in nonexotic language like English? Here s one: un-f******-believable OK, what knowledge is needed? How do we represent this knowledge? Two parts to the what Which units can glue to which others (roots and affixes) (or stems and affixes)= morphotactics What spelling changes (orthographic changes) occur – like dropping the e in chase + ed (morpheme shape depends on its context – like plural) IDEA: MODEL EACH AS A FINITE-STATE MACHINE, then combine. WHY is this a good model? HOW should we ‘combine’ these two processes? How: we will modify the idea of a finitestate automaton (fsa) A finite-state automaton (FSA) is a quintuple (Q,Σ,δ,I ,F) where: Q is a finite set of states Σ is a finite set of terminal symbols, the alphabet δ ⊆ Q x Σ x Q is the transition mapping (or next state relation) I ⊆ Q the initial states F ⊆ Q, the final states If δ:( Q x Σ x Q ) is a function, then the FSA is deterministic, o.w. it is nondeterministic Why finite-state machines? • • • • Minimal model Fast Note what defines a finite-state machine Linear concatenation of equivalence classes of words (beads on a string) – the minimal machinery to represent precedence More ancient linguistic corpus analysis Rulon Wells (1947, pp.81-82): A word A belongs to the class determined by the environment: ___ X if AX is either an utterance or occurs as a part of some utterance AKA: distributional analysis This turns out to be algebraically correct, but Wells did not know the math to take the next step… Under construction Palin will lose the election. Palin could lose the election. ____ z z = tail following word x will ≡ could; Why are 2 words in the same ‘class’? Because they have the same tails; they are in the same equivalence class; The class = a state in an fsa More precisely Splitting equivalence classes as data comes in ε-transition Any fsa must sometimes fail to classify an arbitrarily long string of this form properly… Why??? aaabbb aabb aaaabb a500b500 1 2 … 5 0 Remembrance of languages past… • Finite-state automaton defined by having a finite number of states…. (duh) • Which define a finite # of equivalence classes or bins, ie, an equivalence relation R s.t. for all strings x, y over the defined alphabet Σ s.t. either xRy or else not xRy. • (This equivalence relation is of finite rank ) • No fsa can properly bin anbn, for all n≥1 Modeling English examples – how? Finite-state machine for stems + endings Adjective • As an example, consider adjectives Big, bigger, biggest Cool, cooler, coolest, coolly Red, redder, reddest Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily Real, unreal, silly er # Pure model of linear concatenation Additions: 1. Obviously, we need to add more states: real-??realer 2. We also need to add output (e.g., Adj, comparative), so we must make this machine a transducer 3. Finally, we must add in a second set of machines to check for valid roots and endings character by character, as well as possible spelling changes (big-bigger) Adding output to an fsa to create a transducer cool Adjective er Morpho-phonology is finite-state: regular relations # Comparative Q: So how do we add output to these sorts of devices? A: Finite-state automata → finite-state transducer 6.863J/9.611J SP10 An example regular relation – a pairing of sets of strings – each of which is finite-state – described by an fst {(ab)n, (ba)n}, s.t. n >0, i.e., {(ab,ba), (abab, baba), (ababab, bababa), … } (“Interchanges” a’s and b’s) S a:b Danger, danger, Will Robinson: Finite State Transducers (FST) are a bit different from ordinary finite-state automata B b:a Note that the relation specifies no directionality between X and Y (there is no input and no output, so neutral between parsing & generation ) What are the properties of regular (rational) relations? How do we implement these in our word parsing application? Properties of regular (rational) relations/ transductions • Key differences: (important for implementation) 1. Not closed under intersection (unlike fsa s, unless they obey the same-length constraint) 2. Are closed under composition 3. Cannot always be determinized (unlike nondeterministic fsa s) Definition & example of composition, and then why not closed under intersection & why this matters Unlike FSAs, FSTs (regular relations) are not closed under intersection (gulp!) Regular relation 1: a pair of finite-state languages: {an, bnc*} Claim: this is a regular relation ε:c a:b ε:c Regular relation 2: a similar pair of finite state languages: {an, b*cn} Claim: this too is a regular relation But what is the intersection of (1) and (2)? Ans: {an, bncn}. And this is not a regular relation, by definition, because bncn is not a finite-state (regular) language (Why not? Salivate appropriately as bell rings…) So, we use this kind of machine to model the root + ending constraints…as a set of underlying/surface pairs Modeling English Adjectives fsa (3.5): Root Fox + s Affix Affix dict Root Dictionary, Noun 6.863J/9.611J SP11 from section 3.3 of textbook examples big, bigger, biggest, *unbig cool, cooler, coolest, coolly red, redder, reddest, *redly clear, clearer, clearest, clearly, unclear, unclearly happy, happier, happiest, happily unhappy, unhappier, unhappiest, unhappily real, *realer, *realest, unreal, really The valid root plus affix combinations are given by a finite-state transducer Initial machine is overly simple need more classes to make finer grain distinctions e.g. *unbig plural Modeling English Adjectives divide adjectives into classes examples adj-root2: big, bigger, biggest, *unbig adj-root2: cool, cooler, coolest, coolly adj-root2: red, redder, reddest, *redly adj-root1: clear, clearer, clearest, clearly, unclear, unclearly adj-root1: happy, happier, happiest, happily adj-root1: unhappy, unhappier, unhappiest, unhappily adj-root1: real, *realer, *realest, unreal, really However... Examples uncooler • Smoking uncool and getting uncooler. • google: 22,800 (2006), 10,900 (2005) *realer • google: 3,500,000 (2006) 494,000 (2005) *realest • google: 795,000 (2006) 415,000 (2005) Of course, the full automaton for roots+endings is, uhm, big+er! We’ll call this the dictionary or Lexicon And this illustrates why we a second set of FST constraints… what are they for? 6.863J/9.611J SP11 A 2nd set of FST’s is required to specify correct “spelling changes” Describing (surface, underlying) forms as pairs of languages (a language being just a set of strings): String 1 f a t String 2 f a t +Adj t +Comp e Relation(String 1, String 2) Forget about the terms input and output r What are the ‘spelling change’ rules for English? • Cat+s, …. • Big+er, … • Fly+s, … (What about that last one?) Any others? • Fox+s How do we implement these as FSTs? Let’s try fox+s/foxes… 6.863J/9.611J SP11 Outfoxed And acceptance (cook until done) F f O o X x + 0 0 e S s # lexical # surface Let’s convert this to a machine that checks whether this pair is OK… @:@ @ Csib:Csib +:0 1 2 0:e 3 reject reject reject @ s:s 4 @ #:# 5 6 @:@ Standard notation: +:0 <=> x:x__0:e ! 6.863J/9.611J SP09 Lecture 3 6.863J/9.611J SP11 Insert ‘e’ before non-initial z, s, x (“epenthesis”) important! The fst is designed to pass input not matching any arcs of the rule through unmodified (rather than fail) 0 Outfoxed! 0 f:f, o:o 0 0 0 +:0 x:x 0 0 #:# implements context-sensitive rule q0 to q2 : left context q3 to q0 : right context F f O o X x + 0 0 e S s # lexical # surface s:s 0:e F f O o X x + 0 0 e S s # lexical # surface Here’s the way we’ll write this ‘epenthesis’ rule in terms of left/right contexts and underlying/surface pairings, i.e., ‘two levels’: f o x 0 + s f o x e 0 s Underlying or lexical form (level 1) Surface form (level 2) 0:e <=> x:x __ +:0 0:e <=> x __ +:0 Q: why didn’t we just right this rule like this: +:e <=> x:x __ s:s 6.863J/9.611J SP11 Why the ‘padding’ • The ‘0’ character is exactly 1 (one!) character long; it is not an empty string! (don’t make this misinterpretation) • This ensures that the lexical/surface string pairs are exactly the same length • This keeps the machine running ‘synchronously’ • Q: Why do we need to do that…? • A: Because we also have the dictionary (lexicon) finite-state constraints to keep track of, and here’s our proposed implementation: 6.863J/9.611J SP11 That is, what’s with all those ‘zero’ characters ????
© Copyright 2026 Paperzz