Machine Translation- 4 Autumn 2008 Lecture 19 10 Sep 2008 IBM Model 1 Recap IBM Model 1 allows for an efficient computation of translation probabilities No notion of fertility, i.e., it’s possible that the same English word is the best translation for all foreign words No positional information, i.e., depending on the language pair, there might be a tendency that words occurring at the beginning of the English sentence are more likely to align to words at the beginning of the foreign sentence IBM Model 2 Model parameters: T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it d(i|j,l,m) = distortion probability, or probability that fj is aligned to ei , given l and m IBM Model 3 Model parameters: T(fj | eaj ) = translation probability of foreign word fj given English word eaj that generated it r(j|i,l,m) = reverse distortion probability, or probability of position fj, given its alignment to ei, l, and m n(ei) = fertility of word ei , or number of foreign words aligned to ei p1 = probability of generating a foreign word by alignment with the NULL English word IBM Model 3 IBM Model 3 offers two additional features compared to IBM Model 1: How likely is an English word e to align to k foreign words (fertility)? Positional information (distortion), how likely is a word in position i to align to a word in position j? IBM Model 3: Fertility The best Model 1 alignment could be that a single English word aligns to all foreign words This is clearly not desirable and we want to constrain the number of words an English word can align to Fertility models a probability distribution that word e aligns to k words: n(k,e) Consequence: translation probabilities cannot be computed independently of each other anymore IBM Model 3 has to work with full alignments, note there are up to (l+1)m different alignments IBM Model 3 Generative Story: Choose fertilities for each English word Insert spurious words according to probability of being aligned to the NULL English word Translate English words -> foreign words Reorder words according to reverse distortion probabilities IBM Model 3 For models 1 and 2: We can compute exact EM updates For models 3 and 4: Exact EM updates cannot be efficiently computed Use best alignments from previous iterations to initialize each successive model Explore only the subspace of potential alignments that lies within same neighborhood as the initial alignments IBM Model 4 Model parameters: Same as model 3, except uses more complicated model of reordering (for details, see Brown et al. 1993) IBM Model 1 + Model 3 Iterating over all possible alignments is computationally infeasible Solution: Compute the best alignment with Model 1 and change some of the alignments to generate a set of likely alignments (pegging) Model 3 takes this restricted set of alignments as input Pegging Given an alignment a we can derive additional alignments from it by making small changes: Changing a link (j,i) to (j,i’) Swapping a pair of links (j,i) and (j’,i’) to (j,i’) and (j’,i) The resulting set of alignments is called the neighborhood of a IBM Model 3: Distortion The distortion factor determines how likely it is that an English word in position i aligns to a foreign word in position j, given the lengths of both sentences: d(j | i, l, m) Note, positions are absolute positions Deficiency Problem with IBM Model 3: It assigns probability mass to impossible strings Well formed string: “This is possible” Ill-formed but possible string: “This possible is” Impossible string: Impossible strings are due to distortion values that generate different words at the same position Impossible strings can still be filtered out in later stages of the translation process Limitations of IBM Models Only 1-to-N word mapping Handling fertility-zero words (difficult for decoding) Almost no syntactic information Word classes Relative distortion Long-distance word movement Fluency of the output depends entirely on the English language model Decoding How to translate new sentences? A decoder uses the parameters learned on a parallel corpus Translation probabilities Fertilities Distortions In combination with a language model the decoder generates the most likely translation Standard algorithms can be used to explore the search space (A*, greedy searching, …) Similar to the traveling salesman problem Three Problems for Statistical MT Language model Given an English string e, assigns P(e) by formula good English string -> high P(e) random word sequence -> low P(e) Translation model Given a pair of strings <f,e>, assigns P(f | e) by formula <f,e> look like translations -> high P(f | e) <f,e> don’t look like translations -> low P(f | e) Decoding algorithm Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e) Slide from Kevin Knight The Classic Language Model Word N-Grams Goal of the language model -- choose among: He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table Rice shrine American shrine Rice company American company Slide from Kevin Knight Intuition of phrase-based translation (Koehn et al. 2003) Generative story has three steps 1) Group words into phrases 2) Translate each phrase 3) Move the phrases around Generative story again 1) 2) 3) Group English source words into phrases e1, e2, …, en Translate each English phrase ei into a Spanish phrase fj. The probability of doing this is (fj|ei) Then (optionally) reorder each Spanish phrase We do this with a distortion probability A measure of distance between positions of a corresponding phrase in the 2 lgs. “What is the probability that a phrase in position X in the English sentences moves to position Y in the Spanish sentence?” Slide from Koehn 2008 Slide from Koehn 2008 Distortion probability The distortion probability is parameterized by ai-bi-1 Where ai is the start position of the foreign (Spanish) phrase generated by the ith English phrase ei. And bi-1 is the end position of the foreign (Spanish) phrase generated by the I-1th English phrase ei-1. We’ll call the distortion probability d(ai-bi-1). And we’ll have a really stupid model: d(ai-bi-1) = |ai-bi-1| Where is some small constant. Final translation model for phrase-based MT l P(F | E) ( f i ,ei )d(ai bi1 ) i1 Let’s look at a simple example with no distortion Phrase-based MT Language model P(E) Translation model P(F|E) Model How to train the model Decoder: finding the sentence E that is most probable Training P(F|E) What we mainly need to train is (fj|ei) Suppose we had a large bilingual training corpus A bitext In which each English sentence is paired with a Spanish sentence And suppose we knew exactly which phrase in Spanish was the translation of which phrase in the English We call this a phrase alignment If we had this, we could just count-and-divide: But we don’t have phrase alignments What we have instead are word alignments: Getting phrase alignments To get phrase alignments: 1) We first get word alignments 2) Then we “symmetrize” the word alignments into phrase alignments Model 1 continued Prob of choosing a length and then one of the possible alignments: Combining with step 3: The total probability of a given foreign sentence F: Decoding How do we find the best A? Training alignment probabilities Step 1: get a parallel corpus Hansards Canadian parliamentary proceedings, in French and English Hong Kong Hansards: English and Chinese Step 2: sentence alignment Step 3: use EM (Expectation Maximization) to train word alignments Step 1: Parallel corpora Example from DE-News (8/1/1996) English German Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged major tax reform continues . Die Diskussion um die vorgesehene grosse Steuerreform dauert an . The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 . Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen . Slide from Christof Monz Step 2: Sentence Alignment The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. Intuition: - use length in words or chars - together with dynamic programming - or use a simpler MT model Slide from Kevin Knight El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Sentence Alignment 1. 2. 3. 4. 5. The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. Slide from Kevin Knight El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Sentence Alignment 1. 2. 3. 4. 5. The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. Slide from Kevin Knight El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Sentence Alignment 1. 2. 3. The old man is happy. He has fished many times. His wife talks to him. The sharks await. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0). Slide from Kevin Knight Step 3: word alignments It turns out we can bootstrap alignments From a sentence-aligned bilingual corpus We use is the Expectation-Maximization or EM algorithm EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … All word alignments equally likely All P(french-word | english-word) equally likely Slide from Kevin Knight EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased. Slide from Kevin Knight EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “house” co-occurs with both “la” and “maison”, but P(maison | house) can be raised without limit, to 1.0, while P(la | house) is limited because of “the” (pigeonhole principle) Slide from Kevin Knight EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration Slide from Kevin Knight EM for training alignment probs … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training! For details, see: •Section 24.6.1 in the chapter • “A Statistical MT Tutorial Workbook” (Knight, 1999). • “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) • Software: GIZA++ Slide from Kevin Knight Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … P(juste | fair) = 0.411 P(juste | correct) = 0.027 P(juste | right) = 0.020 … new French sentence Slide from Kevin Knight Possible English translations, to be rescored by language model A more complex model: IBM Model 3 Brown et al., 1993 Generative approach: Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch n(3|slap) P-Null t(la|the) Maria no dió una bofetada a la verde bruja d(j|i) Maria no dió una bofetada a la bruja verde Probabilities can be learned from raw bilingual text. How do we evaluate MT? Human tests for fluency Rating tests: Give the raters a scale (1 to 5) and ask them to rate Or distinct scales for Or check for specific problems Clarity, Naturalness, Style Cohesion (Lexical chains, anaphora, ellipsis) Hand-checking for cohesion. Well-formedness 5-point scale of syntactic correctness Comprehensibility tests Noise test Multiple choice questionnaire Readability tests cloze How do we evaluate MT? Human tests for fidelity Adequacy Does it convey the information in the original? Ask raters to rate on a scale Bilingual raters: give them source and target sentence, ask how much information is preserved Monolingual raters: give them target + a good human translation Informativeness Task based: is there enough info to do some task? Give raters multiple-choice questions about content Evaluating MT: Problems Asking humans to judge sentences on a 5-point scale for 10 factors takes time and $$$ (weeks or months!) We can’t build language engineering systems if we can only evaluate them once every quarter!!!! We need a metric that we can run every time we change our algorithm. It would be OK if it wasn’t perfect, but just tended to correlate with the expensive human metrics, which we could still run in quarterly. Bonnie Dorr Automatic evaluation Miller and Beebe-Center (1958) Assume we have one or more human translations of the source passage Compare the automatic translation to these human translations Bleu NIST Meteor Precision/Recall BiLingual Evaluation Understudy (BLEU —Papineni, 2001) http://www.research.ibm.com/people/k/kishore/RC22176.pdf Automatic Technique, but …. Requires the pre-existence of Human (Reference) Translations Approach: Produce corpus of high-quality human translations Judge “closeness” numerically (word-error rate) Compare n-gram matches between candidate translation and 1 or more reference translations Slide from Bonnie Dorr BLEU Evaluation Metric (Papineni et al, ACL-2002) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. Slide from Bonnie Dorr • N-gram precision (score is between 0 & 1) – What percentage of machine n-grams can be found in the reference translation? – An n-gram is an sequence of n words – Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”) • Brevity penalty – Can’t just type out single word “the” (precision 1.0!) *** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t) BLEU Evaluation Metric (Papineni et al, ACL-2002) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. Slide from Bonnie Dorr • BLEU4 formula (counts n-grams up to length 4) exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0) p1 = 1-gram precision P2 = 2-gram precision P3 = 3-gram precision P4 = 4-gram precision Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . Slide from Bonnie Dorr Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Bleu Comparison Chinese-English Translation Example: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Slide from Bonnie Dorr How Do We Compute Bleu Scores? Intuition: “What percentage of words in candidate occurred in some human translation?” Proposal: count up # of candidate translation words (unigrams) # in any reference translation, divide by the total # of words in # candidate translation But can’t just count total # of overlapping N-grams! Candidate: the the the the the the Reference 1: The cat is on the mat Solution: A reference word should be considered exhausted after a matching candidate word is identified. Slide from Bonnie Dorr “Modified n-gram precision” For each word compute: (1) total number of times it occurs in any single reference translation (2) number of times it occurs in the candidate translation Instead of using count #2, use the minimum of #2 and #2, I.e. clip the counts at the max for the reference transcription Now use that modified count. And divide by number of candidate words. Slide from Bonnie Dorr Modified Unigram Precision: Candidate #1 It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer??? 17/18 Slide from Bonnie Dorr Modified Unigram Precision: Candidate #2 It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer???? Slide from Bonnie Dorr 8/14 Modified Bigram Precision: Candidate #1 It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer???? Slide from Bonnie Dorr 10/17 Modified Bigram Precision: Candidate #2 It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0) Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. What’s the answer???? Slide from Bonnie Dorr 1/13 Catching Cheaters the(2) the the the(0) the(0) the(0) the(0) Reference 1: The cat is on the mat Reference 2: There is a cat on the mat What’s the unigram answer? What’s the bigram answer? Slide from Bonnie Dorr 2/7 0/7 Bleu distinguishes human from machine translations Slide from Bonnie Dorr Bleu problems with sentence length Candidate: of the Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Problem: modified unigram precision is 2/2, bigram 1/1! Solution: brevity penalty; prefers candidates translations which are same length as one of the references Slide from Bonnie Dorr BLEU Tends to Predict Human Judgments NIST Score (variant of BLEU) 2.5 Adequacy 2.0 R2 = 88.0% Fluency R2 = 90.2% 1.5 Linear (Adequacy) Linear (Fluency) 1.0 0.5 0.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 -0.5 -1.0 -1.5 -2.0 -2.5 Human Judgments slide from G. Doddington (NIST) 1.0 1.5 2.0 2.5 Summary Intro and a little history Language Similarities and Divergences Four main MT Approaches Transfer Interlingua Direct Statistical Evaluation
© Copyright 2026 Paperzz