Toward Statistical Machine Translation without Parallel Corpora Alex Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky Johns Hopkins University EACL 2012 Motivation } State-of-the-art statistical machine translation models are estimated from parallel corpora (manually translated text) } Large volumes of parallel text are typically needed to induce good models BUT } Manual translation is laborious and expensive } Sufficient quantities available for only a few language pairs } E.g. Canadian and European parliamentary proceedings } WMT 2011 translation task: 6 languages, Google Translate - 59 2 Goal Instead of: 3 Goal Cheap monolingual data + Use cheap and plentiful monolingual data to reduce (eliminate?) the need for parallel data to induce good translation models 4 Summary of the Approach At a very high level, translation involves: 1. Choosing correct translation of words and phrases Phrase based SMT: extract / score phrasal dictionaries from sentence aligned parallel data This work: extend bilingual lexicon induction to score phrasal dictionaries from monolingual data 2. Putting the translated phrases or words in the right order Phrase based SMT: extract ordering information from parallel data This work: novel algorithm to estimate ordering from monolingual data 5 Outline } Motivation and summary of the approach } Phrase-based statistical machine translation } Weaning phrase-based SMT off parallel data } Scoring phrases } } Scoring ordering } } Extending bilingual lexicon induction Novel reordering algorithm Experiments 6 Phrase-based SMT Parallel Data (big and expensive) en ur en en urfr Phrase Table Word Align and Extract Phrases en fr Score Phrases Tuning Data (small) Score Ordering en en urfr fr Scored Table en fr fts Training 1 ft weights Decoding 2 en en fr Extract Phrases Score Phrases Score Ordering Phrase Table en en en urur fr Monolingual Data (huge and cheap) We introduce monolingual equivalents to bilingually estimated features 7 1 } Scoring Phrases Phrase-based SMT: relatedness of a given phrase pair (e,f) is captured by: } Phrase translation probabilities φ(e|f), φ(f|e) } Average word translation w(ei|fj) probabilities using phrase-pair-internal word alignments fj índice f: e: de inflation inflación rate ei } } Both estimated from a parallel corpus How can we induce phrasal similarity from monolingual data? 8 Scoring Phrases: Context 1 } An old first idea [Rapp, 99]: measure contextual similarity } } Words appearing in similar context are probably related First, collect context rápidamente planeta economías 21 ... este número podría crecer muy rápidamente si no se modifica ... 1 ... nuestras economías a crecer y desarrollarse de forma saludable ... extranjero empleo ... que nos permitirá crecer rápidamente cuando el contexto ... crecer 9 1 } Scoring Phrases: Context An old first idea [Rapp, 99]: measure contextual similarity } Words appearing in similar context are probably related First, collect context } Then, project through a seed dictionary, and compare vectors } rápidamente planeta economías 7 4 4 5 economic growth 1 employment 2 quickly policy 1 7 dict. extranjero empleo 7 3 crecer 9 policy crecer (projected) activity expand 10 1 Second idea: measure temporal similarity } Assume, we have temporal information associated with text (e.g. news publication dates) } Events are discussed in different languages at the same time Collect temporal signature terrorist (en) terrorista (es) Occurrences } terrorist (en) riqueza (es) Occurrences } Scoring Phrases: Time Similar Dissimilar Time } Measure similarity between signatures (e.g. cosine or DFT-based metric) 11 1 } Scoring Phrases: Topics Third idea: measure topic similarity } } Phrases and their translations are likely to appear in the same topic The more similar the set of topics in which a pair of phrases appears the more likely the phrases are similar Topic 1 Topic 2 Topic 3 Topic N L1 L2 } We treat Wikipedia article pairs with interlingual links as topics 12 1 } Fourth idea: measure orthographic similarity } } } } Scoring Phrases: Orthography Etymologically related words often retain similar spelling across languages with the same writing system Spanish English democracia democracy Transliterate for language pairs with different writing system Russian Transliterated English демократия demokratiya democracy Measure similarity with edit distance or a discriminative translit model Other ideas: burstiness, etc. 13 1 } Scoring Phrases: Scaling Up Lex Induction Challenge in scoring phrase pairs monolingually: Sparsity 50 45 Accuracy, % 40 35 30 25 20 15 10 5 Top 10 0 0 } 100 200 300 400 Corpus frequency 500 600 Score the similarity of individual words within a phrase pair } Similar to lexical weights in phrase-based SMT } Use phrase pair internal word alignments, penalize for unaligned words 14 2 } m: monotone (keep order) } s: swap order } d: become discontinuous } Reordering features are probability estimates of s, d, and m verdienen Facebook in Profils seines aufrgund Start with aligned phrases man } sollte Phrase-based SMT: model the probabilities of orientation change of a phrase with respect to a preceding phrase when translated Wieviel } Scoring Ordering How much should you charge m m d for your d m Facebook d profile s 15 2 } Scoring Ordering from Monolingual Data Estimate same probabilities, but from pairs of (unaligned) sentences taken from monolingual data } We don’t have alignments, but we do have a phrase table German German Mono English Phrase Table German ! das Profils … Facebook … und nicht zustand } English , and profile … in Facebook … and a lack situation as What does your Facebook profile reveal? s Das Anlegen eines Profils in Facebook ist einfach. German German Mono German Repeat over many sentences 16 2 Estimate same probabilities, but from pairs of (unaligned) sentences taken from monolingual data einfach ist Facebook in Profils eines We don’t have alignments, but we do have a phrase table Anlegen } Das } Scoring Ordering from Monolingual Data What does your Facebook profile s reveal } Repeat over many sentences 17 1,2 Scoring Phrases and Ordering: Summary Phrase-based SMT Features Monolingual Equivalents Phrase translation probabilities Temporal, contextual, and topic phrase similarity Lexical weights Temporal, contextual, topic, and orthographic word similarities Reordering features Reordering features estimated from monolingual data Alternatives to features estimated from parallel corpora: we use cheap monolingual data (along with some metadata), and a seed dictionary 18 Experiments: Spanish-English } Idealized setup: treat English and Spanish sections of Europarl as two independent monolingual corpora } } } Europarl is annotated with temporal information Drop the idealization: estimate features from truly monolingual corpora } Gigaword, Wikipedia for contextual and temporal } Wikipedia for topical Run a series of lesion experiments: } Begin with standard features estimated from parallel data } Drop phrasal, lexical, and reordering features } Replace them with the monolingually estimated counterparts } See how much performance can be recovered 19 Experiment: Europarl Spanish-English 25 21.9 22.9 75% performance recovered 21.5 20 16.9 15 17.5 + 1 BLEU if mono features are added to phrase based MT BLEU 12.9 10.5 10 5 4.0 0 Standard phrasebased SMT Removed the bilingual reordering model Removed Removed bilingual all bilingual translation features probabilities Added monoling. reordering model Added monoling. phrase features All (and only) monoling. features Standard phrasebased + mono phrase features 20 Experiment: Monolingual Spanish-English 25 21.9 21.5 20 17.9 15 23.3 82% performance recovered 18.8 + 1.5 BLEU if mono features are added to phrase based MT BLEU 12.9 10.2 10 5 4.0 0 Standard phrasebased SMT Removed the bilingual reordering model Removed Removed bilingual all bilingual translation features probabilities Added monoling. reordering model Added monoling. phrase features All (and only) monoling. features Standard phrasebased + mono phrase features 21 Conclusions Monolingual data takes us a long way toward inducing good translation models } Can recover substantial portion of the lost performance in lesion experiments } Consistently improve performance when added to the phrase-based setup Important direction for languages lacking sufficient (or any) parallel data } True for most languages } Monolingual data is plentiful and growing 22
© Copyright 2026 Paperzz