Deciphering Foreign Language NLP ! Sujith Ravi and Kevin Knight #1 #2 [email protected], [email protected] Information Sciences Institute University of Southern California ! Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : 2 associates/ asociados : 0.8 TRAIN groups/ grupos : : : 0.9 Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados Spanish/English (English) Garcia and associates (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : 2 associates/ asociados : 0.8 TRAIN groups/ grupos : : : 0.9 Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados Spanish/English (English) Garcia and associates Japanese/German (Spanish) sus grupos están en Europa (English) its groups are in Europe : : : 2 associates/ asociados : 0.8 TRAIN groups/ grupos : : : 0.9 Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados Spanish/English (English) Garcia and associates Japanese/German (Spanish) sus grupos están en Europa (English) its groups are in Europe Malayalam/English : : : 2 associates/ asociados : 0.8 TRAIN groups/ grupos : : : 0.9 Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados Spanish/English (English) Garcia and associates Japanese/German (Spanish) sus grupos están en Europa (English) its groups are in Europe Malayalam/English : : Swahili/German :... 2 associates/ asociados : 0.8 TRAIN groups/ grupos : : : 0.9 Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados Spanish/English (English) Garcia and associates Japanese/German (Spanish) sus grupos están en Europa (English) its groups are in Europe Malayalam/English : : Swahili/German :... BOTTLENECK 2 associates/ asociados : 0.8 TRAIN groups/ grupos : : : 0.9 Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados Spanish/English (English) Garcia and associates Japanese/German (Spanish) sus grupos están en Europa (English) its groups are in Europe Malayalam/English : : Swahili/German :... Can we get rid of parallel data? BOTTLENECK 2 associates/ asociados : 0.8 TRAIN groups/ grupos : : : 0.9 Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados Spanish/English (English) Garcia and associates associates/ asociados : 0.8 Japanese/German TRAIN (Spanish) sus grupos están en Europa (English) its groups are in Europe Malayalam/English : : Swahili/German :... Can we get rid of parallel data? groups/ grupos : : : 0.9 BOTTLENECK Monolingual corpora Translation tables Spanish text TRAIN English text 2 associates/ asociados : 0.8 : : : Statistical Machine Translation (MT) Bilingual text Current MT systems Translation tables (Spanish) Garcia y asociados Spanish/English (English) Garcia and associates associates/ asociados : 0.8 Japanese/German TRAIN (Spanish) sus grupos están en Europa (English) its groups are in Europe Malayalam/English : : Swahili/German :... Can we get rid of parallel data? : 0.9 BOTTLENECK Monolingual corpora PLENTY groups/ grupos : : Translation tables Spanish text TRAIN English text PLENTY 2 associates/ asociados : 0.8 : : : Machine Translation without parallel data Getting Rid of Parallel Data • MT system trained on non-parallel data ➡ useful for rare language-pairs (limited/no parallel data) 3 Machine Translation without parallel data Getting Rid of Parallel Data • MT system trained on non-parallel data ➡ useful for rare language-pairs (limited/no parallel data) Monolingual corpora Translation tables Spanish text TRAIN English text 3 associates/ asociados : 0.8 : : : Machine Translation without parallel data Getting Rid of Parallel Data • MT system trained on non-parallel data ➡ useful for rare language-pairs (limited/no parallel data) Monolingual corpora Translation tables Spanish text TRAIN English text • • associates/ asociados : 0.8 : : : Goal: not to beat existing MT systems, instead Can we build a reasonably good MT system from scratch without any parallel data? ➡ monolingual resources available in plenty 3 Related Work • • Machine Translation without parallel data Extracting bilingual lexical connections from comparable corpora ➡ exploit word context frequencies (Fung, 1995; Rapp, 1995; Koehn & Knight, 2001) ➡ Canonical Correlation Analysis (CCA) method (Haghighi & Klein, 2008) Mining parallel sentence pairs for MT training using comparable corpora (Munteanu et al., 2004) ➡ need dictionary, some initial parallel data 4 NLP Our Contributions New ✓ MT system built from scratch without parallel data ➡ novel decipherment approach for translation ➡ novel methods for training translation models from non-parallel text ➡ Bayesian training for IBM 3 translation model ✓ Novel methods to deal with large-scale vocabularies inherent in MT problems ✓ Empirical studies for MT decipherment 5 Rest of this Talk • • • • Introduction Related Work New Idea for Language Translation ➡ Step 1: Word Substitution ➡ Step 2: Foreign Language as a Cipher Conclusion 6 New Cracking the MT Code “When I look at an article in Spanish, I say to myself, this is really English, but it has been encoded in some strange symbols. Now I will proceed to decode...” Warren Weaver (1947) (Spanish) • Ciphertext: este es un sistema de cifrado complejo 7 New Cracking the MT Code “When I look at an article in Spanish, I say to myself, this is really English, but it has been encoded in some strange symbols. Now I will proceed to decode...” Warren Weaver (1947) (Spanish) • Ciphertext: este es un sistema de cifrado complejo (English) • Plaintext: this is a complex cipher 7 New MT Decipherment without Parallel Data f Spanish corpus El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha ordenado la documentación por diferentes categorías atendiendo a los hechos más notables. Desde el tipo de suceso (evento criminal, fuego amigo, ... New MT Decipherment without Parallel Data e P(e) English Language Model English ? P(f | e) English-to-Spanish Translation Model key f Spanish corpus El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha ordenado la documentación por diferentes categorías atendiendo a los hechos más notables. Desde el tipo de suceso (evento criminal, fuego amigo, ... New MT Decipherment without Parallel Data English corpus (CNN) WikiLeaks website publishes classified military documents from Iraq. The whistle-blower website WikiLeaks published nearly 400,000 classified military documents from the Iraq war on Friday, calling it the largest classified military leak in history,.... Language Model Training e P(e) English Language Model English ? P(f | e) English-to-Spanish Translation Model key f Spanish corpus El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha ordenado la documentación por diferentes categorías atendiendo a los hechos más notables. Desde el tipo de suceso (evento criminal, fuego amigo, ... New MT Decipherment without Parallel Data English corpus (CNN) WikiLeaks website publishes classified military documents from Iraq. The whistle-blower website WikiLeaks published nearly 400,000 classified military documents from the Iraq war on Friday, calling it the largest classified military leak in history,.... For each f • alignments = hidden • e translation = hidden Language Model Training e P(e) English Language Model English ? P(f | e) English-to-Spanish Translation Model key f Spanish corpus El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha ordenado la documentación por diferentes categorías atendiendo a los hechos más notables. Desde el tipo de suceso (evento criminal, fuego amigo, ... New MT Decipherment without Parallel Data Train parameters θ to maximize probability of observed foreign text f: English corpus (CNN) WikiLeaks website publishes classified military documents from Iraq. The whistle-blower website WikiLeaks published nearly 400,000 classified military documents from the Iraq war on Friday, calling it the largest classified military leak in history,.... argmax θ Pθ (f ) ≈ argmax θ ∑e Pθ (e, f) ≈ argmax θ ∑e P(e) . Pθ (f | e) TRAINING Language Model Training e P(e) English Language Model English ? P(f | e) English-to-Spanish Translation Model key For each f • alignments = hidden • e translation = hidden f Spanish corpus El portal web permite la búsqueda por todo tipo de métodos. Por un lado, Wikileaks ha ordenado la documentación por diferentes categorías atendiendo a los hechos más notables. Desde el tipo de suceso (evento criminal, fuego amigo, ... Characteristics of Decipherment Key for MT MT Determinism in Key? Linguistic unit of substitution Transposition (re-ordering) Insertion Deletion Scale (vocabulary & data sizes) 9 Characteristics of Decipherment Key Hard for MT MT problem! Determinism in Key? many-to-many Linguistic unit of substitution Transposition Word / Phrase (re-ordering) Insertion Deletion Large Scale (100 - 1M word types) (vocabulary & data sizes) 10 Characteristics of Decipherment Key Tackle a simpler problem first! Word Substitution Determinism in Key? Linguistic unit of substitution Transposition MT Hard problem! 1-to-1 many-to-many Word Word / Phrase Large Large (re-ordering) Insertion Deletion Scale (vocabulary & data sizes) (100 - 1M word types) (100 - 1M word types) 11 Rest of this Talk • • • • • • Introduction Related Work New Idea for Language Translation Word Substitution Foreign Language as a Cipher Conclusion 12 Word Substitution Word Substitution (on the road to MT) !"#$%&'()%$"*)%)#+%*,%-,./*)"0% 95 90 76 31 95 20 95 65 31 50 42 16 65 31 50 58 42 16 19 95 16 92 73 16 65 67 57 31 26 65 47 24 56 21 16 43 11 80 60 16 English words masked by cipher symbols 38 ... 70 52 57 30 33 26 ... 1*2(%3,)34(56*)(&%789%$#..*,.:%(;<(4$%$"(5(%#5(%"3,&5(&)%'=% $"'3)#,&)%'=%>$#.)?%4(5%<*4"(5%$'2(,%@$#.%A%-,./*)"%B'5&C% 13 16 Word Substitution Word Substitution (on the road to MT) !"#$%&'()%$"*)%)#+%*,%-,./*)"0% !"#$% 95 90 76 31 95 20 95 65 31 50 42 16 65 31 50 58 42 16 19 95 16 92 73 16 65 67 57 31 26 65 47 24 56 21 16 43 11 80 60 16 Learn substitution mappings between English words and cipher symbols 38 ... 70 52 57 ... 30 33 26 !"#$%&'( 1*2(%3,)34(56*)(&%789%$#..*,.:%(;<(4$%$"(5(%#5(%"3,&5(&)%'=% $"'3)#,&)%'=%>$#.)?%4(5%<*4"(5%$'2(,%@$#.%A%-,./*)"%B'5&C% 14 16 Word Substitution Word Substitution Decipherment !"#$%&'%(') *) 95 43 95 65 19 92 65 38 26 47 ... 15 90 11 65 31 95 73 67 70 16 24 76 80 31 50 16 16 57 52 31 60 50 58 95 16 42 42 20 31 57 26 30 65 33 56 21 16 16 16 ... Word Substitution Word Substitution Decipherment !"#$%&'%(') *) ? &$ ()*+,-.$ !"&'$ !"#$%$&'$ /0&12$ 95 43 95 65 19 92 65 38 26 47 ... 15 90 11 65 31 95 73 67 70 16 24 76 80 31 50 16 16 57 52 31 60 50 58 95 16 42 42 20 31 57 26 30 65 33 56 21 16 16 16 ... Word Substitution Word Substitution Decipherment !"&'$ !"#$%$&'$ /0&12$ English Language Model !"#$%&'%(') *) ? &$ ()*+,-.$ Substitution Table 95 43 95 65 19 92 65 38 26 47 ... 15 90 11 65 31 95 73 67 70 16 24 76 80 31 50 16 16 57 52 31 60 50 58 95 16 42 42 20 31 57 26 30 65 33 56 21 16 16 16 ... Word Substitution Word Substitution Decipherment English corpus LM Training !"&'$ !"#$%$&'$ /0&12$ English Language Model !"#$%&'%(') *) ? &$ ()*+,-.$ Substitution Table 95 43 95 65 19 92 65 38 26 47 ... 15 90 11 65 31 95 73 67 70 16 24 76 80 31 50 16 16 57 52 31 60 50 58 95 16 42 42 20 31 57 26 30 65 33 56 21 16 16 16 ... Word Substitution Word Substitution Decipherment ? &$ ()*+,-.$ !"&'$ !"#$%$&'$ /0&12$ 16 !"#$%&'%(') *) Word Substitution Word Substitution Decipherment ? &$ ()*+,-.$ !"&'$ !"#$%$&'$ !"#$%&'%(') *) /0&12$ key contains millions of parameters!!! 16 Word Substitution Word Substitution Decipherment ? &$ ()*+,-.$ !"&'$ !"#$%$&'$ !"#$%&'%(') *) /0&12$ key contains millions of parameters!!! New New 1. EM Decipherment 2. Bayesian Decipherment - Bayesian inference (efficient, parallelized sampling) - EM using new iterative training procedure 16 Word Substitution Word Substitution: (1) EM New •! !"#$%&'(#)*%+# –! ,-*.)/### •! &))0#(1#%(12)#3#4-0*()#567#-*2*7)()2%# –! 8$7)/### •! 9$:)#;<,#(*==$&=>#?4(#56:#-1%%$?@)#A(*=%B#-)2#.$-C)2#(1:)&# –! ,1@4D1&/# Iterative EM •! •! •! •! E4$@0#FGF#H#FGF#.C*&&)@#IJ$(C#KLM#J120N# !"#*%%$=&%#KLM#(1#%17)#.$-C)2#(1:)&%# !@$7$&*()#-*2*7)()2%# !H-*&0#(1#5GF#H#5GF>#)(.O# • Decoding: Viterbi decoding using trained channel (details in paper) 17 Word Substitution New • Word Substitution: (2) Bayesian Several advantages over EM ✓ efficient inference, even with higher-order LMs ➡ incremental scoring of derivations during sampling ✓ novel sample-selection strategy permits fast training ✓ no memory bottleneck ✓ sparse priors help learn skewed distributions 18 Word Substitution Word Substitution: (2) Bayesian (continued) • • Same generative story as EM, replace models with Chinese Restaurant Process (CRP) formulations ➡ Base distributions (P0): source = English LM probabilities, channel = uniform ➡ Priors: source (α) = 104, channel (β) = 0.01 Inference via point-wise Gibbs sampling ➡ Smart sample-choice selection 19 Word Substitution Smart Sample-Choice Selection for Bayesian Decipherment cipher: current: 22 three 94 eight 43 living 04 here 98 ? resample: resample: resample: resample: resample: resample: resample: three three three three three three three the a and boys gun brick ran ... living living living living living living living here here here here here here here ? ?! ?! ? ? ? ? !"#$%&'()*+,)-("#$!$%&'()*+,-.!/0,'!1,*0%$!/#2%-3! 40#5'(-6'!#$!),++,#-'!#"!1,*0%$!/#2%-'!*%$!%*#107! 85$$%-/!'#+59#-:!!'()*+%!+./'(0+1(233(*+(,-/%;/!10#,1%'! <=,-%!1#)*5/(9#-:!!>,?%-!@!(-6!AB!C0(/!($%!/#*&DEE!F!$(-2%6!GH!IJ@!F!AKL! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!M"!@!(-6!A!-%?%$!1#s$$%6B!/0%-!*,12!DEE!$(-6#)!C#$6'7! 20 Word Substitution Word Substitution: (2) Bayesian (continued) • • • Same generative story as EM, replace models with Chinese Restaurant Process (CRP) formulations ➡ Base distributions (P0): source = English LM probabilities, channel = uniform ➡ Priors: source (α) = 104, channel (β) = 0.01 Inference via point-wise Gibbs sampling ➡ Smart sample-choice selection ➡ Parallelized sampling using Map Reduce (3 to 5-fold faster) Decoding: Extract trained channel from final sample,Viterbi-decode (details in paper) 21 Word Substitution Word Substitution Results 22 Word Substitution Sample Decipherments Cipher Original English Deciphered 23 Rest of this Talk • • • • • • Introduction Related Work New Idea for Language Translation Word Substitution Foreign Language as a Cipher Conclusion 24 Machine Translation without parallel data Foreign Language as a Cipher !"#$%&'()$*)( +( Spanish corpus 25 Machine Translation without parallel data Foreign Language as a Cipher ? &$ ()*+,-.$ !"&'$ !"#$%$&'$ !"#$%&'()$*)( +( /0&12$ Spanish corpus 25 Machine Translation without parallel data Foreign Language as a Cipher ? &$ ()*+,-.$ !"&'$ !"#$%$&'$ !"#$%&'()$*)( +( /0&12$ English Language Model English-to-Foreign Translation Model 25 Spanish corpus Machine Translation without parallel data Foreign Language as a Cipher English corpus LM Training ? &$ ()*+,-.$ !"&'$ !"#$%$&'$ !"#$%&'()$*)( +( /0&12$ English Language Model English-to-Foreign Translation Model 25 Spanish corpus Machine Translation without parallel data Foreign Language as a Cipher Machine Translation without parallel data = Word Substitution + transposition + insertion + deletion + ... English corpus LM Training ? &$ ()*+,-.$ !"&'$ !"#$%$&'$ !"#$%&'()$*)( +( /0&12$ English Language Model English-to-Foreign Translation Model 25 Spanish corpus Machine Translation without parallel data Deciphering Foreign Language ? &$ ()*+,-.$ !"&'$ !"#$%&'()$*)( +( !"#$%$&'$ /0&12$ • P(f | e) can be any translation model used in MT (e.g., IBM Model 3) ➡ but without parallel data, training is intractable 26 Machine Translation without parallel data Deciphering Foreign Language ? &$ ()*+,-.$ !"&'$ !"#$%&'()$*)( +( !"#$%$&'$ /0&12$ • P(f | e) can be any translation model used in MT (e.g., IBM Model 3) ➡ but without parallel data, training is intractable New New 1. EM Decipherment 2. Bayesian Decipherment - Simple generative story - IBM Model 3 generative story - model can be trained efficiently using EM - Bayesian inference (efficient, parallelized sampling) 26 Machine Translation without parallel data New • MT Decipherment: (1) EM Simple generative story (avoid large-sized fertility, distortion tables) ✓ word substitutions ✓ deletions ✓ insertions • • Model can be trained efficiently using EM Use prior linguistic knowledge for decipherment ➡ • ✓ local re-ordering morphology, linguistic constraints (e.g., “8” in English maps to “8” in Spanish) Whole-segment language models instead of word n-gram LMs e.g., ✓ ARE YOU TALKING ABOUT ME ? THANK YOU TALKING ABOUT ? 27 New • Machine Translation without parallel data MT Decipherment: (2) Bayesian Generative story: IBM Model 3 (popular) translation (word substitution) distortion (transposition) fertility 28 New • Machine Translation without parallel data MT Decipherment: (2) Bayesian Generative story: IBM Model 3 (popular) translation (word substitution) • • distortion (transposition) fertility Complex model, makes inference very hard New translation model ➡ replace IBM Model 3 components with CRP processes 28 New • Machine Translation without parallel data MT Decipherment: (2) Bayesian Generative story: IBM Model 3 (popular) translation (word substitution) • • distortion (transposition) fertility Complex model, makes inference very hard New translation model ➡ replace IBM Model 3 components with CRP processes 28 CRP cache model Machine Translation without parallel data MT Decipherment: (2) Bayesian • (continued) Bayesian inference for estimating translation model ➡ • • efficient, scalable inference using strategies described earlier Sampling IBM Model 3 ➡ point-wise Gibbs sampling: for each foreign string f, jointly sample alignments, e translations ➡ sampling operators = translate 1 word, swap alignments, ... (similar to German et al., 2001) ➡ blocked sampling: sample single derivation for repeating sentences Choose the final sample as MT decipherment output 29 Machine Translation without parallel data Time Expressions English corpus Spanish corpus ... 10 días consecutivos de cotización 10 semanas consecutivas 100 años después ... 17 de abril 1986 ... años cuarto puesto enero alrededor. 28 ... mil años antes mil años ... otros 12 meses más o menos ... una de tres horas uno de tres años un jueves por la noche Un día hace poco ... ... 10 MONTHS LATER 10 MORE YEARS 24 MINUTES 28 CONSECUTIVE QUARTERS ... A WEEK EARLIER ABOUT A DECADE AGO ABOUT A MONTH AFTER ... AUGUST 6 , 1789 ... CENTURIES AGO DEC . 11 , 1989 ... TWO DAYS LATER TWO DECADES LATER TWO FULL DAYS ... YEARS ... 30 Machine Translation without parallel data Time Expressions English corpus ... 10 MONTHS LATER 10 MORE YEARS 24 MINUTES 28 CONSECUTIVE QUARTERS ... A WEEK EARLIER ABOUT A DECADE AGO ABOUT A MONTH AFTER ... AUGUST 6 , 1789 ... CENTURIES AGO DEC . 11 , 1989 Baseline without parallel data ... Decipherment TWO DAYS LATER without parallel data TWO DECADES LATER TWO FULL DAYS ... YEARS ... Spanish corpus Results 30 BLEU score 100 75 50 25 0 ... 10 días consecutivos de cotización 10 semanas consecutivas 100 años después ... 17 de abril 1986 ... años cuarto puesto enero alrededor. 28 ... ↑Higher is better mil años antes mil años MOSES with parallel data 88 ... otros 12 meses más o menos ... 48.7horas una de tres uno de tres años un 18.2 jueves por la noche Un día hace poco ... Machine Translation without parallel data Movie Subtitles English corpus ... ALL RIGHT , LET' S GO . ARE YOU ALL RIGHT ? ARE YOU CRAZY ? ... HEY , DO YOU WANT TO COME OUT AND PLAY THE GAME ? ... WHAT ARE YOU DOING HERE ? ... YEAH ! YOU KNOW WHAT I MEAN ? ... Spanish corpus ... abran la puerta . bien hecho . ... ¡ por aquí ! ¿ a qué te refieres ? ¿ cómo podré verlos a través de mis lágrimas ? oye , ¿ quieres salir y jugar el juego ? ... un segundo . vamonos . ... OPUS Spanish/English corpus [ Tiedemann, 2009 ] 31 Machine Translation without parallel data Movie Subtitles English corpus ... ALL RIGHT , LET' S GO . ARE YOU ALL RIGHT ? ARE YOU CRAZY ? ... HEY , DO YOU WANT TO COME OUT AND PLAY THE GAME ? ... WHAT ARE YOU DOING HERE ? ... YEAH ! YOU KNOW WHAT I MEAN ? ... Results Spanish corpus ... abran la puerta . ↑Higher is better bien hecho . ...100 ¡ por aquí ! ¿ a75 qué te refieres ? MOSES parallel ¿ cómo podré verloswith a través de misdata lágrimas 63.6 ? oye , ¿ quieres salir y jugar el juego ? 50 ... un segundo . 19.3 25 vamonos . ... BLEU score Decipherment without parallel data OPUS Spanish/English corpus [ Tiedemann, 2009 ] 31 0 MT Accuracy vs. Data Size Machine Translation without parallel data Phrase-based MT, parallel data IBM Model 3 - distortion, parallel data MT quality on test set EM Decipherment, no parallel data 32 MT Accuracy vs. Data Size Machine Translation without parallel data Phrase-based MT, parallel data IBM Model 3 - distortion, parallel data MT quality on test set EM Decipherment, no parallel data 32 Conclusion • • Language translation without parallel data ➡ very challenging task, but shown to be possible! (using decipherment approach) ➡ initial results promising ➡ can easily extend to new language pairs, domains Future Work ➡ Scalable decipherment methods for full-scale MT ➡ Better unsupervised algorithms for decipherment ➡ Leverage existing bilingual resources (e.g., dictionaries, etc.) during decipherment ➡ Applications for domain adaptation 33 What else can Decipherment Do? This talk Language Translation Monolingual corpora Translation tables Spanish text TRAIN English text associates/ asociados : 0.8 : : : 34 What else can Decipherment Do? This talk Language Translation Monolingual corpora Translation tables Spanish text TRAIN English text associates/ asociados : 0.8 : : : Cryptanalysis CATCH A SERIAL KILLER Afternoon talk (2pm, Machine Learning Session 2-B) 34 NLP THANK YOU! 35
© Copyright 2026 Paperzz