Spanish corpus - of Sujith Ravi

Deciphering Foreign Language
NLP
!
Sujith Ravi and Kevin Knight
#1
#2
[email protected],
[email protected]
Information Sciences Institute
University of Southern California
!
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish) Garcia y asociados
(English) Garcia and associates
(Spanish) sus grupos están en Europa
(English) its groups are in Europe
:
:
:
2
associates/
asociados : 0.8
TRAIN
groups/
grupos
:
:
: 0.9
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish)
Garcia y asociados
Spanish/English
(English) Garcia and associates
(Spanish) sus grupos están en Europa
(English) its groups are in Europe
:
:
:
2
associates/
asociados : 0.8
TRAIN
groups/
grupos
:
:
: 0.9
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish)
Garcia y asociados
Spanish/English
(English) Garcia and associates
Japanese/German
(Spanish) sus grupos están en Europa
(English) its groups are in Europe
:
:
:
2
associates/
asociados : 0.8
TRAIN
groups/
grupos
:
:
: 0.9
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish)
Garcia y asociados
Spanish/English
(English) Garcia and associates
Japanese/German
(Spanish) sus grupos están en Europa
(English)
its groups are in Europe
Malayalam/English
:
:
:
2
associates/
asociados : 0.8
TRAIN
groups/
grupos
:
:
: 0.9
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish)
Garcia y asociados
Spanish/English
(English) Garcia and associates
Japanese/German
(Spanish) sus grupos están en Europa
(English)
its groups are in Europe
Malayalam/English
:
:
Swahili/German
:...
2
associates/
asociados : 0.8
TRAIN
groups/
grupos
:
:
: 0.9
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish)
Garcia y asociados
Spanish/English
(English) Garcia and associates
Japanese/German
(Spanish) sus grupos están en Europa
(English)
its groups are in Europe
Malayalam/English
:
:
Swahili/German
:...
BOTTLENECK
2
associates/
asociados : 0.8
TRAIN
groups/
grupos
:
:
: 0.9
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish)
Garcia y asociados
Spanish/English
(English) Garcia and associates
Japanese/German
(Spanish) sus grupos están en Europa
(English)
its groups are in Europe
Malayalam/English
:
:
Swahili/German
:...
Can we get rid
of parallel data?
BOTTLENECK
2
associates/
asociados : 0.8
TRAIN
groups/
grupos
:
:
: 0.9
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish)
Garcia y asociados
Spanish/English
(English) Garcia and associates
associates/
asociados : 0.8
Japanese/German
TRAIN
(Spanish) sus grupos están en Europa
(English)
its groups are in Europe
Malayalam/English
:
:
Swahili/German
:...
Can we get rid
of parallel data?
groups/
grupos
:
:
: 0.9
BOTTLENECK
Monolingual corpora
Translation tables
Spanish text
TRAIN
English text
2
associates/
asociados : 0.8
:
:
:
Statistical Machine Translation (MT)
Bilingual text
Current
MT systems
Translation tables
(Spanish)
Garcia y asociados
Spanish/English
(English) Garcia and associates
associates/
asociados : 0.8
Japanese/German
TRAIN
(Spanish) sus grupos están en Europa
(English)
its groups are in Europe
Malayalam/English
:
:
Swahili/German
:...
Can we get rid
of parallel data?
: 0.9
BOTTLENECK
Monolingual corpora
PLENTY
groups/
grupos
:
:
Translation tables
Spanish text
TRAIN
English text
PLENTY
2
associates/
asociados : 0.8
:
:
:
Machine Translation without
parallel data
Getting Rid of Parallel Data
•
MT system trained on non-parallel data
➡
useful for rare language-pairs (limited/no parallel data)
3
Machine Translation without
parallel data
Getting Rid of Parallel Data
•
MT system trained on non-parallel data
➡
useful for rare language-pairs (limited/no parallel data)
Monolingual corpora
Translation tables
Spanish text
TRAIN
English text
3
associates/
asociados : 0.8
:
:
:
Machine Translation without
parallel data
Getting Rid of Parallel Data
•
MT system trained on non-parallel data
➡
useful for rare language-pairs (limited/no parallel data)
Monolingual corpora
Translation tables
Spanish text
TRAIN
English text
•
•
associates/
asociados : 0.8
:
:
:
Goal: not to beat existing MT systems, instead
Can we build a reasonably good MT system from scratch
without any parallel data?
➡
monolingual resources available in plenty
3
Related Work
•
•
Machine Translation without
parallel data
Extracting bilingual lexical connections from comparable
corpora
➡
exploit word context frequencies (Fung, 1995; Rapp,
1995; Koehn & Knight, 2001)
➡
Canonical Correlation Analysis (CCA) method (Haghighi
& Klein, 2008)
Mining parallel sentence pairs for MT training using
comparable corpora (Munteanu et al., 2004)
➡
need dictionary, some initial parallel data
4
NLP
Our Contributions
New
✓ MT system built from scratch without parallel data
➡
novel decipherment approach for translation
➡
novel methods for training translation models from
non-parallel text
➡
Bayesian training for IBM 3 translation model
✓ Novel methods to deal with large-scale vocabularies
inherent in MT problems
✓ Empirical studies for MT decipherment
5
Rest of this Talk
•
•
•
•
Introduction
Related Work
New Idea for Language Translation
➡
Step 1: Word Substitution
➡
Step 2: Foreign Language as a Cipher
Conclusion
6
New
Cracking the MT Code
“When I look at an article in Spanish, I
say to myself, this is really English, but it
has been encoded in some strange symbols.
Now I will proceed to decode...”
Warren Weaver (1947)
(Spanish)
• Ciphertext: este es un sistema de cifrado complejo
7
New
Cracking the MT Code
“When I look at an article in Spanish, I
say to myself, this is really English, but it
has been encoded in some strange symbols.
Now I will proceed to decode...”
Warren Weaver (1947)
(Spanish)
• Ciphertext: este es un sistema de cifrado complejo
(English)
• Plaintext: this is a complex cipher
7
New
MT Decipherment without Parallel Data
f
Spanish corpus
El portal web permite la búsqueda por todo
tipo de métodos. Por un lado, Wikileaks ha
ordenado la documentación por diferentes
categorías atendiendo a los hechos más
notables. Desde el tipo de suceso (evento
criminal, fuego amigo, ...
New
MT Decipherment without Parallel Data
e
P(e)
English
Language Model
English
?
P(f | e)
English-to-Spanish
Translation Model
key
f
Spanish corpus
El portal web permite la búsqueda por todo
tipo de métodos. Por un lado, Wikileaks ha
ordenado la documentación por diferentes
categorías atendiendo a los hechos más
notables. Desde el tipo de suceso (evento
criminal, fuego amigo, ...
New
MT Decipherment without Parallel Data
English corpus
(CNN) WikiLeaks website publishes classified
military documents from Iraq. The whistle-blower
website WikiLeaks published nearly 400,000
classified military documents from the Iraq war on
Friday, calling it the largest classified military
leak in history,....
Language Model Training
e
P(e)
English
Language Model
English
?
P(f | e)
English-to-Spanish
Translation Model
key
f
Spanish corpus
El portal web permite la búsqueda por todo
tipo de métodos. Por un lado, Wikileaks ha
ordenado la documentación por diferentes
categorías atendiendo a los hechos más
notables. Desde el tipo de suceso (evento
criminal, fuego amigo, ...
New
MT Decipherment without Parallel Data
English corpus
(CNN) WikiLeaks website publishes classified
military documents from Iraq. The whistle-blower
website WikiLeaks published nearly 400,000
classified military documents from the Iraq war on
Friday, calling it the largest classified military
leak in history,....
For each f
• alignments = hidden
• e translation = hidden
Language Model Training
e
P(e)
English
Language Model
English
?
P(f | e)
English-to-Spanish
Translation Model
key
f
Spanish corpus
El portal web permite la búsqueda por todo
tipo de métodos. Por un lado, Wikileaks ha
ordenado la documentación por diferentes
categorías atendiendo a los hechos más
notables. Desde el tipo de suceso (evento
criminal, fuego amigo, ...
New
MT Decipherment without Parallel Data
Train parameters θ to maximize probability of
observed foreign text f:
English corpus
(CNN) WikiLeaks website publishes classified
military documents from Iraq. The whistle-blower
website WikiLeaks published nearly 400,000
classified military documents from the Iraq war on
Friday, calling it the largest classified military
leak in history,....
argmax θ Pθ (f ) ≈ argmax θ ∑e Pθ (e, f)
≈ argmax θ ∑e P(e) . Pθ (f | e)
TRAINING
Language Model Training
e
P(e)
English
Language Model
English
?
P(f | e)
English-to-Spanish
Translation Model
key
For each f
• alignments = hidden
• e translation = hidden
f
Spanish corpus
El portal web permite la búsqueda por todo
tipo de métodos. Por un lado, Wikileaks ha
ordenado la documentación por diferentes
categorías atendiendo a los hechos más
notables. Desde el tipo de suceso (evento
criminal, fuego amigo, ...
Characteristics of Decipherment Key
for MT
MT
Determinism in Key?
Linguistic unit of
substitution
Transposition
(re-ordering)
Insertion
Deletion
Scale
(vocabulary & data sizes)
9
Characteristics of Decipherment Key
Hard
for MT
MT problem!
Determinism in Key?
many-to-many
Linguistic unit of
substitution
Transposition
Word / Phrase
(re-ordering)
Insertion
Deletion
Large
Scale
(100 - 1M word types)
(vocabulary & data sizes)
10
Characteristics of Decipherment Key
Tackle a simpler problem first!
Word Substitution
Determinism in Key?
Linguistic unit of
substitution
Transposition
MT
Hard
problem!
1-to-1
many-to-many
Word
Word / Phrase
Large
Large
(re-ordering)
Insertion
Deletion
Scale
(vocabulary & data sizes)
(100 - 1M word types) (100 - 1M word types)
11
Rest of this Talk
•
•
•
•
•
•
Introduction
Related Work
New Idea for Language Translation
Word Substitution
Foreign Language as a Cipher
Conclusion
12
Word Substitution
Word Substitution
(on the road to MT)
!"#$%&'()%$"*)%)#+%*,%-,./*)"0%
95
90
76
31
95
20
95
65
31
50
42
16
65
31
50
58
42
16
19
95
16
92
73
16
65
67
57
31
26
65
47
24
56
21
16
43
11
80
60
16
English words masked
by cipher symbols
38
...
70
52
57
30
33
26
...
1*2(%3,)34(56*)(&%789%$#..*,.:%(;<(4$%$"(5(%#5(%"3,&5(&)%'=%
$"'3)#,&)%'=%>$#.)?%4(5%<*4"(5%$'2(,%@$#.%A%-,./*)"%B'5&C%
13
16
Word Substitution
Word Substitution
(on the road to MT)
!"#$%&'()%$"*)%)#+%*,%-,./*)"0%
!"#$%
95
90
76
31
95
20
95
65
31
50
42
16
65
31
50
58
42
16
19
95
16
92
73
16
65
67
57
31
26
65
47
24
56
21
16
43
11
80
60
16
Learn substitution mappings between
English words and cipher symbols
38
...
70
52
57
...
30
33
26
!"#$%&'(
1*2(%3,)34(56*)(&%789%$#..*,.:%(;<(4$%$"(5(%#5(%"3,&5(&)%'=%
$"'3)#,&)%'=%>$#.)?%4(5%<*4"(5%$'2(,%@$#.%A%-,./*)"%B'5&C%
14
16
Word Substitution
Word Substitution Decipherment
!"#$%&'%(')
*)
95
43
95
65
19
92
65
38
26
47
...
15
90
11
65
31
95
73
67
70
16
24
76
80
31
50
16
16
57
52
31
60
50
58
95
16
42
42
20
31
57
26
30
65
33
56
21
16
16
16
...
Word Substitution
Word Substitution Decipherment
!"#$%&'%(')
*)
?
&$
()*+,-.$
!"&'$
!"#$%$&'$
/0&12$
95
43
95
65
19
92
65
38
26
47
...
15
90
11
65
31
95
73
67
70
16
24
76
80
31
50
16
16
57
52
31
60
50
58
95
16
42
42
20
31
57
26
30
65
33
56
21
16
16
16
...
Word Substitution
Word Substitution Decipherment
!"&'$
!"#$%$&'$
/0&12$
English
Language Model
!"#$%&'%(')
*)
?
&$
()*+,-.$
Substitution Table
95
43
95
65
19
92
65
38
26
47
...
15
90
11
65
31
95
73
67
70
16
24
76
80
31
50
16
16
57
52
31
60
50
58
95
16
42
42
20
31
57
26
30
65
33
56
21
16
16
16
...
Word Substitution
Word Substitution Decipherment
English corpus
LM Training
!"&'$
!"#$%$&'$
/0&12$
English
Language Model
!"#$%&'%(')
*)
?
&$
()*+,-.$
Substitution Table
95
43
95
65
19
92
65
38
26
47
...
15
90
11
65
31
95
73
67
70
16
24
76
80
31
50
16
16
57
52
31
60
50
58
95
16
42
42
20
31
57
26
30
65
33
56
21
16
16
16
...
Word Substitution
Word Substitution Decipherment
?
&$
()*+,-.$
!"&'$
!"#$%$&'$
/0&12$
16
!"#$%&'%(')
*)
Word Substitution
Word Substitution Decipherment
?
&$
()*+,-.$
!"&'$
!"#$%$&'$
!"#$%&'%(')
*)
/0&12$
key contains millions of
parameters!!!
16
Word Substitution
Word Substitution Decipherment
?
&$
()*+,-.$
!"&'$
!"#$%$&'$
!"#$%&'%(')
*)
/0&12$
key contains millions of
parameters!!!
New
New
1. EM Decipherment
2. Bayesian Decipherment
- Bayesian inference (efficient,
parallelized sampling)
- EM using new iterative training
procedure
16
Word Substitution
Word Substitution: (1) EM
New
•! !"#$%&'(#)*%+#
–! ,-*.)/###
•! &))0#(1#%(12)#3#4-0*()#567#-*2*7)()2%#
–! 8$7)/###
•! 9$:)#;<,#(*==$&=>#?4(#56:#-1%%$?@)#A(*=%B#-)2#.$-C)2#(1:)&#
–! ,1@4D1&/# Iterative EM
•!
•!
•!
•!
E4$@0#FGF#H#FGF#.C*&&)@#IJ$(C#KLM#J120N#
!"#*%%$=&%#KLM#(1#%17)#.$-C)2#(1:)&%#
!@$7$&*()#-*2*7)()2%#
!H-*&0#(1#5GF#H#5GF>#)(.O#
• Decoding: Viterbi decoding using trained channel (details in paper)
17
Word Substitution
New
•
Word Substitution: (2) Bayesian
Several advantages over EM
✓ efficient inference, even with higher-order LMs
➡
incremental scoring of derivations during sampling
✓ novel sample-selection strategy permits fast training
✓ no memory bottleneck
✓ sparse priors help learn skewed distributions
18
Word Substitution
Word Substitution: (2) Bayesian
(continued)
•
•
Same generative story as EM, replace models with Chinese
Restaurant Process (CRP) formulations
➡
Base distributions (P0): source = English LM probabilities,
channel = uniform
➡
Priors: source (α) = 104, channel (β) = 0.01
Inference via point-wise Gibbs sampling
➡
Smart sample-choice selection
19
Word Substitution
Smart Sample-Choice Selection for
Bayesian Decipherment
cipher:
current:
22
three
94
eight
43
living
04
here
98
?
resample:
resample:
resample:
resample:
resample:
resample:
resample:
three
three
three
three
three
three
three
the
a
and
boys
gun
brick
ran
...
living
living
living
living
living
living
living
here
here
here
here
here
here
here
?
?!
?!
?
?
?
?
!"#$%&'()*+,)-("#$!$%&'()*+,-.!/0,'!1,*0%$!/#2%-3!
40#5'(-6'!#$!),++,#-'!#"!1,*0%$!/#2%-'!*%$!%*#107!
85$$%-/!'#+59#-:!!'()*+%!+./'(0+1(233(*+(,-/%;/!10#,1%'!
<=,-%!1#)*5/(9#-:!!>,?%-!@!(-6!AB!C0(/!($%!/#*&DEE!F!$(-2%6!GH!IJ@!F!AKL!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!M"!@!(-6!A!-%?%$!1#&#115$$%6B!/0%-!*,12!DEE!$(-6#)!C#$6'7!
20
Word Substitution
Word Substitution: (2) Bayesian
(continued)
•
•
•
Same generative story as EM, replace models with Chinese
Restaurant Process (CRP) formulations
➡
Base distributions (P0): source = English LM probabilities,
channel = uniform
➡
Priors: source (α) = 104, channel (β) = 0.01
Inference via point-wise Gibbs sampling
➡
Smart sample-choice selection
➡
Parallelized sampling using Map Reduce (3 to 5-fold faster)
Decoding: Extract trained channel from final sample,Viterbi-decode
(details in paper)
21
Word Substitution
Word Substitution Results
22
Word Substitution
Sample Decipherments
Cipher
Original English
Deciphered
23
Rest of this Talk
•
•
•
•
•
•
Introduction
Related Work
New Idea for Language Translation
Word Substitution
Foreign Language as a Cipher
Conclusion
24
Machine Translation without
parallel data
Foreign Language as a Cipher
!"#$%&'()$*)(
+(
Spanish corpus
25
Machine Translation without
parallel data
Foreign Language as a Cipher
?
&$
()*+,-.$
!"&'$
!"#$%$&'$
!"#$%&'()$*)(
+(
/0&12$
Spanish corpus
25
Machine Translation without
parallel data
Foreign Language as a Cipher
?
&$
()*+,-.$
!"&'$
!"#$%$&'$
!"#$%&'()$*)(
+(
/0&12$
English
Language Model
English-to-Foreign
Translation Model
25
Spanish corpus
Machine Translation without
parallel data
Foreign Language as a Cipher
English corpus
LM Training
?
&$
()*+,-.$
!"&'$
!"#$%$&'$
!"#$%&'()$*)(
+(
/0&12$
English
Language Model
English-to-Foreign
Translation Model
25
Spanish corpus
Machine Translation without
parallel data
Foreign Language as a Cipher
Machine Translation without
parallel data
=
Word Substitution
+ transposition
+ insertion
+ deletion
+ ...
English corpus
LM Training
?
&$
()*+,-.$
!"&'$
!"#$%$&'$
!"#$%&'()$*)(
+(
/0&12$
English
Language Model
English-to-Foreign
Translation Model
25
Spanish corpus
Machine Translation without
parallel data
Deciphering Foreign Language
?
&$
()*+,-.$
!"&'$
!"#$%&'()$*)(
+(
!"#$%$&'$
/0&12$
•
P(f | e) can be any translation model used in MT (e.g., IBM Model 3)
➡
but without parallel data, training is intractable
26
Machine Translation without
parallel data
Deciphering Foreign Language
?
&$
()*+,-.$
!"&'$
!"#$%&'()$*)(
+(
!"#$%$&'$
/0&12$
•
P(f | e) can be any translation model used in MT (e.g., IBM Model 3)
➡
but without parallel data, training is intractable
New
New
1. EM Decipherment
2. Bayesian Decipherment
- Simple generative story
- IBM Model 3 generative story
- model can be trained efficiently
using EM
- Bayesian inference (efficient,
parallelized sampling)
26
Machine Translation without
parallel data
New
•
MT Decipherment: (1) EM
Simple generative story (avoid large-sized fertility, distortion tables)
✓ word substitutions ✓ deletions
✓ insertions
•
•
Model can be trained efficiently using EM
Use prior linguistic knowledge for decipherment
➡
•
✓ local re-ordering
morphology, linguistic constraints (e.g., “8” in English maps to “8”
in Spanish)
Whole-segment language models instead of word n-gram LMs
e.g.,
✓
ARE YOU TALKING ABOUT ME ?
THANK YOU TALKING ABOUT ?
27
New
•
Machine Translation without
parallel data
MT Decipherment: (2) Bayesian
Generative story: IBM Model 3 (popular)
translation
(word substitution)
distortion
(transposition)
fertility
28
New
•
Machine Translation without
parallel data
MT Decipherment: (2) Bayesian
Generative story: IBM Model 3 (popular)
translation
(word substitution)
•
•
distortion
(transposition)
fertility
Complex model, makes inference very hard
New translation model
➡
replace IBM Model 3 components with CRP processes
28
New
•
Machine Translation without
parallel data
MT Decipherment: (2) Bayesian
Generative story: IBM Model 3 (popular)
translation
(word substitution)
•
•
distortion
(transposition)
fertility
Complex model, makes inference very hard
New translation model
➡
replace IBM Model 3 components with CRP processes
28
CRP cache model
Machine Translation without
parallel data
MT Decipherment: (2) Bayesian
•
(continued)
Bayesian inference for estimating translation model
➡
•
•
efficient, scalable inference using strategies described earlier
Sampling IBM Model 3
➡
point-wise Gibbs sampling: for each foreign string f, jointly
sample alignments, e translations
➡
sampling operators = translate 1 word, swap alignments, ...
(similar to German et al., 2001)
➡
blocked sampling: sample single derivation for repeating
sentences
Choose the final sample as MT decipherment output
29
Machine Translation without
parallel data
Time Expressions
English corpus
Spanish corpus
...
10 días consecutivos de cotización
10 semanas consecutivas
100 años después
...
17 de abril 1986
...
años
cuarto puesto
enero alrededor. 28
...
mil años antes
mil años
...
otros 12 meses más o menos
...
una de tres horas
uno de tres años
un jueves por la noche
Un día hace poco
...
...
10 MONTHS LATER
10 MORE YEARS
24 MINUTES
28 CONSECUTIVE QUARTERS
...
A WEEK EARLIER
ABOUT A DECADE AGO
ABOUT A MONTH AFTER
...
AUGUST 6 , 1789
...
CENTURIES AGO
DEC . 11 , 1989
...
TWO DAYS LATER
TWO DECADES LATER
TWO FULL DAYS
...
YEARS
...
30
Machine Translation without
parallel data
Time Expressions
English corpus
...
10 MONTHS LATER
10 MORE YEARS
24 MINUTES
28 CONSECUTIVE QUARTERS
...
A WEEK EARLIER
ABOUT A DECADE AGO
ABOUT A MONTH AFTER
...
AUGUST 6 , 1789
...
CENTURIES AGO
DEC . 11 , 1989
Baseline without parallel data
...
Decipherment
TWO DAYS
LATER without parallel data
TWO DECADES LATER
TWO FULL DAYS
...
YEARS
...
Spanish corpus
Results
30
BLEU score
100
75
50
25
0
...
10 días consecutivos de cotización
10 semanas consecutivas
100 años después
...
17 de abril 1986
...
años
cuarto puesto
enero alrededor. 28
...
↑Higher is better
mil años antes
mil años
MOSES
with parallel data 88
...
otros 12 meses más o menos
...
48.7horas
una de tres
uno de tres años
un 18.2
jueves por la noche
Un día hace poco
...
Machine Translation without
parallel data
Movie Subtitles
English corpus
...
ALL RIGHT , LET' S GO .
ARE YOU ALL RIGHT ?
ARE YOU CRAZY ?
...
HEY , DO YOU WANT TO COME OUT AND PLAY THE GAME ?
...
WHAT ARE YOU DOING HERE ?
...
YEAH !
YOU KNOW WHAT I MEAN ?
...
Spanish corpus
...
abran la puerta .
bien hecho .
...
¡ por aquí !
¿ a qué te refieres ?
¿ cómo podré verlos a través de mis lágrimas ?
oye , ¿ quieres salir y jugar el juego ?
...
un segundo .
vamonos .
...
OPUS Spanish/English corpus
[ Tiedemann, 2009 ]
31
Machine Translation without
parallel data
Movie Subtitles
English corpus
...
ALL RIGHT , LET' S GO .
ARE YOU ALL RIGHT ?
ARE YOU CRAZY ?
...
HEY , DO YOU WANT TO COME OUT AND PLAY THE GAME ?
...
WHAT ARE YOU DOING HERE ?
...
YEAH !
YOU KNOW WHAT I MEAN ?
...
Results
Spanish
corpus
...
abran la puerta
.
↑Higher
is better
bien hecho .
...100
¡ por aquí !
¿ a75
qué te refieres ?
MOSES
parallel
¿ cómo podré
verloswith
a través
de misdata
lágrimas
63.6 ?
oye , ¿ quieres salir y jugar el juego ?
50
...
un segundo .
19.3
25
vamonos
.
...
BLEU score
Decipherment without parallel data
OPUS Spanish/English corpus
[ Tiedemann, 2009 ]
31
0
MT Accuracy vs. Data Size
Machine Translation without
parallel data
Phrase-based MT,
parallel data
IBM Model 3 - distortion,
parallel data
MT quality
on test set
EM Decipherment,
no parallel data
32
MT Accuracy vs. Data Size
Machine Translation without
parallel data
Phrase-based MT,
parallel data
IBM Model 3 - distortion,
parallel data
MT quality
on test set
EM Decipherment,
no parallel data
32
Conclusion
•
•
Language translation without parallel data
➡
very challenging task, but shown to be possible! (using
decipherment approach)
➡
initial results promising
➡
can easily extend to new language pairs, domains
Future Work
➡
Scalable decipherment methods for full-scale MT
➡
Better unsupervised algorithms for decipherment
➡
Leverage existing bilingual resources (e.g., dictionaries, etc.)
during decipherment
➡
Applications for domain adaptation
33
What else can Decipherment Do?
This talk
Language Translation
Monolingual corpora
Translation tables
Spanish text
TRAIN
English text
associates/
asociados : 0.8
:
:
:
34
What else can Decipherment Do?
This talk
Language Translation
Monolingual corpora
Translation tables
Spanish text
TRAIN
English text
associates/
asociados : 0.8
:
:
:
Cryptanalysis
CATCH A SERIAL KILLER
Afternoon talk
(2pm, Machine Learning Session 2-B)
34
NLP
THANK YOU!
35