Toward Statistical Machine Translation without Parallel Corpora

Toward Statistical Machine Translation
without Parallel Corpora
Alex Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky
Johns Hopkins University
EACL 2012
Motivation
} 
State-of-the-art statistical machine translation models are estimated from
parallel corpora (manually translated text)
} 
Large volumes of parallel text are typically needed to induce good models
BUT
} 
Manual translation is laborious and expensive
} 
Sufficient quantities available for only a few language pairs
} 
E.g. Canadian and European parliamentary proceedings
} 
WMT 2011 translation task: 6 languages, Google Translate - 59
2
Goal
Instead of:
3
Goal
Cheap monolingual data
+
Use cheap and plentiful monolingual data to reduce (eliminate?) the
need for parallel data to induce good translation models
4
Summary of the Approach
At a very high level, translation involves:
1. 
Choosing correct translation of words and phrases
Phrase based SMT: extract / score phrasal dictionaries from sentence
aligned parallel data
This work: extend bilingual lexicon induction to score phrasal
dictionaries from monolingual data
2. 
Putting the translated phrases or words in the right order
Phrase based SMT: extract ordering information from parallel data
This work: novel algorithm to estimate ordering from monolingual data
5
Outline
} 
Motivation and summary of the approach
} 
Phrase-based statistical machine translation
} 
Weaning phrase-based SMT off parallel data
} 
Scoring phrases
} 
} 
Scoring ordering
} 
} 
Extending bilingual lexicon induction
Novel reordering algorithm
Experiments
6
Phrase-based SMT
Parallel Data
(big and expensive)
en
ur
en
en urfr
Phrase Table
Word Align
and Extract
Phrases
en
fr
Score
Phrases
Tuning Data
(small)
Score
Ordering
en
en urfr
fr
Scored Table
en
fr
fts
Training
1
ft weights
Decoding
2
en
en
fr
Extract
Phrases
Score
Phrases
Score
Ordering
Phrase Table
en
en
en
urur
fr
Monolingual Data
(huge and cheap)
We introduce monolingual equivalents to
bilingually estimated features
7
1
} 
Scoring Phrases
Phrase-based SMT: relatedness of a given phrase pair (e,f) is captured by:
} 
Phrase translation probabilities φ(e|f), φ(f|e)
} 
Average word translation w(ei|fj) probabilities using phrase-pair-internal word
alignments
fj
índice
f:
e:
de
inflation
inflación
rate
ei
} 
} 
Both estimated from a parallel corpus
How can we induce phrasal similarity from monolingual data?
8
Scoring Phrases: Context
1
} 
An old first idea [Rapp, 99]: measure contextual similarity
} 
} 
Words appearing in similar context are probably related
First, collect context
rápidamente
planeta
economías
21
... este número podría crecer muy rápidamente si no se modifica ...
1
... nuestras economías a crecer y desarrollarse de forma saludable ...
extranjero
empleo
... que nos permitirá crecer rápidamente cuando el contexto ...
crecer
9
1
} 
Scoring Phrases: Context
An old first idea [Rapp, 99]: measure contextual similarity
} 
Words appearing in similar context are probably related
First, collect context
} 
Then, project through a seed dictionary, and compare vectors
} 
rápidamente
planeta
economías
7
4
4
5
economic
growth
1
employment
2
quickly
policy
1
7
dict.
extranjero
empleo
7
3
crecer
9
policy
crecer
(projected)
activity
expand
10
1
Second idea: measure temporal similarity
} 
Assume, we have temporal information associated with text (e.g. news
publication dates)
} 
Events are discussed in different languages at the same time
Collect temporal signature
terrorist (en)
terrorista (es)
Occurrences
} 
terrorist (en)
riqueza (es)
Occurrences
} 
Scoring Phrases: Time
Similar
Dissimilar
Time
} 
Measure similarity between signatures (e.g. cosine or DFT-based metric)
11
1
} 
Scoring Phrases: Topics
Third idea: measure topic similarity
} 
} 
Phrases and their translations are likely to appear in the same topic
The more similar the set of topics in which a pair of phrases appears the more
likely the phrases are similar
Topic 1
Topic 2
Topic 3
Topic N
L1
L2
} 
We treat Wikipedia article pairs with interlingual links as topics
12
1
} 
Fourth idea: measure orthographic similarity
} 
} 
} 
} 
Scoring Phrases: Orthography
Etymologically related words often retain similar spelling across languages with
the same writing system
Spanish
English
democracia
democracy
Transliterate for language pairs with different writing system
Russian
Transliterated
English
демократия
demokratiya
democracy
Measure similarity with edit distance or a discriminative translit model
Other ideas: burstiness, etc.
13
1
} 
Scoring Phrases: Scaling Up Lex Induction
Challenge in scoring phrase pairs monolingually: Sparsity
50
45
Accuracy, %
40
35
30
25
20
15
10
5
Top 10
0
0
} 
100
200
300
400
Corpus frequency
500
600
Score the similarity of individual words within a phrase pair
} 
Similar to lexical weights in phrase-based SMT
} 
Use phrase pair internal word alignments, penalize for unaligned words
14
2
} 
m: monotone (keep order)
} 
s: swap order
} 
d: become discontinuous
} 
Reordering features are
probability estimates of s, d, and
m
verdienen
Facebook
in
Profils
seines
aufrgund
Start with aligned phrases
man
} 
sollte
Phrase-based SMT: model the probabilities of orientation change of a phrase
with respect to a preceding phrase when translated
Wieviel
} 
Scoring Ordering
How
much
should
you
charge
m
m
d
for
your
d
m
Facebook
d
profile
s
15
2
} 
Scoring Ordering from Monolingual Data
Estimate same probabilities, but from pairs of (unaligned) sentences taken
from monolingual data
} 
We don’t have alignments, but we do have a phrase table
German
German
Mono
English
Phrase Table
German
! das
Profils
…
Facebook
…
und nicht
zustand
} 
English
, and
profile
…
in Facebook
…
and a lack
situation as
What does your Facebook profile reveal?
s
Das Anlegen eines Profils in Facebook ist einfach.
German
German
Mono
German
Repeat over many sentences
16
2
Estimate same probabilities, but from pairs of (unaligned) sentences taken
from monolingual data
einfach
ist
Facebook
in
Profils
eines
We don’t have alignments, but we do have a phrase table
Anlegen
} 
Das
} 
Scoring Ordering from Monolingual Data
What
does
your
Facebook
profile
s
reveal
} 
Repeat over many sentences
17
1,2
Scoring Phrases and Ordering: Summary
Phrase-based SMT Features
Monolingual Equivalents
Phrase translation probabilities
Temporal, contextual, and topic
phrase similarity
Lexical weights
Temporal, contextual, topic, and
orthographic word similarities
Reordering features
Reordering features estimated
from monolingual data
Alternatives to features estimated from parallel corpora: we use cheap
monolingual data (along with some metadata), and a seed dictionary
18
Experiments: Spanish-English
} 
Idealized setup: treat English and Spanish sections of Europarl as two
independent monolingual corpora
} 
} 
} 
Europarl is annotated with temporal information
Drop the idealization: estimate features from truly monolingual corpora
} 
Gigaword, Wikipedia for contextual and temporal
} 
Wikipedia for topical
Run a series of lesion experiments:
} 
Begin with standard features estimated from parallel data
} 
Drop phrasal, lexical, and reordering features
} 
Replace them with the monolingually estimated counterparts
} 
See how much performance can be recovered
19
Experiment: Europarl Spanish-English
25
21.9
22.9
75% performance
recovered
21.5
20
16.9
15
17.5
+ 1 BLEU if
mono features
are added to
phrase based
MT
BLEU
12.9
10.5
10
5
4.0
0
Standard
phrasebased SMT
Removed
the bilingual
reordering
model
Removed
Removed
bilingual
all bilingual
translation
features
probabilities
Added
monoling.
reordering
model
Added
monoling.
phrase
features
All
(and only)
monoling.
features
Standard phrasebased + mono
phrase features
20
Experiment: Monolingual Spanish-English
25
21.9
21.5
20
17.9
15
23.3
82% performance
recovered
18.8
+ 1.5 BLEU if
mono features
are added to
phrase based
MT
BLEU
12.9
10.2
10
5
4.0
0
Standard
phrasebased SMT
Removed
the bilingual
reordering
model
Removed
Removed
bilingual
all bilingual
translation
features
probabilities
Added
monoling.
reordering
model
Added
monoling.
phrase
features
All
(and only)
monoling.
features
Standard phrasebased + mono
phrase features
21
Conclusions
Monolingual data takes us a long way toward inducing good translation models
} 
Can recover substantial portion of the lost performance in lesion experiments
} 
Consistently improve performance when added to the phrase-based setup
Important direction for languages lacking sufficient (or any) parallel data
} 
True for most languages
} 
Monolingual data is plentiful and growing
22