Using Comparable Corpora to Adapt a

The 7th International Conference on Language Resources and Evaluation, Malta, May 2010.
Using Comparable Corpora to Adapt
a Translation Model to Domains
Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada
Department of Computer Science, Shizuoka University
Overview
1.
Motivation and goal
2.
Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3.
Experiments
4.
Discussion
5.
Related work
6.
Summary
2
Motivation and goal

Statistical machine translation
– Able to learn a translation model from a parallel corpus
– Suffer from the limited availability of large parallel corpora
 Use comparable corpora for SMT
– Estimate translation pseudo-probabilities from a bilingual
dictionary and comparable corpora
– Use the pseudo-probabilities estimated from in-domain
comparable corpora to
• Adapt a translation model learned from an out-of-domain
parallel corpus, or
• Augment a translation model learned from a small indomain parallel corpus
3
Overview
1.
Motivation and goal
2.
Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3.
Experiments
4.
Discussion
5.
Related work
6.
Summary
4
Basic idea for estimating word translation
pseudo-probabilities from comparable corpora

Word associations suggest particular senses or translations of a
polysemous word (Yarowsky 1993)
– (tank, soldier)  the “military vehicle” sense or translation
“戦車[SENSHA]” of “tank”
– (tank, gasoline)  the “container for liquid or gas” sense or
translation “タンク[TANKU]” of “tank”
 Comparable corpora allow us to determine which word
associations suggest which translations of a polysemous word
(Kaji & Morimoto 2002)
 Assume that the more word associations that suggest a
translation, the higher the probability of the translation word
would be
5
Naive method for estimating word translation
pseudo-probabilities
Japanese corpus
English corpus
Extract word associations
(tank, fuel)
(tank, gasoline)
(tank, missile)
(tank, soldier)
English-Japanese
dictionary
Extract word associations
(タンク[TANKU], 燃料[NENRYOU])
Align word
associations
(タンク[TANKU], ガソリン[GASORIN])
(戦車[SENSHA], ミサイル[MISAIRU])
(戦車[SENSHA], 兵士[HEISHI])
“Fuel”, “gasoline”, and others suggest “タンク[TANKU]”
“Missile”, “soldier”, and others suggest “戦車[SENSHA]”
Calculate the percentage of associated words suggesting each translation
Pps(タンク[TANKU]|tank)=|{fuel, gasoline, …}| / (|{fuel, gasoline, …}|+|{missile, soldier, …}|)
Pps(戦車[SENSHA]|tank)=|{missile, soldier, …}| / (|{fuel, gasoline, …}|+|{missile, soldier,
6 …}|)
Difficulties the naive method suffers from
 Failure in word-association alignment
(tank, Chechen)  ?
– due to the disparity in topical coverage between two language
corpora
(tank, Chechen) ? (戦車[SENSHA], チェチェン[CHECHEN])
– due to the incomplete coverage of the intermediary bilingual
dictionary
 Incorrect word-association alignment
(tank, troop)  (水槽[SUISOU], 群れ[MURE])
– due to incidental word-for-word correspondence between
word associations that do not really correspond to each other
7
How to overcome the difficulties
 Two words associated with a third word are likely to suggest the
same sense or translation of the third word when they are also
associated with each other
“Soldier” and “troop”, both of which are associated with “tank”, are
associated with each other
 “Soldier” and “troop” suggest the same translation “戦車[SENSHA]”
 Define a correlation between an associated word and a translation
using the correlations between other associated words and the
translation
– C(troop, 戦車[SENSHA])
 MI(troop, tank)  {MI(troop, soldier)  C(soldier, 戦車[SENSHA])
+ MI(troop, missile)  C(missile, 戦車[SENSHA]) + …}
– C(troop, タンク[TANKU])
 MI(troop, tank)  {MI(troop, soldier)  C(soldier, タンク[TANKU])
+ MI(troop, missile)  C(missile, タンク[TANKU]) + 8…}
 Calculate the correlations iteratively starting with the initial values
determined according to the results of word-association alignment
via a bilingual dictionary
• C0(associated_word, translation)
• Alignment
tank
戦車
タンク
[SENSHA] [TANKU]
Chechen
tank
戦車
タンク
[SENSHA] [TANKU]
Chechen
0.0
0.0
fuel

fuel
0.0
1.0
gasoline

gasoline
0.0
1.0
missile

missile
1.0
0.0
soldier

soldier
1.0
0.0
troop

troop
0.5
0.5

9
Overview of our method for estimating noun
translation pseudo-probabilities
English corpus
Extract pairs of words cooccurring in a window*
Calculate pointwise mutual
information
Japanese corpus
* Window size =
10 content words
Extract pairs of words cooccurring in a window*
English-Japanese
dictionary
English
Japanese
word
word
Align
associations
associations
Calculate pointwise mutual
information
Initial value of correlation matrix of English associated
words vs. Japanese translations for an English noun
Calculate pairwise correlation between associated words and translations iteratively
Correlation matrix of associated words vs. translations
Assign each associated word to the translation with which it has the highest correlation and calculate the percentage of associated words assigned to each translation
Noun translation pseudo-probabilities
10
Example correlation matrix and estimated noun
translation pseudo-probabilities
plant
activity
bacteria
boiler
coal
computer
control
culture
environment
failure
flower
:
装置
設備
植物
工場
プラント
苗
[SETSUBI [SHOKU[SOUCHI]
[KOUJOU] [PURANTO] [NAEI]
]
BUTSU]
植木
[UEKI]
0.02
0.02
0.05
0.87
0.55
0.47
0.03
0.76
0.93
0.04
:
0.03
0.03
2.70
2.35
0.71
0.51
0.05
1.25
1.22
0.06
:
2.10
1.98
0.05
1.70
0.02
0.17
3.26
1.32
0.03
4.02
:
0.20
0.01
0.03
0.68
0.49
0.15
0.23
0.03
0.53
0.04
:
0.03
0.02
2.73
2.06
0.73
0.62
0.12
0.05
1.43
0.04
:
0.01
0.27
0.03
0.65
0.01
0.06
0.77
0.23
0.01
1.23
:
0.02
0.02
0.04
0.99
0.01
0.01
0.88
0.03
0.01
1.70
:
Translation
pseudo.047
probabilities
.241
.423
.022
.223
.022 .022
11
Overview
1.
Motivation and goal
2.
Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3.
Experiments
4.
Discussion
5.
Related work
6.
Summary
12
Our method for estimating noun-sequence
translation pseudo-probabilities
English corpus
Extract a noun
sequence with
its frequency
English-Japanese
dictionary
Japanese corpus
Generate all
compositional
translations
Retrieve compositional translations and count
their frequencies
E(1)=e1(1)e2(1)… em(1),
E(2)= e1(2)e2(2)… em(2),
…,
E(n)=e1(n)e2(n)… em(n)
F=f1f2…fm
Estimate according to constituentword translation pseudo-probabilities
Estimate according to
occurrence frequencies
P1 ( E
( j)
n
| F )  g ( E )  g ( E ) P2 ( E
( j)
k 1
(k )
( j)
m
| F )   Pps (e
i 1
( j)
i
n
m
| fi )  Pps (ei( k ) | fi )
k 1 i 1
Combine two estimates
Pps ( E ( j ) | F )  P1 ( E ( j ) | F ) P 2 ( E ( j ) | F )
13
Overview
1.
Motivation and goal
2.
Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3.
Experiments
4.
Discussion
5.
Related work
6.
Summary
14
Phrase-based SMT using translation pseudoprobabilities
In-domain sourcelanguage corpus
Out-of-domain (or in-domain)
parallel corpus
Giza++ &
heuristics
Bilingual
dictionary
Basic phrase table
Estimate translation
pseudo-probabilities
In-domain phrase table
(pseudo-probabilities)
Merge
SRILM
Adapted (or augmented)
phrase table
Source language text
In-domain targetlanguage corpus
In-domain language model
Moses decoder
Target language text
15
Overview
1.
Motivation and goal
2.
Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3.
Experiments
4.
Discussion
5.
Related work
6.
Summary
16
Experimental setting
 Experiment A
Adapt a phrase table learned from an out-of-domain parallel corpus by using
in-domain comparable corpora
 Experiment B
Augment a phrase table learned from an in-domain small parallel corpus by
using in-domain larger comparable corpora
Experiment A
Experiment B
20,000 pairs of
20,000 pairs of Japanese and English
Training
Japanese and English sentences having high similarity ―ones
parallel
patent abstracts in the extracted from scientific-paper abstracts
corpus
physics
in the chemistry
Scientific-paper abstracts in the chemistry
Training
comparable ・ Japanese: 151,958 abstracts (90.8 Mbytes)
corpora
・ English: 102,730 abstracts (64.9 Mbytes)
Test corpus 1,000 Japanese sentences, each having one reference English
translation, from scientific paper abstracts in the chemistry
333,656 pairs of translation equivalents between 163,247 Japanese and
Bilingual
dictionary 93,727 English nouns from EDR, EIJIRO, and EDICT dictionaries17
 Our method in four cases using a different volume of
comparable corpora
1.
2.
3.
4.
Japanese: all, English: all
Japanese: half, English: all
Japanese: all, English: half
Japanese: half, English: half
 Two baseline methods using the phrase table learned from the
parallel corpus
1. Baseline without dictionary
2. Baseline with dictionary: Phrase table were augmented with the bilingual
dictionary
[Note] The TL language model learned from the whole TL monolingual
corpus was used commonly in all cases involving our method and the
baseline methods
 Evaluation metric: BLEU-4
18
Experimental results
 BLEU-4 score
Experiment A
Experiment B
J:all, E:all
13.30
16.82
J:half, E:all
13.19
16.70
J:all, E:half
13.21
16.78
J:half, E:half
13.27
16.71
Baseline w/o dictionary
11.42
16.37
Baseline w/ dictionary
12.94
16.32
Method
Our method
 Our method rather slightly improved the BLEU score
 The effect of the difference in volume of comparable corpora remains
unclear
 Simply adding a bilingual dictionary improved the out-of-domain phrase
table, but did not improve the in-domain phrase table
19
Overview
1.
Motivation and goal
2.
Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3.
Experiments
4.
Discussion
5.
Related work
6.
Summary
20
Discussions
1. Optimization of the parameters
– Parameters, including the window size and thresholds for word
occurrence frequency, co-occurrence frequency, and pointwise mutual
information, affect the correlation matrix of associated words vs.
translations
– How to optimize the values for the parameters remains unsolved
2. Alternatives for word-association measure
– Pointwise mutual information, which tends to overestimate lowfrequency words, is not the most suitable for acquiring word associations
– Need to compare with alternatives such as log-likelihood ratio and the
Dice coefficient
21
3. Refinement of the definition of translation pseudo-probability
– Need to consider the frequencies of associated words as well as the
dependence among associated words
– Need to reconsider the strategy assigning an associated word to only one
translation
4. Estimate of verb translation pseudo-probabilities
– Need to use syntactic co-occurrence, instead of co-occurrence in a
widow, to extract verb-noun associations from corpora
– Need to define pariwise correlation between associated nouns and
translations recursively based on heuristics where two nouns associated
with a verb are likely to suggest the same sense of the verb when they
belong to the same semantic class
22
Overview
1.
Motivation and goal
2.
Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3.
Experiments
4.
Discussion
5.
Related work
6.
Summary
23
Related work
 Many studies on bilingual lexicon acquisition from bilingual comparable
corpora have been reported since the mid 90s, but few studies on word
translation probability estimate from bilingual comparable corpora
 Estimate of word translation probabilities from comparable corpora using an
EM algorithm (Koehn & Knight 2000) could be greatly affected by the
occurrence frequencies of translation candidates in the TL corpus
In contrast, our method produces translation pseudo-probabilities that
reflect the distribution of the senses of the SL word in the SL corpus
 Methods for extracting parallel sentence pairs from bilingual comparable
corpora (Zhao & Vogel, 2002; Utiyama & Isahara 2003; Fung & Cheung,
2004; Munteanu & Marcu, 2005); extracted parallel sentences could be used to
learn a translation model with a conventional method based on word-for-word
alignment. This approach is applicable only to closely comparable corpora.
In contrast, our method is applicable even to a pair of unrelated
monolingual corpora.
24
Overview
1.
Motivation and goal
2.
Proposed method
a. Estimating noun translation pseudo-probabilities
b. Estimating noun-sequence translation pseudo-probabilities
c. Phrase-based SMT using translation pseudo-probabilities
3.
Experiments
4.
Discussion
5.
Related work
6.
Summary
25
Summary
 A method for estimating translation pseudo-probabilities from a
bilingual dictionary and bilingual comparable corpora was created
– Assumption: The more associated words a translation is correlated with,
the higher its translation probability
– Essence of the method: Calculate pairwise correlations between associated
words of an SL word and its TL translations
 A phrase-based SMT framework using out-of-domain parallel
corpus and in-domain comparable corpora was proposed
– An experiment showed promising results; the BLEU score was improved
by using the translation pseudo-probabilities estimated from in-domain
comparable corpora.
 Future work includes optimizing the parameters and extending
the method to estimate translation pseudo-probabilities for verbs.
26