Supplementary Material

Character-Based Pivot Translation for Under-Resourced Languages and
Domains – Supplementary Material
Jörg Tiedemann
Department of Linguistics and Philology
Uppsala University, Uppsala/Sweden
[email protected]
1
Basic Setup – Data and Models
The subtitle corpora provided by OPUS are different for all language pairs. Therefore, there is no
common test set for all language pairs and results
are not comparable between languages. Some
care had to be taken when preparing the data sets
for our experiments. We selected the data in the
following way:
• Every sentence pair in a test set and in a development set is unique.
• Test sets, development sets and training sets
are completely disjoint.
• Monolingual data for language modeling do
not contain any of the sentences in test and
development sets
• Training and development data for the pivot
models do not contain sentences that appear
in the test set of the source-to-target model
• The subtitle corpus is quite noisy. Therefore,
we added some extra filters to select proper
test and development data:
– Maximum sentence length: 100
– Maximum length of a single token: 100
characters
– Ratio between the number of words on
one side and on the other side ≤ 4
– Ratio between the length of the source
language string and the target language
string ≤ 2
• Training data is cleaned with the standard
tools provided by Moses; maximum sentence length: 100
1.1
Word-Based SMT Models
All translation models used in the paper use the
standard setup of phrase-based SMT with slightly
different parameters. Word-based models use the
following general setup:
• Translation models and language models use
lowercased data
• Language models are trained on the entire
monolingual material available from the subtitle corpus in OPUS except for the sentences
included in development and test sets; no
other data sets are used as background language model
• Language models are interpolated 5-gram
models with Witten-Bell discounting
(Kneser-Ney smoothing caused problems
with small data sets) trained with the SRILM
toolkit (Stolcke, 2002)
• Maximum phrase length: 7
• Lexicalized reordering, distortion limit = 6
• BLEU scores are computed on lowercased
data (even though we have trained recasing
models for all data sets – the tendencies are
the same)
• For Norwegian to English we used a compound splitter for the input texts (and also for
Danish, Swedish and German when used as a
pivot language). The compound splitters are
trained with the tools provided by Moses.
Table 1 lists the monolingual data used for language modeling.
1.2
Character-Based SMT Models
Character-based models are different in the following way:
language
Galician
Catalan
Macedonian
Bosnian
Bulgarian
Spanish
English
# sent’s
9k
138k
772k
5M
37M
79M
175M
# words
46k
1M
4M
26M
199M
462M
1,056M
Table 1: Size of monolingual training data in number
of sentences and words.
• Language models are 10-gram models (but
with the same smoothing parameters)
• Maximum phrase length (lengths of character N-grams) = 10
• No reordering model (distortion limit = 0)
• Maximum sentence length: 100 characters
The training data for character-level models is
smaller than the one for word-based models due
to the restriction to 100 characters but the differences are small:
language
Catalan-Spanish
Galician-Spanish
Macedonian-Bulgarian
Macedonian-Bosnian
word-based
64,129
2,027
155,505
12,625
char-based
61,605
1,947
151,965
12,365
Table 2: Size of training data in number of sentence
pairs after cleaning.
2
Pivot Translations
For N-best rescoring we used the top N unique
translation hypotheses from the intermediate
translation step selected from the top-100 translations returned by the decoder. In some cases, there
were less unique translation alternatives. This is
especially the case for character-based models in
which there are many ways of segmenting a sentence but leading to the same translation.
Tables 3 and 4 list some additional results from
pivot translations using other character-alignment
approaches. We report only one-best pivot translations in this case. The differences between the
alignment models are rather small. Character
alignment using word alignment models seems to
be quite robust and leads to better performance in
the pivot translation task.
Model
English – Catalan (baseline)
English – Spanish -word- Catalan
English – Spanish -wfst1:1 - Catalan
English – Spanish -wfst2:2 - Catalan
English – Spanish -ibmchar - Catalan
English – Spanish -ibmbigram - Catalan
Catalan – English (baseline)
Catalan -word- Spanish – English
Catalan -wfst1:1 - Spanish – English
Catalan -wfst2:2 - Spanish – English
Catalan -ibmchar - Spanish – English
Catalan -ibmbigram - Spanish – English
English – Spanish -word- Galician
English – Spanish -wfst1:1 - Galician
English – Spanish -wfst2:2 - Galician
English – Spanish -ibmchar - Galician
English – Spanish -ibmbigram - Galician
Galician -word- Spanish – English
Galician -wfst1:1 - Spanish – English
Galician -wfst2:2 - Spanish – English
Galician -ibmchar - Spanish – English
Galician -ibmbigram - Spanish – English
BLEU
26.70
38.91
44.32
41.43
44.46
44.46
27.86
38.41
39.97
39.30
40.42
40.43
20.55
19.75
19.99
20.86
21.12
13.16
15.78
16.16
15.59
16.04
Table 3: Pivot translations with various character
alignment approaches for Catalan / Galician – English
via Spanish.
Model
English – Maced. (baseline)
English – Bosnian -word- Maced.
English – Bosnian -wfst1:1 - Maced.
English – Bosnian -wfst2:2 - Maced.
English – Bosnian -ibmchar - Maced.
English – Bosnian -ibmbigram - Maced.
English – Bulgarian -word- Maced.
English – Bulgarian -wfst1:1 - Maced.
English – Bulgarian -wfst2:2 - Maced.
English – Bulgarian -ibmchar - Maced.
English – Bulgarian -ibmbigram - Maced.
Maced. – English (baseline)
Maced. -word- Bosnian – English
Maced. -wfst1:1 - Bosnian – English
Maced. -wfst2:2 - Bosnian – English
Maced. -ibmchar - Bosnian – English
Maced. -ibmbigram - Bosnian – English
Maced. -word- Bulgarian – English
Maced. -wfst1:1 - Bulgarian – English
Maced. -wfst2:2 - Bulgarian – English
Maced. -ibmchar - Bulgarian – English
Maced. -ibmbigram - Bulgarian – English
BLEU
11.04
7.33
8.44
7.59
8.89
9.99
12.49
10.67
10.54
10.72
11.57
20.24
12.36
17.48
16.05
18.84
18.73
19.62
20.58
20.14
21.16
21.05
Table 4: Pivot translations with various character
alignment approaches for Macedonian – English via
Bosnian / Bulgarian.
Example: Macedonian – English (via Bulgarian/Bosnian)
Reference:
Baseline:
Pivotword (bg):
Pivotword (bs):
Pivotchar (bg):
Pivotchar (bs):
3
"How many women can look like a goddess in a bakery uniform?
How many women look like божици пекарска in an uniform?
How many women in пекарска божици uniform?
How many women look like божици пекарска униформа in?
How many women look like goddesses, sit down here in the uniform?
How would look like in a uniform božici pekarsku?
More Examples
Here are some more example translations comparing baseline translations and pivot-based translation. These examples are mainly selected to show
the impact of unknown words and the ability of
character-based pivot models to recover an acceptable translation. This is, of course, not always
the case as we can see in the example above with
Bosnian as a pivot.
Example: Maced. – English (via Bulg./Bosn.)
Reference:
Baseline:
Pivotword (bs):
Pivotword (bg):
It’s a simple question, себесочувување.
That’s a matter of себесочувување.
Pivotchar (bs):
Pivotchar (bg):
It’s just a question of stranojavanje.
It’s just a question of yourself.
Reference:
Baseline:
Example: Galician – English (via Spanish)
Reference:
Pivotword :
Pivotchar :
The harvest is our challenge!
The colleita’s our challenge.
The harvest is our challenge.
Reference:
Pivotword :
Pivotchar :
When did you leave the barracks?
When deixou you the headquarter?
When you leave the barracks?
Reference:
Pivotword :
Pivotchar :
I promise I will return it to you.
Cha devolverei I promise you that.
I promise I’il bring it back.
Reference:
Pivotword :
Pivotchar :
The pope is chained to the capital’s moneybag.
The Pope is encadeado moedieiro the capital.
The Pope is chainedto muediera the capital.
It’s a simple matter of self-preservation.
It’s simply a question of себесочувување.
Wow, your friend’s so cynical.
Your friend is very цинична.
Pivotword (bs):
Pivotword (bg):
Приjателката цинична you very much.
Your girlfriend’s very cynical.
Pivotchar (bs):
Pivotchar (bg):
Your friend’s so cynical.
Your girlfriend’s very cynical.
Example: Catalan – English (via Spanish)
Reference:
Baseline:
Pivotword :
Pivotchar :
She just went through a breakup.
Just come in for a ruptura
He just went through a breakup.
He just went through a breakup.
Reference:
Baseline:
Pivotword :
Pivotchar :
And
And
And
And
when
when
when
when
that happens, you have to try to say no.
that succeeixi you mean, you don’t.
I say no, you must try succeeixi.
that happens, you try to say no.
4
Character-Level Phrases
Translating characters seems like a very primitive operation. However, the paradigm of phrasebased SMT and its flexible definition of phrase
pairs with variable sizes makes it possible that
such a character-level model may cover common
word and even phrase correspondences. A few
examples are shown in table 5.
Norwegian
akkurat
akkurat
greit !
greit .
greit
sjon .
Swedish
precis
s ä k e r t
okej .
k ö r t i l l
o k e j d å
tion .
Table 5: Examples from a character-level phrase table
(without scores).
The largest irregular differences between
closely related languages can often been found
among function words and frequent, usually short
expressions. Those expressions can easily be covered with pairs of character N-grams. Table 5
shows, for example, how character-level phrase
tables can cover different translations (including
multi-word expressions) of the Norwegian “greit”
(literally translated: “great”) which is often used
as an affirmative expression. At the same time, a
character-level translation model can include regular differences between languages such as the
suffix “-sjon” in Norwegian which in most cases
corresponds to the Swedish suffix “-tion” (the “ ”
indicates that these strings appear at the end of a
token).
A clear drawback of a character-level model is
the increased complexity of decoding. The search
graph increases as the length of a sentence grows
with its segmentation into characters. Furthermore, the phrase table is typically much larger
as we extract longer character N-grams. Finally,
there will also be many alternative translations for
most character sequences which leads to a further
explosion of the search space. However, decoding
is still manageable as we experienced in our experiments and further optimizations are certainly
possible.
References
Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In Proceedings of the 7th
international conference on spoken language processing (ICSLP 2002), pages 901–904, Denver, CO,
USA.