Character-Based Pivot Translation for Under-Resourced Languages and Domains – Supplementary Material Jörg Tiedemann Department of Linguistics and Philology Uppsala University, Uppsala/Sweden [email protected] 1 Basic Setup – Data and Models The subtitle corpora provided by OPUS are different for all language pairs. Therefore, there is no common test set for all language pairs and results are not comparable between languages. Some care had to be taken when preparing the data sets for our experiments. We selected the data in the following way: • Every sentence pair in a test set and in a development set is unique. • Test sets, development sets and training sets are completely disjoint. • Monolingual data for language modeling do not contain any of the sentences in test and development sets • Training and development data for the pivot models do not contain sentences that appear in the test set of the source-to-target model • The subtitle corpus is quite noisy. Therefore, we added some extra filters to select proper test and development data: – Maximum sentence length: 100 – Maximum length of a single token: 100 characters – Ratio between the number of words on one side and on the other side ≤ 4 – Ratio between the length of the source language string and the target language string ≤ 2 • Training data is cleaned with the standard tools provided by Moses; maximum sentence length: 100 1.1 Word-Based SMT Models All translation models used in the paper use the standard setup of phrase-based SMT with slightly different parameters. Word-based models use the following general setup: • Translation models and language models use lowercased data • Language models are trained on the entire monolingual material available from the subtitle corpus in OPUS except for the sentences included in development and test sets; no other data sets are used as background language model • Language models are interpolated 5-gram models with Witten-Bell discounting (Kneser-Ney smoothing caused problems with small data sets) trained with the SRILM toolkit (Stolcke, 2002) • Maximum phrase length: 7 • Lexicalized reordering, distortion limit = 6 • BLEU scores are computed on lowercased data (even though we have trained recasing models for all data sets – the tendencies are the same) • For Norwegian to English we used a compound splitter for the input texts (and also for Danish, Swedish and German when used as a pivot language). The compound splitters are trained with the tools provided by Moses. Table 1 lists the monolingual data used for language modeling. 1.2 Character-Based SMT Models Character-based models are different in the following way: language Galician Catalan Macedonian Bosnian Bulgarian Spanish English # sent’s 9k 138k 772k 5M 37M 79M 175M # words 46k 1M 4M 26M 199M 462M 1,056M Table 1: Size of monolingual training data in number of sentences and words. • Language models are 10-gram models (but with the same smoothing parameters) • Maximum phrase length (lengths of character N-grams) = 10 • No reordering model (distortion limit = 0) • Maximum sentence length: 100 characters The training data for character-level models is smaller than the one for word-based models due to the restriction to 100 characters but the differences are small: language Catalan-Spanish Galician-Spanish Macedonian-Bulgarian Macedonian-Bosnian word-based 64,129 2,027 155,505 12,625 char-based 61,605 1,947 151,965 12,365 Table 2: Size of training data in number of sentence pairs after cleaning. 2 Pivot Translations For N-best rescoring we used the top N unique translation hypotheses from the intermediate translation step selected from the top-100 translations returned by the decoder. In some cases, there were less unique translation alternatives. This is especially the case for character-based models in which there are many ways of segmenting a sentence but leading to the same translation. Tables 3 and 4 list some additional results from pivot translations using other character-alignment approaches. We report only one-best pivot translations in this case. The differences between the alignment models are rather small. Character alignment using word alignment models seems to be quite robust and leads to better performance in the pivot translation task. Model English – Catalan (baseline) English – Spanish -word- Catalan English – Spanish -wfst1:1 - Catalan English – Spanish -wfst2:2 - Catalan English – Spanish -ibmchar - Catalan English – Spanish -ibmbigram - Catalan Catalan – English (baseline) Catalan -word- Spanish – English Catalan -wfst1:1 - Spanish – English Catalan -wfst2:2 - Spanish – English Catalan -ibmchar - Spanish – English Catalan -ibmbigram - Spanish – English English – Spanish -word- Galician English – Spanish -wfst1:1 - Galician English – Spanish -wfst2:2 - Galician English – Spanish -ibmchar - Galician English – Spanish -ibmbigram - Galician Galician -word- Spanish – English Galician -wfst1:1 - Spanish – English Galician -wfst2:2 - Spanish – English Galician -ibmchar - Spanish – English Galician -ibmbigram - Spanish – English BLEU 26.70 38.91 44.32 41.43 44.46 44.46 27.86 38.41 39.97 39.30 40.42 40.43 20.55 19.75 19.99 20.86 21.12 13.16 15.78 16.16 15.59 16.04 Table 3: Pivot translations with various character alignment approaches for Catalan / Galician – English via Spanish. Model English – Maced. (baseline) English – Bosnian -word- Maced. English – Bosnian -wfst1:1 - Maced. English – Bosnian -wfst2:2 - Maced. English – Bosnian -ibmchar - Maced. English – Bosnian -ibmbigram - Maced. English – Bulgarian -word- Maced. English – Bulgarian -wfst1:1 - Maced. English – Bulgarian -wfst2:2 - Maced. English – Bulgarian -ibmchar - Maced. English – Bulgarian -ibmbigram - Maced. Maced. – English (baseline) Maced. -word- Bosnian – English Maced. -wfst1:1 - Bosnian – English Maced. -wfst2:2 - Bosnian – English Maced. -ibmchar - Bosnian – English Maced. -ibmbigram - Bosnian – English Maced. -word- Bulgarian – English Maced. -wfst1:1 - Bulgarian – English Maced. -wfst2:2 - Bulgarian – English Maced. -ibmchar - Bulgarian – English Maced. -ibmbigram - Bulgarian – English BLEU 11.04 7.33 8.44 7.59 8.89 9.99 12.49 10.67 10.54 10.72 11.57 20.24 12.36 17.48 16.05 18.84 18.73 19.62 20.58 20.14 21.16 21.05 Table 4: Pivot translations with various character alignment approaches for Macedonian – English via Bosnian / Bulgarian. Example: Macedonian – English (via Bulgarian/Bosnian) Reference: Baseline: Pivotword (bg): Pivotword (bs): Pivotchar (bg): Pivotchar (bs): 3 "How many women can look like a goddess in a bakery uniform? How many women look like божици пекарска in an uniform? How many women in пекарска божици uniform? How many women look like божици пекарска униформа in? How many women look like goddesses, sit down here in the uniform? How would look like in a uniform božici pekarsku? More Examples Here are some more example translations comparing baseline translations and pivot-based translation. These examples are mainly selected to show the impact of unknown words and the ability of character-based pivot models to recover an acceptable translation. This is, of course, not always the case as we can see in the example above with Bosnian as a pivot. Example: Maced. – English (via Bulg./Bosn.) Reference: Baseline: Pivotword (bs): Pivotword (bg): It’s a simple question, себесочувување. That’s a matter of себесочувување. Pivotchar (bs): Pivotchar (bg): It’s just a question of stranojavanje. It’s just a question of yourself. Reference: Baseline: Example: Galician – English (via Spanish) Reference: Pivotword : Pivotchar : The harvest is our challenge! The colleita’s our challenge. The harvest is our challenge. Reference: Pivotword : Pivotchar : When did you leave the barracks? When deixou you the headquarter? When you leave the barracks? Reference: Pivotword : Pivotchar : I promise I will return it to you. Cha devolverei I promise you that. I promise I’il bring it back. Reference: Pivotword : Pivotchar : The pope is chained to the capital’s moneybag. The Pope is encadeado moedieiro the capital. The Pope is chainedto muediera the capital. It’s a simple matter of self-preservation. It’s simply a question of себесочувување. Wow, your friend’s so cynical. Your friend is very цинична. Pivotword (bs): Pivotword (bg): Приjателката цинична you very much. Your girlfriend’s very cynical. Pivotchar (bs): Pivotchar (bg): Your friend’s so cynical. Your girlfriend’s very cynical. Example: Catalan – English (via Spanish) Reference: Baseline: Pivotword : Pivotchar : She just went through a breakup. Just come in for a ruptura He just went through a breakup. He just went through a breakup. Reference: Baseline: Pivotword : Pivotchar : And And And And when when when when that happens, you have to try to say no. that succeeixi you mean, you don’t. I say no, you must try succeeixi. that happens, you try to say no. 4 Character-Level Phrases Translating characters seems like a very primitive operation. However, the paradigm of phrasebased SMT and its flexible definition of phrase pairs with variable sizes makes it possible that such a character-level model may cover common word and even phrase correspondences. A few examples are shown in table 5. Norwegian akkurat akkurat greit ! greit . greit sjon . Swedish precis s ä k e r t okej . k ö r t i l l o k e j d å tion . Table 5: Examples from a character-level phrase table (without scores). The largest irregular differences between closely related languages can often been found among function words and frequent, usually short expressions. Those expressions can easily be covered with pairs of character N-grams. Table 5 shows, for example, how character-level phrase tables can cover different translations (including multi-word expressions) of the Norwegian “greit” (literally translated: “great”) which is often used as an affirmative expression. At the same time, a character-level translation model can include regular differences between languages such as the suffix “-sjon” in Norwegian which in most cases corresponds to the Swedish suffix “-tion” (the “ ” indicates that these strings appear at the end of a token). A clear drawback of a character-level model is the increased complexity of decoding. The search graph increases as the length of a sentence grows with its segmentation into characters. Furthermore, the phrase table is typically much larger as we extract longer character N-grams. Finally, there will also be many alternative translations for most character sequences which leads to a further explosion of the search space. However, decoding is still manageable as we experienced in our experiments and further optimizations are certainly possible. References Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002), pages 901–904, Denver, CO, USA.
© Copyright 2026 Paperzz