Integrating Statistical Machine Translation into the Translation Curriculum Dr Dorothy Kenny Dublin City University Ireland Technologies Translators Use: Translation Memory Translation Memory as a Source of Data ‘[Microsoft’s] largest available natural resource is the nearly two decades of product documentation that has been localized into an increasing range of languages and preserved as translation memories.’ Joscelyne (2009) SMT systems Language Weaver Microsoft Translator Quick Interlude: avoiding cyberhype Keeping things in perspective? CSA 2012: post-editing MT accounts for 2.47% of revenues (or US$828.02 million) for language service providers CSA 2012: 38.63% of the 1,119 respondents providing post-editing services Optimale 2011: MT skills among new recruits not very important for employers of translators across Europe Lafeber (in progress) – new recruits not required to have any great expertise in TM/MT by governmental organisations Statistical MT Existing translations contain more solutions to more translation problems that any other existing resource. (Isabelle 1992) training ‘learns’ a translation model from existing source texts and their human translations ‘learns’ a target language model from a corpus of target language text tuning needs to be ‘tuned’ to get the balance between the translation and the language model right decoding selects most probable translation on the basis of what is suggested by the above models (among other things) The TRAINING phase in SMT i.e. the phase during which the translation model and the language model are ‘learned’ or ‘induced’ from parallel and TL corpora So what’s learned? single word alignments liste -> list probability of each pairing of source and target words das -> the 0.7 das -> that 0.15 das -> which 0.075 possibly also “phrase” alignments [and their probabilities] das Haus ist -> the house is 0.7 the house was 0.2 Translation modelled generatively the translation of sentences is broken down into the translation of words (lexical models, or unigrams) or “phrases” (n-grams) The probability of a candidate translation of a sentence is broken down into the product of the probabilities of the translations of: each word (unigram), or each string of two words (bigrams) or each string of three words (trigrams)… etc N-grams Quand le chat n’est pas là les souris dansent Unigrams: quand Bigrams: quand le Trigrams: quand le chat pas là le chat ne souris le chat là les est dansent chat ne les souris là ne est est pas souris dansent le chat ne chat ne est pas là les là les souris N-grams also used in target language model pas les pas là ne est pas est les souris dansent Noisy Channel Model integrates a language model p(t) with the translation model p(s|t) Bayes rule gives: argmaxt p(t|s) = argmaxt p(s|t) p(t) The most probable translation t given a source sentence s is the one for which the probability of that source sentence given that target sentence multiplied by the probability of that target sentence is highest. Log-linear model Integrates the translation model and the language model and any other “features” considered relevant Allows features to be weighted (e.g. so that the translation model is more important than the language model) Allows features to be added easily (e.g. so that translation can be based on stems rather than full form tokens, etc) SMT in a nutshell STs + Human Translations Translation Model Tuning Decoder Texts in the Target Language Language Model New SL texts to translate ‘Best’ Translation What is you reaction to the following statements? 1. Machine Translation is for computer scientists. Translators have their own tool. It’s called Translation Memory. 2. Translators don’t need to know how SMT works. Being able to use Google Translate is enough. Translators just need to keep using it, so that it improves. 3. Statistical Machine Translation will turn most translators into posteditors one day. (statements based on Koehn 2010, Robinson 2012, Pym 2012) Approach at Dublin City University Translators are involved before, during and after run-time in SMT workflows Empowerment comes with: technical know-how and lots of data We needed to design a syllabus where students can really ‘do’ SMT judge whether the SMT results are adequate/useful intervene in the SMT workflow to improve results advise others on whether SMT will be useful in their workflows Translation Technology at Dublin City University 12-week module for taught postgraduates (MA in Translation Studies; MSc in Translation Technology) Lectures on: Translation Memory Machine Translation history, general approaches, RBMT & SMT SMT basic architecture, statistical models (noisy channel, log-linear), training, tuning, decoding, post-editing... Evaluation of MT human evaluation, automatic evaluation Preparing data for SMT Post-editing SMT output SMT Practical Work Source data to train an SMT system eg DGT Translation Memory (22 languages, TMX format) Profile the data languages domain character encoding any irrelevant data number of one-to-many translations, etc… Upload data to SMT system Train the SMT system Test the newly-created system with new input Evaluate the output (using human and automatic metrics) Devise a strategy to improve system performance Pre-edit input files? Create and integrate a glossary? Post-edit? Apply input-based strategies and test the system again Challenges for translators and translator trainers conceptual challenges technical challenges ethical considerations Conceptual Challenges Conceptual Challenges “The idea of generating target sentences by translating words and phrases from the source sentence in a random order using a model containing many nonsensical translations may not seem plausible. In fact, the methods used are not intended (in our opinion, at least) to be either linguistically or cognitively plausible” (Hearne and Way 2011) Technical Challenges From Corporations with massive computing power to Do-ItYourself SMT Moses Open-source software for building SMT engines Even software developers find it difficult to implement Moses cf. Do Moses YourselfTM Translators using MT DIY SMT: freelance translators “Do Moses Yourself” (free, open-source SMT software) SMT ‘in the cloud’ SmartMATE (ALS, now Capita, UK) KantanMT.com (XCELERATOR MT Ltd, Ireland) Microsoft Translator Hub Translation by Everyone for Everyone SMT in the cloud: SmartMATE SMT in the cloud: KantanMT Ethical Challenges Using your own SMT in the cloud largely overcomes issues of: confidentiality (despite ‘anonymisation’) permission to use others’ work attribution (see Drugan and Babych 2010) helps assuage concerns about: quality assurance (see Karamanis et al. 2011) deskilling (see Kenny 2011, Doherty, Kenny and Way 2012) Homework: go to www.kantanmt.com and Questions? Special thanks to Dr Stephen Doherty (DCU), Prof Andy Way (formerly DCU, now Lingo24), the KantanMT.com team
© Copyright 2026 Paperzz