Integrating Statistical Machine Translation into the translation

Integrating Statistical Machine Translation
into the Translation Curriculum
Dr Dorothy Kenny
Dublin City University
Ireland
Technologies Translators Use: Translation Memory
Translation Memory as a Source of Data
‘[Microsoft’s] largest
available natural resource
is the nearly two decades
of product documentation
that has been localized
into an increasing range of
languages and preserved
as translation memories.’
Joscelyne (2009)
SMT systems
Language
Weaver
Microsoft Translator
Quick Interlude: avoiding cyberhype
Keeping things in perspective?
ƒ CSA 2012: post-editing MT accounts for 2.47% of revenues (or
US$828.02 million) for language service providers
ƒ CSA 2012: 38.63% of the 1,119 respondents providing post-editing
services
ƒ Optimale 2011: MT skills among new recruits not very important for
employers of translators across Europe
ƒ Lafeber (in progress) – new recruits not required to have any great
expertise in TM/MT by governmental organisations
Statistical MT
Existing translations contain more solutions to more translation problems that
any other existing resource. (Isabelle 1992)
training
ƒ ‘learns’ a translation model from existing source texts and their
human translations
ƒ ‘learns’ a target language model from a corpus of target language text
tuning
ƒ needs to be ‘tuned’ to get the balance between the translation and the
language model right
decoding
ƒ selects most probable translation on the basis of what is suggested by
the above models (among other things)
The TRAINING phase in SMT
i.e. the phase during which the translation model and the language model
are ‘learned’ or ‘induced’ from parallel and TL corpora
So what’s learned?
ƒ single word alignments
ƒ liste -> list
ƒ probability of each pairing of source and target words
ƒ das -> the
0.7
ƒ das -> that 0.15
ƒ das -> which 0.075
ƒ possibly also “phrase” alignments [and their probabilities]
ƒ das Haus ist ->
the house is
0.7
the house was 0.2
Translation modelled generatively
the translation of sentences is broken down into the
translation of words (lexical models, or unigrams) or
“phrases” (n-grams)
The probability of a candidate translation of a sentence is
broken down into the product of the probabilities of the
translations of:
ƒ each word (unigram), or
ƒ each string of two words (bigrams) or
ƒ each string of three words (trigrams)… etc
N-grams
Quand le chat n’est pas là les souris dansent
Unigrams:
quand
Bigrams:
quand le
Trigrams:
quand le chat
pas là
le
chat
ne
souris
le chat
là les
est
dansent
chat ne
les souris
là
ne est
est pas
souris dansent
le chat ne
chat ne est
pas là les
là les souris
N-grams also used in target language model
pas
les
pas là
ne est pas
est
les souris dansent
Noisy Channel Model
ƒ integrates a language model p(t) with the translation model p(s|t)
ƒ Bayes rule gives:
argmaxt p(t|s) = argmaxt p(s|t) p(t)
The most probable translation t given a source sentence s is
the one for which the probability of that source sentence given that
target sentence multiplied by the probability of that target sentence is
highest.
Log-linear model
ƒ Integrates the translation model and the language model
and any other “features” considered relevant
ƒ Allows features to be weighted (e.g. so that the translation
model is more important than the language model)
ƒ Allows features to be added easily (e.g. so that translation
can be based on stems rather than full form tokens, etc)
SMT in a nutshell
STs +
Human
Translations
Translation
Model
Tuning
Decoder
Texts in the
Target
Language
Language
Model
New SL
texts to
translate
‘Best’
Translation
What is you reaction to the following statements?
1. Machine Translation is for computer scientists. Translators have their
own tool. It’s called Translation Memory.
2. Translators don’t need to know how SMT works. Being able to use
Google Translate is enough. Translators just need to keep using it, so
that it improves.
3. Statistical Machine Translation will turn most translators into posteditors one day.
(statements based on Koehn 2010, Robinson 2012, Pym 2012)
Approach at Dublin City University
ƒ Translators are involved before, during and after run-time in
SMT workflows
ƒ Empowerment comes with: technical know-how and lots of
data
ƒ We needed to design a syllabus where students can
ƒ really ‘do’ SMT
ƒ judge whether the SMT results are adequate/useful
ƒ intervene in the SMT workflow to improve results
ƒ advise others on whether SMT will be useful in their workflows
Translation Technology at Dublin City University
12-week module for taught postgraduates (MA in Translation Studies;
MSc in Translation Technology)
Lectures on:
ƒ Translation Memory
ƒ Machine Translation
ƒ history, general approaches, RBMT & SMT
ƒ SMT
ƒ basic architecture, statistical models (noisy channel, log-linear), training,
tuning, decoding, post-editing...
ƒ Evaluation of MT
ƒ human evaluation, automatic evaluation
ƒ Preparing data for SMT
ƒ Post-editing SMT output
SMT Practical Work
ƒ Source data to train an SMT system
ƒ eg DGT Translation Memory (22 languages, TMX format)
ƒ Profile the data
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
languages
domain
character encoding
any irrelevant data
number of one-to-many translations, etc…
Upload data to SMT system
Train the SMT system
Test the newly-created system with new input
Evaluate the output (using human and automatic metrics)
Devise a strategy to improve system performance
ƒ Pre-edit input files? Create and integrate a glossary? Post-edit?
ƒ Apply input-based strategies and test the system again
Challenges for translators and translator trainers
ƒ conceptual challenges
ƒ technical challenges
ƒ ethical considerations
Conceptual Challenges
Conceptual Challenges
“The idea of generating target sentences by translating words and
phrases from the source sentence in a random order using a model
containing many nonsensical translations may not seem plausible. In
fact, the methods used are not intended (in our opinion, at least) to be
either linguistically or cognitively plausible” (Hearne and Way 2011)
Technical Challenges
ƒ From Corporations with massive computing power to Do-ItYourself SMT
ƒ Moses
ƒ Open-source software for building SMT engines
ƒ Even software developers find it difficult to implement Moses
ƒ cf. Do Moses YourselfTM
Translators using MT
DIY SMT: freelance translators “Do Moses Yourself” (free, open-source SMT
software)
SMT ‘in the cloud’
ƒSmartMATE (ALS, now Capita, UK)
ƒKantanMT.com (XCELERATOR MT Ltd, Ireland)
ƒMicrosoft Translator Hub Translation by Everyone
for Everyone
SMT in the cloud: SmartMATE
SMT in the cloud: KantanMT
Ethical Challenges
Using your own SMT in the cloud
largely overcomes issues of:
ƒ confidentiality (despite ‘anonymisation’)
ƒ permission to use others’ work
ƒ attribution
(see Drugan and Babych 2010)
helps assuage concerns about:
ƒ quality assurance
(see Karamanis et al. 2011)
ƒ deskilling
(see Kenny 2011, Doherty, Kenny and Way 2012)
Homework: go to www.kantanmt.com and
Questions?
Special thanks to Dr Stephen Doherty (DCU), Prof Andy Way (formerly DCU,
now Lingo24), the KantanMT.com team