METIS-II

FP6-IST-003768
METIS-II
Statistical Machine Translation using Monolingual Corpora:
From Concept to Implementation
Specific Targeted Research Project or Innovation Projects
Future and Emergent Technologies
D4.4 Post-processing & Post-editing
Due date of deliverable: 31.7.2006
Actual submission date: 13.9.2006
Start date of project: 1.10.2004
Duration: 3 years
KULeuven
Final version
Project co-funded by the European Commission within the Sixth Framework Programme
(2002-2006)
Dissemination Level
PU
Public
PP
Restricted to other programme participants (including the Commission Services)
RE
Restricted to a group specified by the consortium (including the Commission Services)
CO
Confidential, only for members of the consortium (including the Commission Services)
X
D4.4 Post-processing & Post-editing
1
1. Introduction ...................................................................................................................................2 2. Post‐processing..............................................................................................................................2 2.1 Token Generation....................................................................................................................3 2.1.1 Issues and methods .............................................................................................................3 2.1.1.1 ILSP .............................................................................................................................3 2.1.1.2 KUL............................................................................................................................11 2.1.1.3 GFAI ..........................................................................................................................14 2.1.1.4 FUPF..........................................................................................................................18 2.1.2 Token generator ................................................................................................................22 2.1.2.1 Extension of the BNC tag set.....................................................................................22 2.1.2.2 Retagging the BNC....................................................................................................23 2.1.2.3 The lemmatiser...........................................................................................................24 2.1.2.4 Token Generation.......................................................................................................26 3. Post‐editing ..................................................................................................................................28 3.1 Introduction ...........................................................................................................................28 3.2 Language‐specific issues ......................................................................................................30 3.2.1 Modern Greek to English..................................................................................................30 3.2.2 Dutch to English...............................................................................................................32 3.2.3 German to English ............................................................................................................34 3.2.4 Spanish to English ............................................................................................................36 3.3 Interface ..................................................................................................................................38 3.3.1 Implementation concepts of the post‐editing module .......................................................38 3.3.2 The web interface ..............................................................................................................38 4. Conclusion....................................................................................................................................39 ANNEX.............................................................................................................................................40 BIBLIOGRAPHY .............................................................................................................................42 D4.4 Post-processing & Post-editing
2
1. Introduction
This deliverable describes the rewriting process that takes place after the translation
procedure returns one (or more) preliminary translation(s), that is target language
strings in lemmatised form. Certainly, at this state, the translation is still not completed:
1.
lemmas have to be substituted by the correct token forms regarding any agreement phenomena
2.
probably, some reordering needs to be done.
In this rewriting process, we distinguish two main parts: post-processing and postediting.
Post-processing is the part that can be done without human interference. The set of
modules designed to do this is collectively called the post-processor. In the METIS
context, post-processing is mainly about token generation.
Post-editing is the correction done by a human translator (usually the end user of the
system) and consists of ad hoc corrections that cannot be done in a automated, generic way. The results of the changes the post-editor made, can be used as an extra
input to the translation system. A possibility is to use this parallel corpus of corrections as an input for an automated post-processing module that tries to transfer some
of the post-editing issues to an earlier stage.
According to the user requirements, as defined in D2.1, users find a postediting facility useful. Quoting the relevant section from D2.1: “The most frequent actions in postediting were moving words or phrases from one place of the sentence to another and
replacing wrong words or phrases. The least frequent action is adding phrases. It is
often the case that whole sentences have to be retranslated. Most of the respondents
find the post-editing tasks interesting There are no specific tools used for postediting.”
2. Post-processing
Post-processing is the automatic rewriting of the lemmatised target language sentences generated by the translation engine of the MT system. In METIS-II, postprocessing is a modular system, called the post-processor.
Most of the work carried out at this phase is about token generation given that the
output of the translation procedure thus far consists of lemmata only. This process includes the generation of the correct number, person, tense, case, and degree of comparison of the token, as well as the correct capitalisation. In the future, small modules
may try to correct a few generic translation errors in an automated way.
D4.4 Post-processing & Post-editing
3
2.1 Token Generation
2.1.1 Issues and methods
2.1.1.1 ILSP
For the token generation procedure to work some amount of morphosyntactic information is necessary. Below we describe the process by which this information is collected and used.
Mapping of morphological features
The grammar of the target language dictates the type of grammatical information of
the source language that is retained for generation purposes. In the case of the Modern Greek Æ English pair, the following information is retained (below, a Prolog-like
notation is used to indicate identical value assignment to Parameters; Modern Greek is
given on the left side of the ‘Æ’).
Vb(Nu,Pers,Tense,Voice,….) Æ Vb(Nu,Pers,Tense,Voice)
At(Nu,Gen,Case,...) Æ Ø
Noun(Main/Common,Nu,Gen,Case,...) Æ Noun(Main/Common,Nu)
Pronoun (Nu,Gen,Case,...) Æ Noun(Nu,Gen,Case)
Adj(Nu,Gen,Case, Degree...) Æ Adj(Degree)
Adv(Degree...) Æ Adv(Degree)
In the default case, what comes from the source language tagger can be used as such
unless there is some overriding information coming from the lexicon. For instance, the
Greek plural noun «µαλλιά» maps on ‘hair’ which is always singular and the Greek
verb «έρχοµαι» (come) which is a passive voice verb (actually a deponent one) certainly does not return passive voice verbs in English.
This mapping takes place at lexicon reading time. Otherwise, one would need to carry
this information up to the generation point and perform a second check with the lexicon for mismatches which is more time-consuming. Furthermore, some of this information (e.g. information about pro-drop) is used for the core translation procedure.
GENERATION
Major phenomena that have to be treated at generation time are
A.
Subject – Verb agreement
B.
Verbal tense
C.
Capitalization
4
D4.4 Post-processing & Post-editing
A. Subject-verb agreement is based on the subject NP-head. Thus, this noun has to
be spotted but this is information made available by the core translation machine. The
number and person features of this word will control agreement in the case of a simple (not complex) NP. This approach is problematic in the following cases:
(i)
Pro-drop
(ii)
Coordination: number and person
(iii) Disjunction: number and person
Pro-drop: Modern Greek is a pro-drop language which implies that often no overt
phrase functioning as the subject is supplied with the source language sentence. Of
course, the verb carries information about number and person. Given that English is
the target language, an overt Pronoun must be generated to fill the subject slot in the
sentence. This pronoun must bear the agreement features defined by the source language verb.
Since the general strategy is to read off agreement features from the subject at token
generation time, the following method is applied: before entering the core translation
engine (which uses corpus evidence to compile the translations), the tagged and
lemmatised source language string is augmented with a noun chunk marked as NP1
(which yields subjects) containing only a POS tag indicator (PN) and a TOKEN GENERATION LABEL (TGL) which copies the number and person features from the verb. At
token generation time, the respective personal pronoun is generated. Modern Greek
does not mark verbal endings for grammatical gender therefore the subjects of third
person, singular verbs remain underspecified in this respect. The generator returns a
disjunction of pronouns that match the information available (he/she/it). Resolution of
this contextually resolvable ambiguity is postponed for the post-editing time when the
user simply deletes the unwanted words.
Subject verb agreement with complex NPs as subjects (NPs with coordinate and disjunctive structures) is controlled by a number of relatively complex rules in Modern
Greek where person is encoded with verbal morphology. English does not encode person with the verbal morphology and this simplifies things a lot. The correspondences
for translation purposes (Modern Greek Æ English) are given in the two following tables:
COORDINATION
Greek coordinated
NP
Greek verbal tag
English coordinate NP
English verbal tag
Άντρες, γυναίκες,
παιδιά έτρεχαν
VbMnIdPa03PlXxIpAvXx
Men, women, children ran
VVD
Εγώ κι ο Πέτρος
φεύγουµε
VbMnIdPr01PlXxPeAvXx
Peter and I leave
VVB
Εγώ και συ θα βρούµε
µια λύση.
PtFu+VbMnIdXx01PlXxPeAvXx
You and I will find
a solution
VM0 + VVI
Ο Πέτρος µε τον Παύλο παίζουν.
VbMnIdPr03PlXxIpAvXx
Peter plays with
Paul
VVZ
5
D4.4 Post-processing & Post-editing
DISJUNCTION
Greek disjoint
NPs
Θα πάω εγώ ή
ο Γιάννης.
Greek verbal tag
PtFu+VbMnIdXx01SgXxPeAvXx
English disjoint NPs
Either John or I will go
English verbal
tag
VM0 + VVI
B. Tense correspondence is one-to-many in the general case and can be judged mostly on a contextual basis unless some adverbial in the sentence forces to a particular
tense selection. We can ask the post-processing module to generate all possible tenses (unless some clue selects only one tense) and then, opt for the right one at postediting time.
A table with some tense correspondences follows.
6
D4.4 Post-processing & Post-editing
ACTIVE VOICE
Tense
Greek form
Greek tag
English form
English tag
VbMnIdPr01SgXxIpAvXx
I catch
VVB/VVZ
VbMnIdPr01SgXxIpAvXx
I am catching
VBB/VBZ + VVG
VbMnIdPr01SgXxIpAvXx
I have been catching
VHB/VHZ + VBN + VVG
PtSj--VbMnIdXx01SgXxIpAvX
to catch
TO0 + VVI
PtSj--VbMnIdXx01SgXxIpAvX
to be catching
TO0 + VBI + VVG
να πιάσω
PtSj--VbMnIdXx01SgXxPeAvXx
To catch
TO0 + VVI
ας πιάνω
PtOt--VbMnIdPr01SgXxIpAvXx
let smbd catch
let + ???+VVI
ας πιάσω
PtOt--VbMnIdΧx01SgXxPeAvXx
let smbd catch
let + ???+VVI
Πιάνε
VbMnMpPr03PlXxIpAvXx
catch
VVB
πιάνοντας
VbMnPpXxXxXxXxIpAvXx
Catching
NN1-VVG
PtFu --VbMnIdPr01SgXxIpAvXx
Will be catching
[will\VM0] + VBI+ VVG
PtFu --VbMnIdPr01SgXxIpAvXx
Will have been catching
[will\VM0] + VHI+VBN+ VVG
PtFu --VbMnIdPr01SgXxIpAvXx
Will catch
[will\VM0] + VVI
Will catch
[will\VM0] + VVI
I am catching (present tense used as future tense)
VBB/VBZ + VVG
Πιάνω
να πιάνω
Present
θα πιάνω
Future
θα πιάσω
PtFu --VbMnIdΧx01SgXxPeAvXx
I am going to catch
VBB/VBZ/VBD+going+TO0+VV
I
να πιάσω
PtSj-- VbMnIdXx01SgXxPeAvXx
To catch
TO0 + VVI
να πιάνω
PtSj-- VbMnIdΧx01SgXxPeAvXx
To be catching
TO0 +VBI+ VVG
ας πιάσω
PtOt --VbMnIdΧx01SgXxPeAvXx
let smbd catch
let + ???+VVI
ας πιάνω
PtOt -- VbMnIdΧx01SgXxPeAvXx
Let smbd be catching
let + ???+ VBI+ VVG
7
D4.4 Post-processing & Post-editing
Tense
Greek form
English form
English tag
VbMnIdPa01SgXxIpAvXx
I was catching
VBD+ VVG
VbMnIdPa01SgXxIpAvXx
I used to catch
[used\VVD]+TO0+VVI
VbMnIdPa01SgXxIpAvXx
I caught
VVD
VbMnIdPa01SgXxIpAvXx
I would catch
[would\VM0]+VVI
να έπιανα
PtSj--VbMnId01PaSgXxIpAvXx
? if only I caught????
CJS+ AV0+ VVD
ας έπιανα
PtOt--VbMnIdPa01SgXxIpAvXx
? wish I caught
Έπιανα
Past
Greek tag
θα έπιανα
PtFu --VbMnIdPa01SgXxIpAvXx
I would catch
[would\VM0] +VVI
Έπιασα
VbMnIdPa01SgXxPeAvXx
I caught
VVD
Πιάσε
VbMnΜpXx02SgXxPeAvXx
catch
VVB
έχω πιάσει
VbMnIdPr01SgXxIpAvXx
fXxXxXxXxPeAvXx
I have caught
VHB/VHZ+ VVN
να έχω πιάσει
PtSj--VbMnIdPr01SgXxIpAvXx
VbMnNfXxXxXxXxPeAvXx
To have caught
TO0+ VHI+VVN
I will have caught
[will\VM0]+VHI+VVN
θα έχω πιάσει
ας έχω πιάσει
είχα πιάσει
VbMnN-
PtFu --VbMnIdPr01SgXxIpAvXx
VbMnNfXxXxXxXxPeAvXx
PtOt--VbMnIdPr01SgXxIpAvXx
VbMnNfXxXxXxXxPeAvXx
VbMnIdPa01SgXxIpAvXx
VbMnNfXxXxXxXxPeAvXx
Wish I had caught
I had caught
PtSj-- VbMnIdPa01SgXxIpAvXx
να είχα πιάσει
VbMnNfXxXxXxXxPeAvXx
If only I had caught
ας είχα πιάσει
(PtOt-- VbMnIdPa01SgXxIpAvXx
Wish I had caught
VbMnNfXxXxXxXxPeAvXx
VHD+VVN
8
D4.4 Post-processing & Post-editing
PASSIVE VOICE
Tense
Present
Greek form
English form
English tag
I hide
VVB/VVZ
Κρύβοµαι
VbMnIdPr01SgXxIpPvXx
I am caught
VBB/VBZ+VVN
Πιάνοµαι
VbMnIdPr01SgXxIpPvXx
I am being caught
VBB/VBZ + VBG+ VVN
I have been caught
VHB/VHZ + VBN + VVN
TO0 + VBI + VVN
Να πιάνοµαι
PtSj - VbMnIdPr01SgXxIpPvXx
To be caught
Ας πιάνοµαι
PtOt - VbMnIdPr01SgXxIpPvXx
Let smbdy be caught
Ας πιασθώ
PtOt - VbMnIdPr01SgXxIpPvXx
Let smbdy get caught
πιάσου
Future
Greek tag
VbMnMpPr02PlXxPePvXx
θα πιάνοµαι
PtFu -VbMnIdPr01SgXxIpPvXx
θα κρύβοµαι
PtFu - VbMnIdPr01SgXxIpPvXx
θα πιαστώ
PtFu - VbMnIdXx01SgXxPePvXx
Να πιαστώ
PtSj - VbMnIdXx01SgXxPePvXx
Ας πιαστώ
PtOt - VbMnIdXx01SgXxPePvXx
Get caught
[get\VVB/VVZ] + VVN
Must be caught
[must\VM0]+ VBI + VVN
Have to be caught
VHB/VHZ +TO0 + VBI + VVN
Ought to be caught
[ought\VM0]+TO0 + VBI + VVN
Will be getting caught
[will\VM0] + VBI+ [getting\VVG]+VVN
Will be hiding
[will\VM0] + VBI+ VVG
I am going to be hiding
Will get caught
VBB/VBZ+[going\VVG]+TO0+VBI+VVG
Will be caught
[will\VM0] + VBI+ VVN
To get caught
TO0 + [get\VVI] +VVN
To be caught
TO0 + VBI+ VVN
Let smbdy get caught
[will\VM0] + [get\VVI] +VVN
9
D4.4 Post-processing & Post-editing
Tense
Greek form
Greek tag
Πιανόµουν
κρυβόµουν
VbMnIdPa01SgXxIpPvXx
VbMnIdPa01SgXxIpPvXx
Να πιανόµουν
Ας πιανόµουν
Θα πιανόµουν
πιάστηκα
Κρύφτηκα
PtSj - VbMnIdPr01SgXxIpPvXx
PtOt - VbMnIdPr01SglXxIpPvXx
PtFu VbMnIdPr01SgPlXxIpPvXx
VbMnIdPa01SgXxPePvXx
VbMnIdPa01SgXxPePvXx
VbMnIdPr01SgXxIpAvXx
(VbMnNfXxXxXxXxPePvXx)
VbMnIdXx03SgXxPePvXx
PtSj - VbMnIdPr01SgXxIpAvXx
(VbMnNfXxXxXxXxPePvXx)
VbMnIdXx03SgXxPePvXx
PtFu -VbMnIdPr01SgXxIpAvXx
(VbMnNfXxXxXxXxPePvXx)
VbMnIdXx03SgXxPePvXx
PtOt - VbMnIdPr01SgXxIpAvXx
(VbMnNfXxXxXxXxPePvXx)
VbMnIdXx03SgXxPePvXx
VbMnIdPa01SgXxIpAvXx
(VbMnNfXxXxXxXxPePvXx)
VbMnIdXx03SgXxPePvXx
PtSj - VbMnIdPa01SgXxIpAvXx
(VbMnNfXxXxXxXxPePvXx)
VbMnIdXx03SgXxPePvXx
PtOt - VbMnIdPa01SgXxIpAvXx
(VbMnNfXxXxXxXxPePvXx)
VbMnIdXx03SgXxPePvXx
Έχω πιαστεί
Να έχω πιαστεί
Past
Θα έχω πιαστεί
Ας έχω πιαστεί
Είχα πιαστεί
Να είχα πιαστεί
Ας είχα πιαστεί
English form
English tag
I was being caught
I used to be caught
I would be caught
I was hiding
If only I were caught
wish I were caught
I would be caught
I was caught
I hid
VBD+VBG+VVN
[used\VVD]+TO0+VBI+VVN
[would\VM0]+VBI+VVN
VBD +VVG
I have been caught
VHB/VHZ+VBN+ VVN
To have been caught
TO0+ VHI+ VBN+ VVN
I
will
caught
[[will\VM0]++ VHI+ VBN+ VVN
Wish I
caught
have
been
had
been
I had been caught
I was caught
If only I had been
caught
Wish I
caught
had
been
[would\VM0] +VBI+ VVN
VBD+VVN
VVD
VHD+VBN+ VVN
VBD+VVN
D4.4 Post-processing & Post-editing
10
C. Capitalisation is performed by the post-processing module on those nouns which
are indicated as such by the source language tagging. The respective tag mapping rule is given below:
NoPr Æ NP0
11
D4.4 Post-processing & Post-editing
2.1.1.2 KUL
Morphological analysis in Dutch is done by the TnT tagger1 trained on the CGNtagset2. This tagset is based on the morphosyntactic forms of Dutch. We use the option that not only the best (most probable) tag, but also the alternative tags with a
lower probability are retained. These are combined into several source-language
analysis alternatives, with as their weights the products of the tag probabilities of the
elements. These alternatives go through the rest of the translation process, as they
can result in different lemmatisation and chunking, and, of course, in different translations. The Dutch lemmatiser is based on the PoS tags assigned by the tagger. It uses
the CGN lexicon to find the correct lemma for a token. For certain tokens, the lemmatisation process generates more than one token. Using the tags as extra information
reduces the ambiguity substantially.
At the target-language side, we use the BNC processed in a similar way. Lemmatisation is done by the lemmatiser described in deliverable 3.3 and Carl et al. (2005).
Since the lemmatiser algorithm is reversible, it can be used as a token generator.
The token generator needs a lemma and a tag in order to recreate the correct token.
Of course, the lemmas are provided by the bilingual dictionary. However, a lot of tag
information is also contained in this dictionary. It contains the correct part of speech
of the English words, as well as in fixed expressions, the full tag. E.g., the Dutch word
aanhang is translated as the English supporters, which is always plural. In the dictionary, we have the entry:
aanhang
N
supporter
NN2
Hence, the translated lemma is ‘supporter’, but the NN2 tag denotes it always has to
be plural.
The other tag information is drawn from the set of tag-mapping rules. This set contains the rules to map Dutch indications of number, person, tense, case, and the degrees of comparison to English equivalent forms.
Tag mapping is done according to the following table.
1
2
See Brants (2000).
Tag set developed for the Spoken Dutch Corpus (CGN) (Van Eynde 2004).
D4.4 Post-processing & Post-editing
N(soort,ev,stan|dat)
N(soort,ev,gen)
N(soort,mv)
N(eigen,ev,stan|dat)
N(eigen,gen)
N(eigen,mv)
ADJ(prenom|nom|postnom,basis|dim)
ADJ(prenom|nom|postnom,comp)
ADJ(prenom|nom|postnom,sup)
ADJ(vrij,basis|dim)
ADJ(vrij,comp,zonder)
ADJ(vrij,sup,zonder)
WW(pv,tgw,ev)
WW(pv,tgw,mv)
WW(pv,tgw,met-t)
WW(pv,verl,ev)
WW(pv,verl,mv)
WW(pv,verl,met-t)
WW(pv,conj,ev)
WW(inf)
WW(od)
WW(vd)
TW(hoofd)
TW(rang)
VNW(pers|pr|refl|recip)
VNW(bez)
VNW(vrag|betr|vb|excl)
VNW(aanw)
VNW(onbep)
LID()
VZ()
BW()
VG(neven)
VG(onder)
TSW()
LET()
SPEC(vreemd)
SPEC(deeleigen)
12
NN0|NN1
NN0+POS|NN1+POS
NN2
NP0
NP0+POS
NP0
AJ0
AJC
AJS
AJ0|AV0
AJC|AV0
AJS|AV0
VBB|VBZ|VDB|VHB|VM0|VVB|VDB+VVI
VBB|VDB|VHB|VM0|VVB|VDB+VVI|VBB|VVG
VBB|VDB|VDZ|VHB|VHZ|VM0|VVB|VVZ|
VDB+VVI|VDZ+VVI|VVB|VVG
VBD|VDD|VHD|VM0|VVD|VDD+VVI|
VBD+VVG|VBZ+VVG
VBD|VDD|VHD|VM0|VVD|VDD+VVI|
VBD+VVG
VBD|VDD|VHD|VM0|VVD|VDD+VVI|
VBD+VVG
VBB|VBZ|VDB|VDZ|VHB|VHZ|VM0|VV0|
VVZ|VD0+VVI|VDZ+VVI|VBB+VVG|
VBZ+VVG
TO0+VBI|TO0+VDI|TO0+VHI|TO0+VVI
VBG|VDG|VHG|VVG
VBN|VDN|VHN|VVN
CRD
ORD
PNP|PNX
DPS
PNQ|DTQ
DT0|PNI
PNI|AT0|EX0
AT0
PRF|PRP|TO0
AV0|AVP|AVQ|XX0
CJC
CJS|CJT
ITJ
PUL|PUN|PUQ|PUR
UNC
NP0
D4.4 Post-processing & Post-editing
13
Furthermore, there are some additional rules that constitute the expander. The expander is a piece of software that can reorder the words or chunks in the proposed
translations, and insert or delete words. It can generate extra translations or substitute the ones that came out of the combination of the dictionary and the tag mapping.
Currently, the CCL expander is dealing with the following phenomena:
1.
The different parts of verb clusters are put together in one bag. In Dutch, the different parts of compound tenses can be separated by direct and indirect object,
prepositional phrases and even whole sub-clauses. The past participles and their
auxiliaries are put into one bag in order to retrieve the corresponding BNC bags
from the target-language corpus.
2.
The literal translation of om in the om te + infinitive construction is deleted,
since it remains untranslated. Again, the word om could be separated from the
remainder of the infinitival phrase by several constituents.
3.
In Dutch, the usual form of the active compound tenses is formed with the appropriate tenses of the verb hebben and the past participle. However, some intransitive verbs (esp. verbs of motion) are using the zijn as auxiliary in these
tenses. For transitive verbs, zijn is used to form the passive voice of the aforesaid compound tenses. Since in English the combination to be and past participle
is used for the translation of the Dutch ‘worden + past participle’, we rewrite the
literal translations ‘to be + past participle’ to ‘to have + past participle’ and ‘to
have been + past participle’. In order to not confuse these with the passive of
the non-compound tenses, we only introduce get and become as translations of
worden. After the former rule fired, we substitute these verbs, if they are followed by a past participle, for the appropriate form of to be.
4.
The expander is assigning the correct tags in order to translate the combination
of a verb followed by the adverb graag into to like to, followed by a verb properly. We do this, using the dictionary information3 and the fact that the tense of
original Dutch verb has to be mapped on the tense of to like, while the translation of the original verb gets an infinitive tag. The word order is also switched to
get correct English.
Capitalisation is based on the lexicon. When the target-language word in the dictionary has a capital, it always remains capitalised. In addition, tokens at the beginning of a sentence are also capitalised. The process is done by a simple rulebased module. Another simple rule-based module tries to correct the indefinite
article a/an when the search engine gives a wrong choice.
3
The dictionary entry is:
graag
BW
like~to#<VVI> VV?!InP[TO0#VVI]
where # is used to indicate consecutive separate tokens, ~ to indicate tokens which are not necessarily consecutive, the ? to indicate the
underspecification of the word’s features (here person, number and tense), the ! to indicate the head of the TL chunk and <VVI> to indicate that an expander rule needs to be triggered that places the original main verb in the <VVI> slot and that transfers the feature information from that main verb to the feature information of like.
D4.4 Post-processing & Post-editing
14
2.1.1.3 GFAI
As outlined in section 2.1.2, the token generator needs a lemma and a tag to produce
surface words. This information is gathered from different resources. Most important
is the transfer dictionary which contains English lemmas together with their BNC tags.
Another source of information is the source language analysis and a third resource is
the so called expander. We will briefly describe these different resources with the view
of how they are used to generate and integrate information for token generation.
As we have already outlined in deliverable 3.3, the English sides of the GFAI bilingual
dictionary is pre-processed in such a way that every entry resembles a flat tree. The
leaves of the tree are BNC-compatible lemmas and tags, while the mother node contains information concerning the entry as a whole.
Thus, there are several entries in the German-English dictionary for "schnell":
{de=schnell,mde={c=adj},en=fast,men={c=adj}}.
{de=schnell,mde={c=adj},en=rapid,men={c=adj}}.
{de=schnell,mde={c=adj},en=quick,men={c=adj}}.
These entries will be tagged and lemmatised as described in deliverable 3.3 so that
the English sides yield translation options in the following shape:
{c=adj} @ {c=AJ0,lu=fast}.
{c=adj} @ {c=AJ0,lu=quick}.
{c=adj} @ {c=AJ0,lu=rapid}.
The feature bundle preceding the @ contains mother information from the lexicon
while the feature bundle(s) following @ are the leave nodes with BNC-compatible information. In terms of METIS UDF, the information of the mother node corresponds to
special information of translation options while the leaf nodes correspond to token
translations. When translating a sentence, such as:
1.
Hans hat die schnellsten Wagen (Hans owns the quickest cars),
the information from the SL parser and the dictionary is merged into a representation
similar to METIS' UDF as shown below. A special part of the translation units carries
over information from the source language analysis, while translation options contain
information from the dictionary. Thus, 'Hans' is tagged in the SL as noun; it is possibly
the subject of the sentence (marked with the feature mark=subj ) and it is singular.
The second word in the sentence, 'haben' is the finite verb. The remaining three
words 'die schnellsten Wagen' are part of an accusative (or dative) non-subject chunk.
Note that the three translation options for “schnell” are also reproduced below.
D4.4 Post-processing & Post-editing
15
{snr=1}
@ {ori=Hans,c=noun,wnrr=1,mother={chunk=1,mark=subj,per=3,nb=sg}}
@ {c=noun} @ {c=NP0,lu=hans}.
,
{ori=haben,c=verb,wnrr=2,mother={chunk=1,mark=vg_fiv,tns=pres,per=3,nb=sg}}
@ {c=verb} @ {c=VHB;VHD;VHI;VHN;VHZ,lu=have}.
, {c=verb} @ {c=VVB;VVD;VVI;VVN;VVZ,lu=bear}.
, {c=verb} @ {c=VHB;VHD;VHI;VHN;VHZ,lu=have},{c=VVN;VVD;VVI,lu=get}
,
{ori=d_art,c=w,wnrr=3,sc=art,mother={chunk=3;4;5,mark=nosubj,case=acc;dat,nb
=plu}}
@ {c=art} @ {c=AT0,lu=the}.
.
,
{ori=schnell,c=adj,wnrr=4,deg=sup,mother={chunk=3;4;5,mark=nosubj,case=acc;d
at,nb=plu}}
@ {c=adj} @ {c=AJ0,lu=fast}.
, {c=adj} @ {c=AJ0,lu=quick}.
, {c=adj} @ {c=AJ0,lu=rapid}.
.
,
{ori=wagen,c=noun,wnrr=5,mother={chunk=3;4;5,mark=nosubj,case=acc;dat,nb=p
lu}}
@ {c=noun} @ {c=NN1,lu=car}.
, {c=noun} @ {c=NN1,lu=carriage}.
, {c=noun} @ {c=VVG;AJ0;NN1,lu=charging},{c=NN1,lu=vehicle}.
, {c=noun} @ {c=NN1,lu=coach}.
, {c=noun} @ {c=NN1,lu=wain}.
, {c=noun} @ {c=NN1,lu=trolley}.
, {c=noun} @ {c=NN1,lu=vehicle}
The task of the expander is now to:
∗
add further readings, insert or delete words where appropriate.
∗
readjust the order of the translation units according to English syntax
∗
disambiguate tags in translation options
D4.4 Post-processing & Post-editing
16
We give two examples of how these modifications take place in the above representation and show how features are modified.
There are three translation options 'have', 'bear' and 'have got' for the German verb
'haben'. During dictionary pre-processing, the English verbs were lemmatised and
tagged in line with the representation of the BNC. For verbs, our strategy is currently
to expand the tags with all possibilities which we anticipate for this verb to take in a
sentence. The expander would look into the information stored in the translation units
as well as into it TL context and seek to disambiguate these tags. Knowing that 3rd
person singular is to be generated, the expander can unify (or replace) the tags in the
leaves with the appropriate BNC tag:
,
{ori=haben,c=verb,wnrr=2,mother={chunk=1,mark=vg_fiv,tns=pres,per=3,nb=sg}}
@ {c=verb} @ {c=VHZ,lu=have}.
, {c=verb} @ {c=VVZ,lu=bear}.
, {c=verb} @ {c=VHZ,lu=have},{c=VVN,lu=get}
In a similar way, the plural for the translations of “Wagen” can be percolated from the
translation unit into the POS tags of the translation options and token translations.
Another example concerns the translation of degree adjectives. Within the translation
unit of the adjective 'schnell' we keep a trace of the degree from the SL analysis. The
translation into English of the superlative 'schnellsten' can be realised, for instance as
'most rapid', where an adverb modifies the adjective or as 'quickest', an inflection
which is possible in English only for short adjectives. Based on the degree information,
the expander, may thus insert an optional translation unit with the adverb 'most' before the adjective and/or substitute the AJ0 reading of the adjective with a AJS. Technically, this is realised by inserting embedded levels of translation options and translation units.
, {translation_unit}
@ {translation_option_1}
@ {c=adv}
@ {c=adj} @ {c=AV0,lu=most}.
.
, {ori=schnell,c=adj,wnrr=4,deg=sup,...}
@ {c=adj} @ {c=AJ0,lu=fast}.
, {c=adj} @ {c=AJ0,lu=quick}.
, {c=adj} @ {c=AJ0,lu=rapid}.
D4.4 Post-processing & Post-editing
17
, {translation_option_2}
@ {ori=schnell,c=adj,wnrr=4,deg=sup,...}
@ {c=adj} @ {c=AJS,lu=fast}.
, {c=adj} @ {c=AJS,lu=quick}.
, {c=adj} @ {c=AJS,lu=rapid}.
Thus, a newly generated translation unit contains two translation option which, again,
consist of one or more sequences of translation units. Translation option 1 multiplies
out into three two word units 'most fast', 'most quick' and 'most rapid' while the AJS
feature in translation option 2 would generate 'fastest', 'quickest' and 'rapid' (see description of the token generator in section 2.1.2.4). We thus have six alternative
translations for “schnellsten” and we leave it then to the search engine to decide
which of these translations is most likely in the context of the sentence and given the
n-gram language models generated from the BNC.
The search engine consists essentially of a beam search algorithm which takes the
BNC n-gram language models as a heuristic function to estimate the promise of each
node it examines. The beam is initialised with the first translation tokens of all the
translation options in the first translation unit. It then unfolds the successor nodes until the number of unfolded paths reaches the beam width (which is currently set to
1000). At this point, the worst paths are discarded and only the most promising sentence prefixes are followed.
In the example above, the beam is initialised with the token translation hans/NP0.
This state has three successor nodes, namely have/VHZ, bear/VVZ and have/VHZ.
While the first occurrence of have/VHZ and bear/VVZ are followed by the article
the/AT0, the second occurrence of have/VHZ unfolds into get/VVN. This process of incremental unfolding of parallel paths through the structure is continued until the last
nodes are reached. At this point, the m highest weighted sentences are and sent to
the post-processing module, which in our current setting is for the above example:
1. Hans has got the most rapid cars.
The search engine inspects recursively embedded translation units and translation options. The content of successive translation unit unfolds into successive nodes of the
paths while two or more translation options open alternative continuations for the
sentences. Each node has one or more successor nodes.
The n-gram language models are computed making use of the CMU languagemodelling toolkit and are used to estimate the probability of the sentence prefix so far
examined. We are currently experimenting with various settings for the search engine, using 3 and 4 gram models of the lemmatised BNC as well and up to 6-gram
models of BNC tag sequences as well as language models using the surface word
forms.
18
D4.4 Post-processing & Post-editing
2.1.1.4 FUPF
The morphological generator takes as input the lemma retrieved from the BNC, the
POS tag and part of the morphosyntactic information coming from the Source Language.
Spanish, being morphologically richer than English, provides in some cases more inflectional information than needed for correct token generation. Thus, information
such as noun gender or verbal mood, is not used at generation time. Other information, which can be useful to generate the correct utterance, such as number or verbal
tense, is calculated by the Spanish tagger and lemmatiser and is carried over through
all the processing steps, and finally retrieved for actual token generation.
Spanish Morphological Analyzer: The Spanish input is processed by a shallow
morphosyntactic analyzer, or tagger, called CastCG, which is based on the Constraint
Grammar formalism.
The output of the tagger is a string of Spanish lemmas or base forms, with disambiguated POS tags and inflectional information. Morphological disambiguation is performed by selecting the most plausible reading for each word given the context, expressed in linear terms. Below, we give an example of an output of the Spanish tagger, for the simple example “Estamos contentas” (We are happy).
Estamos estar
contentas
V IND PRES PL P1
contento
A FEM PL
In this example, the verb (“estar”) carries the features: indicative, present, plural and
first person; and the adjective (“contentas”) is tagged feminine plural.
Tagger-Dictionary mapping: At a subsequent step, POS and morphosyntactic information are mapped into the Parole/EAGLES tagset used by the dictionary. In this
mapping step, information about POS, which will be used during dictionary look-up, is
separated from inflectional information which will be used only later, in token generation. See below the output of this mapping for the simple example above (“Estamos
contentas”):
estamos estar
VA
i$p$pl$1
contentas
AQ
f$pl
contento
The first tag (VA, AQ) contains POS information according to the PAROLE conventions.
The second tag contains a string of characters that conventionally encodes the inflectional features coming from SL analysis. This string will be read off and interpreted at
generation time. Here is a table of conversions between the information provided by
the tagger and the encoded string which is passed onto generation:
19
D4.4 Post-processing & Post-editing
TAGGER MORPH TAG
INFO
MSC
m
masculine
FEM
f
feminine
SG
sg
singular
PL
pl
plural
P1
1
1st pers
P2
2
2nd pers
P3
3
3rd pers
IND
i
indicative
IMP
imp
imperative
IMPF
ipf
imperfect
PRES
p
present
FUT
f
future
PRET
pp
past
GER
g
gerund
PCP
prt
participle
PERF
ppf
perfect
Refl
rf
reflexive
SUB
sb
subjunctive
INF
inf
infinitive
CND
cnd
conditional
NOM
nom
nominative
DAT
dat
dative
ACC
acc
accusative
Poss
ps
possessive
Rel
rel
relative
Interr
int
interrogative
In some cases, the information is not passed as a feature but it is used to calculate
the proper POS tag. For example, this is the case of proper nouns, for which the following mapping applies at this stage:
N Prop => NP0
20
D4.4 Post-processing & Post-editing
Also, in this step of the process, certain occurrences of pro-drop in the Spanish sentences are taken care of. If a verb in first or second form occurs, which has no overt
subject, then a pronominal token is inserted. This pronoun carries all the morphological features that are provided by the verb in personal form.
In our example, this results in the following output:
nosotras nosotros
PP
f$pl$1$nom
estamos estar
VA
i$p$pl$1
contentas
contento
AQ
f$pl
Morphological information in the UDF: The UDF, or XML structure that is input to
the search engine, provides a container for the morphological information gathered
during SL analysis.
This information is stored under the “extra_information” tag and is marked as
SLmorph.
<trans_unit id="1">
<extra_information>
<SLmorph>f$pl</SLmorph>
</extra_information>
<option id="1">
<token_trans>
<lemma>happy</lemma>
<pos>AJ0</pos>
</token_trans>
</option>
<trans_unit>
Computation of the extended BNC tags for token generation: Although all the
information present in the SL word is carried over, only a part of this information will
actually be used for English token generation.
In noun generation, gender information is discarded and only number is retrieved. In
the general cases, the following mapping applies:
Reduced tag SL Morph Extended tag
Tag Value
NN
Sg
NN1
Singular noun
NN
Pl
NN2
Plural noun
21
D4.4 Post-processing & Post-editing
Nouns that are neutral for number, such as aircraft or data should already be tagged
as NN0 and therefore SL number does not affect them.
Mapping of verbal inflections will in general be many-to-one since Spanish verbal
morphology is richer than English. As an example of generation of verbal forms, the
following tag mapping combinations apply:
Reduced tag
SL Morph
Extended tag
Tag value
VV
p$sg$1
p$pl$1
p$sg$2
p$pl$2
p$pl$3
VVB
base form of lexical verb
VV
p$sg$3
VVZ
-s form of lex. verb
VV
pp$sg$1
pp$pl$1
pp$sg$2
pp$pl$2
pp$sg$3
pp$pl$3
VV
VV
ipf$sg$1
ipf$pl$1
ipf $sg$2
VVD
ipf $pl$2
ipf $sg$3
ipf $pl$3
past form of lexical verb
g
VVG
-ing form of lexical verb
prt
VVN
past participle of lex. verb
D4.4 Post-processing & Post-editing
22
2.1.2 Token generator
The token generator generates the correct token, using a lemma and a tag. Essentially these are the lemmas and tags as provided with the BNC. However, for some
words an extension of the tag set was required for correct token generation. Together
with this work, a re-conceptualisation of the lemmatiser was implemented. This novelty concerns the way lemmatisation exceptions are handled and how this information
is collected and used during token generation. First we describe the work done to extend the tag set and then describe the lemmatiser and the token generator.
All consortium partners use the same token generator. Each group needs to do some
pre-processing to get the information from the source language to the tag format
needed for the target language generation, as is shown in section 2.1.1.
2.1.2.1 Extension of the BNC tag set
Extension of the reversible lemmatiser involved 5 different BNC tags:
1.
PNQ
wh pronouns (who/whom/whosoever/whomsoever...)
The BNC PNQ tag subsumes nominative, objective and dative forms of the pronouns.
Two new tags were introduced. The new tags distinguish nominative (PNQ), objective
(PNO) and genitive (PNG).
2.
VM0
modal verbs (may/might/shall/should...)
This BNC tag includes present tense and past tense forms of the modal verbs. A new
tag (VMD) was introduced to distinguish present tense (VM0) and past tense (VMD).
3.
VBB
present tense of 'be' (am/are/be)
A new tag was required to distinguish 1st person (VBA) the form 'be' (VBE) and other
forms (VBB).
4.
VBD
past tense of 'be' (was/were)
As with the VBB tag, this BNC tag would not allow to distinguish 1st and 3rd person of
past tense from the other word forms. A new tag (VBW) was introduced to represent
the form 'were'.
5.
AV0
adverbs
This tag includes various forms of adverbs. It was expanded into AV0, AVC, AVS and
AVY.
In order to be able to regenerate the different forms subsumed under these tags,
each tag was finer grained into the subclasses given above.
23
D4.4 Post-processing & Post-editing
This work consists of two parts:
∗
retagging the BNC
∗
implementation of reversible lemmatisation rules
2.1.2.2 Retagging the BNC
For retagging the BNC we have classified some of the available BNC tags into finergrained classes based on the tokens. Since there are only a small number of words for
the tags PNQ, VBB, VBD and VM0, we can use a simple table-lookup. The mapping
from the 'old' BNC token/tag (Otag) to a new token/tag (Ntag) can be achieved
through the following tables:
Otag token
Ntag
PNQ
who
PNQ
PNQ
whom
PNO
PNQ
whose
PNQ
Otag token
Ntag
VM0
could
VMD
Otag token Ntag
PNG
VM0
might
VMD
VBB
am
VBA
VM0
should VMD
whoever
PNQ
VBB
'm
VBA
VM0
wo
VMD
PNQ
whomever
PNO
VBB
be
VBE
VM0
would
VMD
PNQ
whoseever
PNG
VBD
were
VBW
VM0
'd
VMD
PNQ
whomsoever
PNO
PNQ
whosesoever PNG
The following distinctions for adverbs are desirable:
∗
AV0 uninflected adverbs (hard/monthly/then)
∗
AVC comparative adverbs (smaller/better/easier)
∗
AVS superlative adverbs (best/earliest/quickest)
∗
AVY from adjectives derived adverbs (probably/typically/richly)
The procedure would be to look at the suffix of the adverb and then decide the new
tag. Simplified, if the word would end in -er an AVC tag would be given if it ends in est, the new tag would be AVS and if it ended in -ly the tag would be AVY. However,
there are a number of exceptions, such as "monthly" which ends on -ly but is identical
to the adjective from which it is derived. Hence, we want "monthly" to be tagged as
AV0 rather than AVY. Similarly, "after" and "rather" end in -er but are not compara-
24
D4.4 Post-processing & Post-editing
tive adverbs. For the uninflected adverbs ending on -ly, a list of 131 adverbs was generated containing those words which occur as adjectives and adverbs in the BNC. In
addition a list of 15 adverbs ending on -er was added to the exception list.
AV0 acromegaly AV0
AV0 adeptly
AV0
AV0 after
AV0
AV0 bedridly
AV0
AV0 bluntly
AV0
AV0 bodily
AV0
AV0 brindly
AV0
2.1.2.3 The lemmatiser
With a new tag set, new lemmatisation rules may apply. The lemmatiser and the token generator (described in section 2.1.2.4 below) make use of a so-called 'TokenLemLexicon' and sets of 'TokenSuffixRules'. While the TokenSuffixRules lemmatise
regular forms, the TokenLemLexicon lists the lemmatisation exceptions.
The 'new' pronouns (PNO, PNG and PNQ), the various forms of 'be' and modal verbs
are converted into lemmas by the exception lexicon. A lemmatisation exception consists of four pieces of information: the BNC tag, an identifier (ID), the word token and
its lemma.
BNC tag ID word token
lemma
VBA
F
am
be
VBA
N
'm
be
VBB
F
are
be
VBB
N
're
be
VBB
A
'rt
be
VBD
F
was
be
VBE
F
be
be
VBG
F
being
be
VBI
F
be
be
VBN
F
been
be
VBW
F
were
be
VBZ
F
is
be
VBZ
N
's
be
25
D4.4 Post-processing & Post-editing
During lemmatisation, the token/tag is looked up in the table and the lemma together
with its identifier is returned. For instance, the lemma of "am/VBA" is "be" and its ID
is "F", while for the short form "'m/VBA", the lemma/ID is "be/N". For generation of
word forms, a tag, an ID and a lemma are provided to return a token. Thus,
"VBA/F/be" generates "am" while "VBA/N/be" generates "'m". As a default behaviour,
when no ID is given, the variant indicated as 'F' will be generated. The following modal verbs and pronouns follow a similar schema:
PNG
F
whosever
whoever
F
whose
who
VMD
F
would
will
PNG
VMD
N1
wo
will
PNO F
whomever
whoever
VMD
N
'd
will
PNO F
whomsoever
whosoever
PNO F
whom
who
PNQ F
whoever
whoever
PNQ F
who
who
Since there is a large number of more or less regular adverbs, lemmatisation of adverbs makes use of TokenSuffixRules. Lemmatisation rules will only be necessary for
the inflected adjectives AVC, AVS and AVY. Since the word forms of comparative and
superlative adverbs are homomorphic to comparative and superlative adjectives, we
can just apply the adjective lemmatisation rules to those adverbs.
The -ly inflection for AVY tagged adverbs contains many more words, is more complicated and also requires in addition to lexicalised forms lemmatisation rules. We have
developed the following list of eight AVY lemmatisation rules. These rules work similar
to those already available for other word classes. The first column contains the new
AVY tag, the second column is a rule number, followed by a regular expression that
describes the token ending and the fourth column is a regular expression which describes the suffix of the lemma. As with the other lemmatisation rules described elsewhere (Carl et al., 2005), the matching suffix of a word token of would be replaced by
the lemma suffix of that rule.
AVY 1 (al)y $1e
AVY 2 (.)ily $1y
AVY 3 ^(.{1,2}ll)y $1
AVY 4 ([eiaormu]bl)y $1e
AVY 5 ([ai]mpl)y $1e
AVY 6 (ic)ally $1
AVY 7 ([rd]u)ly $1e
AVY 8 (..)ly $1
D4.4 Post-processing & Post-editing
26
Applying the rules in ascending order of their identifier numbers, a rule would substitute a matching token suffix by the lemma suffix. There is only at most one rule which
can apply.
For instance, rule 4 will match the 'ably' ending of adverbs like "probably" and replace
it by 'able'. This will correctly produce the lemma "probable". Rule 3 will transform
"fully" and "dally" into "full" and "dall" and rule 6 will map "automatically", "tropically"
and "publically" into "automatic", "tropic" and "public".
Note that there are numerous exceptions for rule 6. Thus, "grammatically" should be
lemmatised into "grammatical" and not into "grammatic", and the lemma of "archaeologically" is "archaeological" and not "archaeologic". These exceptions would have to
be listed in the TokenLemLexicon. As described above, a match in this exception lexicon would provide the correct lemma and no more lemmatisation rules need apply.
A set of 140 exceptions was extracted semi-automatically from the BNC based on the
following assumptions. Underlying was the assumption that -ically adverbs are derived
from adjectives and the problem would be to knowing whether the adjective from
which the adverb was derived ends on -ical or in -ic. Since there are many more adjectives ending in -ic than in -ical, it seemed reasonable to assume that the -ical ending would be exceptional and the -ic ending would be the rule.
First a list of more than 1400 adjectives (AJ0) ending in -ical were extracted from the
BNC. After deleting compound adjectives such as "academic-theoretical" and "aerospace-to-medical" from that list, for the remaining 780 adjectives was verified
whether also a derived adverb (ending in -ically) was in the BNC. A list of 140 adjectives and their derived adverbs were thus extracted and included into the exception
list.
2.1.2.4 Token Generation
Token generation transforms a lemma/tag/ID into a word token. It makes use of the
lemmatiser resources, the TokenLemLexicon and the Token Suffix Rules as well as a
set of proper resources which are described below. Two applications are distinguished:
1) a lemma, a tag and a rule or lexicon-exception ID is provided and 2) only the lemma and a tag are known.
Known lemma, tag and ID: In this scenario, we assume that we not only know the
lemma and the tag of the word to be generated, but also the ID of its inflection rule.
The word token is retrieved from the exceptions lexicon if it contains the lemma/tag
and ID. Otherwise the reversed lemmatisation rules are applied for token generation.
Thus, within the inflection paradigm AVY, rule 6 (see above) will transform the lemma
"automatic" into "automatically" by substituting the lemma suffix 'ic' with 'ically'. To
do so, the lemma suffix and the token suffix are swapped as exemplified in the following rule 6:
AVY 6 (ic)
$1ally
D4.4 Post-processing & Post-editing
27
Known lemma, tag: The procedure becomes more complicated if we don't know the
ID or rule number which correctly transforms the lemma into a token. Since the rules
and their order was designed for lemmatisation, we cannot just apply the first matching inverted lemmatisation rule as this would generate in many cases wrong words.
Thus, the lemma "hungry" can be mapped into "hungrily" by applying rule 2 or into
"hungryly" using rule 8. To know which rule generates with highest probability the
correct form, we have counted the lemma endings and rule identifier for each inflection class. The idea was that if we look at a long enough ending of a lemma given its
tag, then we could with high certainty determine which tokenization rule must apply
to generate the correct word.
Thus, to generate a word token from the lemma "hungry" in the AVY paradigm, we
see in the table below that there are 43899 lemmas in the BNC ending on -y which
are generated by rule 2 and there are 746 lemmas generated by rule 8. Even for the
ending -ry there is still an ambiguity between rules with IDs 2 and 8. Looking at a still
longer ending, we see that there are 1160 lemmas in the BNC ending on -gry. All
1160 lemmas ending on -gry in the AVY class were generated through rule 2 so that
we can deterministically generate "hungrily" by applying rule 2.
y AVY 2 43899
y AVY 8 746
ry AVY 2 16870
ry AVY 8 409
gry AVY 2 1160
D4.4 Post-processing & Post-editing
28
3. Post-editing
Post-editing is what a human user of an MT system can do after the whole automated
translation process comes to an end. The users’ survey (D2.1) has shown that such a
module is welcome by the users. Thus, METIS-II offers the option that the output of
the post-processor is presented to the post-editor.
3.1 Introduction
The following list of phenomena was compiled in D2.1 on the basis of the responses of
the users to the following question: Which is the sort of ‘similarity’ with the
‘ideal translation’ that makes you consider post editing at all? For example,
what kind of mistakes would you consider as trivial when post editing.
Quoting from D2.1: “To the question regarding the sort of ‘similarity’ with the ‘ideal
translation’ that makes the translator consider post editing at all the following answers
were given:”
∗
spelling and punctuation errors
∗
A word missing here and there, a dot or a comma. “Post editing surely gives me
the chance to have a global overview of the translation”.
∗
subject verb agreement
∗
spelling, grammatical errors
∗
verb tenses, singular/plural number distinction, adjectival vs adverbial usage
(however, always depending on the situation)
∗
the most trivial error is the synonyms that are presented by the tools and the
words with similar morphology. When the structure is incorrect, the text needs
re-translation and therefore the whole procedure becomes very time-consuming.’
Seen from a more technical point of view, problems that can be solved at post-editing
time are usually about:
∗
errors that are due to mistakes at pre-processing time: the language specific
pre-processing tools, especially the taggers, generate errors which influence the
performance of the tools that rely on the information generated by them. Thus,
wrong tagging may result in wrong chunking and wrong grammatical function
prediction in the source language. These errors will be carried over to the translation algorithm and will negatively affect the results.
∗
errors resulting from wrong application of mapping rules (expander). For instance, a wrongly tagged sentence in MG may appear as having no subject (being a pro-drop one). This will wrongly trigger the mechanism dealing with pro-
D4.4 Post-processing & Post-editing
29
drop phenomena.
∗
contextually dependent issues: anaphora resolution and tense selection
∗
errors due to data sparseness. For instance, it is quite often the case that lexical
selection is not appropriate because a similar structure does not exist in the corpus and the machinery available is not able to perform the necessary (semantic)
inferences.
∗
errors due to the statistical nature of lexical solution: a more frequent use will be
preferred in the case of hapax legomena (or less frequent uses).
On the other hand, the most frequent actions undertaken at post-editing time were
shown to be (again, quoting from D2.1): “The actions in post-editing are ordered as
follows (starting with the most frequent ones):”
∗
Replace wrong phrases
∗
Add words
∗
Moving words/phrases from one place in the sentence to another
∗
Fill in missing words
∗
Replace wrong phrases
∗
Add phrases
∗
Fill in missing phrases’
Put in a more compact way, the most common processes in post-editing are:
∗
replacing lexical items on a word and phrase level
∗
reordering, inserting and deleting words and chunks in the sentence
In the future, we are going to allow the post-editor to see a window of sentence
around the sentence that is translated at a specific point, since post-editing can be
use for anaphor resolution and tense assignment.
D4.4 Post-processing & Post-editing
30
3.2 Language-specific issues
3.2.1 Modern Greek to English
Common errors with which the translation engine is faced are those resulting from the
pre-processing of the source language sentences. More specifically, the ILSP PAROLE
tagger may generate the following errors:
1.
Accusative nouns may be marked with Nominative case and vice versa, i.e. Nominative nouns may be marked with Accusative case, rendering thus difficult to
discriminate between the subject and other NPs with various syntactic functions
(direct objects, adverbials etc)
2.
Moreover, preposition may be tagged as adverbs, hence leading to wrong chunking with respect to the formation of prepositional phrases.
3.
Another error case at pre-processing time constitutes the wrong tagging of adjectives as adverbs due to their partial morphological resemblance.
4.
In a similar vein, the partial morphological resemblance of some article types
may lead to their tagging as pronouns and vice versa.
Another post-editing task concerns the handling of a set of phenomena which can only
be contextually resolved:
5.
Anaphora resolution: the assignment of the correct person and number features
of pronouns is almost invariably context-dependent.
6.
Tense resolution: the tense and mood features are not only context dependent
but also subject to inherent mismatches between Modern Greek and English.
Some phenomena have to do with the present stage of development of the system
and the resources. More sophisticated resources and more detailed treatment of a
particular pair will eliminate the need for post-editing in cases like the following:
7.
Lexical insertion rules (e.g. do-support)
8.
English expletive subjects (it, there) do not have any equivalent in Greek and
can not be, therefore, automatically generated unless they are introduced with
some mapping rule at pre-processing time.
9.
Saxon genitive: Modern Greek genitive structures have one-to-one correspondence only with the English of-genitives, but the system is directed to consider
the Saxon genitive pattern as well. However, given the fact that the Saxon genitive pattern has a restricted distribution compared to the of-genitive one and
thus the system may not yield the appropriate structure each time, the end user
should be given the option to transform the generated of-genitive into the Saxon
genitive.
D4.4 Post-processing & Post-editing
31
Furthermore, it is pretty often the case that the criterion of frequency-of-occurrence
does not yield correct results, that is, the proper translation. This is most evident
when the required translation is among the hapax legomena or occurs less frequently
in the corpus. An example of the second case is given with the sentence below where
the MG noun ‘αποζηµίωση’ is wrongly (in this sentence) translated as ‘compensation’
due to the high frequency-of-occurrence of ‘compensation’ over ‘ indemnity’.
1.
Το Ιράν απαίτησε από τις ΗΠΑ αποζηµίωση
Iran demand compensation / indemnity from the USA
D4.4 Post-processing & Post-editing
32
3.2.2 Dutch to English
Some issues that probably will not be solved after automatic post-processing include
the following:
∗
tense system
∗
support verbs
∗
use of prepositions
∗
use of articles
∗
wrong lexical choice
∗
word order
Tense system: There is no one-to-one mapping between the Dutch and the English
tense system. While the Dutch imperfect and perfect tenses usually express the same
(albeit there are some differences), the English simple past and present perfect tenses
are not interchangeable. Dutch has an imperfective aspect representation with aan
het + infinitive, but it is certainly not as common as the use of English continuous
tenses. There are also minor differences in the use of the passive voice.
Support verbs: Support verbs or light verbs are verbs that have no real fixed meaning, no general semantic content. Examples in Dutch are zijn, komen, brengen, hebben etc. These verbs are very hard to translate, since with no fixed meaning, there is
also no fixed translation. In principle, the target-language corpus should come up with
the correct verb-complement combination, but we anticipate that in a lot of cases, this
will go wrong. It is also very difficult to correct this without human interference. So,
this will be a task for the post-editor.
Use of articles and prepositions: There are always at least minor differences between languages in the use of articles. In general, the Dutch and English systems are
alike, but there are still some divergences, usually only distinguished on the lexical
level. The Dutch sentence ‘de verkoop daalde met 70%’ translates as ‘sales dropped
70%’, so the article is not retained.
There is almost never a one-to-one translation of prepositions between languages.
This means prepositions get often mistranslated. Since the corpus might actually offer
more possibilities that are semantically correct but no translation of the original, human post-editing is absolutely necessary in this case.
Wrong lexical choice: Since we are working with a statistical system, the system
might rate the correct solution lower than an incorrect one, if it already finds one.
Wrong lexical choices can only be solved by a post-editor.
D4.4 Post-processing & Post-editing
33
Word order: Although word order in English is rather straightforward, there are still
some issues when translating from Dutch. Dutch has a complex, not entirely contextfree word order which might lead to wrong chunking or analysis and cause problems
in the construction of a correct target-language sentence. Hence, the post-editing environment should provide the human translator with the ability to move words and
chunks around.
D4.4 Post-processing & Post-editing
34
3.2.3 German to English
Specific requirements for post editing on the basis of the current German system are
the following:
1.
Prepositions are tricky in machine translation anyway, also for METIS. There
are different sorts of them. The strongly bound ones which are hardly predictable
and the ‘contentful’ prepositions which neither can be predicted well. So for the
latter, there is the fact that in German the pictures are hanging AT the wall, while in English the pictures are hanging ON the wall. In German the ‘kids are playing ON the playground’ in English ‘the kids are playing IN the playground. The
recipe for treatment of prepositions is that they are translated ambiguously ad it
is left to the search engine to find out the right one. So, it is hoped that if the
search engine has the choice between ‘on the playground’, ‘at the playground’,
‘in the playground’ it is ably to choose the right one.
The task for post editing here is that sometimes the wrong preposition has to be
replaced by the right one.
2.
A similar problem occurs with the articles. As articles are differently used across
languages it has to be taken care of that the search engine can choose the right
structure.
This is achieved by making available articles optional and by inserting articles
(definite and indefinite) in NPs that have no article. It is up to the search engine
to decide whether an article is to be there or not and which one, a definite or an
indefinite one.
3.
Wrong lexical choice: The German-English transfer dictionary is very rich and
contains many readings of words and multi word units. In many cases the ranker
is not able to choose the wanted structure. Post-editing thus has to replace single lexical choices.
4.
A large area for post-editing results from different word orders. German has a
relatively free word order. This means that expander has to rearrange chunks. In
doing so it would actually need a full valency analysis. As this is not available it
can try to detect the subject, the verb group and the prepositional phrases. It
can arrange the subject or subject candidates (sometimes more than one) in
front of the finite verb, place the rest of the verb group right after the finite verb
and leave it to the search engine to verify the structure, evaluate all the transformations with article insertion and the like.
Sometimes the procedure goes wrong or is not applied due to some unforeseen
structural properties. This results in structures that need to have chunks completely replaced or moved.
D4.4 Post-processing & Post-editing
5.
35
Combination of phenomena: All the mentioned phenomena may combine to a
very bad result in a worst case which has to be corrected. A sentence such as
‘Um das Leben des Premierministers kämpfen die Ärzte seit Tagen.
(For the life of the prime minister the doctors have been fighting for days)
currently results in the following structure:
Round a life a prime minister campaign the doctor from the moment of day
This can hardly be recognised as a translation of the original sentence. What went
wrong?
∗
The strongly bound preposition ‘um’ has been literally translated by ‘round’.
∗
The indefinite article that replaced the definite one survived the search engine.
∗
The ‘of’ insertion was not kept by the search engine.
∗
The word order was not rearranged, thus the German word order was kept for
the English sentence which means that the finite verb occurs not after the subject, but before
∗
In this case the verb has a wrong translation. ‘kämpfen’ might be ‘fight, campaign, battle, combat, brawl, struggle’.
∗
‘the doctor’ has wrong number.
∗
Then a wrong translation of ‘seit’ follows (which may be in some rare cases ‘from
the moment of’)
∗
followed by a wrong number of ‘day’.
∗
In the end tense and aspect have to be corrected
Round a life a prime minister campaign the doctor from the moment of day
For the life of the prime minister the doctors have been fighting for days
D4.4 Post-processing & Post-editing
36
3.2.4 Spanish to English
Sentences that present contrastive phenomena in the translation between Spanish
and English are the most likely to need correcting in the post-editing stage. Here follow some instances of this type of translation problems.
Tense & Mood: Lack of correspondence in the use of tense and mood: e.g. nonprogressive present tense in Spanish is generally translated by progressive in English.
SP. Mañana voy a comprar un paraguas.
ENG. Tomorrow I am going to buy an umbrella.
Voice: Different constructions in Spanish are translated into Passive Voice in English.
Such constructions include: impersonal 3rd person plural, impersonal se and reflexive
passive.
SP. Me han robado el coche.
ENG. My car has been stolen.
Use of prepositions: Prepositions present many translation problems. Either they
appear in the Source and not in the Target, or vice-versa, or they are not translationally equivalent.
One of the most frequent cases of different complementation patterns in both languages is the translation of a Direct Object in Spanish for a Prepositional Object in
English.
SP. Busca ese libro.
ENG. Look for that book.
Another systematic divergence is the use of preposition “a” (to) in Spanish to mark
human DOs.
SP. Pregunta a tu amigo.
ENG. Ask your friend.
In noun complementation, one of the main structural issues is the translation of [N +
de N] in Spanish into [N + N] in English, although there are other structure changes,
such as [N + Adj] => [N + N] or [N + de Vinf] => [Vger + N]
SP. libro de física
ENG. physics book
Use of the article: The use of the article is different in both languages in many cases. Spanish tends to use the definite article ‘el’ for generic plural NP Subjects or Objects, appositive NPs, temporal NPs or locative NPs, while English does not. On the
D4.4 Post-processing & Post-editing
37
other hand English often uses the undefinite article ‘an’ for generic NP Objects and
Spanish does not.
SP. Me gustan las patatas.
ENG. I like potatoes.
Pro-drop: Subject is obligatorily present in an English sentence, but not in Spanish.
Consequently we find dummy pronouns in English and frequent instances of Pro-drop
in Spanish. An additional problem is caused by the difficulty of generating the correct
form of the third person pronoun in singular (he, she, it).
SP. Está cansado.
ENG. He feels tired.
Sentences with “se” pronoun: The Spanish pronoun ‘se’ has several semantic interpretations and consequently several possible translations into English (self pronoun, possessive, inchoative...)
SP. Olga se compró una blusa.
ENG. Olga bought herself a blouse..
SP. Los vasos se rompen.
ENG. Glasses break.
Word Order: Spanish is a relatively free Word Order language, while English is not.
Often, for topicalisation reasons, the object is extraposed to the front of the verb.
SP. El coche se lo ha comprado José.
ENG. Joseph bought the car.
Another typical word order divergence is the position of the adjective within the NP.
SP. los perfumes franceses
ENG. the French perfumes
Preposition stranding happens in English and does not happen in Spanish
SP. ¿A quien le hablaste?
ENG. Whom did you talk to?
D4.4 Post-processing & Post-editing
38
3.3 Interface
3.3.1 Implementation concepts of the post-editing module
After the METIS server has produced a translation, the end user will be forwarded to
the web interface of the post-editing module, as seen in the annex, where she will be
able to perform all necessary corrections. Our approach will be based on Perl, a very
popular solution for dynamic web interfaces, which has been already used by METIS to
develop most of the post-editing tools used in the project.
To implement the module, we are going to use the Apache server4 as a web server
and mod_perl5. The perl module mod_perl embeds a Perl interpreter into the Apache
server, which ensures minimal response time of the Perl scripts. mod_perl is widely
used for executing Perl scripts in web pages and, nowadays, has taken the place of
traditional CGI scripting. For the database connections we will use the Perl DBI module6, which comes with mod_perl. The Perl DBI module allows our script to be connected to all popular relational databases. A list of all supported database engines can
be found in http://dbi.perl.org/.
The main perl script of the post-editing module will run on the Apache web server and
will accept as input the METIS main engine output in XML format. The script will then
generate the editing web page (see annex) for the end user to work on the output of
the METIS main engine. Based on the editing choices of the end user, the appropriate
perl scripts will be invoked. Finally, when the user has accepted the output as the final
translation, the post-editing module will be able to store the translation in the METIS
database for future reference.
3.3.2 The web interface
The web interface is the way METIS communicates with its user. Apart from entering
or referring to a text, the main task of the user is post-editing. We suggest a web interface as shown in the Annex.
The idea is to give the post-editor a tool of seeing the different n best solutions proposed by the search engine with their assigned weights (with n parametrisable), but
also give him/her the possibility of permuting words and chunks easily. By clicking on
partial selections, the post-editor can constitute a whole sentence. The way of doing
this, is explained in the Annex.
4
http://www.apache.org/
http://perl.apache.org/
6
http://dbi.perl.org/
5
D4.4 Post-processing & Post-editing
39
4. Conclusion
The METIS-II approach consists of a source-language and a target-language model.
The link between the two is given by the translation model (TM). In this deliverable,
we discussed post-processing and post-editing, which are part of the target-language
model.
To begin with, post-processing consists of token generation. In order to deal with data
sparseness, we decided to use lemmas instead of tokens for use in the search engine.
The implication of this approach is that the correct morphological forms have to be
generated after the search is completed.
Furthermore, the METIS-II system leaves room for post-editing. Machine translation
nowadays is not able to generate perfect translations. Post-editors can change the
proposed machine-generated translation by using a simple web interface. The most
common actions will be lexical selection and permutations of tokens and chunks.
In the future, we may be able to use the results from post-editing to improve our system. A corpus of corrected sentences and chunks might be the base of an automated
grammar correction for METIS.
40
D4.4 Post-processing & Post-editing
ANNEX
Select Target Language Text
The dog barks towards the postman.
Sentence:
go
Partial selection
de hond
blaft
The dog
The
barks
go
go
dog
go
naar de postbode
go
towards the postman
towards
go
.
.
go
the postman
the
Selected text
The dog barks
accept
go
go
postman
go
go
41
D4.4 Post-processing & Post-editing
The idea is that the translator clicks
nual adaptation is possible.
go
on the correctly translated bits of text, so that these appear in the bottom line, where ma-
The translator can choose between fully translated sentences, or between lower level chunks, or even words.
When the full translation is made in the Selected text box, and the translator hits
translation of the source sentence.
accept
, the sentence is saved as a correct
The consecutive clicks on the buttons allow to build an aligned bitext of chunks and sentences, as selected by the translator, and
can help to enhance the translation engine.
created on 20.06.06 by Vincent Vandeghinste
D4.4 Post-processing & Post-editing
42
BIBLIOGRAPHY
Jeff Allen, MT Post-Editing, http://www.geocities.com/mtpostediting/
Thorsten Brants, TnT – A Statistical Part-of-Speech Tagger, Proceedings of the Sixth
Conference on Applied Natural Language Processing (ANLP-2000), Seattle, WA,
2000.
Michael Carl, Paul Schmidt, and Jörg Schütz, Reversible Template-based Shake & Bake Generation, Proceedings of MT Summit X, Workshop on Example-Based Machine Translation, pp. 17-25, Phuket, 2005.
Jakob Elming, Transformation-based correction of rule-based MT, Proceedings of the
EAMT 11th Annual Conference, Oslo, 2006.
Kevin Knight and Chander Ishwar, Automated Postediting of Documents, Proceedings
of National Conference on Artificial Intelligence (AAAI), Seattle, WA, 1994.
Claus Povlsen and Annelise Bech, Ape: Reducing the Monkey Business in Post-Editing
by Automating the Task Intelligently, Proceedings of Machine Translation Summit
VIII, Santiago de Compostela, 2001.
Frank Van Eynde, POS-tagging en lemmatisering, TST-Centrale, Leiden/Antwerp,
http://www.tst.inl.nl/cgndocs/doc_Dutch/topics/annot/pos_tagging/tg_prot.pdf

Download Report

METIS-II

Paperzz.com

Your Paperzz