Improving Pronoun Translation for Statistical Machine Translation

Improving Pronoun Translation for
Statistical Machine Translation (SMT)
Liane Guillou
NI VER
S
E
R
G
O F
H
Y
TH
IT
E
U
D I
U
N B
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2011
Abstract
Machine Translation is a well established field, yet the majority of current systems perform
the translation of sentences in complete isolation, losing valuable contextual information from
previously translated sentences in the discourse. One such class of contextual information
concerns who or what it is that a reduced referring expression such as a pronoun is meant to
refer to. The use of inappropriate referring expressions in a target language text can seriously
affect its ability to be understood by the reader.
This project follows on from two recent research papers that focussed on improving the translation of pronouns in Statistical Machine Translation (SMT). The approach taken is to annotate
the pronouns in the source language with the morphological properties of the antecedent translation in the target language prior to translation using a phrase-based English-Czech SMT system. The project makes use of a number of manually annotated corpora in order to factor out
the effects arising from poor coreference resolution, wherein selecting the wrong antecedent
for a pronoun in the source language text will wrongly bias its translation. The aim of this work
is to discover whether “perfect” coreference resolution in the source language text can reduce
the incidence of inappropriate referring expressions in the target language text.
The annotated translation system developed as part of this project makes only a marginal improvement over the baseline system, as measured using a bespoke automated evaluation metric.
These results are supported by a manual evaluation conducted by a native Czech speaker. The
reason for a lack of substantial improvement over the baseline may be attributed to many factors, not least of which concern the highly inflective nature of the Czech language.
iii
Acknowledgements
I would like to thank my supervisor, Professor Bonnie Webber, for her continued guidance and
support from the conception of this project through to its realisation. I am deeply grateful for
the patience that she has shown in explaining to me those concepts that were difficult to grasp,
for setting me on the correct path when I became lost and most of all, for infecting me with
her enthusiasm for this work. I have thoroughly enjoyed my time spent working on this project
and I couldn’t have asked for anything more in terms of the supervision I have received in my
first foray into the field of Machine Translation.
Special thanks are owed to Dr. Markéta Lopatková and Dr. Ondřej Bojar at Charles University.
I am indebted to Markéta for her suggestions, enthusiasm and assistance with the analysis of
results at every stage of this project. Her expertise in Czech Natural Language Processing has
proved invaluable and I can honestly say as a monolingual speaker that without her help, this
project would not have been possible. I am also extremely grateful to Ondřej for his recommendations with respect to the stemming of the English and Czech data to obtain shared word
alignments for the translation models and his suggestions regarding the automated evaluation
of the translation output.
Thanks also to Christian Hardmeier for his patience in answering my many questions in relation
to his previous work on pronoun translation and evaluation.
Credit is also owed to David Mareček at Charles University, who created the PCEDT 2.0 alignment file used in this project.
Finally, I would like to thank my colleagues for their company during the long days spent in
the computer labs and their assistance in peer reviewing this document.
The PCEDT 2.0 corpus, which is not yet publicly available, has been used with permission
from the Institute of Formal and Applied Linguistics, Charles University, Prague.
iv
Declaration
I declare that this thesis was composed by myself, that the work contained herein is my own
except where explicitly stated otherwise in the text, and that this work has not been submitted
for any other degree or professional qualification except as specified.
(Liane Guillou)
v
I dedicate this thesis to my mother, Anna Guillou, who instilled in me from an early age the
importance of education and made sacrifices to ensure that I received the very best. Her love,
encouragement and unwavering support have been instrumental throughout my life, and have
given me the confidence that I needed to embark upon this course of further study. Words
alone cannot convey my gratitude.
vi
Table of Contents
1
2
Introduction
1
1.1
Definition of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3.1
Focus on Pronoun Translation in Machine Translation . . . . . . . . .
5
1.3.2
English-Czech Machine Translation . . . . . . . . . . . . . . . . . . .
7
1.4
Example of Poor Pronoun Translation . . . . . . . . . . . . . . . . . . . . . .
7
1.5
Hypothesis and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.6
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Concepts
9
2.1
Anaphora and Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2
Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3
Czech Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4
Phrase-based Statistical Machine Translation . . . . . . . . . . . . . . . . . . 10
2.5
Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6
Evaluation in Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7
3
4
2.6.1
Automated Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.2
Manual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Data
17
3.1
BBN Pronoun Coreference and Entity Type Corpus . . . . . . . . . . . . . . . 17
3.2
Penn Treebank 3.0 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3
PCEDT 2.0 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Methodology
4.1
23
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vii
4.2
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4
Constructing the Language Model . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5
Combining the Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6
4.5.1
Identification of Coreferential Pronouns and their Antecedents . . . . . 30
4.5.2
Extraction of the Antecedent Head Noun . . . . . . . . . . . . . . . . 31
4.5.3
Extraction of Morphological Properties from the PCEDT 2.0 Corpus . . 31
Training the Translation Models . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6.1
Computing the Word Alignments . . . . . . . . . . . . . . . . . . . . 33
4.6.2
Tuning the Translation System Weights: Minimum Error Rate Training
(MERT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6.3
Annotation of the Training Set Data . . . . . . . . . . . . . . . . . . . 34
4.7
The Annotated Translation Process . . . . . . . . . . . . . . . . . . . . . . . . 36
4.8
Annotation and Translation System Architecture . . . . . . . . . . . . . . . . . 37
4.9
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.9.1
Automated Evaluation: Assessing the Accuracy of Pronoun Translations 39
4.9.2
Manual Evaluation: Error Analysis and Human Judgements . . . . . . 42
4.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5
6
Results and Discussion
45
5.1
Automated Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2
Manual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3
Critical Evaluation of the Approach and Potential Sources of Error . . . . . . . 52
5.4
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Conclusion and Future Work
55
6.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A Czech Pronouns Used in the Automated Evaluation
61
Bibliography
65
viii
Chapter 1
Introduction
The primary aim of this project is to produce more accurate coreferring expressions in the target
language within English to Czech Statistical Machine Translation (SMT). To date there have
been few attempts to integrate coreference resolution methods into Machine Translation. Notable exceptions include two recently published articles, focussing on English to French/German translation of third person personal pronouns. This project considers the translation of
pronouns in English-Czech SMT, which is a more complex issue due to certain properties of
the Czech language. Czech is a highly inflective language (as with German) that exhibits subject pro-drop and has a “free word-order”, i.e. the word order reflects the information structure
of discourse.
Whilst considerable progress has been made in Machine Translation research, little attention
has been paid to cross-sentence coreference (Le Nagard and Koehn, 2010). The recent work
of both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), focussing on thirdperson personal pronoun translation for SMT, represents a realisation of the need to address
this gap. In particular, it represents an acknowledgement that the appropriate translation of
discourse-level phenomena, including pronominal reference, is essential to ensure that the
translated text makes sense to its intended audience. As Le Nagard and Koehn (2010) state,
current Machine Translation methods treat sentences as mutually independent and therefore do
not handle the cross-sentence dependencies that can arise due to the use of anaphoric reference.
The recent work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) demonstrates an interest within the research community in improving overall translation quality via
the accurate translation of pronouns. Whilst the method proposed by Le Nagard and Koehn
(2010) showed little improvement, the method presented by Hardmeier and Federico (2010)
showed a small but significant improvement as measured by their bespoke automated scoring
metric that incorporates precision and recall.
1
2
Chapter 1. Introduction
This project investigates whether the approach used by Le Nagard and Koehn (2010) can improve pronoun translation in English-Czech SMT. This method was selected in preference to
that used by Hardmeier and Federico (2010) due to its simplicity. A major difference between
this project and previous work is the use of manually annotated corpora in place of coreference resolution algorithms to extract pronoun antecedents and automated methods to identify
antecedent head nouns. These corpora provide coreference annotation and noun phrases from
which the head noun can be extracted with little effort. This marks the first attempt to assess the
potential for source language coreference to improve pronoun translation in SMT by exploiting “perfect” manual source language coreference annotation. Furthermore it is also the first
attempt to apply the technique of source language pronoun annotation to the English-Czech
language pair.
The motivation for using the English-Czech language pair is threefold. Firstly, the availability of the PCEDT 2.0 parallel English-Czech corpus, as provided by the Institute of Formal
and Applied Linguistics at Charles University, Prague, coincided with the start of this project.
Secondly, as a monolingual speaker, the choice of the second language in the pair is fairly arbitrary, but dependent on the availability of a native speaker to assist in the evaluation of the
translation system output and to provide language specific assistance during the development
of such a system. This project benefited enormously from the expert advice of Dr. Markéta
Lopatková at Charles University, Prague. The third, and perhaps most salient reason for choosing Czech as the second language in the translation pair is that Czech is a subject pro-drop
language. That is, in Czech, an explicit subject pronoun may be omitted if its antecedent can
be predicted on the grounds of saliency and/or verb morphology. It was initially envisaged that
the system developed as part of this project would be designed to explicitly handle this phenomenon. However, due to the complexity of designing a pronoun-focussed translation system
and devising a strategy for evaluating the system output, this has been left as a future extension
to this project.
This document describes in detail the approach taken in the investigation of whether source
language annotation may improve pronoun translation in English-Czech SMT. The remainder
of this chapter defines the problem, introduces the concept of anaphora resolution and its application in Machine Translation and presents the hypothesis upon which this project is based.
Chapter 2 introduces the key concepts and chapter 3, the corpora used in the project. Chapter
4 describes the approach taken in the development of the annotation and translation system
and the evaluation of its output. The results of the evaluation are presented and discussed in
chapter 5 and the project is concluded in chapter 6. Possible options for future continuation
of this work are also included in chapter 6, with suggestions reflecting some of the key issues
highlighted in the preceding chapters.
1.1. Definition of the Problem
1.1
3
Definition of the Problem
Pronouns can be used as anaphoric expressions. When a pronoun is used anaphorically, it
is called a coreferential pronoun. In Czech, as with many other languages, the number and
gender of a personal pronoun must agree with the number and gender of its antecedent. This
is the phenomenon known as anaphora. When observing this phenomenon in discourse it is
common for the pronoun’s antecedent to appear in an earlier sentence to the pronoun itself,
presenting a problem for current state of the art Machine Translation systems which translate
sentences in isolation. When sentences are translated in isolation, the contextual information
present in the preceding sentences becomes lost. In the case of a coreferential pronoun, if its
antecedent appears in a previous sentence, information about that antecedent will be lost by the
time the sentence in which the pronoun occurs is considered for translation. The translation of
the pronoun is then carried out with no knowledge of the number and gender of the pronoun’s
antecedent.
Consider the translation of the English pronoun “it” into Czech for the following simple examples1 :
1. The dog has a ball. I can see it playing outside.
2. The cow is in the field. I can see it grazing.
3. The car is in the garage. I will drive it to school later.
In each of the examples, the English pronoun “it” refers to an entity that has a different gender
in Czech. In order to translate the pronoun correctly in Czech it is necessary to identify the
gender (and number) of the entity to which the pronoun refers and ensure that the gender (and
number) of the pronoun agrees. In example 1 “it” refers to the dog (“pes”, masculine) and
should be translated as “jeho/ho/jej”. In example 2, “it” refers to the cow (“kráva”, feminine)
and should be translated as “ji”. In the final example, 3, “it” refers to the car (“auto”, neuter)
and should be translated as “je/jej/ho”.
In Czech, within the masculine gender, a distinction is made between animate objects (e.g.
people and animals) and inanimate objects (e.g. buildings). In many cases the same pronoun
may be used for both animate and inanimate masculine genders, but there are a number cases
in which different pronouns must be used. For example, in the case of possessive reflexive
pronouns in the accusative case, “svého” is used to refer to a dog (masculine animate, singular)
that belongs to someone, e.g. “I admired my (own) dog”: “Obdivoval jsme svého psa”. This
is in contrast with “svo̊j” which is used to refer to a castle (masculine inanimate, singular) that
1 Examples
adapted from information from “Local Lingo” - an online Czech language resource:
http://www.locallingo.com/
4
Chapter 1. Introduction
belongs to someone, e.g. “I admired my (own) castle”: “Obdivoval jsme svo̊j hrad”.
The problem of identifying the entity to which a pronoun refers is termed anaphora resolution.
Section 1.2 outlines a brief history of anaphora resolution with particular reference to its incorporation in the field of Machine Translation. The concept of Anaphora and the closely related
concept of Coreference are described in greater detail in chapter 2.
1.2
Background
Anaphora resolution involves the identification of the antecedent of a referent, typically a
pronominal or noun phrase expression that is used to refer to something that has been previously mentioned in the discourse (the antecedent). In the case where multiple referents refer
to the same antecedent, these referents are said to be coreferential; these relationships can be
represented using coreference chains. Mitkov et al. (1995) assert that the identification of an
anaphor’s antecedent is often crucial to ensure a correct translation, especially in cases in which
the target language of the translation marks the gender of pronouns.
The problems of anaphora resolution and the related task of coreference resolution have sparked
considerable research within the field of Natural Language Processing (NLP). Strube (2007)
charts the changes from early techniques that modelled linguistic knowledge algorithmically
such as Hobbs’s Algorithm (Hobbs, 1978), the Centering model (Grosz et al., 1995) and Lappin and Leass’s algorithm (1994), through to the Supervised and Semi-Supervised Machine
Learning methods commonly used today. Even within the sphere of Machine Learning, there
is still much debate as to which method provides the best results. Early methods include that
to which Strube (2007) credits Soon et al. (2001) - the recasting of coreference resolution as
a binary classification task to which Machine Learning techniques can be applied. In contrast,
Linh et al. (2009) argue that ranking based models are more suited to the task of anaphora
resolution. Ng (2010) also argues in favour of ranking models that allow for the identification of the most probable candidate antecedents, claiming that they outperform other classes of
supervised Machine Learning methods.
In order to improve methods for anaphora resolution based on supervised Machine Learning, as
well as to serve as “Gold standards” for evaluation, parallel efforts have been pursued to manually annotate large corpora with coreference chains. The OntoNotes 3.0 corpus (Weischedel
et al., 2009) and the BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) (used in this project) are examples of such corpora.
Despite continued efforts into providing methods for anaphora resolution, there has been little
work focusing on the integration of anaphora resolution and SMT systems. Le Nagard and
1.3. Previous Work
5
Koehn (2010) argue that work on SMT has not moved beyond sentence-level translation. Furthermore they assert that the translation ambiguity arising from the use of pronouns cannot be
resolved within the context of a single sentence if a pronoun refers to an antecedent from a previous sentence. Hardmeier and Federico (2010) present a case study of the performance of one
of their SMT systems on personal pronouns to illustrate that improved handling of pronominal
anaphora may lead to improvements in translation quality. They report that the SMT system
is unable to find a suitable translation for anaphoric pronouns in 39% of cases and that while
choosing the wrong pronoun does not generally affect important content words, it can make
the output translations difficult to understand.
1.3
1.3.1
Previous Work
Focus on Pronoun Translation in Machine Translation
Early work on the integration of anaphora resolution with Machine Translation includes that
of Mitkov et al. (1995), Lappin and Leass (1994) and Saiggon and Carvalho (1994). Mitkov
et al. (1995) focussed on intersentential anaphora resolution, conjoining sentences to simulate the intersententiality that could be handled by the rule-based CAT2 Machine Translation
system. They provided example output from their system showing instances where pronouns
are translated correctly from English to German. However, they provided only the details
of their approach and several examples, offering no information relating to the evaluation of
their method. Lappin and Leass (1994) integrated their RAP algorithm into a logic-based Machine Translation system, but the core focus of their work was on anaphora resolution and
not on Machine Translation. Saiggon and Carvalho (1994) used a transfer approach combined
with Artificial Intelligence techniques and focussed on both intersentential and intrasentential
anaphora resolution for the translation of pronouns in Portuguese to English translation. This
interest in the 1990’s culminated in the publication of a special issue on anaphora resolution in
Machine Translation with an introduction provided by Mitkov (1999).
No further evidence of work on the integration of anaphora resolution and Machine Translation
systems is available until 2010, in which papers on the subject were published by Le Nagard
and Koehn (2010) and Hardmeier and Federico (2010). This resurgence in the interest of
anaphora resolution for Machine Translation systems follows advances in the field since the
1990’s which have made the application of these new approaches possible.
The approach taken by Le Nagard and Koehn (2010) involves the identification of the antecedent of each coreferential occurrence of ‘it’ and ‘they’ in the source language (English)
together with the identification of the antecedent’s translation into the target language (French)
6
Chapter 1. Introduction
and its grammatical gender. Based on the gender of the noun in the target language, the occurrence of ‘it’ in the source language text is replaced by it-masculine, it-feminine or it-neutral.
The same is applied for occurrences of ‘they’. Using the Moses toolkit (Hoang et al., 2007),
they trained an SMT system on annotated training data composed using the annotation method
previously described, before applying the same process to the test data as part of the translation
process. In the training of the annotation system the French translation of the English antecedent is extracted from the parallel corpus using the word alignment obtained as part of the
process of training their baseline system. When running test translations, they first translate the
test text using the baseline system to extract the French translations of the English antecedents.
They then use the gender of the French word to annotate the English pronoun before translating
the annotated test text using the system trained on annotated training data. This approach treats
the annotation of pronouns as a separate task which is performed outside of the translation process. The authors report little change in the BLEU score of their system over the baseline and
instead resort to manually counting the number of correctly translated pronouns. Whilst they
attribute the lack of improvement of their system to the poor quality of their coreference resolution system, they claim that the process works well when the coreference resolution system
provides accurate results.
The approach taken by Hardmeier and Federico (2010) differs in that it provides a singlestep process whereby the identification of a pronoun’s antecedent in the source language and
the extraction of its target language translation’s morphological properties is integrated in the
translation process as an additional model in their SMT system. This additional model maintains a mapping of each source language pronoun and the number and gender of its antecedent.
Translation is achieved by first processing the source language test text using a coreference
resolution system to identify coreferential pronouns and their antecedents. The output of the
coreference resolution system is used as input to a decoder driver module which runs a number
of Moses decoder processes in parallel. The decoder driver then feeds individual sentences to
the decoder processes using a priority queue to order sentences according to how many pronoun antecedents they contain. Thus sentences that contain a greater number of antecedents
are translated first, ensuring a high throughput of the system. The authors report no significant
improvement in BLEU score between their system and the baseline, but they do report a small
but significant improvement in pronoun translation recall against a single reference translation.
The approach used in this project is similar to that taken by Le Nagard and Koehn (2010).
Whilst their project required the use of a coreference resolution system to build coreference
chains, the provision of a source language corpus with manually annotated coreference information allowed this project to focus on the translation problem. This project also accommodates a wider range of English pronouns than the study by Le Nagard and Koehn (2010), which
1.4. Example of Poor Pronoun Translation
7
only considered the translation of ‘it’ and ‘they’.
1.3.2
English-Czech Machine Translation
Much of the recent work in English-Czech SMT has been conducted at the Institute of Formal
and Applied Linguistics at Charles University, Prague. Research has been conducted in many
areas including the development of parallel corpora suitable for the development of Machine
Translation systems such as the PCEDT 2.0 corpus used in this project and its predecessor,
the PCEDT 1.0 corpus (Čmejrek et al., 2004). Another area of research has concentrated on
the development of both phrase-based and dependency-based SMT systems. In a comparative
study of phrase-based and dependency-based SMT systems Bojar and Hajič (2008) concluded
that their best phrase-based system outperformed the experimental dependency-based system,
but work continues in both directions.
The decision to focus on phrase-based SMT in this project is due to its simplicity, which given
the relatively short time-scale, is an important factor. That phrase-based systems currently
outperform dependency-based systems in English-Czech SMT is an added bonus.
1.4
Example of Poor Pronoun Translation
As an example of poor pronoun translation, consider the following English sentence from the
Wall Street Journal corpus and its translation (by a Machine Translation system) in Czech:
he said mexico could be one of the next countries to be removed from the priority list because
of its efforts to craft a new patent law .
řekl , že mexiko by mohl být jeden z dalšı́ch zemı́ , aby byl odvolán z prioritou seznam , protože
jejı́ snahy podpořit nové patentový zákon .
In this example, the English pronoun “its”, which refers to “mexico” is translated in Czech as
“jejı́” (feminine, singular) and “mexico” is translated as “mexiko” (neuter, singular). Here, the
Czech translation of the pronoun and its antecedent disagree in gender. A more correct translation of the pronoun would be “jeho” (neuter, singular possessive pronoun) or “své” (possessive
pronoun) depending on the overall structure of the translated sentence.
8
Chapter 1. Introduction
1.5
Hypothesis and Contributions
The work of Hardmeier and Federico (2010) focussed on English to German translation whilst
Le Nagard and Koehn (2010) focussed on English to French translation. This project considers
the translation of pronouns in English to Czech SMT and builds on the work of Le Nagard and
Koehn (2010) and Hardmeier and Federico (2010). By factoring out the problems of automated
coreference resolution, parsing and part of speech (POS) tagging and morphological tagging,
this project attempts to assess how well an approach to explicitly annotating pronouns in the
source language could work when applied to English-Czech SMT if conditions were assumed
to be “perfect”. Where French (a Romance language) and German (a Germanic language)
share a similar root to English, the differences between English and Czech are even greater.
Therefore, not only does this project assess the suitability of a pronoun annotation approach in
improving the translation of pronouns into another language, but into a language that is very
different from English. It is believed that this project is the first attempt made to explicitly
handle the problem of pronoun translation in Czech SMT.
This project makes three major contributions:
1. A prototype system for the annotation and translation of pronouns in English-Czech
SMT.
2. Automated and manual evaluations of the output of the system as compared against a
baseline.
3. An annotated aligned parallel corpus which could be used in future investigations into
pronoun translation in English-Czech SMT.
1.6
Chapter Summary
This chapter introduced the specific problem of pronoun translation in SMT, discussed previous
work in relation to anaphora resolution, pronoun-focussed Machine Translation and EnglishCzech SMT and outlined the hypothesis on which this work is based. The next chapter will
describe in detail many of the concepts that are essential to the understanding of the problem
as well as the approach taken in the development of the annotation and translation system and
its evaluation.
Chapter 2
Concepts
2.1
Anaphora and Coreference
Anaphora is a discourse level phenomenon in which the interpretation of one expression is
dependent on another previously mentioned expression, also known as the the antecedent. For
example in the sentence below, the word “He” at the start of the second sentence refers to “J.P.
Bolduc” at the start of the first sentence. In order to understand the meaning of the second
sentence, the reader must first identify the referent of the pronoun “He” (which in this example
is “J.P. Bolduc”).
J.P. Bolduc, vice chairman of W.R. Grace & Co., which holds a 83.4% interest in this energyservices company, was elected a director. He succeeds Terrence D. Daniels, formerly a W.R.
Grace vice chairman, who resigned.1
Where anaphora is concerned with referring to a previously mentioned expression in the discourse, coreference is the act of referring to the same referent (Mitkov et al., 2000), such that
multiple expressions that refer to the same expression are said to be coreferential. Coreferential
chains may be established in order to link multiple referring expressions to the same antecedent
expression.
This project focuses on the translation of already resolved instances of nominal anaphora, in
which a referring expression - a pronoun, definite Noun Phrase (NP) or proper name, has a
non-pronominal NP as its antecedent (Mitkov et al., 2000). The project makes use of manually
annotated corpora from which instances of coreferential (and anaphoric) pronouns and their
antecedents are identified, in order to annotate training data with which to train an SMT system.
1 Example
taken from the Wall Street Journal corpus
9
10
Chapter 2. Concepts
2.2
Coreference Resolution
Coreference Resolution is the process of identifying the referent to which a referring expression
refers. In this project, the pronouns are the referring expressions and the antecedents are the
referents. As discussed in chapter 1, there has been much research into the development of
automated methods to provide coreference and anaphora resolution. Such automated methods
were used by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), but it is
well documented that these methods do not acheive perfect accuracy. Indeed, Le Nagard and
Koehn (2010) cite the poor performance of their coreference resolution as a possible reason for
their lack of improvement in pronoun translation.
In this project, a manually annotated coreference corpus (the BBN Coreference and Entity Type
corpus) is used to identify coreferential pronouns and their antecedents. As the corpus has been
manually annotated, the coreference annotation is assumed to be highly accurate.
2.3
Czech Language
Czech is a member of the western group of Slavic languages. Like other Slavic languages it
is highly inflective, with seven cases and four grammatical genders: masculine animate (for
people and animals), masculine inanimate (for inanimate objects), feminine and neuter. In the
case of the feminine and neuter genders, animacy is not grammatically marked. Czech is a free
word-order language, in which word order reflects the information structure of the sentence
within the current discourse. In addition, Czech is a pro-drop language; an explicit subject
pronoun may be omitted if it may be inferred based on some other grammatical feature, for
example verb morphology.2
In contrast with Czech, English, is neither a highly inflectional nor a pro-drop language. Furthermore, English follows a Subject-Verb-Object (SVO) pattern for word order and lacks grammatical gender.
2.4
Phrase-based Statistical Machine Translation
Phrase-based models are currently the best performing SMT models (Koehn, 2009). The concept behind these models is the decomposition of the translation problem into a number of
smaller word sequences, called phrases, which are translated one at a time in order to build
the complete translation. It is important to note that a phrase may be any sequence of words
2 Information
provided by “The Czech Language” - an online guide: http://www.czech-language.cz
2.4. Phrase-based Statistical Machine Translation
11
of arbitrary length and that there is no deep linguistic motivation behind the choice of segmentation. Phrase-based models have several advantages over word-based models in which
words are translated in isolation. Firstly, phrase-based models provide a simple solution to
the problem where a single word in the source language translates into multiple words in the
target language or vice versa. Secondly, translating phrases rather than single words can help
to resolve translation ambiguities. Finally, with phrase-based models, the notions of insertion
and deletion that are present in word-based models are no longer necessary, leading to a model
that is conceptually simpler.
The three components that make up a phrase-based model are the translation model, language
model and reordering model. The translation model takes the form of a phrase translation table
which provides a mapping between the source and target language phrases and the probabilities
associated with each mapping. The phrase translation table is learned by creating word alignments between the aligned sentence pairs of a parallel training corpus. The word alignments
are collected for both translation directions, the alignment points are merged and then those
phrases that are consistent with the word alignment are extracted. The probabilities that are assigned to each phrase mapping in the table are calculated by counting the number of (parallel)
sentence pairs a particular phrase pair appears in, and then computing the relative frequency of
this count compared with the count of the source phrase translating as any other phrase in the
target language.
The language model ensures the fluency of the translations output by the model - providing a
means to score and hence identify the best output translation from a list of candidate translations. The language models used in SMT are typically n-gram language models which consist
of n-grams in the target language together with probabilities based on maximum likelihood
estimation. A language model is usually constructed from the target side of the parallel corpus
used in the training of the translation model, and may be augmented by additional in-domain
target data, or weighted with a separate out-of-domain language model. Smoothing is often applied to improve the reliability of the probability estimates, with modified Kneser-Ney
smoothing commonly used in SMT (Kneser and Ney, 1995).
The reordering model allows phrases in the source language to be taken out of sequence when
building the translation in the target language, thereby allowing phrase-level reordering. Allowing unlimited reordering can have a detrimental effect on translation quality, and so it is usual
for a penalty to be associated with any reordering that takes place. Penalties are assigned such
that a larger cost is associated with the movement of a phrase that skips more word positions,
than one that skips fewer word positions.
In phrase-based SMT, these three models are combined as a linear model. The best translation
arg maxc p(c|e) is computed using Bayes’ Rule, which combines the three components of the
12
Chapter 2. Concepts
phrase-based model as in the equation below: the translation model φ(e|c), the language model
PLM and the reordering model Ω(e|c).
arg maxc p(c|e) = arg maxc φ(e|c) ∗ PLM ∗ Ω(e|c)
Where ‘e’ is an English sentence and ‘c’ is the Czech translation of that sentence.
Once the components of the phrase-based model have been constructed, their weights are tuned
to optimise the overall model performance. Tuning is carried out using a dataset that is kept
separate from the main training dataset for this specific purpose. Minimum Error Rate Training
(MERT) (Och, 2003) is a commonly used tuning technique in SMT. MERT tunes the model
weights to optimise performance as measured using BLEU scores calculated against one or
more reference translations. BLEU will be described in more detail in section 2.6.
In Machine Translation, the process of finding the best scoring translation according to the
model is referred to as decoding (Koehn, 2009). Using a phrase-based translation model, decoding is carried out by starting with a source sentence and building the translation from left to
right, extracting source phrases in any order. The phrases are translated into the target language
and then ‘stitched’ together to make a complete translation. The source words covered by each
phrase are then marked as translated and the process continues until all of the source words
have been covered. As there are many possible valid translations of a single source language
sentence, these variations must be captured. This is achieved using a search graph from which
the single best translation (or an N-best list) may be derived using a scoring method that uses a
language model and the phrase table probabilities.
2.5
Moses
Moses (Hoang et al., 2007) is an open source SMT toolkit that provides automated training of
translation models and may be used with any language pair, given a parallel training corpus.
Moses may be used to construct both tree-based and phrase-based translation models but for
the purpose of this project only the phrase-based training was required.
The automated training process produces a phrase translation table and a lexicalised reordering
model. The language model is created separately using the target side of the parallel corpus
together with additional in-domain corpus data as required. The training process consists of a
number of steps which include data preparation, the creation of word alignments using Giza++
(Och and Ney, 2003), extraction and scoring of phrases and building the generation and lexi-
2.6. Evaluation in Machine Translation
13
calised reordering models 3 . The generation model contains probabilities for both directions of
translation.
During testing, in which a sentence or collection of sentences from the test corpus (which are
not also included in the training corpus) are translated, the Moses decoder constructs a search
graph and uses a beam search algorithm to select the translation with the highest probability
from that graph. The search graph is constructed using the process of hypothesis expansion.
Hypothesis combination and pruning are then employed to reduce the search space. In the
Moses implementation of beam search, hypotheses that cover the same number of foreign
words are compared and those with high cost (low probability) are pruned. The cost of each
hypothesis is calculated using a combination of the cost of translation and the estimated future
cost of translating the remaining source text for the current sentence. Whilst the decoder may
be used to output an N-Best list of translations for an input sentence, in this project only the
best translation is required and therefore only a single translation is requested from the decoder.
2.6
Evaluation in Machine Translation
Evaluation in Machine Translation typically falls into one of two categories: manual or automated. Whilst automated methods are used to ascertain improvements during the development
of a Machine Translation system, manual methods using either monolingual or bilingual human
judges are typically used to provide the final evaluation.
Currently there are no standard automated metrics available for the evaluation of pronoun translation in SMT. Hardmeier and Federico (Hardmeier and Federico, 2010) developed their own
bespoke automated metric incorporating precision and recall measured against a single reference translation. In contrast, Le Nagard and Koehn (2010) relied on manually counting the
number of correctly translated pronouns in their system output. Manual evaluation of the results is slow and therefore not a practical solution for large volumes of text. Furthermore, for
a monolingual SMT system developer, manual evaluation must be outsourced to a third party,
adding an additional hindrance to the development process.
In this project, the Czech translations output by the phrase-based SMT system were evaluated
using a combination of manual and automated methods. The manual methods used focussed on
human judgements as to whether pronouns in the Machine Translation output were correctly
used or dropped and if they were incorrectly used, whether a native Czech speaker would be
able to understand the meaning of the sentence as a whole. BLEU, an automated metric widely
used in the evaluation of SMT systems was used during system development as a preliminary
3A
full description of
http://www.statmt.org/moses/
the
Moses
translation
system
training
process
can
be
found
at:
14
Chapter 2. Concepts
check to confirm that the system output was valid Czech, before a more detailed automated
analysis of the results was conducted. The evaluation methods used in this project are discussed
in more detail in chapter 4.
2.6.1
Automated Evaluation
BLEU (Papineni et al., 2002) is an automated evaluation metric widely used in SMT to assess
the overall quality of the output translations. It provides an efficient and low cost alternative to
human judgements during iterations of development cycles to measure system improvement. It
computes a document-level score of the translated output against a single reference translation
or a set of reference translations (Koehn, 2009). The BLEU score is based on a combination of
n-gram precision and a brevity penalty.
N
BLEU = BP ∗ exp( ∑ wn log pn )
n=1
The n-gram precision (pn ) is a measure of the ratio of n-grams of order n in the output translation that are present in the reference translation to the total number of n-grams of order n in
the output translation, and wn are positive weights that sum to one. The brevity penalty (BP)
ensures that the length of the output translation is not too short, as compared with the length of
the reference translation. The effect of the brevity penalty is that the BLEU score is reduced if
the output translation is shorter than the reference translation, i.e. where words are dropped in
the output translation. The BLEU score is applied at the document level in order to allow some
freedom in translation output length at the sentence level, for example where a single source
sentence may be translated into two sentences in the target language, or vice versa.
BLEU has been widely criticised (Koehn, 2009), yet remains one of the most popular automated evaluation metrics in use with SMT systems due to its high correlation with human
judgements of quality (Papineni et al., 2002).
With respect to the specific problem of pronoun translation evaluation in Czech, two further
criticisms apply. Firstly, as the sole focus of this project is pronoun translation, only a small
number of words are expected to change between the translations produced by the baseline
and annotated translation systems. Therefore, the variation in BLEU score is expected to be
very small. Observations regarding the shortcomings of BLEU in relation to the evaluation
of pronoun translation have been made previously by both Le Nagard and Koehn (2010) and
Hardmeier and Federico (2010). Secondly, Czech is a highly inflective language with four
genders and seven cases, so with only a single reference translation provided in the PCEDT
2.0 corpus it is not reasonable to evaluate the output of the translation systems using a recall-
2.7. Chapter Summary
15
based method. Bojar and Kos (2010) are critical of the use of BLEU scores in the evaluation of
English-Czech SMT, claiming that BLEU scores correlate poorly with human judgements. It
is for these reasons that BLEU was not used in the evaluation of the systems developed as part
of this project.
2.6.2
Manual Evaluation
The manual evaluation of Machine Translation output can be rather complex. Human judges
are typically required to rate a single target language text using a five point scale or to rank several target language texts based on fluency (whether the text is fluent), and adequacy (whether
the meaning of the source language text has been captured) (Koehn, 2009). Evaluation based
on fluency and adequacy judgements suffers from a number of problems. Firstly, it can be slow
and unreliable (Callison-Burch et al., 2008). Secondly, the scores assigned by human judges in
the measurement of fluency and adequacy are often very close suggesting that the judges may
find it difficult to make a clear distinction between the two criteria. Thirdly, there are concerns
that without explicit instructions, many human judges develop their own rules or misinterpret
the intended use of an absolute scale and instead score the output of multiple systems relative to
one another (Callison-Burch et al., 2007). Finally, manual evaluation using such criteria tends
to be subjective, which can lead to poor agreement between a group of human judges.
Again, these manual methods tend to focus on sentences as a whole and are therefore not
wholly applicable to the more specific problem of evaluating pronoun translation.
2.7
Chapter Summary
This chapter introduced the concepts of anaphora and coreference resolution and provided an
introduction to phrase-based SMT, the Moses toolkit and the methods currently used in the
evaluation of Machine Translation output. In particular, the various issues associated with
automated and manual evaluation methods were highlighted with respect to their application to
the more specific problem of evaluating pronoun translation. The next chapter will introduce
the manually annotated corpora used in this project.
Chapter 3
Data
In the development of the annotation and translation process a number of manually annotated
corpora in both English and Czech are used: the BBN Pronoun Coreference and Entity Type
corpus for the English (source) side of the parallel corpus and the identification of coreferential
pronouns and their antecedents, and the PCEDT 2.0 corpus for the Czech (target) side of the
parallel corpus. Each corpus contains text or a translation of the original text taken from a
subset of the Wall Street Journal (WSJ). It is the provision of these manually annotated corpora that allowed the project to focus solely on the translation problem without the need for
automated methods for coreference or anaphora resolution.
In addition, the annotation of the WSJ files within the Penn Treebank 3.0 corpus is used to
identify a single antecedent head word in the case where the antecedent extracted from the
BBN Pronoun Coreference and Entity Type corpus spans multiple words. This is particularly
important as in order to extract the number and gender of a Czech word it is necessary to first
identify the head of the English antecedent.
The corpora are described in detail in the following sections.
3.1
BBN Pronoun Coreference and Entity Type Corpus
The BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) provides annotations of the WSJ file texts with pronoun coreference and entity types together with
the raw English text. For the purpose of this project, two files from the corpus are used: the
WSJ.sent file that contains the raw English sentences and the WSJ.pron pronoun coreference
file that contains a list of coreferential pronouns together with their antecedents. In the pronoun
coreference file, coreferential pronouns and their antecedents are indexed using sentence and
word token numbers.
17
18
Chapter 3. Data
The WSJ.sent file has the format:
(WSJ0005
S1: J.P. Bolduc , vice chairman of W.R. Grace & Co. , which ...
S2: He succeeds Terrence D. Daniels , formerly a W.R. Grace ...
S3: W.R. Grace holds three of Grace Energy ’s seven board seats .
)
For each file in the corpus collection, the sentences are numbered and listed in the order in
which they appear in the text.
The WSJ.pron file has the format:
(WSJ0005
(
Antecedent -> S1:1-2 -> J.P. Bolduc
Pronoun -> S2:1-1 -> He
)
For each WSJ file in the collection, each antecedent and the pronouns that refer to it are listed,
together with the number of the sentence in which they appear and the start and end positions
of the word(s) within the sentence.
It was initially envisaged that the OntoNotes 3.0 corpus (Weischedel et al., 2009) would be used
to identify coreferential pronouns and their antecedents. However, the annotation in the BBN
Coreference and Entity Type corpus allows for a simpler method of identification and extraction
than the OntoNotes 3.0 corpus. The OntoNotes 3.0 corpus is then left as an alternative source
of coreference information. Due to differences in the choice of which types of coreference are
annotated in these corpora, the use of the OntoNotes 3.0 corpus as an alternative or additional
source of coreference information would allow for an investigation into the translation of ‘it’,
‘this’ and ‘that’ marked as event coreference.
3.2
Penn Treebank 3.0 Corpus
The Penn Treebank 3.0 corpus contains manually annotated parse trees of the sentences within
the WSJ corpus. The merged files within the corpus contain both parse and part of speech
annotation and as such may be used to identify Noun Phrases (NPs) and through the use of
simple rules, the head of an NP.
The corpus contains separate merged files for each WSJ file. Within each file, a parse is provided for each sentence, with part of speech tags provided for each word or token.
3.3. PCEDT 2.0 Corpus
19
These sentence level parses have the format:
( (S
(NP-SBJ-1 (DT The) (NNP U.S.) )
(, ,)
(S-ADV
(NP-SBJ (-NONE- *-1) )
(VP (VBG claiming)
(NP
(NP (DT some) (NN success) )
(PP-LOC (IN in)
(NP (PRP its) (NN trade) (NN diplomacy) )))))
‘‘The U.S. claiming some success in its trade diplomacy...’’
In the case that the BBN Coreference and Entity Type corpus identified “The U.S.” as the
antecedent of the pronoun “its”, the NP “(NP-SBJ-1 (DT The) (NNP U.S.) )” is extracted from
the sentence level parse. The rightmost noun of the NP (“U.S.”) is then extracted as the head
of the NP.
3.3
PCEDT 2.0 Corpus
The Prague Czech-English Dependency Treebank (PCEDT 2.0) corpus
1
is a collection of
English-Czech parallel resources suitable for use in SMT experiments. It contains a subset of
the Wall Street Journal corpus in English with a close Czech translation (created manually)
that has been manually annotated with deep syntactical (tectogrammatical) and morphological
information. These Czech translations form the Czech side of the parallel corpus included in
both the training and testing sets.
The PCEDT 2.0 corpus data is split into a number of XML format files corresponding to the
three layers of annotation that exist for each WSJ file in the corpus collection. These layers
are the morphological layer (m-layer), analytical layer (a-layer) and tectogrammatical layer (tlayer). The corpus also contains the word layer (w-layer), an un-annotated, tokenised copy of
the text which is segmented into WSJ files and paragraphs. The organisation and interconnection of these layers is shown in figure 3.12 . The annotation standard of these layers follows that
of the Prague Dependency Treebank 2.0 (Hajič et al., 2006).
1 Version 2.0 of the PCEDT corpus is not yet publicly available, but is an extension of the PCEDT 1.0 corpus:
http://ufal.mff.cuni.cz/pcedt/
2 Image taken from the documentation of the Prague Dependency Treebank 2.0 corpus:
http://ufal.mff.cuni.cz/pdt2.0/
20
Chapter 3. Data
Figure 3.1: Diagram showing the annotation layers of the PCEDT 2.0 corpus
The m-layer forms the lowest level of annotation. In this layer, the tokens in the w-layer are
divided into sentences and annotated with morphological lemma, tag and ID attributes. The
tag attribute is a 15 character string, representing the token’s part of speech and a number of
morphological properties, including number and gender. The ID attribute provides a unique
identifier which is used to link back to the w-layer.
The a-layer forms the middle level of annotation with sentences from the m-layer represented as
trees with labelled nodes and edges. In this layer, there is a one-to-one mapping between each
token and its corresponding token in the m-layer, with an edge between the nodes that represent
the tokens. Each node in the a-layer has six attributes including an ID attribute and those
attributes representing surface syntactic information including coordination and apposition.
The “m.rf” attribute links an a-layer node to the corresponding node in the m-layer.
The t-layer forms the highest level of annotation with sentences represented as trees which
reflect the deep linguistic structure of the sentence. Unlike the a-layer in which each node
has a one-to-one mapping with a corresponding morphological token in the m-layer, at the
t-layer, not all of the morphological tokens are represented (for example nodes representing
3.4. Chapter Summary
21
prepositions are dropped). Also, additional nodes may be added at this level, for example to
represent an omitted subject where subject pro-drop has occurred. The t-layer contains 39
attributes for every node, including attributes representing deep structure properties and those
used for the purpose of linking back to the a-layer.
In addition, a list of nodes in the PCEDT 2.0 corpus together with the corresponding Czech
word and aligned English word, was used. This “PCEDT 2.0 alignment file” was composed
using a method that combines Giza++ alignments extracted from PCEDT 2.0 corpus and extracted t-layer nodes for each of the aligned words. This list of nodes forms the word alignment
between the Czech side of the PCEDT 2.0 corpus and the English BBN Pronoun Coreference
and Entity Type corpus. It should be noted that this alignment information is separate to that
produced by Giza++ as part of the training of the phrase-based SMT systems.
3.4
Chapter Summary
This chapter introduced the three manually annotated corpora used in this project, described the
structure of the data and highlighted the specific information that is provided by each corpus.
The next chapter describes in detail the approach taken in the development of the baseline
and annotation and translation systems and the automated and manual methods used in the
evaluation of these systems.
Chapter 4
Methodology
4.1
Overview
This project follows a similar method to that used by Le Nagard and Koehn (2010) whereby
the annotation of pronouns in the source language text is applied prior to translation, leaving
the translation process unaffected.
The annotation of the (English) source language text and its subsequent translation (into Czech)
is achieved via a two-step process (see figure 4.1) that makes use of two phrase-based translation systems. The first, hereafter referred to as the Baseline system, is trained using unannotated English and Czech sentence aligned parallel training data taken from the PCEDT 2.0
and BBN Coreference and Entity Type corpora. The second system, hereafter referred to as
the Annotated system, is trained using the same parallel training data, in which the pronouns in
the English text are annotated with number and gender agreement between the Czech pronoun
and what is a valid translation into Czech of the original English antecedent head noun. This
alignment of English and Czech words is obtained from the PCEDT 2.0 alignment file that
was provided in addition to the corpus. The Baseline system serves a dual purpose; as well as
its incorporation within the two-step translation process, it also serves as the baseline against
which the translations output by the Annotated system are compared.
In addition to the translation systems, an annotation process is required. This process is used
to take an English text file, identify those pronouns that are coreferential and their antecedents
and annotate the pronouns with the number and gender of the Czech word that the English
antecedent translates to. The coreferential pronouns and their antecedents are extracted from
the BBN Coreference and Entity Type corpus and the Czech translation of the English antecedent is obtained from the translation output of the Baseline system. In using the Czech
translation of the English antecedent from the Baseline system translation output, a simplifica23
24
Chapter 4. Methodology
Figure 4.1: Diagram showing two-step annotation and translation process
INPUT
Original
English
Text
BASELINE
Translation System
Czech
Trans.
Text
Corpora
Identify coreferential English pronouns and antecedents
BBN
Extract antecedent head noun
Identify Czech translation of antecedent head noun
Penn
Treebank
PCEDT
Extract number and gender of Czech word
Annotate English pronouns with Czech number and
gender
Annotated
English
Text
ANNOTATED
Translation System
Czech
Trans.
Text
OUTPUT
tion is introduced. Whilst the pronoun and its antecedent may occur in the same sentence, in
many cases the antecedent will appear in a previous sentence. Therefore, in order to identify
the translation of many of the antecedents it is necessary to translate the previous sentence(s)
before translating the current sentence. Rather than translating the text sentence by sentence,
the complete source language text is translated using the Baseline system (as a block) and the
Czech translations of the English antecedents are extracted from this output. This mirrors the
solution used by Le Nagard and Koehn (2010) and provides a simplification of the problem
of obtaining the Czech translation prior to annotation. Another option would be to translate
sentence by sentence but this would make no difference to the final outcome as the output of
the Baseline system remains the same irrespective of the method employed (at least within a
two-step process).
The original English text is annotated such that all coreferential pronouns for which a Czech
translation of the antecedent is found are marked with the number and gender of that Czech
word. The output of the annotation process is thus the same English text that was input to
the Baseline system, with the addition of the annotation of the coreferential pronouns. This
4.1. Overview
25
annotated English text is then translated using the Annotated translation system, the output of
which is the final translation of the complete annotation and translation process.
The two main differences between the implementation of this project and that by Le Nagard
and Koehn (2010) lie in the translation language pair and the methods used in the extraction
of coreference information and morphological properties of the target translations of the antecedents. Le Nagard and Koehn (2010) use the English-French language pair in their work
and use only the gender of the antecedents in the annotation of the English pronouns. They
omit number from the annotation on the basis that singular English pronouns rarely translate
in French as plural pronouns and that incorporating both number and gender in the annotation
would introduce further segmentation of the training data. In Czech, both number and gender
are important in determining the syntactic form of many pronouns. For example, the pronoun
“je” is ambiguous in Czech and may be used as both neuter singular and as plural with any
gender. Moreover, the syntactic form of possessive reflexive pronouns is dependent not only
on the gender of the object(s) in possession, but also on the number of objects. Whilst the
issue of increased segmentation of the training data (as a result of including both number and
gender in the annotation of the English pronouns) is acknowledged, if the aim is to improve
the translation of pronouns, both number and gender are necessary in Czech. Hardmeier and
Federico (2010) also annotate their pronouns using both number and gender in the translation
of the English-German language pair.
The second main difference is that in this project, the identification of coreferential pronouns
and their antecedents and the morphological properties of words in the output of the Baseline
system are achieved using manually translated corpora, which are deemed to be highly accurate. In contrast, Le Nagard and Koehn used automated methods to extract this information and
as such introduced additional sources of potential error into their process.
Another possible approach would be to implement a system using a similar method to that used
by Hardmeier and Federico (2010), whereby the source language text is translated sentence by
sentence using a single-step process. The advantage of this approach is that if a pronoun’s
antecedent appears in an earlier sentence, which will often be the case, then the translation
of the antecedent will already be known by the time that the sentence in which the pronoun
appears is considered for translation. The same does not hold, however, when the pronoun and
its antecedent appear in the same sentence as the translation of the antecedent is not yet known.
The two-step process used in this project and by Le Nagard and Koehn (2010) provides a simple
solution to the issue of obtaining the Czech translation of the English antecedent head. It is,
however, acknowledged that the single-step translation system implemented by Hardmeier and
Federico (2010) represents a more elegant solution to the problem. That is not to say that the
solution presented by Hardmeier and Federico is perfect, but it does have a major advantage
26
Chapter 4. Methodology
over the two-step method in that it is, rather obviously, more efficient to translate the text only
once.
Given the relatively short time-scale of this project, the simpler two-step translation process,
incorporating the translation of texts as a complete block was selected in preference to a singlestep translation process. As it is only the pronouns that are expected to change between translation output of the Baseline and Annotated translation systems, this method is deemed to be
a satisfactory alternative to the single-step method, the issue of efficiency notwithstanding.
Problems arising from the use of a two-step process with respect to building the Baseline and
Annotated systems are discussed in section 4.6.2.
Figure 4.2: Overview of the Annotation Process
The annotation process is shown in figure 4.2. In this simple two sentence translation example,
the second sentence contains a coreferential instance of the personal pronoun “it”, which refers
to “castle” in the first sentence. In the first step of the process, the coreferential pronoun (“it”)
is identified, before its antecedent head noun (“castle”) is identified in the second step. The
Czech translation of the antecedent head noun (“Hrad”; Czech for “castle”) is then obtained
from the translation of the previous sentence in step 3 and the number and gender of the Czech
word are extracted in step 4. In the final step, the pronoun is annotated in the English sentence,
4.2. Assumptions
27
Table 4.1: Pronouns
3rd Person Personal
Reflexive
Possessive (preceding a noun)
Possessive (used alone)
Singular
Plural
she, her, he, him, it
they, them
himself, herself, itself
themselves
his, her, its
their
his, hers
theirs
before being submitted to the Annotated translation system. The 3rd person personal pronouns
for which annotation is applied, are shown in table 4.1.
The demonstrative pronouns “this”, “these”, “that” and “those” are not marked as coreferential in the BBN Coreference and Entity Type corpus and are therefore excluded. Additionally,
non-referential (pleonastic) pronouns have been excluded from the annotation process and the
accuracy of their translations is not assessed as it falls outside the scope of this project. Performance of these pronouns is therefore expected to be the same in both the Baseline and
Annotated systems. Whilst the 3rd Person Personal Pronouns “he”, “she”, “him” and “her” are
unambiguous, they were included in the annotation in order to highlight instances of subject
pro-drop. As discussed in chapter 1, one of the main reasons for selecting Czech as the second language in the translation language pair was because it is a subject pro-drop language.
Despite the lack of explicit handling of subject pro-drop scenarios in this project, the translation system’s ability to handle this phenomenon was of interest. These pronouns are therefore
annotated in order to assess the extent to which the translation systems are able to ‘learn’ scenarios in which the subject pronoun may be dropped without the use of additional contextual
information. It is assumed that as these pronouns are unambiguous, their annotation will not
serve to further fragment the training data. Provided that the correct antecedent head noun is
identified, these pronouns should always be labelled as singular and with the correct gender.
The performance of the systems was evaluated in terms of an automated evaluation of the
pronoun translation and a manual evaluation by a native Czech speaker.
4.2
Assumptions
A number of simplifying assumptions are asserted with respect to the manually annotated corpus resources:
1. That the coreference resolution in the manually annotated BBN Coreference and Entity
Type corpus is “perfect”
28
Chapter 4. Methodology
2. That the annotation of morphological properties of Czech words in the PCEDT 2.0 corpus is “perfect”
3. That the PCEDT 2.0 alignment file contains a “perfect” alignment of English words and
their Czech translation
4. That the annotation of NPs in the Penn Treebank 3.0 corpus is “perfect”
In this case “perfect” is deemed to be the best possible annotation of the corpora, or alignment
in the case of the PCEDT 2.0 alignment file. This assumption is made as the corpora have
been manually annotated, ensuring a high degree of accuracy. It is acknowledged that these
assumptions are unrealistic, but they are made in order to define the boundaries of what is
achievable given the resources available. This is in contrast to the lower level of accuracy that
is expected from the use of automated tools to achieve coreference resolution in the source
language, and the extraction of morphological properties of words in the target language.
4.3
Datasets
The data set used to train the translation systems and the testing data sets used to test the
systems were compiled from the English and Czech translations contained in the PCEDT 2.0
and BBN Coreference and Entity Type corpora. The data sets were constructed so as to allocate
as much data to the training set as possible, whilst leaving a small portion of at least 1,500
sentences for testing. As contextual information is necessary in the annotation of pronouns
and the analysis of the output in testing, it was necessary to ensure that for each WSJ file, the
complete set of sentences was allocated to either the training or testing set.
The allocation of files to the testing set was achieved via random selection, with the exception
of the hand selection of five files that formed the Development test set. These files were selected
due to greater familiarity with their text and the annotation in the PCEDT 2.0 corpus, making
analysis and manual evaluation of the translation system output easier. This set was intended to
be used in the manual analysis of progress at each stage of the development of the annotation
and translation processes.
The training set was constructed using the remainder of the parallel English - Czech WSJ
files available in the PCEDT 2.0 corpus. It excludes duplicate sentences1 and those already
present in the test set as well as sentences longer than 100 words (in either English or Czech)
as recommended for the Moses training process.
1 Duplicate
sentences occur in several places in the Wall Street Journal corpus. For example, in weekly summaries of interest and exchange rates, where the same text regularly appears at the start and/or end of the column.
4.4. Constructing the Language Model
29
Table 4.2: Datasets
Parallel Sentences
Czech Words
English Words
Training Set
Weight Tuning Set
Final Test File
Development Test File
47,549
500
540
280
955,018
9,342
10,110
5,467
1,024,438
10,265
11,907
6,114
Table 4.3: Language Model
Total Combined Corpus
Sentences
Czech Words
2,295,172
34,474,301
An additional data set, the “Weight Tuning Set” was set aside for the sole purpose of tuning
the weights of the translation systems. This process will be described in more detail in section
4.6.2. Details of all three data sets are provided in table 4.2.
The Language Model corpus was constructed using a combination of the target side of the
parallel training corpus (including those sentences that were removed to comply with Moses
training requirements) and the Czech monolingual 2010 and 2011 News Crawl corpora 2 . Following the removal of all duplicate sentences, the three corpora were combined to form a single
language model corpus, from which the language model was constructed. This was possible
as all three corpora are taken from the same ‘Newswire’ domain. Another solution would have
been to construct separate language models from the different corpora, had they originated
from different domains. Details of the language model corpus are given in table 4.3.
4.4
Constructing the Language Model
The language model used for the purpose of scoring translations during the decoding process
in both the Baseline and Annotated systems was a 3-gram model, constructed from the Czech
monolingual language model corpus described in section 4.3. The language model was constructed using the SRILM toolkit (Stolcke, 2002) with interpolated Kneser-Ney discounting
(Kneser and Ney, 1995) applied.
2 Provided
for the Sixth EMNLP Workshop on Statistical Machine Translation: http://www.statmt.org/wmt11/
30
4.5
Chapter 4. Methodology
Combining the Corpora
The first step of the project was to develop a method for identifying coreferential pronouns in
the English text, their antecedent (in English) and the antecedent’s translation in Czech. The
method for the identification of coreferential pronouns and their antecedents in the English text
is common to the training and testing tasks. However, the method used for the identification of
the Czech translation of the English antecedent differs between these tasks. In the annotation
of the training data used to build the Annotated translation system, the Czech translation of
the antecedent is simply obtained from the alignment provided in the PCEDT 2.0 alignment
file. This file has the added advantage of containing the t-layer nodes of the Czech words, via
which the number and gender may be extracted from the corresponding m-layer node. During
testing it is necessary to obtain the Czech translation of the English antecedent as output by
the translation system and use the number and gender of that word to annotate the English
pronoun.
The implementation focussed initially on combining information from the source language
BBN Coreference and Entity Type and Penn Treebank 3.0 corpora. The BBN Coreference
and Entity Type corpus was used to identify coreferential pronouns and their antecedents. The
Penn Treebank 3.0 corpus was then used to extract the head noun of the antecedent from those
antecedents which spanned several words. It is necessary to extract the antecedent head noun
as in the annotation of English pronouns with the number and gender of their antecedent, the
morphological properties must be derived from a single Czech word (per antecedent).
4.5.1
Identification of Coreferential Pronouns and their Antecedents
The identification of coreferential pronouns is achieved by reading the WSJ.pron file provided
as part of the BBN Coreference and Entity Type corpus and described in section 3.3. As this
file provides the WSJ file name, sentence number and sentence internal word positions of the
pronouns and their antecedent(s) the extraction of this information is relatively simple. The
word position information is later used in the mapping of the English antecedent head noun to
its Czech translation via the PCEDT 2.0 alignment file in order to extract the morphological
properties with which to annotate the English pronoun.
It should be noted that through the use of the BBN Coreference and Entity Type corpus to
identify coreferential pronouns, the misidentification of (non-referential) pleonastic pronouns
as coreferential does not arise. For example, consider the case of the pronoun “it” in the sentence “It is raining.”. Here, “it” does not refer to an entity or event and would therefore not be
marked as coreferential in the BBN Coreference and Entity Type corpus. The misidentification
4.5. Combining the Corpora
31
of such pronouns can, however, cause problems for coreference resolution systems.
4.5.2
Extraction of the Antecedent Head Noun
The identification of coreferential pronouns and the extraction of their antecedent(s) from the
BBN Coreference and Entity Type corpus is straightforward due to the simple structure of the
WSJ.pron file. However, the extraction of the head noun from antecedents that consist of more
than a single word is more complex. Whilst it is possible to use part of speech taggers to tag
the words in the antecedent string and derive linguistically motivated rules to identify the head
noun, the provision of annotated parse trees for the WSJ sentences in Penn Treebank 3.0 corpus
provided a more robust means of extracting this information.
The extraction of the head noun from the antecedent NP is achieved by overlaying the antecedent obtained from the BBN Coreference and Entity Type corpus with the NPs annotated
in the merged files of the Penn Treebank 3.0 corpus to obtain a match. Due to differences in
annotation between the two corpora, it is often the case that the antecedent does not exactly
match with a complete NP in the Penn Treebank 3.0 corpus. Where this is the case, the closest
partial match is obtained, ensuring that the word identified as the head noun in the NP annotation in the Penn Treebank 3.0 corpus is also present in the antecedent. Where an antecedent
matches a nested NP in the Penn Treebank 3.0 corpus, the rightmost noun of the leftmost NP
(in the nested construction) is extracted. It is this that provides the robustness over the previously mentioned alternative method and is particularly effective in the extraction of the head
noun in appositive constructions.
4.5.3
Extraction of Morphological Properties from the PCEDT 2.0 Corpus
Whilst different strategies are used to obtain the morphological properties of a Czech word
corresponding to the English antecedent head noun in the annotation of the English pronouns
in the training data (section 4.6.3) and as part of the annotation and training process (section
4.7), the objective is the same. That is the number and gender of the Czech word must be
obtained from the m-layer of the PCEDT 2.0 corpus.
As described in chapter 3, the m-layer contains a tag attribute which consists of a string of
15 characters that represent various morphological properties of the Czech word, including its
number and gender. An investigation of the annotation of the nouns identified as the Czech
translations of the English antecedent head nouns in the training data revealed:
Five genders: masculine animate, masculine inanimate, feminine, neuter and “any”
Three numbers: singular, plural and “any”
32
Chapter 4. Methodology
The use of “any” in the annotation of gender denotes a Czech word that may take any gender.
Similarly, the use of “any” in the annotation of number denotes a Czech word that may be either
singular or plural. This introduction of an additional category for both number and gender
brings about a further segmentation of the annotated training data. Identifying a solution to this
problem has been left as future work.
Once extracted, the number and gender of the Czech word is used to annotate the English
pronoun in the format Pronoun.gender.number. For example, in the following text:
the u.s. , claiming some success in its.mascin.pl trade diplomacy , removed south korea ,
taiwan and saudi arabia from a list of countries it.mascin.pl is closely watching for allegedly
failing to honor u.s. patents , copyrights and other intellectual-property rights
The English pronouns “its” and “it” both refer to “u.s.”, which in the case of this example
is found to translate to “usa” in Czech. In the PCEDT 2.0 corpus “usa” is annotated in the
m-layer as masculine inanimate and plural. The English pronouns are therefore annotated as
its.mascin.pl and it.mascin.pl respectively (as shown in the example).
4.6
Training the Translation Models
Both the Baseline and Annotated systems are phrase-based SMT models, trained using the
Moses toolkit (Hoang et al., 2007). They share the same 3-gram language model and are
forced to use the same word alignments. Following the computation of the word alignments,
training of both models commenced at the construction of the phrase translation table.
In the construction of both the Baseline and Annotated translation systems, the lexical reordering model:
1. Uses the msd (monotone, swap, discontinuous) model configuration which considers the
three orientation types monotone, swap and discontinuous in the reordering.
2. Is conditioned on both the foreign phrase and the English phrase and is bidirectional for each phrase C, its ordering with respect to the previous phrase and the ordering of the
next phrase with respect to C are considered.
The Baseline system was trained using the full texts of the parallel training corpus, with the
un-annotated English text forming the source side. The Annotated system was trained in the
same way as the Baseline system but using the annotated English text as the source side of the
parallel training corpus. The annotation of the English training set data is described in detail in
section 4.6.3.
4.6. Training the Translation Models
4.6.1
33
Computing the Word Alignments
When using two translation systems in a two-step translation process, it is necessary to ensure
that that Czech translation of the antecedent in the output of the Annotated system is the same
as that in the output of the Baseline system. Otherwise the annotation of the English pronouns
serves no useful purpose. In order to ensure consistency of the antecedent translations between
the systems it is necessary to force both systems to use the same word alignments. The word
alignments were produced using Giza++ run over a ‘stemmed’ version of the un-annotated parallel training corpus in both translation directions and symmetrised by using the grow-diagonal
final heuristic. The stemming of the un-annotated training corpus is not stemming in the traditional sense. Rather each word in the corpus is trimmed such that it is only four characters in
length. This was implemented upon the recommendation made by Dr. Ondřej Bojar in order
to improve the robustness of the word alignments used in the phrase extraction step of training
the translation models. This is necessary due to the inflective nature of Czech words which if
left untrimmed would lead to weaker word alignments used in the construction of the phrase
translation tables.
It is important to note that whilst the word alignments were computed using the ‘stemmed’
parallel corpus texts, the translation models were trained using the full corpus texts.
4.6.2
Tuning the Translation System Weights: Minimum Error Rate Training (MERT)
When a model is first trained using Moses, the model weights generated are a default set of
weights. According to the Moses documentation, the quality of these default weights is questionable3 . It is therefore necessary to tune these weights to ensure that they are suitable for the
translation language pair and given the available translation system models.
The weights were tuned using the MERT tuning script provided as part of the Moses toolkit,
using the 500 sentence “Weight Tuning Set” file described in section 4.3. The output of the
tuning process is a new Moses configuration file which is used to replace the default configuration file produced by the Moses training process. Different weights were computed for the
Baseline and Annotated systems as they were trained using different training data and therefore
comprise different models. Whilst the same 500 sentences of the “Weight Tuning Set” file is
used in the tuning of both weight sets, in tuning the weights of the Annotated system the English pronouns in these sentences were first annotated using the same method used to annotate
the training data.
The tuning of the weights, whilst obviously highly recommended, led to problems in the exper3 http://www.statmt.org/moses/?n=FactoredTraining.Tuning
34
Chapter 4. Methodology
iments conducted as part of the project. With two systems involved in the two-step translation
process it was necessary to tune the weights of both systems. The result of this tuning, is in
theory a better set of weights for each system. Having tuned both systems independently it
was then discovered that there was some considerable variation in the Czech translation of the
English antecedent head noun between the two systems. As the two-step translation process
is dependent on this translation remaining constant, there was a concern that this variation in
the translations between the two systems would lead to the introduction of further errors. It is
not clear how Le Nagard and Koehn (2010) addressed this issue as there is no mention of the
tuning of the translation system weights in their paper, but it seems likely that they encountered
similar issues.
As the impact of the tuning process upon the translation of the antecedent head nouns is not
fully understood and in light of the variation observed when both systems were tuned independently, the decision was taken to use the sub-optimal default weights. The use of these weights,
in conjunction with the shared word alignments from which the phrases were extracted, ensured
a high degree of consistency in the Czech translation of the English antecedent head noun between both systems. This consistency across the translations is of particular relevance to the
automated evaluation as defined later in this chapter. The tuning of the weights of both systems in such a way as to ensure that both systems perform well and that the translation of the
antecedent head noun is consistent between the systems is left as a possible option for future
work. It should be noted that a single-step process such as that used by Hardmeier and Federico
(2010) would not suffer from this problem of inconsistency (as it uses only a single translation
system), perhaps adding greater weight to the argument in favour of using a single-step process
in further research.
4.6.3
Annotation of the Training Set Data
The process of annotation used to generate the training data with which the Annotated system
was trained works as follows:
1. Identify coreferential English pronouns and their antecedents using the BBN Coreference
and Entity Type corpus.
2. Extract the head noun of the antecedent. Where the antecedent spans more than a single word, the antecedent and the NPs annotated in the Penn Treebank 3.0 corpus are
overlayed and the head noun is extracted using the process described in section 4.5.2.
3. Obtain the Czech translation (and its t-layer node) of the English antecedent head noun
from the PCEDT 2.0 alignment file.
4.6. Training the Translation Models
35
4. Obtain the number and gender of the Czech word by traversing the PCEDT annotation
layers from the t-layer node to the corresponding m-layer node. The part of speech tag,
number and gender in the positional tag, and the term4 from the lemma field are extracted
from the m-layer node.
5. If the m-layer node is annotated as a noun, then the number and gender of the corresponding Czech word is used to annotated the English pronoun in the original English
text.
In the training data set, there are 23,233 pronouns marked as coreferential by the BBN Coreference and Entity Type corpus. Of those, it was possible to extract the antecedent head noun
for 23,126 from the noun phrases marked in the merged files of the Penn Treebank 3.0 corpus.
This leaves 107 coreferential pronouns without an antecedent head noun.
Of the coreferential pronouns in the training set sentences, 20,721 out of a possible 23,233
are annotated by the training data annotation process. There are several reasons why not all
coreferential pronouns have been annotated:
1. No head noun may be found for a multi-word antecedent NP, either because the antecedent does not contain a noun or because the noun identified as the head in the Penn
Treebank 3.0 corpus annotation is not part of the antecedent. This is due to possible
discrepancies between the annotation of the two corpora, such that no match for the
antecedent can be obtained from the NPs.
2. There is no mapping for the English antecedent head noun in the PCEDT 2.0 alignment
file. Therefore it is not possible to extract a number and gender for the aligned Czech
word.
3. The word identified as the Czech translation of the antecedent head noun is not annotated
as a noun at its m-layer node.
A further four annotated pronouns have been removed due to the exclusion of a number of
sentences from the training data set by the Moses ‘clean data’ script. This leaves a total of
20,717 pronouns annotated in the English side of the parallel training corpus. See table 4.4 for
a breakdown of this number by pronoun.
4 The
term of a lemma is used in the identification of surnames, which are used as the ‘head’ noun in an antecedent string that contains a person’s full name.
36
Chapter 4. Methodology
Table 4.4: Breakdown of Annotated Coreferential Pronouns in the Training Data Set
English Pronoun
Number of Occurrences
He
She
527
Him
290
Her
426
His
1,714
Hers
1
It
4,478
Its
3,941
They
2,427
Them
657
Their
1,729
Theirs
Himself
Herself
6
83
11
Itself
156
Themselves
114
Total
4.7
4,157
20,717
The Annotated Translation Process
The input to the annotated translation process is an un-annotated English test file that consists
of a set of sentences not present in the training set. This file is first translated using the Baseline
system with a trace added to the Moses decoder. The coreferential English pronouns are then
identified using the BBN Coreference and Entity Type corpus and their antecedent head noun(s)
are extracted from the annotated NPs in the Penn Treebank 3.0 corpus, as previously described.
The sentence number and word position of the English pronoun and its antecedent head noun(s)
are extracted from the input English text and retained.
Using the sentence number and word position of the English antecedent head noun, the Czech
translation is identified in the output of the Baseline system using the phrase alignments output
by the Moses decoder (in the trace file) and the phrase internal word alignments in the phrase
translation table. The number and gender of the Czech word identified as the translation of
the antecedent head noun are extracted from the m-layer of the PCEDT 2.0 corpus, using a
pre-built dictionary of Czech words and their morphological properties. A copy of the original
English test file is then constructed, with all coreferential pronouns annotated with the number
and gender of the relevant Czech word. This annotated English test file is then translated by
the Annotation system in the second step of the translation process.
For evaluation purposes, calls to the Moses decoder when performing the translations with
4.8. Annotation and Translation System Architecture
37
the Baseline and Annotated systems include an option to return the word alignments for each
sentence in the input English test file. This word alignment information is output to a separate
file and consists of a single line per sentence with word-level alignments of the format: E-C.
Where E is the position of the English word in the input sentence and C is the position of the
Czech word in the translated sentence.
In the design of the annotated translation process, a number of assumptions have been introduced. Firstly, that the Czech translation of the English antecedent head noun is the same in the
output of both the Baseline and the Annotated system. As the Baseline and Annotated systems
were trained using the same word alignments, it is reasonable to make the assumption that the
translation of the English antecedent head noun will be the same in the output of both systems.
Secondly, it is assumed that the annotation of the Czech words in the m-layer is both accurate
and consistent. The same assumption was also made in the annotation of the training data.
4.8
Annotation and Translation System Architecture
The prototype annotation and translation system (described in figure 4.1) takes the form of
a Python application that includes a bespoke module that contains functions for accessing,
processing and combining information from the corpora and the PCEDT 2.0 alignment file.
This module also contains functions that are used in the generation of the annotated English
training data used to train the Annotated translation system.
The Python application works as follows:
1. Tokenise the un-annotated English test file.
2. Call the Moses decoder to translate the un-annotated English test file using the Baseline
system and generate two files: the Czech translation output with trace information (for
the identification of Czech and English phrases used in the translation) and the word
alignments used by the decoder (used in the automated evaluation).
3. Perform the annotation of the English test file using the annotation process described
previously.
4. Tokenise the annotated English test file.
5. Call the Moses decoder to translate the annotated English test file using the Annotated
system and generate trace output and the word alignments used by the decoder.
6. If the additional Czech and English annotation switches are set to ‘on’ the application
may also be used to read in the Czech translation output and the annotated English test
file and add additional information to these files to aid manual evaluation.
38
Chapter 4. Methodology
In addition to this application, a number of Python scripts were developed as part of the project.
These scripts perform a number of functions including:
1. Generation of the corpus from which the language model is constructed.
2. Generation of the parallel training and test data sets from the PCEDT 2.0 corpus (Czech
side) and the BBN Coreference and Entity Type corpus (English side).
3. Creation of the ‘stemmed’ parallel training data set data from which the word alignments
used in the training of the Baseline and Annotated translation systems are generated.
4. Generation of the annotated English training data used in the training of the Annotated
translation system.
5. The execution of an automated evaluation. This will be described in more detail in
section 4.9.1.
4.9
Evaluation
With no standard method available for the evaluation of pronoun translation in SMT and BLEU
rejected on the basis that it is not well suited to the specific problem of evaluating pronoun
translation, it was necessary to devise methods in order to evaluate the performance of the
systems. As already discussed in section 2.6, the problem of evaluating the translation of pronouns was addressed differently by Le Nagard and Koehn (2010) and Hardmeier and Federico
(2010). Where Le Nagard and Koehn (2010) manually counted the number of correctly translated pronouns in the output of their translation systems, Hardmeier and Federico (2010) relied
on precision and recall scored against a single reference translation.
Again, as previously discussed in section 2.6, in the case of English-Czech translation a recall
and precision based metric seems unsuitable given both the highly inflective nature of Czech
and the provision of only a single reference translation. Given the number of possible syntactic
forms that a pronoun of the correct number and gender may take in a language that has seven
cases and that case is not considered in the annotation, the translation of pronouns with accurate
syntactic form cannot be guaranteed. A method involving the manual counting of correctly
translated pronouns, as used by Le Nagard and Koehn, is prohibitively slow and laborious, not
to mention an impossible task for a monolingual speaker. Whilst a one-off manual evaluation
of pronoun translation may provide an acceptable method for the final evaluation of a system
it is clearly unpractical to rely on such a method during system development.
It is clear that in the case of the development of an English-Czech translation system by a
monolingual speaker, neither of the methods discussed so far are suitable for evaluation during
4.9. Evaluation
39
the development process. Given that in Czech, a pronoun must agree in number and gender
with its antecedent, it is perhaps more meaningful to count the number of pronouns in the
translation system output for which this agreement holds, rather than simply score the output
against a single reference translation.
The following sections describe an automated method used to provide these counts and the
approach taken in a more detailed manual evaluation of pronoun translation carried out by a
Czech native speaker who is also an expert in NLP.
4.9.1
Automated Evaluation: Assessing the Accuracy of Pronoun Translations
The development of automated evaluation methods is necessary both for the final evaluation
and for the development and tuning of systems that focus on pronoun translation. Without
the availability of such an evaluation metric during the development of the annotation and
translation process as part of this project, the analysis of progress was measured using manual
checks. These checks focussed on the accuracy of pronoun annotations in the annotated English
test file and the manual evaluation of a small number of pronouns in the Czech translations
output by the Annotated system. In order to evaluate the final output of the translation systems,
an automated method was deemed to be a necessity.
The automated evaluation consisted of a method based on the counting of pronouns in the input
English test file and its translation produced by the relevant translation system that met certain
specified criteria. These criteria are specified within a single Python script that is designed to
simultaneously output results for both the Baseline and Annotated systems such that a direct
comparison of the two systems is possible. Using this evaluation script, the following statistics
were collected:
1. Total number of pronouns in the input English test file - irrespective of whether they are
identified as coreferential.
2. Total number of English pronouns identified as coreferential, as per the annotation of the
BBN Coreference and Entity Type corpus.
3. Total number of coreferential English pronouns that are annotated by the annotation
process.
4. Total number of coreferential English pronouns that are aligned with any Czech translation.
5. Total number of coreferential English pronouns translated as valid Czech pronouns irrespective of whether the Czech translation is a valid match for the original English
pronoun.
40
Chapter 4. Methodology
6. Total number of coreferential English pronouns translated as a valid Czech pronoun corresponding to a valid translation of the original English pronoun.
7. Total number of coreferential English pronouns translated as a valid Czech pronoun (corresponding to the original English pronoun) and with number and gender agreement between the Czech pronoun and the Czech translation of a valid translation of the original
English antecedent head noun.
The evaluation handles the following pronouns:
Personal Pronouns: he, him, she, her, it, they, them
Reflexive Personal Pronouns: himself, herself, itself and themselves
Possessive Pronouns: his, its, their, theirs
Possessive Reflexive Pronouns: him, her, it, them
The evaluation script works as follows:
1. Read in tokenised English input, Czech translation system output and Czech reference
translation files.
2. Identify coreferential English pronouns and their antecedent head nouns in the input
English text.
3. Identify the word positions of these English words in the input English text.
4. Identify aligned Czech words (for the English pronoun and antecedent head noun) in the
translation system output using the word alignments output by the Moses decoder.
5. Collect counts of pronouns that meet those criteria listed above.
Whilst the English pronoun and its antecedent head noun are single words, they may translate
as a single word (one-to-one mapping) or multiple words (one-to-many mapping) in the Czech
output. A one-to-one mapping is ideal, but the more complex case of a one-to-many mapping
presents a problem as it is necessary to collect counts based on the identification of a single
Czech pronoun and the Czech antecedent translation(s). In the scenario that an English pronoun
translates as more than one word in Czech, the dictionary of Czech pronouns (see Appendix) is
used to identify those words that are valid Czech pronouns. The scenario in which an English
antecedent head noun translates as more than one word in Czech is a little more complex. When
this is the case, the agreement of the pronoun must be checked against each Czech antecedent
word and if agreement is found with any of the Czech antecedent words this is deemed to be a
‘match’.
In gathering those statistics listed as items 5, 6 and 7, it was necessary to reference a list of all of
the valid Czech translations of the English pronouns included in the annotation and translation
4.9. Evaluation
41
process. A complete list of the Czech pronoun syntactic forms and their number and gender,
may be found in the Appendix.
Whilst all of the statistics are useful in evaluating the performance of the systems and providing a basis for comparison, perhaps the most informative are those described as items 6 and
7. Statistic 6 provides a means of measuring the accuracy with which a translation system
translates an English pronoun as a valid Czech translation of that English pronoun. Statistic
7 provides a means of further questioning the translated Czech pronoun - in addition to being
a valid translation of the original English pronoun, does the Czech pronoun agree in number
and gender with the (Czech translation of the) antecedent head noun? As it is a requirement
in Czech that the number and gender of a pronoun agrees with that of the antecedent, it is this
statistic that arguably provides the most meaningful information in relation to the system performance. The validity of this statistic is, however, reliant upon several factors including the
correct identification of the English antecedent head noun, its accurate translation into Czech
by the Baseline system and finally, the correct identification of the Czech word in the Baseline
system output. The uncertainty surrounding these factors, as well as concerns surrounding the
robustness of the word alignments output by the decoder (used in the automated evaluation)
provides additional motivation for soliciting human judgements via a manual evaluation.
It is worth noting that a simplification is made in the evaluation of non-reflexive possessive
pronouns. In Czech, the choice of syntactic form for a singular non-reflexive possessive pronoun is dependent on the gender of both the possessor and the object in possession. The cases
in which the possessor is masculine animate, masculine inanimate or neuter are simple as the
same syntactic form is used irrespective of the gender of the object in possession. The case in
which the possessor is feminine is more complex as the syntactic form differs depending on
the number and gender of the object in possession. As the Wall Street Journal corpus contains
few possessive pronouns in which the possessor is female (i.e. less than 2% of all coreferential
pronouns in the corpus, less than 6% of the possessive pronouns), this case is unlikely to appear
frequently and has therefore been omitted from the evaluation for the sake of simplicity. This is
attributed to the genre of the WSJ texts and would not necessarily hold true for other domains.
It should also be noted that the automated evaluation does not include statistics on dropped
pronouns. As the decision as to whether or not to drop a subject pronoun is one that may
be made by the speaker (or writer) this is too subjective to be measured using an automated
method. The only option would be to use of the reference translation(s) in order to identify
those sentences in which a pronoun may be dropped, which again raises issues surrounding the
provision of only a single reference translation. The evaluation of dropped pronouns (which
are not explicitly handled by the Annotated system) ‘learned’ as a result of the training of the
translation systems was therefore left to the manual assessor to comment upon.
42
4.9.2
Chapter 4. Methodology
Manual Evaluation: Error Analysis and Human Judgements
Whilst the automated evaluation provides an indication of relative performance, there are a
number of problems associated with this method (as discussed in section 4.9.1). Furthermore,
the true test of whether or not there is deemed to be an improvement in pronoun translation over
the Baseline system requires the solicitation of human judgements from a manual assessor as
part of a manual evaluation.
As with the manual evaluation of Machine Translation in general, the manual evaluation of
pronoun translation is not a straightforward task. Care must be taken to ensure that the manual assessors are given clear instructions as to how to conduct the evaluation and even then
instructions may be misinterpreted. Furthermore, the identification of intended pronoun translations in the system output is potentially difficult, even for a native Czech speaker. This is due
to phrase-level reordering between the input English text and the Czech output, the insertion
of spurious pronouns during the translation process and the ambiguity of words such as “je”
which may be used as either a pronoun or a verb. In the evaluation of the translated pronouns,
it was believed to be important to ensure that the manual assessor was directed to the Czech
translations aligned to the English pronouns. For this purpose, referential pronouns in both
the Czech and English texts provided for manual assessment were marked with the head noun
of their antecedent. In addition, referential pronouns in the English source texts were marked
with the corresponding Czech translation of the antecedent head noun, and those in the Czech
target texts were marked with the original English pronoun that they align to. Examples of the
additional annotation provided for the purposes of the manual evaluation are presented below.
English text input to the Baseline system:
the u.s. , claiming some success in its trade diplomacy , removed south korea , taiwan and
saudi arabia from a list of countries it is closely watching for allegedly failing to honor u.s.
patents , copyrights and other intellectual-property rights .
Czech translation output by the Baseline system:
usa , tvrdı́ někteřı́ jejı́(its) obchodnı́ úspěch v diplomacii , odvolán jižnı́ korea , taiwanu a
saúdská arábie ze seznamu zemı́ je(it) pozorně sledovali za údajné schopná dodržet amerických
patentů , copyrights a dalšı́ intellectual-property práva .
English text input to the Annotated system:
the u.s.* , claiming some success in its(u.s.,usa).mascin.pl trade diplomacy , removed south korea , taiwan and saudi arabia from a list of countries it(u.s.,usa).mascin.pl is closely watching
for allegedly failing to honor u.s. patents , copyrights and other intellectual-property rights .
Czech translation output by the Annotated system:
usa ,* tvrdı́ někteřı́ úspěchu ve své(its.mascin.pl) obchodnı́ diplomacii , odvolán jižnı́ korea
4.10. Chapter Summary
43
, taiwanu a saúdská arábie ze seznamu zemı́ je(it.mascin.pl) pozorně sledovali za údajné
schopná dodržet amerických patentů , copyrights a dalšı́ intellectual-property práva .
Because a pronoun must agree in number and gender with its antecedent, when that antecedent
comes from an earlier sentence, the assessor carrying out manual evaluation must also be provided with that sentence in order to understand the context of the pronoun. The additional
mark-up of the Czech target text is therefore of even greater importance.
The sample English and Czech translation texts were composed from five WSJ files selected at
random from the Development and Final test sets.
The manual assessor was asked to make the following judgements:
1. Whether the pronoun had been translated correctly, or in the case of a dropped pronoun,
whether it had been dropped correctly;
2. If the pronoun translation was incorrect, whether a native Czech speaker would still be
able to derive the meaning;
3. In the case of the input to the Annotated system, whether the pronoun had been correctly
annotated, at least with respect to the Czech translation of the identified antecedent;
4. In the case where an English pronoun had a different translation in the Baseline and Annotated Czech target text, which system produced the better translation. If both systems
translated an English pronoun to a valid Czech translation (of that pronoun), both results
are to be marked equally as correct translations.
It should be noted that the evaluation focussed solely on the translation of pronouns, and not on
the translation system output as a whole, as with general purpose manual evaluation in Machine
Translation.
4.10
Chapter Summary
This chapter described the approach taken in the training of the Baseline and Annotated phrasebased translation systems, the development of the annotation and translation process and the
methods developed in order to address the more specific problem of evaluating pronoun translation in English-Czech SMT. The next chapter presents the results of the automated and manual
evaluations of the output of the Annotated translation system and provides a comparison with
the output of the Baseline system. The chapter also provides a discussion of the results.
Chapter 5
Results and Discussion
5.1
Automated Evaluation
The results of the automated evaluation (described in section 4.9.1) are presented for the Development test set in table 5.1 and for the Final test set in table 5.2. As these tables show, there
is only a small improvement of the Annotated system over the Baseline system, for each test
set.
The statistics in the last two rows of each table require further explanation. By way of an example, consider a sentence in which the English pronoun ‘it’ is identified as having an antecedent
for which the head noun translates to a Czech word that is singular and feminine. If ‘it’ was
translated as the Czech pronoun ‘on’, this would be a valid Czech translation of the English
pronoun ‘it’, satisfying the criteria “Czech Pronouns that are a valid translation of the original
English Pronoun”. This translation would not, however, satisfy the additional requirement of
agreement with the antecedent as ‘on’ (singular with nominative case) has masculine gender
and the antecedent is feminine. In order to satisfy the more stringent criteria of also matching the number and gender of the antecedent, ‘it’ would need to be translated as ‘ona’ in the
nominative case.
If accuracy of the pronoun translations is taken to be a measure of the proportion of coreferential English pronouns that have a valid Czech translation and agree in number and gender with
their antecedent then the accuracy of the systems is as follows:
1. Development test set: Baseline system 44/141 (31.21%), Annotated system 46/141 (32.62%)
2. Final test set: Baseline system 142/331 (42.90%), Annotated system 146/331 (44.10%)
45
46
Chapter 5. Results and Discussion
Table 5.1: Automated Evaluation Results for the Development Test Set
Baseline System
Annotated System
Pronouns
156
156
Coreferential Pronouns
141
141
Annotated Coreferential Pronouns
N/A
117
Coreferential English Pronouns with a Czech translation
141
141
Coreferential English Pronouns translated as valid Czech Pronouns
71
75
Czech Pronouns that are a valid translation of the original English
63
71
44
46
Pronoun
Czech Pronouns that are a valid translation of the original English
Pronoun and the Czech Pronoun and Antecedent match in number
and gender
However, there are a number of reasons for not taking this evaluation as definitive:
1. The automated evaluation hinges on the accuracy of the word alignments output by the
decoder (alongside the Czech translations) in order to identify the Czech translations of
the English pronoun and its antecedent. The robustness of these alignments is questionable, so caution should be taken when interpreting the results.
2. The automated evaluation requires accurate identification of the true Czech translation
of the head noun of the antecedent in English. This in turn requires the latter be identified accurately. If either is incorrect, the English pronoun in the input to the Annotated
translation system is likely to be annotated incorrectly, thereby blocking any potential
gains from the annotation and translation process.
3. English pronouns are only annotated with number and gender of their Czech counterparts
and so the correct inflectional form of the Czech pronouns in the target text cannot be
guaranteed. As a result, inflectional form cannot be used as criteria in the automated
evaluation.
All these points mean that manual evaluation is critical for understanding the potential capabilities of source text annotation as a technique for improving pronoun translation.
Despite efforts to ensure that the English antecedent head noun is translated as the same Czech
word in the Baseline and Annotated systems, a small number of differences between the two
systems were identified. One incident of a different antecedent translation was identified for
the Development test set and two were identified for the Final test set. In all three cases, the
English pronouns were not translated as Czech pronouns, so the presence of these anomalies
does not affect the accuracy scores reported previously.
Automated evaluation also fails to capture actual variations between the Baseline and the An-
5.1. Automated Evaluation
47
Table 5.2: Automated Evaluation Results for the Final Test Set
Baseline System
Annotated System
Pronouns
350
350
Coreferential Pronouns
331
331
Annotated Coreferential Pronouns
N/A
278
Coreferential English Pronouns with a Czech translation
317
317
Coreferential English Pronouns translated as valid Czech Pronouns
198
198
Czech Pronouns that are a valid translation of the original English
182
182
142
146
Pronoun
Czech Pronouns that are a valid translation of the original English
Pronoun and the Czech Pronoun and Antecedent match in number
and gender
notated target texts. Upon closer inspection of the system output it is clear that there is a
fairly high degree of overlap between the two systems in terms of English pronouns that are
translated using exactly the same Czech form. There are also a substantial number of English
pronouns for which the Czech translation is different. Where the two systems produce the same
translation of the same English pronoun (i.e. the same word position within the same sentence,
within the same WSJ file), it is possible that both systems have produced a valid translation of
the pronoun, or that they have both produced an invalid translation. Where the translations are
invalid, interpretation by a human expert is required in order to ascertain the cause of the error.
Where the two systems produce a different translation of the same English pronoun, there are
yet more possibilities. Both systems could produce a different Czech pronoun, whether it be
valid or invalid, neither system may produce a Czech pronoun (but the Czech translation may
be different), or one system may produce a valid Czech pronoun where the other does not. As
both systems share the same underlying word alignment for the construction of their phrase
translation models, these differences can only follow from the data used to train their translation models. The extent of this variation differs between the two test sets. For the Development
test set, approximately 1/3 of the pronoun translations are different between the two systems,
whereas for the Final test set this is much lower at approximately 1/6.
The evaluation of the instances where the pronoun translation is the same in both systems
and where it differs between the systems is left to the manual assessor. Whilst a monolingual speaker with a dictionary of Czech pronouns and their English translations may manually
examine these instances using the information in the files output as part of the automated evaluation process, there are cases that can only be analysed by a native Czech speaker. This
motivates the solicitation of human judgements.
48
Chapter 5. Results and Discussion
Table 5.3: Manual Evaluation Results: Pronouns with the same translation in both systems
(“Matches”)
Criterion
Total number of pronouns
Result for both systems
72
Pronoun translation correct in terms of number and gender or correctly dropped
52/72
Pronoun translation incorrect in terms of number and gender or incorrectly dropped
20/72
English pronoun annotated correctly with the number and gender of the Czech trans-
67/72
lation
Total number of incorrectly translated pronouns
20
Pronoun translation incorrect and cannot be understood or is “misleading”
8/20
Pronoun translation incorrect but the meaning could still be understood
12/20
5.2
Manual Evaluation
The results of the manual evaluation suggest that the performance of the Annotated system
is comparable with, even marginally better than that of the Baseline system. In the sample
files provided for the evaluation there were 31 pronouns for which the translations provided
by the two systems differed (differences) and 72 for which the translation provided by the
systems was the same (matches). These sets of pronouns show different things. Evaluation of
the “matches” (see table 5.3) provides an indication of how well both systems do (in general
terms) and evaluation of the “differences” (see tables 5.4 and 5.5) allows for a comparison of
how the systems compare. Tables 5.6 and 5.7 describe the performance of both systems with
respect to the appropriate use of pro-drop. The results contained in these tables correspond to
judgements based on the criteria specified in section 4.9.2.
Upon inspection of the “matches” set (see table 5.3) of 72 pronouns, it is clear that a reasonable number of pronouns are correctly translated or dropped by both systems (52/72) and that
of those 20 pronouns that are incorrectly translated, the meaning of 12 could still be understood. This leaves 8 pronouns for which the translation was so poor that the meaning cannot
be understood. Focussing specifically on those pronouns that are dropped (see table 5.6), 28
out of 32 are correctly (or at least satisfactorily) dropped, with only 6 pronouns that should
have been dropped but were not. This suggests that the translation systems were able to ‘learn’
scenarios in which pro-drop is appropriate. The success of both systems with respect to the
appropriate dropping of pronouns was somewhat unexpected but could be due to instances in
which there are short distances between the pronoun and verb in English. For example, many
of the occurrences of ‘he’ and ‘she’ in the English text appear in the context of “he said...” or
“she said...”, are translated as “...řekl...” and “...řekla...” (respectively) in the Czech machine
translation output. These instances represent scenarios in which the pronoun was appropriately
5.2. Manual Evaluation
49
Table 5.4: Manual Evaluation Results: Pronouns with the different translations in each system
(“Differences”)
Criterion
Total number of pronouns
Pronoun translation correct in terms of number and gender or are
Baseline System
Annotated System
31
31
19/31
17/31
12/31
14/31
N/A
18/31
correctly dropped
Pronoun translation incorrect in terms of number and gender or are
incorrectly dropped
English pronoun annotated correctly with the number and gender of
the Czech translation
Total number of incorrectly translated pronouns
Pronoun translation incorrect and cannot be understood or is “mis-
12
14
5/12
6/14
7/12
8/14
leading”
Pronoun translation incorrect but the meaning could still be understood
* The remaining 11 pronoun translations were found to be “similar”
between the two systems. In this case, the translations provided
by one system was no better or worse than the other - either both
translations were deemed to be equally good or equally bad
** The remaining 6 pronoun translations were found to be “similar”
between the two systems
Table 5.5: Manual Evaluation Results: A direct comparison of pronoun translations that differ
between systems (“Differences”)
Criterion
Baseline System Better
Annotated System Better
Systems Equal
Overall quality
9/31
11/31
11/31
Quality when annotation is correct
3/18
9/18
6/18
Table 5.6: Manual Evaluation Results: Dropped Pronouns in the “Matches” set
Criterion
Result for both systems
Total dropped pronouns
32
Correctly / satisfactorily dropped
28
Incorrectly / inappropriately dropped
4
Pronouns that should have been dropped (but were not)
6
Table 5.7: Manual Evaluation Results: Dropped Pronouns in the “Differences” set
Criterion
Baseline System
Annotated System
Total dropped pronouns
12
3
Correctly / satisfactorily dropped
12
3
Pronouns that should have been dropped (but were not)
1
1
50
Chapter 5. Results and Discussion
dropped. However, it is not the case that the problem of pro-drop has been solved, merely that
a few scenarios in which pro-drop is appropriate have been ‘learned’.
An inspection of the results from the “differences” set (see tables 5.4 and 5.5) of 31 pronouns
presents further points of interest. Whilst the performance of the Annotated system appears to
be a little better than the Baseline system overall (see table 5.5), the manual assessor actually
identified fewer correct translations for the Annotated system (17/31) than the Baseline system
(19/31). This may seem strange but it appears to be due to a small number of cases in which
the translations produced by both systems were incorrect but those produced by the Annotated
system were deemed to be marginally better. Unfortunately, the sample size for this set is rather
small and therefore it is somewhat difficult to form a complete picture of where one system may
be consistently better than the other. As an example of where the Annotated system produces
a better translation than the Baseline system, consider the following English sentence and its
translations by both systems:
English text: he said mexico could be one of the next countries to be removed from the priority
list because of its.neut.sg efforts to craft a new patent law .
Baseline system translation: řekl , že mexiko by mohl být jeden z dalšı́ch zemı́ , aby byl
odvolán z prioritou seznam , protože jejı́ snahy podpořit nové patentový zákon .
Annotated system translation: řekl , že mexiko by mohl být jeden z dalšı́ch zemı́ , aby byl
odvolán z prioritou seznam , protože jeho snahy podpořit nové patentový zákon .
In this example, the English pronoun “its”, which refers to “mexico” is annotated as neuter
and singular (as extracted from the Baseline translation of “mexico”). Both systems translate
the pronoun’s antecedent “mexico” as “mexiko” (neuter, singular) but differ in their translation
of the pronoun. The Baseline system translates “its” incorrectly as “jejı́” (feminine, singular),
whereas the Annotated system produces the more correct translation: “jeho” (neuter, singular),
which agrees with the antecedent in both number and gender. It is also interesting to note that
“jeho” is not the only correct pronoun translation in this case. If “because of its efforts to craft
a new patent law” is translated as a separate clause, the use of the possessive pronoun “jeho”
is correct. Alternatively, if the same fragment were to be translated as a phrase belonging to
the same clause as the antecedent “mexico” (also the subject), the reflexive possessive pronoun
“své” should be used instead, as it is in the reference translation.
There are two further points of interest with regards to the results from the “differences” set:
1. It would appear that the Baseline system is more likely to drop pronouns than the Annotated system (in those scenarios when a pronoun should be dropped).
2. If the annotation of the English pronoun is correct, the translation provided by the Anno-
5.2. Manual Evaluation
51
tated system is judged to be better than the translation provided by the Baseline system.
Unfortunately the sample size of this set of pronouns is rather too small to make any definite
claims but it would appear that in general the explicit annotation of pronouns results in worse
performance in terms of pro-drop (see table 5.7). What is encouraging is that it would appear
that the correct annotation of an English pronoun leads to a good translation in Czech. Where
the annotation is correct with respect to the extraction of the number and gender from the Czech
translation of the antecedent, pronoun translation is deemed to be better for the Annotated system than the Baseline system (see table 5.5). In the “differences” set, where the annotation of
18 out of 31 English pronouns is correct, 9 pronouns are translated better by the Annotated system, 3 are translated better by the Baseline system and 6 are too ‘similar’ to make a judgement.
This is supported to some extent by the results of the “matches” set (see table 5.3) in which the
accuracy of the English pronoun annotation is deemed to be high (67/72) and the correctness
of the pronoun translation or dropping of a pronoun is also reasonably high.
Another interesting example that was identified in the manual evaluation showed that despite
the incorrect annotation of an English pronoun, the translation produced by the Annotated
system was deemed to be (accidentally) better than that by the Baseline system:
English text: the others here today live elsewhere they.fem.pl belong to a group of 15 ringers
– including two octogenarians and four youngsters in training – who drive every sunday from
church to church in a sometimes-exhausting effort to keep the bells sounding in the many belfries of east anglia .
Baseline system translation: ostatnı́ zde dnes žije jinde to patřı́ ke skupině 15 ringers - včetně
dvou octogenarians a čtyři , který v obdobı́ - , kteřı́ jezdı́ každou neděli od kostela , aby cı́rkev
v sometimes-exhausting snahu udržet zvony sounding v mnoha belfries of east anglia .
Annotated system translation: ostatnı́ zde dnes žije jinde ty patřı́ ke skupině 15 ringers včetně dvou octogenarians a čtyři , který v obdobı́ - , kteřı́ jezdı́ každou neděli od kostela , aby
cı́rkev v sometimes-exhausting snahu udržet zvony sounding v mnoha belfries of east anglia .
In this example, the English pronoun “they” refers to “others” in the previous sentence and is
annotated as feminine, plural. It should, however, be annotated as masculine animate, plural according to the number and gender of “ostatnı́”. This incorrect annotation affects the translation
of the pronoun “they” by the Annotated system, but the translation “ty” (Annotated system) is
perfectly understandable to a native Czech speaker and deemed to be better than “to” (Baseline
system). Moreover, “ty” represents a form that is common in colloquial Czech.
Unfortunately, no clearer picture of the effects of the annotation and translation process with
respect to individual pronouns may be obtained. Whilst it was expected that the translation of
English pronouns which appeared in the training data with a high frequency would be trans-
52
Chapter 5. Results and Discussion
lated more accurately than those that appeared with a low frequency, it is not possible to draw
any conclusions from such a small sample size. A more extensive manual evaluation would
therefore be required.
In addition to the judgements, the manual assessor also provided feedback on the manual evaluation task. One of the major difficulties that they encountered during the evaluation was in
connection with evaluating the translation of pronouns in sentences which exhibit poor syntactic structure. This is a criticism of Machine Translation as a whole but highlights a specific
problem in the manual evaluation of pronoun translation. Also the effects of poor syntactic
structure are likely to introduce an additional element of subjectivity if the assessor must first
interpret the syntactic structure of the translation system output.
5.3
Critical Evaluation of the Approach and Potential Sources of
Error
Errors in different parts of the process may contribute to the Annotated system not performing
that much better than the Baseline:
1) Identification of the English antecedent head word. The incorrect identification of the English antecedent head word will in turn affect the identification of the Czech translation from
which the number and gender is extracted. This will affect not only the annotation of the training data used to train the Annotated translation system but also the annotation of the test file as
part of the annotation and translation process.
2) Identification of the Czech translation of the English antecedent head word. For the training data, the Czech translation is obtained from the PCEDT 2.0 alignment file. Errors in the
alignments used in the generation of this file would therefore eventually lead to the extraction
of the incorrect morphological properties of the Czech word used to label the coreferential English pronouns in the training data. During the translation of a test file the Czech translation
of an English antecedent head noun is extracted using the phrase internal word alignments in
the phrase table, corresponding to the phrase used in the translation. The potential for errors in
these word alignments cannot be ruled out.
3) Incorrect annotation in the manually annotated corpora. As the morphological properties of
the Czech words in the PCEDT 2.0 corpus, coreferential pronouns and their antecedents in the
BBN Coreference and Entity Type corpus and the parsed sentences in the Penn Treebank 3.0
corpus are manually annotated, the accuracy of this information is deemed to be of very high
quality. The risk of errors in the manual annotation of these corpora is therefore believed to be
minimal.
5.4. Chapter Summary
53
The potential sources of error in 1 and 2 could be contributing factors in the introduction of
variation between the pronoun translations in the Baseline and Annotated systems. The other
obvious source of this variation is the difference in the training data used between the two
systems. This difference in the training data (introduced as a result of the annotation of English
pronouns) raises another concern with respect to the potential weakening of statistics in the
phrase table of the Annotated system due to the segmentation of the data. Whilst this cannot
be avoided if the objective is to try to improve the translation of pronouns when translating
into a language where the number and gender of pronouns is important, decisions taken by
the decoder based on weak statistics may give poor results. This is perhaps more of an issue
given the relatively small size of the parallel training corpus when compared to the resources
used in the development of other SMT systems. One possible area for improvement would
be to reject translated sentences produced by the Annotated translation system in which there
are pronoun translations that are based on low counts of phrase-level occurrence within the
training data. The other obvious solution would be to add more parallel data to the training
corpus. However, as the aim of this project was to use manually annotated corpora in which
the coreference annotation is assumed to be “perfect”, this assumption would need to be relaxed
if the generation of more parallel training data necessitated the use of a coreference resolution
system.
Potential sources of error are not limited to the annotation and translation process. As mentioned briefly in section 5.1, there are a number of potential sources of error in the automated
evaluation method which should not be overlooked. These sources of error are related to those
already described in relation to the annotation and translation process. The evaluation hinges
not only on the correct identification of the head noun from the English antecedent, but also
on the identification of the Czech translation in the output of the translation system which is
reliant upon the word alignments output by the decoder. If any of these is incorrect, the results
of the evaluation will be affected as it relies upon counts of Czech pronouns that agree in number and gender with the Czech translation of the English antecedent head noun. This, again,
highlights the great need for standard automated evaluation methods for the specific problem
of pronoun translation, and in the case of this project, a method (or methods) that are suitable
for the evaluation of highly inflective languages such as Czech.
5.4
Chapter Summary
This chapter presented the results of manual and automated evaluations of the output of the Annotated and Baseline systems and a discussion of the results and potential sources of error in the
annotation and translation processes as well as the evaluation itself. The next chapter provides
54
Chapter 5. Results and Discussion
a conclusion to this work and makes suggestions as to possible areas of further investigation
for future work.
Chapter 6
Conclusion and Future Work
The work carried out as part of this project raises perhaps more questions than it answers. This
chapter outlines the contributions of this project and summarises the outstanding issues which
may impede not only the further progress of this work, but also that of other studies focussing
on the translation of pronouns in SMT.
6.1
Conclusion
Building on the work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) and
using a similar method to that developed by Le Nagard and Koehn (2010), this project focussed
on the translation of pronouns in phrase-based English-Czech SMT. The three contributions of
this work are:
1. A prototype annotation and translation system for English-Czech SMT trained on the
Wall Street Journal corpus and a close Czech translation as provided by the PCEDT 2.0
corpus.
2. Automated and manual evaluations of the output of the annotation and translation process
against a baseline system.
3. An aligned parallel corpus (in which the pronouns in the English source side text are
annotated) which may be used in future investigations into methods for improving the
handling of pronoun coreference.
The annotation and translation system uses a two-step process based on the approach taken
by Le Nagard and Koehn (2010). Whilst it is acknowledged that this approach is slow (due
to the incorporation of two translation steps) and cumbersome when compared to the more
elegant solution presented by Hardmeier and Federico (2010), the two-step process provided
55
56
Chapter 6. Conclusion and Future Work
a simple framework for the investigation into pronoun translation in English-Czech SMT. Furthermore, it is recognised that a two-step process is unpractical and not suitable for real-world
deployment. It does, however, provide a simplification to the problem of obtaining the Czech
translation of the antecedent head noun and is therefore valid in the design of a prototype system that is used as a proof of concept.
Unlike the previous projects by Le Nagard and Koehn (2010) and Hardmeier and Federico
(2010), this project made use of a number of manually annotated corpora to factor out the
effects of both imperfect coreference resolution and alignment in the training data. The use
of these corpora allowed for an assessment of the extent to which the approach of annotating
English pronouns with the number and gender of their Czech antecedent can improve their
translation into Czech. In short, the answer to this question is simple; the performance of
the Annotated system shows little improvement over the Baseline system as measured using
automated and manual evaluations. There are a number of possible reasons for this as discussed
in detail in chapter 5. The two major areas for concern are the accuracy of the translation of the
English antecedent head noun by the Baseline system as well as its accurate identification in the
translation output and the potential for weakening of the statistics in the phrase translation table
in the Annotated system. The amount of data in the parallel training corpus used in this project
is perhaps not enough to provide sufficiently accurate Baseline translations and robust statistics
for the Annotated system’s phrase translation model. The best way to assess the validity of this
claim would be to rebuild the translation models using an extended parallel training corpus.
It is acknowledged that this would likely mean that the assumption of “perfect” coreference
would be compromised as the availability of a second English corpus with manually annotated
coreference information and a close Czech translation for a similar domain to the WSJ corpus
is rather unlikely. Should it be possible to obtain a suitable parallel English-Czech corpus, a
state of the art coreference resolution system such as that developed by Charniak and Elsner
(2009) could be used to provide the missing coreference information.
The problem of pronoun translation in SMT is complex, especially when translating into a
highly inflective language such as Czech where it is important to ensure that pronouns have the
correct number, gender and case and that there is agreement between the pronoun and the head
of its antecedent. It is therefore important to realise that whilst the results for the Annotated
system on two small test sets show a marginal improvement over the Baseline system, this is
based purely on the number and gender of the pronouns and their antecedents. The correct case
of the pronouns, and hence the correct syntactic form, is not considered.
The possibility of further experimentation using the prototype annotation and translation process is limited by a number of factors. Firstly, it is believed that the Wall Street Journal corpus
may be too small for the purposes of this work given the suspected problems associated with
6.1. Conclusion
57
data sparsity arising from the number of genders in Czech and the annotation of “any” in the
absence of a defined number or gender in the PCEDT 2.0 corpus. Secondly, the provision of
only a single reference translation combined with the high degree of inflection in Czech and the
lack of a standard automated evaluation metric presents a problem in deciding how best to evaluate the system output. Thirdly, the question of how best to apply tuning to two systems used
in a two-step translation process where consistency in the translation of the antecedent head
nouns between systems is a complex one. This perhaps highlights another argument against
the use of a two-step translation process. It should be noted that this project is not unique in
suffering from these problems, with the first two affecting not only pronoun-focussed translation, but Machine Translation in general. With the topic of evaluation taking a prominent place
in the 2011 Workshop on Machine Translation1 it is clear that there are still many questions
surrounding automated evaluation techniques. Whilst manual evaluation is always an available option, it is not well suited as for use during the development of SMT systems in which
experiments are to be run with any degree of frequency. It is clear that the lack of a suitable
automated evaluation method presents a major stumbling block in the path of future progress.
Designing a translation system is only one half of the problem; evaluation of such a system is
the other.
Finally, the problem of ensuring consistency of the Czech translation of the English antecedent
head noun between the Baseline and Annotated systems resulted in the adoption of the default
model weights provided as part of the Moses training. This would not pose a problem for a
single-step translation process in which only one translation model is required and therefore
the issue of consistent translation of the antecedent head noun between two systems would not
be relevant.
This document outlines the work undertaken as part of a three month long MSc project. It is
clear that whilst some progress has been made, three months is not nearly sufficient to tackle all
of the problems related to the development and evaluation of what now appears to have been
a rather ambitious project from the outset. In truth this work has only just begun to scratch
the surface, but it is hoped that work focussing on pronoun translation and the wider issue
of handling discourse level phenomena in Machine Translation will continue. The following
section (6.2) makes a number of suggestions as to the directions in which future work could be
taken. These suggestions are made in light of a number of difficulties encountered during this
project.
1 http://www.statmt.org/wmt11
58
6.2
Chapter 6. Conclusion and Future Work
Future Work
Improving the accuracy of pronoun translation in Machine Translation remains an open problem and as such there is great scope for future work in this area. Indeed, there may be other
methods for handling pronoun translation that work better than those already investigated. It
may be the case that it is not sufficient to focus solely on the source side and that operations on
the target side must also be considered. There are also many possible directions for future work
in relation to problems identified during the course of this project. These include, but are not
limited to, the handling of pronoun dropping in pro-drop languages such as Czech, Romanian,
Spanish and Italian, the development of pronoun specific evaluation metrics and addressing the
problem of the availability of only a single reference translation.
The explicit handling of pronoun dropping in Machine Translation when translating from a non
pro-drop language such as English into a pro-drop language such as Czech is lacking in current
Machine Translation systems and research in this area has been somewhat limited to date.
Exceptions include work in English-Italian translation (Gojun, 2010) with a focus on trying
to improve the translation of subject pronouns by improving the alignment of verb phrases
(in a phrase-based SMT system) that contain pronominal subjects and a method for resolving
intrasentiential zero pronouns in English-Japanese translation (Nakaiwa and Ikehara, 1995).
Kim et al. (2010) developed a method for identifying non-referential zero pronouns in KoreanEnglish translation but this has yet to be applied to a practical Machine Translation problem.
Work could focus on the identification of pro-drop scenarios in English-Czech translation and
the development of an explicit annotation method with which to mark those English pronouns
that should be dropped in the Czech translation. Another option may be to consider the removal
of pronouns from the English source text that should be dropped in the Czech translation output.
Both options would require a method to predict whether an English pronoun should be dropped
in the Czech translation. This could be achieved either through defining handwritten rules or
by making use of a Machine Learning classifier trained using a parallel English-Czech corpus
with a sufficient coverage of the relevant pro-drop scenarios for the English-Czech language
pair. The PCEDT 2.0 corpus t-layer contains the annotation of pro-dropped pronouns, which
are not realised in the Czech at the w-layer (surface level text) and therefore may prove to be
useful in the pursuit of the explicit handling of pro-drop in English-Czech SMT.
As Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) have already identified,
the lack of evaluation metrics suited to the specific problem of pronoun translation makes evaluation very difficult. The provision of a robust metric is essential for the evaluation of future
work and in the comparison of different systems in order to establish if progress is being made
and also to identify where sources of error exist. It is also necessary to consider the requirement
6.2. Future Work
59
for an evaluation metric which satisfies not only the problem of evaluating translated pronouns
which should be present in the translation output, but also those which should be dropped or
for which pro-drop is a suitable alternative to displaying a pronoun. Ideally the development
of such a metric would take place prior to future work on the translation problem.
In connection with the issue of evaluation, the provision of only a single reference translation
is a particular problem in the evaluation of pronouns in English-Czech translation due to the
highly inflective nature of Czech and hence the number of possible syntactic forms that a Czech
pronoun may take. In order for evaluation metrics incorporating the notions of precision and
recall to become useful when translating into a highly inflected language, it is necessary to
provide multiple reference translations that capture the range of valid alternatives. Whilst it
is possible to employ the services of a number of translators to provide additional reference
translations based on the same original text, this can be both slow and costly.
As an alternative, the use of paraphrase to automate the generation of synthetic reference translations may be considered. Work by Kauchak and Barzilay (2006) focussed on the use of paraphrase generation to provide sentence-level synthetic reference translations, which could assist
in refining the accuracy of automated evaluation methods in Machine Translation, thereby addressing the gap between automated evaluation and human judgements. Their technique aims
to take a reference sentence and generated Machine Translation system output and find a paraphrase of the reference sentence with wording closer to the Machine Translation system output
than the reference itself. This moves away from prior research in which the aim was to produce
any paraphrase of the reference. However, their technique applies only to content words and
therefore would need to be adapted to the more specific issue of pronouns before it could be
used in practice. More recent work by Chen and Dolan (2011) focusses on the use of crowdsourcing techniques to obtain sentence-level paraphrase data by asking human participants to
describe what they see in a video and to participate in a separate direct paraphrase task. Using
video as a medium for gathering alternative translations leads to the generation of short texts
and is not suitable for many domains. Crowd-sourcing techniques used to obtain paraphrases
in direct paraphrase tasks and in the solicitation of multilingual translations may also prove
useful in obtaining multiple reference translations. One example of the use of crowd-sourcing
to obtain multiple multilingual translations is Microsoft’s WikiBABEL (Kumaran et al., 2008)
project.
In short, there are a great number of possibilities for further research in this area. The accurate translation of pronouns incorporating the use of coreference resolution techniques is an
extremely interesting and highly important problem for which there remains great scope for
future work.
Appendix A
Czech Pronouns Used in the
Automated Evaluation
The following tables of Czech pronouns were used in the automated evaluation. Where the
pronoun is a possessive, possessive reflexive or demonstrative pronoun, the gender refers to
the object(s) in possession. Where the pronoun is a personal pronoun the gender refers to the
person, or group of persons.
In the tables “Masc. An.” and “Masc. Inan.” are used to denote the “Masculine Animate” and
“Masculine Inanimate” genders respectively.
61
62
Appendix A. Czech Pronouns Used in the Automated Evaluation
Table A.1: Czech Pronouns: Personal
Czech Pronoun
English Translation
Number
Masc. An.
Masc. Inan.
Feminine
Neuter
je
it
singular
1
1
0
1
jeho
him, his,it,its
singular
1
1
0
1
jej
him, it
singular
1
1
0
1
jemu
him, it
singular
1
1
0
1
ji
her, it
singular
0
0
1
0
jı́
her, it
singular
0
0
1
0
jı́m
him, it
singular
1
1
0
1
ho
him, it
singular
1
1
0
1
mu
him, it
singular
1
1
0
1
ně
it
singular
0
0
0
1
něho
him, it
singular
1
1
0
1
něj
him, it
singular
1
1
0
1
němu
him, it
singular
1
1
0
1
ni
her, it
singular
0
0
1
0
nı́
her, it
singular
0
0
1
0
nı́m
him, it
singular
1
1
0
1
on
he, it
singular
1
1
0
0
ona
she, it
singular
0
0
1
0
ono
it
singular
0
0
0
1
se
himself, herself, itself, themselves
singular
1
1
1
1
sebe
himself, herself, itself, themselves
singular
1
1
1
1
sebou
himself, herself, itself, themselves
singular
1
1
1
1
si
himself, herself, itself, themselves
singular
1
1
1
1
sobě
himself, herself, itself, themselves
singular
1
1
1
1
je
them
plural
1
1
1
1
jich
them
plural
1
1
1
1
jim
them
plural
1
1
1
1
jimi
them
plural
1
1
1
1
ně
them
plural
1
1
1
1
nich
them
plural
1
1
1
1
nim
them
plural
1
1
1
1
nimi
them
plural
1
1
1
1
ona
they
plural
0
0
0
1
oni
they, these, those
plural
1
1
0
0
ony
they, these, those
plural
0
0
1
0
se
himself, herself, itself, themselves
plural
1
1
1
1
sebe
himself, herself, itself, themselves
plural
1
1
1
1
sebou
himself, herself, itself, themselves
plural
1
1
1
1
si
himself, herself, itself, themselves
plural
1
1
1
1
sobě
himself, herself, itself, themselves
plural
1
1
1
1
63
Table A.2: Czech Pronouns: Possessive
Czech Pronoun
English Translation
Number
Masc. An.
Masc. Inan.
Feminine
Neuter
jeho
him,it,its
singular
1
1
0
1
jejı́
hers, its, her
singular
1
1
1
1
jejı́ch
hers, its, her
singular
1
1
1
1
jejı́ho
hers, its, her
singular
1
1
0
1
jejı́m
hers, its, her
singular
1
1
0
1
jejı́mi
hers, its, her
singular
1
1
1
1
jejı́mu
hers, its, her
singular
1
1
0
1
jejich
their, theirs
plural
1
1
1
1
jejı́
hers, its, her
plural
1
1
1
1
Table A.3: Czech Pronouns: Possessive Reflexive
Czech Pronoun
English Translation
Number
Masc. An.
Masc. Inan.
Feminine
Neuter
svá
his, her, its, their
singular
0
0
1
0
své
his, her, its, their
singular
0
0
1
1
svého
his, her, its
singular
1
0
1
1
svém
his, its
singular
1
1
0
1
svému
his, her, its
singular
1
1
0
1
svoje
his, her, its, their
singular
0
0
1
1
svoji
his, her, its, their
singular
0
0
1
0
svojı́
his, her, its
singular
0
0
1
0
svou
her, its
singular
0
0
1
0
svůj
his, her, its
singular
1
1
0
0
svým
his, her, its, their
singular
1
1
0
1
svá
his, her, its, their
plural
0
0
0
1
své
his, her, its, their
plural
1
1
1
0
svı́
their, theirs
plural
1
0
0
0
svoje
his, her, its, their
plural
1
1
1
1
svoji
his, her, its, their
plural
1
0
0
0
svých
their, theirs
plural
1
1
1
1
svým
his, her, its, their
plural
1
1
1
1
svými
their, theirs
plural
1
1
1
1
64
Appendix A. Czech Pronouns Used in the Automated Evaluation
Table A.4: Czech Pronouns: Demonstrative
Czech Pronoun
English Translation
Number
Masc. An.
Masc. Inan.
Feminine
Neuter
ten
this, he, it
singular
1
1
0
0
ta
this, she, it
singular
0
0
1
0
ta
these, they, them
plural
0
0
0
1
to
this, it
singular
0
0
0
1
toho
this, him, it
singular
1
1
0
1
té
this, her
singular
0
0
1
0
tomu
this, him, it
singular
1
1
0
1
tu
this, her
singular
0
0
1
0
tom
this, him, it
singular
1
1
0
1
tı́m
this, him, it
singular
1
1
0
1
tou
this, her
singular
0
0
1
0
ti
these, they
plural
1
0
0
0
ty
these, they, them
plural
1
1
1
0
těch
these, them
plural
1
1
1
1
těm
these, them
plural
1
1
1
1
těmi
these, them
plural
1
1
1
1
Bibliography
Bojar, O. and Hajič, J. (2008). Phrase-based and Deep Syntactic English-to-Czech Statistical
Machine Translation. In Proceedings of the Third Workshop on Statistical Machine Translation, StatMT ’08, pages 143–146, Stroudsburg, PA, USA. Association for Computational
Linguistics.
Bojar, O. and Kos, K. (2010). 2010 Failures in English-Czech Phrase-based MT. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR,
WMT ’10, pages 60–66, Stroudsburg, PA, USA. Association for Computational Linguistics.
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., and Schroeder, J. (2007). (Meta-)
Evaluation of Machine Translation. ACL Workshop on Statistical Machine Translation.
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., and Schroeder, J. (2008). Further metaevaluation of machine translation. In Proceedings of the Third Workshop on Statistical
Machine Translation, StatMT ’08, pages 70–106, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Charniak, E. and Elsner, M. (2009). EM Works for Pronoun Anaphora Resolution. In Conference of the European Chapter of the Association for Computational Linguistics, pages
148–156.
Chen, D. L. and Dolan, W. B. (2011). Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 190–200, Stroudsburg, PA, USA. Association for Computational Linguistics.
Gojun, A. (2010). Null Subjects in Statistical Machine Translation: A Case Study on Aligning
English and Italian Verb Phrases with Pronominal Subjects. Master’s thesis, Universität
Stuttgart.
Grosz, B. J., Weinstein, S., and Joshi, A. K. (1995). Centering: A Framework for Modeling
the Local Coherence Of Discourse. Computational Linguistics, 21:203–225.
65
66
Bibliography
Hajič, J., Panevová, J., Hajičová, E., Panevová, J., Sgall, P., Pajas, P., Štěpánek, J., Havelka,
J., and Mikulová, M. (2006). Prague Dependency Treebank (PDT) 2.0 LDC Calalog No.:
LDC2006T01. Technical report, Linguistic Data Consortium.
Hardmeier, C. and Federico, M. (2010). Modelling Pronominal Anaphora in Statistical Machine Translation. In Proceedings of the 7th International Workshop on Spoken Language
Translation.
Hoang, H., Birch, A., Callison-burch, C., Zens, R., Aachen, R., Constantin, A., Federico, M.,
Bertoldi, N., Dyer, C., Cowan, B., Shen, W., Moran, C., and Bojar, O. (2007). Moses: Open
Source Toolkit for Statistical Machine Translation. pages 177–180.
Hobbs, J. (1978). Resolving Pronominal References. Lingua 44, pages 311–338.
Kauchak, D. and Barzilay, R. (2006). Paraphrasing For Automatic Evaluation. In Proceedings
of the Main Conference on Human Language Technology Conference of the North American
Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pages 455–462,
Stroudsburg, PA, USA. Association for Computational Linguistics.
Kim, K.-S., Park, S.-B., Song, H.-J., Park, S., and Lee, S.-J. (2010). Identification of Nonreferential Zero Pronouns for Korean-English Machine Translation. In Zhang, B.-T. and
Orgun, M., editors, PRICAI 2010: Trends in Artificial Intelligence, pages 112–122. Springer
Berlin / Heidelberg.
Kneser, R. and Ney, H. (1995). Improved Backing-Off for M-gram Language Modeling. IEEE
International Conference on Acoustics, Speech, and Signal Processing, 1:181–184.
Koehn, P. (2009). Statistical Machine Translation. Cambridge University Press, 1 edition.
Kumaran, A., Saravanan, K., and Maurice, S. (2008). wikiBABEL: Community Creation of
Multilingual Data. In Proceedings of the 4th International Symposium on Wikis, WikiSym
’08, pages 14:1–14:11, New York, NY, USA. ACM.
Lappin, S. and Leass, H. J. (1994). An Algorithm for Pronominal Anaphora Resolution. Computational Linguistics, 20:535–561.
Le Nagard, R. and Koehn, P. (2010). Aiding Pronoun Translation with Co-reference Resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and
MetricsMATR, WMT ’10, pages 252–261, Stroudsburg, PA, USA. Association for Computational Linguistics.
Linh, N. G., Novák, V., and Zabokrtský, Z. (2009). Comparison of Classification and Ranking
Approaches to Pronominal Anaphora Resolution in Czech. In Proceedings of the SIGDIAL
2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and
Bibliography
67
Dialogue, SIGDIAL ’09, pages 276–285, Stroudsburg, PA, USA. Association for Computational Linguistics.
Mitkov, R. (1999). Introduction: Special Issue on Anaphora Resolution in Machine Translation
and Multilingual NLP. Machine Translation, 14:159–161.
Mitkov, R., Choi, R. S.-K., and Sharp, R. (1995). Anaphora Resolution in Machine Translation.
In Proceedings of the Sixth International Conference on Theoretical and Methodological
Issues in Machine Translation, pages 5–7.
Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones, L., and Sotirova, V. (2000). Coreference
and Anaphora: Developing Annotating Tools, Annotated Resources and Annotation Strategies. In Proceedings of the Discourse, Anaphora and Reference Resolution Conference
(DAARC2000), pages 49–58, Lancaster, UK.
Nakaiwa, H. and Ikehara, S. (1995). Intrasentential Resolution of Japanese Zero Pronouns
in a Machine Translation System Using Semantic and Pragmatic Constraints. In Semantic
Constraints Viewed from Ellipsis and Inter-Event Relations (in Japanese), IEICE-WGNLC,
pages 96–105.
Ng, V. (2010). Supervised Noun Phrase Coreference Research: The First Fifteen Years. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL
’10, pages 1396–1411, Stroudsburg, PA, USA. Association for Computational Linguistics.
Och, F. J. (2003). Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume
1, ACL ’03, pages 160–167, Stroudsburg, PA, USA. Association for Computational Linguistics.
Och, F. J. and Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment
Models. Computational Linguistics, 29:19–51.
Papineni, K., Roukos, S., Ward, T., and jing Zhu, W. (2002). BLEU: a method for automatic
evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Saiggon, H. and Carvalho, A. (1994). Anaphora Resolution in a Machine Translation System.
In Proceedings of the International Conference: Machine Translation, 10 Years On.
Soon, W. M., Ng, H. T., and Lim, D. C. Y. (2001). A Machine Learning Approach to Coreference Resolution of Noun Phrases. Computational Linguistics, 27:521–544.
68
Bibliography
Stolcke, A. (2002). SRILM - An Extensible Language Modeling Toolkit. In Proceedings of
ICSLP, volume 2, pages 901–904, Denver, USA.
Strube, M. (2007). Corpus-based and Machine Learning Approaches to Anaphora Resolution.
Anaphors in Text: Cognitive, Formal and Applied Approaches to Anaphoric Reference. John
Benjamins Pub Co.
Čmejrek, M., Hajič, J., and Kuboň, V. (2004). Prague Czech-English Dependency Treebank:
Syntactically Annotated Resources for Machine Translation. In In Proceedings of EAMT
10th Annual Conference, page 04.
Weischedel, R. and Brunstein, A. (2005). BBN Coreference and Entity Type Corpus LDC
Calalog No.: LDC2005T33. Technical report, Linguistic Data Consortium.
Weischedel, R., Pradhan, S., Ramshaw, L., Kaufman, J., Franchini, M., El-Bachouti, M., Xue,
N., Palmer, M., Marcus, M., Taylor, A., Greenberg, C., Hovy, E., Belvin, R., and Houston,
A. (2009). OntoNotes Release 3.0 LDC Calalog No.: LDC2009T24. Technical report,
Linguistic Data Consortium.