Improving Pronoun Translation for Statistical Machine Translation (SMT) Liane Guillou NI VER S E R G O F H Y TH IT E U D I U N B Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2011 Abstract Machine Translation is a well established field, yet the majority of current systems perform the translation of sentences in complete isolation, losing valuable contextual information from previously translated sentences in the discourse. One such class of contextual information concerns who or what it is that a reduced referring expression such as a pronoun is meant to refer to. The use of inappropriate referring expressions in a target language text can seriously affect its ability to be understood by the reader. This project follows on from two recent research papers that focussed on improving the translation of pronouns in Statistical Machine Translation (SMT). The approach taken is to annotate the pronouns in the source language with the morphological properties of the antecedent translation in the target language prior to translation using a phrase-based English-Czech SMT system. The project makes use of a number of manually annotated corpora in order to factor out the effects arising from poor coreference resolution, wherein selecting the wrong antecedent for a pronoun in the source language text will wrongly bias its translation. The aim of this work is to discover whether “perfect” coreference resolution in the source language text can reduce the incidence of inappropriate referring expressions in the target language text. The annotated translation system developed as part of this project makes only a marginal improvement over the baseline system, as measured using a bespoke automated evaluation metric. These results are supported by a manual evaluation conducted by a native Czech speaker. The reason for a lack of substantial improvement over the baseline may be attributed to many factors, not least of which concern the highly inflective nature of the Czech language. iii Acknowledgements I would like to thank my supervisor, Professor Bonnie Webber, for her continued guidance and support from the conception of this project through to its realisation. I am deeply grateful for the patience that she has shown in explaining to me those concepts that were difficult to grasp, for setting me on the correct path when I became lost and most of all, for infecting me with her enthusiasm for this work. I have thoroughly enjoyed my time spent working on this project and I couldn’t have asked for anything more in terms of the supervision I have received in my first foray into the field of Machine Translation. Special thanks are owed to Dr. Markéta Lopatková and Dr. Ondřej Bojar at Charles University. I am indebted to Markéta for her suggestions, enthusiasm and assistance with the analysis of results at every stage of this project. Her expertise in Czech Natural Language Processing has proved invaluable and I can honestly say as a monolingual speaker that without her help, this project would not have been possible. I am also extremely grateful to Ondřej for his recommendations with respect to the stemming of the English and Czech data to obtain shared word alignments for the translation models and his suggestions regarding the automated evaluation of the translation output. Thanks also to Christian Hardmeier for his patience in answering my many questions in relation to his previous work on pronoun translation and evaluation. Credit is also owed to David Mareček at Charles University, who created the PCEDT 2.0 alignment file used in this project. Finally, I would like to thank my colleagues for their company during the long days spent in the computer labs and their assistance in peer reviewing this document. The PCEDT 2.0 corpus, which is not yet publicly available, has been used with permission from the Institute of Formal and Applied Linguistics, Charles University, Prague. iv Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Liane Guillou) v I dedicate this thesis to my mother, Anna Guillou, who instilled in me from an early age the importance of education and made sacrifices to ensure that I received the very best. Her love, encouragement and unwavering support have been instrumental throughout my life, and have given me the confidence that I needed to embark upon this course of further study. Words alone cannot convey my gratitude. vi Table of Contents 1 2 Introduction 1 1.1 Definition of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Focus on Pronoun Translation in Machine Translation . . . . . . . . . 5 1.3.2 English-Czech Machine Translation . . . . . . . . . . . . . . . . . . . 7 1.4 Example of Poor Pronoun Translation . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Hypothesis and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Concepts 9 2.1 Anaphora and Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Czech Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Phrase-based Statistical Machine Translation . . . . . . . . . . . . . . . . . . 10 2.5 Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Evaluation in Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 3 4 2.6.1 Automated Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6.2 Manual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Data 17 3.1 BBN Pronoun Coreference and Entity Type Corpus . . . . . . . . . . . . . . . 17 3.2 Penn Treebank 3.0 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 PCEDT 2.0 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Methodology 4.1 23 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 vii 4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4 Constructing the Language Model . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5 Combining the Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6 4.5.1 Identification of Coreferential Pronouns and their Antecedents . . . . . 30 4.5.2 Extraction of the Antecedent Head Noun . . . . . . . . . . . . . . . . 31 4.5.3 Extraction of Morphological Properties from the PCEDT 2.0 Corpus . . 31 Training the Translation Models . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.6.1 Computing the Word Alignments . . . . . . . . . . . . . . . . . . . . 33 4.6.2 Tuning the Translation System Weights: Minimum Error Rate Training (MERT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6.3 Annotation of the Training Set Data . . . . . . . . . . . . . . . . . . . 34 4.7 The Annotated Translation Process . . . . . . . . . . . . . . . . . . . . . . . . 36 4.8 Annotation and Translation System Architecture . . . . . . . . . . . . . . . . . 37 4.9 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.9.1 Automated Evaluation: Assessing the Accuracy of Pronoun Translations 39 4.9.2 Manual Evaluation: Error Analysis and Human Judgements . . . . . . 42 4.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5 6 Results and Discussion 45 5.1 Automated Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Manual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3 Critical Evaluation of the Approach and Potential Sources of Error . . . . . . . 52 5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Conclusion and Future Work 55 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A Czech Pronouns Used in the Automated Evaluation 61 Bibliography 65 viii Chapter 1 Introduction The primary aim of this project is to produce more accurate coreferring expressions in the target language within English to Czech Statistical Machine Translation (SMT). To date there have been few attempts to integrate coreference resolution methods into Machine Translation. Notable exceptions include two recently published articles, focussing on English to French/German translation of third person personal pronouns. This project considers the translation of pronouns in English-Czech SMT, which is a more complex issue due to certain properties of the Czech language. Czech is a highly inflective language (as with German) that exhibits subject pro-drop and has a “free word-order”, i.e. the word order reflects the information structure of discourse. Whilst considerable progress has been made in Machine Translation research, little attention has been paid to cross-sentence coreference (Le Nagard and Koehn, 2010). The recent work of both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), focussing on thirdperson personal pronoun translation for SMT, represents a realisation of the need to address this gap. In particular, it represents an acknowledgement that the appropriate translation of discourse-level phenomena, including pronominal reference, is essential to ensure that the translated text makes sense to its intended audience. As Le Nagard and Koehn (2010) state, current Machine Translation methods treat sentences as mutually independent and therefore do not handle the cross-sentence dependencies that can arise due to the use of anaphoric reference. The recent work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) demonstrates an interest within the research community in improving overall translation quality via the accurate translation of pronouns. Whilst the method proposed by Le Nagard and Koehn (2010) showed little improvement, the method presented by Hardmeier and Federico (2010) showed a small but significant improvement as measured by their bespoke automated scoring metric that incorporates precision and recall. 1 2 Chapter 1. Introduction This project investigates whether the approach used by Le Nagard and Koehn (2010) can improve pronoun translation in English-Czech SMT. This method was selected in preference to that used by Hardmeier and Federico (2010) due to its simplicity. A major difference between this project and previous work is the use of manually annotated corpora in place of coreference resolution algorithms to extract pronoun antecedents and automated methods to identify antecedent head nouns. These corpora provide coreference annotation and noun phrases from which the head noun can be extracted with little effort. This marks the first attempt to assess the potential for source language coreference to improve pronoun translation in SMT by exploiting “perfect” manual source language coreference annotation. Furthermore it is also the first attempt to apply the technique of source language pronoun annotation to the English-Czech language pair. The motivation for using the English-Czech language pair is threefold. Firstly, the availability of the PCEDT 2.0 parallel English-Czech corpus, as provided by the Institute of Formal and Applied Linguistics at Charles University, Prague, coincided with the start of this project. Secondly, as a monolingual speaker, the choice of the second language in the pair is fairly arbitrary, but dependent on the availability of a native speaker to assist in the evaluation of the translation system output and to provide language specific assistance during the development of such a system. This project benefited enormously from the expert advice of Dr. Markéta Lopatková at Charles University, Prague. The third, and perhaps most salient reason for choosing Czech as the second language in the translation pair is that Czech is a subject pro-drop language. That is, in Czech, an explicit subject pronoun may be omitted if its antecedent can be predicted on the grounds of saliency and/or verb morphology. It was initially envisaged that the system developed as part of this project would be designed to explicitly handle this phenomenon. However, due to the complexity of designing a pronoun-focussed translation system and devising a strategy for evaluating the system output, this has been left as a future extension to this project. This document describes in detail the approach taken in the investigation of whether source language annotation may improve pronoun translation in English-Czech SMT. The remainder of this chapter defines the problem, introduces the concept of anaphora resolution and its application in Machine Translation and presents the hypothesis upon which this project is based. Chapter 2 introduces the key concepts and chapter 3, the corpora used in the project. Chapter 4 describes the approach taken in the development of the annotation and translation system and the evaluation of its output. The results of the evaluation are presented and discussed in chapter 5 and the project is concluded in chapter 6. Possible options for future continuation of this work are also included in chapter 6, with suggestions reflecting some of the key issues highlighted in the preceding chapters. 1.1. Definition of the Problem 1.1 3 Definition of the Problem Pronouns can be used as anaphoric expressions. When a pronoun is used anaphorically, it is called a coreferential pronoun. In Czech, as with many other languages, the number and gender of a personal pronoun must agree with the number and gender of its antecedent. This is the phenomenon known as anaphora. When observing this phenomenon in discourse it is common for the pronoun’s antecedent to appear in an earlier sentence to the pronoun itself, presenting a problem for current state of the art Machine Translation systems which translate sentences in isolation. When sentences are translated in isolation, the contextual information present in the preceding sentences becomes lost. In the case of a coreferential pronoun, if its antecedent appears in a previous sentence, information about that antecedent will be lost by the time the sentence in which the pronoun occurs is considered for translation. The translation of the pronoun is then carried out with no knowledge of the number and gender of the pronoun’s antecedent. Consider the translation of the English pronoun “it” into Czech for the following simple examples1 : 1. The dog has a ball. I can see it playing outside. 2. The cow is in the field. I can see it grazing. 3. The car is in the garage. I will drive it to school later. In each of the examples, the English pronoun “it” refers to an entity that has a different gender in Czech. In order to translate the pronoun correctly in Czech it is necessary to identify the gender (and number) of the entity to which the pronoun refers and ensure that the gender (and number) of the pronoun agrees. In example 1 “it” refers to the dog (“pes”, masculine) and should be translated as “jeho/ho/jej”. In example 2, “it” refers to the cow (“kráva”, feminine) and should be translated as “ji”. In the final example, 3, “it” refers to the car (“auto”, neuter) and should be translated as “je/jej/ho”. In Czech, within the masculine gender, a distinction is made between animate objects (e.g. people and animals) and inanimate objects (e.g. buildings). In many cases the same pronoun may be used for both animate and inanimate masculine genders, but there are a number cases in which different pronouns must be used. For example, in the case of possessive reflexive pronouns in the accusative case, “svého” is used to refer to a dog (masculine animate, singular) that belongs to someone, e.g. “I admired my (own) dog”: “Obdivoval jsme svého psa”. This is in contrast with “svo̊j” which is used to refer to a castle (masculine inanimate, singular) that 1 Examples adapted from information from “Local Lingo” - an online Czech language resource: http://www.locallingo.com/ 4 Chapter 1. Introduction belongs to someone, e.g. “I admired my (own) castle”: “Obdivoval jsme svo̊j hrad”. The problem of identifying the entity to which a pronoun refers is termed anaphora resolution. Section 1.2 outlines a brief history of anaphora resolution with particular reference to its incorporation in the field of Machine Translation. The concept of Anaphora and the closely related concept of Coreference are described in greater detail in chapter 2. 1.2 Background Anaphora resolution involves the identification of the antecedent of a referent, typically a pronominal or noun phrase expression that is used to refer to something that has been previously mentioned in the discourse (the antecedent). In the case where multiple referents refer to the same antecedent, these referents are said to be coreferential; these relationships can be represented using coreference chains. Mitkov et al. (1995) assert that the identification of an anaphor’s antecedent is often crucial to ensure a correct translation, especially in cases in which the target language of the translation marks the gender of pronouns. The problems of anaphora resolution and the related task of coreference resolution have sparked considerable research within the field of Natural Language Processing (NLP). Strube (2007) charts the changes from early techniques that modelled linguistic knowledge algorithmically such as Hobbs’s Algorithm (Hobbs, 1978), the Centering model (Grosz et al., 1995) and Lappin and Leass’s algorithm (1994), through to the Supervised and Semi-Supervised Machine Learning methods commonly used today. Even within the sphere of Machine Learning, there is still much debate as to which method provides the best results. Early methods include that to which Strube (2007) credits Soon et al. (2001) - the recasting of coreference resolution as a binary classification task to which Machine Learning techniques can be applied. In contrast, Linh et al. (2009) argue that ranking based models are more suited to the task of anaphora resolution. Ng (2010) also argues in favour of ranking models that allow for the identification of the most probable candidate antecedents, claiming that they outperform other classes of supervised Machine Learning methods. In order to improve methods for anaphora resolution based on supervised Machine Learning, as well as to serve as “Gold standards” for evaluation, parallel efforts have been pursued to manually annotate large corpora with coreference chains. The OntoNotes 3.0 corpus (Weischedel et al., 2009) and the BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) (used in this project) are examples of such corpora. Despite continued efforts into providing methods for anaphora resolution, there has been little work focusing on the integration of anaphora resolution and SMT systems. Le Nagard and 1.3. Previous Work 5 Koehn (2010) argue that work on SMT has not moved beyond sentence-level translation. Furthermore they assert that the translation ambiguity arising from the use of pronouns cannot be resolved within the context of a single sentence if a pronoun refers to an antecedent from a previous sentence. Hardmeier and Federico (2010) present a case study of the performance of one of their SMT systems on personal pronouns to illustrate that improved handling of pronominal anaphora may lead to improvements in translation quality. They report that the SMT system is unable to find a suitable translation for anaphoric pronouns in 39% of cases and that while choosing the wrong pronoun does not generally affect important content words, it can make the output translations difficult to understand. 1.3 1.3.1 Previous Work Focus on Pronoun Translation in Machine Translation Early work on the integration of anaphora resolution with Machine Translation includes that of Mitkov et al. (1995), Lappin and Leass (1994) and Saiggon and Carvalho (1994). Mitkov et al. (1995) focussed on intersentential anaphora resolution, conjoining sentences to simulate the intersententiality that could be handled by the rule-based CAT2 Machine Translation system. They provided example output from their system showing instances where pronouns are translated correctly from English to German. However, they provided only the details of their approach and several examples, offering no information relating to the evaluation of their method. Lappin and Leass (1994) integrated their RAP algorithm into a logic-based Machine Translation system, but the core focus of their work was on anaphora resolution and not on Machine Translation. Saiggon and Carvalho (1994) used a transfer approach combined with Artificial Intelligence techniques and focussed on both intersentential and intrasentential anaphora resolution for the translation of pronouns in Portuguese to English translation. This interest in the 1990’s culminated in the publication of a special issue on anaphora resolution in Machine Translation with an introduction provided by Mitkov (1999). No further evidence of work on the integration of anaphora resolution and Machine Translation systems is available until 2010, in which papers on the subject were published by Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). This resurgence in the interest of anaphora resolution for Machine Translation systems follows advances in the field since the 1990’s which have made the application of these new approaches possible. The approach taken by Le Nagard and Koehn (2010) involves the identification of the antecedent of each coreferential occurrence of ‘it’ and ‘they’ in the source language (English) together with the identification of the antecedent’s translation into the target language (French) 6 Chapter 1. Introduction and its grammatical gender. Based on the gender of the noun in the target language, the occurrence of ‘it’ in the source language text is replaced by it-masculine, it-feminine or it-neutral. The same is applied for occurrences of ‘they’. Using the Moses toolkit (Hoang et al., 2007), they trained an SMT system on annotated training data composed using the annotation method previously described, before applying the same process to the test data as part of the translation process. In the training of the annotation system the French translation of the English antecedent is extracted from the parallel corpus using the word alignment obtained as part of the process of training their baseline system. When running test translations, they first translate the test text using the baseline system to extract the French translations of the English antecedents. They then use the gender of the French word to annotate the English pronoun before translating the annotated test text using the system trained on annotated training data. This approach treats the annotation of pronouns as a separate task which is performed outside of the translation process. The authors report little change in the BLEU score of their system over the baseline and instead resort to manually counting the number of correctly translated pronouns. Whilst they attribute the lack of improvement of their system to the poor quality of their coreference resolution system, they claim that the process works well when the coreference resolution system provides accurate results. The approach taken by Hardmeier and Federico (2010) differs in that it provides a singlestep process whereby the identification of a pronoun’s antecedent in the source language and the extraction of its target language translation’s morphological properties is integrated in the translation process as an additional model in their SMT system. This additional model maintains a mapping of each source language pronoun and the number and gender of its antecedent. Translation is achieved by first processing the source language test text using a coreference resolution system to identify coreferential pronouns and their antecedents. The output of the coreference resolution system is used as input to a decoder driver module which runs a number of Moses decoder processes in parallel. The decoder driver then feeds individual sentences to the decoder processes using a priority queue to order sentences according to how many pronoun antecedents they contain. Thus sentences that contain a greater number of antecedents are translated first, ensuring a high throughput of the system. The authors report no significant improvement in BLEU score between their system and the baseline, but they do report a small but significant improvement in pronoun translation recall against a single reference translation. The approach used in this project is similar to that taken by Le Nagard and Koehn (2010). Whilst their project required the use of a coreference resolution system to build coreference chains, the provision of a source language corpus with manually annotated coreference information allowed this project to focus on the translation problem. This project also accommodates a wider range of English pronouns than the study by Le Nagard and Koehn (2010), which 1.4. Example of Poor Pronoun Translation 7 only considered the translation of ‘it’ and ‘they’. 1.3.2 English-Czech Machine Translation Much of the recent work in English-Czech SMT has been conducted at the Institute of Formal and Applied Linguistics at Charles University, Prague. Research has been conducted in many areas including the development of parallel corpora suitable for the development of Machine Translation systems such as the PCEDT 2.0 corpus used in this project and its predecessor, the PCEDT 1.0 corpus (Čmejrek et al., 2004). Another area of research has concentrated on the development of both phrase-based and dependency-based SMT systems. In a comparative study of phrase-based and dependency-based SMT systems Bojar and Hajič (2008) concluded that their best phrase-based system outperformed the experimental dependency-based system, but work continues in both directions. The decision to focus on phrase-based SMT in this project is due to its simplicity, which given the relatively short time-scale, is an important factor. That phrase-based systems currently outperform dependency-based systems in English-Czech SMT is an added bonus. 1.4 Example of Poor Pronoun Translation As an example of poor pronoun translation, consider the following English sentence from the Wall Street Journal corpus and its translation (by a Machine Translation system) in Czech: he said mexico could be one of the next countries to be removed from the priority list because of its efforts to craft a new patent law . řekl , že mexiko by mohl být jeden z dalšı́ch zemı́ , aby byl odvolán z prioritou seznam , protože jejı́ snahy podpořit nové patentový zákon . In this example, the English pronoun “its”, which refers to “mexico” is translated in Czech as “jejı́” (feminine, singular) and “mexico” is translated as “mexiko” (neuter, singular). Here, the Czech translation of the pronoun and its antecedent disagree in gender. A more correct translation of the pronoun would be “jeho” (neuter, singular possessive pronoun) or “své” (possessive pronoun) depending on the overall structure of the translated sentence. 8 Chapter 1. Introduction 1.5 Hypothesis and Contributions The work of Hardmeier and Federico (2010) focussed on English to German translation whilst Le Nagard and Koehn (2010) focussed on English to French translation. This project considers the translation of pronouns in English to Czech SMT and builds on the work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). By factoring out the problems of automated coreference resolution, parsing and part of speech (POS) tagging and morphological tagging, this project attempts to assess how well an approach to explicitly annotating pronouns in the source language could work when applied to English-Czech SMT if conditions were assumed to be “perfect”. Where French (a Romance language) and German (a Germanic language) share a similar root to English, the differences between English and Czech are even greater. Therefore, not only does this project assess the suitability of a pronoun annotation approach in improving the translation of pronouns into another language, but into a language that is very different from English. It is believed that this project is the first attempt made to explicitly handle the problem of pronoun translation in Czech SMT. This project makes three major contributions: 1. A prototype system for the annotation and translation of pronouns in English-Czech SMT. 2. Automated and manual evaluations of the output of the system as compared against a baseline. 3. An annotated aligned parallel corpus which could be used in future investigations into pronoun translation in English-Czech SMT. 1.6 Chapter Summary This chapter introduced the specific problem of pronoun translation in SMT, discussed previous work in relation to anaphora resolution, pronoun-focussed Machine Translation and EnglishCzech SMT and outlined the hypothesis on which this work is based. The next chapter will describe in detail many of the concepts that are essential to the understanding of the problem as well as the approach taken in the development of the annotation and translation system and its evaluation. Chapter 2 Concepts 2.1 Anaphora and Coreference Anaphora is a discourse level phenomenon in which the interpretation of one expression is dependent on another previously mentioned expression, also known as the the antecedent. For example in the sentence below, the word “He” at the start of the second sentence refers to “J.P. Bolduc” at the start of the first sentence. In order to understand the meaning of the second sentence, the reader must first identify the referent of the pronoun “He” (which in this example is “J.P. Bolduc”). J.P. Bolduc, vice chairman of W.R. Grace & Co., which holds a 83.4% interest in this energyservices company, was elected a director. He succeeds Terrence D. Daniels, formerly a W.R. Grace vice chairman, who resigned.1 Where anaphora is concerned with referring to a previously mentioned expression in the discourse, coreference is the act of referring to the same referent (Mitkov et al., 2000), such that multiple expressions that refer to the same expression are said to be coreferential. Coreferential chains may be established in order to link multiple referring expressions to the same antecedent expression. This project focuses on the translation of already resolved instances of nominal anaphora, in which a referring expression - a pronoun, definite Noun Phrase (NP) or proper name, has a non-pronominal NP as its antecedent (Mitkov et al., 2000). The project makes use of manually annotated corpora from which instances of coreferential (and anaphoric) pronouns and their antecedents are identified, in order to annotate training data with which to train an SMT system. 1 Example taken from the Wall Street Journal corpus 9 10 Chapter 2. Concepts 2.2 Coreference Resolution Coreference Resolution is the process of identifying the referent to which a referring expression refers. In this project, the pronouns are the referring expressions and the antecedents are the referents. As discussed in chapter 1, there has been much research into the development of automated methods to provide coreference and anaphora resolution. Such automated methods were used by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), but it is well documented that these methods do not acheive perfect accuracy. Indeed, Le Nagard and Koehn (2010) cite the poor performance of their coreference resolution as a possible reason for their lack of improvement in pronoun translation. In this project, a manually annotated coreference corpus (the BBN Coreference and Entity Type corpus) is used to identify coreferential pronouns and their antecedents. As the corpus has been manually annotated, the coreference annotation is assumed to be highly accurate. 2.3 Czech Language Czech is a member of the western group of Slavic languages. Like other Slavic languages it is highly inflective, with seven cases and four grammatical genders: masculine animate (for people and animals), masculine inanimate (for inanimate objects), feminine and neuter. In the case of the feminine and neuter genders, animacy is not grammatically marked. Czech is a free word-order language, in which word order reflects the information structure of the sentence within the current discourse. In addition, Czech is a pro-drop language; an explicit subject pronoun may be omitted if it may be inferred based on some other grammatical feature, for example verb morphology.2 In contrast with Czech, English, is neither a highly inflectional nor a pro-drop language. Furthermore, English follows a Subject-Verb-Object (SVO) pattern for word order and lacks grammatical gender. 2.4 Phrase-based Statistical Machine Translation Phrase-based models are currently the best performing SMT models (Koehn, 2009). The concept behind these models is the decomposition of the translation problem into a number of smaller word sequences, called phrases, which are translated one at a time in order to build the complete translation. It is important to note that a phrase may be any sequence of words 2 Information provided by “The Czech Language” - an online guide: http://www.czech-language.cz 2.4. Phrase-based Statistical Machine Translation 11 of arbitrary length and that there is no deep linguistic motivation behind the choice of segmentation. Phrase-based models have several advantages over word-based models in which words are translated in isolation. Firstly, phrase-based models provide a simple solution to the problem where a single word in the source language translates into multiple words in the target language or vice versa. Secondly, translating phrases rather than single words can help to resolve translation ambiguities. Finally, with phrase-based models, the notions of insertion and deletion that are present in word-based models are no longer necessary, leading to a model that is conceptually simpler. The three components that make up a phrase-based model are the translation model, language model and reordering model. The translation model takes the form of a phrase translation table which provides a mapping between the source and target language phrases and the probabilities associated with each mapping. The phrase translation table is learned by creating word alignments between the aligned sentence pairs of a parallel training corpus. The word alignments are collected for both translation directions, the alignment points are merged and then those phrases that are consistent with the word alignment are extracted. The probabilities that are assigned to each phrase mapping in the table are calculated by counting the number of (parallel) sentence pairs a particular phrase pair appears in, and then computing the relative frequency of this count compared with the count of the source phrase translating as any other phrase in the target language. The language model ensures the fluency of the translations output by the model - providing a means to score and hence identify the best output translation from a list of candidate translations. The language models used in SMT are typically n-gram language models which consist of n-grams in the target language together with probabilities based on maximum likelihood estimation. A language model is usually constructed from the target side of the parallel corpus used in the training of the translation model, and may be augmented by additional in-domain target data, or weighted with a separate out-of-domain language model. Smoothing is often applied to improve the reliability of the probability estimates, with modified Kneser-Ney smoothing commonly used in SMT (Kneser and Ney, 1995). The reordering model allows phrases in the source language to be taken out of sequence when building the translation in the target language, thereby allowing phrase-level reordering. Allowing unlimited reordering can have a detrimental effect on translation quality, and so it is usual for a penalty to be associated with any reordering that takes place. Penalties are assigned such that a larger cost is associated with the movement of a phrase that skips more word positions, than one that skips fewer word positions. In phrase-based SMT, these three models are combined as a linear model. The best translation arg maxc p(c|e) is computed using Bayes’ Rule, which combines the three components of the 12 Chapter 2. Concepts phrase-based model as in the equation below: the translation model φ(e|c), the language model PLM and the reordering model Ω(e|c). arg maxc p(c|e) = arg maxc φ(e|c) ∗ PLM ∗ Ω(e|c) Where ‘e’ is an English sentence and ‘c’ is the Czech translation of that sentence. Once the components of the phrase-based model have been constructed, their weights are tuned to optimise the overall model performance. Tuning is carried out using a dataset that is kept separate from the main training dataset for this specific purpose. Minimum Error Rate Training (MERT) (Och, 2003) is a commonly used tuning technique in SMT. MERT tunes the model weights to optimise performance as measured using BLEU scores calculated against one or more reference translations. BLEU will be described in more detail in section 2.6. In Machine Translation, the process of finding the best scoring translation according to the model is referred to as decoding (Koehn, 2009). Using a phrase-based translation model, decoding is carried out by starting with a source sentence and building the translation from left to right, extracting source phrases in any order. The phrases are translated into the target language and then ‘stitched’ together to make a complete translation. The source words covered by each phrase are then marked as translated and the process continues until all of the source words have been covered. As there are many possible valid translations of a single source language sentence, these variations must be captured. This is achieved using a search graph from which the single best translation (or an N-best list) may be derived using a scoring method that uses a language model and the phrase table probabilities. 2.5 Moses Moses (Hoang et al., 2007) is an open source SMT toolkit that provides automated training of translation models and may be used with any language pair, given a parallel training corpus. Moses may be used to construct both tree-based and phrase-based translation models but for the purpose of this project only the phrase-based training was required. The automated training process produces a phrase translation table and a lexicalised reordering model. The language model is created separately using the target side of the parallel corpus together with additional in-domain corpus data as required. The training process consists of a number of steps which include data preparation, the creation of word alignments using Giza++ (Och and Ney, 2003), extraction and scoring of phrases and building the generation and lexi- 2.6. Evaluation in Machine Translation 13 calised reordering models 3 . The generation model contains probabilities for both directions of translation. During testing, in which a sentence or collection of sentences from the test corpus (which are not also included in the training corpus) are translated, the Moses decoder constructs a search graph and uses a beam search algorithm to select the translation with the highest probability from that graph. The search graph is constructed using the process of hypothesis expansion. Hypothesis combination and pruning are then employed to reduce the search space. In the Moses implementation of beam search, hypotheses that cover the same number of foreign words are compared and those with high cost (low probability) are pruned. The cost of each hypothesis is calculated using a combination of the cost of translation and the estimated future cost of translating the remaining source text for the current sentence. Whilst the decoder may be used to output an N-Best list of translations for an input sentence, in this project only the best translation is required and therefore only a single translation is requested from the decoder. 2.6 Evaluation in Machine Translation Evaluation in Machine Translation typically falls into one of two categories: manual or automated. Whilst automated methods are used to ascertain improvements during the development of a Machine Translation system, manual methods using either monolingual or bilingual human judges are typically used to provide the final evaluation. Currently there are no standard automated metrics available for the evaluation of pronoun translation in SMT. Hardmeier and Federico (Hardmeier and Federico, 2010) developed their own bespoke automated metric incorporating precision and recall measured against a single reference translation. In contrast, Le Nagard and Koehn (2010) relied on manually counting the number of correctly translated pronouns in their system output. Manual evaluation of the results is slow and therefore not a practical solution for large volumes of text. Furthermore, for a monolingual SMT system developer, manual evaluation must be outsourced to a third party, adding an additional hindrance to the development process. In this project, the Czech translations output by the phrase-based SMT system were evaluated using a combination of manual and automated methods. The manual methods used focussed on human judgements as to whether pronouns in the Machine Translation output were correctly used or dropped and if they were incorrectly used, whether a native Czech speaker would be able to understand the meaning of the sentence as a whole. BLEU, an automated metric widely used in the evaluation of SMT systems was used during system development as a preliminary 3A full description of http://www.statmt.org/moses/ the Moses translation system training process can be found at: 14 Chapter 2. Concepts check to confirm that the system output was valid Czech, before a more detailed automated analysis of the results was conducted. The evaluation methods used in this project are discussed in more detail in chapter 4. 2.6.1 Automated Evaluation BLEU (Papineni et al., 2002) is an automated evaluation metric widely used in SMT to assess the overall quality of the output translations. It provides an efficient and low cost alternative to human judgements during iterations of development cycles to measure system improvement. It computes a document-level score of the translated output against a single reference translation or a set of reference translations (Koehn, 2009). The BLEU score is based on a combination of n-gram precision and a brevity penalty. N BLEU = BP ∗ exp( ∑ wn log pn ) n=1 The n-gram precision (pn ) is a measure of the ratio of n-grams of order n in the output translation that are present in the reference translation to the total number of n-grams of order n in the output translation, and wn are positive weights that sum to one. The brevity penalty (BP) ensures that the length of the output translation is not too short, as compared with the length of the reference translation. The effect of the brevity penalty is that the BLEU score is reduced if the output translation is shorter than the reference translation, i.e. where words are dropped in the output translation. The BLEU score is applied at the document level in order to allow some freedom in translation output length at the sentence level, for example where a single source sentence may be translated into two sentences in the target language, or vice versa. BLEU has been widely criticised (Koehn, 2009), yet remains one of the most popular automated evaluation metrics in use with SMT systems due to its high correlation with human judgements of quality (Papineni et al., 2002). With respect to the specific problem of pronoun translation evaluation in Czech, two further criticisms apply. Firstly, as the sole focus of this project is pronoun translation, only a small number of words are expected to change between the translations produced by the baseline and annotated translation systems. Therefore, the variation in BLEU score is expected to be very small. Observations regarding the shortcomings of BLEU in relation to the evaluation of pronoun translation have been made previously by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). Secondly, Czech is a highly inflective language with four genders and seven cases, so with only a single reference translation provided in the PCEDT 2.0 corpus it is not reasonable to evaluate the output of the translation systems using a recall- 2.7. Chapter Summary 15 based method. Bojar and Kos (2010) are critical of the use of BLEU scores in the evaluation of English-Czech SMT, claiming that BLEU scores correlate poorly with human judgements. It is for these reasons that BLEU was not used in the evaluation of the systems developed as part of this project. 2.6.2 Manual Evaluation The manual evaluation of Machine Translation output can be rather complex. Human judges are typically required to rate a single target language text using a five point scale or to rank several target language texts based on fluency (whether the text is fluent), and adequacy (whether the meaning of the source language text has been captured) (Koehn, 2009). Evaluation based on fluency and adequacy judgements suffers from a number of problems. Firstly, it can be slow and unreliable (Callison-Burch et al., 2008). Secondly, the scores assigned by human judges in the measurement of fluency and adequacy are often very close suggesting that the judges may find it difficult to make a clear distinction between the two criteria. Thirdly, there are concerns that without explicit instructions, many human judges develop their own rules or misinterpret the intended use of an absolute scale and instead score the output of multiple systems relative to one another (Callison-Burch et al., 2007). Finally, manual evaluation using such criteria tends to be subjective, which can lead to poor agreement between a group of human judges. Again, these manual methods tend to focus on sentences as a whole and are therefore not wholly applicable to the more specific problem of evaluating pronoun translation. 2.7 Chapter Summary This chapter introduced the concepts of anaphora and coreference resolution and provided an introduction to phrase-based SMT, the Moses toolkit and the methods currently used in the evaluation of Machine Translation output. In particular, the various issues associated with automated and manual evaluation methods were highlighted with respect to their application to the more specific problem of evaluating pronoun translation. The next chapter will introduce the manually annotated corpora used in this project. Chapter 3 Data In the development of the annotation and translation process a number of manually annotated corpora in both English and Czech are used: the BBN Pronoun Coreference and Entity Type corpus for the English (source) side of the parallel corpus and the identification of coreferential pronouns and their antecedents, and the PCEDT 2.0 corpus for the Czech (target) side of the parallel corpus. Each corpus contains text or a translation of the original text taken from a subset of the Wall Street Journal (WSJ). It is the provision of these manually annotated corpora that allowed the project to focus solely on the translation problem without the need for automated methods for coreference or anaphora resolution. In addition, the annotation of the WSJ files within the Penn Treebank 3.0 corpus is used to identify a single antecedent head word in the case where the antecedent extracted from the BBN Pronoun Coreference and Entity Type corpus spans multiple words. This is particularly important as in order to extract the number and gender of a Czech word it is necessary to first identify the head of the English antecedent. The corpora are described in detail in the following sections. 3.1 BBN Pronoun Coreference and Entity Type Corpus The BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) provides annotations of the WSJ file texts with pronoun coreference and entity types together with the raw English text. For the purpose of this project, two files from the corpus are used: the WSJ.sent file that contains the raw English sentences and the WSJ.pron pronoun coreference file that contains a list of coreferential pronouns together with their antecedents. In the pronoun coreference file, coreferential pronouns and their antecedents are indexed using sentence and word token numbers. 17 18 Chapter 3. Data The WSJ.sent file has the format: (WSJ0005 S1: J.P. Bolduc , vice chairman of W.R. Grace & Co. , which ... S2: He succeeds Terrence D. Daniels , formerly a W.R. Grace ... S3: W.R. Grace holds three of Grace Energy ’s seven board seats . ) For each file in the corpus collection, the sentences are numbered and listed in the order in which they appear in the text. The WSJ.pron file has the format: (WSJ0005 ( Antecedent -> S1:1-2 -> J.P. Bolduc Pronoun -> S2:1-1 -> He ) For each WSJ file in the collection, each antecedent and the pronouns that refer to it are listed, together with the number of the sentence in which they appear and the start and end positions of the word(s) within the sentence. It was initially envisaged that the OntoNotes 3.0 corpus (Weischedel et al., 2009) would be used to identify coreferential pronouns and their antecedents. However, the annotation in the BBN Coreference and Entity Type corpus allows for a simpler method of identification and extraction than the OntoNotes 3.0 corpus. The OntoNotes 3.0 corpus is then left as an alternative source of coreference information. Due to differences in the choice of which types of coreference are annotated in these corpora, the use of the OntoNotes 3.0 corpus as an alternative or additional source of coreference information would allow for an investigation into the translation of ‘it’, ‘this’ and ‘that’ marked as event coreference. 3.2 Penn Treebank 3.0 Corpus The Penn Treebank 3.0 corpus contains manually annotated parse trees of the sentences within the WSJ corpus. The merged files within the corpus contain both parse and part of speech annotation and as such may be used to identify Noun Phrases (NPs) and through the use of simple rules, the head of an NP. The corpus contains separate merged files for each WSJ file. Within each file, a parse is provided for each sentence, with part of speech tags provided for each word or token. 3.3. PCEDT 2.0 Corpus 19 These sentence level parses have the format: ( (S (NP-SBJ-1 (DT The) (NNP U.S.) ) (, ,) (S-ADV (NP-SBJ (-NONE- *-1) ) (VP (VBG claiming) (NP (NP (DT some) (NN success) ) (PP-LOC (IN in) (NP (PRP its) (NN trade) (NN diplomacy) ))))) ‘‘The U.S. claiming some success in its trade diplomacy...’’ In the case that the BBN Coreference and Entity Type corpus identified “The U.S.” as the antecedent of the pronoun “its”, the NP “(NP-SBJ-1 (DT The) (NNP U.S.) )” is extracted from the sentence level parse. The rightmost noun of the NP (“U.S.”) is then extracted as the head of the NP. 3.3 PCEDT 2.0 Corpus The Prague Czech-English Dependency Treebank (PCEDT 2.0) corpus 1 is a collection of English-Czech parallel resources suitable for use in SMT experiments. It contains a subset of the Wall Street Journal corpus in English with a close Czech translation (created manually) that has been manually annotated with deep syntactical (tectogrammatical) and morphological information. These Czech translations form the Czech side of the parallel corpus included in both the training and testing sets. The PCEDT 2.0 corpus data is split into a number of XML format files corresponding to the three layers of annotation that exist for each WSJ file in the corpus collection. These layers are the morphological layer (m-layer), analytical layer (a-layer) and tectogrammatical layer (tlayer). The corpus also contains the word layer (w-layer), an un-annotated, tokenised copy of the text which is segmented into WSJ files and paragraphs. The organisation and interconnection of these layers is shown in figure 3.12 . The annotation standard of these layers follows that of the Prague Dependency Treebank 2.0 (Hajič et al., 2006). 1 Version 2.0 of the PCEDT corpus is not yet publicly available, but is an extension of the PCEDT 1.0 corpus: http://ufal.mff.cuni.cz/pcedt/ 2 Image taken from the documentation of the Prague Dependency Treebank 2.0 corpus: http://ufal.mff.cuni.cz/pdt2.0/ 20 Chapter 3. Data Figure 3.1: Diagram showing the annotation layers of the PCEDT 2.0 corpus The m-layer forms the lowest level of annotation. In this layer, the tokens in the w-layer are divided into sentences and annotated with morphological lemma, tag and ID attributes. The tag attribute is a 15 character string, representing the token’s part of speech and a number of morphological properties, including number and gender. The ID attribute provides a unique identifier which is used to link back to the w-layer. The a-layer forms the middle level of annotation with sentences from the m-layer represented as trees with labelled nodes and edges. In this layer, there is a one-to-one mapping between each token and its corresponding token in the m-layer, with an edge between the nodes that represent the tokens. Each node in the a-layer has six attributes including an ID attribute and those attributes representing surface syntactic information including coordination and apposition. The “m.rf” attribute links an a-layer node to the corresponding node in the m-layer. The t-layer forms the highest level of annotation with sentences represented as trees which reflect the deep linguistic structure of the sentence. Unlike the a-layer in which each node has a one-to-one mapping with a corresponding morphological token in the m-layer, at the t-layer, not all of the morphological tokens are represented (for example nodes representing 3.4. Chapter Summary 21 prepositions are dropped). Also, additional nodes may be added at this level, for example to represent an omitted subject where subject pro-drop has occurred. The t-layer contains 39 attributes for every node, including attributes representing deep structure properties and those used for the purpose of linking back to the a-layer. In addition, a list of nodes in the PCEDT 2.0 corpus together with the corresponding Czech word and aligned English word, was used. This “PCEDT 2.0 alignment file” was composed using a method that combines Giza++ alignments extracted from PCEDT 2.0 corpus and extracted t-layer nodes for each of the aligned words. This list of nodes forms the word alignment between the Czech side of the PCEDT 2.0 corpus and the English BBN Pronoun Coreference and Entity Type corpus. It should be noted that this alignment information is separate to that produced by Giza++ as part of the training of the phrase-based SMT systems. 3.4 Chapter Summary This chapter introduced the three manually annotated corpora used in this project, described the structure of the data and highlighted the specific information that is provided by each corpus. The next chapter describes in detail the approach taken in the development of the baseline and annotation and translation systems and the automated and manual methods used in the evaluation of these systems. Chapter 4 Methodology 4.1 Overview This project follows a similar method to that used by Le Nagard and Koehn (2010) whereby the annotation of pronouns in the source language text is applied prior to translation, leaving the translation process unaffected. The annotation of the (English) source language text and its subsequent translation (into Czech) is achieved via a two-step process (see figure 4.1) that makes use of two phrase-based translation systems. The first, hereafter referred to as the Baseline system, is trained using unannotated English and Czech sentence aligned parallel training data taken from the PCEDT 2.0 and BBN Coreference and Entity Type corpora. The second system, hereafter referred to as the Annotated system, is trained using the same parallel training data, in which the pronouns in the English text are annotated with number and gender agreement between the Czech pronoun and what is a valid translation into Czech of the original English antecedent head noun. This alignment of English and Czech words is obtained from the PCEDT 2.0 alignment file that was provided in addition to the corpus. The Baseline system serves a dual purpose; as well as its incorporation within the two-step translation process, it also serves as the baseline against which the translations output by the Annotated system are compared. In addition to the translation systems, an annotation process is required. This process is used to take an English text file, identify those pronouns that are coreferential and their antecedents and annotate the pronouns with the number and gender of the Czech word that the English antecedent translates to. The coreferential pronouns and their antecedents are extracted from the BBN Coreference and Entity Type corpus and the Czech translation of the English antecedent is obtained from the translation output of the Baseline system. In using the Czech translation of the English antecedent from the Baseline system translation output, a simplifica23 24 Chapter 4. Methodology Figure 4.1: Diagram showing two-step annotation and translation process INPUT Original English Text BASELINE Translation System Czech Trans. Text Corpora Identify coreferential English pronouns and antecedents BBN Extract antecedent head noun Identify Czech translation of antecedent head noun Penn Treebank PCEDT Extract number and gender of Czech word Annotate English pronouns with Czech number and gender Annotated English Text ANNOTATED Translation System Czech Trans. Text OUTPUT tion is introduced. Whilst the pronoun and its antecedent may occur in the same sentence, in many cases the antecedent will appear in a previous sentence. Therefore, in order to identify the translation of many of the antecedents it is necessary to translate the previous sentence(s) before translating the current sentence. Rather than translating the text sentence by sentence, the complete source language text is translated using the Baseline system (as a block) and the Czech translations of the English antecedents are extracted from this output. This mirrors the solution used by Le Nagard and Koehn (2010) and provides a simplification of the problem of obtaining the Czech translation prior to annotation. Another option would be to translate sentence by sentence but this would make no difference to the final outcome as the output of the Baseline system remains the same irrespective of the method employed (at least within a two-step process). The original English text is annotated such that all coreferential pronouns for which a Czech translation of the antecedent is found are marked with the number and gender of that Czech word. The output of the annotation process is thus the same English text that was input to the Baseline system, with the addition of the annotation of the coreferential pronouns. This 4.1. Overview 25 annotated English text is then translated using the Annotated translation system, the output of which is the final translation of the complete annotation and translation process. The two main differences between the implementation of this project and that by Le Nagard and Koehn (2010) lie in the translation language pair and the methods used in the extraction of coreference information and morphological properties of the target translations of the antecedents. Le Nagard and Koehn (2010) use the English-French language pair in their work and use only the gender of the antecedents in the annotation of the English pronouns. They omit number from the annotation on the basis that singular English pronouns rarely translate in French as plural pronouns and that incorporating both number and gender in the annotation would introduce further segmentation of the training data. In Czech, both number and gender are important in determining the syntactic form of many pronouns. For example, the pronoun “je” is ambiguous in Czech and may be used as both neuter singular and as plural with any gender. Moreover, the syntactic form of possessive reflexive pronouns is dependent not only on the gender of the object(s) in possession, but also on the number of objects. Whilst the issue of increased segmentation of the training data (as a result of including both number and gender in the annotation of the English pronouns) is acknowledged, if the aim is to improve the translation of pronouns, both number and gender are necessary in Czech. Hardmeier and Federico (2010) also annotate their pronouns using both number and gender in the translation of the English-German language pair. The second main difference is that in this project, the identification of coreferential pronouns and their antecedents and the morphological properties of words in the output of the Baseline system are achieved using manually translated corpora, which are deemed to be highly accurate. In contrast, Le Nagard and Koehn used automated methods to extract this information and as such introduced additional sources of potential error into their process. Another possible approach would be to implement a system using a similar method to that used by Hardmeier and Federico (2010), whereby the source language text is translated sentence by sentence using a single-step process. The advantage of this approach is that if a pronoun’s antecedent appears in an earlier sentence, which will often be the case, then the translation of the antecedent will already be known by the time that the sentence in which the pronoun appears is considered for translation. The same does not hold, however, when the pronoun and its antecedent appear in the same sentence as the translation of the antecedent is not yet known. The two-step process used in this project and by Le Nagard and Koehn (2010) provides a simple solution to the issue of obtaining the Czech translation of the English antecedent head. It is, however, acknowledged that the single-step translation system implemented by Hardmeier and Federico (2010) represents a more elegant solution to the problem. That is not to say that the solution presented by Hardmeier and Federico is perfect, but it does have a major advantage 26 Chapter 4. Methodology over the two-step method in that it is, rather obviously, more efficient to translate the text only once. Given the relatively short time-scale of this project, the simpler two-step translation process, incorporating the translation of texts as a complete block was selected in preference to a singlestep translation process. As it is only the pronouns that are expected to change between translation output of the Baseline and Annotated translation systems, this method is deemed to be a satisfactory alternative to the single-step method, the issue of efficiency notwithstanding. Problems arising from the use of a two-step process with respect to building the Baseline and Annotated systems are discussed in section 4.6.2. Figure 4.2: Overview of the Annotation Process The annotation process is shown in figure 4.2. In this simple two sentence translation example, the second sentence contains a coreferential instance of the personal pronoun “it”, which refers to “castle” in the first sentence. In the first step of the process, the coreferential pronoun (“it”) is identified, before its antecedent head noun (“castle”) is identified in the second step. The Czech translation of the antecedent head noun (“Hrad”; Czech for “castle”) is then obtained from the translation of the previous sentence in step 3 and the number and gender of the Czech word are extracted in step 4. In the final step, the pronoun is annotated in the English sentence, 4.2. Assumptions 27 Table 4.1: Pronouns 3rd Person Personal Reflexive Possessive (preceding a noun) Possessive (used alone) Singular Plural she, her, he, him, it they, them himself, herself, itself themselves his, her, its their his, hers theirs before being submitted to the Annotated translation system. The 3rd person personal pronouns for which annotation is applied, are shown in table 4.1. The demonstrative pronouns “this”, “these”, “that” and “those” are not marked as coreferential in the BBN Coreference and Entity Type corpus and are therefore excluded. Additionally, non-referential (pleonastic) pronouns have been excluded from the annotation process and the accuracy of their translations is not assessed as it falls outside the scope of this project. Performance of these pronouns is therefore expected to be the same in both the Baseline and Annotated systems. Whilst the 3rd Person Personal Pronouns “he”, “she”, “him” and “her” are unambiguous, they were included in the annotation in order to highlight instances of subject pro-drop. As discussed in chapter 1, one of the main reasons for selecting Czech as the second language in the translation language pair was because it is a subject pro-drop language. Despite the lack of explicit handling of subject pro-drop scenarios in this project, the translation system’s ability to handle this phenomenon was of interest. These pronouns are therefore annotated in order to assess the extent to which the translation systems are able to ‘learn’ scenarios in which the subject pronoun may be dropped without the use of additional contextual information. It is assumed that as these pronouns are unambiguous, their annotation will not serve to further fragment the training data. Provided that the correct antecedent head noun is identified, these pronouns should always be labelled as singular and with the correct gender. The performance of the systems was evaluated in terms of an automated evaluation of the pronoun translation and a manual evaluation by a native Czech speaker. 4.2 Assumptions A number of simplifying assumptions are asserted with respect to the manually annotated corpus resources: 1. That the coreference resolution in the manually annotated BBN Coreference and Entity Type corpus is “perfect” 28 Chapter 4. Methodology 2. That the annotation of morphological properties of Czech words in the PCEDT 2.0 corpus is “perfect” 3. That the PCEDT 2.0 alignment file contains a “perfect” alignment of English words and their Czech translation 4. That the annotation of NPs in the Penn Treebank 3.0 corpus is “perfect” In this case “perfect” is deemed to be the best possible annotation of the corpora, or alignment in the case of the PCEDT 2.0 alignment file. This assumption is made as the corpora have been manually annotated, ensuring a high degree of accuracy. It is acknowledged that these assumptions are unrealistic, but they are made in order to define the boundaries of what is achievable given the resources available. This is in contrast to the lower level of accuracy that is expected from the use of automated tools to achieve coreference resolution in the source language, and the extraction of morphological properties of words in the target language. 4.3 Datasets The data set used to train the translation systems and the testing data sets used to test the systems were compiled from the English and Czech translations contained in the PCEDT 2.0 and BBN Coreference and Entity Type corpora. The data sets were constructed so as to allocate as much data to the training set as possible, whilst leaving a small portion of at least 1,500 sentences for testing. As contextual information is necessary in the annotation of pronouns and the analysis of the output in testing, it was necessary to ensure that for each WSJ file, the complete set of sentences was allocated to either the training or testing set. The allocation of files to the testing set was achieved via random selection, with the exception of the hand selection of five files that formed the Development test set. These files were selected due to greater familiarity with their text and the annotation in the PCEDT 2.0 corpus, making analysis and manual evaluation of the translation system output easier. This set was intended to be used in the manual analysis of progress at each stage of the development of the annotation and translation processes. The training set was constructed using the remainder of the parallel English - Czech WSJ files available in the PCEDT 2.0 corpus. It excludes duplicate sentences1 and those already present in the test set as well as sentences longer than 100 words (in either English or Czech) as recommended for the Moses training process. 1 Duplicate sentences occur in several places in the Wall Street Journal corpus. For example, in weekly summaries of interest and exchange rates, where the same text regularly appears at the start and/or end of the column. 4.4. Constructing the Language Model 29 Table 4.2: Datasets Parallel Sentences Czech Words English Words Training Set Weight Tuning Set Final Test File Development Test File 47,549 500 540 280 955,018 9,342 10,110 5,467 1,024,438 10,265 11,907 6,114 Table 4.3: Language Model Total Combined Corpus Sentences Czech Words 2,295,172 34,474,301 An additional data set, the “Weight Tuning Set” was set aside for the sole purpose of tuning the weights of the translation systems. This process will be described in more detail in section 4.6.2. Details of all three data sets are provided in table 4.2. The Language Model corpus was constructed using a combination of the target side of the parallel training corpus (including those sentences that were removed to comply with Moses training requirements) and the Czech monolingual 2010 and 2011 News Crawl corpora 2 . Following the removal of all duplicate sentences, the three corpora were combined to form a single language model corpus, from which the language model was constructed. This was possible as all three corpora are taken from the same ‘Newswire’ domain. Another solution would have been to construct separate language models from the different corpora, had they originated from different domains. Details of the language model corpus are given in table 4.3. 4.4 Constructing the Language Model The language model used for the purpose of scoring translations during the decoding process in both the Baseline and Annotated systems was a 3-gram model, constructed from the Czech monolingual language model corpus described in section 4.3. The language model was constructed using the SRILM toolkit (Stolcke, 2002) with interpolated Kneser-Ney discounting (Kneser and Ney, 1995) applied. 2 Provided for the Sixth EMNLP Workshop on Statistical Machine Translation: http://www.statmt.org/wmt11/ 30 4.5 Chapter 4. Methodology Combining the Corpora The first step of the project was to develop a method for identifying coreferential pronouns in the English text, their antecedent (in English) and the antecedent’s translation in Czech. The method for the identification of coreferential pronouns and their antecedents in the English text is common to the training and testing tasks. However, the method used for the identification of the Czech translation of the English antecedent differs between these tasks. In the annotation of the training data used to build the Annotated translation system, the Czech translation of the antecedent is simply obtained from the alignment provided in the PCEDT 2.0 alignment file. This file has the added advantage of containing the t-layer nodes of the Czech words, via which the number and gender may be extracted from the corresponding m-layer node. During testing it is necessary to obtain the Czech translation of the English antecedent as output by the translation system and use the number and gender of that word to annotate the English pronoun. The implementation focussed initially on combining information from the source language BBN Coreference and Entity Type and Penn Treebank 3.0 corpora. The BBN Coreference and Entity Type corpus was used to identify coreferential pronouns and their antecedents. The Penn Treebank 3.0 corpus was then used to extract the head noun of the antecedent from those antecedents which spanned several words. It is necessary to extract the antecedent head noun as in the annotation of English pronouns with the number and gender of their antecedent, the morphological properties must be derived from a single Czech word (per antecedent). 4.5.1 Identification of Coreferential Pronouns and their Antecedents The identification of coreferential pronouns is achieved by reading the WSJ.pron file provided as part of the BBN Coreference and Entity Type corpus and described in section 3.3. As this file provides the WSJ file name, sentence number and sentence internal word positions of the pronouns and their antecedent(s) the extraction of this information is relatively simple. The word position information is later used in the mapping of the English antecedent head noun to its Czech translation via the PCEDT 2.0 alignment file in order to extract the morphological properties with which to annotate the English pronoun. It should be noted that through the use of the BBN Coreference and Entity Type corpus to identify coreferential pronouns, the misidentification of (non-referential) pleonastic pronouns as coreferential does not arise. For example, consider the case of the pronoun “it” in the sentence “It is raining.”. Here, “it” does not refer to an entity or event and would therefore not be marked as coreferential in the BBN Coreference and Entity Type corpus. The misidentification 4.5. Combining the Corpora 31 of such pronouns can, however, cause problems for coreference resolution systems. 4.5.2 Extraction of the Antecedent Head Noun The identification of coreferential pronouns and the extraction of their antecedent(s) from the BBN Coreference and Entity Type corpus is straightforward due to the simple structure of the WSJ.pron file. However, the extraction of the head noun from antecedents that consist of more than a single word is more complex. Whilst it is possible to use part of speech taggers to tag the words in the antecedent string and derive linguistically motivated rules to identify the head noun, the provision of annotated parse trees for the WSJ sentences in Penn Treebank 3.0 corpus provided a more robust means of extracting this information. The extraction of the head noun from the antecedent NP is achieved by overlaying the antecedent obtained from the BBN Coreference and Entity Type corpus with the NPs annotated in the merged files of the Penn Treebank 3.0 corpus to obtain a match. Due to differences in annotation between the two corpora, it is often the case that the antecedent does not exactly match with a complete NP in the Penn Treebank 3.0 corpus. Where this is the case, the closest partial match is obtained, ensuring that the word identified as the head noun in the NP annotation in the Penn Treebank 3.0 corpus is also present in the antecedent. Where an antecedent matches a nested NP in the Penn Treebank 3.0 corpus, the rightmost noun of the leftmost NP (in the nested construction) is extracted. It is this that provides the robustness over the previously mentioned alternative method and is particularly effective in the extraction of the head noun in appositive constructions. 4.5.3 Extraction of Morphological Properties from the PCEDT 2.0 Corpus Whilst different strategies are used to obtain the morphological properties of a Czech word corresponding to the English antecedent head noun in the annotation of the English pronouns in the training data (section 4.6.3) and as part of the annotation and training process (section 4.7), the objective is the same. That is the number and gender of the Czech word must be obtained from the m-layer of the PCEDT 2.0 corpus. As described in chapter 3, the m-layer contains a tag attribute which consists of a string of 15 characters that represent various morphological properties of the Czech word, including its number and gender. An investigation of the annotation of the nouns identified as the Czech translations of the English antecedent head nouns in the training data revealed: Five genders: masculine animate, masculine inanimate, feminine, neuter and “any” Three numbers: singular, plural and “any” 32 Chapter 4. Methodology The use of “any” in the annotation of gender denotes a Czech word that may take any gender. Similarly, the use of “any” in the annotation of number denotes a Czech word that may be either singular or plural. This introduction of an additional category for both number and gender brings about a further segmentation of the annotated training data. Identifying a solution to this problem has been left as future work. Once extracted, the number and gender of the Czech word is used to annotate the English pronoun in the format Pronoun.gender.number. For example, in the following text: the u.s. , claiming some success in its.mascin.pl trade diplomacy , removed south korea , taiwan and saudi arabia from a list of countries it.mascin.pl is closely watching for allegedly failing to honor u.s. patents , copyrights and other intellectual-property rights The English pronouns “its” and “it” both refer to “u.s.”, which in the case of this example is found to translate to “usa” in Czech. In the PCEDT 2.0 corpus “usa” is annotated in the m-layer as masculine inanimate and plural. The English pronouns are therefore annotated as its.mascin.pl and it.mascin.pl respectively (as shown in the example). 4.6 Training the Translation Models Both the Baseline and Annotated systems are phrase-based SMT models, trained using the Moses toolkit (Hoang et al., 2007). They share the same 3-gram language model and are forced to use the same word alignments. Following the computation of the word alignments, training of both models commenced at the construction of the phrase translation table. In the construction of both the Baseline and Annotated translation systems, the lexical reordering model: 1. Uses the msd (monotone, swap, discontinuous) model configuration which considers the three orientation types monotone, swap and discontinuous in the reordering. 2. Is conditioned on both the foreign phrase and the English phrase and is bidirectional for each phrase C, its ordering with respect to the previous phrase and the ordering of the next phrase with respect to C are considered. The Baseline system was trained using the full texts of the parallel training corpus, with the un-annotated English text forming the source side. The Annotated system was trained in the same way as the Baseline system but using the annotated English text as the source side of the parallel training corpus. The annotation of the English training set data is described in detail in section 4.6.3. 4.6. Training the Translation Models 4.6.1 33 Computing the Word Alignments When using two translation systems in a two-step translation process, it is necessary to ensure that that Czech translation of the antecedent in the output of the Annotated system is the same as that in the output of the Baseline system. Otherwise the annotation of the English pronouns serves no useful purpose. In order to ensure consistency of the antecedent translations between the systems it is necessary to force both systems to use the same word alignments. The word alignments were produced using Giza++ run over a ‘stemmed’ version of the un-annotated parallel training corpus in both translation directions and symmetrised by using the grow-diagonal final heuristic. The stemming of the un-annotated training corpus is not stemming in the traditional sense. Rather each word in the corpus is trimmed such that it is only four characters in length. This was implemented upon the recommendation made by Dr. Ondřej Bojar in order to improve the robustness of the word alignments used in the phrase extraction step of training the translation models. This is necessary due to the inflective nature of Czech words which if left untrimmed would lead to weaker word alignments used in the construction of the phrase translation tables. It is important to note that whilst the word alignments were computed using the ‘stemmed’ parallel corpus texts, the translation models were trained using the full corpus texts. 4.6.2 Tuning the Translation System Weights: Minimum Error Rate Training (MERT) When a model is first trained using Moses, the model weights generated are a default set of weights. According to the Moses documentation, the quality of these default weights is questionable3 . It is therefore necessary to tune these weights to ensure that they are suitable for the translation language pair and given the available translation system models. The weights were tuned using the MERT tuning script provided as part of the Moses toolkit, using the 500 sentence “Weight Tuning Set” file described in section 4.3. The output of the tuning process is a new Moses configuration file which is used to replace the default configuration file produced by the Moses training process. Different weights were computed for the Baseline and Annotated systems as they were trained using different training data and therefore comprise different models. Whilst the same 500 sentences of the “Weight Tuning Set” file is used in the tuning of both weight sets, in tuning the weights of the Annotated system the English pronouns in these sentences were first annotated using the same method used to annotate the training data. The tuning of the weights, whilst obviously highly recommended, led to problems in the exper3 http://www.statmt.org/moses/?n=FactoredTraining.Tuning 34 Chapter 4. Methodology iments conducted as part of the project. With two systems involved in the two-step translation process it was necessary to tune the weights of both systems. The result of this tuning, is in theory a better set of weights for each system. Having tuned both systems independently it was then discovered that there was some considerable variation in the Czech translation of the English antecedent head noun between the two systems. As the two-step translation process is dependent on this translation remaining constant, there was a concern that this variation in the translations between the two systems would lead to the introduction of further errors. It is not clear how Le Nagard and Koehn (2010) addressed this issue as there is no mention of the tuning of the translation system weights in their paper, but it seems likely that they encountered similar issues. As the impact of the tuning process upon the translation of the antecedent head nouns is not fully understood and in light of the variation observed when both systems were tuned independently, the decision was taken to use the sub-optimal default weights. The use of these weights, in conjunction with the shared word alignments from which the phrases were extracted, ensured a high degree of consistency in the Czech translation of the English antecedent head noun between both systems. This consistency across the translations is of particular relevance to the automated evaluation as defined later in this chapter. The tuning of the weights of both systems in such a way as to ensure that both systems perform well and that the translation of the antecedent head noun is consistent between the systems is left as a possible option for future work. It should be noted that a single-step process such as that used by Hardmeier and Federico (2010) would not suffer from this problem of inconsistency (as it uses only a single translation system), perhaps adding greater weight to the argument in favour of using a single-step process in further research. 4.6.3 Annotation of the Training Set Data The process of annotation used to generate the training data with which the Annotated system was trained works as follows: 1. Identify coreferential English pronouns and their antecedents using the BBN Coreference and Entity Type corpus. 2. Extract the head noun of the antecedent. Where the antecedent spans more than a single word, the antecedent and the NPs annotated in the Penn Treebank 3.0 corpus are overlayed and the head noun is extracted using the process described in section 4.5.2. 3. Obtain the Czech translation (and its t-layer node) of the English antecedent head noun from the PCEDT 2.0 alignment file. 4.6. Training the Translation Models 35 4. Obtain the number and gender of the Czech word by traversing the PCEDT annotation layers from the t-layer node to the corresponding m-layer node. The part of speech tag, number and gender in the positional tag, and the term4 from the lemma field are extracted from the m-layer node. 5. If the m-layer node is annotated as a noun, then the number and gender of the corresponding Czech word is used to annotated the English pronoun in the original English text. In the training data set, there are 23,233 pronouns marked as coreferential by the BBN Coreference and Entity Type corpus. Of those, it was possible to extract the antecedent head noun for 23,126 from the noun phrases marked in the merged files of the Penn Treebank 3.0 corpus. This leaves 107 coreferential pronouns without an antecedent head noun. Of the coreferential pronouns in the training set sentences, 20,721 out of a possible 23,233 are annotated by the training data annotation process. There are several reasons why not all coreferential pronouns have been annotated: 1. No head noun may be found for a multi-word antecedent NP, either because the antecedent does not contain a noun or because the noun identified as the head in the Penn Treebank 3.0 corpus annotation is not part of the antecedent. This is due to possible discrepancies between the annotation of the two corpora, such that no match for the antecedent can be obtained from the NPs. 2. There is no mapping for the English antecedent head noun in the PCEDT 2.0 alignment file. Therefore it is not possible to extract a number and gender for the aligned Czech word. 3. The word identified as the Czech translation of the antecedent head noun is not annotated as a noun at its m-layer node. A further four annotated pronouns have been removed due to the exclusion of a number of sentences from the training data set by the Moses ‘clean data’ script. This leaves a total of 20,717 pronouns annotated in the English side of the parallel training corpus. See table 4.4 for a breakdown of this number by pronoun. 4 The term of a lemma is used in the identification of surnames, which are used as the ‘head’ noun in an antecedent string that contains a person’s full name. 36 Chapter 4. Methodology Table 4.4: Breakdown of Annotated Coreferential Pronouns in the Training Data Set English Pronoun Number of Occurrences He She 527 Him 290 Her 426 His 1,714 Hers 1 It 4,478 Its 3,941 They 2,427 Them 657 Their 1,729 Theirs Himself Herself 6 83 11 Itself 156 Themselves 114 Total 4.7 4,157 20,717 The Annotated Translation Process The input to the annotated translation process is an un-annotated English test file that consists of a set of sentences not present in the training set. This file is first translated using the Baseline system with a trace added to the Moses decoder. The coreferential English pronouns are then identified using the BBN Coreference and Entity Type corpus and their antecedent head noun(s) are extracted from the annotated NPs in the Penn Treebank 3.0 corpus, as previously described. The sentence number and word position of the English pronoun and its antecedent head noun(s) are extracted from the input English text and retained. Using the sentence number and word position of the English antecedent head noun, the Czech translation is identified in the output of the Baseline system using the phrase alignments output by the Moses decoder (in the trace file) and the phrase internal word alignments in the phrase translation table. The number and gender of the Czech word identified as the translation of the antecedent head noun are extracted from the m-layer of the PCEDT 2.0 corpus, using a pre-built dictionary of Czech words and their morphological properties. A copy of the original English test file is then constructed, with all coreferential pronouns annotated with the number and gender of the relevant Czech word. This annotated English test file is then translated by the Annotation system in the second step of the translation process. For evaluation purposes, calls to the Moses decoder when performing the translations with 4.8. Annotation and Translation System Architecture 37 the Baseline and Annotated systems include an option to return the word alignments for each sentence in the input English test file. This word alignment information is output to a separate file and consists of a single line per sentence with word-level alignments of the format: E-C. Where E is the position of the English word in the input sentence and C is the position of the Czech word in the translated sentence. In the design of the annotated translation process, a number of assumptions have been introduced. Firstly, that the Czech translation of the English antecedent head noun is the same in the output of both the Baseline and the Annotated system. As the Baseline and Annotated systems were trained using the same word alignments, it is reasonable to make the assumption that the translation of the English antecedent head noun will be the same in the output of both systems. Secondly, it is assumed that the annotation of the Czech words in the m-layer is both accurate and consistent. The same assumption was also made in the annotation of the training data. 4.8 Annotation and Translation System Architecture The prototype annotation and translation system (described in figure 4.1) takes the form of a Python application that includes a bespoke module that contains functions for accessing, processing and combining information from the corpora and the PCEDT 2.0 alignment file. This module also contains functions that are used in the generation of the annotated English training data used to train the Annotated translation system. The Python application works as follows: 1. Tokenise the un-annotated English test file. 2. Call the Moses decoder to translate the un-annotated English test file using the Baseline system and generate two files: the Czech translation output with trace information (for the identification of Czech and English phrases used in the translation) and the word alignments used by the decoder (used in the automated evaluation). 3. Perform the annotation of the English test file using the annotation process described previously. 4. Tokenise the annotated English test file. 5. Call the Moses decoder to translate the annotated English test file using the Annotated system and generate trace output and the word alignments used by the decoder. 6. If the additional Czech and English annotation switches are set to ‘on’ the application may also be used to read in the Czech translation output and the annotated English test file and add additional information to these files to aid manual evaluation. 38 Chapter 4. Methodology In addition to this application, a number of Python scripts were developed as part of the project. These scripts perform a number of functions including: 1. Generation of the corpus from which the language model is constructed. 2. Generation of the parallel training and test data sets from the PCEDT 2.0 corpus (Czech side) and the BBN Coreference and Entity Type corpus (English side). 3. Creation of the ‘stemmed’ parallel training data set data from which the word alignments used in the training of the Baseline and Annotated translation systems are generated. 4. Generation of the annotated English training data used in the training of the Annotated translation system. 5. The execution of an automated evaluation. This will be described in more detail in section 4.9.1. 4.9 Evaluation With no standard method available for the evaluation of pronoun translation in SMT and BLEU rejected on the basis that it is not well suited to the specific problem of evaluating pronoun translation, it was necessary to devise methods in order to evaluate the performance of the systems. As already discussed in section 2.6, the problem of evaluating the translation of pronouns was addressed differently by Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). Where Le Nagard and Koehn (2010) manually counted the number of correctly translated pronouns in the output of their translation systems, Hardmeier and Federico (2010) relied on precision and recall scored against a single reference translation. Again, as previously discussed in section 2.6, in the case of English-Czech translation a recall and precision based metric seems unsuitable given both the highly inflective nature of Czech and the provision of only a single reference translation. Given the number of possible syntactic forms that a pronoun of the correct number and gender may take in a language that has seven cases and that case is not considered in the annotation, the translation of pronouns with accurate syntactic form cannot be guaranteed. A method involving the manual counting of correctly translated pronouns, as used by Le Nagard and Koehn, is prohibitively slow and laborious, not to mention an impossible task for a monolingual speaker. Whilst a one-off manual evaluation of pronoun translation may provide an acceptable method for the final evaluation of a system it is clearly unpractical to rely on such a method during system development. It is clear that in the case of the development of an English-Czech translation system by a monolingual speaker, neither of the methods discussed so far are suitable for evaluation during 4.9. Evaluation 39 the development process. Given that in Czech, a pronoun must agree in number and gender with its antecedent, it is perhaps more meaningful to count the number of pronouns in the translation system output for which this agreement holds, rather than simply score the output against a single reference translation. The following sections describe an automated method used to provide these counts and the approach taken in a more detailed manual evaluation of pronoun translation carried out by a Czech native speaker who is also an expert in NLP. 4.9.1 Automated Evaluation: Assessing the Accuracy of Pronoun Translations The development of automated evaluation methods is necessary both for the final evaluation and for the development and tuning of systems that focus on pronoun translation. Without the availability of such an evaluation metric during the development of the annotation and translation process as part of this project, the analysis of progress was measured using manual checks. These checks focussed on the accuracy of pronoun annotations in the annotated English test file and the manual evaluation of a small number of pronouns in the Czech translations output by the Annotated system. In order to evaluate the final output of the translation systems, an automated method was deemed to be a necessity. The automated evaluation consisted of a method based on the counting of pronouns in the input English test file and its translation produced by the relevant translation system that met certain specified criteria. These criteria are specified within a single Python script that is designed to simultaneously output results for both the Baseline and Annotated systems such that a direct comparison of the two systems is possible. Using this evaluation script, the following statistics were collected: 1. Total number of pronouns in the input English test file - irrespective of whether they are identified as coreferential. 2. Total number of English pronouns identified as coreferential, as per the annotation of the BBN Coreference and Entity Type corpus. 3. Total number of coreferential English pronouns that are annotated by the annotation process. 4. Total number of coreferential English pronouns that are aligned with any Czech translation. 5. Total number of coreferential English pronouns translated as valid Czech pronouns irrespective of whether the Czech translation is a valid match for the original English pronoun. 40 Chapter 4. Methodology 6. Total number of coreferential English pronouns translated as a valid Czech pronoun corresponding to a valid translation of the original English pronoun. 7. Total number of coreferential English pronouns translated as a valid Czech pronoun (corresponding to the original English pronoun) and with number and gender agreement between the Czech pronoun and the Czech translation of a valid translation of the original English antecedent head noun. The evaluation handles the following pronouns: Personal Pronouns: he, him, she, her, it, they, them Reflexive Personal Pronouns: himself, herself, itself and themselves Possessive Pronouns: his, its, their, theirs Possessive Reflexive Pronouns: him, her, it, them The evaluation script works as follows: 1. Read in tokenised English input, Czech translation system output and Czech reference translation files. 2. Identify coreferential English pronouns and their antecedent head nouns in the input English text. 3. Identify the word positions of these English words in the input English text. 4. Identify aligned Czech words (for the English pronoun and antecedent head noun) in the translation system output using the word alignments output by the Moses decoder. 5. Collect counts of pronouns that meet those criteria listed above. Whilst the English pronoun and its antecedent head noun are single words, they may translate as a single word (one-to-one mapping) or multiple words (one-to-many mapping) in the Czech output. A one-to-one mapping is ideal, but the more complex case of a one-to-many mapping presents a problem as it is necessary to collect counts based on the identification of a single Czech pronoun and the Czech antecedent translation(s). In the scenario that an English pronoun translates as more than one word in Czech, the dictionary of Czech pronouns (see Appendix) is used to identify those words that are valid Czech pronouns. The scenario in which an English antecedent head noun translates as more than one word in Czech is a little more complex. When this is the case, the agreement of the pronoun must be checked against each Czech antecedent word and if agreement is found with any of the Czech antecedent words this is deemed to be a ‘match’. In gathering those statistics listed as items 5, 6 and 7, it was necessary to reference a list of all of the valid Czech translations of the English pronouns included in the annotation and translation 4.9. Evaluation 41 process. A complete list of the Czech pronoun syntactic forms and their number and gender, may be found in the Appendix. Whilst all of the statistics are useful in evaluating the performance of the systems and providing a basis for comparison, perhaps the most informative are those described as items 6 and 7. Statistic 6 provides a means of measuring the accuracy with which a translation system translates an English pronoun as a valid Czech translation of that English pronoun. Statistic 7 provides a means of further questioning the translated Czech pronoun - in addition to being a valid translation of the original English pronoun, does the Czech pronoun agree in number and gender with the (Czech translation of the) antecedent head noun? As it is a requirement in Czech that the number and gender of a pronoun agrees with that of the antecedent, it is this statistic that arguably provides the most meaningful information in relation to the system performance. The validity of this statistic is, however, reliant upon several factors including the correct identification of the English antecedent head noun, its accurate translation into Czech by the Baseline system and finally, the correct identification of the Czech word in the Baseline system output. The uncertainty surrounding these factors, as well as concerns surrounding the robustness of the word alignments output by the decoder (used in the automated evaluation) provides additional motivation for soliciting human judgements via a manual evaluation. It is worth noting that a simplification is made in the evaluation of non-reflexive possessive pronouns. In Czech, the choice of syntactic form for a singular non-reflexive possessive pronoun is dependent on the gender of both the possessor and the object in possession. The cases in which the possessor is masculine animate, masculine inanimate or neuter are simple as the same syntactic form is used irrespective of the gender of the object in possession. The case in which the possessor is feminine is more complex as the syntactic form differs depending on the number and gender of the object in possession. As the Wall Street Journal corpus contains few possessive pronouns in which the possessor is female (i.e. less than 2% of all coreferential pronouns in the corpus, less than 6% of the possessive pronouns), this case is unlikely to appear frequently and has therefore been omitted from the evaluation for the sake of simplicity. This is attributed to the genre of the WSJ texts and would not necessarily hold true for other domains. It should also be noted that the automated evaluation does not include statistics on dropped pronouns. As the decision as to whether or not to drop a subject pronoun is one that may be made by the speaker (or writer) this is too subjective to be measured using an automated method. The only option would be to use of the reference translation(s) in order to identify those sentences in which a pronoun may be dropped, which again raises issues surrounding the provision of only a single reference translation. The evaluation of dropped pronouns (which are not explicitly handled by the Annotated system) ‘learned’ as a result of the training of the translation systems was therefore left to the manual assessor to comment upon. 42 4.9.2 Chapter 4. Methodology Manual Evaluation: Error Analysis and Human Judgements Whilst the automated evaluation provides an indication of relative performance, there are a number of problems associated with this method (as discussed in section 4.9.1). Furthermore, the true test of whether or not there is deemed to be an improvement in pronoun translation over the Baseline system requires the solicitation of human judgements from a manual assessor as part of a manual evaluation. As with the manual evaluation of Machine Translation in general, the manual evaluation of pronoun translation is not a straightforward task. Care must be taken to ensure that the manual assessors are given clear instructions as to how to conduct the evaluation and even then instructions may be misinterpreted. Furthermore, the identification of intended pronoun translations in the system output is potentially difficult, even for a native Czech speaker. This is due to phrase-level reordering between the input English text and the Czech output, the insertion of spurious pronouns during the translation process and the ambiguity of words such as “je” which may be used as either a pronoun or a verb. In the evaluation of the translated pronouns, it was believed to be important to ensure that the manual assessor was directed to the Czech translations aligned to the English pronouns. For this purpose, referential pronouns in both the Czech and English texts provided for manual assessment were marked with the head noun of their antecedent. In addition, referential pronouns in the English source texts were marked with the corresponding Czech translation of the antecedent head noun, and those in the Czech target texts were marked with the original English pronoun that they align to. Examples of the additional annotation provided for the purposes of the manual evaluation are presented below. English text input to the Baseline system: the u.s. , claiming some success in its trade diplomacy , removed south korea , taiwan and saudi arabia from a list of countries it is closely watching for allegedly failing to honor u.s. patents , copyrights and other intellectual-property rights . Czech translation output by the Baseline system: usa , tvrdı́ někteřı́ jejı́(its) obchodnı́ úspěch v diplomacii , odvolán jižnı́ korea , taiwanu a saúdská arábie ze seznamu zemı́ je(it) pozorně sledovali za údajné schopná dodržet amerických patentů , copyrights a dalšı́ intellectual-property práva . English text input to the Annotated system: the u.s.* , claiming some success in its(u.s.,usa).mascin.pl trade diplomacy , removed south korea , taiwan and saudi arabia from a list of countries it(u.s.,usa).mascin.pl is closely watching for allegedly failing to honor u.s. patents , copyrights and other intellectual-property rights . Czech translation output by the Annotated system: usa ,* tvrdı́ někteřı́ úspěchu ve své(its.mascin.pl) obchodnı́ diplomacii , odvolán jižnı́ korea 4.10. Chapter Summary 43 , taiwanu a saúdská arábie ze seznamu zemı́ je(it.mascin.pl) pozorně sledovali za údajné schopná dodržet amerických patentů , copyrights a dalšı́ intellectual-property práva . Because a pronoun must agree in number and gender with its antecedent, when that antecedent comes from an earlier sentence, the assessor carrying out manual evaluation must also be provided with that sentence in order to understand the context of the pronoun. The additional mark-up of the Czech target text is therefore of even greater importance. The sample English and Czech translation texts were composed from five WSJ files selected at random from the Development and Final test sets. The manual assessor was asked to make the following judgements: 1. Whether the pronoun had been translated correctly, or in the case of a dropped pronoun, whether it had been dropped correctly; 2. If the pronoun translation was incorrect, whether a native Czech speaker would still be able to derive the meaning; 3. In the case of the input to the Annotated system, whether the pronoun had been correctly annotated, at least with respect to the Czech translation of the identified antecedent; 4. In the case where an English pronoun had a different translation in the Baseline and Annotated Czech target text, which system produced the better translation. If both systems translated an English pronoun to a valid Czech translation (of that pronoun), both results are to be marked equally as correct translations. It should be noted that the evaluation focussed solely on the translation of pronouns, and not on the translation system output as a whole, as with general purpose manual evaluation in Machine Translation. 4.10 Chapter Summary This chapter described the approach taken in the training of the Baseline and Annotated phrasebased translation systems, the development of the annotation and translation process and the methods developed in order to address the more specific problem of evaluating pronoun translation in English-Czech SMT. The next chapter presents the results of the automated and manual evaluations of the output of the Annotated translation system and provides a comparison with the output of the Baseline system. The chapter also provides a discussion of the results. Chapter 5 Results and Discussion 5.1 Automated Evaluation The results of the automated evaluation (described in section 4.9.1) are presented for the Development test set in table 5.1 and for the Final test set in table 5.2. As these tables show, there is only a small improvement of the Annotated system over the Baseline system, for each test set. The statistics in the last two rows of each table require further explanation. By way of an example, consider a sentence in which the English pronoun ‘it’ is identified as having an antecedent for which the head noun translates to a Czech word that is singular and feminine. If ‘it’ was translated as the Czech pronoun ‘on’, this would be a valid Czech translation of the English pronoun ‘it’, satisfying the criteria “Czech Pronouns that are a valid translation of the original English Pronoun”. This translation would not, however, satisfy the additional requirement of agreement with the antecedent as ‘on’ (singular with nominative case) has masculine gender and the antecedent is feminine. In order to satisfy the more stringent criteria of also matching the number and gender of the antecedent, ‘it’ would need to be translated as ‘ona’ in the nominative case. If accuracy of the pronoun translations is taken to be a measure of the proportion of coreferential English pronouns that have a valid Czech translation and agree in number and gender with their antecedent then the accuracy of the systems is as follows: 1. Development test set: Baseline system 44/141 (31.21%), Annotated system 46/141 (32.62%) 2. Final test set: Baseline system 142/331 (42.90%), Annotated system 146/331 (44.10%) 45 46 Chapter 5. Results and Discussion Table 5.1: Automated Evaluation Results for the Development Test Set Baseline System Annotated System Pronouns 156 156 Coreferential Pronouns 141 141 Annotated Coreferential Pronouns N/A 117 Coreferential English Pronouns with a Czech translation 141 141 Coreferential English Pronouns translated as valid Czech Pronouns 71 75 Czech Pronouns that are a valid translation of the original English 63 71 44 46 Pronoun Czech Pronouns that are a valid translation of the original English Pronoun and the Czech Pronoun and Antecedent match in number and gender However, there are a number of reasons for not taking this evaluation as definitive: 1. The automated evaluation hinges on the accuracy of the word alignments output by the decoder (alongside the Czech translations) in order to identify the Czech translations of the English pronoun and its antecedent. The robustness of these alignments is questionable, so caution should be taken when interpreting the results. 2. The automated evaluation requires accurate identification of the true Czech translation of the head noun of the antecedent in English. This in turn requires the latter be identified accurately. If either is incorrect, the English pronoun in the input to the Annotated translation system is likely to be annotated incorrectly, thereby blocking any potential gains from the annotation and translation process. 3. English pronouns are only annotated with number and gender of their Czech counterparts and so the correct inflectional form of the Czech pronouns in the target text cannot be guaranteed. As a result, inflectional form cannot be used as criteria in the automated evaluation. All these points mean that manual evaluation is critical for understanding the potential capabilities of source text annotation as a technique for improving pronoun translation. Despite efforts to ensure that the English antecedent head noun is translated as the same Czech word in the Baseline and Annotated systems, a small number of differences between the two systems were identified. One incident of a different antecedent translation was identified for the Development test set and two were identified for the Final test set. In all three cases, the English pronouns were not translated as Czech pronouns, so the presence of these anomalies does not affect the accuracy scores reported previously. Automated evaluation also fails to capture actual variations between the Baseline and the An- 5.1. Automated Evaluation 47 Table 5.2: Automated Evaluation Results for the Final Test Set Baseline System Annotated System Pronouns 350 350 Coreferential Pronouns 331 331 Annotated Coreferential Pronouns N/A 278 Coreferential English Pronouns with a Czech translation 317 317 Coreferential English Pronouns translated as valid Czech Pronouns 198 198 Czech Pronouns that are a valid translation of the original English 182 182 142 146 Pronoun Czech Pronouns that are a valid translation of the original English Pronoun and the Czech Pronoun and Antecedent match in number and gender notated target texts. Upon closer inspection of the system output it is clear that there is a fairly high degree of overlap between the two systems in terms of English pronouns that are translated using exactly the same Czech form. There are also a substantial number of English pronouns for which the Czech translation is different. Where the two systems produce the same translation of the same English pronoun (i.e. the same word position within the same sentence, within the same WSJ file), it is possible that both systems have produced a valid translation of the pronoun, or that they have both produced an invalid translation. Where the translations are invalid, interpretation by a human expert is required in order to ascertain the cause of the error. Where the two systems produce a different translation of the same English pronoun, there are yet more possibilities. Both systems could produce a different Czech pronoun, whether it be valid or invalid, neither system may produce a Czech pronoun (but the Czech translation may be different), or one system may produce a valid Czech pronoun where the other does not. As both systems share the same underlying word alignment for the construction of their phrase translation models, these differences can only follow from the data used to train their translation models. The extent of this variation differs between the two test sets. For the Development test set, approximately 1/3 of the pronoun translations are different between the two systems, whereas for the Final test set this is much lower at approximately 1/6. The evaluation of the instances where the pronoun translation is the same in both systems and where it differs between the systems is left to the manual assessor. Whilst a monolingual speaker with a dictionary of Czech pronouns and their English translations may manually examine these instances using the information in the files output as part of the automated evaluation process, there are cases that can only be analysed by a native Czech speaker. This motivates the solicitation of human judgements. 48 Chapter 5. Results and Discussion Table 5.3: Manual Evaluation Results: Pronouns with the same translation in both systems (“Matches”) Criterion Total number of pronouns Result for both systems 72 Pronoun translation correct in terms of number and gender or correctly dropped 52/72 Pronoun translation incorrect in terms of number and gender or incorrectly dropped 20/72 English pronoun annotated correctly with the number and gender of the Czech trans- 67/72 lation Total number of incorrectly translated pronouns 20 Pronoun translation incorrect and cannot be understood or is “misleading” 8/20 Pronoun translation incorrect but the meaning could still be understood 12/20 5.2 Manual Evaluation The results of the manual evaluation suggest that the performance of the Annotated system is comparable with, even marginally better than that of the Baseline system. In the sample files provided for the evaluation there were 31 pronouns for which the translations provided by the two systems differed (differences) and 72 for which the translation provided by the systems was the same (matches). These sets of pronouns show different things. Evaluation of the “matches” (see table 5.3) provides an indication of how well both systems do (in general terms) and evaluation of the “differences” (see tables 5.4 and 5.5) allows for a comparison of how the systems compare. Tables 5.6 and 5.7 describe the performance of both systems with respect to the appropriate use of pro-drop. The results contained in these tables correspond to judgements based on the criteria specified in section 4.9.2. Upon inspection of the “matches” set (see table 5.3) of 72 pronouns, it is clear that a reasonable number of pronouns are correctly translated or dropped by both systems (52/72) and that of those 20 pronouns that are incorrectly translated, the meaning of 12 could still be understood. This leaves 8 pronouns for which the translation was so poor that the meaning cannot be understood. Focussing specifically on those pronouns that are dropped (see table 5.6), 28 out of 32 are correctly (or at least satisfactorily) dropped, with only 6 pronouns that should have been dropped but were not. This suggests that the translation systems were able to ‘learn’ scenarios in which pro-drop is appropriate. The success of both systems with respect to the appropriate dropping of pronouns was somewhat unexpected but could be due to instances in which there are short distances between the pronoun and verb in English. For example, many of the occurrences of ‘he’ and ‘she’ in the English text appear in the context of “he said...” or “she said...”, are translated as “...řekl...” and “...řekla...” (respectively) in the Czech machine translation output. These instances represent scenarios in which the pronoun was appropriately 5.2. Manual Evaluation 49 Table 5.4: Manual Evaluation Results: Pronouns with the different translations in each system (“Differences”) Criterion Total number of pronouns Pronoun translation correct in terms of number and gender or are Baseline System Annotated System 31 31 19/31 17/31 12/31 14/31 N/A 18/31 correctly dropped Pronoun translation incorrect in terms of number and gender or are incorrectly dropped English pronoun annotated correctly with the number and gender of the Czech translation Total number of incorrectly translated pronouns Pronoun translation incorrect and cannot be understood or is “mis- 12 14 5/12 6/14 7/12 8/14 leading” Pronoun translation incorrect but the meaning could still be understood * The remaining 11 pronoun translations were found to be “similar” between the two systems. In this case, the translations provided by one system was no better or worse than the other - either both translations were deemed to be equally good or equally bad ** The remaining 6 pronoun translations were found to be “similar” between the two systems Table 5.5: Manual Evaluation Results: A direct comparison of pronoun translations that differ between systems (“Differences”) Criterion Baseline System Better Annotated System Better Systems Equal Overall quality 9/31 11/31 11/31 Quality when annotation is correct 3/18 9/18 6/18 Table 5.6: Manual Evaluation Results: Dropped Pronouns in the “Matches” set Criterion Result for both systems Total dropped pronouns 32 Correctly / satisfactorily dropped 28 Incorrectly / inappropriately dropped 4 Pronouns that should have been dropped (but were not) 6 Table 5.7: Manual Evaluation Results: Dropped Pronouns in the “Differences” set Criterion Baseline System Annotated System Total dropped pronouns 12 3 Correctly / satisfactorily dropped 12 3 Pronouns that should have been dropped (but were not) 1 1 50 Chapter 5. Results and Discussion dropped. However, it is not the case that the problem of pro-drop has been solved, merely that a few scenarios in which pro-drop is appropriate have been ‘learned’. An inspection of the results from the “differences” set (see tables 5.4 and 5.5) of 31 pronouns presents further points of interest. Whilst the performance of the Annotated system appears to be a little better than the Baseline system overall (see table 5.5), the manual assessor actually identified fewer correct translations for the Annotated system (17/31) than the Baseline system (19/31). This may seem strange but it appears to be due to a small number of cases in which the translations produced by both systems were incorrect but those produced by the Annotated system were deemed to be marginally better. Unfortunately, the sample size for this set is rather small and therefore it is somewhat difficult to form a complete picture of where one system may be consistently better than the other. As an example of where the Annotated system produces a better translation than the Baseline system, consider the following English sentence and its translations by both systems: English text: he said mexico could be one of the next countries to be removed from the priority list because of its.neut.sg efforts to craft a new patent law . Baseline system translation: řekl , že mexiko by mohl být jeden z dalšı́ch zemı́ , aby byl odvolán z prioritou seznam , protože jejı́ snahy podpořit nové patentový zákon . Annotated system translation: řekl , že mexiko by mohl být jeden z dalšı́ch zemı́ , aby byl odvolán z prioritou seznam , protože jeho snahy podpořit nové patentový zákon . In this example, the English pronoun “its”, which refers to “mexico” is annotated as neuter and singular (as extracted from the Baseline translation of “mexico”). Both systems translate the pronoun’s antecedent “mexico” as “mexiko” (neuter, singular) but differ in their translation of the pronoun. The Baseline system translates “its” incorrectly as “jejı́” (feminine, singular), whereas the Annotated system produces the more correct translation: “jeho” (neuter, singular), which agrees with the antecedent in both number and gender. It is also interesting to note that “jeho” is not the only correct pronoun translation in this case. If “because of its efforts to craft a new patent law” is translated as a separate clause, the use of the possessive pronoun “jeho” is correct. Alternatively, if the same fragment were to be translated as a phrase belonging to the same clause as the antecedent “mexico” (also the subject), the reflexive possessive pronoun “své” should be used instead, as it is in the reference translation. There are two further points of interest with regards to the results from the “differences” set: 1. It would appear that the Baseline system is more likely to drop pronouns than the Annotated system (in those scenarios when a pronoun should be dropped). 2. If the annotation of the English pronoun is correct, the translation provided by the Anno- 5.2. Manual Evaluation 51 tated system is judged to be better than the translation provided by the Baseline system. Unfortunately the sample size of this set of pronouns is rather too small to make any definite claims but it would appear that in general the explicit annotation of pronouns results in worse performance in terms of pro-drop (see table 5.7). What is encouraging is that it would appear that the correct annotation of an English pronoun leads to a good translation in Czech. Where the annotation is correct with respect to the extraction of the number and gender from the Czech translation of the antecedent, pronoun translation is deemed to be better for the Annotated system than the Baseline system (see table 5.5). In the “differences” set, where the annotation of 18 out of 31 English pronouns is correct, 9 pronouns are translated better by the Annotated system, 3 are translated better by the Baseline system and 6 are too ‘similar’ to make a judgement. This is supported to some extent by the results of the “matches” set (see table 5.3) in which the accuracy of the English pronoun annotation is deemed to be high (67/72) and the correctness of the pronoun translation or dropping of a pronoun is also reasonably high. Another interesting example that was identified in the manual evaluation showed that despite the incorrect annotation of an English pronoun, the translation produced by the Annotated system was deemed to be (accidentally) better than that by the Baseline system: English text: the others here today live elsewhere they.fem.pl belong to a group of 15 ringers – including two octogenarians and four youngsters in training – who drive every sunday from church to church in a sometimes-exhausting effort to keep the bells sounding in the many belfries of east anglia . Baseline system translation: ostatnı́ zde dnes žije jinde to patřı́ ke skupině 15 ringers - včetně dvou octogenarians a čtyři , který v obdobı́ - , kteřı́ jezdı́ každou neděli od kostela , aby cı́rkev v sometimes-exhausting snahu udržet zvony sounding v mnoha belfries of east anglia . Annotated system translation: ostatnı́ zde dnes žije jinde ty patřı́ ke skupině 15 ringers včetně dvou octogenarians a čtyři , který v obdobı́ - , kteřı́ jezdı́ každou neděli od kostela , aby cı́rkev v sometimes-exhausting snahu udržet zvony sounding v mnoha belfries of east anglia . In this example, the English pronoun “they” refers to “others” in the previous sentence and is annotated as feminine, plural. It should, however, be annotated as masculine animate, plural according to the number and gender of “ostatnı́”. This incorrect annotation affects the translation of the pronoun “they” by the Annotated system, but the translation “ty” (Annotated system) is perfectly understandable to a native Czech speaker and deemed to be better than “to” (Baseline system). Moreover, “ty” represents a form that is common in colloquial Czech. Unfortunately, no clearer picture of the effects of the annotation and translation process with respect to individual pronouns may be obtained. Whilst it was expected that the translation of English pronouns which appeared in the training data with a high frequency would be trans- 52 Chapter 5. Results and Discussion lated more accurately than those that appeared with a low frequency, it is not possible to draw any conclusions from such a small sample size. A more extensive manual evaluation would therefore be required. In addition to the judgements, the manual assessor also provided feedback on the manual evaluation task. One of the major difficulties that they encountered during the evaluation was in connection with evaluating the translation of pronouns in sentences which exhibit poor syntactic structure. This is a criticism of Machine Translation as a whole but highlights a specific problem in the manual evaluation of pronoun translation. Also the effects of poor syntactic structure are likely to introduce an additional element of subjectivity if the assessor must first interpret the syntactic structure of the translation system output. 5.3 Critical Evaluation of the Approach and Potential Sources of Error Errors in different parts of the process may contribute to the Annotated system not performing that much better than the Baseline: 1) Identification of the English antecedent head word. The incorrect identification of the English antecedent head word will in turn affect the identification of the Czech translation from which the number and gender is extracted. This will affect not only the annotation of the training data used to train the Annotated translation system but also the annotation of the test file as part of the annotation and translation process. 2) Identification of the Czech translation of the English antecedent head word. For the training data, the Czech translation is obtained from the PCEDT 2.0 alignment file. Errors in the alignments used in the generation of this file would therefore eventually lead to the extraction of the incorrect morphological properties of the Czech word used to label the coreferential English pronouns in the training data. During the translation of a test file the Czech translation of an English antecedent head noun is extracted using the phrase internal word alignments in the phrase table, corresponding to the phrase used in the translation. The potential for errors in these word alignments cannot be ruled out. 3) Incorrect annotation in the manually annotated corpora. As the morphological properties of the Czech words in the PCEDT 2.0 corpus, coreferential pronouns and their antecedents in the BBN Coreference and Entity Type corpus and the parsed sentences in the Penn Treebank 3.0 corpus are manually annotated, the accuracy of this information is deemed to be of very high quality. The risk of errors in the manual annotation of these corpora is therefore believed to be minimal. 5.4. Chapter Summary 53 The potential sources of error in 1 and 2 could be contributing factors in the introduction of variation between the pronoun translations in the Baseline and Annotated systems. The other obvious source of this variation is the difference in the training data used between the two systems. This difference in the training data (introduced as a result of the annotation of English pronouns) raises another concern with respect to the potential weakening of statistics in the phrase table of the Annotated system due to the segmentation of the data. Whilst this cannot be avoided if the objective is to try to improve the translation of pronouns when translating into a language where the number and gender of pronouns is important, decisions taken by the decoder based on weak statistics may give poor results. This is perhaps more of an issue given the relatively small size of the parallel training corpus when compared to the resources used in the development of other SMT systems. One possible area for improvement would be to reject translated sentences produced by the Annotated translation system in which there are pronoun translations that are based on low counts of phrase-level occurrence within the training data. The other obvious solution would be to add more parallel data to the training corpus. However, as the aim of this project was to use manually annotated corpora in which the coreference annotation is assumed to be “perfect”, this assumption would need to be relaxed if the generation of more parallel training data necessitated the use of a coreference resolution system. Potential sources of error are not limited to the annotation and translation process. As mentioned briefly in section 5.1, there are a number of potential sources of error in the automated evaluation method which should not be overlooked. These sources of error are related to those already described in relation to the annotation and translation process. The evaluation hinges not only on the correct identification of the head noun from the English antecedent, but also on the identification of the Czech translation in the output of the translation system which is reliant upon the word alignments output by the decoder. If any of these is incorrect, the results of the evaluation will be affected as it relies upon counts of Czech pronouns that agree in number and gender with the Czech translation of the English antecedent head noun. This, again, highlights the great need for standard automated evaluation methods for the specific problem of pronoun translation, and in the case of this project, a method (or methods) that are suitable for the evaluation of highly inflective languages such as Czech. 5.4 Chapter Summary This chapter presented the results of manual and automated evaluations of the output of the Annotated and Baseline systems and a discussion of the results and potential sources of error in the annotation and translation processes as well as the evaluation itself. The next chapter provides 54 Chapter 5. Results and Discussion a conclusion to this work and makes suggestions as to possible areas of further investigation for future work. Chapter 6 Conclusion and Future Work The work carried out as part of this project raises perhaps more questions than it answers. This chapter outlines the contributions of this project and summarises the outstanding issues which may impede not only the further progress of this work, but also that of other studies focussing on the translation of pronouns in SMT. 6.1 Conclusion Building on the work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) and using a similar method to that developed by Le Nagard and Koehn (2010), this project focussed on the translation of pronouns in phrase-based English-Czech SMT. The three contributions of this work are: 1. A prototype annotation and translation system for English-Czech SMT trained on the Wall Street Journal corpus and a close Czech translation as provided by the PCEDT 2.0 corpus. 2. Automated and manual evaluations of the output of the annotation and translation process against a baseline system. 3. An aligned parallel corpus (in which the pronouns in the English source side text are annotated) which may be used in future investigations into methods for improving the handling of pronoun coreference. The annotation and translation system uses a two-step process based on the approach taken by Le Nagard and Koehn (2010). Whilst it is acknowledged that this approach is slow (due to the incorporation of two translation steps) and cumbersome when compared to the more elegant solution presented by Hardmeier and Federico (2010), the two-step process provided 55 56 Chapter 6. Conclusion and Future Work a simple framework for the investigation into pronoun translation in English-Czech SMT. Furthermore, it is recognised that a two-step process is unpractical and not suitable for real-world deployment. It does, however, provide a simplification to the problem of obtaining the Czech translation of the antecedent head noun and is therefore valid in the design of a prototype system that is used as a proof of concept. Unlike the previous projects by Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), this project made use of a number of manually annotated corpora to factor out the effects of both imperfect coreference resolution and alignment in the training data. The use of these corpora allowed for an assessment of the extent to which the approach of annotating English pronouns with the number and gender of their Czech antecedent can improve their translation into Czech. In short, the answer to this question is simple; the performance of the Annotated system shows little improvement over the Baseline system as measured using automated and manual evaluations. There are a number of possible reasons for this as discussed in detail in chapter 5. The two major areas for concern are the accuracy of the translation of the English antecedent head noun by the Baseline system as well as its accurate identification in the translation output and the potential for weakening of the statistics in the phrase translation table in the Annotated system. The amount of data in the parallel training corpus used in this project is perhaps not enough to provide sufficiently accurate Baseline translations and robust statistics for the Annotated system’s phrase translation model. The best way to assess the validity of this claim would be to rebuild the translation models using an extended parallel training corpus. It is acknowledged that this would likely mean that the assumption of “perfect” coreference would be compromised as the availability of a second English corpus with manually annotated coreference information and a close Czech translation for a similar domain to the WSJ corpus is rather unlikely. Should it be possible to obtain a suitable parallel English-Czech corpus, a state of the art coreference resolution system such as that developed by Charniak and Elsner (2009) could be used to provide the missing coreference information. The problem of pronoun translation in SMT is complex, especially when translating into a highly inflective language such as Czech where it is important to ensure that pronouns have the correct number, gender and case and that there is agreement between the pronoun and the head of its antecedent. It is therefore important to realise that whilst the results for the Annotated system on two small test sets show a marginal improvement over the Baseline system, this is based purely on the number and gender of the pronouns and their antecedents. The correct case of the pronouns, and hence the correct syntactic form, is not considered. The possibility of further experimentation using the prototype annotation and translation process is limited by a number of factors. Firstly, it is believed that the Wall Street Journal corpus may be too small for the purposes of this work given the suspected problems associated with 6.1. Conclusion 57 data sparsity arising from the number of genders in Czech and the annotation of “any” in the absence of a defined number or gender in the PCEDT 2.0 corpus. Secondly, the provision of only a single reference translation combined with the high degree of inflection in Czech and the lack of a standard automated evaluation metric presents a problem in deciding how best to evaluate the system output. Thirdly, the question of how best to apply tuning to two systems used in a two-step translation process where consistency in the translation of the antecedent head nouns between systems is a complex one. This perhaps highlights another argument against the use of a two-step translation process. It should be noted that this project is not unique in suffering from these problems, with the first two affecting not only pronoun-focussed translation, but Machine Translation in general. With the topic of evaluation taking a prominent place in the 2011 Workshop on Machine Translation1 it is clear that there are still many questions surrounding automated evaluation techniques. Whilst manual evaluation is always an available option, it is not well suited as for use during the development of SMT systems in which experiments are to be run with any degree of frequency. It is clear that the lack of a suitable automated evaluation method presents a major stumbling block in the path of future progress. Designing a translation system is only one half of the problem; evaluation of such a system is the other. Finally, the problem of ensuring consistency of the Czech translation of the English antecedent head noun between the Baseline and Annotated systems resulted in the adoption of the default model weights provided as part of the Moses training. This would not pose a problem for a single-step translation process in which only one translation model is required and therefore the issue of consistent translation of the antecedent head noun between two systems would not be relevant. This document outlines the work undertaken as part of a three month long MSc project. It is clear that whilst some progress has been made, three months is not nearly sufficient to tackle all of the problems related to the development and evaluation of what now appears to have been a rather ambitious project from the outset. In truth this work has only just begun to scratch the surface, but it is hoped that work focussing on pronoun translation and the wider issue of handling discourse level phenomena in Machine Translation will continue. The following section (6.2) makes a number of suggestions as to the directions in which future work could be taken. These suggestions are made in light of a number of difficulties encountered during this project. 1 http://www.statmt.org/wmt11 58 6.2 Chapter 6. Conclusion and Future Work Future Work Improving the accuracy of pronoun translation in Machine Translation remains an open problem and as such there is great scope for future work in this area. Indeed, there may be other methods for handling pronoun translation that work better than those already investigated. It may be the case that it is not sufficient to focus solely on the source side and that operations on the target side must also be considered. There are also many possible directions for future work in relation to problems identified during the course of this project. These include, but are not limited to, the handling of pronoun dropping in pro-drop languages such as Czech, Romanian, Spanish and Italian, the development of pronoun specific evaluation metrics and addressing the problem of the availability of only a single reference translation. The explicit handling of pronoun dropping in Machine Translation when translating from a non pro-drop language such as English into a pro-drop language such as Czech is lacking in current Machine Translation systems and research in this area has been somewhat limited to date. Exceptions include work in English-Italian translation (Gojun, 2010) with a focus on trying to improve the translation of subject pronouns by improving the alignment of verb phrases (in a phrase-based SMT system) that contain pronominal subjects and a method for resolving intrasentiential zero pronouns in English-Japanese translation (Nakaiwa and Ikehara, 1995). Kim et al. (2010) developed a method for identifying non-referential zero pronouns in KoreanEnglish translation but this has yet to be applied to a practical Machine Translation problem. Work could focus on the identification of pro-drop scenarios in English-Czech translation and the development of an explicit annotation method with which to mark those English pronouns that should be dropped in the Czech translation. Another option may be to consider the removal of pronouns from the English source text that should be dropped in the Czech translation output. Both options would require a method to predict whether an English pronoun should be dropped in the Czech translation. This could be achieved either through defining handwritten rules or by making use of a Machine Learning classifier trained using a parallel English-Czech corpus with a sufficient coverage of the relevant pro-drop scenarios for the English-Czech language pair. The PCEDT 2.0 corpus t-layer contains the annotation of pro-dropped pronouns, which are not realised in the Czech at the w-layer (surface level text) and therefore may prove to be useful in the pursuit of the explicit handling of pro-drop in English-Czech SMT. As Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) have already identified, the lack of evaluation metrics suited to the specific problem of pronoun translation makes evaluation very difficult. The provision of a robust metric is essential for the evaluation of future work and in the comparison of different systems in order to establish if progress is being made and also to identify where sources of error exist. It is also necessary to consider the requirement 6.2. Future Work 59 for an evaluation metric which satisfies not only the problem of evaluating translated pronouns which should be present in the translation output, but also those which should be dropped or for which pro-drop is a suitable alternative to displaying a pronoun. Ideally the development of such a metric would take place prior to future work on the translation problem. In connection with the issue of evaluation, the provision of only a single reference translation is a particular problem in the evaluation of pronouns in English-Czech translation due to the highly inflective nature of Czech and hence the number of possible syntactic forms that a Czech pronoun may take. In order for evaluation metrics incorporating the notions of precision and recall to become useful when translating into a highly inflected language, it is necessary to provide multiple reference translations that capture the range of valid alternatives. Whilst it is possible to employ the services of a number of translators to provide additional reference translations based on the same original text, this can be both slow and costly. As an alternative, the use of paraphrase to automate the generation of synthetic reference translations may be considered. Work by Kauchak and Barzilay (2006) focussed on the use of paraphrase generation to provide sentence-level synthetic reference translations, which could assist in refining the accuracy of automated evaluation methods in Machine Translation, thereby addressing the gap between automated evaluation and human judgements. Their technique aims to take a reference sentence and generated Machine Translation system output and find a paraphrase of the reference sentence with wording closer to the Machine Translation system output than the reference itself. This moves away from prior research in which the aim was to produce any paraphrase of the reference. However, their technique applies only to content words and therefore would need to be adapted to the more specific issue of pronouns before it could be used in practice. More recent work by Chen and Dolan (2011) focusses on the use of crowdsourcing techniques to obtain sentence-level paraphrase data by asking human participants to describe what they see in a video and to participate in a separate direct paraphrase task. Using video as a medium for gathering alternative translations leads to the generation of short texts and is not suitable for many domains. Crowd-sourcing techniques used to obtain paraphrases in direct paraphrase tasks and in the solicitation of multilingual translations may also prove useful in obtaining multiple reference translations. One example of the use of crowd-sourcing to obtain multiple multilingual translations is Microsoft’s WikiBABEL (Kumaran et al., 2008) project. In short, there are a great number of possibilities for further research in this area. The accurate translation of pronouns incorporating the use of coreference resolution techniques is an extremely interesting and highly important problem for which there remains great scope for future work. Appendix A Czech Pronouns Used in the Automated Evaluation The following tables of Czech pronouns were used in the automated evaluation. Where the pronoun is a possessive, possessive reflexive or demonstrative pronoun, the gender refers to the object(s) in possession. Where the pronoun is a personal pronoun the gender refers to the person, or group of persons. In the tables “Masc. An.” and “Masc. Inan.” are used to denote the “Masculine Animate” and “Masculine Inanimate” genders respectively. 61 62 Appendix A. Czech Pronouns Used in the Automated Evaluation Table A.1: Czech Pronouns: Personal Czech Pronoun English Translation Number Masc. An. Masc. Inan. Feminine Neuter je it singular 1 1 0 1 jeho him, his,it,its singular 1 1 0 1 jej him, it singular 1 1 0 1 jemu him, it singular 1 1 0 1 ji her, it singular 0 0 1 0 jı́ her, it singular 0 0 1 0 jı́m him, it singular 1 1 0 1 ho him, it singular 1 1 0 1 mu him, it singular 1 1 0 1 ně it singular 0 0 0 1 něho him, it singular 1 1 0 1 něj him, it singular 1 1 0 1 němu him, it singular 1 1 0 1 ni her, it singular 0 0 1 0 nı́ her, it singular 0 0 1 0 nı́m him, it singular 1 1 0 1 on he, it singular 1 1 0 0 ona she, it singular 0 0 1 0 ono it singular 0 0 0 1 se himself, herself, itself, themselves singular 1 1 1 1 sebe himself, herself, itself, themselves singular 1 1 1 1 sebou himself, herself, itself, themselves singular 1 1 1 1 si himself, herself, itself, themselves singular 1 1 1 1 sobě himself, herself, itself, themselves singular 1 1 1 1 je them plural 1 1 1 1 jich them plural 1 1 1 1 jim them plural 1 1 1 1 jimi them plural 1 1 1 1 ně them plural 1 1 1 1 nich them plural 1 1 1 1 nim them plural 1 1 1 1 nimi them plural 1 1 1 1 ona they plural 0 0 0 1 oni they, these, those plural 1 1 0 0 ony they, these, those plural 0 0 1 0 se himself, herself, itself, themselves plural 1 1 1 1 sebe himself, herself, itself, themselves plural 1 1 1 1 sebou himself, herself, itself, themselves plural 1 1 1 1 si himself, herself, itself, themselves plural 1 1 1 1 sobě himself, herself, itself, themselves plural 1 1 1 1 63 Table A.2: Czech Pronouns: Possessive Czech Pronoun English Translation Number Masc. An. Masc. Inan. Feminine Neuter jeho him,it,its singular 1 1 0 1 jejı́ hers, its, her singular 1 1 1 1 jejı́ch hers, its, her singular 1 1 1 1 jejı́ho hers, its, her singular 1 1 0 1 jejı́m hers, its, her singular 1 1 0 1 jejı́mi hers, its, her singular 1 1 1 1 jejı́mu hers, its, her singular 1 1 0 1 jejich their, theirs plural 1 1 1 1 jejı́ hers, its, her plural 1 1 1 1 Table A.3: Czech Pronouns: Possessive Reflexive Czech Pronoun English Translation Number Masc. An. Masc. Inan. Feminine Neuter svá his, her, its, their singular 0 0 1 0 své his, her, its, their singular 0 0 1 1 svého his, her, its singular 1 0 1 1 svém his, its singular 1 1 0 1 svému his, her, its singular 1 1 0 1 svoje his, her, its, their singular 0 0 1 1 svoji his, her, its, their singular 0 0 1 0 svojı́ his, her, its singular 0 0 1 0 svou her, its singular 0 0 1 0 svůj his, her, its singular 1 1 0 0 svým his, her, its, their singular 1 1 0 1 svá his, her, its, their plural 0 0 0 1 své his, her, its, their plural 1 1 1 0 svı́ their, theirs plural 1 0 0 0 svoje his, her, its, their plural 1 1 1 1 svoji his, her, its, their plural 1 0 0 0 svých their, theirs plural 1 1 1 1 svým his, her, its, their plural 1 1 1 1 svými their, theirs plural 1 1 1 1 64 Appendix A. Czech Pronouns Used in the Automated Evaluation Table A.4: Czech Pronouns: Demonstrative Czech Pronoun English Translation Number Masc. An. Masc. Inan. Feminine Neuter ten this, he, it singular 1 1 0 0 ta this, she, it singular 0 0 1 0 ta these, they, them plural 0 0 0 1 to this, it singular 0 0 0 1 toho this, him, it singular 1 1 0 1 té this, her singular 0 0 1 0 tomu this, him, it singular 1 1 0 1 tu this, her singular 0 0 1 0 tom this, him, it singular 1 1 0 1 tı́m this, him, it singular 1 1 0 1 tou this, her singular 0 0 1 0 ti these, they plural 1 0 0 0 ty these, they, them plural 1 1 1 0 těch these, them plural 1 1 1 1 těm these, them plural 1 1 1 1 těmi these, them plural 1 1 1 1 Bibliography Bojar, O. and Hajič, J. (2008). Phrase-based and Deep Syntactic English-to-Czech Statistical Machine Translation. In Proceedings of the Third Workshop on Statistical Machine Translation, StatMT ’08, pages 143–146, Stroudsburg, PA, USA. Association for Computational Linguistics. Bojar, O. and Kos, K. (2010). 2010 Failures in English-Czech Phrase-based MT. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, WMT ’10, pages 60–66, Stroudsburg, PA, USA. Association for Computational Linguistics. Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., and Schroeder, J. (2007). (Meta-) Evaluation of Machine Translation. ACL Workshop on Statistical Machine Translation. Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., and Schroeder, J. (2008). Further metaevaluation of machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation, StatMT ’08, pages 70–106, Stroudsburg, PA, USA. Association for Computational Linguistics. Charniak, E. and Elsner, M. (2009). EM Works for Pronoun Anaphora Resolution. In Conference of the European Chapter of the Association for Computational Linguistics, pages 148–156. Chen, D. L. and Dolan, W. B. (2011). Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 190–200, Stroudsburg, PA, USA. Association for Computational Linguistics. Gojun, A. (2010). Null Subjects in Statistical Machine Translation: A Case Study on Aligning English and Italian Verb Phrases with Pronominal Subjects. Master’s thesis, Universität Stuttgart. Grosz, B. J., Weinstein, S., and Joshi, A. K. (1995). Centering: A Framework for Modeling the Local Coherence Of Discourse. Computational Linguistics, 21:203–225. 65 66 Bibliography Hajič, J., Panevová, J., Hajičová, E., Panevová, J., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., and Mikulová, M. (2006). Prague Dependency Treebank (PDT) 2.0 LDC Calalog No.: LDC2006T01. Technical report, Linguistic Data Consortium. Hardmeier, C. and Federico, M. (2010). Modelling Pronominal Anaphora in Statistical Machine Translation. In Proceedings of the 7th International Workshop on Spoken Language Translation. Hoang, H., Birch, A., Callison-burch, C., Zens, R., Aachen, R., Constantin, A., Federico, M., Bertoldi, N., Dyer, C., Cowan, B., Shen, W., Moran, C., and Bojar, O. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. pages 177–180. Hobbs, J. (1978). Resolving Pronominal References. Lingua 44, pages 311–338. Kauchak, D. and Barzilay, R. (2006). Paraphrasing For Automatic Evaluation. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pages 455–462, Stroudsburg, PA, USA. Association for Computational Linguistics. Kim, K.-S., Park, S.-B., Song, H.-J., Park, S., and Lee, S.-J. (2010). Identification of Nonreferential Zero Pronouns for Korean-English Machine Translation. In Zhang, B.-T. and Orgun, M., editors, PRICAI 2010: Trends in Artificial Intelligence, pages 112–122. Springer Berlin / Heidelberg. Kneser, R. and Ney, H. (1995). Improved Backing-Off for M-gram Language Modeling. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1:181–184. Koehn, P. (2009). Statistical Machine Translation. Cambridge University Press, 1 edition. Kumaran, A., Saravanan, K., and Maurice, S. (2008). wikiBABEL: Community Creation of Multilingual Data. In Proceedings of the 4th International Symposium on Wikis, WikiSym ’08, pages 14:1–14:11, New York, NY, USA. ACM. Lappin, S. and Leass, H. J. (1994). An Algorithm for Pronominal Anaphora Resolution. Computational Linguistics, 20:535–561. Le Nagard, R. and Koehn, P. (2010). Aiding Pronoun Translation with Co-reference Resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, WMT ’10, pages 252–261, Stroudsburg, PA, USA. Association for Computational Linguistics. Linh, N. G., Novák, V., and Zabokrtský, Z. (2009). Comparison of Classification and Ranking Approaches to Pronominal Anaphora Resolution in Czech. In Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Bibliography 67 Dialogue, SIGDIAL ’09, pages 276–285, Stroudsburg, PA, USA. Association for Computational Linguistics. Mitkov, R. (1999). Introduction: Special Issue on Anaphora Resolution in Machine Translation and Multilingual NLP. Machine Translation, 14:159–161. Mitkov, R., Choi, R. S.-K., and Sharp, R. (1995). Anaphora Resolution in Machine Translation. In Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation, pages 5–7. Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones, L., and Sotirova, V. (2000). Coreference and Anaphora: Developing Annotating Tools, Annotated Resources and Annotation Strategies. In Proceedings of the Discourse, Anaphora and Reference Resolution Conference (DAARC2000), pages 49–58, Lancaster, UK. Nakaiwa, H. and Ikehara, S. (1995). Intrasentential Resolution of Japanese Zero Pronouns in a Machine Translation System Using Semantic and Pragmatic Constraints. In Semantic Constraints Viewed from Ellipsis and Inter-Event Relations (in Japanese), IEICE-WGNLC, pages 96–105. Ng, V. (2010). Supervised Noun Phrase Coreference Research: The First Fifteen Years. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 1396–1411, Stroudsburg, PA, USA. Association for Computational Linguistics. Och, F. J. (2003). Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL ’03, pages 160–167, Stroudsburg, PA, USA. Association for Computational Linguistics. Och, F. J. and Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29:19–51. Papineni, K., Roukos, S., Ward, T., and jing Zhu, W. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computational Linguistics. Saiggon, H. and Carvalho, A. (1994). Anaphora Resolution in a Machine Translation System. In Proceedings of the International Conference: Machine Translation, 10 Years On. Soon, W. M., Ng, H. T., and Lim, D. C. Y. (2001). A Machine Learning Approach to Coreference Resolution of Noun Phrases. Computational Linguistics, 27:521–544. 68 Bibliography Stolcke, A. (2002). SRILM - An Extensible Language Modeling Toolkit. In Proceedings of ICSLP, volume 2, pages 901–904, Denver, USA. Strube, M. (2007). Corpus-based and Machine Learning Approaches to Anaphora Resolution. Anaphors in Text: Cognitive, Formal and Applied Approaches to Anaphoric Reference. John Benjamins Pub Co. Čmejrek, M., Hajič, J., and Kuboň, V. (2004). Prague Czech-English Dependency Treebank: Syntactically Annotated Resources for Machine Translation. In In Proceedings of EAMT 10th Annual Conference, page 04. Weischedel, R. and Brunstein, A. (2005). BBN Coreference and Entity Type Corpus LDC Calalog No.: LDC2005T33. Technical report, Linguistic Data Consortium. Weischedel, R., Pradhan, S., Ramshaw, L., Kaufman, J., Franchini, M., El-Bachouti, M., Xue, N., Palmer, M., Marcus, M., Taylor, A., Greenberg, C., Hovy, E., Belvin, R., and Houston, A. (2009). OntoNotes Release 3.0 LDC Calalog No.: LDC2009T24. Technical report, Linguistic Data Consortium.
© Copyright 2026 Paperzz