Elliphant: A Machine Learning Method for Identifying Subject Ellipsis and Impersonal Constructions in Spanish Luz Rello Main advisor: Ruslan Mitkov Co-advisor: Xavier Blanco A thesis submitted for the degree of Erasmus Mundus International Master in Natural Language Processing and Human Language Technology Research Group in Computational Linguistics University of Wolverhampton June 2010 Laboratori fLexSem Universitat Autònoma de Barcelona In memory of Juan Rello “And then again,” Grandpa Joe went on speaking very slowly now so that Charlie wouldn’t miss a word, “Mr Willy Wonka can make marshmallows that taste of violets, and rich caramels that change colour every ten seconds as you suck them, and little feathery sweets that melt away deliciously the moment you put them between your lips. He can make chewing-gum that never loses its taste, and sugar balloons that you can blow up to enormous sizes before you pop them with a pin and gobble them up. And, by a most secret method, he can make lovely blue birds’ eggs with black spots on them, and when you put one of these in your mouth, it gradually gets smaller and smaller until suddenly there is nothing left except a tiny little pink sugary baby bird sitting on the tip of your tongue.”” Charlie and the Chocolate Factory, Roald Dahl Abstract This thesis presents Elliphant, a machine learning system for classifying Spanish subject ellipsis as either referential or non-referential. Linguistically motivated features are incorporated in a system which performs a ternary classification: verbs with explicit subjects, verbs with omitted but referential subjects (zero pronouns), and verbs with no subject (impersonal constructions). To the best of our knowledge, this is the first attempt to automatically identify non-referential ellipsis in Spanish. In order to enable a memory-based strategy, the eszic Corpus was created and manually annotated. The corpus is composed of Spanish legal and health texts and contains more than 6,800 annotated instances. A set of 14 features were defined and a separate training file was created, containing the instances represented as vectors of feature values. The training data was used with the Weka package and a set of optimization experiments was carried out to determine the best machine learning algorithm to use, the parameter optimization, the most effective combinations of features, the optimal number of instances needed to train the classifier, and the optimal settings for classifying instances occurring in different genres. A comparative evaluation of Elliphant with Connexor’s Machinese Syntax parser shows the superiority of our system. The overall accuracy of the system is 86.9%. Due to the fairly frequent elision of subjects in Spanish, this system is useful as the classification of elliptic subjects as referential or non-referential can improve the accuracy of Natural Language Processing where zero anaphora resolution is necessary, inter alia, for information extraction, machine translation, automatic summarization and text categorization. Acknowledgements First, my sincere acknowledgements to Prof. Ruslan Mitkov for providing everything that can be asked of a supervisor: constant trust, support and encouragement from the very beginning until the end of this thesis. There are three other persons without whom this work would not have been possible (alphabetically): Thank you, Ricardo Baeza-Yates, for your brilliant ideas; thank you, Richard Evans, for your guidance; and thank you, Pablo Suárez, for helping the project to become a reality. I would like to acknowledge the Computational Linguistics Group at the University of Wolverhampton where my collaboration through the first year brought its first results, specially to Iustina Ilisei and Naveed Afzal. Thank you for the assistance received in Universitat Autònoma de Barcelona by my co-advisor Xavier Blanco and by Jose Marı́a Brucart and Joaquim Llisterri. I am indebted to the Grupo de Investigación en Tratamiento Automático del Lenguaje Natural of Universitat Pompeu Fabra for their support and feedback during this last semester, particularly to Gabriela Ferraro and Leo Wanner. Finally, thank you to Igor Mel’čuk and Ignacio Bosque for easing doubts and to Sang Yoon Kim and Ana Suárez Fernández for their help throughout the annotation process. These master studies were supported by a ”La Caixa” grant (Becas de ”La Caixa” para estudios de máster en España. Convocatoria 2008). Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work 2.1 2.2 5 NLP Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 NLP Approaches to Zero Pronouns . . . . . . . . . . . . . . . . . 6 2.1.2 NLP Approaches to Identifying Non-referential Constructions . . 10 Linguistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Linguistic Approaches to Subject Ellipsis . . . . . . . . . . . . . 14 2.2.2 Linguistic Approaches to Non-referential Ellipsis . . . . . . . . . 18 3 Detecting Ellipsis in Spanish 3.1 3.2 21 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Explicit Subjects: Non-elliptic and Referential . . . . . . . . . . 22 3.1.2 Zero Pronouns: Elliptic and Referential . . . . . . . . . . . . . . 23 3.1.3 Impersonal Constructions: Elliptic and Non-referential . . . . . . 24 Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Building the Training Data . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Annotation Software and Annotation Guidelines . . . . . . . . . 30 3.2.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.4 Purpose Built Tools . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.5 The WEKA Package . . . . . . . . . . . . . . . . . . . . . . . . . 41 vii 4 Evaluation 4.1 4.2 43 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.1 Method Selected: K* Algorithm . . . . . . . . . . . . . . . . . . 44 4.1.2 Learning Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.3 Most Effective Features . . . . . . . . . . . . . . . . . . . . . . . 50 4.1.4 Genre Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5 Conclusions and Future Work 59 5.1 Main Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 References 63 List of Figures 2.1 Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia Española, 2009). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 16 An example of the output of the Connexor’s Machinese Syntax parser for Spanish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Screenshot of the annotation program interface. . . . . . . . . . . . . . . 30 3.3 An example of Weka Explorer interface. . . . . . . . . . . . . . . . . . 42 4.1 eszic training data learning curve for accuracy. . . . . . . . . . . . . . . 48 4.2 eszic training data learning curve for precision, recall and f-measure. . . 48 4.3 Learning curve for accuracy, recall and f-measure of the classes. . . . . . 49 4.4 Learning curve for accuracy, recall and f-measure in relation to the number of instances of each class. . . . . . . . . . . . . . . . . . . . . . . . . ix 50 List of Tables 3.1 eszic Corpus: tokens, sentences and clauses. . . . . . . . . . . . . . . . 27 3.2 eszic Corpus: number of instances per class. . . . . . . . . . . . . . . . 28 3.3 eszic Corpus annotation tags. . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Features: definitions and values. . . . . . . . . . . . . . . . . . . . . . . 36 4.1 Weka classifiers accuracy (20% of the eszic training set). . . . . . . . . 45 4.2 eszic training data evaluation with K* -B 40 -M a. . . . . . . . . . . . . 46 4.3 Leave-one-out and ten-fold cross-validation comparison. . . . . . . . . . 47 4.4 Selected features by Weka Attribute Selection methods. . . . . . . . . . 51 4.5 Classification using the selected features groups: accuracy. . . . . . . . . 52 4.6 Extrinsic parser features classification results. . . . . . . . . . . . . . . . 53 4.7 Intrinsic parser features classification results. . . . . . . . . . . . . . . . 53 4.8 Single feature omission classifications: accuracy. . . . . . . . . . . . . . . 53 4.9 Legal and health genres comparative evaluation. . . . . . . . . . . . . . 54 4.10 Cross-genre training and testing evaluation. . . . . . . . . . . . . . . . . 55 4.11 Elliphant eszic training data results. . . . . . . . . . . . . . . . . . . . . 56 4.12 Machinese eszic training data results. . . . . . . . . . . . . . . . . . . . 56 4.13 Elliphant Legal eszic training results. . . . . . . . . . . . . . . . . . . . 56 4.14 Machinese Legal eszic training results. . . . . . . . . . . . . . . . . . . . 57 4.15 Elliphant Health eszic training data results. . . . . . . . . . . . . . . . 57 4.16 Machinese Health eszic training data results. . . . . . . . . . . . . . . . 57 xi Chapter 1 Introduction This introduction is intended to explain the three primary motivations for this research (Section 1.1), its objectives (Section 1.2), and to briefly describe its outcomes. These outcomes include the results of an evaluation of the implemented system and publications produced over the course of the study (see Section 1.3). The overall structure of the thesis is also presented in Section 1.4. 1.1 Motivation There are three reasons motivating the decision to choose this research topic and develop a tool, Elliphant, to perform the identification of zero pronouns (referential elliptic subjects) and impersonal constructions (non-referential elliptic non-existing subjects) in Spanish. The three justifications for this work are: (1) the highly frequent occurrence of zero pronouns in Spanish; (2) identification of zero pronouns is a prerequisite for anaphora resolution in Spanish and also for other Natural Language Processing (nlp) applications; and (3) this challenge had not yet been fully addressed in the field. The system presented in this dissertation represents the first attempt to automatically identify non-referential ellipsis in Spanish. Since Spanish is a pro-drop language (Chomsky, 1981), subject ellipsis is a recurring phenomenon. It was noted that 26% of the 6,878 cases annotated in the corpus exploited in this work have an elliptic subject, while only 3% of them occur in impersonal constructions. The topic of subject ellipsis has been addressed in previous work on other pro-drop languages such as Japanese (Okumura & Tamura, 1996), Chinese 1 1.2 Objectives 1. Introduction (Zhao & Ng, 2007), Korean (Lee & Byron, 2004) and Russian (Kibrik, 2004). The related topic of the identification of non-referential pronouns has been addressed in non-pro-drop languages such as English (Evans, 2001) and French (Danlos, 2005). The identification of zero pronouns and non-referential impersonal constructions is necessary for anaphora resolution, since the resolution of zero pronouns (zero anaphora) implies that they need to be identified. The identification of zero pronouns first requires that they can be distinguished from non-referential constructions (Mitkov, 2010). Coreference and anaphora resolution, and in particular zero anaphora resolution, has been found to be crucial in a number of nlp applications. These include, but are not limited to, information extraction (Chinchor & Hirschman, 1997), machine translation (Peral & Ferrández, 2000), automatic summarisation (Steinberger et al., 2007), text categorisation (Yeh & Chen, 2003a), topic recognition (Yeh & Chen, 2007), salience identification (Iida et al., 2009) and word sense disambiguation (Kawahara & Kurohashi, 2004). Moreover, there is additional research showing that zero pronoun identification is useful in order to make further developments in centering theory (Matsui, 1999), for name entity recognition (Hirano et al., 2007), for the investigation of convergence universals in translation (Corpas Pastor et al., 2008) and to discriminate predicate-argument structure (Imamura et al., 2009). Finally, the difficulty in detecting non-referential pronouns has been acknowledged since computational resolution of anaphora was first attempted (Bergsma et al., 2008) and this task is currently needed in nlp for Spanish. The need for automatic tools able to detect ellipticals has been stated by Recasens & Hovy (2009) who note that their application would improve existing methods for zero anaphora resolution in Spanish (Ferrández & Peral, 2000). One particular contribution of the current research is the recognition of Spanish impersonal constructions which, following from the literature review presented in Chapter 2, appears not to have been addressed before in the literature. 1.2 Objectives The goal of the fully automatic method presented in this dissertation (Elliphant) is to identify zero pronouns (referential elliptic subjects) and impersonal constructions (non-referential elliptic subjects) in Spanish. In order to accomplish this objective, it is 2 1.3 Results 1. Introduction also necessary to identify the cases that occur in the subject position in complementary distribution. For this reason, the identification of explicit subjects was carried out using a learning based method which led to a ternary classification method which covers all the elements (elliptic and explicit, referential and non-referential) of the subject position in the clause. These three classes are explicit subjects, zero pronouns and impersonal constructions. 1.3 Results The results obtained by the Elliphant system and the level of performance that it reaches are encouraging since this tool not only identifies zero pronouns and impersonal constructions but also outperforms a dependency parser (Connexor’s Machinese Syntax) in identifying explicit subjects as well as elliptic subjects. A series of experiments undertaken with the algorithm has enabled discovery of the most effective features for use in the classification tasks. The performance results obtained for the identification of impersonal constructions are, according to the survey of previous work carried out in Chapter 2, the first presented for this task in the literature. The classification results obtained by the algorithm were presented in Rello et al. (2010b). However, this paper undertook no further investigation into the efficacy of the features used presented in Rello et al. (2010a). With regard to the attempt to achieve improved performance from the Elliphant system, two previous studies have contributed to its design: one concerning the distribution of zero pronouns (Rello & Illisei, 2009a) and the other presenting a rule-based method for their identification (Rello & Illisei, 2009b). It should be noted however that despite their contribution, Elliphant differs considerably from these initial studies in terms of methodology (corpus used, linguistic criteria exploited, and the overall approach) and the classification task itself (classes to be identified). Overall, the Elliphant system represents a considerable advancement on those works. 1.4 Thesis Outline The remainder of this thesis is structured in four Chapters. Chapter 2 provides a literature review of nlp approaches (see Section 2.1) to zero pronouns (Section 2.1.1) and identification of non-referential expressions (Section 2.1.2). The review also covers 3 1.4 Thesis Outline 1. Introduction work in the field of Linguistics, including approaches to referential and non-referential subject ellipsis (Section 2.2.1 and 2.2.2). Chapter 3 describes the methodology embodied by the Elliphant system. Firstly, the classification task (see Section 3.1) and an explanation of each of the classes is presented: explicit subjects (Section 3.1.1), zero pronouns (Section 3.1.2) and impersonal constructions (Section 3.1.3). Secondly, the machine learning method (see Section 3.2) is described, beginning with the compilation of the corpus (Section 3.2.1), the guidelines established and the software developed to facilitate annotation of the corpus by human annotators (Section 3.2.2), a description of the features (see Section 3.2.3) derived from the corpus and the purpose built tools (Section 3.2.4) implemented to generate the training data exploited by the machine learning package, Weka (3.2.5). Elliphant is evaluated in Chapter 4. A set of experiments (Section 4.1) was carried out to determine the method and parameter values which work best for these classification tasks (Section 4.1.1), its learning curves (Section 4.1.2) and the most effective groups of features (Section 4.1.3). A comparative evaluation of the Elliphant system with an existing parser is presented in Section 4.2. Finally, in Chapter 5, conclusions are drawn and plans for future work are considered. 4 Chapter 2 Related Work Both the nlp and linguistics literature address referential and non-referential subject ellipsis. Although the nlp literature is directly related to this dissertation in terms of objectives and methodology, more general literature in linguistics contributes various means by which classes of subject ellipsis and annotation criteria can be established. Related work in nlp (see Section 2.1) on this topic can be classified as (a) literature related to zero pronouns (Section 2.1.1), which is mainly concerned with their identification, resolution and generation, and (b) literature related to the identification of non-referential constructions (Section 2.1.2). The literature in linguistics (Section 2.2) concerning different types of ellipsis, in which both zero pronouns (See Section 2.2.1) and non-referential constructions (See Section 2.2.2) are included, is focused on the definition, delimitation and description of their use in language. 2.1 NLP Approaches The nlp literature on this topic broadly concerns two topics, namely zero pronouns (Section 2.1.1) and non-referential constructions (Section 2.1.2). The number and variety of studies of the first group is considerably larger than that of the second. Both topics are mainly related to coreference and anaphora resolution systems as the resolution of zero pronouns (zero anaphora) implies their prior identification. That identification requires first the identification of zero pronouns and secondly the identification of non-referential constructions (Mitkov, 2010). 5 2.1 NLP Approaches 2. Related Work While undertaking this literature review, no specific studies on the identification of non-referential constructions were found in Spanish, although it has been indicated to be a necessary task (Ferrández & Peral, 2000; Recasens & Hovy, 2009) in anaphora and coreference resolution. For this reason it is expected that the method presented in this dissertation will complement current Spanish pronoun resolution systems. 2.1.1 NLP Approaches to Zero Pronouns A zero pronoun is the resultant “gap” (zero anaphor) where zero anaphora or ellipsis occurs, when an anaphoric pronoun is omitted but is nevertheless understood (Mitkov, 2002). In linguistics, zero pronouns are also referred to as null subjects, empty subjects, elliptic subjects, elided subjects, tacit subjects, understood subjects and non-explicit subjects, among others. In the nlp literature such omitted subjects are broadly denoted as zero pronouns. Some linguistic studies also make use of the term “zero pronoun” which is not equivalent to the computational concept. The Meaning-Text Theory (mtt) considers a zero pronoun in subject position to be a non-argumental impersonal subject (Mel’čuk, 2006): Llueve. (It) is raining. while in Generative Grammar, following the Zero Hypothesis (Kratzer, 1998), a zero pronoun can have phonetic content (full pronoun) or not (null pronoun). In this theory, the concept of zero pronoun has to do only with its lack of lexical content in contrast to lexical pronouns (Alonso-Ovalle & D’Introno, 2000). In this work a a zero pronoun (Mitkov, 2002) corresponds with an omitted subject (Real Academia Española, 2009) in Spanish. Zero pronouns become crucial when processing any pro-drop language (Chomsky, 1981) –also known as null subject languages– since zero anaphora is fairly frequent in such languages. By way of example, of the 6,827 annotated cases in our corpus, 26% of them have an omitted subject. The current literature review indicates that the following pro-drop languages are the ones on which related work on zero pronoun processing have been carried out: 6 2.1 NLP Approaches 2. Related Work – Japanese (Hirano et al., 2007; Iida et al., 2006, 2009; Imamura et al., 2009; Isozaki & Hirao, 2003; Kawahara & Kurohashi, 2004; Matsui, 1999; Mori & Nakagawa, 1996; Murata et al., 1999; Nakagawa, 1992; Nakaiwa, 1997; Nakaiwa & Ikehara, 1992; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Okumura & Tamura, 1996; Sasano et al., 2008; Seki et al., 2002; Takada & Doi, 1994; Yoshimoto, 1988); – Chinese (Hu, 2008; Peng & Araki, 2007a,b; Yeh & Chen, 2003a,b, 2007; Yeh & Mellish, 1997; Zhao & Ng, 2007); – Korean (Han, 2004; Lee & Byron, 2004; Lee et al., 2005); – Spanish (Barreras, 1993; Corpas Pastor, 2008; Corpas Pastor et al., 2008; Ferrández & Peral, 2000; Peral, 2002; Peral & Ferrández, 2000; Rello & Illisei, 2009a,b); and – Russian (Kibrik, 2004). These studies of zero pronouns address a variety of topics. Depending on their goal, the literature on zero pronouns can be divided into the following classes: – Zero pronoun classification or annotation: (Han, 2004; Kibrik, 2004; Lee & Byron, 2004; Lee et al., 2005; Rello & Illisei, 2009a); – Zero pronoun identification (Corpas Pastor, 2008; Corpas Pastor et al., 2008; Nakaiwa, 1997; Rello & Illisei, 2009b; Yoshimoto, 1988); – Resolution of zero pronouns, including their prior identification (Barreras, 1993; Ferrández & Peral, 2000; Hu, 2008; Isozaki & Hirao, 2003; Kawahara & Kurohashi, 2004; Murata et al., 1999; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Okumura & Tamura, 1996; Peng & Araki, 2007b; Sasano et al., 2008; Seki et al., 2002; Yeh & Chen, 2003b; Zhao & Ng, 2007); and – Zero pronoun generation (Peral, 2002; Peral & Ferrández, 2000; Theune et al., 2006; Yeh & Mellish, 1997); Other nlp applications where zero pronouns are taken into consideration are: machine translation (Nakaiwa & Ikehara, 1992; Nakaiwa & Shirai, 1996; Peng & Araki, 2007a; Peral, 2002; Peral & Ferrández, 2000); named entity recognition (Hirano et al., 2007); summarisation (Steinberger et al., 2007); text categorisation (Yeh & Chen, 7 2.1 NLP Approaches 2. Related Work 2003a); topic identification (Yeh & Chen, 2007) and identifying salience in text (Iida et al., 2009); and word sense disambiguation (Kawahara & Kurohashi, 2004). Further research topics where zero pronoun identification is useful are: predicateargument structure discrimination (Imamura et al., 2009); for further developments in centering theory (Matsui, 1999) such as improved interpretation of zero pronouns (Takada & Doi, 1994); or for the investigation of convergence universals in translation (Corpas Pastor, 2008; Corpas Pastor et al., 2008). Studies of specific cases of zero pronouns such as those in which their referents take the semantic role of experiencer (Nakagawa, 1992), zero pronouns in relationships with conditional constructions (Mori & Nakagawa, 1996) or descriptions of the syntactic patterns in which zero pronouns are used (Iida et al., 2006), among others. In terms of methodology, rule-based, machine learning, and a variety of other approaches have been taken toward zero pronoun identification and resolution: – Rule-based approaches (Barreras, 1993; Corpas Pastor et al., 2008; Ferrández & Peral, 2000; Hu, 2008; Kawahara & Kurohashi, 2004; Kibrik, 2004; Matsui, 1999; Mori & Nakagawa, 1996; Murata et al., 1999; Nakagawa, 1992; Nakaiwa & Ikehara, 1992; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Peral, 2002; Peral & Ferrández, 2000; Rello & Illisei, 2009b; Yeh & Chen, 2003a,b, 2007; Yeh & Mellish, 1997; Yoshimoto, 1988); – Machine learning approaches (Hirano et al., 2007; Iida et al., 2006, 2009; Kawahara & Kurohashi, 2004; Peng & Araki, 2007b; Zhao & Ng, 2007); – Hybrid methods combining rules and learning algorithms (Isozaki & Hirao, 2003); – Probabilistic models (Sasano et al., 2008; Seki et al., 2002); and – other techniques such as the exploitation of parallel corpora (Nakaiwa, 1997). Although it is clear that machine learning methods perform better than other approaches when identifying non-referential expressions (Boyd et al., 2005), there is some debate about which approach brings optimal performance when applied in anaphora resolution systems (Mitkov, 2002). In Spanish, the most influential work on this topic is the Ferrández and Peral algorithm for zero pronoun resolution (Ferrández & Peral, 2000) together with their 8 2.1 NLP Approaches 2. Related Work previous related work (Ferrández et al., 1998, 1999). Their implementation of a zero pronoun identification and resolution module forms part of a system known as the Slot Unification Parser for Anaphora resolution (supar) (Ferrández et al., 1999). Although substantially related, the work described in this dissertation differs, both, in form and in aim from this previous research for Spanish (Ferrández & Peral, 2000). Firstly, their definition of zero pronouns is broader since it is suited to a different purpose: the zero class includes not only those zero signs whose referent lies in previous clauses (anaphoric, according to their classification) and those that lie outside the text (exophoric), but also those that occur after the verb (cataphoric). Here, it is considered that those subjects that are within the clause, irrespective of whether they appear before or after the verb, belong to the explicit subject class. Secondly, Ferrández & Peral (2000) take a rule-based approach while the system described in this dissertation performs the classification using an instance-based learner. Additionally, their rules are based on partial parsing, while some of the features exploited by the Elliphant system make use of information obtained from an analysis of our corpus by a deep dependency parser. Ferrández & Peral (2000) tested their approach to zero pronoun identification and resolution using 1,599 cases, while the machine learning approach presented in this dissertation was tested on a corpus containing 6,827 classified verbal instances. Finally, they do not provide a method for the identification of non-referential zero pronouns. They also make no overt mention of automatic classification of zero pronouns of the anaphoric or cataphoric kind (Ferrández & Peral, 2000). Despite the similarities of Ferrández & Peral (2000) work to the approach described in this dissertation, the fact that they take a different definition for zero pronouns, means that a comparison with the method described in the current work is not feasible (Section 4.2). In order to improve on previous work by the current author (Rello & Illisei, 2009b), this study differs from it in the design of the classification and the methodology. In Rello & Illisei (2009b) a binary classification as either elliptic-subject or non-elliptic subject was made as a result of the implementation of a rule-based method which applies only to zero pronouns, whilst in the present study a ternary classification is presented which covers all the possible instances of subject position in Spanish. Moreover, while zero pronouns were annotated in Rello & Illisei (2009b), in the present 9 2.1 NLP Approaches 2. Related Work study the zero pronouns themselves were left unmarked. Instead, the main verb of each clause is annotated and classified into one of three types. The baseline rule-based algorithm described in Rello & Illisei (2009b) was based on the zero pronoun identification methodology developed in Corpas Pastor et al. (2008) which treats every clause which does not have an explicit subject as containing a zero pronoun. 2.1.2 NLP Approaches to Identifying Non-referential Constructions The identification of non-referential pronouns1 is a crucial step in coreference (Boyd et al., 2005; Mitkov, 2010) and anaphora resolution systems (Mitkov, 2001, 2002). In comparison to the work addressing zero pronouns, previous research on this topic is fairly limited, and, as implied by this survey of related work, the approach described in this dissertation is the first attempt to automatically identify impersonal constructions in Spanish. The literature describing approaches to the identification of non-referential expressions is focused on: – Identification of pleonastic it in English (Denber, 1998; Lappin & Leass, 1994; Paice & Husk, 1987). Work by Evans (2000, 2001) is exploited by an anaphora resolution system in Mitkov et al. (2002). Also (Bergsma et al., 2008; Boyd et al., 2005; Clemente et al., 2004; Gundel et al., 2005; Lambrecht, 2001; Li et al., 2009; Müller, 2006; Ng & Cardie, 2002); and – Identification of expletive pronouns in French (Danlos, 2005). Nevertheless, in those languages where approaches to the identification of nonreferential expressions have been implemented, there is actually an explicit word with some grammatical information (a third person pronoun) in the text, which is nonreferential (Mitkov, 2010). By contrast, in Spanish, non-referential expressions are not realised by expletive or pleonastic pronouns but by a certain kind of ellipsis. For this reason, it is easy to wrongly identify them as zero pronouns, which are referential. For example, pleonastic pronouns such as: 1 In previous work these pronouns have also been referred to as pleonastic, expletive, non-anaphoric, and non-referential pronouns. 10 2.1 NLP Approaches 2. Related Work (a.1) (It)1 must be stated that Oskar behaved impeccably. (b.1) (It) rains, (Il) pleut, (Es) regnet. (c.1) (It)’s three o’clock. are all elided in Spanish, resulting in the following non-referential impersonal constructions: (a.2) Se dice que Oscar se comportó impecablemente. (b.2) Llueve. (c.2) Son las tres en punto. A sizable proportion of the false positives obtained in previous work on identifying zero pronouns were caused by such non-referential impersonal constructions (Rello & Illisei, 2009b). Ferrández & Peral (2000) noted that an inability to identify verbs used in impersonal constructions has a negative effect on the performance of their anaphora resolution algorithm2 , while in Recasens & Hovy (2009, p. 41) the need for a tool to identify ellipsis is observed: “In contrast with previous work, many of the features relied on gold standard annotations, pointing out the need for automatic tools for ellipticals detection and deep parsing.” Four approaches have been implemented to identify non-referential expressions and described in the literature: – Rule-based approaches (Danlos, 2005; Denber, 1998; Lappin & Leass, 1994; Paice & Husk, 1987); 1 In this work explicit subjects in the examples are presented in italics., zero pronouns in the examples are presented by the symbol Ø, while in the English translations the subjects which are elided in Spanish are marked with parenthesis. Impersonal constructions in the examples are not explicitly indicated using a symbol (see Section 3.1). 2 The other two reasons given for the low success rate in the identification of verbs with no subject are the lack of semantic information and the inaccuracy of the grammar used (Ferrández & Peral, 2000). 11 2.1 NLP Approaches 2. Related Work – Machine learning approaches (Bergsma et al., 2008; Boyd et al., 2005; Clemente et al., 2004; Evans, 2000, 2001; Mitkov et al., 2002; Müller, 2006; Ng & Cardie, 2002); – Web based approach (Li et al., 2009); and – Descriptive studies from contextual (Lambrecht, 2001) and intonational points of view (Gundel et al., 2005). Paice & Husk (1987) introduce a rule-based method for identifying non-referential it while Lappin & Leass (1994) and Denber (1998) describe rule-based components of their pronoun resolution systems which detect non-referential uses of it. Mitkov’s first anaphora resolution algorithm did not incorporate an approach for detecting pleonastic it (Mitkov, 1998), while, in more recent versions, mars (Mitkov’s Anaphora Resolution System), uses Evans (2001) system to detect pleonastic it, and machine learning (Mitkov et al., 2002). Instance-based learning approaches are used for identifying pleonastic it in English, while the only approach for the identification of expletive pronouns in French employs a ruled-based methodology (Danlos, 2005). Evans (2001)1 describes the first attempt using a machine learning method to classify pleonastic it into seven types while Boyd et al. (2005) present a linguistically motivated classification of non-referential it into four types. A comparison replicating the approaches developed by Paice & Husk (1987) and Evans (2001) with the system implemented by Boyd et al. (2005) corroborates the finding that machine learning outperforms rule-based approaches (Boyd et al., 2005). Further, it is pointed out that rule-based methods are limited due to their reliance on lists of verbs and adjectives commonly used in the patterns that they exploit, which can make them less portable and more difficult to adapt to new texts. Nevertheless, the basic grammatical patterns are still reasonably consistent indicators of non-referential occurrences of it (Boyd et al., 2005). Certain aspects of the work described in this dissertation were inspired by the methodology of the machine learning approaches for the identification of pleonastic it specifically by Evans (2001) and Boyd et al. (2005). 1 This method is currently incorporated as a component of mars (Mitkov et al., 2002). 12 2.2 Linguistic Approaches 2. Related Work Due to the fact that the occurrence of non-referential zero pronouns is not very common1 , the size of our corpus was increased in order to achieve a sufficient number of instances for each class. The training data exploited by the Elliphant system contains 6,827 instances of which 179 are non-referential examples. In Evans (2001) 3,171 instances of it where classified into seven classes while in Boyd et al. (2005) 2,337 examples were classified into four classes. Our corpus was analyzed, as in the approach described by Evans (2001), using a functional dependency parser, Connexor’s Machinese Syntax 2 (Connexor Oy, 2006b; Tapanainen & Järvinen, 1997). Moreover, some of the features used in the Elliphant system, such as the consideration of the lemmas and the parts of speech (POS) of the preceding and following material, were also implemented in Evans (2001) approach. In contrast to previous work, the K* algorithm (Cleary & Trigg, 1995) was found to provide the most accurate classification in the current study. Other approaches have employed various classification algorithms, including K-nearest neighbors in TiMBL (Boyd et al., 2005; Evans, 2001) and JRip in Weka (Müller, 2006). 2.2 Linguistic Approaches Literature related to ellipsis in linguistic theory has served as one basis for establishing the linguistically motivated classes and the annotation criteria in the current work. The linguistically related work on this topic is focused on the definition and description of the use of ellipsis in natural language and the limits of that use. In Spanish, the use of ellipsis is very widespread. It is a phenomenon that occurs in a wide range of contexts and is therefore much discussed in the field of linguistics. To illustrate, some controversial topics in linguistics that pertain to instances of ellipsis found in our corpus include: the establishment of different types of ellipsis, the identification of impersonal sentences (non-referential expressions), the definition of particular syntactic categories which can function as subjects, and the intricate differentiation of reflex passive with elliptic subject from impersonal sentences in different varieties of Spanish. The concepts used in both types of literature (nlp and linguistic) to distinguish different types of ellipsis and zero signs are extremely broad and are well debated in 1 2 Only 3% of the verbs found in our corpus (see Section 3.2.1) have non-referential elliptic subjects. http://www.connexor.eu/technology/machinese/demo/syntax/. 13 2.2 Linguistic Approaches 2. Related Work the linguistic literature. Elements of the elliptic typology used in this work which were derived from the literature are stated next while the linguistic and formal criteria used to identify the chosen classes and which served as the basis for the corpus annotation, including a typology of the examples found, is explained in Sections 3.1.1, 3.1.2, 3.1.3 and 3.2.2. 2.2.1 Linguistic Approaches to Subject Ellipsis The study of the omission of some element from the sentence or the discourse in natural language has been a challenge not only in computing but also in Spanish linguistics itself –from the Renaissance period through to the present day. The first occidental grammarian who treated ellipsis as a grammatical phenomenon (Hernández Terrés, 1984) was Francisco Sánchez de las Brozas, El Brocense (15231600) (Sánchez de las Brozas, [1562] 1976, p. 317), who took the concept of ellipsis from Apolonio Dı́scolo (Dı́scolo, [2nd century] 1987) and defined it as: “La elipsis es la falta de una palabra o de varias en una construcción correcta [...]. “Ellipsis is the omission of one or more items from a correct construction [...].” This conception, in which grammar serves as a basis for a rational explanation of the surface form of the language: “No hay, pues, ninguna duda de que se debe buscar la explicación racional de la cosas, también de las palabras.” Sánchez de las Brozas ([1562] 1976) cited in Garcı́a Jurado (2007, p. 12) “There is no doubt about that there shall be pursuit a rational explenation of the things.” later inspired the rational grammar of Port-Royal (Lancelot & Arnauld, [1660] 1980) which was a precursor of Chomsky’s work (Chomsky, [1968] 2006, p. 5): “One, particularly crucial in the present context, is the very great interest in the potentialities and capacities of automata, a problem that intrigued the seventeenthcentury mind as fully as it does our own. [...] A similar realisation lies at the base of Cartesian philosophy.” 14 2.2 Linguistic Approaches 2. Related Work In order to elide something, a meaning, which is not expressed needs to be assumed. It thus follows that ellipsis itself was one of the basic mechanisms to explain the transition from D-structure to S-Structure becoming a central issue (Brucart, 1987) in generative grammar from its original model, the Standard Theory (Chomsky, 1965) to its latest revisions (Chomsky, 1995). Different branches of linguistics have considered ellipsis from different points of view: – Semantic: traditionally, the criteria used to define ellipsis were semantic or logical (Bello, [1847] 1981) and prescriptive (Real Academia Española, 2001); – Descriptive and explicative: (Brucart, 1999); – Distributional: although structuralism rejected the study of units which were not codified in the signifier or phonetic realization, some classifications of ellipsis were presented (Francis, 1958; Fries, 1940); – Pragmatic: in diverse pragmatic paradigms the role of ellipsis is crucial as it influences the interpretation of text. As a result it has given rise to several lines of investigation such as implications though ellipsis (Grice, 1975), ellipsis studied as a factor to activate textual coherence (Halliday & Hasan, 1976), or indefinite ellipsis in which a word can stand for one or more sentences in a restrictive code (Shopen, 1973); and – Cognitive: in terms of ellipsis processing by the brain (Streb et al., 2004, p. 175): “Ellipses and pronouns/proper names are processed by distinct mechanisms being implemented in distinct cortical cell assemblies.” or as part of the explanation of the language faculty (Chomsky, 1965). The terminology and linguistic explanations relevant for this work, consider both zero pronouns and non-referential expressions to be different types of ellipsis (Brucart, 1999). Four kinds of Spanish subject ellipsis are distinguished (Brucart, 1999, p. 2851). This classification is presented in correlation with a verb classification (Real Academia Española, 2009), which is related to the omitted subject classification presented in Bosque (1989). 15 2.2 Linguistic Approaches 2. Related Work The classification of Spanish omitted subjects presented in Bosque (1989) is: omitted subjects from finite verbs, which can be referential and non-referential and omitted subjects from non-finite verbs which can be argumental and non-argumental. The argumental omitted subjects can in turn be referential and non-referential. In this study non-argumental omitted subjects are claimed not to exist (Bosque, 1989), although in Brucart (1999), non-argumental omitted subjects are considered a type of ellipsis (Type 4 in Figure 2.1). Types of subject ellipsis Types of verbs depending on their subject (Brucart, 1999) (Real Academia Española, 2009) (1) Omitted subject in a clause containing a finite verb: Verb with argumental omitted s u b j e c t w i t h a n e s p e c i fi c interpretation Ø No vendrán They won’t come Verb with argumental omitted sub ject with an unespecific interpretation Ø Dicen que vendrá They say he will come It is said he will come (2) Argumental impersonal subject Verb with argumental omitted subject which is represented by pronoun se En este estudio Ø se trabaja bien. In this room one can work properly. (3) Non-argumental impersonal subject Ve r b w i t h n o a r g u m e nt a l subject Ø Nieva It is snowing (4) Omitted subject in a non-finite verb clause Juan intentaba (Ø decírselo a María.) John tried (John to tell Mary.) Figure 2.1: Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia Española, 2009). The first type of ellipsis (see (1) in Figure 2.1) represents omitted subjects and corresponds to zero pronouns in the nlp literature. An omitted subject is the result of nominal ellipsis where a non-phonetically/orthographically realized lexical element – 1. Omitted sub ject in a clause containing a finite verb: omitted subject– which is needed for the interpretation of the meaning and the structure of the sentence, isvendrán omitted since it can retrieved from its context (Brucart, 1999). Ø No [They] won’t come Despite their lack of phonetic realization, omitted subjects are part of the clause (Real Ø Dicen que vendrá Academia Española, 2009). [They] say he won’t come [It is] said he won’t come 2. Argumental impersonal subject En este estudio Ø se trabaja bien. In this ro om [one] can work properly. 3. Non-argumental impersonal 16 2.2 Linguistic Approaches 2. Related Work Two types of syntactic ellipsis or lexical-syntactic ellipsis can be distinguished: verbal ellipsis and nominal ellipsis. These types of subject ellipsis can affect the whole argument of the verb or be partial and just affect the head of the argument (Brucart, 1999). As detailed in Section 3.2.2, the annotation of our corpus includes both complete noun phrase ellipsis and noun phrase head ellipsis. Note that nominal ellipsis not only affects the subjects but also the other arguments of the verb –datives, direct objects or infinitive objects– although their ellipsis is held to more restricted conditions (Brucart, 1999). However, this fact is not acknowledged in some prior approaches in nlp (Ferrández & Peral, 2000, p. 166): “While in other languages, zero-pronouns may appear in either the subject’s or the object’s grammatical position, (e.g. Japanese), in Spanish texts, zero-pronouns only appear in the position of the subject.” The interpretation of Type 1 ellipsis can be definite and specific (Brucart, 1999) or indefinite (Real Academia Española, 2009). Since omitted subjects are referential, they can be lexically retrieved (Gómez Torrego, 1992). An example of omitted subject could be: (d) Las leyes no tendrán efecto retroactivo si Ø no dispusieren lo contrario. The law will not have a retroactive effect unless (they) specify otherwise. The nature of the omitted subject [Ø] itself has been discussed in the linguistic literature (Real Academia Española, 2009). While recent approaches in linguistics agree that the omitted subject has a pronominal nature (elided pronoun), others contend that the subject is expressed in the morphology of the verb inflection. In Generative Grammar subject ellipsis has been understood as a (1) pro-form (Beavers & Sag, 2004; Chung et al., 1995; Fiengo & May, 1994; Wilder, 1997) or as (2) a syntactic realization without a phonetic constituent (Merchant, 2001; Ross, 1967). The Meaning-Text Theory (mtt) contends that ellipsis occurs in the SSyntS (surface syntax) when the elliptic element is deleted during the transition from SSyntS to DMorphS (deep morphology) (or vice versa) and an empty node stands in for the representation of the elliptic element. This procedure for treating ellipses is also proposed in the MTT for the description for all coordinate structures (Mel’čuk, 2003). 17 2.2 Linguistic Approaches 2. Related Work The identification of omitted subjects is not problematic when the zero pronoun belongs to the first or second person but when it is a third person omitted subject, the reference can be anaphoric or cataphoric (Type 1 ellipsis in Table 2.1) or non-specific1 . A generic or non-specific interpretation can follow in some clauses with singular second person and plural third person zero pronouns (Real Academia Española, 2009). However, depending on discourse knowledge, there can be alternators of specific and non-specific interpretation in clauses which are formally equal, as the next example shows: (e) Ø Me han regalado un reloj. (In this example both interpretations, specific and nonspecific, are possible.) (1) (They) gave me a watch. (When the agent referred to by “they” has been mentioned previously in the discourse.) (2) (I) was given a watch. (When no agent has been mentioned previously in the discourse.) where the non-specific interpretation does not exclude a possible specific one (Real Academia Española, 2009). Therefore, both groups of argumental subjects with specific and non-specific interpretations are included in the same class. 2.2.2 Linguistic Approaches to Non-referential Ellipsis On the other hand, Type 2 and type 3 ellipsis listed in Figure 2.1 correspond to non-referential expressions or impersonal sentences. Type 2 ellipsis is composed of impersonal sentences containing the Spanish particle se, whose argumental omitted subject always has an unspecific interpretation and is referred to using the pronoun se (Mendikoetxea, 1994). Type 3 ellipsis corresponds to the set of sentences called impersonal sentences. Although the types of impersonal constructions in Spanish are heterogeneous, all of them share a lack of some properties of the subject (Fernández Soriano & Táboas Baylı́n, 1999). Some studies consider different kinds of Spanish impersonality, e.g. semantic and syntactic impersonality (Gómez Torrego, 1992), while others distinguish several semantic degrees of impersonality (Mendikoetxea, 1999). 1 In journalistic headlines with an omitted subject, a non-specific interpretation can occur (Bosque, 1989) even in non-pro-drop languages such as English, French or German (Real Academia Española, 2009). Such non-specific interpretations can occur when the antecedent or referent was not previously mentioned in the discourse. 18 2.2 Linguistic Approaches 2. Related Work Traditionally –from a semantic point of view– impersonal sentences have been considered to be those which cannot contain a subject, the agent of the action described (Real Academia Española, 1977). This impersonality can the due either to the nature of the verb, (f) Llueve. (It) rains. or due to the speaker’s ignorance of the subject (Seco, 1988): (g) Llaman a la puerta. (Someone) is knocking the door. where the subject is unidentified and it is therefore impossible to assign a reference to it (Bello, [1847] 1981). The controversy of treating non-referential expressions as a type of ellipsis, given that they cannot be lexically retrieved, has already been discussed (Gómez Torrego, 1992). While Brucart (1999) considers them a case of ellipsis, as do some Generative Grammar approaches1 , others (Bosque, 1989; Mel’čuk, 2006)2 consider that such elliptic and non-referential subjects do not exist in language. A descriptive point of view (Fernández Soriano & Táboas Baylı́n, 1999) would regard impersonal sentences as belonging to either of two main groups (1) impersonal sentences without a subject and (2) cases of impersonal verbs with the inherent feature of not having a subject. In the current dissertation, a prescriptive and descriptive approach (Real Academia Española, 2009) to the consideration of impersonal sentences is taken (See Section 3.1.3). Type 4 ellipsis (Brucart, 1999) in Figure 2.1 is ignored in our work. However, this fourth type is much debated in literature; for example, Head-Driven Phrase Structure Grammar does not consider the infinitive subject as a null category (slash), nor do Pollard and Sag in their work (Pollard & Sag, 1994). 1 Generative Grammar explains these impersonal sentences by labeling the absence of the subject with a pro-form which presents the same syntactic features as the subject although is has no phonological realization. Following the Extended Projection Principle this pro-form embodies all the syntactic requirements of a subject except for its phonological realization (Chomsky, 1981). 2 MTT uses the concept of the zero sign to characterize elements whose signifier is empty and is by no means realized as a perceptible phonetic pause (Mel’čuk, 2006). 19 2.2 Linguistic Approaches 2. Related Work 20 Chapter 3 Detecting Ellipsis in Spanish This chapter describes the methodology used in this study. The first step is to create a linguistically motivated classification system (Section 3.1) for all instances of elliptic and non-elliptic as well as referential and non-referential subjects. Since the machine learning method requires training data, a corpus (the eszic Corpus) was compiled (see Section 3.2.1) and a purpose built tool for its annotation was developed, as were guidelines (see Section 3.2.2). The third task consisted of implementing a method to extract the features (Section 3.2.3) of instances from the corpus and create training data (eszic training data; see Section 3.2.4). Finally, once the features of instances are derived from a document they are exploited for classification by machine learning using the Weka package (Section 3.2.5). 3.1 Classification The first step is to create a classification system for all instances of subject and impersonal constructions. The groups into which the subjects were divided were labeled: elliptic and non-elliptic subjects as well as referential and non-referential subjects. These two labels result in a ternary classification: (1) Explicit subjects: non-elliptic and referential1 ; (2) Zero pronouns: elliptic and referential2 ; and 1 Explicit subjects in the examples are presented in italics. Zero pronouns in the examples are presented by the symbol Ø. In the English translations the subjects which are elided in Spanish are marked with parenthesis. 2 21 3.1 Classification 3. Detecting Ellipsis in Spanish (3) Impersonal constructions: elliptic and non-referential1 . A subject can be non-elliptic (explicit) or elliptic (omitted subject or zero pronoun). A sign can be referential or non-referential. The distinction lies in the fact that, while the former can be lexically retrieved, the latter cannot (impersonal construction). This treatment of the classification as ternary differs from previous work whose division of subjects was binary: elliptic (zero pronoun) and non-elliptic, both referential (Ferrández & Peral, 2000; Rello & Illisei, 2009b) (see Section 2.1.1). In Evans (2001) the seven fold classification of pleonastic it is based on the type of referent while in Boyd et al. (2005), classification follows syntactic and semantic criteria (see Section 2.1.2). In the following sections, each class is described. With regard to cases in which classification can be controversial, different annotation criteria were applied (see Section 3.2.2). 3.1.1 Explicit Subjects: Non-elliptic and Referential This class is the one to which explicit subjects belong. They are phonetically realised, usually by a nominal group: noun, pronoun, noun phrase (a), free relatives, semi-free relatives, substantival adjectives (Real Academia Española, 2009). (a) Las fuentes del ordenamiento jurı́dico español son la ley, la costumbre y los principios generales del derecho. The sources of the Spanish legal system are the law, the judicial custom and the general principles of law2 . The syntactic positions of subjects can be pre-verbal or post-verbal. The occurrence of post-verbal subjects is restricted by some conditions (Real Academia Española, 2009). (b) Carecerán de validez las disposiciones que contradigan otra de rango superior. The dispositions which contradict the higher range ones will not be valid. 1 2 Impersonal constructions in the examples are not explicitly indicated using a symbol. Unless otherwise specified, all the examples provided are taken from our corpus (Section 3.2.1). 22 3.1 Classification 3. Detecting Ellipsis in Spanish Post-verbal subjects, as well as preverbal ones, are also found in passive constructions and passive reflex constructions. As in active clauses, preverbal subjects without a definite article are rare while post-verbal subjects without a definite article are more frequent (Real Academia Española, 2009). Projections of non-nominal categories such as clauses containing an infinitive or a conjugated verb, interrogative indirect clauses, or indirect exclamative clauses, can function as subjects (Real Academia Española, 2009). (c) Corresponde a los poderes públicos promover las condiciones para que la libertad y la igualdad del individuo y de los grupos en que se integra sean reales y efectivas. It corresponds to the public power to promote individual and group liberties to be real and effective. 3.1.2 Zero Pronouns: Elliptic and Referential Class 2 is formed by elliptic but referential subjects called zero pronouns. An elliptic subject is the result of a nominal ellipsis, where a non-phonetically realised lexical element –elliptic subject– which is needed for the interpretation of the meaning and the structure of the sentence, is omitted since it can retrieved from its context (Brucart, 1999). Despite their lack of phonetic realisation, elliptic subjects are considered part of the clause (Real Academia Española, 2009). (d) La Constitución Españolai (title in text) Øi Fue refrendada por el pueblo español el 6 de diciembre de 1978. The Spanish Constitutioni (title in text) (It) i was countersigned by the Spanish population on the 6th of December of 1978. Elliptic subjects are considered to be a personal pronoun variant which is not phonetically realised (Real Academia Española, 2009). Where referential, they can be lexically retrieved (Gómez Torrego, 1992). That is to say that they can be substituted by explicit pronouns without changing or losing any of the meaning of the clauses in which they occur. The elision of the subject can affect not only the noun head, but also the entire noun phrase (Brucart, 1999). The noun head can be omitted in Spanish when the subject of which it is a part fulfills some structural requirements (Brucart, 1999). This 23 3.1 Classification 3. Detecting Ellipsis in Spanish includes cases in which the subject is referential (Brucart, 1999). The processing of these subjects has been addressed by the development of specific algorithms in previous work (Ferrández et al., 1997). Ellipsis of the head of the noun phrase is only possible when a definite article occurs. (e) El Ø que está obsesionado con que todo el mundo piensa mal es Javier. The (one) who is obsessed with everyone thinking wrong is Javier. The article possesses a referential value which could be either anaphoric or cataphoric (Real Academia Española, 2009). Such examples of subjects with an elided head are instances of semi-free relatives (Real Academia Española, 2009) and, as expected, they are not as frequent in our corpus as elisions of the entire subject noun phrase. 3.1.3 Impersonal Constructions: Elliptic and Non-referential Impersonal constructions with no subjects, that are both non-referential and elliptic, do not exist (Bosque, 1989)1 . The appearance of clauses containing zero pronouns and impersonal constructions is similar. Class 3 is composed of impersonal constructions which are formed by (1) impersonal and (2) reflex impersonal clauses (impersonal clauses with se). Impersonal clauses have no argumental subject. Since the subject does not exist, it cannot be lexically retrieved by any means and no phonetic realisation of it can be expected (Bosque, 1989). The following cases are considered to be impersonal sentences (Real Academia Española, 2009): – Non-reflex impersonal clauses denoting natural phenomena describing meteorological situations: (f) Nieva. (It) snows. – Non-reflex impersonal clauses with verbs haber (to be), hacer (to do), ser (to be), estar (to be)2 , ir (to go) and dar (to give): 1 The existence of a non-phonetically realised element in subject position is postulated (see Section 2.2). While Generative Grammar defends their existence (pro-form), mtt does not (zero sign). 2 Depending on the verbal aspect, there are different Spanish verbs which correspond with the English verb to be. 24 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach (g) En un kilogramo de gas hay tanta materia como en un kilogramo de sólido. In a kilogram of gas (there) is the same amount of mass as in a kilogram of solid. (Existential use of the verb haber ). – Non-reflex impersonal clauses with other verbs such as sobrar con (to be too much), bastar con (to be enough) or faltar con (to have lack of) or the pronominal unipersonal verb1 with subject zero such as tratarse de (to be about): (h) Deberán adoptar las precauciones necesarias para su seguridad, especialmente cuando se trate de niños. Necessary measures should be taken, specially when (it) is about children. (i) Basta con tres sesiones. (It) is enough with three sessions. Verbs in such impersonal sentences (Gómez Torrego, 1992), are called lexical impersonal verbs (Real Academia Española, 2009). Due to their lack of subject they are not easily distinguished from verbs with omitted –but existing– subjects. Secondly, reflex impersonal clauses have an omitted subject whose reference is nonspecific and cannot be lexically retrieved. (j) Se estará a lo que establece el apartado siguiente. (It) will be what is established in the next section These clauses are formed with the particle se. This particle also serves other syntactic functions (reflexive pronoun, pronominal pronoun, reciprocal pronoun, etc.) in clauses with an elided subject. 3.2 Machine Learning Approach Our corpus was compiled and parsed in order to create training data (referred to as the eszic training data) for use by a machine learning classification method as explained in the next section. A tool was developed for annotation of the corpus (see Section 3.2.2). Fourteen features were proposed for the purpose of classifying instances of subjects (see Section 1 A verb which is only conjugated in the third person. 25 3.2 Machine Learning Approach 3. Detecting Ellipsis in Spanish 3.2.3). The feature vectors, together with their manual classifications, were written to a training file. A method for obtaining the values of those features for each instance was implemented. The classification algorithm employed was the K* instance-based learner available in the Weka package (Witten & Frank, 2005) (see Section 3.2.5). 3.2.1 Building the Training Data The eszic training data used by the Elliphant system is obtained from the eszic corpus created ad hoc. The corpus is named after its annotated content “Explicit Subjects, Zero-pronouns and Impersonal Constructions”. The corpus contains a total of 79,615 words (titles and sentences that do not contain at least one finite verb are ignored), including 6,825 finite verbs. Of these verbs, 71% have an explicit subject, 26% have a zero pronoun and 3% belong to an impersonal construction. There is an average of 2.3 clauses per sentence with 11.7 words per clause and 26.9 words per sentence. The corpus compiled to extract the training data is composed of seventeen documents, originally written in Spanish, and belonging to two genres: legal and health. The legal texts1 are composed of laws taken from the: (1) Spanish Constitution (whole text) (Constitución Española, 1978), (2) Laws on Unfair Competition (whole text) (Ley 3/1991, 1991), (3) Penal Code (first book) (Ley Orgánica 10/1995, 1995), (4) Law for Administrative-contentious Jurisdiction (title 1, articles 1 to 17) (Ley 29/1998, 1998), (5) Civil Code (first book, until title V) (Código Civil, 1889), (6) Law for Universities (introduction) (Ley Orgánica 6/2001, 2001), (7) Law for Associations (chapter 1) (Ley Orgánica 1/2002, 2002) and (8) Law for Advertisements (whole text) (Ley 29/2005, 2005). The nine health texts are taken from psychiatric papers compiled from a Spanish digital journal of psychiatry Psiquiatrı́a.com 2 : (1) Cinema as a tool for teaching personality disorders (López Ortega, 2009), (2) Efficacy, functionality, and empowerment for phobic pathology treatment, in the context of specialised public Mental Health Services (Garcı́a Losa, 2008), (3) Emotions in Psychiatry (Sevillano Arroyo & Ducret Rossier, 2008), (4) And what about siblings? How to help TLP3 siblings 1 All the legal texts are available online at: http://noticias.juridicas.com/base_datos/ The full-text articles from Psiquiatrı́a.com Journal are available online at: http://www. psiquiatria.com/. 3 Trastorno lı́mite de la personalidad (Borderline Personality Disorder). 2 26 3.2 Machine Learning Approach 3. Detecting Ellipsis in Spanish eszic Corpus Legal text 1 Legal text 2 Legal text 3 Legal text 4 Legal text 5 Legal text 6 Legal text 7 Legal text 8 Health text 1 Health text 2 Health text 3 Health text 4 Health text 5 Health text 6 Health text 7 Health text 8 Health text 9 Total Number of Tokens Number of Sentences Number of Clauses 9,972 1,147 17,960 3,578 12,456 3,962 2,159 5,219 2,753 11,339 1,854 1,937 2,183 1,568 1,296 1,687 12,441 93,511 941 47 1,035 189 746 130 131 291 110 658 47 84 93 63 69 53 525 5,212 600 56 1,181 191 891 219 136 282 270 1,028 140 124 148 210 89 127 1,394 7,086 Table 3.1: eszic Corpus: tokens, sentences and clauses. 27 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach (Molina López, 2008), (5) Factorial analysis of personal attitudes in secondary education (Pintor Garcı́a, 2007), (6) The influence of the concept of self and social competence in children’s depression (Aldea Muñoz, 2006), (7) Depression as a mental health problem in Mexican teenagers (Balcázar Nava et al., 2005), (8) Relationship difficulties in couples (Dı́az Morfa, 2004), and (9) A case of psychological intervention for children’s depression (Aldea Muñoz, 2003). Table 3.2 presents the number of instances found in the eszic corpus by class. Two columns illustrate the number of instances by genre (legal and health) within the corpus. Number of instances per class Explicit subjects Zero pronouns Impersonal constructions Total Legal eszic Corpus Health eszic Corpus eszic Corpus 2,739 619 71 3,429 2,116 1,174 108 3,398 4,855 1,793 179 6,827 Table 3.2: eszic Corpus: number of instances per class. The text containing instances to be classified was analysed using Connexor’s Machinese Syntax (Järvinen & Tapanainen, 1998; Järvinen et al., 2004; Tapanainen & Järvinen, 1997)1 . This dependency parser returns information on the pos and morphological lemma of words in a text, as well as returning the dependency relations between those words. The parsing system employed uses Functional Dependency Grammar (FDG) (Järvinen & Tapanainen, 1998; Tapanainen & Järvinen, 1997) and combines (Järvinen et al., 2004) a lexicon and a morphological disambiguator based on constraint grammar (Tapanainen, 1996). When performing fully automatic parsing it is necessary to address word-order phenomena. The formalism used in the parser is capable of referring simultaneously both to the order in which syntactic dependencies apply and to linear order. This feature is an extension of Tesnière’s theory (Tesnière, 1959), which does not formalise linearisation. In the parsed output the linear order is preserved while the structural order requires that functional information is not coded in the canonical order 1 A demo of Connexor’s Machinese Syntax is available at: http://www.connexor.eu/technology/ machinese/. 28 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach of the dependents. The functional information is represented explicitly using arcs with labels of syntactic functions as shown in Figure 3.1 (Järvinen et al., 2004). Figure 3.1: An example of the output of the Connexor’s Machinese Syntax parser for Spanish. The dependency information allows the identification of complex constituents in a text. For example, complex noun phrases can be identified by transitively grouping together all the words dependent on a noun head (Evans, 2001). Additional software was implemented to perform this and allow identification of clauses and noun phrases which are required for implementation of some of the features used in our classification (see Section 3.2.4). The eszic training data makes use of the three types of information returned by Connexor’s Machinese Syntax parser (Connexor Oy, 2006a,b): 1. morphological tags generated for verbs –singular (SG), third person (3P), indicative (IND), among many others– including the pos tags –verb (V), noun (N), preposition (PREP), etc.–; 2. syntactic tags –main element (@MAIN), nominal head (@NH), auxiliary verb (@AUX), etc.–; and 29 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach 3. syntactic relations –subject (subj), verb chained (v-ch), determiner (det)–. The lexical information (LEMMA) given by the parser was also taken into consideration in the set of features. 3.2.2 Annotation Software and Annotation Guidelines A program was written in Python (see Figure 3.2) to extract all occurrences of finite verbs from the eszic Corpus and to assign to each the vector of feature values described in Section 3.1. Two annotators were presented with the clause in which each verb appears and prompted to classify the verb into one of thirteen classes. Figure 3.2: Screenshot of the annotation program interface. Although the goal is to develop training data for a classifier making a ternary classification of the subject position elements, an annotation scheme which gives more 30 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach detail about each instance was used. This annotation scheme was used with a dual purpose: to get the most from the annotation task since the instances occur in a broad number of constructions and because a more detailed annotation could be useful in future work. The thirteen classes are grouped into the three types: (1) explicit subjects, (2) zero pronouns or (3) impersonal constructions. In Table 3.3, the linguistic motivation for each of the annotated classes is shown in correlation with the types to which they belong. From each annotation class, in addition to the two criteria that are crucial for this study –elliptic vs. non-elliptic and referential vs. non-referential– a combination of syntactic, semantic and discourse knowledge can also be encoded during the annotation. This knowledge includes information about whether the subject is nominal or non-nominal, whether it is an active or a passive subject or whether the subject refers to an active participant in the action, state or process denoted by the verb. The annotation program extracts from the parsed eszic Corpus the clause in which each finite verb occurs. As Connexor’s Machinese Syntax parser does not explicitly perform clause splitting but only sentence splitting, a method was developed to accomplish the clause identification task. The method identifies the finite verbs in the corpus and transitively groups together the words directly and indirectly dependent upon them1 . The identified clauses are then presented to the annotators who are asked to label the verb. For each verb classified by an annotator, an xml tag (i.e. <subject>ZERO</subject>) with its class is added in the token line of the parsed eszic Corpus where the verb occurs. An example (k) of an annotated verb whose subject is a zero pronoun follows: (k) <token id="w53"><text>entró </text><lemma>entrar </lemma> <depend head="w51">mod </depend><tags><syntax>@MAIN </syntax><morpho>V IND PRET SG P3 </morpho><subject>ZERO </subject> </tags></token> This manual classification, together with the features (see Section 3.2.3) are written to the eszic training file. 1 A clause splitter module was implemented to extract the features from the eszic Corpus (see Section 3.2.4). 31 3.2 Machine Learning Approach 3. Detecting Ellipsis in Spanish eszic Corpus Annotation Tags Linguistic Phonetic Realization Syntactic Verbal cateDiathegory sis Semantic interpretation Disclosure Elliptic noun phrase Elliptic noun phrase head Nominal subject Active participant Referential subject sub- – – + + + + Reflex passive – – + + – + – – + – – + + – + + + + – + + + + + – – – + + + + – + + – + – + + + – + – – – + – + + – + – – + – – – – – + – – n/a – n/a – – – n/a + n/a – information Elliphant Classes Linguistic characteristics Class 1 Explicit Active ject Explicit subject subject Passive subject Omitted subject Omitted subject head Non-nominal subject Class 2 Reflex passive omitted sub- ject Zero pronoun Reflex passive omitted sub- ject head Reflex passive non-nominal subject Passive omitted subject Passive non-nominal subject Class 3 Reflex impersonal clause Impersonal (with se) construction Impersonal construction (without se) Table 3.3: eszic Corpus annotation tags. 32 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach Annotating explicit and elliptic subjects as well as impersonal constructions in Spanish is not a trivial task. Guidelines were established for the annotation of borderline instances whose classification is a frequent source of disagreement between annotators. The following text presents some of these borderline cases that belong to the three types of finite verb classes, together with the criteria adopted for their annotation. When distinguishing explicit subjects, in addition to nouns, there are other syntactic categories which may arguably function as heads of subjects. In the case of adverbial and prepositional categories, it was decided that they should be considered subjects if they can be focalised (Real Academia Española, 2009). (`) De acuerdo con la Organización Mundial de la Salud, la depresión ocupa el cuarto lugar entre las enfermedades más incapacitantes y aproximadamente de 100 a 200 millones de personas la padecen. According to the International Health Organization, depression is ranked as the fourth illness which causes more invalidity and approximately from 100 to 200 million people suffer from it. While conditional clauses could be considered subjects, in this work an alternative analysis is followed. Under this approach, a sentence with a conditional clause functioning as subject is considered to contain a zero pronoun, as its elliptic subject can be retrieved from the preceding discourse (Real Academia Española, 2009). Nevertheless, no examples were found of conditional clauses functioning as subjects in the eszic corpus used in this dissertation. The correct classification of zero pronouns is also a source of disagreement between annotators as it may be argued that some instances with postponed non-nominal subjects (see example (m) below) should be interpreted as cataphoric zero pronouns. In contrast to anaphora, in cataphora the cataphoric expression is situated before the nominal group to which it points (Real Academia Española, 2009). Tanaka (2000) and Mitkov (2002) point out that there is some scepticism about the concept of cataphora in the NLP literature. For example, Kuno (1972) asserts that there is no genuine cataphora in its literal sense, as the referent of a seemingly cataphoric pronoun must already be mentioned in the preceding discourse and, therefore, is predictable when a reader encounters the pronoun. This viewpoint was refuted by Carden (1982) and Tanaka (2000) who describe empirical data which shows cases of genuine cataphora 33 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach where the pronoun is the first mention of its referent in the discourse (Carden, 1982; Tanaka, 2000). Although some examples of genuine cataphora were found in their corpus (Tanaka, 2000), none were found in the eszic Corpus except for occurrences of the elision of noun heads where the antecedent is postponed, as in example (e). The annotation guidelines developed for the current work considered these cases which involve postponed clauses as non-nominal subjects. (m) Artı́culo 46. No pueden contraer matrimonio: Los menores de edad no emancipados. Los que estén ligados con vı́nculo matrimonial. Article 46. (They) cannot get married: The non-emancipated minors. The ones which are already married. Finally, the borderline cases in impersonal constructions are debated in Spanish. The decision of how to classify reflex impersonal clauses containing se is frequently a difficult one to make due to the ambiguity of these instances. For example, in the sentence Se secaron (see example (n) below), the particle se has four possible semantic interpretations in Spanish (Real Academia Española, 2009). In these cases, the decision taken by the annotator depends on the meaning given by the context. (n) Se secaron (Particle se = reflexive pronoun) (They) dried (themselves). Se secaron (Particle se = reciprocal pronoun) (They) dried (each other). Se secaron (Particle se = pronominal pronoun and there is an elliptic subject which does not have control over the action, for instance, the trees.) The trees got dried. Se secaron (Particle se = reflex passive in which the referent of the subject would have to perform the described action under their own free will, for instance, some people over an object, for instance, the clothes) (They) dried (the clothes). 34 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach There can be ambiguity between reflex passives containing a zero pronoun and impersonal constructions in which the object is not human (o). (o) Se firmará el acuerdo. Ø will sign the agreement. In such instances, the annotation criterion followed is to annotate them as reflex passive clauses containing a zero pronoun. 3.2.3 Features Fourteen features were proposed in order to classify instances according to the types presented in Section 3.1. The values (see Table 3.4) for the features were derived from information provided both by Connexor’s Machinese Syntax (Connexor Oy, 2006b) parser, which processed the eszic Corpus, and a set of lists. An additional program was implemented in order to extract the values of features for every instance in the corpus (see Section 3.2.4). These values were used to produce a training vector for each instance. For a detailed explanation of the feature values see Section 3.2.4. For the purpose of description, it is convenient to describe each of the features as broadly belonging to one of ten classes, detailed below. 1 PARSER: the presence or absence of a subject in the clause, as identified by the parser. It was observed (Rello & Illisei, 2009b) that the analysis returned by Connexor’s Machinese Syntax is particularly inaccurate when identifying coordinated subjects, subjects containing prepositional modifiers, and appositions occurring between commas (see example (p) below). Other common cases of parsing error involve subjects which are distant from the finite verb in the clause. Features 7 and 8 were proposed in an effort to take into consideration potential candidates for the subject. (p) La publicidad, por su propia ı́ndole, es una actividad que atraviesa las fronteras. Advertising, due to its own nature, is an activity which goes beyond boundaries. 2 CLAUSE: the clause types considered are: main clauses, relative clauses, clauses starting with a complex conjunction, clauses starting with a simple conjunction, and clauses introduced using punctuation marks (commas, semicolons, etc). A 35 3.2 Machine Learning Approach 3. Detecting Ellipsis in Spanish Feature Definition Value 1 2 3 4 5 6 Parsed subject Clause type Verb lemma Verb morphological number Verb morphological person Agreement in person, number, tense and mood True, False Main, Rel, Imp, Prop, Punct Parser’s lemma tag SG, PL P1, P2, P3 PARSER CLAUSE LEMMA NUMBER PERSON AGREE FTFF, TTTT, FFFF, TFTF, TTFF, FTFT, FTTF, TFTT, FFFT, TTTF, FFTF, TFFT, FFTT, FTTT, TFFF TTFT 7 NHPREV Previous noun phrases 8 NHTOT Total noun phrases 9 INF Infinitive 10 SE 11 A 12 POSpre Particle se Preposition a Four parts of the speech previous to the verb 13 POSpos Four parts of the speech speech following the verb 14 VERBtype Type of verb: copulative, impersonal, pronominal, transitive and intransitive Number of noun phrases previous to the verb Number of noun phrases in the clause Number of infinitives in the clause se, no True, False 292 different values combining the parser’s pos tags,i.e.: @HN, @CC, @MAIN, etc. 280 different values combining the parser’s pos tags,i.e.: @HN, @CC, @MAIN, etc. CIPX, XIXX, XXXT, XXPX, XXXI, CIXX, XXPT, XIPX, XIPT, XXXX, XIXI, CXPI, XXPI, XIPI, XXEX Table 3.4: Features: definitions and values. method was implemented to identify these different types of clause as the parser does not explicitly mark the boundaries of clauses within sentences (see Section 3.2.4) 3 LEMMA: lexical information extracted from the parser: the lemma of the finite verb. 4-5 NUMBER, PERSON: morphological information features of the verb: its grammatical number (singular or plural) and its person (first, second, or third 36 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach person). 6 AGREE: feature which encodes the tense, mood, person, and number of the verb in the clause, and its agreement in person, number, tense, and mood with the preceding verb in the sentence and also with the main verb of the sentence. When a finite verb appears in a subordinate clause, its tense and mood can assist in recognition of these features in the verb of the main clause and help to enforce some restrictions required by this verb, especially when both verbs share the same referent as subject. 7-9 NHPREV, NHTOT, INF: the candidates for the subject of the clause are represented by the number of noun phrases in the clause that precede the verb, the total number of noun phrases in the clause, and the number of infinitive verbs in the clause. 10 SE: this is a binary feature encoding the presence or absence of the particle se in close proximity to the verb. When se occurs immediately before or after the verb or with a maximum of one token (see example (q) below) lying between the verb and itself, this is considered “close proximity.” (q) No podrá sacarse una ventaja indebida de la reputación de una marca. (It) is not allowed to take unfair advantage of a brand reputation. 11 A: this is a binary feature encoding the presence or absence of the preposition a in the clause. Since, the distinction between passive reflex clauses with zero pronouns and impersonal constructions sometimes relies on the appearance of preposition a (to, for, etc.). For instance, example (r) is a passive reflex clause containing a zero pronoun while example (s) is an impersonal construction. (r) Se admiten los alumnos que reúnan los requisitos. (They) accept the students who fulfill the requirements. (s) Se admite a los alumnos que reúnan los requisitos. (It) is accepted for the students who fulfill the requirements. 37 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach 12-13 POSpre , POSpos : the pos of eight tokens, that is, the four words preceding and the four words following the instance1 14 VERBtype : the verb is classified as copulative (yes/no), as a verb with an im- personal use (yes/no), as a pronominal verb (yes/no), and as a transitive verb (yes/no/both). 3.2.4 Purpose Built Tools As training data is required in order to exploit the methods distributed in the Weka package (Witten & Frank, 2005), a method was implemented to extract the values of the previously described features for instances occurring in the eszic Corpus. For each instance (each annotated finite verb) a new line is written in the training data file with values for the fourteen features separated by commas, together with the manual classification of the vector using the standard CVS (comma separated values) format. The values of features 7-9 are numerical while the values of the remaining features are nominal (i.e. symbolic). To extract the features, ad hoc software was implemented in Python. The program exploits morphological and syntactic information, dependency relations reported by the parser, and lists of verbs grouped by their syntactic and morphological properties (e.g. transitivity, pronominal use, etc.). The method implemented includes the following purpose built tools which are described below. The description includes information on the particular features whose values are computed using the tools. 1 Clause splitter module (CLAUSE): since Connexor’s Machinese Syntax (Connexor Oy, 2006a) does not provide any information about the clause boundaries within sentences, this clause splitter module is required. Each clause is built by identifying finite verbs in a sentence and then searching for signals that indicate the boundaries of the clause (relative pronouns, conjunctions, punctuation marks, etc.). In theory, each clause could be built using dependency information given by the parser by grouping together all the words dependent on the finite verb. However, this strategy was not used in order to avoid parsing errors in the dependency information reported by the parser. Errors of this type are especially common 1 This set of features can be regarded as useful for identifying non-nominal it (Evans, 2001). 38 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach when long sentences are parsed using Connexor’s Machinese Syntax. The Clause splitter module also identifies the type of clause in which the finite verb occurs. The feature attributes corresponding to the type of clause are: 1.1 Main (Main): when the finite verb belongs to the main clause. 1.2 Relative (Rel): when the finite verb belongs to a relative clause. A list of relative pronouns was used to identify this type of clause (i.e.: que (that), cuyo (whose), quien (who), etc.). 1.3 Improper conjunction (Imp): when the finite verb belongs to a clause starting with an improper conjunction. A list of improper conjunctions was used to identify the value of this attribute (i.e.: porque (because), luego (so), aunque (although), etc.). 1.4 Proper conjunction (Prop): when the finite verb belongs to a clause starting with a proper conjunction. A list of proper conjunctions was used (i.e.: y, e (and), o, u (or), ni (neither), pero (but) and sino (otherwise). 1.5 Punctuation marks (Punct): when the clause in which the finite verb occurs is preceded by a punctuation mark (‘.’, ‘,’, ‘:’, ‘;’, ‘?’, ‘!’, “”, ‘-’, ‘(’, and ‘)’ ). 2 Noun phrase module (NHPREV, NHTOT): in order to obtain the subject candidates, this module identifies and counts the noun phrases that precede and follow the finite verb in the clause. As is the case for the clause splitter, this module exploits dependency information returned by the parser (Connexor Oy, 2006a). 3 Counter (NHPREV, NHTOT, INF): this module is used to determine the total number, in the clause, of noun phrases (nhprev, nhtot) and infinitival forms (inf). 4 Tag taker (PARSER, LEMMA, NUMBER, PERSON, A, POSpre , POSpos ): these Python functions process the attributes of the XML tags output by the parser (eszic Corpus) to generate a set of features for the eszic training data. A function generates a binary value that indicates whether or not the finite verb has a dependent subject (parser). A function consults the lemma of the verb and takes it as the value for feature (lemma). Other functions exploit morphological information obtained by the parser such as the number of the 39 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach finite verb (number), which can be either singular (SG) or plural (PL), or the morphological person of the finite verb (person) which can be first, second or third person (P1, P2, P3); Another function identifies whether the preposition a occurs in the clause (a). This information is used as the values for the features; and, finally, there is another method which obtains the pos of the four words that precede the instance in the clause ((pos)pre ) and the four words that follow it ((pos)pos ). 5 Agreement module (AGREE): this module checks whether the verb used in the clause agrees (true, T) or disagrees (false, F) in tense and mood, and in person and number with the main verb that occurs in the sentence1 and the previous verb occurring within the sentence. This agreement information is combined into one symbolic feature, such as TTTT (with respect to the verb used in the clause, the first T denotes agreement in number and person with the main verb of the sentence, the second T denotes agreement in tense and mood with the main verb of the sentence, the third T denotes agreement in number and person with the previous verb in the sentence and the fourth T denotes agreement in tense and mood with the previous verb in the sentence) or TTFF (when there is agreement in between the verb in the clause and the main sentence verb but no agreement with the previous clause verb). There are sixteen possible combinations of true (T) and false (F) values. 6 Se identifier (SE): this function identifies whether the particle se occurs in close proximity to the finite verb. Again, in this context, a distance of at most one token between the finite verb and se is considered “close proximity.” The value for this feature can be (yes), when se appears, or (no), when it does not. 7 Verb classifier (VERBtype ): this module specifies the value of four features of the finite verb that occurs in the clause. The features encode information about whether or not the verb appears in four different lists of verbs (the same instance can occur in more then one list). These four lists2 contain 11,060 different verb lemmas which are present in the Royal Spanish Academy Dictionary (Real 1 In this study, it is considered that sentences may contain several verbs whereas clauses contain only one finite verb. 2 The lists 7.2-7.4 of infinitive verb forms were provided by Molino de Ideas s.a. 40 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach Academia Española, 2001). The criteria on which these lists (items 7.2-7.4) were built was the information contained in the dictionary definitions of the verbs (Real Academia Española, 2001): 7.1 Copulative verbs (C): a list containing the copulative verbs, i.e. ser (to be), parecer (to seem like), etc.; 7.2 Impersonal verbs (I): a list containing all the verbs whose use is impersonal. Such use is specified in their definition, i.e. llover (to rain), nevar (to snow), etc.; 7.3 Pronominal verbs (P): a list which includes all the pronominal verbs (verbs whose lemma in the dictionary appears with se) and all the potential pronominal verbs whose definitions specify a potential pronominal use; and 7.4 Transitive and intransitive verbs (T): a list containing transitive verbs and intransitive verbs that meet the criteria detailed previously in item 7. 3.2.5 The WEKA Package The Weka workbench1 is a collection of state-of-the-art machine learning algorithms and data preprocessing tools (Hall et al., 2009; Witten & Frank, 2005). Both Weka interfaces, the Explorer and the Experimenter were used to discover the methods and parameter settings that work best for the current classification task. Standard evaluation measures –precision, recall, f-measure and accuracy (Manning & Schütze, 1999)– provided by Weka are used. In these measures, true positives (tp) and true negatives (tn) are the number of cases that the system got right. The wrongly selected cases are the false positives (f p) while the cases that the system failed to select are the false negatives (f n). In the current context, true positives and true negatives would be the numbers of correctly classified instances while the false positives and false negatives are the numbers of falsely classified instances (Manning & Schütze, 1999). Precision is defined as the ratio of selected items that the system got right, that is, the ratio of true positives to the sum of true positives and false positives: p = tp tp+f p . Recall is defined as the proportion of target items that the system selected, that is the ratio of the number of true positives to the sum of true positives and false negatives: r= 1 tp tp+f n . Weka is available at: http://www.cs.waikato.ac.nz/ml/weka/. 41 3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach Figure 3.3: An example of Weka Explorer interface. F-measure is a single measure of overall performance which combines precision and recall: F = 1 r 2 + 1 p . Accuracy is the proportion of correctly classified objects: A= tp + tn . tp + tn + f p + f n 42 Chapter 4 Evaluation “Then you should say what you mean” [...] “I do,” Alice hastily replied; “at least I mean what I say that’s the same thing, you know.” “Not the same thing a bit!” said the Hatter. “Why, you might just as well say that ‘I see what I eat’ is the same thing as ‘I eat what I see’ !” Alice in Wonderland, Lewis Carroll This chapter presents the evaluation of the Elliphant system and some optimisation experiments carried out with the machine learning method (see Section 4.1). A comparative evaluation of Elliphant’s performance with that of Connexor’s Machinese Syntax parser is also described (see Section 4.2). Standard evaluation measures (precision, recall, f-measure and accuracy) are used to evaluate Elliphant with regard to the identification of the three classes: explicit subjects, zero pronouns and impersonal constructions. 4.1 Experiments A set of experiments was executed using the in Weka package with the purpose of answering the following questions: (1) Which method and parameter values work best for our problem? (see Section 4.1.1) (2) How many instances are needed to train the algorithm? (see Section 4.1.2) 43 4.1 Experiments 4. Evaluation (3) Does the genre matter? (see Section 4.1.3) (4) Which are the most significant features and what are the most effective combinations of features? (see Section 4.1.4) 4.1.1 Method Selected: K* Algorithm A comparison of the learning algorithms implemented in Weka (Witten & Frank, 2005) was carried out to determine the most accurate method for each classification task. A comparison of the accuracy levels (see Table 4.1), which presents all of the Weka classifiers which exploit the features utilised in the Elliphant system is shown below, with default parameter settings. The experiment was executed using 20% of the instances in the training data, which were selected randomly. Ten-fold cross-validation was used in the evaluation. All methods with an accuracy within 1% of K*’s are marked in italics. The seven1 highest performance classifiers were compared using 100% of the training data and 10-fold cross-validation. The Bayes classifiers (BayesNet, NaiveBayes and NaiveBayesUpdateable) obtained an accuracy score of 0.846, the function classifier (RBFNetwork) offers an accuracy of 0.850 and the tree classifier (LADTree) an accuracy of 0.830. With an accuracy of 0.860, the lazy learning classifier K* is the best performing one, and hence our chosen technique. Although lazy learning requires a relatively large amount of memory to store the entire training set, the eszic training data is small enough that it can be classified within a few minutes. Instance-based learners classify new instances by comparing them to the manually classified instances in the training data. The fundamental assumption is that similar instances will have similar classifications. Nearest neighbor algorithms are the simplest of the instance-based learners. They use a domain-specific distance measure to retrieve the single most similar instance from the training set. In a nearest-neighbor method each instance in the training set is represented by a vector of feature values that has been explicitly classified. When a new vector of feature values is presented, a distance measure is computed between the new vector and the set of vectors held in the training 1 Unfortunately, due to hardware limitations, it was not possible to obtain results from the NBTree classifier and the JRip rule classifier when using the entire set of training data. 44 4.1 Experiments 4. Evaluation Weka classifiers Accuracy Weka classifiers Accuracy Bayes: BayesNet 0.848 Meta: RacedIncrementalLogitBoost 0.717 Bayes: NaiveBayes 0.848 Meta: RandomSubSpace 0.731 Bayes: NaiveBayesSimple 0.842 Meta: Stacking 0.717 Bayes: NaiveBayesUpdateable 0.848 Meta: StackingC 0.717 Functions: RBFNetwork 0.848 Meta: Vote 0.717 Lazy: IB1 0.804 Misc: HyperPipes 0.715 Lazy: IBk 0.810 Misc: VFI 0.704 Lazy: K* 0.850 Rules: ConjunctiveRule 0.809 Lazy: LWL 0.809 Rules: DecisionTable 0.834 Meta: AdaBoostM1 0.81 Rules: DTNB 0.834 Meta: AttributeSelectedClassifier 0.836 Rules: JRip 0.845 Meta: ClassificationViaClustering 0.66 Rules: NNge 0.740 Meta: CVParameterSelection 0.717 Rules: OneR 0.762 Meta: Decorate 0.795 Rules: PART 0.795 Meta: END 0.809 Rules: Ridor 0.821 Meta: EnsembleSelection 0.762 Rules: ZeroR 0.717 Meta: FilteredClassifier 0.810 Trees: BFTree 0.760 Meta: Grading 0.717 Trees: DecisionStump 0.810 Meta: LogitBoost 0.841 Trees: J48 0.810 Meta: MultiBoostAB 0.810 Trees: J48graft 0.813 Meta: MultiClassClassifier 0.661 Trees: LADTree 0.846 Meta: MultiScheme 0.717 Trees: NBTree 0.850 NestedDichotomies: ClassBalancedND 0.809 Trees: RandomForest 0.793 NestedDichotomies: DataNearBalancedND 0.809 Trees: RandomTree 0.749 NestedDichotomies: ND 0.809 Trees: REPTree 0.723 Meta: OrdinalClassClassifier 0.810 Trees: SimpleCart 0.763 Table 4.1: Weka classifiers accuracy (20% of the eszic training set). set (Cleary & Trigg, 1995). The k nearest ones are identified and the new vector is assigned the class shared by the majority of the nearest neighbors1 . K* is an instance-based classifier. The class of a test instance is based upon the classes of those training instances that are similar to it, as determined by some similarity function. It differs from other instance-based learners in that this algorithm computes the distance between two instances using a method motivated by informa1 Evans (2001) and Boyd et al. (2005) executed their experiments with the k nearest neighbor classifier which is also a lazy learning algorithm. 45 4.1 Experiments 4. Evaluation tion theory in which an entropy-based distance function is used (Cleary & Trigg, 1995; Witten & Frank, 2005). The distance between instances is defined as the complexity of transforming one instance into another. The calculation of the complexity between instances is detailed in Cleary & Trigg (1995). When using K*, the most effective classification is made when using a blending parameter1 of 40%2 and the rest of the parameters remain with their default values: the missing Mode parameter3 set to the average column entropy curves and the entropic Auto Blend parameter set to false. Table 4.2 presents the evaluation of Elliphant when exploiting the K* classifier with the parameters set as explained before, using ten-fold cross-validation. Class Explicit subjects Zero pronouns Impersonal constructions Precision Recall F-measure 0.900 0.772 0.889 0.923 0.740 0.626 0.911 0.756 0.734 eszic training data Accuracy: 0.867 (ten-fold cross-validation) Table 4.2: eszic training data evaluation with K* -B 40 -M a. There is a marginal reduction in accuracy when the system is evaluated using tenfold cross-validation (0.867) instead of leave-one-out cross-validation (0.869), though its statistical significance is minimal. When decreasing the proportion of training data used, the difference in performance levels between both evaluation methods remains stable except when using 50% of the training data and is just 0.005. Although leaveone-out cross-validation obtains more accurate results as it is easier to classify test instances using almost 100% of the training data than from only 90% of it, in practice a classifier is trained and tested on instances derived from different data sets. Ten-fold cross-validation is thus a more accurate simulation of real-world classification scenarios. Moreover, it can be computed far more quickly than leave-one-out cross-validation. 1 The parameter for global blending. Blending percentages up to 50% were tested. 3 The missing Mode determines how missing attribute values are treated. 2 46 4.1 Experiments 4. Evaluation percentage Ten-fold cross validation Leave-one-out validation 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.836 0.859 0.854 0.855 0.858 0.860 0.860 0.865 0.866 0.867 0.834 0.862 0.851 0.858 0.863 0.862 0.862 0.863 0.869 0.868 eszic training data Table 4.3: Leave-one-out and ten-fold cross-validation comparison. 4.1.2 Learning Curve A learning curve shows how accuracy changes with varying sample sizes, plotting the number of correctly classified instances against the number of instances in the training data. To calculate the learning curve of the Elliphant system, the eszic training data was used to generate ten training samples, representing 10%, 20%, 40%, 50%, 60%, 70%, 80%, 90% and 100% of the data set. The instances contained in the eszic training file were randomly ordered so that the genre variable could not influence the results presented below. In these experiments, the K* algorithm was used with the parameter settings described in Section 4.1.1 and the evaluation was carried out using ten-fold cross-validation. The learning curve shown in Figure 4.1 presents the increase in accuracy obtained by the Elliphant system using the eszic training data. Performance reaches a plateau at its maximum level when using 90% of the training instances.1 Figure 4.2 displays the precision, recall and f-measure of classification for all classes 1 One thing to be noted is that the ordering of the instances makes a slight difference to the accuracy of classification. While the system obtains an accuracy of 0.867 when the instances are placed in their original order of occurrence in the eszic training data, 0.866 is obtained when the same instances are presented in random order to the classifier using ten-fold cross validation. This difference also occurs when leave-one-out cross-validation is used. In this case, the method obtains an accuracy of 0.869 when the instances are placed in their original order of occurrence and 0.868 when presented in random order. 47 4.1 Experiments 4. Evaluation Accuracy 0,866 0,861 0.859 0,856 0.858 0.854 0.86 0.86 60% 70% 0.865 0.866 0.866 80% 90% 100% 0.855 0,851 0,845 0,840 0.836 0,835 0,830 10% 20% 30% 40% 50% Figure 4.1: eszic training data learning curve for accuracy. in the eszic training data. The values of the three measures are maximal when utilizing 90% of the training set. While recall plateaus at this sample size, precision and f-measure decrease slightly when the amount of training data is further increased, although this decline is not sufficiently marked to be attributed to overtraining. Precision Recall F-measure 0,866 0.865 0,861 0.859 0.858 0,856 0.856 0.854 0.852 0.852 0.855 0.853 0.852 20% 30% 40% 0,851 0.858 0.857 0.856 0.86 0.858 0.86 0.857 0.858 60% 70% 0.858 0.863 0.863 0.866 0.865 0.865 0.866 0.864 90% 100% 0.864 0,845 0,840 0.836 0,835 0.831 0.83 0,830 10% 50% 80% Figure 4.2: eszic training data learning curve for precision, recall and f-measure. The learning curve in Figure 4.3 shows the classification accuracy for each of the 48 4.1 Experiments 4. Evaluation classes while Figure 4.4 presents this accuracy in relation to the number of training instances for each section of the eszic training data. Under all conditions, subjects are classified with a high accuracy since the information given by the parser (collected in the features) facilitates an f-measure of 0.801 for the identification of explicit subjects. By contrast to explicit subjects, the parser does not recognise zero pronouns in impersonal constructions but can recognise them in clauses with no subject. The accuracy with which these types can be classified begins at a lower level (0.662 and 0.621 respectively). Classification of both zero pronouns of impersonal constructions reaches its maximum when 90% of the training data is exploited. There is also some evidence of overtraining in the classification of impersonal constructions when using 100% of the training data. Explicit Subject 0,911 0,895 0,907 0,905 Zero pronoun Impersonal 0,904 0,906 0,907 0,908 0,911 0,911 0,911 0,741 0,744 0,742 0,748 0,754 0,736 0,754 0,737 90% 100% 0,870 0,828 0,787 0,735 0,745 0,728 0,704 0,662 0,662 0,653 0,661 0,671 0,682 0,721 0,672 0,651 0,642 0,621 0,621 10% 20% 30% 40% 50% 60% 70% 80% Figure 4.3: Learning curve for accuracy, recall and f-measure of the classes. The zero pronoun class has the steepest learning curve. Utilising only 735 instances (50% of the training set), the Elliphant system obtains an accuracy (0.741) close to that obtained when using 100% of the training data. The learning curve for the subject class is more gradual due to the great variety of subjects occurring in the training data. In addition, increasing accuracy from a greater starting point (0.907 using just 20% of the training data) is far more expensive in terms of the addition of training instances. The impersonal sentence class is also learned rapidly by Elliphant. Utilising a training set of only 179 instances, it reaches a classification accuracy of 0.721 (See Figure 4.4). 49 4.1 Experiments 4. Evaluation Explicit Subject 0,911 498 978 1461 Zero pronoun 1929 2433 Impersonal 2898 3400 3899 4386 4854 Explicit subjects 0,870 0,828 0,787 354 0,745 537 735 898 1094 0,704 167 32 0,662 0,621 66 1249 103 82 1416 146 1793 Zero pronouns 163 179 Impersonal constructions 1593 129 49 17 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Figure 4.4: Learning curve for accuracy, recall and f-measure in relation to the number of instances of each class. This demonstrates that Elliphant is not heavily reliant on very large sets of expensive training data and is able to reach adequate levels of performance when exploiting far less training instances. Overall, we see that we only need a small set of annotated instances (1,500) to achieve reasonable results. 4.1.3 Most Effective Features With Weka’s Attribute Selection option, it is possible to evaluate the features by considering the individual predictive ability of each of the features along with the degree of redundancy between them. Table 4.4 shows the relevant ordered features evaluated using different algorithms implemented in Weka’s attribute selection module which can handle the features type (symbolic, numerical, etc.) from the eszic training data. The filters used for each Attribute Selection method are the ones provided by default in Weka1 . Considering the group of features selected using each Weka Attribute Selection algorithm, 11 classifications using the K* classifier were made over the complete eszic 1 BestFirst filter for the CfsSubsetEval method; Attribute ranking filter for the ChiSquaredAttributeEval, FilteredAttributeEval, GainRatioAttributeEval, InfoGainAttributeEval, OneRAttributeEval, ReliefFAttributeEval and SymmetricalUncertAttributeEval; and Greedy Stepwise filter for the ConsistencySubsetEval and FilteredAttributeEval methods. 50 4.1 Experiments 4. Evaluation Weka Attribute Selection Selected features CfsSubsetEval PARSER, NUMBER, NHPREV, NHTOT, VERBtype , PERSON LEMMA, POSpos , NHTOT, NHPREV, POSpre , PARSER PARSER, LEMMA, NUMBER, AGREE, NHTOT, POSpos , POSpre POSpos , LEMMA, NHPREV, NHTOT, PARSER, POSpre PARSER, NHPREV, NHTOT ChiSquaredAttributeEval ConsistencySubsetEval FilteredAttributeEval FilteredSubsetEval GainRatioAttributeEval InfoGainAttributeEval OneRAttributeEval ReliefFAttributeEval SymmetricalUncertAttributeEval NHPREV, PARSER, PERSON, NHTOT, POSpos , CLAUSE POSpos , LEMMA, NHPREV, NHTOT, PARSER, POSpre NHTOT, POSpos , CLAUSE, PERSON, NHPREV, PARSER POSpos , VERBtype , LEMMA, PARSER, CLAUSE, POSpre NHPREV, PARSER, NHTOT, POSpos , PERSON, LEMMA Table 4.4: Selected features by Weka Attribute Selection methods. training data using only the features selected by each method. Table 4.5 presents the accuracy of each classification using ten-fold cross-validation. The most effective group of six features in combination is the one selected by Weka’s SymmetricalUncertAttributeEval Attribute Selection algorithm, since the classification using those six features together already offers an accuracy of 0.851. Likewise, a group consisting of only three features (parser, nhprev, nhtot) was selected by the FilteredSubsetEval algorithm. These three features are the most frequently selected ones among those chosen by all the Attribute Selection methods. A classification which exploits only the three features obtains an accuracy of 0.819. A set of experiments were conducted in which features were selected on the basis of the degree of computational effort needed to generate them. Two sets of features were proposed. One group corresponds to features intrinsic to the parser, whose values can be obtained by trivial exploitation of the tags produced in its output (parser, 51 4.1 Experiments 4. Evaluation Weka Attribute Selection CfsSubsetEval ChiSquaredAttributeEval ConsistencySubsetEval FilteredAttributeEval FilteredSubsetEval GainRatioAttributeEval InfoGainAttributeEval OneRAttributeEval ReliefFAttributeEval SymmetricalUncertAttributeEval Accuracy 0.824 0.848 0.843 0.848 0.819 0.833 0.848 0.833 0.825 0.851 Table 4.5: Classification using the selected features groups: accuracy. lemma, person, pospos , pospre ). The second group of features (clause, agree, nhprev, nhtot, verbtype ) has values derived by methods extrinsic to the parser and rules for the recognition of elements that are independent of it. Derivation of this second group of features necessitated the implementation of more sophisticated modules to identify the boundaries of syntactic constituents such as clauses and noun phrases. These modules are rule-based and operate over the often erroneous output of the parser (see Section 3.2.4). The results obtained when the classifier exclusively exploits each of these intrinsic and extrinsic groups of features are shown in Tables 4.6 and 4.7 A recurrent issue in anaphora resolutions studies is determining the quantity and type of knowledge needed for identification of candidates and selection of a candidate as antecedent. In Mitkov (2002) it is stated that, given the natural linguistic ambiguity of various cases, the resolution of any kind of anaphor requires not only morphological, lexical, and syntactic knowledge but also semantic knowledge, discourse knowledge, and real world knowledge. Nevertheless, current anaphora resolution methods rely mainly on restrictions and preference heuristics, which employ information originating from morpho-syntactic or shallow semantic analysis (Ferrández & Peral, 2000; Mitkov, 1998), while some previous approaches have exploited full parsing (Hobbs, 1977; Lappin & Leass, 1994). As described in this dissertation, Elliphant makes use of deep dependency parsing plus the morphological knowledge contained in the verb lists used. 52 4.1 Experiments 4. Evaluation There are two findings of note in Table 4.6. The first is that no impersonal constructions are identified when only features extrinsic to the parser are used. The second is that there is a reduction in recall when using only intrinsic features. It is therefore better to classify instances using a feature group that combines both types of features. eszic training data Precision Recall F-measure 0.654 0.865 0 0.664 0.891 0 0.659 0.878 0 Explicit subjects Zero pronouns Impersonal constructions Extrinsic parser features eszic training data accuracy: 0.808 Table 4.6: Extrinsic parser features classification results. eszic training data Precision Recall F-measure 0.866 0.779 0.944 0.312 0.983 0.285 0.459 0.869 0.438 Explicit subjects Zero pronouns Impersonal constructions Intrinsic parser features eszic training data accuracy: 0.789 Table 4.7: Intrinsic parser features classification results. To estimate the weight of each feature, classifications were made in which each feature was omitted from the training instances that were presented to the classifier and ten-fold cross-validation was applied. Table 4.8 presents the accuracy of these classifications. Omission of all but one of the features a led to a reduction in accuracy, justifying their inclusion in the training instances. Feature omitted PARSER NHTOT LEMMA POSpos NHPREV PERSON CLAUSE Accuracy Feature omitted 0.854 0.860 0.861 0.861 0.862 0.863 0.863 VERBtype NUMBER INF AGREE POSpre SE A Accuracy 0.863 0.864 0.864 0.865 0.866 0.866 0.867 Table 4.8: Single feature omission classifications: accuracy. 53 4.1 Experiments 4. Evaluation 4.1.4 Genre Analysis As the eszic training data is composed of instances belonging to two different genres (legal and health), two subgroups of the eszic training data were generated: the Legal eszic training data and the Health eszic training data containing all the instances derived from legal and health texts, respectively. A comparative evaluation using tenfold cross-validation over the two subgroups shows that Elliphant is more successful when classifying instances of explicit subjects in legal texts (see Table 4.9). This may be explained by the uniformity of the sentences in the legal texts which present less variation than the ones from the health genre. Texts from the health genre present the additional complication of specialised named entities and acronyms which are used quite frequently in the health texts from the eszic Corpus (i.e.: CCDSD1 , DSM-IV2 or TLP3 ). Further, there is a larger number of explicit subjects in the legal training data (2,739, compared with 2,116 explicit subjects occurring in the health texts). Similarly, better performance in the detection of zero pronouns and impersonal sentences in the health texts may be due to their higher occurrence in the health genre: 108 impersonal constructions and 1,174 zero pronouns compared with 71 impersonal constructions and 619 zero pronouns in the legal texts (see Table 3.2 for details about the number of class instances in each subgroup of the training data). Class Legal genre Explicit subjects Health genre Explicit subjects Legal genre Zero pronouns Health genre Zero pronouns Legal genre Impersonal constructions Health genre Impersonal constructions Precision Recall F-measure 0.920 0.881 0.761 0.784 0.786 0.905 0.955 0.888 0.649 0.796 0.620 0.620 0.937 0.884 0.701 0.790 0.693 0.736 Legal genre accuracy: 0.893 (ten-fold cross-validation) Health genre accuracy: 0.848 (ten-fold cross-validation) Table 4.9: Legal and health genres comparative evaluation. 1 Cuestionario Clı́nico para el Diagnóstico del Sı́ndrome Depresivo (Clinic Questionnaire for Depressive Syndrome Diagnosis). 2 Manual Diagnóstico y Estadı́stico de los Trastornos Mentales IV (Diagnostic and Statistical Manual of Mental Disorders IV). 3 Trastorno lı́mite de la personalidad (Borderline Personality Disorder). 54 4.2 Comparative Evaluation 4. Evaluation We have also studied the effect of training the classifier on data derived from one genre and testing on instances derived from a different genre. Table 4.10 shows that instances from legal texts are not only more homogeneous, as the classifier obtains higher accuracy when testing and training only on legal instances (0.895) but they are also more informative because when combining both legal and health genres as training data, the results in testing the algorithm only on instances from the health genre show significantly increased accuracy (0.933). These results imply that the instances from the health genre are the most heterogeneous ones. Subsets of legal documents where our method achieves an accuracy of 0.942 were also found. ``` ``` ``` Testing set ``` Training set ``` Legal Health eszic Corpus (all) Legal Health eszic Corpus 0.895 0.858 0.920 0.859 0.841 0.933 0.885 0.887 0.869 Accuracy: cross-genre training and testing (ten-fold cross-validation) Table 4.10: Cross-genre training and testing evaluation. 4.2 Comparative Evaluation Due to the lack of previous work on this topic, a comparison with other methods is not feasible. Despite its similarities to this approach, Ferrández & Peral (2000) use a different definition for zero pronouns, and therefore a comparison is not appropriate. As a guideline, the results obtained by Connexor’s Machinese Syntax are presented regarding the existence (or not) of a subject inside the clause. Since this parser does not distinguish between referential and non-referential elliptic subjects, both categories have been merged into one. Needless to say, a comparison of the results obtained by these two methods should be made with caution. They are presented here only as a point of reference. It is clear from the figures that the Elliphant system offers not only improved f-measure in the classification of both elliptic subject classes, but also obtains superior f-measure when classifying the non-omitted subject class. The evaluation was carried out using both the entire set of eszic training data and also the genre-specific subsets of the training data (Legal and Health eszic training 55 4.2 Comparative Evaluation 4. Evaluation data). The evaluation of the Elliphant system was made using leave-one-out crossvalidation. eszic training data Elliphant Explicit subjects Elliphant Zero pronouns Elliphant Impersonal constructions Precision Recall F-measure 0.901 0.774 0.889 0.924 0.743 0.626 0.913 0.758 0.734 Elliphant eszic training data accuracy: 0.869 (leave-one-out cross-validation) Table 4.11: Elliphant eszic training data results. eszic training data Machinese Explicit subjects Machinese Zero pronouns + Impersonal constructions Precision Recall F-measure 0.911 0.716 0.802 0.543 0.829 0.656 Machinese eszic training data accuracy: 0.749 Table 4.12: Machinese eszic training data results. When evaluating over the entire eszic training set, Elliphant outperforms the parser on every measure. When detecting explicit pronouns in Elliphant, the obtained recall score is considerably higher (0.924 compared to the 0.716 of the parser). The averages of the evaluation measures obtained for the identification of zero pronouns and impersonal constructions (precision: 0.831; recall: 0. 684; f-measure: 0.746) were also compared. The comparison demonstrated Elliphant’s superiority over Connexor’s Machinese Syntax parser, in this task, for all measures except recall. Legal genre eszic training data Legal genre Elliphant Explicit subjects Legal genre Elliphant Zero pronouns Legal genre Elliphant Impersonal constructions Precision Recall F-measure 0.922 0.760 0.797 0.955 0.654 0.662 0.938 0.934 0.723 Elliphant Legal eszic training accuracy: 0.895 Table 4.13: Elliphant Legal eszic training results. When processing only the Legal eszic training data, the accuracy of the parser is reduced (0.726), while the performance of the Elliphant system is improved (0.895). 56 4.2 Comparative Evaluation 4. Evaluation Legal genre eszic training data Precision Recall F-measure 0.940 0.702 0.803 0.410 0.823 0.547 Legal genre Machinese Explicit subjects Legal genre Machinese Zero pronouns + Impersonal constructions Machinese Legal eszic training accuracy: 0.726 Table 4.14: Machinese Legal eszic training results. The two systems were used to classify instances of elision (zero pronouns and impersonal constructions) in texts from the legal genre. The averaged evaluation measures obtained by the Elliphant system (precision: 0.778; recall: 0. 658; f-measure: 0.828) were found to be superior to those obtained by the parser for all measures except recall (precision: 0.675; recall: 0. 763; f-measure: 0.675). Health genre eszic training data Health genre Elliphant Explicit subjects Health genre Elliphant Zero pronouns Health genre Elliphant Impersonal constructions Precision Recall F-measure 0.879 0.773 0.882 0.879 0.795 0.620 0.879 0.784 0.728 Elliphant Health eszic training data accuracy: 0.841 Table 4.15: Elliphant Health eszic training data results. Health genre eszic training data Health genre Machinese Explicit subjects Health genre Machinese Zero pronouns + Impersonal constructions Precision Recall F-measure 0.879 0.735 0.801 0.656 0.833 0.734 Machinese Health eszic training data accuracy: 0.772 Table 4.16: Machinese Health eszic training data results. When classifying instances derived from texts in the health genre (using Health eszic training data), the accuracy of both the Elliphant system and the parser was reduced. However, Elliphant still outperforms the parser in this context. When considering the classification of instances of elision in the health genre, Connexor’s Machinese Syntax parser does obtain higher measures for the averaged evaluation measures than Elliphant (precision: 0.827; recall: 0.707; f-measure: 0.756). 57 4.2 Comparative Evaluation 4. Evaluation Nevertheless, unlike the parser, the Elliphant system distinguishes referential (zero pronouns) and non-referential (impersonal constructions) elided subjects. This can be considered one of its main contributions as this task is necessary in order to improve practical anaphora resolution systems. 58 Chapter 5 Conclusions and Future Work In this dissertation, a machine learning approach to the identification of zero pronouns, impersonal constructions, and explicit subjects was presented. In treating this range of classes, complete coverage is provided for all possible constituents which may occur in subject position in Spanish clauses. In order to enable a machine learning approach to classification, a parsed corpus of Spanish texts from the health and legal genres was compiled. The corpus was manually annotated to encode information about the element in subject position for every finite verb in the corpus (the eszic Corpus). A set of 14 features was formulated and training data consisting of 6,827 instances represented by vectors of the feature values was created (eszic training data). The training data was utilised by classification algorithms distributed with the Weka package. Empirical observation revealed that use of the K* algorithm was optimal for the purpose of this classification. The performance of this machine learning approach was compared with that of Connexor’s Machinese Syntax parser. Elliphant offers a classification with superior accuracy in the recognition of both of the elliptic classes (zero pronouns and impersonal constructions), and also in the classification of the non-elliptic subject class (explicit subjects). The method presented in this dissertation is also able to identify impersonal constructions in Spanish. This is a task which appears not to have been dealt with before in the literature. In addition to presenting results with regard to algorithm selection, additional experiments carried out with the underlying method included parameter optimisation, learning of the most effective combinations of features, the optimal number of instances to include in the training data and the relationships between the results and the different genres on which the Elliphant system was tested. This chapter presents the findings 59 5.1 Main Observations 5. Conclusions and Future Work of all of these experiments (see section 5.1). In future research, it is intended that optimisation of the approach and its adaptability to other genres will be investigated in more depth (see section 5.2). 5.1 Main Observations Algorithm selection: the instance-based learning algorithm K* was selected for classification of elliptic vs. explicit subject instances and referential vs. non-referential subject instances. This decision was taken on the basis of having compared the accuracy of this classifier with the rest of the classifiers available in the Weka package. In terms of accuracy, the K* algorithm is closely followed by the Bayes based algorithms in Weka. Parameter optimisation was investigated by checking the impact of the parameter setting on the performance of the K* classifier. Although Weka provides sensible default settings, it is by no means certain that they will be optimal for this particular task. The default settings were changed so that a blending parameter of 40% was used with regard to the K* algorithm. Feature selection: the set of experiments conducted to determine an optimal group of features to be utilised by the classification algorithm revealed that of the entire set of 14 features, the most effective group comprises six of the features: nhprev (number of noun phrases previous to the verb), parser (parsed subject), nhtot (number of noun phrases in the clause), pospos (four pos following the verb), person (verb morphological person), and lemma (verbal lemma). This study showed that feature a (preposition a) does not make any meaningful contribution to the classification. Training data required: learning curves experiments showed the correlation between the accuracy of the classifier and the size of the training set, whose performance reaches a plateau at its maximum level when using 90% of the available data. Genre interference: We evaluated the performance of the Elliphant system separately in two different genres, legal and health, showing that there is some genre interference on the classification tasks. Elliphant classifies zero pronouns and explicit subjects in legal texts with a higher accuracy than is the case in health texts. By contrast, impersonal constructions are more accurately classified in health texts. Crossgenre training and testing demonstrated that legal instances are more informative and 60 5.2 Future Research 5. Conclusions and Future Work homogeneous than health genre cases. 5.2 Future Research Future research goals are related to improvements in: (1) optimisation of the Elliphant system, (2) adaptation of the system to other genres, (3) inter-annotation agreement of the eszic Corpus, (4) the comparison of Elliphant with a rule based approach and the (5) design of an algorithm to resolve zero anaphora in Spanish. Firstly, with regard to further improvement of the Elliphant system, the interaction between (a) feature selection and parameter optimisation, and (b) class distribution will be addressed. In related work, it was found that optimal settings for feature selection and parameter optimisation should not be sought independently of one another since there is an interaction between the two. The joint optimisation of feature selection and parameter optimisation can cause variations in the accuracy levels obtained by classifiers (Hoste, 2005). Additionally, an investigation will be made into how the class distribution of the data affects learning. This will facilitate the compilation of an optimal set of training instances as it has been found that training data containing a lower distribution of negative instances can be beneficial to classification (Hoste, 2005). In future work, evaluation and learning curve experiments in which training instances derived from texts in one genre are used to classify instances derived from texts in a different genre will provide an insight into the optimal type/combination of training data that enables better classification using less instances in various types/genres of text, as well as provide additional robustness to our system. Inter-annotator agreement will be measured and it is planned to design a ruled based algorithm to identify and to resolve zero anaphora in Spanish as there is some debate about which approach, machine learning or rule-based, brings optimal performance when applied in anaphora resolution systems (Mitkov, 2002). 61 5.2 Future Research 5. Conclusions and Future Work 62 References Aldea Muñoz, S. (2003). Un caso de intervención psicológica de la depresión infantil. psiquiatria.com, 7. 28 Aldea Muñoz, S. (2006). Influencia del autoconcepto y de la competencia social en la depresión infantil. psiquiatria.com, 10. 28 Alonso-Ovalle, L. & D’Introno, F. (2000). Full and null pronouns in Spanish: the zero pronoun hypothesis. In H. Campos, E. Herburger, A. Morales-Front & T.J. Walsh, eds., Hispanic linguistics at the turn of the millennium. Papers from the 3rd Hispanic Linguistics Symposium, 189–210, Cascadilla Press, Sommerville, MA. 6 Balcázar Nava, P., Bonilla Muñoz, M.P., Gurrola Peña, G.M., Oudhof van Barneveld, H. & Aguilar Mercado, M.R. (2005). La depresión como problema de salud mental en los adolescentes mexicanos. psiquiatria.com, 9. 28 Barreras, J. (1993). Resolución de elipsis y técnicas de parsing en una interficie de lenguaje natural. Procesamiento del lenguaje natural , 13, 247–258. 7, 8 Beavers, J. & Sag, I. (2004). Coordinate ellipsis and apparent non-constituent coordination. In S. Müller, ed., Proceedings of the 11th International Conference on Head-Driven Phrase Structure Grammar (HPSG-04), 48–69, CSLI Publications, Stanford, CA. 17 Bello, A. ([1847] 1981). Gramática de la lengua castellana destinada al uso de los americanos. Instituto Universitario de Lingüı́stica Andrés Bello, Cabildo Insular de Tenerife, Santa Cruz de Tenerife. 15, 19 Bergsma, S., Lin, D. & Goebel, R. (2008). Distributional identification of non-referential pronouns. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT-08), 10–18. 2, 10, 12 Bosque, I. (1989). Clases de sujetos tácitos. In J. Borrego Nieto, ed., Philologica: homenaje a Antonio Llorente, vol. 2, 91–112, Servicio de Publicaciones, Universidad Pontificia de Salamanca, Salamanca. 15, 16, 18, 19, 24 Boyd, A., Gegg-Harrison, W. & Byron, D. (2005). Identifying non-referential it: a machine learning approach incorporating linguistically motivated patterns. In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing. 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), 40–47. 8, 10, 12, 13, 22, 45 63 References Brucart, J.M. (1987). La elisión sintáctica en español . Universitat Autònoma de Barcelona, Bellaterra. 15 Brucart, J.M. (1999). La elipsis. In I. Bosque & V. Demonte, eds., Gramática descriptiva de la lengua española, vol. 2, 2787–2863, Espasa-Calpe, Madrid. ix, 15, 16, 17, 19, 23, 24 Carden, G. (1982). Backwards anaphora in discourse context. Journal of Linguistics, 18, 361–87. 33, 34 Chinchor, N. & Hirschman, L. (1997). MUC-7 Coreference task definition (version 3.0). In Proceedings of the 1997 Message Understanding Conference (MUC-97). 2 Chomsky, N. (1965). Aspects of the theory of syntax . The MIT Press, Cambridge, MA. 15 Chomsky, N. ([1968] 2006). Language and mind . Cambridge University Press, Cambridge, 3rd edn. 14 Chomsky, N. (1981). Lectures on government and binding. Mouton de Gruyter, Berlin, New York. 1, 6, 19 Chomsky, N. (1995). The minimalist program. The MIT Press, Cambridge, MA. 15 Chung, S., Ladusaw, W. & McCloskey, J. (1995). Sluicing and logical form. Natural Language Semantics, 3, 239–282. 17 Cleary, J. & Trigg, L. (1995). K*: an instance-based learner using an entropic distance measure. In Proceedings of the 12th International Conference on Machine Learning (ICML95), 108–114. 13, 45, 46 Clemente, J., Torisawa, K. & Satou, K. (2004). Improving the identification of nonanaphoric it using Support Vector Machines. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP04), 58–61. 10, 12 Código Civil (1889). Texto de la edición del Código Civil mandada publicar por el Real Decreto de 24 del corriente en cumplimiento de la ley de 26 de mayo último. Gaceta de Madrid , 206, 249–312. 26 Connexor Oy (2006a). Conexor functional dependency grammar 3.7. User’s manual . 29, 38, 39 Connexor Oy (2006b). Machinese language model . 13, 29, 35 Constitución Española (1978). Constitución Española de 27 de diciembre de 1978. Boletı́n Oficial del Estado, 311, 29313–29424. 26 Corpas Pastor, G. (2008). Investigar con corpus en traducción: los retos de un nuevo paradigma. Peter Lang, Frankfurt am Main. 7, 8 Corpas Pastor, G., Mitkov, R., Afzal, N. & Pekar, V. (2008). Translation universals: do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas (AMTA-08), 75–81. 2, 7, 8, 10 64 References Danlos, L. (2005). Automatic recognition of French expletive pronoun occurrences. In R. Dale, K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05), 73–78, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 3651. 2, 10, 11, 12 Denber, M. (1998). Automatic resolution of anaphora in English. Tech. rep., Eastman Kodak Co. 10, 11, 12 Dı́az Morfa, J. (2004). La crisis de las aventuras en las relaciones de pareja. psiquiatria.com, 8. 28 Dı́scolo, A. ([2nd century] 1987). Sintaxis. Gredos, Madrid. 14 Evans, R. (2000). A comparison of rule-based and machine learning methods for identifying non-nominal it. In D.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Proceedings of the 2nd International Conference on Natural Language Processing (NLP-2000), 233–241, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 1835. 10, 12 Evans, R. (2001). Applying machine learning: toward an automatic classification of it. Literary and Linguistic Computing, 16, 45–57. 2, 10, 12, 13, 22, 29, 38, 45 Fernández Soriano, O. & Táboas Baylı́n, S. (1999). Construcciones impersonales no reflejas. In I. Bosque & V. Demonte, eds., Gramática descriptiva de la lengua española, vol. 2, 1631–1722, Espasa-Calpe, Madrid. 18, 19 Ferrández, A. & Peral, J. (2000). A computational approach to zero-pronouns in Spanish. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL-2000), 166–172. 2, 6, 7, 8, 9, 11, 17, 22, 52, 55 Ferrández, A., Palomar, A. & Moreno, L. (1997). El problema del núcleo del sintagma nominal: ¿elipsis o anáfora? Procesamiento del lenguaje natural , 20, 13–26. 24 Ferrández, A., Palomar, A. & Moreno, L. (1998). Anaphor resolution in unrestricted texts with partial parsing. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL/COLING-98), 385–391. 9 Ferrández, A., Palomar, A. & Moreno, L. (1999). An empirical approach to Spanish anaphora resolution. Machine Translation, 14, 191–216. 9 Fiengo, R. & May, R. (1994). Indices and identity. The MIT Press, Cambridge MA. 17 Francis, W. (1958). The structure of American English. Ronald Press, New York. 15 Fries, C. (1940). American English grammar . Appleton-Century-Crofts, New York. 15 Garcı́a Jurado, F. (2007). La etimologı́a como historia de las palabras. E-excellence, Área de Cultura Clásica, Filologı́a Clásica, 39, 1–27. 14 Garcı́a Losa, E. (2008). Efectividad, operatividad y potenciación del tratamiento en patologı́a fóbica, en el contexto de los servicios especializados de salud mental públicos: la utilización en la sala de consulta de los recursos de Internet. psiquiatria.com, 12. 26 65 References Gómez Torrego, L. (1992). La impersonalidad gramatical: descripción y norma. Arco Libros, Madrid. 17, 18, 19, 23, 25 Grice, H. (1975). Logic and conversation. In P. Cole & J.L. Morgan, eds., Syntax and semantics, vol. 3: Speech Acts, 41–58, Academic Press, New York. 15 Gundel, J., Hedberg, N. & Zacharski, R. (2005). Pronouns without NP antecedents: how do we know when a pronoun is referential? In A. Branco, T. McEnery & R. Mitkov, eds., Anaphora processing: linguistic, cognitive and computational modelling, 351–364, John Benjamins, Amsterdam. 10, 12 Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H. (2009). The WEKA data mining software: an update. SIGKDD Explorations, 11, 10–18. 41 Halliday, M.A.K. & Hasan, R. (1976). Cohesion in English. Longman, London. 15 Han, N. (2004). Korean null pronouns: classification and annotation. In Proceedings of the Workshop on Discourse Annotation. 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 33–40. 7 Hernández Terrés, J.M. (1984). La elipsis en la teorı́a gramatical . Universidad de Murcia, Murcia. 14 Hirano, T., Matsuo, Y. & Kikui, G. (2007). Detecting semantic relations between named entities in text using contextual features. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Companion volume proceedings of the demo and poster sessions (ACL-05), 157–160. 2, 7, 8 Hobbs, J. (1977). Resolving pronoun references. Lingua, 44, 311–338. 52 Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D. thesis, University of Antwerp. 61 Hu, Q. (2008). A corpus-based study on zero anaphora resolution in Chinese discourse. Ph.D. thesis, City University of Hong Kong. 7, 8 Iida, R., Inui, K. & Matsumoto, Y. (2006). Exploiting syntactic patterns as clues in zeroanaphora resolution. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics and the 21st International Conference on Computational Linguistics (ACL/COLING-06), 625–632. 7, 8 Iida, R., Kentaro, I. & Matsumoto, Y. (2009). Capturing salience with a trainable cache model for zero-anaphora resolution. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/AFNLP-09), 647–655. 2, 7, 8 Imamura, K., Saito, K. & Izumi, T. (2009). Discriminative approach to predicate-argument structure analysis with zero-anaphora resolution. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/AFNLP-09), 85–88. 2, 7, 8 66 References Isozaki, H. & Hirao, T. (2003). Japanese zero pronoun resolution based on ranking rules and machine learning. In Theoretical Issues in Natural Language Processing. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03), 184–191. 7, 8 Järvinen, T. & Tapanainen, P. (1998). Towards an implementable dependency grammar. In A. Polguère & S. Kahane, eds., Proceedings of the Workshop on Processing of DependencyBased Grammars. 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL/COLING-98), 1–10. 28 Järvinen, T., Laari, M., Lahtinen, T., Paajanen, S., Paljakka, P., Soininen, M. & Tapanainen, P. (2004). Robust language analysis components for practical applications. In Proceedings of the 20th International Conference on Computational Linguistics (COLING04), 53–56. 28, 29 Kawahara, D. & Kurohashi, S. (2004). Improving Japanese zero pronoun resolution by global word sense disambiguation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), 343–349. 2, 7, 8 Kibrik, A.A. (2004). Zero anaphora vs. zero person marking in Slavic: a chicken/egg dilemma? In Proceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC04), 87–90. 2, 7, 8 Kratzer, A. (1998). More structural analogies between pronouns and tenses. In Proceedings of Semantics and Linguistic Theory VIII (SALT-88), Cornell University, Ithaca, NY. 6 Kuno, S. (1972). Functional sentence perspective: a case study from Japanese and English. Linguistic Inquiry, 3, 269–320. 33 Lambrecht, K. (2001). A framework for the analysis of cleft constructions. Linguistics, 39, 463–516. 10, 12 Lancelot, C. & Arnauld, A. ([1660] 1980). Gramática general y razonada. Sociedad General Española de Librerı́a, Madrid. 14 Lappin, S. & Leass, H. (1994). An algorithm for pronominal anaphora resolution. Computational Linguistics, 20, 535–561. 10, 11, 12, 52 Lee, S. & Byron, D. (2004). Semantic resolution of zero and pronoun anaphors in Korean. In Proceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC04), 103–108. 2, 7 Lee, S., Byron, D. & Jang, S. (2005). Why is zero marking important in Korean? In R. Dale, K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05), 588–599, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 3651. 7 Ley 29/1998 (1998). Ley 29/1998, de 13 de julio, reguladora de la Jurisdicción Contenciosoadministrativa. Boletı́n Oficial del Estado, 167, 23516–23551. 26 Ley 29/2005 (2005). Ley 29/2005, de 29 de diciembre, de Publicidad y Comunicación Institucional. Boletı́n Oficial del Estado, 312, 42902–42905. 26 67 References Ley 3/1991 (1991). Ley 3/1991, de 10 de enero, de Competencia Desleal. Boletı́n Oficial del Estado, 10, 959–962. 26 Ley Orgánica 10/1995 (1995). Ley Orgánica 10/1995, de 23 de noviembre, del Código Penal. Boletı́n Oficial del Estado, 281, 33987–34058. 26 Ley Orgánica 1/2002 (2002). Ley Orgánica 1/2002, de 22 de marzo, reguladora del Derecho de Asociación. Boletı́n Oficial del Estado, 73, 11981–11991. 26 Ley Orgánica 6/2001 (2001). Ley Orgánica 6/2001, de 21 de diciembre, de Universidades. Boletı́n Oficial del Estado, 307, 49400–49425. 26 Li, Y., Musilek, P. & Wyard-Scott, L. (2009). Identification of pleonastic it using the web. Computer Engineering, 34, 339–389. 10, 12 López Ortega, M.A. (2009). El cine como herramienta ilustrativa en la enseñanza de los trastornos de la personalidad. psiquiatria.com, 13. 26 Manning, C. & Schütze, H. (1999). Foundations of statistical natural language processing. The MIT Press, Cambridge, MA. 41 Matsui, T. (1999). Approaches to Japanese zero pronouns: centering and relevance. In D. Cristea, N. Ide & D. Marcu, eds., Proceedings of the Workshop on the Relation of Discourse/Dialogue Structure and Reference. 37th Annual Meeting of the Association Computational Linguistics (ACL-99), 11–20. 2, 7, 8 Mel’čuk, I. (2003). Levels of dependency in linguistic description: concepts and problems. In Dependency and valency. An International handbook of contemporary research, 188–229, Mouton de Gruyter, Berlin, New York. 17 Mel’čuk, I. (2006). Zero sign in morphology. In Aspects of the theory of morphology, 447–495, Mouton de Gruyer, Berlin, New York. 6, 19 Mendikoetxea, A. (1994). La semántica de la impersonalidad. In C. Sánchez, ed., Las construcciones con se, 239–267, Visor, Madrid. 18 Mendikoetxea, A. (1999). Construcciones con se: medias, pasivas e impersonales. In I. Bosque & V. Demonte, eds., Gramática descriptiva de la lengua española, vol. 2, 1575–1630, Espasa-Calpe, Madrid. 18 Merchant, J. (2001). The syntax of silence. Sluicing, islands and the theory of ellipsis. Oxford University Press, Oxford. 17 Mitkov, R. (1998). Robust pronoun resolution with limited knowledge. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL/COLING-98), 869–875. 12, 52 Mitkov, R. (2001). Outstanding issues in anaphora resolution. In A. Gelbukh, ed., Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing (CICLing-01), 110–125, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 2004. 10 Mitkov, R. (2002). Anaphora resolution. Longman, London. 6, 8, 10, 33, 52, 61 68 References Mitkov, R. (2010). Discourse processing. In A. Clark, C. Fox & S. Lappin, eds., The handbook of computational linguistics and natural language processing, 599–629, Wiley Blackwell, Oxford. 2, 5, 10 Mitkov, R., Evans, R. & Orasan, C. (2002). A new, fully automatic version of Mitkov’s knowledge-poor pronoun resolution method. In Proceedings of the 3rd International Conference on Computational Linguistics and Intelligent Text Processing (CICLing-02), 69–83, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 2276. 10, 12 Molina López, D. (2008). Y de los hermanos ¿qué? Cómo ayudar a los hermanos de un TLP. psiquiatria.com, 12. 28 Mori, T. & Nakagawa, H. (1996). Zero pronouns and conditionals in Japanese instruction manuals. In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), 782–787. 7, 8 Müller, C. (2006). Automatic detection of nonreferential it in spoken multi-party dialog. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), 49–56. 10, 12, 13 Murata, M., Isahara, H. & Nagao, M. (1999). Pronoun resolution in Japanese sentences using surface expressions and examples. In A. Bagga, B. Baldwin & S. Shelton, eds., Proceedings of the Workshop on Coreference and Its Applications. 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), 39–46. 7, 8 Nakagawa, H. (1992). Zero pronouns as experiencer in Japanese discourse. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-92), 324–330. 7, 8 Nakaiwa, H. (1997). Automatic identification of zero pronouns and their antecedents within aligned sentence pairs. In Proceedings of the 3rd Annual Meeting of the Association for Natural Language Processing in Japan (ANLP-97), 127–141. 7, 8 Nakaiwa, H. & Ikehara, S. (1992). Zero pronoun resolution in a Japanese to English machine translation system by using verbal semantic attributes. In Proceedings of the 3rd Conference on Applied Natural Language Processing (ANLP-92), 201–208. 7, 8 Nakaiwa, H. & Shirai, S. (1996). Anaphora resolution of Japanese zero pronouns with deictic reference. In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), 812–817. 7, 8 Ng, V. & Cardie, C. (2002). Identifying anaphoric and non-anaphoric noun phrases to improve coreference resolution. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), 1–7. 10, 12 Nomoto, T. & Yoshihiko, N. (1993). Resolving zero anaphora in Japanese. In Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL-93), 315–321. 7, 8 Okumura, M. & Tamura, K. (1996). Zero pronoun resolution in Japanese discourse based on centering theory. In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), 871–876. 1, 7 69 References Paice, C.D. & Husk, G.D. (1987). Towards an automatic recognition of anaphoric features in English text: the impersonal pronoun it. Computer Speech and Language, 2, 109–132. 10, 11, 12 Peng, J. & Araki, K. (2007a). Zero anaphora resolution in Chinese and its application in Chinese-English machine translation. In Z. Kedad, N. Lammari, E. Métais, F. Meziane & Y. Rezgui, eds., Natural language processing and information systems. Proceedings of the 12th International Conference on Applications of Natural Language to Information Systems (NLDB-07), 364–375, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 4592. 7 Peng, J. & Araki, K. (2007b). Zero-anaphora resolution in Chinese using maximum entropy. IEICE - Transactions on Information and Systems, E90-D, 1092–1102. 7, 8 Peral, J. (2002). Resolución y generación de la anáfora nominal en español e inglés en un sistema de traducción automática. Procesamiento del lenguaje natural , 28, 127–128. 7, 8 Peral, J. & Ferrández, A. (2000). Generation of Spanish zero-pronouns into English. In D.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Proceedings of the 2nd International Conference on Natural Language Processing (NLP-2000), 252–260, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 1835. 2, 7, 8 Pintor Garcı́a, M. (2007). Análisis factorial de las actitudes personales en educación secundaria. Un estudio empı́rico en la Comunidad de Madrid. psiquiatria.com, 11. 28 Pollard, C. & Sag, I. (1994). Head Driven Phrase Structure Grammar . CSLI Publications, Stanford, CA. 19 Real Academia Española (1977). Esbozo de una nueva gramática de la lengua española. Espasa-Calpe, Madrid. 19 Real Academia Española (2001). Diccionario de la lengua española. Espasa-Calpe, Madrid, 22nd edn. 15, 40, 41 Real Academia Española (2009). Nueva gramática de la lengua española. Espasa-Calpe, Madrid. ix, 6, 15, 16, 17, 18, 19, 22, 23, 24, 25, 33, 34 Recasens, M. & Hovy, E. (2009). A deeper look into features for coreference resolution. In L.D. Sobha, A. Branco & R. Mitkov, eds., Anaphora Processing and Applications. Proceedings of the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-09), 29–42, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 5847. 2, 6, 11 Rello, L. & Illisei, I. (2009a). A comparative study of Spanish zero pronoun distribution. In Proceedings of the International Symposium on Data and Sense Mining, Machine Translation and Controlled Languages, and their application to emergencies and safety critical domains (ISMTCL-09), 209–214, Presses Universitaires de Franche-Comté, Besançon. 3, 7 Rello, L. & Illisei, I. (2009b). A rule-based approach to the identification of Spanish zero pronouns. In Student Research Workshop. International Conference on Recent Advances in Natural Language Processing (RANLP-09), 209–214. 3, 7, 8, 9, 10, 11, 22, 35 70 References Rello, L., Baeza-Yates, R. & Mitkov, R. (2010a). Improved subject ellipsis detection in Spanish. submitted . 3 Rello, L., Suárez, P. & Mitkov, R. (2010b). A machine learning method for identifying non-referential impersonal sentences and zero pronouns in Spanish. Procesamiento del Lenguaje Natural , 45, 281–287. 3 Ross, J. (1967). Constrains on variables in syntax . Ph.D. thesis, Massachusetts Institute of Technology. 17 Sánchez de las Brozas, F. ([1562] 1976). Minerva. De la propiedad de la lengua latina. Cátedra, Madrid. 14 Sasano, R., Kawahara, D. & Kurohashi, S. (2008). A fully-lexicalized probabilistic model for Japanese zero anaphora resolution. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08), 769–776. 7, 8 Seco, M. (1988). Manual de gramática española. Aguilar, Madrid. 19 Seki, K., Fujii, A. & Ishikawa, T. (2002). A probabilistic method for analyzing Japanese anaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), 911–917. 7, 8 Sevillano Arroyo, M.A. & Ducret Rossier, F.E. (2008). Las emociones en la psiquiatrı́a. psiquiatria.com, 12. 26 Shopen, T. (1973). Ellipsis as grammatical indeterminacy. Foundations of Language, 10, 65– 77. 15 Steinberger, J., Poesio, M., Kabadjov, M.A. & Jeek, K. (2007). Two uses of anaphora resolution in summarization. Information Processing and Management, 43, 1663–1680. 2, 7 Streb, J., Hennighausen, E. & Rösler, F. (2004). Different anaphoric expressions are investigated by event-related brain potentials. Journal of Psycholinguistic Research, 33, 175– 201. 15 Takada, S. & Doi, N. (1994). Centering in Japanese: a step towards better interpretation of pronouns and zero-pronouns. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 1151–1156. 7, 8 Tanaka, I. (2000). Cataphoric personal pronouns in English news reportage. In Proceedings of the 3rd Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-2000), 108–117. 33, 34 Tapanainen, P. (1996). The constraint grammar parser CG-2 . Department of General Linguistics, University of Helsinki, Publications, Vol. 27. 28 Tapanainen, P. & Järvinen, T. (1997). A non-projective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP-97), 64–71. 13, 28 Tesnière, L. (1959). Éléments de syntaxe. Klincksieck, Paris. 28 71 References Theune, M., Hielkema, F. & Hendriks, P. (2006). Performing aggregation and ellipsis using discourse structures. In Research on Language & Computation, vol. 4, 353–375, Springer, Berlin, Heidelberg, New York. 7 Wilder, C. (1997). Some properties of ellipsis in coordination. In Studies in universal grammar and typological variation, 59–107, John Benjamins, Amsterdam. 17 Witten, I.H. & Frank, E. (2005). Data mining: practical machine learning tools and techniques. Morgan Kaufmann, London, 2nd edn. 26, 38, 41, 44, 46 Yeh, C. & Chen, Y. (2003a). Using zero anaphora resolution to improve text categorization. In Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation (PACLIC-03), 423–430. 2, 7, 8 Yeh, C. & Chen, Y. (2003b). Zero anaphora resolution in Chinese with partial parsing based on centering theory. In Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE-03), 683–688. 7, 8 Yeh, C. & Chen, Y. (2007). Topic identification in Chinese based on centering model. Journal of Chinese Language and Computing, 17, 83–96. 2, 7, 8 Yeh, C. & Mellish, C. (1997). An empirical study on the generation of zero anaphors in Chinese. Computational Linguistics, 23, 171–190. 7, 8 Yoshimoto, K. (1988). Identifying zero pronouns in Japanese dialogue. In Proceedings of the 12th International Conference on Computational Linguistics (COLING-88), 779–784. 7, 8 Zhao, S. & Ng, H. (2007). Identification and resolution of Chinese zero pronouns: a machine learning approach. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CNLL07), 541–550. 2, 7, 8 72
© Copyright 2026 Paperzz