Elliphant: A Machine Learning Method for Identifying

Elliphant:
A Machine Learning Method
for Identifying Subject Ellipsis
and Impersonal Constructions
in Spanish
Luz Rello
Main advisor: Ruslan Mitkov
Co-advisor: Xavier Blanco
A thesis submitted for the degree of Erasmus Mundus International Master in
Natural Language Processing and Human Language Technology
Research Group in Computational Linguistics
University of Wolverhampton
June 2010
Laboratori fLexSem
Universitat Autònoma de Barcelona
In memory of Juan Rello
“And then again,” Grandpa Joe went on speaking very slowly
now so that Charlie wouldn’t miss a word, “Mr Willy Wonka
can make marshmallows that taste of violets, and rich caramels
that change colour every ten seconds as you suck them, and little
feathery sweets that melt away deliciously the moment you put
them between your lips. He can make chewing-gum that never
loses its taste, and sugar balloons that you can blow up to enormous sizes before you pop them with a pin and gobble them up.
And, by a most secret method, he can make lovely blue birds’
eggs with black spots on them, and when you put one of these in
your mouth, it gradually gets smaller and smaller until suddenly
there is nothing left except a tiny little pink sugary baby bird
sitting on the tip of your tongue.””
Charlie and the Chocolate Factory, Roald Dahl
Abstract
This thesis presents Elliphant, a machine learning system for classifying
Spanish subject ellipsis as either referential or non-referential. Linguistically motivated features are incorporated in a system which performs a
ternary classification: verbs with explicit subjects, verbs with omitted but
referential subjects (zero pronouns), and verbs with no subject (impersonal
constructions). To the best of our knowledge, this is the first attempt to
automatically identify non-referential ellipsis in Spanish. In order to enable a memory-based strategy, the eszic Corpus was created and manually
annotated. The corpus is composed of Spanish legal and health texts and
contains more than 6,800 annotated instances. A set of 14 features were
defined and a separate training file was created, containing the instances
represented as vectors of feature values. The training data was used with
the Weka package and a set of optimization experiments was carried out
to determine the best machine learning algorithm to use, the parameter optimization, the most effective combinations of features, the optimal number
of instances needed to train the classifier, and the optimal settings for classifying instances occurring in different genres. A comparative evaluation
of Elliphant with Connexor’s Machinese Syntax parser shows the superiority of our system. The overall accuracy of the system is 86.9%. Due to
the fairly frequent elision of subjects in Spanish, this system is useful as the
classification of elliptic subjects as referential or non-referential can improve
the accuracy of Natural Language Processing where zero anaphora resolution is necessary, inter alia, for information extraction, machine translation,
automatic summarization and text categorization.
Acknowledgements
First, my sincere acknowledgements to Prof. Ruslan Mitkov for providing
everything that can be asked of a supervisor: constant trust, support and
encouragement from the very beginning until the end of this thesis.
There are three other persons without whom this work would not have
been possible (alphabetically): Thank you, Ricardo Baeza-Yates, for your
brilliant ideas; thank you, Richard Evans, for your guidance; and thank
you, Pablo Suárez, for helping the project to become a reality.
I would like to acknowledge the Computational Linguistics Group at the
University of Wolverhampton where my collaboration through the first year
brought its first results, specially to Iustina Ilisei and Naveed Afzal.
Thank you for the assistance received in Universitat Autònoma de Barcelona
by my co-advisor Xavier Blanco and by Jose Marı́a Brucart and Joaquim
Llisterri.
I am indebted to the Grupo de Investigación en Tratamiento Automático
del Lenguaje Natural of Universitat Pompeu Fabra for their support and
feedback during this last semester, particularly to Gabriela Ferraro and Leo
Wanner.
Finally, thank you to Igor Mel’čuk and Ignacio Bosque for easing doubts
and to Sang Yoon Kim and Ana Suárez Fernández for their help throughout
the annotation process.
These master studies were supported by a ”La Caixa” grant (Becas de ”La
Caixa” para estudios de máster en España. Convocatoria 2008).
Contents
1 Introduction
1
1.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Related Work
2.1
2.2
5
NLP Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.1
NLP Approaches to Zero Pronouns . . . . . . . . . . . . . . . . .
6
2.1.2
NLP Approaches to Identifying Non-referential Constructions . .
10
Linguistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
Linguistic Approaches to Subject Ellipsis . . . . . . . . . . . . .
14
2.2.2
Linguistic Approaches to Non-referential Ellipsis . . . . . . . . .
18
3 Detecting Ellipsis in Spanish
3.1
3.2
21
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1.1
Explicit Subjects: Non-elliptic and Referential . . . . . . . . . .
22
3.1.2
Zero Pronouns: Elliptic and Referential . . . . . . . . . . . . . .
23
3.1.3
Impersonal Constructions: Elliptic and Non-referential . . . . . .
24
Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2.1
Building the Training Data . . . . . . . . . . . . . . . . . . . . .
26
3.2.2
Annotation Software and Annotation Guidelines . . . . . . . . .
30
3.2.3
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2.4
Purpose Built Tools . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2.5
The WEKA Package . . . . . . . . . . . . . . . . . . . . . . . . .
41
vii
4 Evaluation
4.1
4.2
43
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.1.1
Method Selected: K* Algorithm . . . . . . . . . . . . . . . . . .
44
4.1.2
Learning Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.1.3
Most Effective Features . . . . . . . . . . . . . . . . . . . . . . .
50
4.1.4
Genre Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
5 Conclusions and Future Work
59
5.1
Main Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.2
Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
References
63
List of Figures
2.1
Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia
Española, 2009). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
16
An example of the output of the Connexor’s Machinese Syntax parser
for Spanish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2
Screenshot of the annotation program interface. . . . . . . . . . . . . . .
30
3.3
An example of Weka Explorer interface. . . . . . . . . . . . . . . . . .
42
4.1
eszic training data learning curve for accuracy. . . . . . . . . . . . . . .
48
4.2
eszic training data learning curve for precision, recall and f-measure. . .
48
4.3
Learning curve for accuracy, recall and f-measure of the classes. . . . . .
49
4.4
Learning curve for accuracy, recall and f-measure in relation to the number of instances of each class. . . . . . . . . . . . . . . . . . . . . . . . .
ix
50
List of Tables
3.1
eszic Corpus: tokens, sentences and clauses. . . . . . . . . . . . . . . .
27
3.2
eszic Corpus: number of instances per class. . . . . . . . . . . . . . . .
28
3.3
eszic Corpus annotation tags.
. . . . . . . . . . . . . . . . . . . . . . .
32
3.4
Features: definitions and values.
. . . . . . . . . . . . . . . . . . . . . .
36
4.1
Weka classifiers accuracy (20% of the eszic training set). . . . . . . . .
45
4.2
eszic training data evaluation with K* -B 40 -M a. . . . . . . . . . . . .
46
4.3
Leave-one-out and ten-fold cross-validation comparison. . . . . . . . . .
47
4.4
Selected features by Weka Attribute Selection methods. . . . . . . . . .
51
4.5
Classification using the selected features groups: accuracy. . . . . . . . .
52
4.6
Extrinsic parser features classification results. . . . . . . . . . . . . . . .
53
4.7
Intrinsic parser features classification results. . . . . . . . . . . . . . . .
53
4.8
Single feature omission classifications: accuracy. . . . . . . . . . . . . . .
53
4.9
Legal and health genres comparative evaluation. . . . . . . . . . . . . .
54
4.10 Cross-genre training and testing evaluation. . . . . . . . . . . . . . . . .
55
4.11 Elliphant eszic training data results. . . . . . . . . . . . . . . . . . . . .
56
4.12 Machinese eszic training data results. . . . . . . . . . . . . . . . . . . .
56
4.13 Elliphant Legal eszic training results. . . . . . . . . . . . . . . . . . . .
56
4.14 Machinese Legal eszic training results. . . . . . . . . . . . . . . . . . . .
57
4.15 Elliphant Health eszic training data results.
. . . . . . . . . . . . . . .
57
4.16 Machinese Health eszic training data results. . . . . . . . . . . . . . . .
57
xi
Chapter 1
Introduction
This introduction is intended to explain the three primary motivations for this research
(Section 1.1), its objectives (Section 1.2), and to briefly describe its outcomes. These
outcomes include the results of an evaluation of the implemented system and publications produced over the course of the study (see Section 1.3). The overall structure of
the thesis is also presented in Section 1.4.
1.1
Motivation
There are three reasons motivating the decision to choose this research topic and develop
a tool, Elliphant, to perform the identification of zero pronouns (referential elliptic
subjects) and impersonal constructions (non-referential elliptic non-existing subjects)
in Spanish.
The three justifications for this work are: (1) the highly frequent occurrence of zero
pronouns in Spanish; (2) identification of zero pronouns is a prerequisite for anaphora
resolution in Spanish and also for other Natural Language Processing (nlp) applications; and (3) this challenge had not yet been fully addressed in the field. The system
presented in this dissertation represents the first attempt to automatically identify
non-referential ellipsis in Spanish.
Since Spanish is a pro-drop language (Chomsky, 1981), subject ellipsis is a recurring phenomenon. It was noted that 26% of the 6,878 cases annotated in the corpus
exploited in this work have an elliptic subject, while only 3% of them occur in impersonal constructions. The topic of subject ellipsis has been addressed in previous work
on other pro-drop languages such as Japanese (Okumura & Tamura, 1996), Chinese
1
1.2 Objectives
1. Introduction
(Zhao & Ng, 2007), Korean (Lee & Byron, 2004) and Russian (Kibrik, 2004). The
related topic of the identification of non-referential pronouns has been addressed in
non-pro-drop languages such as English (Evans, 2001) and French (Danlos, 2005).
The identification of zero pronouns and non-referential impersonal constructions is
necessary for anaphora resolution, since the resolution of zero pronouns (zero anaphora)
implies that they need to be identified. The identification of zero pronouns first requires
that they can be distinguished from non-referential constructions (Mitkov, 2010).
Coreference and anaphora resolution, and in particular zero anaphora resolution,
has been found to be crucial in a number of nlp applications. These include, but
are not limited to, information extraction (Chinchor & Hirschman, 1997), machine
translation (Peral & Ferrández, 2000), automatic summarisation (Steinberger et al.,
2007), text categorisation (Yeh & Chen, 2003a), topic recognition (Yeh & Chen, 2007),
salience identification (Iida et al., 2009) and word sense disambiguation (Kawahara &
Kurohashi, 2004). Moreover, there is additional research showing that zero pronoun
identification is useful in order to make further developments in centering theory (Matsui, 1999), for name entity recognition (Hirano et al., 2007), for the investigation of
convergence universals in translation (Corpas Pastor et al., 2008) and to discriminate
predicate-argument structure (Imamura et al., 2009).
Finally, the difficulty in detecting non-referential pronouns has been acknowledged
since computational resolution of anaphora was first attempted (Bergsma et al., 2008)
and this task is currently needed in nlp for Spanish. The need for automatic tools
able to detect ellipticals has been stated by Recasens & Hovy (2009) who note that
their application would improve existing methods for zero anaphora resolution in Spanish (Ferrández & Peral, 2000). One particular contribution of the current research is
the recognition of Spanish impersonal constructions which, following from the literature review presented in Chapter 2, appears not to have been addressed before in the
literature.
1.2
Objectives
The goal of the fully automatic method presented in this dissertation (Elliphant) is
to identify zero pronouns (referential elliptic subjects) and impersonal constructions
(non-referential elliptic subjects) in Spanish. In order to accomplish this objective, it is
2
1.3 Results
1. Introduction
also necessary to identify the cases that occur in the subject position in complementary
distribution. For this reason, the identification of explicit subjects was carried out using
a learning based method which led to a ternary classification method which covers all
the elements (elliptic and explicit, referential and non-referential) of the subject position
in the clause. These three classes are explicit subjects, zero pronouns and impersonal
constructions.
1.3
Results
The results obtained by the Elliphant system and the level of performance that it
reaches are encouraging since this tool not only identifies zero pronouns and impersonal constructions but also outperforms a dependency parser (Connexor’s Machinese
Syntax) in identifying explicit subjects as well as elliptic subjects. A series of experiments undertaken with the algorithm has enabled discovery of the most effective
features for use in the classification tasks. The performance results obtained for the
identification of impersonal constructions are, according to the survey of previous work
carried out in Chapter 2, the first presented for this task in the literature.
The classification results obtained by the algorithm were presented in Rello et al.
(2010b). However, this paper undertook no further investigation into the efficacy of
the features used presented in Rello et al. (2010a).
With regard to the attempt to achieve improved performance from the Elliphant
system, two previous studies have contributed to its design: one concerning the distribution of zero pronouns (Rello & Illisei, 2009a) and the other presenting a rule-based
method for their identification (Rello & Illisei, 2009b). It should be noted however
that despite their contribution, Elliphant differs considerably from these initial studies
in terms of methodology (corpus used, linguistic criteria exploited, and the overall approach) and the classification task itself (classes to be identified). Overall, the Elliphant
system represents a considerable advancement on those works.
1.4
Thesis Outline
The remainder of this thesis is structured in four Chapters. Chapter 2 provides a literature review of nlp approaches (see Section 2.1) to zero pronouns (Section 2.1.1)
and identification of non-referential expressions (Section 2.1.2). The review also covers
3
1.4 Thesis Outline
1. Introduction
work in the field of Linguistics, including approaches to referential and non-referential
subject ellipsis (Section 2.2.1 and 2.2.2). Chapter 3 describes the methodology embodied by the Elliphant system. Firstly, the classification task (see Section 3.1) and an
explanation of each of the classes is presented: explicit subjects (Section 3.1.1), zero
pronouns (Section 3.1.2) and impersonal constructions (Section 3.1.3). Secondly, the
machine learning method (see Section 3.2) is described, beginning with the compilation
of the corpus (Section 3.2.1), the guidelines established and the software developed to
facilitate annotation of the corpus by human annotators (Section 3.2.2), a description
of the features (see Section 3.2.3) derived from the corpus and the purpose built tools
(Section 3.2.4) implemented to generate the training data exploited by the machine
learning package, Weka (3.2.5). Elliphant is evaluated in Chapter 4. A set of experiments (Section 4.1) was carried out to determine the method and parameter values
which work best for these classification tasks (Section 4.1.1), its learning curves (Section 4.1.2) and the most effective groups of features (Section 4.1.3). A comparative
evaluation of the Elliphant system with an existing parser is presented in Section 4.2.
Finally, in Chapter 5, conclusions are drawn and plans for future work are considered.
4
Chapter 2
Related Work
Both the nlp and linguistics literature address referential and non-referential subject
ellipsis. Although the nlp literature is directly related to this dissertation in terms of
objectives and methodology, more general literature in linguistics contributes various
means by which classes of subject ellipsis and annotation criteria can be established.
Related work in nlp (see Section 2.1) on this topic can be classified as (a) literature related to zero pronouns (Section 2.1.1), which is mainly concerned with their
identification, resolution and generation, and (b) literature related to the identification
of non-referential constructions (Section 2.1.2).
The literature in linguistics (Section 2.2) concerning different types of ellipsis, in
which both zero pronouns (See Section 2.2.1) and non-referential constructions (See
Section 2.2.2) are included, is focused on the definition, delimitation and description of
their use in language.
2.1
NLP Approaches
The nlp literature on this topic broadly concerns two topics, namely zero pronouns
(Section 2.1.1) and non-referential constructions (Section 2.1.2). The number and variety of studies of the first group is considerably larger than that of the second.
Both topics are mainly related to coreference and anaphora resolution systems as
the resolution of zero pronouns (zero anaphora) implies their prior identification. That
identification requires first the identification of zero pronouns and secondly the identification of non-referential constructions (Mitkov, 2010).
5
2.1 NLP Approaches
2. Related Work
While undertaking this literature review, no specific studies on the identification
of non-referential constructions were found in Spanish, although it has been indicated
to be a necessary task (Ferrández & Peral, 2000; Recasens & Hovy, 2009) in anaphora
and coreference resolution. For this reason it is expected that the method presented in
this dissertation will complement current Spanish pronoun resolution systems.
2.1.1
NLP Approaches to Zero Pronouns
A zero pronoun is the resultant “gap” (zero anaphor) where zero anaphora or ellipsis
occurs, when an anaphoric pronoun is omitted but is nevertheless understood (Mitkov,
2002). In linguistics, zero pronouns are also referred to as null subjects, empty subjects,
elliptic subjects, elided subjects, tacit subjects, understood subjects and non-explicit subjects, among others. In the nlp literature such omitted subjects are broadly denoted as
zero pronouns. Some linguistic studies also make use of the term “zero pronoun” which
is not equivalent to the computational concept. The Meaning-Text Theory (mtt) considers a zero pronoun in subject position to be a non-argumental impersonal subject
(Mel’čuk, 2006):
Llueve.
(It) is raining.
while in Generative Grammar, following the Zero Hypothesis (Kratzer, 1998), a zero
pronoun can have phonetic content (full pronoun) or not (null pronoun). In this theory,
the concept of zero pronoun has to do only with its lack of lexical content in contrast
to lexical pronouns (Alonso-Ovalle & D’Introno, 2000). In this work a a zero pronoun
(Mitkov, 2002) corresponds with an omitted subject (Real Academia Española, 2009)
in Spanish.
Zero pronouns become crucial when processing any pro-drop language (Chomsky,
1981) –also known as null subject languages– since zero anaphora is fairly frequent in
such languages. By way of example, of the 6,827 annotated cases in our corpus, 26%
of them have an omitted subject.
The current literature review indicates that the following pro-drop languages are
the ones on which related work on zero pronoun processing have been carried out:
6
2.1 NLP Approaches
2. Related Work
– Japanese (Hirano et al., 2007; Iida et al., 2006, 2009; Imamura et al., 2009; Isozaki
& Hirao, 2003; Kawahara & Kurohashi, 2004; Matsui, 1999; Mori & Nakagawa,
1996; Murata et al., 1999; Nakagawa, 1992; Nakaiwa, 1997; Nakaiwa & Ikehara,
1992; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Okumura & Tamura,
1996; Sasano et al., 2008; Seki et al., 2002; Takada & Doi, 1994; Yoshimoto, 1988);
– Chinese (Hu, 2008; Peng & Araki, 2007a,b; Yeh & Chen, 2003a,b, 2007; Yeh &
Mellish, 1997; Zhao & Ng, 2007);
– Korean (Han, 2004; Lee & Byron, 2004; Lee et al., 2005);
– Spanish (Barreras, 1993; Corpas Pastor, 2008; Corpas Pastor et al., 2008; Ferrández
& Peral, 2000; Peral, 2002; Peral & Ferrández, 2000; Rello & Illisei, 2009a,b); and
– Russian (Kibrik, 2004).
These studies of zero pronouns address a variety of topics. Depending on their goal,
the literature on zero pronouns can be divided into the following classes:
– Zero pronoun classification or annotation: (Han, 2004; Kibrik, 2004; Lee &
Byron, 2004; Lee et al., 2005; Rello & Illisei, 2009a);
– Zero pronoun identification (Corpas Pastor, 2008; Corpas Pastor et al., 2008;
Nakaiwa, 1997; Rello & Illisei, 2009b; Yoshimoto, 1988);
– Resolution of zero pronouns, including their prior identification (Barreras,
1993; Ferrández & Peral, 2000; Hu, 2008; Isozaki & Hirao, 2003; Kawahara &
Kurohashi, 2004; Murata et al., 1999; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Okumura & Tamura, 1996; Peng & Araki, 2007b; Sasano et al., 2008;
Seki et al., 2002; Yeh & Chen, 2003b; Zhao & Ng, 2007); and
– Zero pronoun generation (Peral, 2002; Peral & Ferrández, 2000; Theune et al.,
2006; Yeh & Mellish, 1997);
Other nlp applications where zero pronouns are taken into consideration are: machine translation (Nakaiwa & Ikehara, 1992; Nakaiwa & Shirai, 1996; Peng & Araki,
2007a; Peral, 2002; Peral & Ferrández, 2000); named entity recognition (Hirano et al.,
2007); summarisation (Steinberger et al., 2007); text categorisation (Yeh & Chen,
7
2.1 NLP Approaches
2. Related Work
2003a); topic identification (Yeh & Chen, 2007) and identifying salience in text (Iida
et al., 2009); and word sense disambiguation (Kawahara & Kurohashi, 2004).
Further research topics where zero pronoun identification is useful are: predicateargument structure discrimination (Imamura et al., 2009); for further developments
in centering theory (Matsui, 1999) such as improved interpretation of zero pronouns
(Takada & Doi, 1994); or for the investigation of convergence universals in translation
(Corpas Pastor, 2008; Corpas Pastor et al., 2008).
Studies of specific cases of zero pronouns such as those in which their referents take
the semantic role of experiencer (Nakagawa, 1992), zero pronouns in relationships with
conditional constructions (Mori & Nakagawa, 1996) or descriptions of the syntactic
patterns in which zero pronouns are used (Iida et al., 2006), among others.
In terms of methodology, rule-based, machine learning, and a variety of other approaches have been taken toward zero pronoun identification and resolution:
– Rule-based approaches (Barreras, 1993; Corpas Pastor et al., 2008; Ferrández
& Peral, 2000; Hu, 2008; Kawahara & Kurohashi, 2004; Kibrik, 2004; Matsui,
1999; Mori & Nakagawa, 1996; Murata et al., 1999; Nakagawa, 1992; Nakaiwa &
Ikehara, 1992; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Peral, 2002;
Peral & Ferrández, 2000; Rello & Illisei, 2009b; Yeh & Chen, 2003a,b, 2007; Yeh
& Mellish, 1997; Yoshimoto, 1988);
– Machine learning approaches (Hirano et al., 2007; Iida et al., 2006, 2009; Kawahara & Kurohashi, 2004; Peng & Araki, 2007b; Zhao & Ng, 2007);
– Hybrid methods combining rules and learning algorithms (Isozaki & Hirao, 2003);
– Probabilistic models (Sasano et al., 2008; Seki et al., 2002); and
– other techniques such as the exploitation of parallel corpora (Nakaiwa, 1997).
Although it is clear that machine learning methods perform better than other approaches when identifying non-referential expressions (Boyd et al., 2005), there is some
debate about which approach brings optimal performance when applied in anaphora
resolution systems (Mitkov, 2002).
In Spanish, the most influential work on this topic is the Ferrández and Peral
algorithm for zero pronoun resolution (Ferrández & Peral, 2000) together with their
8
2.1 NLP Approaches
2. Related Work
previous related work (Ferrández et al., 1998, 1999). Their implementation of a zero
pronoun identification and resolution module forms part of a system known as the Slot
Unification Parser for Anaphora resolution (supar) (Ferrández et al., 1999).
Although substantially related, the work described in this dissertation differs, both,
in form and in aim from this previous research for Spanish (Ferrández & Peral, 2000).
Firstly, their definition of zero pronouns is broader since it is suited to a different
purpose: the zero class includes not only those zero signs whose referent lies in previous
clauses (anaphoric, according to their classification) and those that lie outside the text
(exophoric), but also those that occur after the verb (cataphoric). Here, it is considered
that those subjects that are within the clause, irrespective of whether they appear before
or after the verb, belong to the explicit subject class.
Secondly, Ferrández & Peral (2000) take a rule-based approach while the system described in this dissertation performs the classification using an instance-based learner.
Additionally, their rules are based on partial parsing, while some of the features exploited by the Elliphant system make use of information obtained from an analysis
of our corpus by a deep dependency parser. Ferrández & Peral (2000) tested their
approach to zero pronoun identification and resolution using 1,599 cases, while the machine learning approach presented in this dissertation was tested on a corpus containing
6,827 classified verbal instances.
Finally, they do not provide a method for the identification of non-referential zero
pronouns. They also make no overt mention of automatic classification of zero pronouns
of the anaphoric or cataphoric kind (Ferrández & Peral, 2000).
Despite the similarities of Ferrández & Peral (2000) work to the approach described
in this dissertation, the fact that they take a different definition for zero pronouns,
means that a comparison with the method described in the current work is not feasible
(Section 4.2).
In order to improve on previous work by the current author (Rello & Illisei, 2009b),
this study differs from it in the design of the classification and the methodology. In
Rello & Illisei (2009b) a binary classification as either elliptic-subject or non-elliptic
subject was made as a result of the implementation of a rule-based method which
applies only to zero pronouns, whilst in the present study a ternary classification is
presented which covers all the possible instances of subject position in Spanish. Moreover, while zero pronouns were annotated in Rello & Illisei (2009b), in the present
9
2.1 NLP Approaches
2. Related Work
study the zero pronouns themselves were left unmarked. Instead, the main verb of
each clause is annotated and classified into one of three types. The baseline rule-based
algorithm described in Rello & Illisei (2009b) was based on the zero pronoun identification methodology developed in Corpas Pastor et al. (2008) which treats every clause
which does not have an explicit subject as containing a zero pronoun.
2.1.2
NLP Approaches to Identifying Non-referential Constructions
The identification of non-referential pronouns1 is a crucial step in coreference (Boyd
et al., 2005; Mitkov, 2010) and anaphora resolution systems (Mitkov, 2001, 2002). In
comparison to the work addressing zero pronouns, previous research on this topic is
fairly limited, and, as implied by this survey of related work, the approach described in
this dissertation is the first attempt to automatically identify impersonal constructions
in Spanish.
The literature describing approaches to the identification of non-referential expressions is focused on:
– Identification of pleonastic it in English (Denber, 1998; Lappin & Leass, 1994;
Paice & Husk, 1987). Work by Evans (2000, 2001) is exploited by an anaphora
resolution system in Mitkov et al. (2002). Also (Bergsma et al., 2008; Boyd et al.,
2005; Clemente et al., 2004; Gundel et al., 2005; Lambrecht, 2001; Li et al., 2009;
Müller, 2006; Ng & Cardie, 2002); and
– Identification of expletive pronouns in French (Danlos, 2005).
Nevertheless, in those languages where approaches to the identification of nonreferential expressions have been implemented, there is actually an explicit word with
some grammatical information (a third person pronoun) in the text, which is nonreferential (Mitkov, 2010). By contrast, in Spanish, non-referential expressions are not
realised by expletive or pleonastic pronouns but by a certain kind of ellipsis. For this
reason, it is easy to wrongly identify them as zero pronouns, which are referential. For
example, pleonastic pronouns such as:
1
In previous work these pronouns have also been referred to as pleonastic, expletive, non-anaphoric,
and non-referential pronouns.
10
2.1 NLP Approaches
2. Related Work
(a.1) (It)1 must be stated that Oskar behaved impeccably.
(b.1) (It) rains, (Il) pleut, (Es) regnet.
(c.1) (It)’s three o’clock.
are all elided in Spanish, resulting in the following non-referential impersonal constructions:
(a.2) Se dice que Oscar se comportó impecablemente.
(b.2) Llueve.
(c.2) Son las tres en punto.
A sizable proportion of the false positives obtained in previous work on identifying
zero pronouns were caused by such non-referential impersonal constructions (Rello &
Illisei, 2009b). Ferrández & Peral (2000) noted that an inability to identify verbs used
in impersonal constructions has a negative effect on the performance of their anaphora
resolution algorithm2 , while in Recasens & Hovy (2009, p. 41) the need for a tool to
identify ellipsis is observed:
“In contrast with previous work, many of the features relied on gold standard
annotations, pointing out the need for automatic tools for ellipticals detection and
deep parsing.”
Four approaches have been implemented to identify non-referential expressions and
described in the literature:
– Rule-based approaches (Danlos, 2005; Denber, 1998; Lappin & Leass, 1994; Paice
& Husk, 1987);
1
In this work explicit subjects in the examples are presented in italics., zero pronouns in the
examples are presented by the symbol Ø, while in the English translations the subjects which are
elided in Spanish are marked with parenthesis. Impersonal constructions in the examples are not
explicitly indicated using a symbol (see Section 3.1).
2
The other two reasons given for the low success rate in the identification of verbs with no subject
are the lack of semantic information and the inaccuracy of the grammar used (Ferrández & Peral,
2000).
11
2.1 NLP Approaches
2. Related Work
– Machine learning approaches (Bergsma et al., 2008; Boyd et al., 2005; Clemente
et al., 2004; Evans, 2000, 2001; Mitkov et al., 2002; Müller, 2006; Ng & Cardie,
2002);
– Web based approach (Li et al., 2009); and
– Descriptive studies from contextual (Lambrecht, 2001) and intonational points of
view (Gundel et al., 2005).
Paice & Husk (1987) introduce a rule-based method for identifying non-referential
it while Lappin & Leass (1994) and Denber (1998) describe rule-based components of
their pronoun resolution systems which detect non-referential uses of it. Mitkov’s first
anaphora resolution algorithm did not incorporate an approach for detecting pleonastic
it (Mitkov, 1998), while, in more recent versions, mars (Mitkov’s Anaphora Resolution
System), uses Evans (2001) system to detect pleonastic it, and machine learning (Mitkov
et al., 2002).
Instance-based learning approaches are used for identifying pleonastic it in English,
while the only approach for the identification of expletive pronouns in French employs
a ruled-based methodology (Danlos, 2005).
Evans (2001)1 describes the first attempt using a machine learning method to classify pleonastic it into seven types while Boyd et al. (2005) present a linguistically
motivated classification of non-referential it into four types.
A comparison replicating the approaches developed by Paice & Husk (1987) and
Evans (2001) with the system implemented by Boyd et al. (2005) corroborates the
finding that machine learning outperforms rule-based approaches (Boyd et al., 2005).
Further, it is pointed out that rule-based methods are limited due to their reliance on
lists of verbs and adjectives commonly used in the patterns that they exploit, which
can make them less portable and more difficult to adapt to new texts. Nevertheless, the
basic grammatical patterns are still reasonably consistent indicators of non-referential
occurrences of it (Boyd et al., 2005).
Certain aspects of the work described in this dissertation were inspired by the
methodology of the machine learning approaches for the identification of pleonastic it
specifically by Evans (2001) and Boyd et al. (2005).
1
This method is currently incorporated as a component of mars (Mitkov et al., 2002).
12
2.2 Linguistic Approaches
2. Related Work
Due to the fact that the occurrence of non-referential zero pronouns is not very
common1 , the size of our corpus was increased in order to achieve a sufficient number of instances for each class. The training data exploited by the Elliphant system
contains 6,827 instances of which 179 are non-referential examples. In Evans (2001)
3,171 instances of it where classified into seven classes while in Boyd et al. (2005) 2,337
examples were classified into four classes.
Our corpus was analyzed, as in the approach described by Evans (2001), using a
functional dependency parser, Connexor’s Machinese Syntax 2 (Connexor Oy, 2006b;
Tapanainen & Järvinen, 1997). Moreover, some of the features used in the Elliphant
system, such as the consideration of the lemmas and the parts of speech (POS) of the
preceding and following material, were also implemented in Evans (2001) approach.
In contrast to previous work, the K* algorithm (Cleary & Trigg, 1995) was found
to provide the most accurate classification in the current study. Other approaches have
employed various classification algorithms, including K-nearest neighbors in TiMBL
(Boyd et al., 2005; Evans, 2001) and JRip in Weka (Müller, 2006).
2.2
Linguistic Approaches
Literature related to ellipsis in linguistic theory has served as one basis for establishing
the linguistically motivated classes and the annotation criteria in the current work. The
linguistically related work on this topic is focused on the definition and description of
the use of ellipsis in natural language and the limits of that use.
In Spanish, the use of ellipsis is very widespread. It is a phenomenon that occurs
in a wide range of contexts and is therefore much discussed in the field of linguistics.
To illustrate, some controversial topics in linguistics that pertain to instances of ellipsis
found in our corpus include: the establishment of different types of ellipsis, the identification of impersonal sentences (non-referential expressions), the definition of particular
syntactic categories which can function as subjects, and the intricate differentiation of
reflex passive with elliptic subject from impersonal sentences in different varieties of
Spanish.
The concepts used in both types of literature (nlp and linguistic) to distinguish
different types of ellipsis and zero signs are extremely broad and are well debated in
1
2
Only 3% of the verbs found in our corpus (see Section 3.2.1) have non-referential elliptic subjects.
http://www.connexor.eu/technology/machinese/demo/syntax/.
13
2.2 Linguistic Approaches
2. Related Work
the linguistic literature. Elements of the elliptic typology used in this work which were
derived from the literature are stated next while the linguistic and formal criteria used
to identify the chosen classes and which served as the basis for the corpus annotation,
including a typology of the examples found, is explained in Sections 3.1.1, 3.1.2, 3.1.3
and 3.2.2.
2.2.1
Linguistic Approaches to Subject Ellipsis
The study of the omission of some element from the sentence or the discourse in natural
language has been a challenge not only in computing but also in Spanish linguistics
itself –from the Renaissance period through to the present day.
The first occidental grammarian who treated ellipsis as a grammatical phenomenon
(Hernández Terrés, 1984) was Francisco Sánchez de las Brozas, El Brocense (15231600) (Sánchez de las Brozas, [1562] 1976, p. 317), who took the concept of ellipsis
from Apolonio Dı́scolo (Dı́scolo, [2nd century] 1987) and defined it as:
“La elipsis es la falta de una palabra o de varias en una construcción correcta [...].
“Ellipsis is the omission of one or more items from a correct construction [...].”
This conception, in which grammar serves as a basis for a rational explanation of the
surface form of the language:
“No hay, pues, ninguna duda de que se debe buscar la explicación racional de
la cosas, también de las palabras.” Sánchez de las Brozas ([1562] 1976) cited in
Garcı́a Jurado (2007, p. 12)
“There is no doubt about that there shall be pursuit a rational explenation of the
things.”
later inspired the rational grammar of Port-Royal (Lancelot & Arnauld, [1660] 1980)
which was a precursor of Chomsky’s work (Chomsky, [1968] 2006, p. 5):
“One, particularly crucial in the present context, is the very great interest in the
potentialities and capacities of automata, a problem that intrigued the seventeenthcentury mind as fully as it does our own. [...] A similar realisation lies at the base
of Cartesian philosophy.”
14
2.2 Linguistic Approaches
2. Related Work
In order to elide something, a meaning, which is not expressed needs to be assumed. It
thus follows that ellipsis itself was one of the basic mechanisms to explain the transition
from D-structure to S-Structure becoming a central issue (Brucart, 1987) in generative
grammar from its original model, the Standard Theory (Chomsky, 1965) to its latest
revisions (Chomsky, 1995).
Different branches of linguistics have considered ellipsis from different points of
view:
– Semantic: traditionally, the criteria used to define ellipsis were semantic or logical
(Bello, [1847] 1981) and prescriptive (Real Academia Española, 2001);
– Descriptive and explicative: (Brucart, 1999);
– Distributional: although structuralism rejected the study of units which were not
codified in the signifier or phonetic realization, some classifications of ellipsis were
presented (Francis, 1958; Fries, 1940);
– Pragmatic: in diverse pragmatic paradigms the role of ellipsis is crucial as it
influences the interpretation of text. As a result it has given rise to several lines
of investigation such as implications though ellipsis (Grice, 1975), ellipsis studied
as a factor to activate textual coherence (Halliday & Hasan, 1976), or indefinite
ellipsis in which a word can stand for one or more sentences in a restrictive code
(Shopen, 1973); and
– Cognitive: in terms of ellipsis processing by the brain (Streb et al., 2004, p. 175):
“Ellipses and pronouns/proper names are processed by distinct mechanisms
being implemented in distinct cortical cell assemblies.”
or as part of the explanation of the language faculty (Chomsky, 1965).
The terminology and linguistic explanations relevant for this work, consider both zero
pronouns and non-referential expressions to be different types of ellipsis (Brucart, 1999).
Four kinds of Spanish subject ellipsis are distinguished (Brucart, 1999, p. 2851).
This classification is presented in correlation with a verb classification (Real Academia
Española, 2009), which is related to the omitted subject classification presented in
Bosque (1989).
15
2.2 Linguistic Approaches
2. Related Work
The classification of Spanish omitted subjects presented in Bosque (1989) is: omitted subjects from finite verbs, which can be referential and non-referential and omitted
subjects from non-finite verbs which can be argumental and non-argumental. The argumental omitted subjects can in turn be referential and non-referential. In this study
non-argumental omitted subjects are claimed not to exist (Bosque, 1989), although in
Brucart (1999), non-argumental omitted subjects are considered a type of ellipsis (Type
4 in Figure 2.1).
Types of subject ellipsis
Types of verbs depending on their subject
(Brucart, 1999)
(Real Academia Española, 2009)
(1) Omitted subject in a clause
containing a finite verb:
Verb with argumental omitted
s u b j e c t w i t h a n e s p e c i fi c
interpretation
Ø No vendrán
They won’t come
Verb with argumental omitted
sub ject with an unespecific
interpretation
Ø Dicen que vendrá
They say he will come
It is said he will come
(2) Argumental impersonal subject
Verb with argumental omitted
subject which is represented by
pronoun se
En este estudio Ø se trabaja bien.
In this room one can work properly.
(3) Non-argumental impersonal
subject
Ve r b w i t h n o a r g u m e nt a l
subject
Ø Nieva
It is snowing
(4) Omitted subject in a non-finite verb clause
Juan intentaba (Ø decírselo a María.)
John tried (John to tell Mary.)
Figure 2.1: Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia
Española, 2009).
The first type of ellipsis (see (1) in Figure 2.1) represents omitted subjects and
corresponds to zero pronouns in the nlp literature. An omitted subject is the result
of nominal ellipsis where a non-phonetically/orthographically realized lexical element –
1.
Omitted sub ject in a clause
containing a finite verb:
omitted subject– which is needed for the interpretation of the meaning and the structure
of the sentence,
isvendrán
omitted since it can retrieved from its context (Brucart, 1999).
Ø No
[They] won’t come
Despite their lack
of phonetic realization, omitted subjects are part of the clause (Real
Ø Dicen que vendrá
Academia Española,
2009).
[They] say
he won’t come
[It is] said he won’t come
2.
Argumental impersonal subject
En este estudio Ø se trabaja bien.
In this ro om [one] can work
properly.
3.
Non-argumental impersonal
16
2.2 Linguistic Approaches
2. Related Work
Two types of syntactic ellipsis or lexical-syntactic ellipsis can be distinguished:
verbal ellipsis and nominal ellipsis. These types of subject ellipsis can affect the whole
argument of the verb or be partial and just affect the head of the argument (Brucart,
1999). As detailed in Section 3.2.2, the annotation of our corpus includes both complete
noun phrase ellipsis and noun phrase head ellipsis. Note that nominal ellipsis not
only affects the subjects but also the other arguments of the verb –datives, direct
objects or infinitive objects– although their ellipsis is held to more restricted conditions
(Brucart, 1999). However, this fact is not acknowledged in some prior approaches in
nlp (Ferrández & Peral, 2000, p. 166):
“While in other languages, zero-pronouns may appear in either the subject’s or
the object’s grammatical position, (e.g. Japanese), in Spanish texts, zero-pronouns
only appear in the position of the subject.”
The interpretation of Type 1 ellipsis can be definite and specific (Brucart, 1999)
or indefinite (Real Academia Española, 2009). Since omitted subjects are referential,
they can be lexically retrieved (Gómez Torrego, 1992). An example of omitted subject
could be:
(d) Las leyes no tendrán efecto retroactivo si Ø no dispusieren lo contrario.
The law will not have a retroactive effect unless (they) specify otherwise.
The nature of the omitted subject [Ø] itself has been discussed in the linguistic literature
(Real Academia Española, 2009). While recent approaches in linguistics agree that the
omitted subject has a pronominal nature (elided pronoun), others contend that the
subject is expressed in the morphology of the verb inflection.
In Generative Grammar subject ellipsis has been understood as a (1) pro-form
(Beavers & Sag, 2004; Chung et al., 1995; Fiengo & May, 1994; Wilder, 1997) or as (2)
a syntactic realization without a phonetic constituent (Merchant, 2001; Ross, 1967).
The Meaning-Text Theory (mtt) contends that ellipsis occurs in the SSyntS (surface syntax) when the elliptic element is deleted during the transition from SSyntS to
DMorphS (deep morphology) (or vice versa) and an empty node stands in for the representation of the elliptic element. This procedure for treating ellipses is also proposed
in the MTT for the description for all coordinate structures (Mel’čuk, 2003).
17
2.2 Linguistic Approaches
2. Related Work
The identification of omitted subjects is not problematic when the zero pronoun
belongs to the first or second person but when it is a third person omitted subject, the
reference can be anaphoric or cataphoric (Type 1 ellipsis in Table 2.1) or non-specific1 .
A generic or non-specific interpretation can follow in some clauses with singular second person and plural third person zero pronouns (Real Academia Española, 2009).
However, depending on discourse knowledge, there can be alternators of specific and
non-specific interpretation in clauses which are formally equal, as the next example
shows:
(e) Ø Me han regalado un reloj. (In this example both interpretations, specific and nonspecific, are possible.)
(1) (They) gave me a watch. (When the agent referred to by “they” has been mentioned
previously in the discourse.)
(2) (I) was given a watch. (When no agent has been mentioned previously in the discourse.)
where the non-specific interpretation does not exclude a possible specific one (Real
Academia Española, 2009). Therefore, both groups of argumental subjects with specific
and non-specific interpretations are included in the same class.
2.2.2
Linguistic Approaches to Non-referential Ellipsis
On the other hand, Type 2 and type 3 ellipsis listed in Figure 2.1 correspond to
non-referential expressions or impersonal sentences. Type 2 ellipsis is composed of
impersonal sentences containing the Spanish particle se, whose argumental omitted
subject always has an unspecific interpretation and is referred to using the pronoun se
(Mendikoetxea, 1994). Type 3 ellipsis corresponds to the set of sentences called impersonal sentences. Although the types of impersonal constructions in Spanish are heterogeneous, all of them share a lack of some properties of the subject (Fernández Soriano
& Táboas Baylı́n, 1999). Some studies consider different kinds of Spanish impersonality, e.g. semantic and syntactic impersonality (Gómez Torrego, 1992), while others
distinguish several semantic degrees of impersonality (Mendikoetxea, 1999).
1
In journalistic headlines with an omitted subject, a non-specific interpretation can occur (Bosque,
1989) even in non-pro-drop languages such as English, French or German (Real Academia Española,
2009). Such non-specific interpretations can occur when the antecedent or referent was not previously
mentioned in the discourse.
18
2.2 Linguistic Approaches
2. Related Work
Traditionally –from a semantic point of view– impersonal sentences have been considered to be those which cannot contain a subject, the agent of the action described
(Real Academia Española, 1977). This impersonality can the due either to the nature
of the verb,
(f) Llueve.
(It) rains.
or due to the speaker’s ignorance of the subject (Seco, 1988):
(g) Llaman a la puerta.
(Someone) is knocking the door.
where the subject is unidentified and it is therefore impossible to assign a reference to
it (Bello, [1847] 1981).
The controversy of treating non-referential expressions as a type of ellipsis, given
that they cannot be lexically retrieved, has already been discussed (Gómez Torrego,
1992). While Brucart (1999) considers them a case of ellipsis, as do some Generative
Grammar approaches1 , others (Bosque, 1989; Mel’čuk, 2006)2 consider that such elliptic
and non-referential subjects do not exist in language.
A descriptive point of view (Fernández Soriano & Táboas Baylı́n, 1999) would
regard impersonal sentences as belonging to either of two main groups (1) impersonal
sentences without a subject and (2) cases of impersonal verbs with the inherent feature
of not having a subject.
In the current dissertation, a prescriptive and descriptive approach (Real Academia
Española, 2009) to the consideration of impersonal sentences is taken (See Section
3.1.3).
Type 4 ellipsis (Brucart, 1999) in Figure 2.1 is ignored in our work. However, this
fourth type is much debated in literature; for example, Head-Driven Phrase Structure
Grammar does not consider the infinitive subject as a null category (slash), nor do
Pollard and Sag in their work (Pollard & Sag, 1994).
1
Generative Grammar explains these impersonal sentences by labeling the absence of the subject
with a pro-form which presents the same syntactic features as the subject although is has no phonological realization. Following the Extended Projection Principle this pro-form embodies all the syntactic
requirements of a subject except for its phonological realization (Chomsky, 1981).
2
MTT uses the concept of the zero sign to characterize elements whose signifier is empty and is by
no means realized as a perceptible phonetic pause (Mel’čuk, 2006).
19
2.2 Linguistic Approaches
2. Related Work
20
Chapter 3
Detecting Ellipsis in Spanish
This chapter describes the methodology used in this study. The first step is to create
a linguistically motivated classification system (Section 3.1) for all instances of elliptic
and non-elliptic as well as referential and non-referential subjects. Since the machine
learning method requires training data, a corpus (the eszic Corpus) was compiled
(see Section 3.2.1) and a purpose built tool for its annotation was developed, as were
guidelines (see Section 3.2.2). The third task consisted of implementing a method to
extract the features (Section 3.2.3) of instances from the corpus and create training
data (eszic training data; see Section 3.2.4). Finally, once the features of instances
are derived from a document they are exploited for classification by machine learning
using the Weka package (Section 3.2.5).
3.1
Classification
The first step is to create a classification system for all instances of subject and impersonal constructions. The groups into which the subjects were divided were labeled: elliptic and non-elliptic subjects as well as referential and non-referential subjects. These
two labels result in a ternary classification:
(1) Explicit subjects: non-elliptic and referential1 ;
(2) Zero pronouns: elliptic and referential2 ; and
1
Explicit subjects in the examples are presented in italics.
Zero pronouns in the examples are presented by the symbol Ø. In the English translations the
subjects which are elided in Spanish are marked with parenthesis.
2
21
3.1 Classification
3. Detecting Ellipsis in Spanish
(3) Impersonal constructions: elliptic and non-referential1 .
A subject can be non-elliptic (explicit) or elliptic (omitted subject or zero pronoun).
A sign can be referential or non-referential. The distinction lies in the fact that, while
the former can be lexically retrieved, the latter cannot (impersonal construction).
This treatment of the classification as ternary differs from previous work whose
division of subjects was binary: elliptic (zero pronoun) and non-elliptic, both referential
(Ferrández & Peral, 2000; Rello & Illisei, 2009b) (see Section 2.1.1). In Evans (2001)
the seven fold classification of pleonastic it is based on the type of referent while in
Boyd et al. (2005), classification follows syntactic and semantic criteria (see Section
2.1.2).
In the following sections, each class is described. With regard to cases in which
classification can be controversial, different annotation criteria were applied (see Section
3.2.2).
3.1.1
Explicit Subjects: Non-elliptic and Referential
This class is the one to which explicit subjects belong. They are phonetically realised,
usually by a nominal group: noun, pronoun, noun phrase (a), free relatives, semi-free
relatives, substantival adjectives (Real Academia Española, 2009).
(a) Las fuentes del ordenamiento jurı́dico español son la ley, la costumbre y los principios
generales del derecho.
The sources of the Spanish legal system are the law, the judicial custom and the general
principles of law2 .
The syntactic positions of subjects can be pre-verbal or post-verbal. The occurrence of post-verbal subjects is restricted by some conditions (Real Academia Española,
2009).
(b) Carecerán de validez las disposiciones que contradigan otra de rango superior.
The dispositions which contradict the higher range ones will not be valid.
1
2
Impersonal constructions in the examples are not explicitly indicated using a symbol.
Unless otherwise specified, all the examples provided are taken from our corpus (Section 3.2.1).
22
3.1 Classification
3. Detecting Ellipsis in Spanish
Post-verbal subjects, as well as preverbal ones, are also found in passive constructions and passive reflex constructions. As in active clauses, preverbal subjects without
a definite article are rare while post-verbal subjects without a definite article are more
frequent (Real Academia Española, 2009).
Projections of non-nominal categories such as clauses containing an infinitive or
a conjugated verb, interrogative indirect clauses, or indirect exclamative clauses, can
function as subjects (Real Academia Española, 2009).
(c) Corresponde a los poderes públicos promover las condiciones para que la libertad y la
igualdad del individuo y de los grupos en que se integra sean reales y efectivas.
It corresponds to the public power to promote individual and group liberties to be real
and effective.
3.1.2
Zero Pronouns: Elliptic and Referential
Class 2 is formed by elliptic but referential subjects called zero pronouns. An elliptic
subject is the result of a nominal ellipsis, where a non-phonetically realised lexical
element –elliptic subject– which is needed for the interpretation of the meaning and
the structure of the sentence, is omitted since it can retrieved from its context (Brucart,
1999). Despite their lack of phonetic realisation, elliptic subjects are considered part
of the clause (Real Academia Española, 2009).
(d) La Constitución Españolai (title in text)
Øi Fue refrendada por el pueblo español el 6 de diciembre de 1978.
The Spanish Constitutioni (title in text)
(It) i was countersigned by the Spanish population on the 6th of December of 1978.
Elliptic subjects are considered to be a personal pronoun variant which is not phonetically realised (Real Academia Española, 2009). Where referential, they can be
lexically retrieved (Gómez Torrego, 1992). That is to say that they can be substituted
by explicit pronouns without changing or losing any of the meaning of the clauses in
which they occur.
The elision of the subject can affect not only the noun head, but also the entire
noun phrase (Brucart, 1999). The noun head can be omitted in Spanish when the
subject of which it is a part fulfills some structural requirements (Brucart, 1999). This
23
3.1 Classification
3. Detecting Ellipsis in Spanish
includes cases in which the subject is referential (Brucart, 1999). The processing of
these subjects has been addressed by the development of specific algorithms in previous
work (Ferrández et al., 1997).
Ellipsis of the head of the noun phrase is only possible when a definite article occurs.
(e) El Ø que está obsesionado con que todo el mundo piensa mal es Javier.
The (one) who is obsessed with everyone thinking wrong is Javier.
The article possesses a referential value which could be either anaphoric or cataphoric
(Real Academia Española, 2009). Such examples of subjects with an elided head are
instances of semi-free relatives (Real Academia Española, 2009) and, as expected, they
are not as frequent in our corpus as elisions of the entire subject noun phrase.
3.1.3
Impersonal Constructions: Elliptic and Non-referential
Impersonal constructions with no subjects, that are both non-referential and elliptic,
do not exist (Bosque, 1989)1 .
The appearance of clauses containing zero pronouns and impersonal constructions
is similar. Class 3 is composed of impersonal constructions which are formed by (1)
impersonal and (2) reflex impersonal clauses (impersonal clauses with se).
Impersonal clauses have no argumental subject. Since the subject does not exist,
it cannot be lexically retrieved by any means and no phonetic realisation of it can be
expected (Bosque, 1989). The following cases are considered to be impersonal sentences
(Real Academia Española, 2009):
– Non-reflex impersonal clauses denoting natural phenomena describing meteorological situations:
(f) Nieva.
(It) snows.
– Non-reflex impersonal clauses with verbs haber (to be), hacer (to do), ser (to
be), estar (to be)2 , ir (to go) and dar (to give):
1
The existence of a non-phonetically realised element in subject position is postulated (see Section
2.2). While Generative Grammar defends their existence (pro-form), mtt does not (zero sign).
2
Depending on the verbal aspect, there are different Spanish verbs which correspond with the
English verb to be.
24
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
(g) En un kilogramo de gas hay tanta materia como en un kilogramo de sólido.
In a kilogram of gas (there) is the same amount of mass as in a kilogram of solid.
(Existential use of the verb haber ).
– Non-reflex impersonal clauses with other verbs such as sobrar con (to be too
much), bastar con (to be enough) or faltar con (to have lack of) or the pronominal
unipersonal verb1 with subject zero such as tratarse de (to be about):
(h) Deberán adoptar las precauciones necesarias para su seguridad, especialmente cuando
se trate de niños.
Necessary measures should be taken, specially when (it) is about children.
(i) Basta con tres sesiones.
(It) is enough with three sessions.
Verbs in such impersonal sentences (Gómez Torrego, 1992), are called lexical impersonal
verbs (Real Academia Española, 2009). Due to their lack of subject they are not easily
distinguished from verbs with omitted –but existing– subjects.
Secondly, reflex impersonal clauses have an omitted subject whose reference is nonspecific and cannot be lexically retrieved.
(j) Se estará a lo que establece el apartado siguiente.
(It) will be what is established in the next section
These clauses are formed with the particle se. This particle also serves other syntactic
functions (reflexive pronoun, pronominal pronoun, reciprocal pronoun, etc.) in clauses
with an elided subject.
3.2
Machine Learning Approach
Our corpus was compiled and parsed in order to create training data (referred to as the
eszic training data) for use by a machine learning classification method as explained
in the next section.
A tool was developed for annotation of the corpus (see Section 3.2.2). Fourteen
features were proposed for the purpose of classifying instances of subjects (see Section
1
A verb which is only conjugated in the third person.
25
3.2 Machine Learning Approach
3. Detecting Ellipsis in Spanish
3.2.3). The feature vectors, together with their manual classifications, were written to
a training file. A method for obtaining the values of those features for each instance
was implemented. The classification algorithm employed was the K* instance-based
learner available in the Weka package (Witten & Frank, 2005) (see Section 3.2.5).
3.2.1
Building the Training Data
The eszic training data used by the Elliphant system is obtained from the eszic corpus
created ad hoc. The corpus is named after its annotated content “Explicit Subjects,
Zero-pronouns and Impersonal Constructions”.
The corpus contains a total of 79,615 words (titles and sentences that do not contain
at least one finite verb are ignored), including 6,825 finite verbs. Of these verbs, 71%
have an explicit subject, 26% have a zero pronoun and 3% belong to an impersonal
construction. There is an average of 2.3 clauses per sentence with 11.7 words per clause
and 26.9 words per sentence.
The corpus compiled to extract the training data is composed of seventeen documents, originally written in Spanish, and belonging to two genres: legal and health.
The legal texts1 are composed of laws taken from the: (1) Spanish Constitution
(whole text) (Constitución Española, 1978), (2) Laws on Unfair Competition (whole
text) (Ley 3/1991, 1991), (3) Penal Code (first book) (Ley Orgánica 10/1995, 1995), (4)
Law for Administrative-contentious Jurisdiction (title 1, articles 1 to 17) (Ley 29/1998,
1998), (5) Civil Code (first book, until title V) (Código Civil, 1889), (6) Law for Universities (introduction) (Ley Orgánica 6/2001, 2001), (7) Law for Associations (chapter
1) (Ley Orgánica 1/2002, 2002) and (8) Law for Advertisements (whole text) (Ley
29/2005, 2005).
The nine health texts are taken from psychiatric papers compiled from a Spanish digital journal of psychiatry Psiquiatrı́a.com 2 : (1) Cinema as a tool for teaching
personality disorders (López Ortega, 2009), (2) Efficacy, functionality, and empowerment for phobic pathology treatment, in the context of specialised public Mental
Health Services (Garcı́a Losa, 2008), (3) Emotions in Psychiatry (Sevillano Arroyo
& Ducret Rossier, 2008), (4) And what about siblings? How to help TLP3 siblings
1
All the legal texts are available online at: http://noticias.juridicas.com/base_datos/
The full-text articles from Psiquiatrı́a.com Journal are available online at: http://www.
psiquiatria.com/.
3
Trastorno lı́mite de la personalidad (Borderline Personality Disorder).
2
26
3.2 Machine Learning Approach
3. Detecting Ellipsis in Spanish
eszic Corpus
Legal text 1
Legal text 2
Legal text 3
Legal text 4
Legal text 5
Legal text 6
Legal text 7
Legal text 8
Health text 1
Health text 2
Health text 3
Health text 4
Health text 5
Health text 6
Health text 7
Health text 8
Health text 9
Total
Number of
Tokens
Number of
Sentences
Number of
Clauses
9,972
1,147
17,960
3,578
12,456
3,962
2,159
5,219
2,753
11,339
1,854
1,937
2,183
1,568
1,296
1,687
12,441
93,511
941
47
1,035
189
746
130
131
291
110
658
47
84
93
63
69
53
525
5,212
600
56
1,181
191
891
219
136
282
270
1,028
140
124
148
210
89
127
1,394
7,086
Table 3.1: eszic Corpus: tokens, sentences and clauses.
27
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
(Molina López, 2008), (5) Factorial analysis of personal attitudes in secondary education (Pintor Garcı́a, 2007), (6) The influence of the concept of self and social competence
in children’s depression (Aldea Muñoz, 2006), (7) Depression as a mental health problem in Mexican teenagers (Balcázar Nava et al., 2005), (8) Relationship difficulties in
couples (Dı́az Morfa, 2004), and (9) A case of psychological intervention for children’s
depression (Aldea Muñoz, 2003).
Table 3.2 presents the number of instances found in the eszic corpus by class.
Two columns illustrate the number of instances by genre (legal and health) within the
corpus.
Number of instances
per class
Explicit subjects
Zero pronouns
Impersonal constructions
Total
Legal eszic
Corpus
Health eszic
Corpus
eszic Corpus
2,739
619
71
3,429
2,116
1,174
108
3,398
4,855
1,793
179
6,827
Table 3.2: eszic Corpus: number of instances per class.
The text containing instances to be classified was analysed using Connexor’s Machinese Syntax (Järvinen & Tapanainen, 1998; Järvinen et al., 2004; Tapanainen & Järvinen, 1997)1 . This dependency parser returns information on the pos and morphological
lemma of words in a text, as well as returning the dependency relations between those
words. The parsing system employed uses Functional Dependency Grammar (FDG)
(Järvinen & Tapanainen, 1998; Tapanainen & Järvinen, 1997) and combines (Järvinen
et al., 2004) a lexicon and a morphological disambiguator based on constraint grammar
(Tapanainen, 1996). When performing fully automatic parsing it is necessary to address word-order phenomena. The formalism used in the parser is capable of referring
simultaneously both to the order in which syntactic dependencies apply and to linear
order. This feature is an extension of Tesnière’s theory (Tesnière, 1959), which does
not formalise linearisation. In the parsed output the linear order is preserved while the
structural order requires that functional information is not coded in the canonical order
1
A demo of Connexor’s Machinese Syntax is available at: http://www.connexor.eu/technology/
machinese/.
28
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
of the dependents. The functional information is represented explicitly using arcs with
labels of syntactic functions as shown in Figure 3.1 (Järvinen et al., 2004).
Figure 3.1: An example of the output of the Connexor’s Machinese Syntax parser for
Spanish.
The dependency information allows the identification of complex constituents in a
text. For example, complex noun phrases can be identified by transitively grouping
together all the words dependent on a noun head (Evans, 2001). Additional software
was implemented to perform this and allow identification of clauses and noun phrases
which are required for implementation of some of the features used in our classification
(see Section 3.2.4).
The eszic training data makes use of the three types of information returned by
Connexor’s Machinese Syntax parser (Connexor Oy, 2006a,b):
1. morphological tags generated for verbs –singular (SG), third person (3P), indicative (IND), among many others– including the pos tags –verb (V), noun (N),
preposition (PREP), etc.–;
2. syntactic tags –main element (@MAIN), nominal head (@NH), auxiliary verb (@AUX),
etc.–; and
29
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
3. syntactic relations –subject (subj), verb chained (v-ch), determiner (det)–. The
lexical information (LEMMA) given by the parser was also taken into consideration
in the set of features.
3.2.2
Annotation Software and Annotation Guidelines
A program was written in Python (see Figure 3.2) to extract all occurrences of finite
verbs from the eszic Corpus and to assign to each the vector of feature values described
in Section 3.1. Two annotators were presented with the clause in which each verb
appears and prompted to classify the verb into one of thirteen classes.
Figure 3.2: Screenshot of the annotation program interface.
Although the goal is to develop training data for a classifier making a ternary
classification of the subject position elements, an annotation scheme which gives more
30
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
detail about each instance was used. This annotation scheme was used with a dual
purpose: to get the most from the annotation task since the instances occur in a
broad number of constructions and because a more detailed annotation could be useful
in future work. The thirteen classes are grouped into the three types: (1) explicit
subjects, (2) zero pronouns or (3) impersonal constructions. In Table 3.3, the linguistic
motivation for each of the annotated classes is shown in correlation with the types to
which they belong. From each annotation class, in addition to the two criteria that
are crucial for this study –elliptic vs. non-elliptic and referential vs. non-referential– a
combination of syntactic, semantic and discourse knowledge can also be encoded during
the annotation. This knowledge includes information about whether the subject is
nominal or non-nominal, whether it is an active or a passive subject or whether the
subject refers to an active participant in the action, state or process denoted by the
verb.
The annotation program extracts from the parsed eszic Corpus the clause in which
each finite verb occurs. As Connexor’s Machinese Syntax parser does not explicitly
perform clause splitting but only sentence splitting, a method was developed to accomplish the clause identification task. The method identifies the finite verbs in the
corpus and transitively groups together the words directly and indirectly dependent
upon them1 . The identified clauses are then presented to the annotators who are asked
to label the verb.
For each verb classified by an annotator, an xml tag (i.e. <subject>ZERO</subject>)
with its class is added in the token line of the parsed eszic Corpus where the verb occurs. An example (k) of an annotated verb whose subject is a zero pronoun follows:
(k)
<token id="w53"><text>entró </text><lemma>entrar </lemma>
<depend head="w51">mod </depend><tags><syntax>@MAIN
</syntax><morpho>V IND PRET SG P3 </morpho><subject>ZERO
</subject> </tags></token>
This manual classification, together with the features (see Section 3.2.3) are written to
the eszic training file.
1
A clause splitter module was implemented to extract the features from the eszic Corpus (see
Section 3.2.4).
31
3.2 Machine Learning Approach
3. Detecting Ellipsis in Spanish
eszic Corpus Annotation Tags
Linguistic
Phonetic
Realization
Syntactic Verbal
cateDiathegory
sis
Semantic
interpretation
Disclosure
Elliptic
noun
phrase
Elliptic
noun
phrase
head
Nominal
subject
Active
participant
Referential
subject
sub-
–
–
+
+
+
+
Reflex passive
–
–
+
+
–
+
–
–
+
–
–
+
+
–
+
+
+
+
–
+
+
+
+
+
–
–
–
+
+
+
+
–
+
+
–
+
–
+
+
+
–
+
–
–
–
+
–
+
+
–
+
–
–
+
–
–
–
–
–
+
–
–
n/a
–
n/a
–
–
–
n/a
+
n/a
–
information
Elliphant
Classes
Linguistic
characteristics
Class 1
Explicit
Active
ject
Explicit
subject
subject
Passive
subject
Omitted subject
Omitted subject head
Non-nominal
subject
Class 2
Reflex passive
omitted
sub-
ject
Zero
pronoun
Reflex passive
omitted
sub-
ject head
Reflex passive
non-nominal
subject
Passive omitted subject
Passive
non-nominal
subject
Class 3
Reflex impersonal clause
Impersonal
(with se)
construction Impersonal
construction
(without se)
Table 3.3: eszic Corpus annotation tags.
32
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
Annotating explicit and elliptic subjects as well as impersonal constructions in Spanish is not a trivial task. Guidelines were established for the annotation of borderline
instances whose classification is a frequent source of disagreement between annotators.
The following text presents some of these borderline cases that belong to the three
types of finite verb classes, together with the criteria adopted for their annotation.
When distinguishing explicit subjects, in addition to nouns, there are other syntactic
categories which may arguably function as heads of subjects. In the case of adverbial
and prepositional categories, it was decided that they should be considered subjects if
they can be focalised (Real Academia Española, 2009).
(`) De acuerdo con la Organización Mundial de la Salud, la depresión ocupa el cuarto lugar
entre las enfermedades más incapacitantes y aproximadamente de 100 a 200 millones de
personas la padecen.
According to the International Health Organization, depression is ranked as the fourth
illness which causes more invalidity and approximately from 100 to 200 million people
suffer from it.
While conditional clauses could be considered subjects, in this work an alternative
analysis is followed. Under this approach, a sentence with a conditional clause functioning as subject is considered to contain a zero pronoun, as its elliptic subject can
be retrieved from the preceding discourse (Real Academia Española, 2009). Nevertheless, no examples were found of conditional clauses functioning as subjects in the eszic
corpus used in this dissertation.
The correct classification of zero pronouns is also a source of disagreement between
annotators as it may be argued that some instances with postponed non-nominal subjects (see example (m) below) should be interpreted as cataphoric zero pronouns.
In contrast to anaphora, in cataphora the cataphoric expression is situated before
the nominal group to which it points (Real Academia Española, 2009). Tanaka (2000)
and Mitkov (2002) point out that there is some scepticism about the concept of cataphora in the NLP literature. For example, Kuno (1972) asserts that there is no genuine
cataphora in its literal sense, as the referent of a seemingly cataphoric pronoun must
already be mentioned in the preceding discourse and, therefore, is predictable when
a reader encounters the pronoun. This viewpoint was refuted by Carden (1982) and
Tanaka (2000) who describe empirical data which shows cases of genuine cataphora
33
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
where the pronoun is the first mention of its referent in the discourse (Carden, 1982;
Tanaka, 2000). Although some examples of genuine cataphora were found in their corpus (Tanaka, 2000), none were found in the eszic Corpus except for occurrences of the
elision of noun heads where the antecedent is postponed, as in example (e).
The annotation guidelines developed for the current work considered these cases
which involve postponed clauses as non-nominal subjects.
(m) Artı́culo 46.
No pueden contraer matrimonio:
Los menores de edad no emancipados.
Los que estén ligados con vı́nculo matrimonial.
Article 46.
(They) cannot get married:
The non-emancipated minors.
The ones which are already married.
Finally, the borderline cases in impersonal constructions are debated in Spanish. The
decision of how to classify reflex impersonal clauses containing se is frequently a difficult one to make due to the ambiguity of these instances. For example, in the sentence
Se secaron (see example (n) below), the particle se has four possible semantic interpretations in Spanish (Real Academia Española, 2009). In these cases, the decision taken
by the annotator depends on the meaning given by the context.
(n) Se secaron (Particle se = reflexive pronoun)
(They) dried (themselves).
Se secaron (Particle se = reciprocal pronoun)
(They) dried (each other).
Se secaron (Particle se = pronominal pronoun and there is an elliptic subject which does
not have control over the action, for instance, the trees.)
The trees got dried.
Se secaron (Particle se = reflex passive in which the referent of the subject would have
to perform the described action under their own free will, for instance, some people over
an object, for instance, the clothes)
(They) dried (the clothes).
34
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
There can be ambiguity between reflex passives containing a zero pronoun and impersonal constructions in which the object is not human (o).
(o) Se firmará el acuerdo.
Ø will sign the agreement.
In such instances, the annotation criterion followed is to annotate them as reflex passive
clauses containing a zero pronoun.
3.2.3
Features
Fourteen features were proposed in order to classify instances according to the types
presented in Section 3.1. The values (see Table 3.4) for the features were derived from
information provided both by Connexor’s Machinese Syntax (Connexor Oy, 2006b)
parser, which processed the eszic Corpus, and a set of lists. An additional program
was implemented in order to extract the values of features for every instance in the
corpus (see Section 3.2.4). These values were used to produce a training vector for each
instance. For a detailed explanation of the feature values see Section 3.2.4.
For the purpose of description, it is convenient to describe each of the features as
broadly belonging to one of ten classes, detailed below.
1 PARSER: the presence or absence of a subject in the clause, as identified by the
parser. It was observed (Rello & Illisei, 2009b) that the analysis returned by Connexor’s Machinese Syntax is particularly inaccurate when identifying coordinated
subjects, subjects containing prepositional modifiers, and appositions occurring
between commas (see example (p) below). Other common cases of parsing error
involve subjects which are distant from the finite verb in the clause. Features 7
and 8 were proposed in an effort to take into consideration potential candidates
for the subject.
(p) La publicidad, por su propia ı́ndole, es una actividad que atraviesa las fronteras.
Advertising, due to its own nature, is an activity which goes beyond boundaries.
2 CLAUSE: the clause types considered are: main clauses, relative clauses, clauses
starting with a complex conjunction, clauses starting with a simple conjunction,
and clauses introduced using punctuation marks (commas, semicolons, etc). A
35
3.2 Machine Learning Approach
3. Detecting Ellipsis in Spanish
Feature
Definition
Value
1
2
3
4
5
6
Parsed subject
Clause type
Verb lemma
Verb morphological number
Verb morphological person
Agreement in person,
number, tense and mood
True, False
Main, Rel, Imp, Prop, Punct
Parser’s lemma tag
SG, PL
P1, P2, P3
PARSER
CLAUSE
LEMMA
NUMBER
PERSON
AGREE
FTFF, TTTT, FFFF, TFTF, TTFF,
FTFT, FTTF, TFTT, FFFT, TTTF,
FFTF, TFFT, FFTT, FTTT, TFFF
TTFT
7 NHPREV
Previous noun phrases
8 NHTOT
Total noun phrases
9 INF
Infinitive
10 SE
11 A
12 POSpre
Particle se
Preposition a
Four parts of the speech
previous to the verb
13 POSpos
Four parts of the speech
speech following the verb
14 VERBtype
Type of verb: copulative,
impersonal, pronominal,
transitive and intransitive
Number of noun phrases
previous to the verb
Number of noun phrases
in the clause
Number of infinitives
in the clause
se, no
True, False
292 different values combining
the parser’s pos tags,i.e.:
@HN, @CC, @MAIN, etc.
280 different values combining
the parser’s pos tags,i.e.:
@HN, @CC, @MAIN, etc.
CIPX, XIXX, XXXT, XXPX, XXXI,
CIXX, XXPT, XIPX, XIPT, XXXX,
XIXI, CXPI, XXPI, XIPI, XXEX
Table 3.4: Features: definitions and values.
method was implemented to identify these different types of clause as the parser
does not explicitly mark the boundaries of clauses within sentences (see Section
3.2.4)
3 LEMMA: lexical information extracted from the parser: the lemma of the finite
verb.
4-5 NUMBER, PERSON: morphological information features of the verb: its
grammatical number (singular or plural) and its person (first, second, or third
36
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
person).
6 AGREE: feature which encodes the tense, mood, person, and number of the
verb in the clause, and its agreement in person, number, tense, and mood with
the preceding verb in the sentence and also with the main verb of the sentence.
When a finite verb appears in a subordinate clause, its tense and mood can assist
in recognition of these features in the verb of the main clause and help to enforce
some restrictions required by this verb, especially when both verbs share the same
referent as subject.
7-9 NHPREV, NHTOT, INF: the candidates for the subject of the clause are
represented by the number of noun phrases in the clause that precede the verb,
the total number of noun phrases in the clause, and the number of infinitive verbs
in the clause.
10 SE: this is a binary feature encoding the presence or absence of the particle se
in close proximity to the verb. When se occurs immediately before or after the
verb or with a maximum of one token (see example (q) below) lying between the
verb and itself, this is considered “close proximity.”
(q) No podrá sacarse una ventaja indebida de la reputación de una marca.
(It) is not allowed to take unfair advantage of a brand reputation.
11 A: this is a binary feature encoding the presence or absence of the preposition
a in the clause. Since, the distinction between passive reflex clauses with zero
pronouns and impersonal constructions sometimes relies on the appearance of
preposition a (to, for, etc.). For instance, example (r) is a passive reflex clause
containing a zero pronoun while example (s) is an impersonal construction.
(r) Se admiten los alumnos que reúnan los requisitos.
(They) accept the students who fulfill the requirements.
(s) Se admite a los alumnos que reúnan los requisitos.
(It) is accepted for the students who fulfill the requirements.
37
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
12-13 POSpre , POSpos : the pos of eight tokens, that is, the four words preceding and
the four words following the instance1
14 VERBtype : the verb is classified as copulative (yes/no), as a verb with an im-
personal use (yes/no), as a pronominal verb (yes/no), and as a transitive verb
(yes/no/both).
3.2.4
Purpose Built Tools
As training data is required in order to exploit the methods distributed in the Weka
package (Witten & Frank, 2005), a method was implemented to extract the values of
the previously described features for instances occurring in the eszic Corpus. For each
instance (each annotated finite verb) a new line is written in the training data file
with values for the fourteen features separated by commas, together with the manual
classification of the vector using the standard CVS (comma separated values) format.
The values of features 7-9 are numerical while the values of the remaining features are
nominal (i.e. symbolic).
To extract the features, ad hoc software was implemented in Python. The program
exploits morphological and syntactic information, dependency relations reported by the
parser, and lists of verbs grouped by their syntactic and morphological properties (e.g.
transitivity, pronominal use, etc.).
The method implemented includes the following purpose built tools which are described below. The description includes information on the particular features whose
values are computed using the tools.
1 Clause splitter module (CLAUSE): since Connexor’s Machinese Syntax (Connexor Oy, 2006a) does not provide any information about the clause boundaries
within sentences, this clause splitter module is required. Each clause is built by
identifying finite verbs in a sentence and then searching for signals that indicate
the boundaries of the clause (relative pronouns, conjunctions, punctuation marks,
etc.). In theory, each clause could be built using dependency information given by
the parser by grouping together all the words dependent on the finite verb. However, this strategy was not used in order to avoid parsing errors in the dependency
information reported by the parser. Errors of this type are especially common
1
This set of features can be regarded as useful for identifying non-nominal it (Evans, 2001).
38
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
when long sentences are parsed using Connexor’s Machinese Syntax. The Clause
splitter module also identifies the type of clause in which the finite verb occurs.
The feature attributes corresponding to the type of clause are:
1.1 Main (Main): when the finite verb belongs to the main clause.
1.2 Relative (Rel): when the finite verb belongs to a relative clause. A list of relative
pronouns was used to identify this type of clause (i.e.: que (that), cuyo (whose),
quien (who), etc.).
1.3 Improper conjunction (Imp): when the finite verb belongs to a clause starting with
an improper conjunction. A list of improper conjunctions was used to identify the
value of this attribute (i.e.: porque (because), luego (so), aunque (although), etc.).
1.4 Proper conjunction (Prop): when the finite verb belongs to a clause starting with
a proper conjunction. A list of proper conjunctions was used (i.e.: y, e (and), o, u
(or), ni (neither), pero (but) and sino (otherwise).
1.5 Punctuation marks (Punct): when the clause in which the finite verb occurs is
preceded by a punctuation mark (‘.’, ‘,’, ‘:’, ‘;’, ‘?’, ‘!’, “”, ‘-’, ‘(’, and ‘)’ ).
2 Noun phrase module (NHPREV, NHTOT): in order to obtain the subject
candidates, this module identifies and counts the noun phrases that precede and
follow the finite verb in the clause. As is the case for the clause splitter, this
module exploits dependency information returned by the parser (Connexor Oy,
2006a).
3 Counter (NHPREV, NHTOT, INF): this module is used to determine the
total number, in the clause, of noun phrases (nhprev, nhtot) and infinitival
forms (inf).
4 Tag taker (PARSER, LEMMA, NUMBER, PERSON, A, POSpre ,
POSpos ): these Python functions process the attributes of the XML tags output
by the parser (eszic Corpus) to generate a set of features for the eszic training data. A function generates a binary value that indicates whether or not the
finite verb has a dependent subject (parser). A function consults the lemma
of the verb and takes it as the value for feature (lemma). Other functions exploit morphological information obtained by the parser such as the number of the
39
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
finite verb (number), which can be either singular (SG) or plural (PL), or the
morphological person of the finite verb (person) which can be first, second or
third person (P1, P2, P3); Another function identifies whether the preposition a
occurs in the clause (a). This information is used as the values for the features;
and, finally, there is another method which obtains the pos of the four words
that precede the instance in the clause ((pos)pre ) and the four words that follow
it ((pos)pos ).
5 Agreement module (AGREE): this module checks whether the verb used in
the clause agrees (true, T) or disagrees (false, F) in tense and mood, and in person
and number with the main verb that occurs in the sentence1 and the previous
verb occurring within the sentence. This agreement information is combined into
one symbolic feature, such as TTTT (with respect to the verb used in the clause,
the first T denotes agreement in number and person with the main verb of the
sentence, the second T denotes agreement in tense and mood with the main verb
of the sentence, the third T denotes agreement in number and person with the
previous verb in the sentence and the fourth T denotes agreement in tense and
mood with the previous verb in the sentence) or TTFF (when there is agreement
in between the verb in the clause and the main sentence verb but no agreement
with the previous clause verb). There are sixteen possible combinations of true
(T) and false (F) values.
6 Se identifier (SE): this function identifies whether the particle se occurs in
close proximity to the finite verb. Again, in this context, a distance of at most
one token between the finite verb and se is considered “close proximity.” The
value for this feature can be (yes), when se appears, or (no), when it does not.
7 Verb classifier (VERBtype ): this module specifies the value of four features
of the finite verb that occurs in the clause. The features encode information
about whether or not the verb appears in four different lists of verbs (the same
instance can occur in more then one list). These four lists2 contain 11,060 different
verb lemmas which are present in the Royal Spanish Academy Dictionary (Real
1
In this study, it is considered that sentences may contain several verbs whereas clauses contain
only one finite verb.
2
The lists 7.2-7.4 of infinitive verb forms were provided by Molino de Ideas s.a.
40
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
Academia Española, 2001). The criteria on which these lists (items 7.2-7.4) were
built was the information contained in the dictionary definitions of the verbs (Real
Academia Española, 2001):
7.1 Copulative verbs (C): a list containing the copulative verbs, i.e. ser (to be), parecer
(to seem like), etc.;
7.2 Impersonal verbs (I): a list containing all the verbs whose use is impersonal. Such
use is specified in their definition, i.e. llover (to rain), nevar (to snow), etc.;
7.3 Pronominal verbs (P): a list which includes all the pronominal verbs (verbs whose
lemma in the dictionary appears with se) and all the potential pronominal verbs
whose definitions specify a potential pronominal use; and
7.4 Transitive and intransitive verbs (T): a list containing transitive verbs and intransitive verbs that meet the criteria detailed previously in item 7.
3.2.5
The WEKA Package
The Weka workbench1 is a collection of state-of-the-art machine learning algorithms
and data preprocessing tools (Hall et al., 2009; Witten & Frank, 2005). Both Weka
interfaces, the Explorer and the Experimenter were used to discover the methods and
parameter settings that work best for the current classification task.
Standard evaluation measures –precision, recall, f-measure and accuracy (Manning
& Schütze, 1999)– provided by Weka are used. In these measures, true positives (tp)
and true negatives (tn) are the number of cases that the system got right. The wrongly
selected cases are the false positives (f p) while the cases that the system failed to select
are the false negatives (f n). In the current context, true positives and true negatives
would be the numbers of correctly classified instances while the false positives and false
negatives are the numbers of falsely classified instances (Manning & Schütze, 1999).
Precision is defined as the ratio of selected items that the system got right, that is,
the ratio of true positives to the sum of true positives and false positives: p =
tp
tp+f p
.
Recall is defined as the proportion of target items that the system selected, that is
the ratio of the number of true positives to the sum of true positives and false negatives:
r=
1
tp
tp+f n
.
Weka is available at: http://www.cs.waikato.ac.nz/ml/weka/.
41
3. Detecting Ellipsis in Spanish
3.2 Machine Learning Approach
Figure 3.3: An example of Weka Explorer interface.
F-measure is a single measure of overall performance which combines precision and
recall:
F =
1
r
2
+
1
p
.
Accuracy is the proportion of correctly classified objects:
A=
tp + tn
.
tp + tn + f p + f n
42
Chapter 4
Evaluation
“Then you should say what you mean” [...]
“I do,” Alice hastily replied; “at least I mean what I say that’s the same thing,
you know.”
“Not the same thing a bit!” said the Hatter. “Why, you might just as well say
that ‘I see what I eat’ is the same thing as ‘I eat what I see’ !”
Alice in Wonderland, Lewis Carroll
This chapter presents the evaluation of the Elliphant system and some optimisation experiments carried out with the machine learning method (see Section 4.1). A
comparative evaluation of Elliphant’s performance with that of Connexor’s Machinese
Syntax parser is also described (see Section 4.2).
Standard evaluation measures (precision, recall, f-measure and accuracy) are used
to evaluate Elliphant with regard to the identification of the three classes: explicit
subjects, zero pronouns and impersonal constructions.
4.1
Experiments
A set of experiments was executed using the in Weka package with the purpose of
answering the following questions:
(1) Which method and parameter values work best for our problem? (see Section 4.1.1)
(2) How many instances are needed to train the algorithm? (see Section 4.1.2)
43
4.1 Experiments
4. Evaluation
(3) Does the genre matter? (see Section 4.1.3)
(4) Which are the most significant features and what are the most effective combinations of
features? (see Section 4.1.4)
4.1.1
Method Selected: K* Algorithm
A comparison of the learning algorithms implemented in Weka (Witten & Frank,
2005) was carried out to determine the most accurate method for each classification
task. A comparison of the accuracy levels (see Table 4.1), which presents all of the
Weka classifiers which exploit the features utilised in the Elliphant system is shown
below, with default parameter settings. The experiment was executed using 20% of the
instances in the training data, which were selected randomly. Ten-fold cross-validation
was used in the evaluation. All methods with an accuracy within 1% of K*’s are marked
in italics.
The seven1 highest performance classifiers were compared using 100% of the training
data and 10-fold cross-validation. The Bayes classifiers (BayesNet, NaiveBayes and
NaiveBayesUpdateable) obtained an accuracy score of 0.846, the function classifier
(RBFNetwork) offers an accuracy of 0.850 and the tree classifier (LADTree) an accuracy
of 0.830. With an accuracy of 0.860, the lazy learning classifier K* is the best performing
one, and hence our chosen technique.
Although lazy learning requires a relatively large amount of memory to store the
entire training set, the eszic training data is small enough that it can be classified
within a few minutes.
Instance-based learners classify new instances by comparing them to the manually
classified instances in the training data. The fundamental assumption is that similar
instances will have similar classifications. Nearest neighbor algorithms are the simplest
of the instance-based learners. They use a domain-specific distance measure to retrieve
the single most similar instance from the training set. In a nearest-neighbor method
each instance in the training set is represented by a vector of feature values that has
been explicitly classified. When a new vector of feature values is presented, a distance
measure is computed between the new vector and the set of vectors held in the training
1
Unfortunately, due to hardware limitations, it was not possible to obtain results from the NBTree
classifier and the JRip rule classifier when using the entire set of training data.
44
4.1 Experiments
4. Evaluation
Weka classifiers
Accuracy
Weka classifiers
Accuracy
Bayes: BayesNet
0.848
Meta: RacedIncrementalLogitBoost
0.717
Bayes: NaiveBayes
0.848
Meta: RandomSubSpace
0.731
Bayes: NaiveBayesSimple
0.842
Meta: Stacking
0.717
Bayes: NaiveBayesUpdateable
0.848
Meta: StackingC
0.717
Functions: RBFNetwork
0.848
Meta: Vote
0.717
Lazy: IB1
0.804
Misc: HyperPipes
0.715
Lazy: IBk
0.810
Misc: VFI
0.704
Lazy: K*
0.850
Rules: ConjunctiveRule
0.809
Lazy: LWL
0.809
Rules: DecisionTable
0.834
Meta: AdaBoostM1
0.81
Rules: DTNB
0.834
Meta: AttributeSelectedClassifier
0.836
Rules: JRip
0.845
Meta: ClassificationViaClustering
0.66
Rules: NNge
0.740
Meta: CVParameterSelection
0.717
Rules: OneR
0.762
Meta: Decorate
0.795
Rules: PART
0.795
Meta: END
0.809
Rules: Ridor
0.821
Meta: EnsembleSelection
0.762
Rules: ZeroR
0.717
Meta: FilteredClassifier
0.810
Trees: BFTree
0.760
Meta: Grading
0.717
Trees: DecisionStump
0.810
Meta: LogitBoost
0.841
Trees: J48
0.810
Meta: MultiBoostAB
0.810
Trees: J48graft
0.813
Meta: MultiClassClassifier
0.661
Trees: LADTree
0.846
Meta: MultiScheme
0.717
Trees: NBTree
0.850
NestedDichotomies: ClassBalancedND
0.809
Trees: RandomForest
0.793
NestedDichotomies: DataNearBalancedND
0.809
Trees: RandomTree
0.749
NestedDichotomies: ND
0.809
Trees: REPTree
0.723
Meta: OrdinalClassClassifier
0.810
Trees: SimpleCart
0.763
Table 4.1: Weka classifiers accuracy (20% of the eszic training set).
set (Cleary & Trigg, 1995). The k nearest ones are identified and the new vector is
assigned the class shared by the majority of the nearest neighbors1 .
K* is an instance-based classifier. The class of a test instance is based upon the
classes of those training instances that are similar to it, as determined by some similarity function. It differs from other instance-based learners in that this algorithm
computes the distance between two instances using a method motivated by informa1
Evans (2001) and Boyd et al. (2005) executed their experiments with the k nearest neighbor
classifier which is also a lazy learning algorithm.
45
4.1 Experiments
4. Evaluation
tion theory in which an entropy-based distance function is used (Cleary & Trigg, 1995;
Witten & Frank, 2005). The distance between instances is defined as the complexity
of transforming one instance into another. The calculation of the complexity between
instances is detailed in Cleary & Trigg (1995).
When using K*, the most effective classification is made when using a blending
parameter1 of 40%2 and the rest of the parameters remain with their default values:
the missing Mode parameter3 set to the average column entropy curves and the entropic
Auto Blend parameter set to false. Table 4.2 presents the evaluation of Elliphant when
exploiting the K* classifier with the parameters set as explained before, using ten-fold
cross-validation.
Class
Explicit subjects
Zero pronouns
Impersonal constructions
Precision
Recall
F-measure
0.900
0.772
0.889
0.923
0.740
0.626
0.911
0.756
0.734
eszic training data Accuracy: 0.867 (ten-fold cross-validation)
Table 4.2: eszic training data evaluation with K* -B 40 -M a.
There is a marginal reduction in accuracy when the system is evaluated using tenfold cross-validation (0.867) instead of leave-one-out cross-validation (0.869), though
its statistical significance is minimal. When decreasing the proportion of training data
used, the difference in performance levels between both evaluation methods remains
stable except when using 50% of the training data and is just 0.005. Although leaveone-out cross-validation obtains more accurate results as it is easier to classify test
instances using almost 100% of the training data than from only 90% of it, in practice
a classifier is trained and tested on instances derived from different data sets. Ten-fold
cross-validation is thus a more accurate simulation of real-world classification scenarios.
Moreover, it can be computed far more quickly than leave-one-out cross-validation.
1
The parameter for global blending.
Blending percentages up to 50% were tested.
3
The missing Mode determines how missing attribute values are treated.
2
46
4.1 Experiments
4. Evaluation
percentage
Ten-fold cross
validation
Leave-one-out
validation
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0.836
0.859
0.854
0.855
0.858
0.860
0.860
0.865
0.866
0.867
0.834
0.862
0.851
0.858
0.863
0.862
0.862
0.863
0.869
0.868
eszic training data
Table 4.3: Leave-one-out and ten-fold cross-validation comparison.
4.1.2
Learning Curve
A learning curve shows how accuracy changes with varying sample sizes, plotting the
number of correctly classified instances against the number of instances in the training
data. To calculate the learning curve of the Elliphant system, the eszic training data
was used to generate ten training samples, representing 10%, 20%, 40%, 50%, 60%,
70%, 80%, 90% and 100% of the data set. The instances contained in the eszic training
file were randomly ordered so that the genre variable could not influence the results
presented below. In these experiments, the K* algorithm was used with the parameter
settings described in Section 4.1.1 and the evaluation was carried out using ten-fold
cross-validation.
The learning curve shown in Figure 4.1 presents the increase in accuracy obtained
by the Elliphant system using the eszic training data. Performance reaches a plateau
at its maximum level when using 90% of the training instances.1
Figure 4.2 displays the precision, recall and f-measure of classification for all classes
1
One thing to be noted is that the ordering of the instances makes a slight difference to the accuracy
of classification. While the system obtains an accuracy of 0.867 when the instances are placed in their
original order of occurrence in the eszic training data, 0.866 is obtained when the same instances are
presented in random order to the classifier using ten-fold cross validation. This difference also occurs
when leave-one-out cross-validation is used. In this case, the method obtains an accuracy of 0.869
when the instances are placed in their original order of occurrence and 0.868 when presented in random
order.
47
4.1 Experiments
4. Evaluation
Accuracy
0,866
0,861
0.859
0,856
0.858
0.854
0.86
0.86
60%
70%
0.865
0.866
0.866
80%
90%
100%
0.855
0,851
0,845
0,840 0.836
0,835
0,830
10%
20%
30%
40%
50%
Figure 4.1: eszic training data learning curve for accuracy.
in the eszic training data. The values of the three measures are maximal when utilizing 90% of the training set. While recall plateaus at this sample size, precision
and f-measure decrease slightly when the amount of training data is further increased,
although this decline is not sufficiently marked to be attributed to overtraining.
Precision
Recall
F-measure
0,866
0.865
0,861
0.859
0.858
0,856
0.856
0.854
0.852
0.852
0.855
0.853
0.852
20%
30%
40%
0,851
0.858
0.857
0.856
0.86
0.858
0.86
0.857
0.858
60%
70%
0.858
0.863
0.863
0.866
0.865
0.865
0.866
0.864
90%
100%
0.864
0,845
0,840
0.836
0,835 0.831
0.83
0,830
10%
50%
80%
Figure 4.2: eszic training data learning curve for precision, recall and f-measure.
The learning curve in Figure 4.3 shows the classification accuracy for each of the
48
4.1 Experiments
4. Evaluation
classes while Figure 4.4 presents this accuracy in relation to the number of training
instances for each section of the eszic training data.
Under all conditions, subjects are classified with a high accuracy since the information given by the parser (collected in the features) facilitates an f-measure of 0.801
for the identification of explicit subjects. By contrast to explicit subjects, the parser
does not recognise zero pronouns in impersonal constructions but can recognise them in
clauses with no subject. The accuracy with which these types can be classified begins
at a lower level (0.662 and 0.621 respectively). Classification of both zero pronouns of
impersonal constructions reaches its maximum when 90% of the training data is exploited. There is also some evidence of overtraining in the classification of impersonal
constructions when using 100% of the training data.
Explicit Subject
0,911 0,895
0,907
0,905
Zero pronoun
Impersonal
0,904
0,906
0,907
0,908
0,911
0,911
0,911
0,741
0,744
0,742
0,748
0,754
0,736
0,754
0,737
90%
100%
0,870
0,828
0,787
0,735
0,745
0,728
0,704
0,662
0,662
0,653
0,661
0,671
0,682
0,721
0,672
0,651
0,642
0,621
0,621
10%
20%
30%
40%
50%
60%
70%
80%
Figure 4.3: Learning curve for accuracy, recall and f-measure of the classes.
The zero pronoun class has the steepest learning curve. Utilising only 735 instances
(50% of the training set), the Elliphant system obtains an accuracy (0.741) close to that
obtained when using 100% of the training data. The learning curve for the subject class
is more gradual due to the great variety of subjects occurring in the training data. In
addition, increasing accuracy from a greater starting point (0.907 using just 20% of the
training data) is far more expensive in terms of the addition of training instances. The
impersonal sentence class is also learned rapidly by Elliphant. Utilising a training set
of only 179 instances, it reaches a classification accuracy of 0.721 (See Figure 4.4).
49
4.1 Experiments
4. Evaluation
Explicit Subject
0,911 498
978
1461
Zero pronoun
1929
2433
Impersonal
2898
3400
3899
4386
4854
Explicit
subjects
0,870
0,828
0,787
354
0,745
537
735
898
1094
0,704
167
32
0,662
0,621
66
1249
103
82
1416
146
1793
Zero
pronouns
163
179
Impersonal
constructions
1593
129
49
17
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Figure 4.4: Learning curve for accuracy, recall and f-measure in relation to the number
of instances of each class.
This demonstrates that Elliphant is not heavily reliant on very large sets of expensive training data and is able to reach adequate levels of performance when exploiting
far less training instances. Overall, we see that we only need a small set of annotated
instances (1,500) to achieve reasonable results.
4.1.3
Most Effective Features
With Weka’s Attribute Selection option, it is possible to evaluate the features by
considering the individual predictive ability of each of the features along with the degree
of redundancy between them. Table 4.4 shows the relevant ordered features evaluated
using different algorithms implemented in Weka’s attribute selection module which
can handle the features type (symbolic, numerical, etc.) from the eszic training data.
The filters used for each Attribute Selection method are the ones provided by default
in Weka1 .
Considering the group of features selected using each Weka Attribute Selection
algorithm, 11 classifications using the K* classifier were made over the complete eszic
1
BestFirst filter for the CfsSubsetEval method; Attribute ranking filter for the ChiSquaredAttributeEval, FilteredAttributeEval, GainRatioAttributeEval, InfoGainAttributeEval, OneRAttributeEval,
ReliefFAttributeEval and SymmetricalUncertAttributeEval; and Greedy Stepwise filter for the ConsistencySubsetEval and FilteredAttributeEval methods.
50
4.1 Experiments
4. Evaluation
Weka Attribute Selection
Selected features
CfsSubsetEval
PARSER,
NUMBER,
NHPREV,
NHTOT,
VERBtype , PERSON
LEMMA, POSpos , NHTOT, NHPREV, POSpre ,
PARSER
PARSER, LEMMA, NUMBER, AGREE, NHTOT,
POSpos , POSpre
POSpos , LEMMA, NHPREV, NHTOT, PARSER,
POSpre
PARSER, NHPREV, NHTOT
ChiSquaredAttributeEval
ConsistencySubsetEval
FilteredAttributeEval
FilteredSubsetEval
GainRatioAttributeEval
InfoGainAttributeEval
OneRAttributeEval
ReliefFAttributeEval
SymmetricalUncertAttributeEval
NHPREV, PARSER, PERSON, NHTOT, POSpos ,
CLAUSE
POSpos , LEMMA, NHPREV, NHTOT, PARSER,
POSpre
NHTOT, POSpos , CLAUSE, PERSON, NHPREV,
PARSER
POSpos , VERBtype , LEMMA, PARSER, CLAUSE,
POSpre
NHPREV, PARSER, NHTOT, POSpos , PERSON,
LEMMA
Table 4.4: Selected features by Weka Attribute Selection methods.
training data using only the features selected by each method. Table 4.5 presents the
accuracy of each classification using ten-fold cross-validation.
The most effective group of six features in combination is the one selected by
Weka’s SymmetricalUncertAttributeEval Attribute Selection algorithm, since the classification using those six features together already offers an accuracy of 0.851. Likewise,
a group consisting of only three features (parser, nhprev, nhtot) was selected by
the FilteredSubsetEval algorithm. These three features are the most frequently selected
ones among those chosen by all the Attribute Selection methods. A classification which
exploits only the three features obtains an accuracy of 0.819.
A set of experiments were conducted in which features were selected on the basis
of the degree of computational effort needed to generate them. Two sets of features
were proposed. One group corresponds to features intrinsic to the parser, whose values
can be obtained by trivial exploitation of the tags produced in its output (parser,
51
4.1 Experiments
4. Evaluation
Weka Attribute Selection
CfsSubsetEval
ChiSquaredAttributeEval
ConsistencySubsetEval
FilteredAttributeEval
FilteredSubsetEval
GainRatioAttributeEval
InfoGainAttributeEval
OneRAttributeEval
ReliefFAttributeEval
SymmetricalUncertAttributeEval
Accuracy
0.824
0.848
0.843
0.848
0.819
0.833
0.848
0.833
0.825
0.851
Table 4.5: Classification using the selected features groups: accuracy.
lemma, person, pospos , pospre ). The second group of features (clause, agree,
nhprev, nhtot, verbtype ) has values derived by methods extrinsic to the parser
and rules for the recognition of elements that are independent of it. Derivation of
this second group of features necessitated the implementation of more sophisticated
modules to identify the boundaries of syntactic constituents such as clauses and noun
phrases. These modules are rule-based and operate over the often erroneous output
of the parser (see Section 3.2.4). The results obtained when the classifier exclusively
exploits each of these intrinsic and extrinsic groups of features are shown in Tables 4.6
and 4.7
A recurrent issue in anaphora resolutions studies is determining the quantity and
type of knowledge needed for identification of candidates and selection of a candidate
as antecedent. In Mitkov (2002) it is stated that, given the natural linguistic ambiguity
of various cases, the resolution of any kind of anaphor requires not only morphological,
lexical, and syntactic knowledge but also semantic knowledge, discourse knowledge, and
real world knowledge. Nevertheless, current anaphora resolution methods rely mainly
on restrictions and preference heuristics, which employ information originating from
morpho-syntactic or shallow semantic analysis (Ferrández & Peral, 2000; Mitkov, 1998),
while some previous approaches have exploited full parsing (Hobbs, 1977; Lappin &
Leass, 1994). As described in this dissertation, Elliphant makes use of deep dependency
parsing plus the morphological knowledge contained in the verb lists used.
52
4.1 Experiments
4. Evaluation
There are two findings of note in Table 4.6. The first is that no impersonal constructions are identified when only features extrinsic to the parser are used. The second
is that there is a reduction in recall when using only intrinsic features. It is therefore
better to classify instances using a feature group that combines both types of features.
eszic training data
Precision
Recall
F-measure
0.654
0.865
0
0.664
0.891
0
0.659
0.878
0
Explicit subjects
Zero pronouns
Impersonal constructions
Extrinsic parser features eszic training data accuracy: 0.808
Table 4.6: Extrinsic parser features classification results.
eszic training data
Precision
Recall
F-measure
0.866
0.779
0.944
0.312
0.983
0.285
0.459
0.869
0.438
Explicit subjects
Zero pronouns
Impersonal constructions
Intrinsic parser features eszic training data accuracy: 0.789
Table 4.7: Intrinsic parser features classification results.
To estimate the weight of each feature, classifications were made in which each
feature was omitted from the training instances that were presented to the classifier
and ten-fold cross-validation was applied. Table 4.8 presents the accuracy of these
classifications. Omission of all but one of the features a led to a reduction in accuracy,
justifying their inclusion in the training instances.
Feature omitted
PARSER
NHTOT
LEMMA
POSpos
NHPREV
PERSON
CLAUSE
Accuracy
Feature omitted
0.854
0.860
0.861
0.861
0.862
0.863
0.863
VERBtype
NUMBER
INF
AGREE
POSpre
SE
A
Accuracy
0.863
0.864
0.864
0.865
0.866
0.866
0.867
Table 4.8: Single feature omission classifications: accuracy.
53
4.1 Experiments
4. Evaluation
4.1.4
Genre Analysis
As the eszic training data is composed of instances belonging to two different genres
(legal and health), two subgroups of the eszic training data were generated: the Legal
eszic training data and the Health eszic training data containing all the instances
derived from legal and health texts, respectively. A comparative evaluation using tenfold cross-validation over the two subgroups shows that Elliphant is more successful
when classifying instances of explicit subjects in legal texts (see Table 4.9). This may
be explained by the uniformity of the sentences in the legal texts which present less
variation than the ones from the health genre. Texts from the health genre present
the additional complication of specialised named entities and acronyms which are used
quite frequently in the health texts from the eszic Corpus (i.e.: CCDSD1 , DSM-IV2 or
TLP3 ). Further, there is a larger number of explicit subjects in the legal training data
(2,739, compared with 2,116 explicit subjects occurring in the health texts). Similarly,
better performance in the detection of zero pronouns and impersonal sentences in the
health texts may be due to their higher occurrence in the health genre: 108 impersonal
constructions and 1,174 zero pronouns compared with 71 impersonal constructions and
619 zero pronouns in the legal texts (see Table 3.2 for details about the number of class
instances in each subgroup of the training data).
Class
Legal genre Explicit subjects
Health genre Explicit subjects
Legal genre Zero pronouns
Health genre Zero pronouns
Legal genre Impersonal constructions
Health genre Impersonal constructions
Precision
Recall
F-measure
0.920
0.881
0.761
0.784
0.786
0.905
0.955
0.888
0.649
0.796
0.620
0.620
0.937
0.884
0.701
0.790
0.693
0.736
Legal genre accuracy: 0.893 (ten-fold cross-validation)
Health genre accuracy: 0.848 (ten-fold cross-validation)
Table 4.9: Legal and health genres comparative evaluation.
1
Cuestionario Clı́nico para el Diagnóstico del Sı́ndrome Depresivo (Clinic Questionnaire for Depressive Syndrome Diagnosis).
2
Manual Diagnóstico y Estadı́stico de los Trastornos Mentales IV (Diagnostic and Statistical Manual of Mental Disorders IV).
3
Trastorno lı́mite de la personalidad (Borderline Personality Disorder).
54
4.2 Comparative Evaluation
4. Evaluation
We have also studied the effect of training the classifier on data derived from one
genre and testing on instances derived from a different genre. Table 4.10 shows that
instances from legal texts are not only more homogeneous, as the classifier obtains
higher accuracy when testing and training only on legal instances (0.895) but they are
also more informative because when combining both legal and health genres as training
data, the results in testing the algorithm only on instances from the health genre show
significantly increased accuracy (0.933). These results imply that the instances from
the health genre are the most heterogeneous ones. Subsets of legal documents where
our method achieves an accuracy of 0.942 were also found.
```
```
``` Testing set
```
Training set
```
Legal
Health
eszic Corpus (all)
Legal
Health
eszic Corpus
0.895
0.858
0.920
0.859
0.841
0.933
0.885
0.887
0.869
Accuracy: cross-genre training and testing (ten-fold cross-validation)
Table 4.10: Cross-genre training and testing evaluation.
4.2
Comparative Evaluation
Due to the lack of previous work on this topic, a comparison with other methods is
not feasible. Despite its similarities to this approach, Ferrández & Peral (2000) use a
different definition for zero pronouns, and therefore a comparison is not appropriate.
As a guideline, the results obtained by Connexor’s Machinese Syntax are presented
regarding the existence (or not) of a subject inside the clause. Since this parser does
not distinguish between referential and non-referential elliptic subjects, both categories
have been merged into one. Needless to say, a comparison of the results obtained by
these two methods should be made with caution. They are presented here only as a
point of reference. It is clear from the figures that the Elliphant system offers not only
improved f-measure in the classification of both elliptic subject classes, but also obtains
superior f-measure when classifying the non-omitted subject class.
The evaluation was carried out using both the entire set of eszic training data and
also the genre-specific subsets of the training data (Legal and Health eszic training
55
4.2 Comparative Evaluation
4. Evaluation
data). The evaluation of the Elliphant system was made using leave-one-out crossvalidation.
eszic training data
Elliphant Explicit subjects
Elliphant Zero pronouns
Elliphant Impersonal constructions
Precision
Recall
F-measure
0.901
0.774
0.889
0.924
0.743
0.626
0.913
0.758
0.734
Elliphant eszic training data accuracy: 0.869 (leave-one-out cross-validation)
Table 4.11: Elliphant eszic training data results.
eszic training data
Machinese Explicit subjects
Machinese Zero pronouns
+ Impersonal constructions
Precision
Recall
F-measure
0.911
0.716
0.802
0.543
0.829
0.656
Machinese eszic training data accuracy: 0.749
Table 4.12: Machinese eszic training data results.
When evaluating over the entire eszic training set, Elliphant outperforms the parser
on every measure. When detecting explicit pronouns in Elliphant, the obtained recall
score is considerably higher (0.924 compared to the 0.716 of the parser). The averages of the evaluation measures obtained for the identification of zero pronouns and
impersonal constructions (precision: 0.831; recall: 0. 684; f-measure: 0.746) were also
compared. The comparison demonstrated Elliphant’s superiority over Connexor’s Machinese Syntax parser, in this task, for all measures except recall.
Legal genre eszic training data
Legal genre Elliphant Explicit subjects
Legal genre Elliphant Zero pronouns
Legal genre Elliphant Impersonal constructions
Precision
Recall
F-measure
0.922
0.760
0.797
0.955
0.654
0.662
0.938
0.934
0.723
Elliphant Legal eszic training accuracy: 0.895
Table 4.13: Elliphant Legal eszic training results.
When processing only the Legal eszic training data, the accuracy of the parser is
reduced (0.726), while the performance of the Elliphant system is improved (0.895).
56
4.2 Comparative Evaluation
4. Evaluation
Legal genre eszic training data
Precision
Recall
F-measure
0.940
0.702
0.803
0.410
0.823
0.547
Legal genre Machinese Explicit subjects
Legal genre Machinese Zero pronouns
+ Impersonal constructions
Machinese Legal eszic training accuracy: 0.726
Table 4.14: Machinese Legal eszic training results.
The two systems were used to classify instances of elision (zero pronouns and impersonal constructions) in texts from the legal genre. The averaged evaluation measures
obtained by the Elliphant system (precision: 0.778; recall: 0. 658; f-measure: 0.828)
were found to be superior to those obtained by the parser for all measures except recall
(precision: 0.675; recall: 0. 763; f-measure: 0.675).
Health genre eszic training data
Health genre Elliphant Explicit subjects
Health genre Elliphant Zero pronouns
Health genre Elliphant Impersonal constructions
Precision
Recall
F-measure
0.879
0.773
0.882
0.879
0.795
0.620
0.879
0.784
0.728
Elliphant Health eszic training data accuracy: 0.841
Table 4.15: Elliphant Health eszic training data results.
Health genre eszic training data
Health genre Machinese Explicit subjects
Health genre Machinese Zero pronouns
+ Impersonal constructions
Precision
Recall
F-measure
0.879
0.735
0.801
0.656
0.833
0.734
Machinese Health eszic training data accuracy: 0.772
Table 4.16: Machinese Health eszic training data results.
When classifying instances derived from texts in the health genre (using Health
eszic training data), the accuracy of both the Elliphant system and the parser was
reduced. However, Elliphant still outperforms the parser in this context.
When considering the classification of instances of elision in the health genre, Connexor’s Machinese Syntax parser does obtain higher measures for the averaged evaluation measures than Elliphant (precision: 0.827; recall: 0.707; f-measure: 0.756).
57
4.2 Comparative Evaluation
4. Evaluation
Nevertheless, unlike the parser, the Elliphant system distinguishes referential (zero
pronouns) and non-referential (impersonal constructions) elided subjects. This can be
considered one of its main contributions as this task is necessary in order to improve
practical anaphora resolution systems.
58
Chapter 5
Conclusions and Future Work
In this dissertation, a machine learning approach to the identification of zero pronouns,
impersonal constructions, and explicit subjects was presented. In treating this range
of classes, complete coverage is provided for all possible constituents which may occur
in subject position in Spanish clauses.
In order to enable a machine learning approach to classification, a parsed corpus of
Spanish texts from the health and legal genres was compiled. The corpus was manually
annotated to encode information about the element in subject position for every finite
verb in the corpus (the eszic Corpus). A set of 14 features was formulated and training
data consisting of 6,827 instances represented by vectors of the feature values was created (eszic training data). The training data was utilised by classification algorithms
distributed with the Weka package. Empirical observation revealed that use of the K*
algorithm was optimal for the purpose of this classification. The performance of this
machine learning approach was compared with that of Connexor’s Machinese Syntax
parser. Elliphant offers a classification with superior accuracy in the recognition of
both of the elliptic classes (zero pronouns and impersonal constructions), and also in
the classification of the non-elliptic subject class (explicit subjects). The method presented in this dissertation is also able to identify impersonal constructions in Spanish.
This is a task which appears not to have been dealt with before in the literature.
In addition to presenting results with regard to algorithm selection, additional experiments carried out with the underlying method included parameter optimisation,
learning of the most effective combinations of features, the optimal number of instances
to include in the training data and the relationships between the results and the different genres on which the Elliphant system was tested. This chapter presents the findings
59
5.1 Main Observations
5. Conclusions and Future Work
of all of these experiments (see section 5.1). In future research, it is intended that optimisation of the approach and its adaptability to other genres will be investigated in
more depth (see section 5.2).
5.1
Main Observations
Algorithm selection: the instance-based learning algorithm K* was selected for classification of elliptic vs. explicit subject instances and referential vs. non-referential
subject instances. This decision was taken on the basis of having compared the accuracy of this classifier with the rest of the classifiers available in the Weka package. In
terms of accuracy, the K* algorithm is closely followed by the Bayes based algorithms
in Weka.
Parameter optimisation was investigated by checking the impact of the parameter setting on the performance of the K* classifier. Although Weka provides sensible
default settings, it is by no means certain that they will be optimal for this particular
task. The default settings were changed so that a blending parameter of 40% was used
with regard to the K* algorithm.
Feature selection: the set of experiments conducted to determine an optimal
group of features to be utilised by the classification algorithm revealed that of the entire set of 14 features, the most effective group comprises six of the features: nhprev
(number of noun phrases previous to the verb), parser (parsed subject), nhtot (number of noun phrases in the clause), pospos (four pos following the verb), person (verb
morphological person), and lemma (verbal lemma). This study showed that feature a
(preposition a) does not make any meaningful contribution to the classification.
Training data required: learning curves experiments showed the correlation between the accuracy of the classifier and the size of the training set, whose performance
reaches a plateau at its maximum level when using 90% of the available data.
Genre interference: We evaluated the performance of the Elliphant system separately in two different genres, legal and health, showing that there is some genre
interference on the classification tasks. Elliphant classifies zero pronouns and explicit
subjects in legal texts with a higher accuracy than is the case in health texts. By contrast, impersonal constructions are more accurately classified in health texts. Crossgenre training and testing demonstrated that legal instances are more informative and
60
5.2 Future Research
5. Conclusions and Future Work
homogeneous than health genre cases.
5.2
Future Research
Future research goals are related to improvements in: (1) optimisation of the Elliphant
system, (2) adaptation of the system to other genres, (3) inter-annotation agreement
of the eszic Corpus, (4) the comparison of Elliphant with a rule based approach and
the (5) design of an algorithm to resolve zero anaphora in Spanish.
Firstly, with regard to further improvement of the Elliphant system, the interaction
between (a) feature selection and parameter optimisation, and (b) class distribution will
be addressed. In related work, it was found that optimal settings for feature selection
and parameter optimisation should not be sought independently of one another since
there is an interaction between the two. The joint optimisation of feature selection
and parameter optimisation can cause variations in the accuracy levels obtained by
classifiers (Hoste, 2005). Additionally, an investigation will be made into how the class
distribution of the data affects learning. This will facilitate the compilation of an
optimal set of training instances as it has been found that training data containing a
lower distribution of negative instances can be beneficial to classification (Hoste, 2005).
In future work, evaluation and learning curve experiments in which training instances derived from texts in one genre are used to classify instances derived from texts
in a different genre will provide an insight into the optimal type/combination of training data that enables better classification using less instances in various types/genres
of text, as well as provide additional robustness to our system.
Inter-annotator agreement will be measured and it is planned to design a ruled based
algorithm to identify and to resolve zero anaphora in Spanish as there is some debate
about which approach, machine learning or rule-based, brings optimal performance
when applied in anaphora resolution systems (Mitkov, 2002).
61
5.2 Future Research
5. Conclusions and Future Work
62
References
Aldea Muñoz, S. (2003). Un caso de intervención psicológica de la depresión infantil. psiquiatria.com, 7. 28
Aldea Muñoz, S. (2006). Influencia del autoconcepto y de la competencia social en la depresión infantil. psiquiatria.com, 10. 28
Alonso-Ovalle, L. & D’Introno, F. (2000). Full and null pronouns in Spanish: the zero
pronoun hypothesis. In H. Campos, E. Herburger, A. Morales-Front & T.J. Walsh, eds.,
Hispanic linguistics at the turn of the millennium. Papers from the 3rd Hispanic Linguistics
Symposium, 189–210, Cascadilla Press, Sommerville, MA. 6
Balcázar Nava, P., Bonilla Muñoz, M.P., Gurrola Peña, G.M., Oudhof van Barneveld, H. & Aguilar Mercado, M.R. (2005). La depresión como problema de salud
mental en los adolescentes mexicanos. psiquiatria.com, 9. 28
Barreras, J. (1993). Resolución de elipsis y técnicas de parsing en una interficie de lenguaje
natural. Procesamiento del lenguaje natural , 13, 247–258. 7, 8
Beavers, J. & Sag, I. (2004). Coordinate ellipsis and apparent non-constituent coordination.
In S. Müller, ed., Proceedings of the 11th International Conference on Head-Driven Phrase
Structure Grammar (HPSG-04), 48–69, CSLI Publications, Stanford, CA. 17
Bello, A. ([1847] 1981). Gramática de la lengua castellana destinada al uso de los americanos.
Instituto Universitario de Lingüı́stica Andrés Bello, Cabildo Insular de Tenerife, Santa Cruz
de Tenerife. 15, 19
Bergsma, S., Lin, D. & Goebel, R. (2008). Distributional identification of non-referential
pronouns. In Proceedings of the 46th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies (ACL/HLT-08), 10–18. 2, 10, 12
Bosque, I. (1989). Clases de sujetos tácitos. In J. Borrego Nieto, ed., Philologica: homenaje
a Antonio Llorente, vol. 2, 91–112, Servicio de Publicaciones, Universidad Pontificia de
Salamanca, Salamanca. 15, 16, 18, 19, 24
Boyd, A., Gegg-Harrison, W. & Byron, D. (2005). Identifying non-referential it: a
machine learning approach incorporating linguistically motivated patterns. In Proceedings
of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language
Processing. 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05),
40–47. 8, 10, 12, 13, 22, 45
63
References
Brucart, J.M. (1987). La elisión sintáctica en español . Universitat Autònoma de Barcelona,
Bellaterra. 15
Brucart, J.M. (1999). La elipsis. In I. Bosque & V. Demonte, eds., Gramática descriptiva de
la lengua española, vol. 2, 2787–2863, Espasa-Calpe, Madrid. ix, 15, 16, 17, 19, 23, 24
Carden, G. (1982). Backwards anaphora in discourse context. Journal of Linguistics, 18,
361–87. 33, 34
Chinchor, N. & Hirschman, L. (1997). MUC-7 Coreference task definition (version 3.0). In
Proceedings of the 1997 Message Understanding Conference (MUC-97). 2
Chomsky, N. (1965). Aspects of the theory of syntax . The MIT Press, Cambridge, MA. 15
Chomsky, N. ([1968] 2006). Language and mind . Cambridge University Press, Cambridge, 3rd
edn. 14
Chomsky, N. (1981). Lectures on government and binding. Mouton de Gruyter, Berlin, New
York. 1, 6, 19
Chomsky, N. (1995). The minimalist program. The MIT Press, Cambridge, MA. 15
Chung, S., Ladusaw, W. & McCloskey, J. (1995). Sluicing and logical form. Natural
Language Semantics, 3, 239–282. 17
Cleary, J. & Trigg, L. (1995). K*: an instance-based learner using an entropic distance
measure. In Proceedings of the 12th International Conference on Machine Learning (ICML95), 108–114. 13, 45, 46
Clemente, J., Torisawa, K. & Satou, K. (2004). Improving the identification of nonanaphoric it using Support Vector Machines. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP04), 58–61. 10, 12
Código Civil (1889). Texto de la edición del Código Civil mandada publicar por el Real
Decreto de 24 del corriente en cumplimiento de la ley de 26 de mayo último. Gaceta de
Madrid , 206, 249–312. 26
Connexor Oy (2006a). Conexor functional dependency grammar 3.7. User’s manual . 29, 38,
39
Connexor Oy (2006b). Machinese language model . 13, 29, 35
Constitución Española (1978). Constitución Española de 27 de diciembre de 1978. Boletı́n
Oficial del Estado, 311, 29313–29424. 26
Corpas Pastor, G. (2008). Investigar con corpus en traducción: los retos de un nuevo
paradigma. Peter Lang, Frankfurt am Main. 7, 8
Corpas Pastor, G., Mitkov, R., Afzal, N. & Pekar, V. (2008). Translation universals:
do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings of
the 8th Conference of the Association for Machine Translation in the Americas (AMTA-08),
75–81. 2, 7, 8, 10
64
References
Danlos, L. (2005). Automatic recognition of French expletive pronoun occurrences. In R. Dale,
K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedings of the
2nd International Joint Conference on Natural Language Processing (IJCNLP-05), 73–78,
Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 3651. 2,
10, 11, 12
Denber, M. (1998). Automatic resolution of anaphora in English. Tech. rep., Eastman Kodak
Co. 10, 11, 12
Dı́az Morfa, J. (2004). La crisis de las aventuras en las relaciones de pareja. psiquiatria.com,
8. 28
Dı́scolo, A. ([2nd century] 1987). Sintaxis. Gredos, Madrid. 14
Evans, R. (2000). A comparison of rule-based and machine learning methods for identifying
non-nominal it. In D.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Proceedings of the 2nd International Conference on Natural Language Processing (NLP-2000),
233–241, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol.
1835. 10, 12
Evans, R. (2001). Applying machine learning: toward an automatic classification of it. Literary
and Linguistic Computing, 16, 45–57. 2, 10, 12, 13, 22, 29, 38, 45
Fernández Soriano, O. & Táboas Baylı́n, S. (1999). Construcciones impersonales no
reflejas. In I. Bosque & V. Demonte, eds., Gramática descriptiva de la lengua española,
vol. 2, 1631–1722, Espasa-Calpe, Madrid. 18, 19
Ferrández, A. & Peral, J. (2000). A computational approach to zero-pronouns in Spanish.
In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics
(ACL-2000), 166–172. 2, 6, 7, 8, 9, 11, 17, 22, 52, 55
Ferrández, A., Palomar, A. & Moreno, L. (1997). El problema del núcleo del sintagma
nominal: ¿elipsis o anáfora? Procesamiento del lenguaje natural , 20, 13–26. 24
Ferrández, A., Palomar, A. & Moreno, L. (1998). Anaphor resolution in unrestricted
texts with partial parsing. In Proceedings of the 36th Annual Meeting of the Association for
Computational Linguistics and 17th International Conference on Computational Linguistics
(ACL/COLING-98), 385–391. 9
Ferrández, A., Palomar, A. & Moreno, L. (1999). An empirical approach to Spanish
anaphora resolution. Machine Translation, 14, 191–216. 9
Fiengo, R. & May, R. (1994). Indices and identity. The MIT Press, Cambridge MA. 17
Francis, W. (1958). The structure of American English. Ronald Press, New York. 15
Fries, C. (1940). American English grammar . Appleton-Century-Crofts, New York. 15
Garcı́a Jurado, F. (2007). La etimologı́a como historia de las palabras. E-excellence, Área
de Cultura Clásica, Filologı́a Clásica, 39, 1–27. 14
Garcı́a Losa, E. (2008). Efectividad, operatividad y potenciación del tratamiento en patologı́a
fóbica, en el contexto de los servicios especializados de salud mental públicos: la utilización
en la sala de consulta de los recursos de Internet. psiquiatria.com, 12. 26
65
References
Gómez Torrego, L. (1992). La impersonalidad gramatical: descripción y norma. Arco Libros,
Madrid. 17, 18, 19, 23, 25
Grice, H. (1975). Logic and conversation. In P. Cole & J.L. Morgan, eds., Syntax and semantics, vol. 3: Speech Acts, 41–58, Academic Press, New York. 15
Gundel, J., Hedberg, N. & Zacharski, R. (2005). Pronouns without NP antecedents:
how do we know when a pronoun is referential? In A. Branco, T. McEnery & R. Mitkov,
eds., Anaphora processing: linguistic, cognitive and computational modelling, 351–364, John
Benjamins, Amsterdam. 10, 12
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H.
(2009). The WEKA data mining software: an update. SIGKDD Explorations, 11, 10–18. 41
Halliday, M.A.K. & Hasan, R. (1976). Cohesion in English. Longman, London. 15
Han, N. (2004). Korean null pronouns: classification and annotation. In Proceedings of the
Workshop on Discourse Annotation. 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 33–40. 7
Hernández Terrés, J.M. (1984). La elipsis en la teorı́a gramatical . Universidad de Murcia,
Murcia. 14
Hirano, T., Matsuo, Y. & Kikui, G. (2007). Detecting semantic relations between named
entities in text using contextual features. In Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics. Companion volume proceedings of the demo and
poster sessions (ACL-05), 157–160. 2, 7, 8
Hobbs, J. (1977). Resolving pronoun references. Lingua, 44, 311–338. 52
Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D.
thesis, University of Antwerp. 61
Hu, Q. (2008). A corpus-based study on zero anaphora resolution in Chinese discourse. Ph.D.
thesis, City University of Hong Kong. 7, 8
Iida, R., Inui, K. & Matsumoto, Y. (2006). Exploiting syntactic patterns as clues in zeroanaphora resolution. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics and the 21st International Conference on Computational Linguistics
(ACL/COLING-06), 625–632. 7, 8
Iida, R., Kentaro, I. & Matsumoto, Y. (2009). Capturing salience with a trainable cache
model for zero-anaphora resolution. In Proceedings of the Joint Conference of the 47th Annual
Meeting of the Association for Computational Linguistics and the 4th International Conference on Natural Language Processing of the Asian Federation of Natural Language Processing
(ACL/AFNLP-09), 647–655. 2, 7, 8
Imamura, K., Saito, K. & Izumi, T. (2009). Discriminative approach to predicate-argument
structure analysis with zero-anaphora resolution. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th
International Conference on Natural Language Processing of the Asian Federation of Natural
Language Processing (ACL/AFNLP-09), 85–88. 2, 7, 8
66
References
Isozaki, H. & Hirao, T. (2003). Japanese zero pronoun resolution based on ranking rules
and machine learning. In Theoretical Issues in Natural Language Processing. Proceedings of
the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03),
184–191. 7, 8
Järvinen, T. & Tapanainen, P. (1998). Towards an implementable dependency grammar. In
A. Polguère & S. Kahane, eds., Proceedings of the Workshop on Processing of DependencyBased Grammars. 36th Annual Meeting of the Association for Computational Linguistics and
17th International Conference on Computational Linguistics (ACL/COLING-98), 1–10. 28
Järvinen, T., Laari, M., Lahtinen, T., Paajanen, S., Paljakka, P., Soininen, M. &
Tapanainen, P. (2004). Robust language analysis components for practical applications. In
Proceedings of the 20th International Conference on Computational Linguistics (COLING04), 53–56. 28, 29
Kawahara, D. & Kurohashi, S. (2004). Improving Japanese zero pronoun resolution by
global word sense disambiguation. In Proceedings of the 20th International Conference on
Computational Linguistics (COLING-04), 343–349. 2, 7, 8
Kibrik, A.A. (2004). Zero anaphora vs. zero person marking in Slavic: a chicken/egg dilemma?
In Proceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC04), 87–90. 2, 7, 8
Kratzer, A. (1998). More structural analogies between pronouns and tenses. In Proceedings
of Semantics and Linguistic Theory VIII (SALT-88), Cornell University, Ithaca, NY. 6
Kuno, S. (1972). Functional sentence perspective: a case study from Japanese and English.
Linguistic Inquiry, 3, 269–320. 33
Lambrecht, K. (2001). A framework for the analysis of cleft constructions. Linguistics, 39,
463–516. 10, 12
Lancelot, C. & Arnauld, A. ([1660] 1980). Gramática general y razonada. Sociedad General
Española de Librerı́a, Madrid. 14
Lappin, S. & Leass, H. (1994). An algorithm for pronominal anaphora resolution. Computational Linguistics, 20, 535–561. 10, 11, 12, 52
Lee, S. & Byron, D. (2004). Semantic resolution of zero and pronoun anaphors in Korean. In
Proceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC04), 103–108. 2, 7
Lee, S., Byron, D. & Jang, S. (2005). Why is zero marking important in Korean? In
R. Dale, K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedings
of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05),
588–599, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol.
3651. 7
Ley 29/1998 (1998). Ley 29/1998, de 13 de julio, reguladora de la Jurisdicción Contenciosoadministrativa. Boletı́n Oficial del Estado, 167, 23516–23551. 26
Ley 29/2005 (2005). Ley 29/2005, de 29 de diciembre, de Publicidad y Comunicación Institucional. Boletı́n Oficial del Estado, 312, 42902–42905. 26
67
References
Ley 3/1991 (1991). Ley 3/1991, de 10 de enero, de Competencia Desleal. Boletı́n Oficial del
Estado, 10, 959–962. 26
Ley Orgánica 10/1995 (1995). Ley Orgánica 10/1995, de 23 de noviembre, del Código Penal.
Boletı́n Oficial del Estado, 281, 33987–34058. 26
Ley Orgánica 1/2002 (2002). Ley Orgánica 1/2002, de 22 de marzo, reguladora del Derecho
de Asociación. Boletı́n Oficial del Estado, 73, 11981–11991. 26
Ley Orgánica 6/2001 (2001). Ley Orgánica 6/2001, de 21 de diciembre, de Universidades.
Boletı́n Oficial del Estado, 307, 49400–49425. 26
Li, Y., Musilek, P. & Wyard-Scott, L. (2009). Identification of pleonastic it using the
web. Computer Engineering, 34, 339–389. 10, 12
López Ortega, M.A. (2009). El cine como herramienta ilustrativa en la enseñanza de los
trastornos de la personalidad. psiquiatria.com, 13. 26
Manning, C. & Schütze, H. (1999). Foundations of statistical natural language processing.
The MIT Press, Cambridge, MA. 41
Matsui, T. (1999). Approaches to Japanese zero pronouns: centering and relevance. In
D. Cristea, N. Ide & D. Marcu, eds., Proceedings of the Workshop on the Relation of Discourse/Dialogue Structure and Reference. 37th Annual Meeting of the Association Computational Linguistics (ACL-99), 11–20. 2, 7, 8
Mel’čuk, I. (2003). Levels of dependency in linguistic description: concepts and problems.
In Dependency and valency. An International handbook of contemporary research, 188–229,
Mouton de Gruyter, Berlin, New York. 17
Mel’čuk, I. (2006). Zero sign in morphology. In Aspects of the theory of morphology, 447–495,
Mouton de Gruyer, Berlin, New York. 6, 19
Mendikoetxea, A. (1994). La semántica de la impersonalidad. In C. Sánchez, ed., Las construcciones con se, 239–267, Visor, Madrid. 18
Mendikoetxea, A. (1999). Construcciones con se: medias, pasivas e impersonales. In
I. Bosque & V. Demonte, eds., Gramática descriptiva de la lengua española, vol. 2, 1575–1630,
Espasa-Calpe, Madrid. 18
Merchant, J. (2001). The syntax of silence. Sluicing, islands and the theory of ellipsis. Oxford
University Press, Oxford. 17
Mitkov, R. (1998). Robust pronoun resolution with limited knowledge. In Proceedings of the
36th Annual Meeting of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics (ACL/COLING-98), 869–875. 12, 52
Mitkov, R. (2001). Outstanding issues in anaphora resolution. In A. Gelbukh, ed., Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text
Processing (CICLing-01), 110–125, Springer, Berlin, Heidelberg, New York, Lecture Notes
in Computer Science, Vol. 2004. 10
Mitkov, R. (2002). Anaphora resolution. Longman, London. 6, 8, 10, 33, 52, 61
68
References
Mitkov, R. (2010). Discourse processing. In A. Clark, C. Fox & S. Lappin, eds., The handbook of computational linguistics and natural language processing, 599–629, Wiley Blackwell,
Oxford. 2, 5, 10
Mitkov, R., Evans, R. & Orasan, C. (2002). A new, fully automatic version of Mitkov’s
knowledge-poor pronoun resolution method. In Proceedings of the 3rd International Conference on Computational Linguistics and Intelligent Text Processing (CICLing-02), 69–83,
Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 2276. 10,
12
Molina López, D. (2008). Y de los hermanos ¿qué? Cómo ayudar a los hermanos de un TLP.
psiquiatria.com, 12. 28
Mori, T. & Nakagawa, H. (1996). Zero pronouns and conditionals in Japanese instruction
manuals. In Proceedings of the 16th International Conference on Computational Linguistics
(COLING-96), 782–787. 7, 8
Müller, C. (2006). Automatic detection of nonreferential it in spoken multi-party dialog. In
Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), 49–56. 10, 12, 13
Murata, M., Isahara, H. & Nagao, M. (1999). Pronoun resolution in Japanese sentences
using surface expressions and examples. In A. Bagga, B. Baldwin & S. Shelton, eds., Proceedings of the Workshop on Coreference and Its Applications. 37th Annual Meeting of the
Association for Computational Linguistics (ACL-99), 39–46. 7, 8
Nakagawa, H. (1992). Zero pronouns as experiencer in Japanese discourse. In Proceedings of
the 15th International Conference on Computational Linguistics (COLING-92), 324–330. 7,
8
Nakaiwa, H. (1997). Automatic identification of zero pronouns and their antecedents within
aligned sentence pairs. In Proceedings of the 3rd Annual Meeting of the Association for Natural Language Processing in Japan (ANLP-97), 127–141. 7, 8
Nakaiwa, H. & Ikehara, S. (1992). Zero pronoun resolution in a Japanese to English machine
translation system by using verbal semantic attributes. In Proceedings of the 3rd Conference
on Applied Natural Language Processing (ANLP-92), 201–208. 7, 8
Nakaiwa, H. & Shirai, S. (1996). Anaphora resolution of Japanese zero pronouns with deictic
reference. In Proceedings of the 16th International Conference on Computational Linguistics
(COLING-96), 812–817. 7, 8
Ng, V. & Cardie, C. (2002). Identifying anaphoric and non-anaphoric noun phrases to improve coreference resolution. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), 1–7. 10, 12
Nomoto, T. & Yoshihiko, N. (1993). Resolving zero anaphora in Japanese. In Proceedings of
the 6th Conference of the European Chapter of the Association for Computational Linguistics
(EACL-93), 315–321. 7, 8
Okumura, M. & Tamura, K. (1996). Zero pronoun resolution in Japanese discourse based
on centering theory. In Proceedings of the 16th International Conference on Computational
Linguistics (COLING-96), 871–876. 1, 7
69
References
Paice, C.D. & Husk, G.D. (1987). Towards an automatic recognition of anaphoric features
in English text: the impersonal pronoun it. Computer Speech and Language, 2, 109–132. 10,
11, 12
Peng, J. & Araki, K. (2007a). Zero anaphora resolution in Chinese and its application in
Chinese-English machine translation. In Z. Kedad, N. Lammari, E. Métais, F. Meziane &
Y. Rezgui, eds., Natural language processing and information systems. Proceedings of the
12th International Conference on Applications of Natural Language to Information Systems
(NLDB-07), 364–375, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer
Science, Vol. 4592. 7
Peng, J. & Araki, K. (2007b). Zero-anaphora resolution in Chinese using maximum entropy.
IEICE - Transactions on Information and Systems, E90-D, 1092–1102. 7, 8
Peral, J. (2002). Resolución y generación de la anáfora nominal en español e inglés en un
sistema de traducción automática. Procesamiento del lenguaje natural , 28, 127–128. 7, 8
Peral, J. & Ferrández, A. (2000). Generation of Spanish zero-pronouns into English. In
D.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Proceedings of the 2nd
International Conference on Natural Language Processing (NLP-2000), 252–260, Springer,
Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 1835. 2, 7, 8
Pintor Garcı́a, M. (2007). Análisis factorial de las actitudes personales en educación secundaria. Un estudio empı́rico en la Comunidad de Madrid. psiquiatria.com, 11. 28
Pollard, C. & Sag, I. (1994). Head Driven Phrase Structure Grammar . CSLI Publications,
Stanford, CA. 19
Real Academia Española (1977). Esbozo de una nueva gramática de la lengua española.
Espasa-Calpe, Madrid. 19
Real Academia Española (2001). Diccionario de la lengua española. Espasa-Calpe, Madrid,
22nd edn. 15, 40, 41
Real Academia Española (2009). Nueva gramática de la lengua española. Espasa-Calpe,
Madrid. ix, 6, 15, 16, 17, 18, 19, 22, 23, 24, 25, 33, 34
Recasens, M. & Hovy, E. (2009). A deeper look into features for coreference resolution. In
L.D. Sobha, A. Branco & R. Mitkov, eds., Anaphora Processing and Applications. Proceedings
of the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-09), 29–42,
Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 5847. 2, 6,
11
Rello, L. & Illisei, I. (2009a). A comparative study of Spanish zero pronoun distribution. In
Proceedings of the International Symposium on Data and Sense Mining, Machine Translation
and Controlled Languages, and their application to emergencies and safety critical domains
(ISMTCL-09), 209–214, Presses Universitaires de Franche-Comté, Besançon. 3, 7
Rello, L. & Illisei, I. (2009b). A rule-based approach to the identification of Spanish zero
pronouns. In Student Research Workshop. International Conference on Recent Advances in
Natural Language Processing (RANLP-09), 209–214. 3, 7, 8, 9, 10, 11, 22, 35
70
References
Rello, L., Baeza-Yates, R. & Mitkov, R. (2010a). Improved subject ellipsis detection in
Spanish. submitted . 3
Rello, L., Suárez, P. & Mitkov, R. (2010b). A machine learning method for identifying non-referential impersonal sentences and zero pronouns in Spanish. Procesamiento del
Lenguaje Natural , 45, 281–287. 3
Ross, J. (1967). Constrains on variables in syntax . Ph.D. thesis, Massachusetts Institute of
Technology. 17
Sánchez de las Brozas, F. ([1562] 1976). Minerva. De la propiedad de la lengua latina.
Cátedra, Madrid. 14
Sasano, R., Kawahara, D. & Kurohashi, S. (2008). A fully-lexicalized probabilistic model
for Japanese zero anaphora resolution. In Proceedings of the 22nd International Conference
on Computational Linguistics (COLING-08), 769–776. 7, 8
Seco, M. (1988). Manual de gramática española. Aguilar, Madrid. 19
Seki, K., Fujii, A. & Ishikawa, T. (2002). A probabilistic method for analyzing Japanese
anaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), 911–917. 7, 8
Sevillano Arroyo, M.A. & Ducret Rossier, F.E. (2008). Las emociones en la psiquiatrı́a.
psiquiatria.com, 12. 26
Shopen, T. (1973). Ellipsis as grammatical indeterminacy. Foundations of Language, 10, 65–
77. 15
Steinberger, J., Poesio, M., Kabadjov, M.A. & Jeek, K. (2007). Two uses of anaphora
resolution in summarization. Information Processing and Management, 43, 1663–1680. 2, 7
Streb, J., Hennighausen, E. & Rösler, F. (2004). Different anaphoric expressions are
investigated by event-related brain potentials. Journal of Psycholinguistic Research, 33, 175–
201. 15
Takada, S. & Doi, N. (1994). Centering in Japanese: a step towards better interpretation of
pronouns and zero-pronouns. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 1151–1156. 7, 8
Tanaka, I. (2000). Cataphoric personal pronouns in English news reportage. In Proceedings of
the 3rd Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-2000), 108–117.
33, 34
Tapanainen, P. (1996). The constraint grammar parser CG-2 . Department of General Linguistics, University of Helsinki, Publications, Vol. 27. 28
Tapanainen, P. & Järvinen, T. (1997). A non-projective dependency parser. In Proceedings
of the 5th Conference on Applied Natural Language Processing (ANLP-97), 64–71. 13, 28
Tesnière, L. (1959). Éléments de syntaxe. Klincksieck, Paris. 28
71
References
Theune, M., Hielkema, F. & Hendriks, P. (2006). Performing aggregation and ellipsis using discourse structures. In Research on Language & Computation, vol. 4, 353–375, Springer,
Berlin, Heidelberg, New York. 7
Wilder, C. (1997). Some properties of ellipsis in coordination. In Studies in universal grammar
and typological variation, 59–107, John Benjamins, Amsterdam. 17
Witten, I.H. & Frank, E. (2005). Data mining: practical machine learning tools and techniques. Morgan Kaufmann, London, 2nd edn. 26, 38, 41, 44, 46
Yeh, C. & Chen, Y. (2003a). Using zero anaphora resolution to improve text categorization. In
Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation
(PACLIC-03), 423–430. 2, 7, 8
Yeh, C. & Chen, Y. (2003b). Zero anaphora resolution in Chinese with partial parsing based
on centering theory. In Proceedings of the International Conference on Natural Language
Processing and Knowledge Engineering (NLP-KE-03), 683–688. 7, 8
Yeh, C. & Chen, Y. (2007). Topic identification in Chinese based on centering model. Journal
of Chinese Language and Computing, 17, 83–96. 2, 7, 8
Yeh, C. & Mellish, C. (1997). An empirical study on the generation of zero anaphors in
Chinese. Computational Linguistics, 23, 171–190. 7, 8
Yoshimoto, K. (1988). Identifying zero pronouns in Japanese dialogue. In Proceedings of the
12th International Conference on Computational Linguistics (COLING-88), 779–784. 7, 8
Zhao, S. & Ng, H. (2007). Identification and resolution of Chinese zero pronouns: a machine
learning approach. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CNLL07), 541–550. 2, 7, 8
72