learning verb inference rules from linguistically

Bar-Ilan University
Department of Computer Science
L EARNING V ERB I NFERENCE RULES FROM
L INGUISTICALLY-M OTIVATED E VIDENCE
by
Hila Weisman
Advisors: Prof. Ido Dagan and Dr. Idan Szpektor
Submitted in partial fulfillment of the requirements for the Master’s degree
in the department of Computer Science
Ramat-Gan, Israel
March 2013 Adar
Copyright 2013
This work was carried out under the supervision of
Prof. Ido Dagan and Dr. Idan Szepktor
Department of Computer Science, Bar-Ilan University.
Acknowledgments
I would like to take this opportunity to thank the people who made this work
possible. First, my advisor Ido Dagan. Ido gave me a chance to delve deeper
into the fascinating field of NLP. He is a brilliant researcher, whose guidance has
taught me how to observe, analyze and scientifically formalize ideas. I would
also like to deeply thank Idan Szpektor, my second advisor. His creative ideas
and broad knowledge in the field have been of tremendous help to me and my
research.
I wish to thank Jonathan Berant, my de-facto third supervisor. Working with
Jonathan on a daily basis was an illuminating and educating experience. I have
learned a lot about truly creative scientific research and diligence through working
with him in ’Team Verb’.
Thanks to my colleagues at the Bar-Ilan NLP lab: Ofer Bronstien, Naomi Zeichner, Asher Stern, Eyal Shnerch, Meni Adler, Erel Segal, Amnon Lotan, Lili
Kotlerman and Chaya Lebeskind for illuminating coffee breaks and for their helpful and kind nature. Thanks to Mindel, Udi, Batels and Yotam for their support,
patience and love during the whole time I was working on my thesis. And to Papa,
in memoriam, for always pushing me to work hard and pursue my dreams.
This work was partially supported by the Israel Science Foundation grant 1112/08,
the PASCAL-2 Network of Excellence of the European Community FP7-ICT2007-1-216886, and the European Community Seventh Framework Programme
(FP7/2007-2013) under grant agreement no. 287923 (EXCITEMENT).
Contents
1
Introduction
1
2
Background
2.1 Distributional Similarity . . . . . . . . . . .
2.2 Co-Occurrence Methods . . . . . . . . . . .
2.3 Integrating Distributional Similarity with
Co-occurrence . . . . . . . . . . . . . . . . .
2.4 Linguistic Background . . . . . . . . . . . .
2.4.1 Fine-Grained Definition of Entailment
2.4.2 Discourse Relations and Markers . .
2.4.3 Verb Classes . . . . . . . . . . . . .
4
4
7
. . . . . . . . . . .
. . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
12
12
15
17
3
Linguistically-Motivated Cues for Entailment
18
3.1 Verb Co-occurrence . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Verb Semantics Heuristics . . . . . . . . . . . . . . . . . . . . . 20
3.3 Typed Distributional Similarity . . . . . . . . . . . . . . . . . . . 22
4
An Integrative Framework for Verb Entailment Learning
4.1 Representation and Model . . . . . . . . . . . . . . .
4.1.1 Sentence-level co-occurrence . . . . . . . . .
4.1.2 Features Adapted from Prior Work . . . . . . .
4.1.3 Document-level co-occurrence . . . . . . . . .
4.1.4 Corpus-level statistics . . . . . . . . . . . . .
4.1.5 Feature Selection and Analysis . . . . . . . . .
4.2 Learning Framework . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
24
25
27
30
31
33
36
5
6
Evaluation
5.1 Evaluation on a Manually Annotated Verb Pair Dataset
5.1.1 Experimental Setting . . . . . . . . . . . . . .
5.1.2 Feature selection and analysis . . . . . . . . .
5.1.3 Results and Analysis . . . . . . . . . . . . . .
5.2 Evaluation within Automatic Content Extraction task .
5.2.1 Experimental Setting . . . . . . . . . . . . . .
5.2.2 Building a Verb Entailment Resource for ACE
5.2.3 Evaluation Metrics and Results . . . . . . . .
5.2.4 Error Analysis . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
39
43
47
48
49
50
56
Conclusions and Future Work
60
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Appendix A VerbOcean Web Patterns Experiment
69
A.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Appendix B Fine-grained Verb Entailment Annotation Guidelines
71
Appendix C Verb Syntax and Semantics
73
C.1 Auxiliary and Light Verbs . . . . . . . . . . . . . . . . . . . . . 73
C.2 Verb Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
C.3 Levin Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
List of Figures
2.1
2.2
2.3
2.4
2.5
3.1
3.2
3.3
4.1
5.1
Related clauses, via the anchor ’Mr. Brown’, and their extracted
syntactic templates in Pekar (2008) . . . . . . . . . . . . . . . .
An automatically learned ’Prosecution’ narrative chain in Chambers and Jurafsky (2008) . . . . . . . . . . . . . . . . . . . . .
Classification of verb entailment to sub-types according to Fellbaum (1998) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Sample Rhetorical Structure Tree, connecting four discourse
units via rhetorical relations in Mann and Thompson (1988) . . .
A mapping of discourse relations to relations proposed by other
researchers: M&T - Mann and Thompson (1988); Ho - Hobbs
(1990); A&L - Asher and Lascarides (2003); K&S - Knott and
Sanders (1998) . . . . . . . . . . . . . . . . . . . . . . . . . .
.
10
.
11
.
13
.
15
.
16
A dependency parse of the sentence “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas”
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Typed Distributional Similarity Feature vectors for the verb “see” . 22
Adverb-typed Distributional Similarity vectors for the verbs “whisper” and “talk” . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
The backward elimination procedure, adapted from Rakotomamonjy (2003) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Evaluated rank-based methods and their Micro-average F1 as a
function of top K rules . . . . . . . . . . . . . . . . . . . . . . .
53
5.2
Co-occurrence Level classifiers and their Micro-average F1 as a
function of top K rules . . . . . . . . . . . . . . . . . . . . . . .
7
54
List of Tables
2.1
4.1
VerbOcean’s Semantic relations with an example of their associated pattern and an instantiating verb pair . . . . . . . . . . . . .
8
Discourse relations and their mapped markers or connectives. . . .
25
5.1
Top 10 positively correlated features according to the Pearson correlation score and their co-occurrence level. . . . . . . . . . . . .
5.2 Top 10 negatively correlated features according to the Pearson
correlation score and their co-occurrence level. . . . . . . . . . .
5.3 Top 10 features according to their single feature classifier F1 score
and their co-occurrence level. . . . . . . . . . . . . . . . . . . . .
5.4 Feature ranking according to minimal feature weights, with the
feature’s verb co-occurrence level . . . . . . . . . . . . . . . . .
5.5 Average precision, recall and F1 for the evaluated models. . . . . .
5.6 Average precision, recall, F1 , and the statistical significance level
of the difference in F1 for each combination of co-occurrence levels.
5.7 10 randomly selected correct entailment rules learned by our model
but missed by TDS. . . . . . . . . . . . . . . . . . . . . . . . . .
5.8 Average number of positive verb pairs per seed with their frequency compared to the average total of 875 candidate rules per
seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9 MAP Results of our model and the decision-rule based resources .
5.10 Error types and their frequency from a random sample of 100 false
matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11 Example errors due to the naive matching mechanism . . . . . . .
5.12 Example errors due to seed ambiguity . . . . . . . . . . . . . . .
40
41
42
43
45
46
47
55
56
57
57
58
Abstract
Many natural language processing (NLP) applications need to perform semantic inference in order to complete their task. Such inference relies on knowledge
that can be represented by inference rules between atomic units of language such
as words or phrases. In this work, we focus on automatically learning one of the
most important rule types, namely, entailment rules between verbs. Most prior
work on learning such rules focused on a rather narrow set of information sources:
mainly distributional similarity, and to a lesser extent manually constructed verb
co-occurrence patterns.
In this work, we claim that it is imperative to utilize information from various textual scopes: verb co-occurrence within a sentence, within a document, as
well as overall corpus statistics. To this end, we propose an integrative model that
improves rule learning by exploiting entailment cues: linguistically motivated indicators that are specific to verbs. These cues operate at different levels of verb
co-occurrence and signal the existence (or non-existence) of the entailment relation between any verb pair. These novel cues are encoded as features together
with other useful features adapted from prior work.
Two experiments performed in different settings show that our proposed model
outperforms, with a statistically significant improvement, state-of-the-art algorithms. Further feature analysis indicates that our novel linguistic features substantially improve performance and that features at all co-occurrence levels, and
especially at the sentence level, contain valuable information for verb entailment
learning.
Chapter 1
Introduction
Semantic inference is the task of conducting inferences over natural language utterances. Many Natural Language Processing (NLP) applications need to perform
semantic inferences in order to complete their task. For example, given a collection of documents, a Question Answering (QA) system needs to retrieve valid
answers to questions posed in natural language. Such a system should infer that
the sentence “Churros are coated with sugar” contains a valid answer for the question “What are Churros covered with?”. In Information Extraction (IE), systems
need to find instances of target relations in text. For example, an IE system aiming
to extract the relation “is enrolled at” would have to recognize that the sentence
“Michelle studies History at NYU” implies that ’Michelle’ is the enrolee and that
she is enrolled at ’NYU’.
In order to make such inferences, these systems depend on knowledge which
can be represented by inference or entailment rules. An entailment rule specifies
a directional inference relation between two text fragments T and H. Entailment
rules can relate lexical elements or words, e.g., the rule ‘Honda → car’ specifies
that the meaning of the word ’Honda’ implies the meaning of the word ’car’. They
may include slots for variables, e.g., ‘X bought Y → X purchased Y’ and also
represent syntactic transformations, e.g., change a sentence from passive form to
active form.
In this work we will focus on automatically learning verb entailment rules,
which describe a lexical entailment relation between two verbs, e.g., ‘whisper →
1
talk’, ‘win → play’ and ‘buy → own’.
Verbs are the primary vehicle for describing events and relations between entities and as such, are arguably the most important lexical and syntactic category
of a language (Fellbaum, 1990). Their semantic importance has led to active research in automatic learning of entailment rules between verbs or verb-like structures (Chklovski and Pantel, 2004; Zanzotto et al., 2006; Abe et al., 2008).
Large knowledge bases of entailment rules can be constructed both manually
or automatically. Manually-created rule bases provide extremely accurate semantic knowledge, but their construction is both time-consuming and their coverage
limited. On the other hand, automated corpus-based rule learning methods are
able to utilize the large amounts of text available today to provide broad coverage
at minimum cost.
Most prior efforts to automatically learn verb entailment rules from large corpora employed distributional similarity methods, assuming that verbs are semantically similar if they occur in similar contexts (Lin, 1998; Berant et al., 2012). This
led to the automatic acquisition of large scale knowledge bases, but with limited
precision. Fewer works, such as VerbOcean (Chklovski and Pantel, 2004), focused on identifying verb entailment through verb instantiation of manually constructed patterns. For example, the sentence “he scared and even startled me”
implies that ‘startle → scare’. This led to more precise rule extraction, but with
poor coverage since contrary to nouns, in which patterns are common (Hearst,
1992), verbs do not co-occur often within rigid patterns. However, verbs do tend
to co-occur in the same document, and also in different clauses of the same sentence.
In this work, we claim that on top of standard pattern-based and distributional
similarity methods, corpus-based learning of verb entailment can greatly benefit
from exploiting additional linguistically-motivated cues that are specific to verbs.
For instance, when verbs co-occur in different clauses of the same sentence, the
syntactic relation between the clauses can be viewed as a proxy for the semantic relation between the verbs. Moreover, we empirically show that in order to
improve performance, it is crucial to combine information sources from different
textual scopes such as verb co-occurrence within a sentence or within a document
and distributional similarity over the entire corpus.
2
Our goal in this work is thus two-fold. First, to create a novel set of entailment
cues that help detect the likelihood of lexical verb entailment. Our novel cues
are specific to verbs and linguistically-motivated. Second, to encode these cues
as features within a supervised classification framework and integrate them with
other standard features adapted from prior work. This results in a supervised
corpus-based learning method that combines verb entailment information at the
sentence, document and corpus levels.
We demonstrate the added power of these cues in two distinct evaluation settings, where our integrated model achieves a substantial and statistically significant improvement over state-of-the-art algorithms. First, we show a 17% improvement in F 1 measure compared to the best performing prior work on a manually
annotated dataset of over 800 verb pairs. Then we utilize our verb entailment
model in an applied Information Extraction setting and show marked improvement over several automated learning algorithms as well as manually created lexicons. In addition, we ascertain the importance of using features at different levels
of co-occurrence by performing ablation tests and applying feature selection techniques.
In Chapter 2 we provide background on prior work in the field of learning
semantic relations between verbs. In Chapter 3, we introduce linguistically motivated cues that are specific to verbs and signal the entailment relation between
verb pairs. In Chapter 4 we describe how these cues are encoded as features within
a supervised classification framework. In Chapter 5, we evaluate our proposed
model in two different settings: a manually annotated dataset and an applicationoriented setting. Then, we conclude with a short discussion about contributions
and possible extensions to our work.
3
Chapter 2
Background
One can view the task of learning verb entailment rules as part of a more general
task of learning semantic relations between lexical units. In the first part of this
chapter we will survey the main approaches for learning verb entailment and other
semantic relations between verbs. In the second part, we will describe works from
both theoretical and applied linguistics which are required for the full understanding of the motivations underlying our proposed model.
2.1
Distributional Similarity
The main approach for learning lexical entailment rules between verbs has employed the distributional hypothesis (Harris, 1954) which states that two words
are likely to have similar meanings if they co-occur, i.e., appear together, with
similar sets of words in a corpus (a large, unstructured volume of text). For instance, we expect the verbs ‘buy’ and ‘purchase’ to appear with human beings as
their subjects and inanimate commodities or companies as their objects, e.g., “He
bought a new car” and “Shari Arison purchased a penthouse”.
Relying on this hypothesis, distributional similarity methods measure the semantic relatedness of words by comparing the contexts in which the words appear
in. The compared words are termed elements and each element is represented
according to its context words. The contexts are in turn represented by feature
vectors, which are lists of numerical values expressing a weighted frequency mea-
4
sure between the element and the feature. The feature vectors vary from purely
lexical, based on words with no syntactic information, to purely syntactic which
are based on dependency relations between the element and the context. For
example, suppose the element is the verb ’ask’ and it appears in the following
sentence: “He asked his mother for some help”, then the purely lexical representation of the verb ’ask’ will have the following features: ’he’,’mother’,’help’,
the lexical-syntactic representation will have the following features: ’subject-he’,
’object-mother’,’preposition complement-help’ and the purely syntactic will have
the following features: ’subject’, ’object’,’preposition complement’. Once these
feature vectors are constructed, a feature weighting function is computed between
each feature and its corresponding element. Last, a similarity measure is computed between the weighted vectors in order to establish the degree of semantic
relatedness between the two elements.
In his seminal paper, Lin (1998) proposed an information-theoretic distributional similarity measure to compute the semantic relatedness of two words. The
presented framework starts by extracting dependency triples from a text corpus. A
dependency triple consists of an element, a context word (feature), and the grammatical relationship between them in a given sentence. For example, given the
sentence “He likes to read mystery novels” and the element ’novel’, the following
triples will be extracted: (mystery, adjective, novel), (novel, object, read). Then,
a Mutual Information measure is applied to weigh the frequency counts of each
feature f with element e:
M I(e, f ) = log2
P (e, f )
P (e) · P (f )
Intuitively, this measures the probability that the common appearance of e and
f is not random. Last, the author presents a novel lexical distributional similarity measure, referred to as the Lin similarity measure, which is based on the
author’s information-theoretic similarity theorem: “The similarity between A and
B is measured by the ratio between the amount of information needed to state the
commonality of A and B and the information needed to fully describe what A and
B are”. Thus, given two element words e1 and e2 , the Lin measure is computed
as the ratio between the information shared by the features of both words and the
5
sum over the features of each word:
Lin(e1 , e2 ) =
Σf ∈F (e1 )∩F (e2 ) (M I(e1 , f ) + M I(e2 , f ))
Σf ∈F (e1 ) M I(e1 , f ) + Σf ∈F (e2 ) M I(e2 , f )
The above mentioned method deals with a symmetric notion of similarity, for
instance the word pairs (“dog”, “animal”) and (“animal”,“dog”) will receive high
distributional similarity scores, regardless of their order. Yet, in order to deal with
non-symmetric semantic relations, such as entailment, there is a need to devise
directional distributional similarity measures. Using such measures, there will be
a difference in the similarity score between word pairs, such that (“dog”, “animal”) will be scored higher than (“animal”,“dog”). These measures generally
follow the notion of distributional inclusion: if the contexts of a word w1 are
properly included in the contexts of a word w2 , then this might imply that w1 →
w2 . For instance, if the context vector representing the word ’startle’ is subsumed
by the context vector of the word ’surprise’ then this might imply that ‘startle →
surprise’.
Weeds and Weir (2003) have shown that the inclusion of one distributional
vector into another correlates well with human intuitions about the relations between general and specific words, i.e., specific is subsumed by general (’startle’
is subsumed by ’surprise’). Geffet and Dagan (2005) further refined the distributional similarity measure, relating distributional behaviour with lexical entailment.
Later, Szpektor et al. (2007) proposed a directional distributional similarity measure termed “Balanced Inclusion”, which balances the notion of directionality in
entailment with the common notion of symmetric semantic similarity.
In general, using these methods results in broad-coverage resources for semantically related lexical items, but the relatedness is somewhat more loose than
expected. In symmetric distributional similarity, it has been shown (Lin et al.,
2003; Geffet and Dagan, 2005) that distributionallly similar words tend to include
not only synonyms but also antonyms, e.g., ’buy’ and ’sell’, and co-hyponyms,
words that share a common hypernym, e.g., ’cruise’ and ’raft’. In directional distributional similarity, high scores are often given to rules with obscure entailing
verbs (Kotlerman et al., 2010). An orthogonal approach that mitigates these drawbacks is that of word co-occurrence.
6
2.2
Co-Occurrence Methods
In contrast to distributional methods, co-occurrence methods capture more exact
and fine-tuned semantic relations by looking at the candidate lexical items as they
appear together in varying scopes. However, their accuracy comes at a cost: as we
limit the co-occurrence scope, we obtain higher precision and lower recall with
regards to the target semantic relation.
In her seminal paper in the field, Hearst (1992) describes a pattern-based
method for the automatic acquisition of the Hyponymy1 (is-a) relation between
nouns, e.g., Honda is a hyponym of car. She manually created lexical-syntactic
patterns, which include both syntactic information consisting of Part of Speech
(POS) slots, e.g., Noun Pharse (NP), as well as lexical information in the form
of specific words. The underlying idea is that certain patterns can be thought of
as surface cues to the semantic relation of hyponymy. For example, the patterns
“N Py such as N Px ” and “N Px and other N Py ” often imply that N Px is a hyponym of N Py , as exemplified in “Bananas and other fruit” and “Shawn Penn
and other actors”.
Snow et al. (2005) generalize Hearst’s work by automatically identifying,
instead of manually creating, lexico-syntactic patterns indicative of hyponymy.
They start with a labeled set of example noun pairs. For instance, the pair (Honda,
Car) will have a positive label since ’Honda’ is a hyponym of ’Car’ and the pair
(Sun, Football) will have a negative label since ’Sun’ is not a hyponym of ’Football’. Next, they collect all the dependency paths connecting these labelled pairs
in a large newsire corpus. They then use machine learning methods that learn the
effectiveness of each such path (or pattern) in recognizing hyponymy and apply
these patterns to recognize novel hyponymy noun pairs.
Intuitively, working with noun patterns seems more promising than working
with verb patterns, since nouns are naturally ordered in a hierarchy (Fellbaum,
1998) and co-occur more often within the same text fragment. Indeed, less work
has been done using verb pattern-based methods. However, one prominent work
taking this approach is Chklovsky and Pantel’s VerbOcean (2004). In VerbOcean,
the authors use lexical-syntactic patterns to discover semantic relations between
1
The converse relation is termed Hypernymy
7
verbs. Similarly to Hearst, they manually construct 33 patterns which are divided
into five pattern groups, with each group corresponding to a different semantic
relation: similarity, strength, antonymy, enablement and happens-before. Table
2.1 presents the relations alongside a sample indicative lexical-syntactic pattern
and an ordered verb pair instantiating the pattern.
Semantic Relation
Example Pattern
Example Verb Pair
Symmetric
Similarity
X and Y
(transform, integrate)
Yes
Strength
X and even Y
(wound, kill)
No
Antonymy
either X or Y
(open, close)
Yes
Enablement
Yed by Xing the
(fight, win)
No
Happens-before
X and eventually Y
(buy, sell)
No
Table 2.1: VerbOcean’s Semantic relations with an example of their associated
pattern and an instantiating verb pair
In order to construct a large resource of fine-grained semantically-related verbs,
they use highly associated verb pairs (30K verb pairs from Dekang and Pantel
(2001)) and for each verb pair, test its association strength with the aforementioned semantic relations using the following information-theoretic measure:
Sp (v1 , v2 ) =
P (v1 , p, v2 )
P (p) · P (v1 ) · P (v2 )
Where (v1 , v2 ) is an ordered pair of verbs, p is a specific pattern out of the 33
available patterns (e.g., “X and eventually Y”) and the probabilities in the above
measure estimated by Google hit counts. It is worth noting that while working
with web counts largely increase the statistics, the counts are volatile and noisy
(Lapata and Keller, 2005), and our preliminary empirical results have corroborated this (see Chapter 4.1.2).
The use of lexical-syntactic patterns helps achieve high precision, since it correlates well with specific semantic relations. However this comes at a price of
low recall, since patterns impose considerable restrictions on the form of verb cooccurrence in corpus. Thus, a major challenge for pattern-based methods is to
8
mitigate the sparseness problem. One possible solution is to adopt a more relaxed
notion of co-occurrence, i.e., utilize less rigid and more "loose" surface cues that
exist when two verbs co-occur within a broader, more varied scope.
Tremper and Frank (2011) suggest a fine-grained supervised method for learning the presupposition semantic relation. The authors utilize surface cues from
the co-occurrence of verbs in the same sentence, without the need of specific rigid
patterns, and discern between five fine-grained semantic relations: presupposition,
entailment, temporal inclusion, antonymy and synonymy (see 2.4.1 for definitions)
using a weakly supervised learning approach. They utilize several shallow syntactic features for verb pairs that co-occur in the same sentence: the verb’s form (e.g.,
the tense of the verb) and part-of-speech contexts (the part-of-speech of the two
words preceding and following each verb), alongside more deep syntactic features
such as co-reference binding between the verbs’ arguments, i.e., two verbs sharing
the same subject or object.
Pekar (2008) further broadens the scope of verb co-occurrence for the task of
event entailment. Event entailment is a specific form of verb entailment, where
one event is the antecedent, denoted as vq and another is the consequent, denoted
as vp . If the antecedent occurred than it implies that the consequent also occurred, for example ‘buy → belong’. Pekar focuses on verb co-occurrence in the
same discourse unit. A discourse unit is defined as a paragraph or a few adjacent
clauses, where a clause is defined as a verb with its dependent constituents. For
example, in the sentence “Although John bought a new laptop, he was unhappy”
there are two clauses “John bought a new laptop” and “he was unhappy”. Intuitively, Pekar postulates that if two verbs co-occur within a locally coherent text,
they are more likely to stand in an event entailment relation. This assumption is
based on Centering Theory (Grosz et al., 1995), a linguistic theory of discourse,
which states that a coherent text segment tends to focus on a specific entity, here
termed as anchor. Building an unsupervised classifier for the task of recognizing
event entailment, the proposed algorithm first identifies pairs of clauses that are
related in the local discourse. Then, it uses features such as textual proximity,
constituents referring to the anchor and pronouns as anchors. Next, it constructs
templates from these clauses and chooses verb pairs that instantiate the templates
(see Figure 2.1).
9
Figure 2.1: Related clauses, via the anchor ’Mr. Brown’, and their extracted
syntactic templates in Pekar (2008)
Last, it calculates an information-theoretic measure between the verbs to test
their association and discover the direction of entailment in each pair. Given an
ordered verb pair (vq , vp ) the following information-theoretic score is computed:
ScoreP ekar (vq , vp ) =
P (vq |vp ) · log PP(v(vq |vp )p )
Σi (P (vqi |vp ) · log
P (vqi |vp )
)
P (vp )
A verb pair (vq , vp ) is given a high score using this formula if the prior probability of vq is relatively far from its posterior probability given vp , in terms of
KL-divergence (Kullback and Leibler, 1951).
Chambers and Jurafsky (2008) further broaden the verb co-occurrence scope
and formulate the notion of narrative (event) chains. Narrative chains are partially ordered sets of events centered around a common entity (here termed as
the protagonist). To induce such chains from unordered text, the authors propose
an unsupervised learning approach that uses co-referring arguments as evidence
of a narrative relation. Here, the verbs need not appear in the same sentence or
paragraph, but in the same document. The algorithm learns an ordering of events
within the same document and builds a narrative by looking at the arguments
shared by the verbs in the document (see Figure 2.2). The authors rely on the
intuition that in a coherent text, any two events that revolve around the same participants are likely to be part of the same story or narrative.
10
Figure 2.2: An automatically learned ’Prosecution’ narrative chain in Chambers
and Jurafsky (2008)
2.3
Integrating Distributional Similarity with
Co-occurrence
An important line of work that naturally follows from the two above mentioned
approaches, namely Distributional Similarity and Co-occurrence, is to integrate
them in order to exploit the benefits provided by these orthogonal approaches (i.e.,
high precision vs. high recall). Observing that the two approaches are largely
complementary, Mirkin et al. (2006) introduce a system for extracting lists of
entailment rules between nouns, e.g., ‘police → organization’. The authors rely
on two knowledge extractors: a pattern-based extractor, using Hearst patterns, and
a distributional extractor applied to a set of entailment seeds. They create a novel
feature set based on both pattern-based and distributional data, e.g., a feature that
computes the probability of a noun pair to appear in a specific Hearst pattern, or
an intersection feature whose value is ’1’ if the noun pair is identified by both the
11
pattern-based and distributional methods and ’0’ otherwise. Later, Pennacchiotti
and Pantel (2009) worked within the field of Information Extraction and expanded
Mirkin et al.’s work by augmenting the feature space with a large set of features
extracted from a web-crawl.
Hagiwara et al. (2009) propose an integrated approach for the task of Synonym
identification. They suggest that in order to fully integrate these two orthogonal
methods, one needs to reformulate the distributional similarity approach. Thus,
instead of using common features in different vectors in order to compare the
commonality of the elements’ context, the authors propose to directly create vectors of distributional features, which represent the degree of context commonality
shared by the elements. For example, given the elements “opera” and “ballet”, instead of constructing two different vectors for each element and comparing them
according to some similarity measure, the authors propose to build a single feature
vector, where each entry represents a contexts the elements share, e.g., given the
sentences “He is going to the opera” and “The Welches are going to the ballet”,
the feature ’go-obj’ will appear in the pair’s joint vector.
In this thesis we aim to follow this line of work and encode co-occurrence cues
at both the sentence level and document level as well as distributional similarity
information, and incorporate these cues as features within a supervised framework
for learning lexical entailment rules between verbs.
2.4
Linguistic Background
As mentioned before, we aim to devise novel and valuable entailment cues that
are specific to verbs and are linguistically-motivated. In the following section, we
will review the related literature in linguistics that has inspired the construction of
many of our novel features.
2.4.1
Fine-Grained Definition of Entailment
The entailment relation between verbs may hold for many different reasons: one
verb may represent an action or a state whose manner is more specific than the
12
other, e.g., ’dancing’ is more specific than ’moving’ and ‘dance → move’. Or
it may be that one verb denotes an action or state which presupposes another
action/state, e.g., ’winning’ a tournament presupposes ’playing’ in a tournament
and ‘win → play’.
Fellbaum (1998) proposed a verb entailment hierarchy, presented in Figure
2.3, which captures these fine-grain differences. Here, the entailment relation is
first divided in two according to the Temporal Inclusion criteria. Then, if the
the criteria is met, yet another division is performed, this time according to the
Troponymy relation:
Figure 2.3: Classification of verb entailment to sub-types according to Fellbaum
(1998)
Temporal Inclusion: A verb V1 is said to temporally include a verb V2 if there
is some stretch of time during which the activities or states denoted by the
two verbs co-occur, but there exists no time during which V2 occurs and V1
does not.
Proper Inclusion: A verb V1 is said to properly include a verb V2 if it temporally
includes it and there exists a stretch of time during which V1 occurs but V2
does not, i.e., V1 and V2 are not temporally co-existent. The direction of
entailment here is V2 → V1 .
Troponymy: A verb V1 is said to be a troponym of V2 if it is temporally (but not
properly) included in V2 . In addition, this relation can be expressed by the
formula “To V1 is to V2 in some particular manner” and is comparable to the
13
semantic relation of Hyponymy between nouns. The direction of entailment
here is V1 → V2 .
To illustrate the temporal inclusion division, let us examine two examples:
1. ‘dance → move’: the activity of ’dancing’ in temporally included in the
activity of ’moving’, as dancing is an activity that occurs while moving.
Furthermore, dance is not properly included in move, since they are always co-existent: one must necessarily be moving every instant that one is
dancing. And indeed, to dance is to move in some particular manner.
2. ‘dream → sleep’: the activity of ’dreaming’ is temporally included in the
activity of ’sleeping’, as ’dreaming’ is an activity that occurs while ’sleeping’. Furthermore, dream is properly included in sleep, since one cannot
dream whilst one is not sleeping and indeed there exists a stretch of time
when one sleeps and does not dream.
There are two remaining relations in the entailment hierarchy that do not confirm to the Temporal Inclusion criteria: Backward Presupposition and Causation.
Backward Presupposition: a verb V1 is said to backward presuppose a verb V2
if the occurrence of the activity denoted by V2 occurs before V1 and is a
prerequisite of the activity denoted by V1 , with the direction of entailment
being V1 → V2 . For example, ’win’ presupposes ’play’ since the event of
’winning’ occurred after the event of ’playing’ and indeed one must have
played in order to win, and ‘win → play’. We note that the reverse relation
is called Presupposition, with the direction of entailment being reversed to
V2 → V1 .
Causation: is a relation between a verb V1 , the cause, and a verb V2 , the effect.
This relation holds when V1 is a causative verb of change that occurs before
V2 to produce it. V2 is the result of V1 and is a resultative verb, with the direction of entailment being V1 → V2 . For example, the event of showing an
object to a person causes this person to see the object and the corresponding
entailment rule is ‘show → see’.
14
2.4.2
Discourse Relations and Markers
Discourse relations between textual units are an essential way for humans to properly interpret or produce text. In their seminal paper, Mann and Thompson (1988)
present a theory of text organization named Rhetorical Structure Theory (RST).
RST is a static, descriptive linguistic theory that deals with what makes a text
coherent. It defines a text as coherent due to the discourse (rhetorical) relations
which link together its various segments. Figure 2.4 presents such tree structure,
which first connects four discourse units into two units by using the ’Purpose’ and
’Non-volitional cause’ discourse relations and subsequently merges the two units
into one coherent discourse unit by using the ’Elaboration’ discourse relation.
Figure 2.4: A Sample Rhetorical Structure Tree, connecting four discourse units
via rhetorical relations in Mann and Thompson (1988)
Discourse relations are semantic relations that connect two text segments,
where the smallest segments are clauses and the largest are paragraphs. These
relations are conceptual and can be marked explicitly in text by discourse markers (also called ’connectives’) e.g., “because”, “so”, “however”, “although”. The
basic idea is that a discourse marker serves to relate the content of connected
segments in a specific type of relationship between the text segments. Many researchers (Hobbs, 1979; Rosner and Stede, 1992; Knott and Dale, 1994, 1996)
have created various taxonomies of discourse markers. Of special interest to us
are the works that try to automatically learn the discourse relations present in a
given text. Marcu and Echihabi (2002) use large amounts of labelled data to automatically learn four discourse relations: contrast, explanation-evidence, condition
15
and elaboration. These four relations have been chosen as a sufficient taxonomy
of discourse relations and the authors provide a mapping to other taxonomies proposed in previous works (see Figure 2.5). In our work with adopt their taxonomy,
with some alterations, and aim to use these markers as insight to the deeper semantic relations between clauses and their corresponding main verbs (see Chapter
3 for more details).
Figure 2.5: A mapping of discourse relations to relations proposed by other researchers: M&T - Mann and Thompson (1988); Ho - Hobbs (1990); A&L - Asher
and Lascarides (2003); K&S - Knott and Sanders (1998)
Lapata and Lascarides (2004) propose a data-intensive and unsupervised approach that automatically captures the temporal discourse relation between clauses
of the same sentence. To that end, the authors suggest a probabilistic framework
where the temporal relations are learned by gathering informative features, such
as the verb’s tense, aspect, modality and voice, in order to infer a likely ordering of the two clauses. For instance, in the sentence “Many employees will lose
their job once the merge is completed”, ’lose’ which is the verb in the main clause
appears with a future modality, active voice and imperfective aspect, while ’complete’ which is the verb in the subordinate clause appears with no modality, passive
voice and imperfective aspect. In our work, we utilize similar temporal features
in order to infer the prevalent temporal relation between verb pairs, which may
assist in detecting the direction of entailment between the verb pair (see Chapter
4.1.2).
16
2.4.3
Verb Classes
There are two predominant taxonomies of verbs semantics: WordNet (Fellbaum,
1998) and Levin Classes (Levin, 1993). WordNet is an on-line English lexicon
which groups lexicalized concepts to sets of synonyms, termed as synsets. These
synsets are in turn connected in a hierarchy by means of lexical and semantic
relations, such as Synonymy, Antonymy and Troponymy. Some relations are specific to verbs, such as Troponymy and Entailment and they assist our annotation
process (see Chapter 5.1.1) and also serve as a baseline in one of our evaluation
settings (see Chapter 5.2).
In contrast, the Levin verb classification aims to capture the close relationship
between the syntax and semantics of verbs. The underlying assumption is that
verb meaning and argument realization, which are alterations in the expression
of verbal arguments such as subject, object and prepositional complement, are
jointly linked . Thus, looking at verbs with shared patterns of argument realization
provides a way of classifying verbs in a meaningful way (see Appendix C for more
details). Levin’s coupling of a verb’s behaviour to its meaning is a predominant
insight that is utilized throughout this work, and we specifically utilize the idea of
verb classification as one of our novel entailment cues (see Chapter 3.2).
17
Chapter 3
Linguistically-Motivated Cues for
Entailment
In order to fully understand our model and its underlying ideas, we will now
outline the motivations for our proposed novel entailment cues: linguistically motivated indicators that are specific to verbs, operate at different levels of verb cooccurrence and cue the existence (or non-existence) of the entailment relation
between verb pairs.
3.1
Verb Co-occurrence
Our focus is on utilizing the information embedded when two verbs appear together (co-occur) within a textual unit. We aim to capture this information in
different levels of co-occurrence, with an emphasis on devising novel cues at the
sentence level, since we believe that co-occurrence at this level bears important
information that has been previously overlooked.
Sentences are hierarchically structured grammatical units, composed of one or
more clauses. We conjecture that the hierarchical organization of a sentence may
be pertinent to the semantic organization of the verbs heading the corresponding
clauses. The clauses can be coordinated such as in “The lizard moved and raised
its head” or subordinate such as in “If we want to win the tournament, we must
practice vigorously”. In the latter case, the clauses can be linked explicitly via a
18
subordinating conjunction. More formally:
Subordination: one clause is subordinate to another, if it semantically depends
on it and cannot be fathomed without it. The dependent clause is called a
subordinate clause and the independent clause is called the superordinate
(or main) clause.
Subordinating Conjunction: a word that functions to link a subordinate clause
to its superordinate clause. Many discourse markers are also subordinating
conjunctions such as because, unless, since, if.
As described in 2.4.2, Discourse markers are lexical terms such as ‘because’
and ‘however’ that indicate a semantic relation between discourse fragments (propositions or speech acts). We suggest that discourse markers may signal the semantic
relation between the main verbs of the connected clauses. For example, in the sentence “He always snores while he sleeps”, the marker ‘while’ indicates a temporal
inclusion relation between the clauses, indicating that ‘snore → sleep’.
Often the relation between clauses is not expressed explicitly via an overt conjuncture, but is still implied by the syntactic structure of the sentence. To that end,
we aim to utilize the grammatical relations as labelled by dependency parsers.
Dependency parsers offer a functional view of the grammatical relationships in a
sentence, where the sentence components are termed “constituents” and are ordered in a hierarchy, where each edge is labelled with the grammatical dependency between the constituents. The hierarchy is rooted by the main verb, which
is either the only non-auxiliary1 verb in the sentence, or the non-auxiliary verb
of the superordinate clause. For example, given the sentence “Bills on ports and
immigration were submitted by Senator Brownback, Republican of Kansas”, a dependency parser will output a labelled hierarchy (or tree) such as in Figure 3.1.
We conjecture that the dependency relation between co-occurring verbs in a
sentence can give insight to the semantic relation between them. For instance,
verbs can be connected via labeled dependency edges expressing that one clause is
an adverbial adjunct2 of the other. Such co-occurrence structure does not indicate
1
see Appendix C for more details on auxiliary verbs.
Adverbial: clause elements that typically refer to circumstances of time, space, reason and
manner. Adjunct: an optional constituent of a construction, providing auxiliary information.
2
19
Figure 3.1: A dependency parse of the sentence “Bills on ports and immigration
were submitted by Senator Brownback, Republican of Kansas”
a deep semantic relation, such as entailment, between the two verbs. For example,
in the sentence “David saw Tamar as she was walking down the street”, the verbs
’see’ and ’walk down’ are connected via an adverbial adjunct relation and indeed
the two events are temporally connected, but neither verb entails the other.
3.2
Verb Semantics Heuristics
Levin (1993) hypothesized that “the behaviour of a verb...is to a large extent determined by its meaning. Thus verb behaviour can be used effectively to probe for
linguistically relevant, pertinent aspects of verb meaning”. We aim to utilize this
hypothesis and look at the behaviour of verbs in a corpus in order to gain insight
to their semantics, i.e., whether a verb describes a state or event, whether it is general or specific etc. This, in turn, provides some notion of the verb’s likelihood to
participate in an entailment relation.
Verb generality Verb-particle constructions (VPC’s) are multi-word expressions
consisting of a head verb and a particle, e.g., switch off (Baldwin and Villavicencio, 2002). We conjecture that the more general a verb is, the more likely it is
to appear with many different particles. In particular, detecting verb generality
20
may assist in tackling an infamous property of distributional similarity methods,
namely, the difficulty in detecting the direction of entailment (Berant et al., 2012).
For example, the verb ’cover’ appears with many different particles such as ’up’
and ’for’, while the verb ’coat’ does not. Thus, assuming we have evidence for
an entailment relation between the two verbs, this indicator can help us discern
the direction of entailment and determine that ‘coat → cover’. On the other hand,
if the entailing verb (v1 ) is more general than the entailed verb (v2 ) this could be
a cue for non-entailment, since we assume that entailment is a relation than goes
from specific to general. Inspired by Levin’s hypothesis, we look at the number
of different VPC’s a verb appears in, and deduce its generality/specificity from its
propensity to appear in such constructions.
Verb classes As described in Chapter 2.4.3, verb classes are sets of semanticallyrelated verbs sharing some linguistic properties. We wished to utilize the Levin
classification in order to gain insight to the semantics of verbs and their propensity
to entail. However, the Levin classes have been reported to have inconsistences,
with verbs appearing in multiple classes (Kipper et al., 2000). In addition, due
to the granularity of the classes (48 main classes and 191 subdivided classes), it
was unclear whether there existed a mapping from the classes with their combinations L × L, to their propensity to entail. Due to these reasons, we chose to
classify verbs according to the major conceptual categories of ’State’ and ’Event’,
adopted by both Jackendoff (1983) and Dowty (1979):
Stative verbs are verbs that express a state, a stable situation which holds for a
certain time interval, e.g., “own”, “see”, “think”, “believe”.
Event verbs are verbs that express an event, a varying situation which evolves
through a series of temporal phases, e.g., “buy”, “run”, “take”.
With the fine-grained definition of entailment in mind (Chapter 2.4.1), we conjecture that the verbs in an entailing verb pair should usually belong to the same
verb class: Temporal Inclusion and Backward Presupposition deal mostly with
events and Troponymy can hold between two events or two states. In Causation,
on the other hand, the cause verb has a complex event structure (Levin, 1993) and
thus belongs to the Event class while the effect verb, denoting the change caused
21
by the causative verb, is usually a State verb, e.g., ‘show → see’, ‘buy → own’.
Conversely, if v1 belongs to the ’State’ class and v2 belongs to the ’Event’ class
then we deduce that the verbs are less likely to entail.
3.3
Typed Distributional Similarity
As discussed in Chapter 2.1, distributional similarity is the most common source
of information for learning semantic relations between verbs. Yet, we suggest that
on top of standard distributional similarity measures, which take several verbal
arguments (such as subject, object) into account simultaneously, we should also
focus on each type of argument independently. Figure 3.2 shows our proposed
representation of the verb “see” with five feature vectors: one for each verbal
dependent: adverb, subject, object, preposition complement, and a feature vector which holds all the dependents together (termed ’all’). This feature vector
corresponds to the typical feature vector representation in standard distributional
similarity approaches.
Figure 3.2: Typed Distributional Similarity Feature vectors for the verb “see”
In this work, we apply this representation to compute distributional similarity between verbs based on the set of adverbs that modify them. Our hypothesis,
which is based on Distributional Inclusion (see Chapter 2.1), is that adverbs may
contain relevant information for capturing the direction of entailment: If a verb
appears with a small set of adverbs it is more likely to be a specific verb that
already conveys a specific action or state, making some adverbs redundant. For
example, the verb ‘whisper’ conveys a specific manner of talking and usually does
not appear with the adverbs ‘loudly’, ‘clearly’, ‘openly’ and so forth. Since we
22
adopt the Inclusion Hypothesis of Entailment (Dagan and Glickman, 2004), i.e.,
that the specific verb entails the general verb, then the verb with a smaller set of
adverb modifiers should entail the verb with a larger set of adverb modifiers. For
instance, as shown in Figure 3.3, the verb ‘whisper’is represented by a small feature vector of adverbs, while the verb ‘talk’ is represented by a large feature vector
of adverbs, indicating that the entailment direction for the verb pair is ‘whisper →
talk’. We surmise that measuring directional similarity based solely on adverb
modifiers could reveal this phenomenon and assist in establishing the direction of
entailment.
Figure 3.3: Adverb-typed Distributional Similarity vectors for the verbs “whisper”
and “talk”
23
Chapter 4
An Integrative Framework for Verb
Entailment Learning
In the previous chapter we discussed linguistic observations regarding novel cues
that may help in detecting entailment relations between verbs. We next describe
how we incorporated these cues as features within a supervised framework for
learning lexical entailment rules between verbs. We follow prior work on supervised lexical semantics (Mirkin et al., 2006; Hagiwara et al., 2009; Tremper and
Frank, 2011) and address the rule learning task as a classification task. The classification task includes two distinct phases: a training phase, where we obtain a
labelled set of representative verb pairs as a training set for the verb entailment
classifier, and a testing phase, where we apply the classifier on new unlabelled
verb pairs in order to estimate their entailment likelihood. In both training and
testing, verb pairs are represented as feature vectors. We next describe the feature
space model and then proceed to detail the construction of our verb entailment
classifier.
4.1
Representation and Model
We first discuss how our novel indicators, as well as other diverse sources of
information adapted from prior verb semantics works, are encoded as features.
Most of our features are based on information extracted from the target verb pair
24
Discourse Relations
Discourse Markers
Contrast
although , despite , but , whereas ,
notwithstanding , though
Cause
because , therefore , thus
Condition
if , unless
Temporal
whenever , after , before , until ,
when , finally , during , afterwards ,
meanwhile
Table 4.1: Discourse relations and their mapped markers or connectives.
co-occurring within varying textual scopes (sentence, document, corpus). Hence,
we group the features according to their related scope. Naturally, when the scope
is small, i.e., at a sentence level, the semantic relation between the verbs is easier
to discern but the information may be sparse. Conversely, when co-occurrence is
loose the relation is harder to discern but coverage is increased.
4.1.1
Sentence-level co-occurrence
Discourse markers As discussed in Chapter 3, discourse markers may signal
relations between the main verbs of adjacent clauses. The literature is abundant
with taxonomies that classify markers to various discourse relations (Mann and
Thompson, 1988; Hovy and Maier, 1993; Knott and Sanders, 1998). Inspired by
Marcu and Echihabi (2002), we employ markers that are mapped to four discourse
relations ’Contrast’, ’Cause’, ’Condition’ and ’Temporal’, as specified in Table
4.1. We chose this rather concise classification, as we did not wish to conflate
the feature space with features relating to overly-specified and at times irrelevant1
discourse relations.
For a target verb pair (v1 , v2 ) and each discourse relation r, we count the number of times that v1 is the main verb in the main clause, v2 is the main verb in
the subordinate clause, and the clauses are connected via a marker mapped to the
discourse relation r. For example, given the sentence “I stayed here because I’ve
never known a place as beautiful as this” the verb pair (‘stay’,‘enjoy’) appears
1
In the sense that the relation does not link clauses, but sentences. e.g., the Elaboration relation.
25
in the ’Causation’ relation, indicated by the marker ‘because’, where ‘stay’ is in
the main clause and “enjoy” is in the subordinate clause. To establish that the
marker does indeed link the appropriate clauses, we verified that the marker is
directly linked to at least one of the verbs in the dependency tree. Next, each
count is normalized by the total number of times (v1 , v2 ) appear with any discourse marker. The same procedure is done when v1 is in the subordinate clause
and v2 in the main clause. We term the features by the relevant discourse relation,
e.g., ‘v1-contrast-v2’ refers to v1 being in the main clause and connected to the
subordinate clause via a ’Contrast’ discourse marker.
Dependency relations between clauses As noted in Chapter 3, the syntactic
structure of verb co-occurrence can indicate the existence or lack of entailment.
In dependency parsing this may be expressed via the label of the dependency
relation connecting the main and subordinate clauses. In our experiments we used
the UkWac2 corpus which was parsed by the dependency parser MALT (Nivre
et al., 2006). Thus, we identified three pertinent MALT dependency relations
connecting verbs of two clauses. Other relations proved to be mainly erroneous,
either at the dependency parser level or the POS tagging level.
The first relation is the object complement relation ‘obj’. In this case the subordinate clause acts as a clause complement to the main clause. For example,
in the sentence “it surprised me that the lizard could talk”, the verb pair (‘surprise’,‘talk’) is connected via the ‘obj’ relation. The second relation is the adverbial adjunct relation ‘adv’, in which the subordinate clause is an adverbial adjunct
of the main clause e.g., “he gave his consent without thinking about the repercussions”. The third and last relation is the coordination relation ‘coord’, in which the
two clauses form a coordinated structure. In a coordinated structure, as opposed
to a subordinate structure, the sub-units are combined in a non-hierarchal manner
(Bluhdorn, 2008), e.g., “every night my dog Lucky sleeps on the bed and my cat
Flippers naps in the bathtub”. We conjecture that the first two relations could
provide information for identifying non-entailment, i.e., be a negative feature of
entailment, while the third relation could provide information for identifying entailment, i.e., be a positive feature of entailment.
2
http://wacky.sslmit.unibo.it/doku.php?id=corpora
26
Similar to discourse markers, we compute for each verb pair (v1 ,v2 ) and each
dependency label d the proportion of times that v1 is the main verb of the main
clause, v2 is the main verb of the subordinate clause, and the clauses are connected
via a dependency relation d, out of all the times they are connected by any dependency relation. We term the features by the dependency label, e.g., ‘v1-adv-v2’
refers to v1 being in the main clause and connected to the subordinate clause via
an adverbial adjunct.
4.1.2
Features Adapted from Prior Work
In order to create an integrative model for automatically learning verb entailment
rules, we explored ways of utilizing ideas from previous works on automatically
learning various semantic relations between verbs and altering them to better suit
the task of learning verb entailment rules.
VerbOcean Patterns We follow Chklovski and Pantel (2004) and extract occurrences of VerbOcean patterns that are instantiated by the target verb pair. As
mentioned in Chapter 2, VerbOcean patterns were originally grouped into five semantic classes. Based on a preliminary study we conducted (see Appendix A for
the full experiment), we decided to utilize only four strength-class patterns as positive indicators for entailment, e.g., “he scared and even startled me”, and three
antonym-class patterns as negative cues for entailment, e.g., “you can either open
or close the door”. We note that these patterns are also commonly used by RTE
systems3 .
Since the corpus pattern counts were very sparse, we defined for a target verb
pair (v1 ,v2 ) two binary features: the first denotes whether the verb pair instantiates at least one positive pattern, and the second denotes whether the verb pair
instantiates at least one negative pattern. For example, given the aforementioned
sentences, the value of the positive feature for the verb pair (‘startle’,‘scare’) is
‘1’. Patterns are directional, and so the value of (‘scare’,‘startle’) is ‘0’.
3
http://aclweb.org/aclwiki/index.php?title=RTE_Knowledge_
Resources#Ablation_Tests
27
Different Polarity Inspired by the use of verb polarity in Lapata and Lascarides
(2004) and Tremper and Frank (2011), we compute the proportion of times that a
target verb pair (v1 , v2 ) appears in different polarity. For example, in “he didn’t
say why he left”, the verb ’say’ appears in negative polarity and the verb ’leave’ in
positive polarity. Such change in polarity could be an indicator of non-entailment
between the two verbs.
Tense ordering The temporal relation between verbs may provide information
about their semantic relation (Lapata and Lascarides, 2004). Thus, for each verb
pair co-occurrence we extract the verbs’ tenses and order them as follows: past <
present < future. We then add the features ‘tense-v1<tense-v2’, ‘tense-v1=tensev2’, and ‘tense-v1>tense-v2’, corresponding to the proportion of times the tense
of v1 is smaller, equal to, or bigger than the tense of v2 . For instance, given the
sentence “He was talking to Sarah when the alarm went off”, both the verb ’talk’
and the VPC ’go off’ appear in past tense and the extracted feature will be ‘tensev1=tense-v2’. This ordering indicates the prevalent temporal relation between the
verbs in the corpus and may assist in detecting the direction of entailment:
• ‘tense-v1=tense-v2’ could be indicative of Troponymy or Temporal Inclusion.
• ‘tense-v1<tense-v2’ could be indicative of Causation (v1 is the cause and v2
the effect).
• ‘tense-v1>tense-v2’ could be indicative of Presupposition, or a negative
feature.
In order to deal with the phantom future tense, we heuristically classify a verb
to a future tense if the verb itself appears in present simple form and is preceded by
a modal4 such as “will”, “would”, “shall”. For instance, given the verb pair (leave,
talk) and the sentence “He will talk to her before he leaves”, the extracted tense
ordering feature will be ‘tense-v1<tense-v2’, since ’talk’ appears in the future
tense and ’leave’ appears in the present tense.
4
This heuristic has a linguistic basis as discussed in Bybee and Fleischman (1995), Werner
(2003).
28
Co-reference Following Chambers and Jurafsky (2008), in every co-occurrence
of a target verb pair (v1 ,v2 ) in a sentence, we extract for each verb the set of
arguments at either the subject or object positions, denoted A1 and A2 (for v1 and
v2 , respectively). We then compute the proportion of co-occurrences in which v1
and v2 share an argument, i.e., A1 ∩A2 6= φ, out of all the sentence co-occurrences
in which both A1 and A2 are non-empty. For instance, in the sentence fragment
“You will discover as you ride through..” the verbs ’discover’ and ’ride’ share
an argument at their subject position. We later present a similar feature which
is computed for argument co-reference at the document level. The intuition for
both features as proposed in Chambers and Jurafsky (2008) is: “verbs sharing coreferring arguments are semantically connected by virtue of narrative discourse
structure” and as such, might stand in an entailment relation.
Syntactic and lexical distance Following Tremper and Frank (2011), we compute the average distance d in dependency edges between the co-occurring verbs
in a sentence. We compute three features corresponding to three bins indicating
whether d < 3, 3 ≤ d ≤ 7, or d > 7. We normalize them by the number of times
the verbs co-occur within a sentence and term the features according to their associated bin, e.g., ‘gram-path-upto-3’ refers to v1 and v2 appearing in a sentence
with at most three dependency edges connecting them in the dependency parse
of the sentence. Similar features are computed for the distance in words (lexical
distance), with bins 0 < d < 5, 5 ≤ d ≤ 10, d > 10. These features provide
insight into the relatedness of the verbs and we hypothesize that as the distance of
appearance is larger, the co-occurrence is less meaningful.
Sentence-level PMI Pointwise mutual information (PMI) between v1 and v2 is
computed, where the co-occurrence scope is a sentence. Higher PMI should hint
at semantically related verbs.
P M Isen (v1 , v2 ) = log
Psen (v1 , v2 )
Psen (v1 , v ∗ ) · Psen (v2 , v ∗ )
Where Psen (·) is the probability of two verbs to co-occur in a sentence and
v is any non-auxiliary verb in the UkWac corpus. The Psen (·) probabilities are
∗
29
estimated as such:
Psen (vi , vj ) = log
countsen (vi , vj )
Σv,v∗ ∈{V } countsen (v, v ∗ )
Where countsen (vi , vj ) is the number of sentences vi and vj co-occur in, and
V is the set of all non-auxiliary verbs in the UkWac corpus.
4.1.3
Document-level co-occurrence
The following group of features addresses co-occurrence of a target verb pair
within the same document. These features are less sparse, but tend to capture
coarser semantic relations between the target verbs.
Narrative score Chambers and Jurafsky (2008) suggested a method for learning
sequences of actions or events (expressed by verbs) in which a single entity is
involved. They proposed a PMI-like narrative score:
P M ICham (e(v1 , r1 ), e(v2 , r2 )) = log
P (e(v1 , r1 ), e(v2 , r2 ))
P (e(v1 , r1 )) · P (e(v2 , r2 ))
Where e(v1 , r1 ) represents the event denoted by the verb v1 and the dependency relation r1 , e.g., e(push, subject), and similarly e(v2 , r2 ) represents the
event denoted by the verb v2 and the dependency relation r2 , e.g., e(drive, object).
Ultimately, this score estimates whether a pair consisting of a verb and one of its
dependency relations (v1 , r1 ) is narratively-related to another such pair (v2 , r2 ).
Their estimation is based on quantifying the likelihood that two verbs will share
an argument that instantiates both the dependency position (v1 , r1 ) and (v2 , r2 )
within documents in which the two verbs co-occur.
For example, given the document “Lindsay Lohan was prosecuted for DUI. Lindsay Lohan was convicted of DUI.” the pairs (‘prosecute’,‘subj’) and (‘convict’,‘subj’)
share the argument ‘Lindsay Lohan’ and are thus part of a narrative chain. Such
narrative relations may provide cues to the semantic relatedness of the verb pair.
We thus compute for every target verb pair nine features using their narrative
score. In four features, r1 = r2 and the common dependency is either a subject,
30
an object, a preposition complement (e.g., “we meet at the station.), or an adverb
termed chambers-subj, chambers-obj, and so on. In the next three features, r1 6=
r2 and r1 , r2 denote either a subject, object, or preposition complement5 termed
chambers-subj-obj and so on. Last, we add as two features the average of the
four features where r1 = r2 (termed chambers-same), and the average of the three
features where r1 6= r2 (termed chambers-diff ).
Document-level PMI Similar to sentence-level PMI, we compute the PMI between v1 and v2 , but this time the co-occurrence scope is a document.
P M Idoc (v1 , v2 ) = log
Pdoc (v1 , v2 )
Pdoc (v1 , v ∗ ) · P (v2 , v ∗ )
Where Pdoc (·) is the probability of two verbs to co-occur in a document and
v is any non-auxiliary verb in the UkWac corpus. The Pdoc (·) probabilities are
estimated as such:
∗
Pdoc (vi , vj ) = log
countdoc (vi , vj )
Σv,v∗ ∈{V } countdoc (v, v ∗ )
Where countdoc (vi , vj ) is the number of documents vi and vj co-occur in and
V is the set of all non-auxiliary verbs in the UkWac corpus.
4.1.4
Corpus-level statistics
The final group of features ignores sentence or document boundaries and is based
on overall corpus statistics.
Distributional similarity Following our hypothesis regarding typed distributional similarity (Chapter 3.3), we represent the verbs as vectors in five different
feature spaces, where each vector space corresponds to a different argument or
modifier: the verb’s subjects and objects (the proper noun ’Dimitri’ is the subject
and ’book’ is the object of the verb ’read’ in the sentence "Dmitri reads an interesting book"), prepositional complements (the noun station in the sentence "We
5
adverbs never instantiate the subject, object or preposition complement positions.
31
meet at every station"), adverbs (the adverb ’exactly’ in the sentence "They look
exactly alike") and the joint vector of the aforementioned vectors.
Following this representation, we first compute for each verb and each argument a separate vector that counts the number of times each word in the corpus
instantiates the argument position of that verb. We also compute a vector that is
the concatenation of the previous separate vectors, which captures the standard
distributional similarity statistics. We then apply three state-of-the-art distributional similarity measures, Lin (Lin, 1998), Weeds precision (Weeds et al., 2004)
and BInc (Szpektor and Dagan, 2008), to compute for every verb pair a similarity
score between each of the five count vectors6 . In total, there are 15 typed distributional similarity features, for all combinations of five argument types and three
distributional similarity measure. We term each feature by its combined distributional similarity measure and argument type, e.g., weeds-prep and lin-all represent
the Weeds measure over prepositional complements and the Lin measure over all
arguments, respectively.
Verb classes Following our discussion in Chapter 3.2, we designed a feature f
which heuristically estimates for each target verb v ∈ (v1 , v2 ) its likelihood to
belong to the ’Stative’ verb class, by computing the proportion of times v appears
in progressive tense out of all v’s corpus occurrences. The intuition is that stative
verbs usually do not appear in the progressive tense, e.g., the progressive form of
the stative verb ’know’, ‘knowing’, has a low corpus frequency. Then, given a verb
pair (v1 ,v2 ) and their corresponding stative features f1 and f2 , we add two features
f1 · f2 and ff21 , which capture the interaction between the verb classes of the two
verbs. As previously discussed, we hypothesize that certain class configurations
relate to entailment while others relate to non-entailment. For instance, higher
f1 · f2 means both verbs lean towards the ’Event’ verb class, and as such are more
likely to entail. Lower ff12 means that v1 is highly associated with the ’State’ verb
class while v2 is highly associated with the ’Event’ verb class, and as such are less
likely to entail.
6
We employ the common practice of using the PMI between a verb and an argument rather
than the argument count as the argument’s weight.
32
Verb generality For each verb v ∈ (v1 , v2 ), we add as a feature the number of
different particles it appears with in the corpus cv , following the hypothesis that
this is a cue to its generality (see Chapter 3.2). Then, given a verb pair (v1 ,v2 )
c
and their corresponding counts cv1 and cv2 , we add the feature cvv1 . We expect that
2
c
when cvv1 is high, v1 is more general than v2 , which is a negative cue for entailment.
2
In summary, we compute for each verb pair (v1 , v2 ) 63 features, which are in
turn passed as input to a machine learning classifier. We next describe a framework for selecting and analyzing the predicative power of these features.
4.1.5
Feature Selection and Analysis
Since our model contains many novel features, it was important to investigate their
utility for detecting verb entailment. To that end, we implemented several feature
selection methods as suggested by Guyon and Elisseeff (2003). Feature ranking
methods are widely used mainly because of their computational efficiency, as they
require only the computation of n scores (n being the number of features in the
feature space). However, Guyon and Elisseeff (2003) outline several examples
where a feature that is completely ’useless’ on its own (i.e., ranked very low in
the feature ranking procedure), but provides a significant performance improvement when joined with other features. Thus, it is useful to look at subsets of
features together, and not just in isolation. These methods belong to the following
approaches:
Feature Ranking Approach Given a set of labelled examples, feature ranking
methods make use of a scoring function S computed between each example’s
feature values and its corresponding label (in our case +1 if ‘v1 → v2 ’ and -1
otherwise). To that end, we implemented two methods for feature ranking: the first
uses the Pearson Correlation criteria as its scoring function and the second uses
the performance of a classifier built with a single feature as its scoring function.
Pearson correlation criteria is defined as:
33
R(i) = p
cov(Xi , Y )
var(Xi )var(Y )
Where Xi is the random variable corresponding to the ith component of an
input feature vector and Y is the random variable corresponding to the appropriate
label ∈ {1, −1}. The estimate we used in order to approximate the Pearson criteria
is:
Σm
k=1 (xk,i − x̄i )(yk − ȳ)
R∗ (i) = q
2
2 m
Σm
k=1 (xk,i − x̄i ) Σk=1 (yk − ȳ)
Where xk,i is the value of the ith feature in example k, x̄i is the average of
values for feature i, yk is the label of example k and ȳ is the average of labels.
Single-feature classifier The underlying idea here is to select features according to their individual predictive power, using the performance of a classifier built
with only a single feature as the ranking criterion. To that end, we performed the
following:
1. For each feature:
(a) Sort the feature values of all examples, in both descending and ascending order (to account for class polarity).
(b) Go through the sorted list, consider each example as the separating
hyperplane and compute the corresponding F 1 measure.
(c) We now have two maxima, one for each class polarity. Compare them
and choose the maximal.
2. Sort the features according to their maximal F1 value, to create a ranked list
of features.
We now have two ranked lists of our 63 features, which we utilize in order to
evaluate the discriminative power of each feature. These rankings also allow us
to corroborate or refute our hypotheses regarding the polarity of the feature, i.e.,
whether it is a positive (or negative) cue for entailment, as we conjectured in the
previous chapter.
34
Figure 4.1: The backward elimination procedure, adapted from Rakotomamonjy
(2003)
Feature Subset Selection The feature subset selection approach examines features in the context of classification, in order to find a subset of features that together have good predicative power. Guyon and Elisseeff describe three methodologies for feature subset selection, and we chose to implement a variant of the
embedded methodology, as it offers a simple yet powerful way to address the
problem of feature selection. We implemented a greedy algorithm, backward
elimination, which removes the minimal weighted7 feature for each subset size
of features. Figure 4.1 delineates the backward elimination procedure. As a result
of applying the procedure on our feature set, we have a ranked list of features
according to their effect on the classifier (Rakotomamonjy, 2003). Looking at the
top of the list allows us to discern redundant features, while looking at the bottom
of the list provides us with a different view of the features utility, as there might be
features that did not show high correlation and usefulness on their own, but when
applied in tandem with other features, provide useful discriminative information.
7
Each feature is given a weight f|w| according to its relevancy and role in the classification
process.
35
4.2
Learning Framework
We next describe our procedure for learning a generic verb entailment classifier,
which can be used to estimate the entailment likelihood for any given verb pair
(v1 , v2 ). Supervised approaches learn from a set of positive and negative examples of entailment. We construct such a set by starting with a list of candidate
verbs, termed seeds. For each seed, we extract top k candidates for entailment
using either a pre-computed distributional similarity score or an outside resource
such as WordNet. We then have a list of verb pairs of the form (seed, candidate)
and (candidate, seed), which will be annotated for entailment in order to create a
labelled set of examples (see Chapter 5.1.1). This set is utilized as a training set
for a Support Vector Machines (SVM) classifier.
SVM is a prominent learning method used for binary classification, introduced
by Cortes and Vapnik (1995). The basic idea is to find a hyperplane (termed the
separating hyperplane) which separates the d-dimensional8 data into two classes,
with a maximal margin between them.
8
Where d is the number of features.
36
Chapter 5
Evaluation
In this section, we will evaluate our proposed model and establish that our linguistically motivated entailment cues and their integration as features in a supervised
learning framework, significantly outperform previous state-of-the-art baselines.
There does not exist, at present, a common framework for evaluating entailment
rules and resources (Mirkin et al., 2009). However, one option for evaluating
entailment rules is to ask annotators to directly judge the correctness of each entailment pair, which yields an explicit and optimal judgment of rules (Dagan and
Glickman, 2004) but can prove to be time and resource-consuming. Another option is to measure the rules’ impact on an end task within an NLP application.
This evaluation approach shows the resource’s importance and its relevancy in a
real-world application setting. In this section we utilize these evaluation settings
and demonstrate the improved performance of our learning model.
5.1
Evaluation on a Manually Annotated Verb Pair
Dataset
As argued by Dagan and Glickman (2004), the best judgment for a given entailment pair can be provided by human experts, since “people attribute meanings
to texts and can perform inferences over them”. In order to obtain direct results
assessment, we conducted a manual evaluation of our model based on human
judgments (Szpektor et al., 2007).
37
5.1.1
Experimental Setting
To evaluate our model, we constructed a dataset containing verb pairs annotated
for entailment and non-entailment. We started by randomly sampling 50 verbs
from a list of the 3500 most common words in English, according to Rundell
and Fox (2003), which we denoted as seed verbs. Next, we extracted the 20
most similar verbs to each seed verb according to the Lin similarity measure (Lin,
1998), which was computed on the RCV1 corpus1 . Then, for each seed verb
vs and one of its extracted similar verbs vsi we generated the two directed pairs
(vs , vsi ) and (vsi , vs ), which represent the candidate rules ‘vs → vsi ’ and ‘vsi → vs ’
respectively. To reduce noise, we filtered out verb pairs where one of the verbs
is an auxiliary or a light verb such as ’do’, ’get’ and ’have’ (see Appendix C for
the full list). This resulted in 812 verb pairs as our dataset2 , which were manually
annotated by two annotators as representing either a valid entailment rule or not,
according to the following framework:
Annotation Framework Supervised approaches learn from a set of positive and
negative examples for entailment. Thus, we need to label the candidate verb pairs
as either positive or negative examples of entailment. A positive example is an
ordered verb pair (v1 , v2 ), where v1 entails v2 . A negative example is an ordered
verb pair (v1 , v2 ), where v1 does not entail v2 . To perform the annotation, we generally followed the rule-based approach for entailment rule annotation (Dekang
and Pantel, 2001; Szpektor et al., 2004) with the following guidelines:
1. The example rule ‘v1 → v2 ’ is judged as correct if the annotator could think
of reasonable contexts under which the rule holds, i.e., v1 entails v2 . The
entailment relation does not have to hold under every possible context, as
long as the context which it holds for is a natural one and not too obscure
or anecdotal. For example, the verb ’win’ can be used in the context of
winning a war or winning a game. The latter use is frequent enough to infer
that the rule ‘win → play’ is correct, since in order to win a game you must
1
http://trec.nist.gov/data/reuters/reuters.html
The data set is available at http://www.cs.biu.ac.il/~nlp/downloads/
verb-pair-annotation.html
2
38
have first played the game, despite the fact that “winning” a war does not
entail “playing” a war.
2. The rule ‘v1 → v2 ’ is judged as incorrect if no reasonable context can be
found as evidence to the correctness of the entailment rule, i.e., v1 does not
entail v2 .
We wanted to ascertain that the manual annotations comply with WordNet. We
therefore verified that verb pairs appearing under certain WordNet relations were
also manually annotated as entailing. The relations are either directly mapped
to entailment through the ’Entailment’ and ’Cause’ relations, or mapped to more
general relations that are commonly used by RTE systems3 : ’Hyponymy’ and
’Synonymy’.
In total 315 verb pairs were labeled as entailing (the rule ‘v1 → v2 ’ was judged
as correct) and 497 verb pairs were labeled as non-entailing (the rule ‘v1 → v2 ’
was judged as incorrect). The Inter-Annotator Agreement (IAA) for a random
sample of 100 pairs was moderate (0.47), as expected from the rule-based approach (Szpektor et al., 2007).
For each verb pair, all 63 features within our model were computed using
the UkWac corpus (Baroni et al., 2009). UkWac is the largest freely available
resource for English with over 2 billion words, complete with POS tags and dependency parsing. For classification, we utilized SVM-perf’s (Joachims, 2005)
linear SVM implementation with default parameters (maximizing F1), and evaluated our model by performing 10-fold Cross Validation over the labeled dataset.
5.1.2
Feature selection and analysis
As discussed in Chapter 4.1.5, we followed the feature ranking methods proposed
by Guyon and Elisseeff (2003) to investigate the utility of our proposed features.
Table 5.1 depicts the 10 most positively correlated features with entailment according to the Pearson correlation measure. From the table, it is clear that distributional similarity features, appearing at ranks 1, 4-7 and 10, are amongst the most
3
http://aclweb.org/aclwiki/index.php?title=RTE_Knowledge_
Resources
39
Rank
Top Positive
Pearson Score
Co-occurrence Level
1
Weeds-adverb
0.1945
Corpus
2
Chambers-obj
0.1646
Document
3
v1-coord-v2
0.1529
Sentence
4
Weeds-pmod
0.1431
Corpus
5
Binc-adverb
0.1405
Corpus
6
Weeds-all
0.1337
Corpus
7
Binc-obj co-occurrence
0.1198
Corpus
8
v2-coord-v1
0.1153
Sentence
9
Sentence-level PMI
0.1082
Sentence
10
Weeds-sbj
0.1024
Corpus
Table 5.1: Top 10 positively correlated features according to the Pearson correlation score and their co-occurrence level.
positively correlated with entailment, which is in line with prior work (Geffet and
Dagan, 2005; Kotlerman et al., 2010). Looking more closely, our suggestion for
typed distributional similarity proved to be useful, and indeed most of the highly
correlated distributional similarity features are typed measures. Standing out are
the adverb-typed measures, with two features in the top 10, including the highest, ‘Weeds-adverb’, and ‘BInc-adverb’. We also note that the highly correlated
distributional similarity measures are directional, Weeds and BInc.
The table also indicates that document and sentence-level co-occurrence contributes positively to entailment detection. This includes both the Chambers narrative measure, with the typed feature Chambers-obj, and the coordinated clauses
with the ‘v1-coord-v2’. Finally, we note that PMI at the sentence level is highly
correlated with entailment even more than at the document level (PMI documents
is further down the list, at rank 22) since the local textual scope is more indicative,
though sparser.
Table 5.2 depicts the 10 most negatively correlated features with entailment according to the Pearson correlation measure. Looking at the table, one immediately
sees that many of our novel co-occurrence features at the sentence level contribute
useful information for identifying non-entailment. For example, as hypothesized
40
Rank
Top Negative Feature
Pearson Score
Co-occurrence Level
1
v1-adv-v2
-0.1056
Sentence
2
v2-obj-v1
-0.0860
Sentence
3
v1-obj-v2
-0.0846
Sentence
4
v2-obj-v1
-0.0841
Sentence
5
tense-v1 > tense-v2
-0.0751
Sentence
6
lexical-distance 0-5
-0.0710
Sentence
-0.0688
Corpus
f1
f2
7
verb generality
8
tense-v1 < tense-v2
-0.0638
Sentence
9
VerbOcean positive
-0.0619
Sentence
10
v1-temporal-v2
-0.0549
Sentence
Table 5.2: Top 10 negatively correlated features according to the Pearson correlation score and their co-occurrence level.
in Chapter 4.1.1, verbs connected via an adverbial adjunct (‘v2-adverb-v1’ and
‘v1-adverb-v2’) or an object complement (‘v1-obj-v2’ and ‘v2-obj-v1’) are negatively correlated with entailment. In addition, the novel ‘verb generality’ feature,
as well as the tense difference feature (‘tense-v1 > tense-v2’) also prove to be
relatively strong cues for non-entailment. We note, however, that the averaged absolute value of the negative scores (0.0755) is somewhat smaller than the averaged
value of the positive scores (0.1158), which could slightly decrease the impact of
the negative signals.
Table 5.3 presents the ranking of features according to the Single Feature classifier method. Looking at the results, we note that the features ‘v1-coord-v2’,
‘v2-coord-v1’ and ‘Sentence-level PMI’ are both positively correlated with entailment as well as achieve high F1 scores in the single feature classification. We
also note that all presented features relate either to the sentence or the document
level of verb co-occurrence. Specifically, 50% of the features are variants of the
Narrative score (4.1.3), which relates to verb co-occurrence at the document level.
In order to fully grasp a feature’s contribution it is crucial to view it in context with other subsets of features. To that end, we implemented the backward
elimination algorithm, as described in Chapter 4.1.5. Table 5.4 shows the features
41
Rank
Single Feature
F1 of Feature
Co-occurrence Level
1
Chambers-obj
0.5784
Document
2
v1-coord-v2
0.5748
Sentence
3
Sentence level PMI
0.5695
Sentence
4
Chambers-adverb
0.5680
Document
5
v2-coord-v1
0.5644
Sentence
6
Chambers-same dep
0.5607
Document
7
Chambers-subject
0.5599
Document
8
Chambers-pmod
0.5598
Document
9
PMI documents
0.559
Document
10
v1-condition-v2
0.559
Sentence
Table 5.3: Top 10 features according to their single feature classifier F1 score and
their co-occurrence level.
found to be useful by the SVM classifier at almost every subset, i.e., features that
remained until the last 10 iterations (Maximal Weight). The table also presents the
features whose removal caused the minimal change in the first 10 iterations (Minimal Weight). Looking at the maximal weighted features, we note again the equal
distribution of features from all co-occurrences levels: Sentence, Document and
Corpus. We also notice that most of the features have appeared in the previous top
10 features tables. Three features, however, have consistently been at the bottom
of the feature ranking methods: ‘Tense-v=Tense-v2’, ‘Chambers-obj-pmod’ and
‘Positive VerbOcean’. We see the converse phenomena with the top 10 minimal
features, with four features appearing both in the previous top features ranking
lists and in the minimal weights features list. In order to further understand these
results, we checked the Pearson correlation score between the features, and saw
that there are interactions between the features that influence their utility for the
classifier. For instance, the feature ranked 2nd in the list, ‘v2-adv-v1’, has a high
correlation with the lexical distance feature 0 < dist < 5, in accordance with the
adverbial adjunct structure which usually occurs when the verbs appear close to
each other, e.g., “David saw Tamar as she was walking down the street”.
To conclude, our feature analysis shows that features at all levels: sentence,
42
Max. Weight Feature
Co-occurrence
Level
Min. Weight Feature
Co-occurrence
Level
PMI documents
Document
Weeds-obj
Corpus
Binc-adv
Corpus
v2-adv-v1
Sentence
Lin-sbj
Corpus
diff-polarity
Sentence
Weeds-adv
Corpus
v1-temporal-v2
Sentence
Positive VerbOcean
Sentence
gram-path-upto-3
Sentence
PMI Sentences
Sentence
v2-coord-v1
Document
Chambers-obj
Document
Chambers-adv
Corpus
Chambers-obj-pmod
Document
v1-adv-v2
Sentence
Tense-v1=Tense-v2
Sentence
v2-generality
Corpus
Tense-v1>Tense-v2
Sentence
Binc-all
Corpus
Table 5.4: Feature ranking according to minimal feature weights, with the feature’s verb co-occurrence level
document and corpus contain valuable information for verb entailment learning
and detection, and thus should be combined together. Furthermore, many of our
novel features are amongst the highly correlated features, showing that our intuitions when devising a rich set of verb-specific and linguistically-motivated features were correct. Our feature subset selection analysis, however, shows that
the learning process is more complex, with many interactions between features
and correlations. Thus, it is best to judge the features’ contribution quantitatively
within the classifier, for example using evaluation metrics such as Precision and
F1 on a test set. We shall next describe the results of such evaluation settings,
starting with the evaluation results on a manually-annotated dataset.
5.1.3
Results and Analysis
We next present the experimental results and analysis that show the improved performance of our novel verb entailment model compared to the following baselines,
which were mostly taken from or inspired by prior work:
Random: A simple decision rule: for any pair (v1 , v2 ), randomly classify as
43
“yes” with a probability equal to the number of entailing verb pairs out of all verb
315
pairs in the labeled dataset (i.e., 812
= 0.387).
VO-KB: A simple unsupervised rule: for any pair (v1 , v2 ), classify as “yes” if
the pair appears in the strength relation in the VerbOcean knowledge-base, which
was computed over web counts, and is commonly used in RTE systems4 .
VO-UkWac: A simple unsupervised rule: for any pair (v1 , v2 ), classify as “yes”
if the value of the positive VerbOcean feature is ‘1’(see Chapter 4.1.2 for details
on the VerbOcean feature construction).
TDS: Includes the 15 distributional similarity features in our supervised model.
This baseline extends Berant et al. (2012), who trained an entailment classifier
over several distributional similarity features. This baseline provides an evaluation of the discriminative power of distributional similarity alone, without cooccurrence features.
TDS+VO: Includes the 15 distributional similarity features and the two VerbOcean features in our supervised model. This baseline is inspired by Mirkin
et al. (2006), who combined distributional similarity features and Hearst patterns
(Hearst, 1992) for learning entailment between nouns.
All: Our full-blown model, including all features described in Chapter 4.1.
For all tested methods, we performed 10-fold cross validation and macroaveraged Precision, Recall and F1 over the 10 folds. Table 5.5 presents the results
of our full-blown model as well as the baselines.
We first note that, as expected, the VerbOcean baselines (VO-KB and VOUkWac) provide low recall, due to the sparseness of rigid pattern instantiation for
verbs both in the UkWac corpus and the web. They do, however, provide precise
rules, with VO-KB being the most precise baseline. Second, it is the combination
4
http://aclweb.org/aclwiki/index.php?title=RTE_Knowledge_
Resources#Ablation_Tests
44
Method
P%
R%
F1 %
All
48.73 85.09
61.81
TDS
47.59 60.44
52.94
TDS and VO 46.55 59.73
52.02
Radnom
36.27 35.23
35.74
VO-KB
53.57 14.28
22.55
VO-UkWac
22.64
6.52
3.80
Table 5.5: Average precision, recall and F1 for the evaluated models.
of all types of information sources that yields the best performance; Our complete model, employing the full set of features, outperforms all other models in
terms of both recall and F1. Its improvement in terms of F1 over the second best
model (TDS), which includes all distributional similarity features, is by 17% and
is statistically significant according to the Wilcoxon signed-rank test at the 0.01
level (Wilcoxon, 1945). This result shows the benefits of integrating linguistically
motivated co-occurrence features with traditional pattern-based and distributional
similarity information. Another interesting observation is that the VO features
seem to slightly decrease precision and recall, possibly due to their sparsity.
To further investigate the contribution of features at various co-occurrence levels, we trained and tested our model with all possible combinations of feature
groups corresponding to a certain co-occurrence scope: sentence, document and
corpus. Table 5.6 presents the results of these tests.
The most notable result of this analysis is that sentence-level features play
an important role within our model. Sentence-level features alone (Sent-level)
provide the best discriminative power for verb entailment with a statistically significant improvement in F 1 compared to the document and corpus levels. Yet, we
note that sentence-level features alone do not capture all the information within
our model, and they should be combined with one of the other feature groups to
reach performance close to the complete model. Furthermore, it is interesting to
examine the statistical significance level: our full model outperforms classifiers
45
Co-occurrence Level
P%
R%
F1 %
Statistical Diff.
Sent+Doc+Corpus-level (ALL) 48.73 85.09
61.81
–
Sent+Corpus-level
47.75 84.44
60.88
–
Sent+Doc-level
47.82 84.20
60.82
–
Doc+Corpus-level
48.15 83.22
60.74
–
Sent-level
44.71 84.65
58.27
0.01
Doc-level
44.03 79.22
56.39
0.01
Corpus-level
45.73 66.36
53.9
0.01
Table 5.6: Average precision, recall, F1 , and the statistical significance level of the
difference in F1 for each combination of co-occurrence levels.
built with features from a single level of co-occurrence with a statical significance of 0.01, while for classifiers composed of features from two different levels
of co-occurrence, the difference is not statistically significant. This shows that by
combining only two co-occurrence levels, we almost reach the performance of the
full model. Specifically, as can be seen in Table 5.6, the Sent+Doc-level model
performs almost as good as the full model, with no statistically significant difference. This subset may be a good substitute to the full model, since its features
are easier to extract from large corpora, as they may be computed in an on-line
fashion, processing one document at a time, as opposed to corpus-level features
whose computation must be done off-line.
As a final analysis (Table 5.7), we randomly sampled 20 correct entailment
rules learned by our model but missed by the typed distributional similarity classifier (TDS). Looking at Table 5.7, our overall impression is that employing cooccurrence information helps to better capture entailment sub-types other than
‘synonymy’ and ‘troponymy’. For example, our model recognizes that ‘acquire →
own’, corresponding to the ‘cause-effect’ entailment relation sub-type, and recognizes ‘stand → stand up’, corresponding to the ‘causation’ entailment relation
sub-type.
46
Entailment Rule
Entailment Relation
abuse → harm
Troponymy
examine → study
Synonymy
acquire → own
Cause-Effect
carry → transport
Troponymy
ruin → destroy
Synonymy
identify → examine
Presupposition
abuse → harass
Troponymy
begin → start
Synonymy
eliminate → defeat
Troponymy
identify → detect
Presupposition
Table 5.7: 10 randomly selected correct entailment rules learned by our model but
missed by TDS.
5.2
Evaluation within Automatic Content Extraction
task
A natural way to evaluate entailment resources is to test their utility in a real-world
NLP application that aims to solve an end task, such as an Information Extraction
(IE) system. We tested our verb entailment model and its learned entailment rules
by using them as a simple IE system. In our experiments we utilized the ACE
2005 event detection task, a standard IE benchmark, in order to compare our verb
entailment model against state-of-the-art baselines. We will next describe and
demonstrate how our proposed model can be utilized to improve performance on
the ACE event detection task. We note that the results are not claimed to be in any
way optimal, since our emphasis is on using the ACE task as a comparative tool,
rather than building an effective and complete IE system that solves the task. The
rest of the section is organized as follows: In 5.2.1 we describe how we utilized
the ACE dataset for evaluating verb entailment rules and in 5.2.2 how we built
a verb entailment resource for the ACE task. Then, we describe the evaluation
metrics and results of comparing our model against two types of baselines: the
first type is a rank-based baseline and the second type is decision-rule-based. In
47
total, we compare our model against six resources and present the results along
with a detailed analysis in 5.2.3.
5.2.1
Experimental Setting
In the ACE event detection task, when given a set of sentences, an IE system
needs to identify mentions of certain events in each sentence. The task utilizes
the ACE 2005 annotated event corpus as its training set5 . This corpus contains
15,724 sentences, each annotated with one or more events out of 33 possible event
types such as “Attack”, “Divorce” and “Start-Position”. For instance, the sentence
“They have been married for three years.” is annotated as mentioning the event
“Marry”. In order to utilize the ACE dataset for evaluating verb entailment rules,
we generally follow the evaluation methodology described in Szpektor and Dagan
(2008), where the authors worked with 26 event types6 and represented each ACE
event type with a few typical verbs, termed as seeds, that were selected from the
textual description of the event in the ACE guidelines7 . For instance, for the event
“Start-Position” the seed verbs are “start”, “found” and “launch”.
We start our evaluation by learning for each seed, lexical entailment rules of
the form ‘candidate → seed’. We then use these rules to retrieve all sentences
containing either seed or candidate (this is referred to as a match), and label the
retrieved sentences with the event mapped to the seed. For example, if our model
learns the entailment rule ‘maim → injure’ for the seed ’injure’ which is in turn
mapped to the ’Injure’ event, then all sentences containing the candidate verb
’maim’ will also be labelled as ’Injure’ event mentions. We then compute various
evaluation metrics on these matches and compare our resource’s performance to
other state-of-the-art baselines. We next elaborate on how we constructed a verb
entailment resource for the ACE task.
5
http://www.itl.nist.gov/iad/mig//tests/ace/ace05/resource/
After filtering out events that ambiguously correspond to multiple distinct verbs
7
http://projects.ldc.upenn.edu/ace/docs/English-Events-Guidelines_
v5.4.3.pdf
6
48
5.2.2
Building a Verb Entailment Resource for ACE
The resource is constructed based on the UkWac corpus as follows:
(a) Construct a list of verb pairs of the form (candidate, seed) for each seed
(b) Represent each (candidate, seed) pair as a vector of features, as described in
Chapter 4.1
(c) Train a classifier on an augmented labelled data-set
(d) Retrieve a ranked list of rules of the form ‘candidate → seed’ according to the
verb pair’s classification score
(a) Constructing a list of candidate verb pair The candidates for each seed s
are verbs that co-occur with s in at least 5 sentences in the UkWac corpus8 . This
corpus was constructed by extracting text from crawled HTML pages. This meant
that we encountered many non-words in the corpus, such as banner advertisements
and HTML code. To filter this noise, the candidate verbs were required to also
appear in WordNet as non-auxiliary verbs. Last, since the number of candidates
was very large (~70,000), we filtered out verb pairs with negative ’PMI sentences’
scores so as to remain with a smaller number of candidates that are semanticallyrelated and have meaningful sentence-level co-occurrences. In total, we now have
40,285 semantically related verb pairs of the form (candidate, seed).
(b) Representing verb pairs as feature vectors Each (candidate, seed) pair is
converted into a feature vector that includes our manually-engineered features.
As before, we collect the feature values from the UkWac corpus and perform
normalizations as described in Section 4.1.
(c) Training a classifier on an augmented labelled data-set Since this task
deals with the classification of over 40,000 verb pairs, we wanted to expand our
training set such that our model will be trained on a representative and sufficiently
large amount of examples. We thus augmented the original 812 pairs list (5.1.1)
8
we filtered out the verbs ’be’,’do’,’will’,’take’,’give’,’get’
49
with positive and negative examples for entailment, retrieved from WordNet (Fellbaum, 1998). A pair of verbs is considered as a positive example for entailment
if the verbs appear under the following WordNet relations: Entailment, Cause,
Hyponymy and Synonymy. A pair of verbs is considered as a negative example for entailment if the verbs appear as Antonyms (e.g., buy and sell) and Cohyponyms, i.e., words that share a common Hypernym, e.g., “cruise” and “raft”
are co-hyponyms since they share the common hypernym “transport”. In total, we
had 3174 verb pairs which were manually annotated for entailment in the manner
described in 5.1.1. We then used the annotated verb pairs as a training set for
SVM-perf’s (Joachims, 2005) linear SVM implementation with default parameters, maximizing Precision.
(d) Ranking the verb pairs The classifier gives a score representing the verb
pair’s likelihood of belonging to the entailing class. To illustrate, suppose the
verb pair (remand, arrest) was given a high score by our classifier. This means
that the classifier is relatively certain that remand entails arrest. We utilize these
classification scores in order to rank the list of entailing candidates for each event.
5.2.3
Evaluation Metrics and Results
After building a verb entailment resource for the ACE task, our goal now is to evaluate the resource’s performance in comparison to other state-of-the-art resources.
We will use two types of resources as baselines: the first is rank-based, e.g., a
classifier built with Distributional Similarity features, where the pairs are ranked
according to the appropriate classification scores. The second type is a decisionrule-based resource, such as VerbOcean, where the pairs are given a binary score
according to the following decision rule: ’1’ if they appear under certain VerbOcean patterns, ’0’ otherwise. To evaluate the rank-based type, we will create
resources at different cut-off points of the ranked list of verb pairs. This method
is not suitable for evaluating the decision-rule-based resources, since they are inherently sparse and have a binary scoring method which does not create a natural
ranked list. Thus, we will evaluate these sparse baselines by aggregating scores
for all 40,275 proposed verb pairs and computing the overall MAP measure.
50
Evaluation of Rank-based Baselines Following Szpektor and Dagan (2008),
we test each learned ranked resource by taking the top K entailment rules for each
seed verb, where K ranges from 0 to 100. This parameter controls the number
of rules that can be used. For example, if K = 20 then for every seed s we will
consider only the 20 rules with highest score that expand s. Naturally, when K is
small the recall will decrease, but precision will increase.
For each such resource we perform the following: we first search for all sentences containing either the seed verbs or the candidate expansion verbs of an
event e. We mark each retrieved sentence as correct if e has been annotated as
having an event mention in said sentence. We next compute the following standard evaluation metrics:
P recision(e) =
Recall(e) =
F 1(e) =
#correct matches of e’s verbs
#total matches of e’s verbs
#correct matches of e’s verbs
#total annotations of e
2 · P recision(e) · Recall(e)
P recision(e) + Recall(e)
Where e’s verbs are either the seed verbs associated with it, or the entailing
candidate expansions. To aggregate the score, we computed micro average Recall,
Precision and F1 over all event types.
Evaluation Results: We compare our model to other rank-based resources, with
a randomly created resource as a lower bound:
(a) DistSim: Includes the three distributional similarity features (Lin, Weeds and
Balanced Inclusion) used in our supervised model. This baseline extends
Berant et al. (2012), who trained an entailment classifier over several distributional similarity features. This baseline provides an evaluation of the
discriminative power of distributional similarity alone, without co-occurrence
features.
(b) DistSim+VO: Includes the three distributional similarity features and the
two VerbOcean features in our supervised model. This baseline is inspired
51
by Mirkin et al. (2006), who combined distributional similarity features and
Hearst patterns (Hearst, 1992) for learning entailment between nouns.
(c) Random: A simple decision rule: for any pair (candidate, seed), randomly
classify the pair as entailing with a probability equal to the ratio of entailing
1379
verb pairs out of all verb pairs in the labeled training set (i.e., 3174
= 0.434).
(d) All: Our full-blown model, including all features described in Chapter 4.1.
Figure 5.1 presents micro-averaged Precision, Recall and F1 of the evaluated
methods for different cutoff points. We first notice that all resources provide rules
that are less precise than using the seeds without expansions (K = 0). A possible
reason for this phenomena is that many of the surface verb seeds are ambiguous,
having multiple meanings. For instance, the seed verb ’release’ can be used in
many contexts, yet in ACE it is defined to be used only in the sense of releasing
a criminal from detention. A possible solution to the ambiguity problem is to
only apply entailment rules in specific contexts (as exemplified in (Szpektor et al.,
2008)), but this extension falls out of the scope of our work. Thus, since we learn
entailment rules without a specific context and meaning, a rule which is correct for
a certain meaning of the seed but incorrect for the specific event-mapped meaning, has the potential of incurring many false matches, resulting in a decreased
precision rate. For example, the rule ‘unleash → release’ is a valid rule in most
contexts of the seed verb ’release’, but not in the specific context of the event type
’Release-Parole’ in ACE. For a detailed analysis of this phenomenon, we refer the
reader to our Error Analysis section (5.2.4).
The graphs in Figure 5.1 show that using just the Distributional Similarity
measures of Lin, Weeds and Balanced Inclusion increase recall, but dramatically
decrease precision already at the cutoff point of K = 20. As a result, F1 only decreases for this method. Adding VerbOcean features slightly increases the performance of the DistSim model, and shows the complementary nature of the Distributional similarity and Pattern-based approaches, as discussed in previous works
(Mirkin et al., 2006; Hagiwara et al., 2009). But it is our verb entailment model
that achieves the best performance, with a statistically significant improvement in
terms of both Precision and F1 at the 0.01 level, compared to all baselines except
52
Figure 5.1: Evaluated rank-based methods and their Micro-average F1 as a function of top K rules
DistSim+VO where we achieve a statistical significance at the 0.05 level. Our
model yields a more accurate resource, as evident from the much slower precision
descending rate compared to the other baselines. Our model’s Recall increases
moderately compared to the other baselines but it is still substantially better than
not using any rules, with no statistically significant difference compared to the
other baselines, apart from the DistSim baseline where the difference is at the
level of 0.1.
In order to further investigate the contribution of features from various verb
co-occurrence levels, we compared the results of three rank-based resources, each
comprising of features from a distinct level of co-occurrence. Figure 5.2 presents
the results of taking the top K rules for each seed verb, where K ranges from 0 to
100.
Looking at the graphs, we notice that as the level of co-occurrence gets broader,
from sentence level through document and ending in corpus level, the resulting
resource is more broad-scale but less precise, showing increase in recall and decrease in precision. This is in accordance with our assertions about the nature of
53
Figure 5.2: Co-occurrence Level classifiers and their Micro-average F1 as a function of top K rules
verb co-occurrence at different levels (see Chapter 3.1). The sentence level resource is the best performing resource in terms of F1, and it also contains the vast
majority (66%) of our novel features. This affirms the importance of our linguistic
features in capturing valuable positive and negative information of entailment that
exists when two verbs co-occur within the same sentence.
Evaluation of Decision Rule-based Baselines We next compare our model
against the more sparse resources, which were either constructed by using a strict
decision rule, e.g., the verb pair appear with certain VerbOcean patterns or by
using an existing lexical resource such as WordNet. As mentioned, due to the
inherent sparsity of these decision-rule-based resources and their binary scoring
method, we cannot use the cutoff method (see Table 5.8 for a comparison of the
average sparseness of these methods). Instead, we will take into account all 40,275
proposed verb pairs and compute the Mean Average Precision (MAP) measure to
compare the entailment rule bases’ quality. MAP is a common evaluation metric
when comparing several ranked lists, which computes the average of the Average
54
Resource
Avg.# of Positive Rules
Avg.#P ositive
Avg.#CandidateRules
All Model
80
0.0893
VO-KB
3
0.0033
VO-Ukwac
9
0.0089
WordNet
23
0.0256
Table 5.8: Average number of positive verb pairs per seed with their frequency
compared to the average total of 875 candidate rules per seed
Precision (AP) scores of each list. Here, each event has a list of rules ranked by the
resource and we compute the event’s AP by averaging the precision values when
truncating the ranked list after each correct match, i.e., after we found a sentence
containing a verb associated with event e and that sentence was indeed annotated
as mentioning event e.
Evaluation Results: We compare our model against the following sparse resources:
(a) VO-KB: a pair (candidate, seed) will be scored with 1 if it appears under the
“stronger-than” relation in the original VerbOcean knowledge-base as constructed by Chklovski and Pantel (2004), and 0 otherwise.
(b) VO-Ukwac: a pair (candidate, seed) will be scored with 1 if it appears under
certain VerbOcean patterns in the Ukwac corpus, as described in 4.1.2, and 0
otherwise.
(c) WordNet: a pair (candidate, seed) will be scored with 1 if it appears under
certain WordNet relations, as described in 5.1.1, and 0 otherwise.
Table 5.9 presents the results of comparing the ranks of all 40,275 verb pairs
by computing the MAP measure. Looking at the table, our model outperforms
the VerbOcean baselines with a statistically significant improvement at the 0.01
level, and at the level of 0.05 for the WordNet baseline. It is interesting to note
the improvement of the VO-KB resource over the VO-Ukwac respource, which
55
demonstrates that using a large amount of web queries yields more robust results
than using a limit-sized corpus.
Resource
MAP
All Model
0.139
VO-KB
0.117
VO-Ukwac
0.103
WordNet
0.124
Table 5.9: MAP Results of our model and the decision-rule based resources
To conclude, our ACE experiments demonstrate that our model creates an accurate and broader-coverage resource, compared to state-of-the-art models and
knowledge bases. We also ascertain the importance of using features at different levels of co-occurrence and show marked improvement as a result of using
them. However, our system, like all other evaluated systems, suffers from incorrect event matches. In the next section we examine these incorrect event matches
and highlight the reasons behind them.
5.2.4
Error Analysis
In order to perform error analysis, we randomly sampled from the corpus 100
false-positive examples, i.e., sentences that according to our model contained an
event mention of an event e, but the ACE annotators did not annotate the sentence
as containing a mention to event e. The investigation of the sampled examples
assist us in identifying reasons for decision errors, whose distribution is presented
in Table 5.10.
Naïve matching mechanism Our evaluation framework involves a simple matching algorithm: first each sentence is tokenized, then each token is lemmatized
according to a few manually constructed rules. Once we have these lemmatized
tokens, we look for an appearance of the seed or candidate verb. This requires little pre-processing at a cost of many wrong matches, since we lack part-of-speech
annotations: 44% of the false positive matches were due to the fact that the verb
56
Type of Error
Mentions
Naïve matching mechanism
46
Incorrect rule learned
32
Context-related expanding verb
13
Ambiguous seed
6
Valid entailing verb, not substitutable
3
Table 5.10: Error types and their frequency from a random sample of 100 false
matches
lemmas were matched to either nouns or adjectives. In addition, some errors were
caused by the lemmatizer, such as the “bite - bit” error presented in Table 5.11,
where the noun ’bit’ was mistaken to be a past tense occurrence of the verb ’bite’.
Event
Verb
Sentence Fragment
Convict
convict
those favoring the rule say it gives convicts incentive to behave..
Injure
bite
a little bit right after the war started..
Start-Org
institute
from Russian institutes they visited and built..
End-Org
end
had returned to their jobs since the end of the war..
Table 5.11: Example errors due to the naive matching mechanism
Context-related expanding verb The errors here are due to verbs that are invalid as a direct substitute to the seed, but they do share a semantic relatedness
with the seed and the contexts in which it appears in, e.g., the expanding verb of
“accuse” can appear in contexts of a ’Charge-Indict’ event, because usually when
a person is charged for a crime, he is accused of committing the crime, but this
does not strictly comply with the Lexical Entailment definition. Other examples
include ‘arrest → convict’, ‘kill → attack’ and ‘sue → jail’.
Seed ambiguity Some ACE seeds are ambiguous to begin with. Thus, verbs
that would lead to correct event detection for at least one of the seed’s senses,
sometimes lead to wrong matches. For instance, the seed ’appeal’ can either mean
’take a court case to a higher court for review’, or it can mean ’request earnestly’.
57
If we look at the second sense, then ’plead’ is a correct expansion of appeal, but
the ACE task deals only with the first sense. Table 5.12 presents examples of this
error type.
Event
Entailment rule
Sentence Fragment
Release-Parole
unleash → release
President Bush, who unleashed
hell acting on faulty intelligence..
Appeal
plead → appeal
the Canadian province of British
Columbia pleaded no contest to
driving drunk..
Execute
carry out → execute
a huge step forward in carrying out
the US-backed Middle East policy..
Table 5.12: Example errors due to seed ambiguity
Invalid expanding words the expanding words do not have an evident connection to the seed’s contexts, such as ‘hit → arrest’, ‘follow → arrest’.
Valid entailing verbs, not substitutable These errors are due to the sometimes
non-overlapping definition of Lexical Entailment and the ACE event detection
task. A system is required to identify mentions of an event that are evident in the
sentence, with the help of matching certain content words. While in Entailment,
we are interested in the events that can be inferred from certain content words. For
example, the learned rule ‘divorce → marry’ follows the presupposition sub-type
of entailment, since if a person has divorced his spouse, he must have been first
married to her. But for the ACE sentence “The Welches disclosed their plans to
divorce a year ago” the event of ’Marry’ was not mentioned in the sentence, although one can infer that it did occur due to the appearance of the verb “divorce”.
To conclude, out error analysis shows that:
(a) Almost half of the errors were caused by the naïve matching procedure.
(b) Correct rules with entailing verbs, applied in inappropriate contexts and contextrelated rules, yield about fourth of the errors.
58
(c) Only 32% of errors originate from using invalid expansion rules.
Thus, the potential of our learned entailment rule resource is much higher than
shown by the evaluation results. For instance, if we were to use a more sophisticated matching algorithm or add a disambiguation module to handle context and
seed ambiguity, we could avoid many of these errors and increase performance.
59
Chapter 6
Conclusions and Future Work
6.1
Conclusions
The main contribution of this thesis is the design of novel linguistically motivated
cues that capture both positive and negative entailment information which is embedded when two verbs co-occur at the sentence, document and corpus levels.
Our model incorporates these novel cues as features together with useful features
adapted from prior work to combine co-occurrence and distributional similarity
information about verb pairs and create an integrative framework for the detection
of verb entailment.
Experiment over a manually labeled dataset showed that our model outperforms, with a statistically significant improvement, state-of-the-art algorithms.
Further feature analysis indicated that features at all levels: sentence, document
and corpus, contain valuable information for verb entailment learning and detection, and thus should be combined together. Furthermore, many of our novel
features were amongst the highly correlated features, showing that our intuitions
when devising a rich set of verb-specific and linguistically-motivated features
were correct. In a second experiment, we performed application-oriented evaluations of our novel verb entailment model in an applied NLP setting and demonstrated that our model creates an accurate and broad-coverage resource, compared
to state-of-the-art models and knowledge bases. We also ascertained the importance of using features at different levels of co-occurrence and show marked im-
60
provement as a result of using them. To conclude, this work demonstrates that by
exploiting the linguistic properties of verb co-occurrence, one can achieve more
precise and scalable methods for verb entailment rule learning and detection. We
believe that this opens the door for further integration of linguistic insights into
the applied field of NLP.
6.2
Future Work
Our study highlighted the potential of integrating linguistically-motivated cues at
different levels of co-occurrence. Yet, there are several extensions we have in
mind for future work in this direction.
Fine-grained Classification As explained in Chapter 2.4.1, entailment is a semantic relation comprising of several relation sub-types such as troponymy, temporal inclusion and backward presupposition. We hypothesize that these subtypes manifest themselves differently in corpora and that by training a different
classifier for each sub-type, we will add discriminative power which will, in turn,
improve the entailment model. We believe that since our features were designed
with this entailment hierarchy in mind, they correspond to different sub-types of
entailment and a natural research direction would be to train a multi-class classifier that classifies every pair of verbs (v1 , v2 ) to one of the sub-types of entailment.
Then, we can deduce whether ‘v1 → v2 ’ based on these scores. Similarly, we can
change our annotations guidelines to be specific for every sub-type of entailment
and then aggregate the fine-grained annotations to see if we get a more cohesive
annotated data set. A first attempt at this approach can be seen in Appendix B.
Utilizing Typed Distributional Similarity As discussed in 3.3, we can change
the traditional representation of verbs in distributional similarity approaches and
represent them as vectors in different feature spaces, with each vector space corresponding to a different argument or modifier, and improve the discriminative
power of the feature space. We can, in a similar way, represent nouns as vectors
in different feature spaces, for instance to compare a noun’s adjectives. An interesting research question is whether this new typed representation can improve the
61
performance of a model which learns entailment rules between nouns.
Extending Our Model to Predicates Predicates are grammatical constructions
which are an extension of verbs. Simply put, a predicate is the main verb and any
auxiliaries that accompany the main verb in a given sentence. In many predicate
structures, the main verb is a light verb such as ’take’, e.g., the predicate ’take a
picture’ in “Shiri took a picture of Yoav”, or an auxiliary verbs such as ’is’, e.g.,
the predicate ’is born in’ in “Sheila was born in Hampstead”. It would be interesting to see if our model boosts performance on the task of predicative entailment,
which is a more general notion of verb entailment and has the potential to directly
benefit Open Information Extraction systems (see Fader et al. (2011)).
Adding a Paragraph Co-occurrence level As exemplified in our ablation tests
in 5.1.3 and 5.2.3, the use of different levels of verb co-occurrence is imperative for detecting the entailment relation between verb pairs. A natural extension to this approach is to utilize yet another co-occurrence level: the paragraph
level. This co-occurrence level is interesting since it presents a broader scope
than the sentence level (thus increasing recall) and on the other hand since a paragraph can be thought of as a single discourse unit, it presents a tighter and more
semantically-related scope than the document level (as shown in Pekar (2008)).
Utilizing RTE as an additional evaluation setting An RTE system has to judge
whether a given text segment entails a hypothesis sentence. One can utilize this
framework in order to test our generated rules by providing them as an additional
knowledge resource to an entailment system, for instance the system developed
at Bar-Ilan NLP lab (Stern and Dagan, 2011). The evaluation will compare our
classifier’s entailment rule-set with other state-of-the-art rule-sets, to measure the
added value of our classifier in this framework. This task is more broad-scale
than ACE, but since it involves many components and knowledge resources, it is
relatively harder to empirically ascertain whether the rule-set contributes to the
performance of the RTE system. Still, it would be beneficial to see the effect of
using our model in such a broad-scale and important task.
62
Bibliography
1. Shuya Abe, Kentaro Inui, and Yuji Matsumoto. Acquiring event relation
knowledge by learning co-occurrence patterns and fertilizing co-occurrence
samples with verbal nouns. In Proceedings of The 3rd International Joint
Conference on Natural Language Processing, 2008.
2. Nicholas Asher and Alex Lascarides. Logics of conversation. Cambridge
University Press, 2003.
3. Timothy Baldwin and Aline Villavicencio. Extracting the unextractable: a
case study on verb-particles. In proceedings of the International Conference
on Computational Linguistics, 2002.
4. Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. The
wacky wide web: A collection of very large linguistically processed webcrawled corpora. Language Resources and Evaluation, 43(3):209–226, 2009.
5. Jonathan Berant, Ido Dagan, and Jacob Goldberger. Learning entailment relations by global graph structure optimization. Computational Linguistics, 38
(1):73–111, 2012.
6. Hardarik Bluhdorn. Subordination and coordination in syntax, semantics and
discourse: Evidence from the study of connectives. John Benjamins Publishing Company, 2008.
7. Joan L. Bybee and Suzanne Fleischman. Modality in grammar and discourse.
John Benjamins Publishing Company, 1995.
63
8. Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative
event chains. In Proceedings of the Annual Meeting-Association for Computational Linguisitics (ACL), 2008.
9. Timothy Chklovski and Patrick Pantel. Verb ocean: Mining the web for finegrained semantic verb relations. In Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.
10. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine
learning, 20(3):273–297, 1995.
11. Ido Dagan and Oren Glickman. Probabilistic textual entailment: Generic
applied modeling of language variability. In Proceedings of the Workshop on
Learning Methods for Text Understanding and Mining (PASCAL), 2004.
12. Lin Dekang and Patrick Pantel. Dirt - discovery of inference rules from text.
In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, 2001.
13. David R Dowty. Word meaning and Montague grammar. Springer, 1979.
14. Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations
for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011.
15. Christiane Fellbaum. English verbs as a semantic net. International Journal
of Lexicography, 3(4):278–301, 1990.
16. Christiane Fellbaum. WordNet: An Electronic Lexical Database (Language,
Speech, and Communication). The MIT Press, 1998.
17. Maayan Geffet and Ido Dagan. The distributional inclusion hypotheses and
lexical entailment. In Proceedings of the Annual Meeting-Association for
Computational Linguisitics (ACL), 2005.
18. Barbara J. Grosz, Scott Weinstein, and Aravind K. Joshi. Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2):203–225, 1995.
64
19. Isabelle Guyon and Andre Elisseeff. An introduction to variable and feature
selection. Journal of Machine Learning Research, 3:1157–1182, 2003.
20. Masato Hagiwara, Yasuhiro Ogawa, and Katsuhiko Toyama. Supervised synonym acquisition using distributional features and syntactic patterns. Information and Media Technologies, 4(2):558–582, 2009.
21. Z. Harris. Distributional structure. Word, 10(23):146–162, 1954.
22. Marti Hearst. Automatic acquisition of hyponyms from large text corpora. In
Proceedings of the 14th International Conference on Computational Linguistics, 1992.
23. Jerry Hobbs. Coherence and coreference. Cognitive Science, 3:67–90, 1979.
24. Jerry R. Hobbs. Literature and Cognition. CSLI Publications, 1990.
25. Eduard Hovy and Elisabeth Maier. Organizing Discourse Structure Relations
using Metafunctions. Pinter Publishing, 1993.
26. Ray Jackendoff. Semantics and Cognition. The MIT Press, 1983.
27. T. Joachims. A support vector method for multivariate performance measures.
In Proceedings of the 22nd international conference on Machine learning,
2005.
28. Karin Kipper, Hoa Trang Dang, and Martha Palmer. Class-based construction
of a verb lexicon. In Proceedings of the National Conference on Artificial
Intelligence, 2000.
29. Alistair Knott and Robert Dale. Using linguistic phenomena to motivate a set
of coherence relations. Discourse processes, 18(1):35–62, 1994.
30. Alistair Knott and Robert Dale. Choosing a set of coherence relations for text
generation: a data-driven approach. Trends in Natural Language Generation
An Artificial Intelligence Perspective, 1:47–67, 1996.
31. Alistair Knott and Ted Sanders. The classification of coherence relations and
their linguistic markers. Journal of Pragmatics, 30(2):135–175, 1998.
65
32. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet.
Directional distributional similarity for lexical inference. Natural Language
Engineering, 16(4):359–389, 2010.
33. Solomon Kullback and Richard A. Leibler. On information and sufficiency.
The Annals of Mathematical Statistics, 22(1):79–86, 1951.
34. Mirella Lapata and Frank Keller. Web-based models for natural language
processing. ACM Transactions on Speech and Language Processing, 2(1):
3–34, 2005.
35. Mirella Lapata and Alex Lascarides. Inferring sentence-internal temporal relations. In Proceedings of the North American Chapter of ACL: Human Language Technologies, 2004.
36. Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation. University Of Chicago Press, 1993.
37. Dekang Lin. An information-theoretic definition of similarity. In Proceedings
of the 15th international conference on Machine Learning, 1998.
38. Dekang Lin, Shaojun Zhao, Lijuan Qin, and Ming Zhou. Identifying synonyms among distributionally similar words. In Proceedings of the International Joint Conference on Artificial Intelligence, 2003.
39. William Mann and Sandra Thompson. Rhetorical structure theory: Toward a
functional theory of text organization. Text, 8(3):243–281, 1988.
40. Daniel Marcu and Abdessamad Echihabi. An unsupervised approach to
recognizing discourse relations. In Proceedings of the Annual MeetingAssociation for Computational Linguisitics (ACL), 2002.
41. Shachar Mirkin, Ido Dagan, and Maayan Geffet. Integrating pattern-based
and distributional similarity methods for lexical entailment acquisition. In
Proceedings of the International Conference on Computational Linguistics,
2006.
66
42. Shachar Mirkin, Ido Dagan, and Eyal Shnarch. Evaluating the inferential
utility of lexical-semantic resources. In Proceedings of the 12th Conference
of the European Chapter of the Association for Computational Linguistics,
2009.
43. Joakim Nivre, Johan Hall, and Jens Nilsson. Maltparser: A data-driven
parser-generator for dependency parsing. In Proceedings of the International
Conference on Language Resources and Evaluation, 2006.
44. Viktor Pekar. Discovery of event entailment knowledge from text corpora.
Computer Speech and Language, 22(1):1–16, 2008.
45. Marco Pennacchiotti and Patrick Pantel. Entity extraction via ensemble semantics. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), 2009.
46. Alain Rakotomamonjy. Variable selection using svm-based criteria. Journal
of Machine Learning Research, 3:1357–1370, 2003.
47. Dietmar Rosner and Manfred Stede. Customizing rst for the automatic production of technical manuals. Aspects of automated natural language generation, 1:199–214, 1992.
48. Michael Rundell and Gwyneth Fox. Macmillan Essential Dictionary for
Learners of English. Macmillan Education, 2003.
49. R. Snow, D. Jurafsky, and A. Y. Ng. Learning syntactic patterns for automatic
hypernym discovery. In Proceedings of the 17th Conference on Advances in
Neural Information Processing Systems, 2005.
50. Asher Stern and Ido Dagan. A confidence model for syntactically-motivated
entailment proofs. In Proceedings of the Conference on Recent Advances in
Natural Language Processing, 2011.
51. Idan Szpektor and Ido Dagan. Learning entailment rules for unary templates.
In Proceedings of the International Conference on Computational Linguistics, 2008.
67
52. Idan Szpektor, Hristo Tanev, and Ido Dagan. Scaling web-based acquisition of
entailment relations. In Proceedings of the conference on Empirical methods
in natural language processing, 2004.
53. Idan Szpektor, Eyal Shnarch, and Ido Dagan. Instance based evaluation of
entailment rule acquisition. In Proceedings of the 45th Annual Meeting of the
Association of Computational Linguistics, 2007.
54. Idan Szpektor, Ido Dagan, Roy Bar-Haim, and Jacob Goldberger. Contextual
preferences. In Proceedings of the Annual Meeting-Association for Computational Linguisitics (ACL), 2008.
55. Galina Tremper and Anette Frank. Extending semantic relation classification
to presupposition relations between verbs. In Proceedings of the NP Syntax
and Information Structure Workshop, 2011.
56. Julie Weeds and David Weir. A general framework for distributional similarity. In Proceedings of the conference on Empirical Methods in Natural
Language Processing (EMNLP), 2003.
57. Julie Weeds, David Weir, and Diana McCarthy. Characterising measures of
lexical distributional similarity. In Proceedings of the 20th international conference on Computational Linguistics, 2004.
58. Thomas A. Werner. Deducing the Future and Distinguishing the Past: Temporal Interpretation in Modal Sentences in English. Rutgers University, 2003.
59. Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics
Bulletin, 1(6):80–83, 1945.
60. Fabio Massimo Zanzotto, Marco Pennacchiotti, and Maria Teresa Pazienza.
Discovering asymmetric entailment relations between verbs using selectional
preferences. In Proceedings of the 23rd International Conference on Computational Linguistics, 2006.
68
Appendix A
VerbOcean Web Patterns
Experiment
As mentioned in Chapter 4.1.2, we performed a preliminary study of the VerbOcean patterns 9 in the context of Entailment. Our goal was to find out whether these
lexical-syntactic patterns can be used to detect lexical entailment between verbs.
To that end, we ran an experiment using the Web, which we will next describe.
A.1
Experimental Setting
1. We chose 50 commonly used verbs as seeds.
2. For each of the 50 verbs we took the top 20 most similar verbs according
to the Lin similarity measure. This resulted in 1000 verb pairs with some
semantic association between them.
3. We manually annotated the 1000 pairs in a four-way manner (v1 → v2 , v2
→ v1 , v1 ↔ v2 , N on − entailment). This resulted in 1952 ordered verb
pairs, annotated in a binary way (after removing duplicates).
4. For each (v1 , v2 ) pair and the 33 VerbOcean patterns, we ran Bing queries
of the form: "v1 pattern v2 ".
For each pattern p, we computed Recall, Precision and F1 measures to check
if pattern p is a good indicator of verb entailment. (i.e. verbs that get a high count
69
when appearing with p are more likely to entail than to not-entail)1 . We then
represented each verb pair (v1 ,v2 ) as a vector of features. The features are:
1. For each pattern a binary feature denoting whether the pair got zero or nonzero count with the pattern
2. For each pattern the actual count that the pair got with the pattern
3. For each pattern the log (count + 1) that the pair got with the pattern
4. For each pattern the normalized count
This resulted in 264 features (132 for each verb pair assignment = 33 patterns ·4 )
A.2
Results
We compared the Recall-Precision scores of the patterns against that of the baseline (which is to output yes on all candidate verb pairs given by the Lin similarity
measure) whose precision is 14.4, recall is 100 and F1 is 25.2. We checked to see
where the curve began to look similar to the baseline (i.e., the precision was close
to 14.4): if the recall at that point was low, we concluded that this pattern was akin
to a random pattern and thus does not give a meaningful signal of entailment.
For patterns that are supposed to be a negative signal, the Recall-Precision
curve wasn’t suitable for comparison. Instead we plotted the ROC curve and
computed the AUC (Area under the ROC curve). In this case, we looked for the
patterns with high AUC. The patterns that look promising are:
1. Positive signals for entailment: Xed even Yed (Y → X), Xed and even Yed
(Y → X) , Xed or at least Yed (X → Y) , Xed by Ying the (X → Y).
2. Negative signals for entailment: either Xing or Ying (Y → X) (Area under
the ROC curve 0.55), Xed but Yed (Area under the ROC curve 0.57 for both
directions), to X but Y (Y → X) (Area under the ROC curve 0.569)
1
For the Antonymy relation patterns, high scores should mean non-entailment
70
Appendix B
Fine-grained Verb Entailment
Annotation Guidelines
In this task, you are given an ordered pair of verbs (v1,v2) and you need to decide
whether v1 entails v2. We adopt the textual entailment framework, where a text
fragment T is said to entail another text fragment H, if a person reading T will
infer that H is most likely true. In our case this means that v1 entails v2 if you can
think of reasonable contexts under which, for the given pair, one of the following
statements is true:
1. v1 is a verb expressing a specific manner or elaboration to v2. The two
verbs must occur at the same time. The test frame to use in this case is: to
v1 is a way of v2-ing (v2 is in gerund form). For example, to march is a way
of walking, to bake is a way of cooking, and to donate is a way of giving. In
all these examples v1 is a particular way of performing v2 and both verbs
occur at the same time.
2. v1 describes an event or state that occurs while v2 occurs and rarely occurs
without v2. For example, for the pair (snore, sleep) we note that snoring
almost always occur while sleeping, that is, people will infer that if someone is snoring, then that someone is most likely sleeping. Other examples
include (walk, step), (ambush, wait). However, in (beep, ring) one could
say that a beep is a part of a ring, but beep is more general than just a sound
71
that occurs when ringing, and can often appear outside the event of ring and
so the relation does not hold for this pair.
3. v2 happens before v1 and must have occurred in order for v1 to occur (i.e.,
v1 presupposes v2). For example, win entails play since the event of play
happens before win and in order for the event of winning to occur, then
one must have played. Please note that win is an ambiguous verb and under
some contexts these rule will not hold. See next paragraph for a clarification
on this matter. Other examples: if you employ someone then it presupposes
that he was hired, and so employ entails hire. However, you can say that if
you detect something then you must have noticed it first, but the two events
are not disjoint and cannot be temporally separated (the event of noticing
ended and then you detected the object) and so the relation does not hold
for this pair.
4. v1 is a causative verb of change (a verb that causes change in state, position
etc) and v2 is a non-causative verb that occurred because v1 occurred before
it (i.e., the two events do not occur at the same time). For example, the event
of buying something causes the buyer to own that something. Or giving X
to person Y causes person Y to have X.
5. v1 is a synonym of v2 (and vice versa) , i.e., v1 and v2 express the same
meaning, and can be considered as alternative ways of conveying the same
information.
In all cases, ambiguity might arise since we look at the verbs out of context and
each verb can be used in different contexts and have several different meanings
(senses). For example, the verb ’win’ can be used as winning a war or winning a
game. The latter use is frequent enough to infer that winning presupposes playing
(since in order to win a game you must have first played the game). The rule is
that v1 entails v2 if there are reasonable non-anecdotic contexts in which one of
the four previously described cases hold.
72
Appendix C
Verb Syntax and Semantics
C.1
Auxiliary and Light Verbs
An auxiliary verb is a verb which adds functional or grammatical meaning to the
clause in which it appears, e.g., to express tense, aspect and voice. Auxiliary verbs
usually accompany a main verb, which provides the main semantic content of the
clause in which it appears. A light verb is a verb which carries little semantic
content on its own and usually forms light-verb-constructions (LVC) with other
verbs or nouns. In our work, when we look at the meaningful co-occurrence of
verbs, we take into account only non-auxiliary and non-light verbs, and manually
filter out, when needed, the following auxiliary verbs: ’be’, ’have’, ’do’, and the
following light verbs: ’have’,’get’,’take’,’give’ and ’get’.
C.2
Verb Syntax
As mentioned in Chapter 1, verbs are the primary vehicle for describing events and
expressing relations between entities. Raising verbs are a special type of verbs,
atypical in that they do not express an event or a state and as such, and do not
stand in a semantic relation with any arguments. Thus, the subject of the sentence
is semantically related to the main verb of the subordinate clause, and not to the
main verb of main (superordinate) clause, as is usually the case. This effect is
illustrated in the following sentences (with the raising verbs in bold):
73
1. Sholmi seems to be happy
2. It appears that Noga left
3. Sagi happens to be a trombone player
4. Shmuel used to chew tobacco
Similar to raising verbs, control verbs can take an infinitive clause complement, but unlike raising verbs, control verbs have semantic content; they semantically select their arguments, that is, their appearance strongly influences the nature
of the arguments they take.
1. Sharon wants to fly to Japan next year.
2. Modi refused to join the strike.
3. Sara will try to finish her assignments by Friday.
Control verbs appear in contexts that look just like the contexts for raising
verbs, i.e., they are the main verbs in the main clause, with an infinitive clause as a
subordinate. However, a control verb can be viewed as expressing an intent about
an activity or event which is denoted by its complement, as opposed to raising
verbs that cannot be viewed as linked in a meaningful way to their complements.
C.3
Levin Classes
As mentioned in Chapter 2.4.3, Levin’s underlying assumption is that verb meaning and argument realization, which are alterations in the expression of verbal arguments such as subject, object and prepositional complement, are jointly linked.
To illustrate, let’s examine the following examples:
The verbs “spray” and “load” can realize their arguments in two different ways
(called locative alteration):
1. Gideon sprayed water on the plants.
2. Gideon sprayed the plants with water.
74
1. Dana loaded apples into the cart.
2. Dana loaded the cart with apples.
But, semantically related verbs such as “pour” and “dump” can only realize
their arguments in one way:
1. Gideon poured water on the plants.
2. (UNGRAMMATICAL) Gideon poured the plants with water.
1. Dana dumped apples into the cart.
2. (UNGRAMMATICAL) Dana dumped the cart with apples.
Thus, we can group the first two verbs in one class and the last two verbs to
another.
75