Modeling Word Senses Using Lexical Substitution

Modeling Word Senses Using
Lexical Substitution
Domagoj Alagić
Faculty of Electrical Engineering and Computing, University of Zagreb
Text Analysis and Knowledge Engineering Lab
Unska 3, 10000 Zagreb
Email: [email protected]
One of the most prototypical and longest-standing NLP
applications concerned with the word senses is word sense
disambiguation (WSD), a task of computationally determining
the meaning of a word in its context [2]. To illustrate, consider
the following two sentences, both containing the noun plane:
Abstract — Modeling word meaning is beneficial in various natural language processing (NLP) applications, such as
information retrieval, machine translation, and automatic text
simplification, among others. The central aspect of word meaning
are its senses – the meanings a word can bear. A prototypical
NLP task that tackles this problem is word sense disambiguation
(WSD). This task comprises assigning the right sense to a word
in context, and therefore it requires a predefined sense inventory.
As such inventories are often task-specific and of inappropriate
granularity, new tasks that do not have this requirement have
been introduced. One of them is lexical substitution, a task of
finding a meaning-preserving substitute for a polysemous word
in context. By asking a system (or annotators) to come up with the
synonyms for a word in context instead of asking them to pick the
right sense from a fixed sense inventory, we indirectly elicit the
word senses in a more natural way. This paper describes the task
in detail: outlines the available datasets for English, explains the
evaluation of lexical substitution systems, and presents the work
done on this topic, outlining the most prominent supervised and
unsupervised machine learning approaches.
1)
2)
The noun plane is a polysemous word, as it holds multiple
meanings (e.g., senses) depending on the context it is found
in. In the first one, it bears the meaning of (1) an aircraft
that has a fixed wing and is powered by propellers or jets,
whereas in the second sentence it denotes (2) an unbounded
two-dimensional shape. Given a list of possible senses, a WSD
system is required to predict, in most cases, the best-fitting
sense for a given context. The sentences used for examples
above, as well as their senses, are taken from the English
WordNet [3], a large manually-constructed lexical resource.
Keywords — natural language processing, machine learning,
lexical semantics, computational semantics, polysemy, word sense,
lexical substitution.
I.
The flight was delayed due to trouble with plane.
Any line joining two points on a plane lies wholly on
that plane.
Even though WSD has been proven beneficial in various
NLP applications, it still comes with its own fair share of
problems. Maybe the most prominent one is that it relies on
a fixed set of senses for each of the words, the so-called
sense inventory. Even though sense inventories are usually
manually compiled by experts (e.g., linguists), they are, more
often than not, of inappropriate granularity, something that
WordNet is often critized for [4, 5]. This stems from the
fact that different applications often require different levels
of granularity, making evaluation of WSD across applications
difficult. Even though there were various attempts amending
for inappropriate sense inventories, for instance, a multi-label
and graded sense annotation scheme [6], they still remain the
main issue of WSD systems.
I NTRODUCTION
Understanding natural language and leveraging the knowledge contained within it proves to be an extremely difficult,
yet amazingly rewarding problem in the field of artificial intelligence. With the increasing internet popularity in recent years,
amounts of freely-available unstructured text data are getting
larger and larger, evidencing the need for intelligent text
analytics. Therefore, it comes as no surprise that the research
field of natural language processing (NLP) has become quite
prominent. Getting around to understanding natural language
most of the time boils down to understanding single words
and generalizing those meanings to bigger text fragments,
building on the principle of semantic composition. In general
terms, this is exactly what lexical semantics, a sub-discipline
of linguistics, is concerned with – modeling the way in which
the meanings of lexical items contribute to the meanings of
the phrases and sentences in which they appear [1]. Correspondingly, in natural language processing (as well as in
the related field of computational linguistics), the study falls
within the purview of computational lexical semantics. This
discipline’s backbone is often machine learning, which allows
for word representations to be learnt automatically from corpus
statistics.
The task of lexical substitution task, i.e., a replacement
of a target word in context with a suitable substitute, often a
synonym or a near-synonym, was proposed in hope of avoiding
these problems. This task circumvents the need of having
a fixed sense inventory by allowing participants (systems
or annotators) to indirectly elicit word meaning (and their
synonyms) by providing the appropriate substitutes for words
in context.
This paper gives an overview of lexical substitution task,
outlining the available datasets for English, explaining the
evaluation of lexical substitution systems, and presenting the
work done on this topic.
1
similar task is that of query expansion from information
retrieval (IR), where, given a input query, a system needs to
come up with additional terms similar to ones in the original
query. Another related task is lexical simplification [10], where
a word in context needs to be replaced with a suitable, but
cognitively simpler synonym. This amounts to implementing
simplicity constraint in the ranking step of lexical substitution.
On the other hand, if the candidate generation step is changed
to search for the synonyms in another language, one ends up
with lexical (word-for-word) machine translation.
The rest of the paper is organized as follows. Section II
describes the lexical substitution task, while Section III introduces available datasets for this task. Section IV outlines
evaluations metrics commonly used to evaluate performance of
lexical substitution systems, as well as inter-annotator agreement between the annotators constructing lexical substitution
datasets. We proceed by presenting and categorizing different
models used on this task. Lastly, Section VI concludes the
paper.
II.
TASK
The following sections will give an overview of the
approaches for tackling both the substitute generation and
substitute ranking subtasks, their evaluation, available datasets,
and the proposed systems.
The task of lexical substitution, sometimes called a contextual synonym expansion [7], was proposed as a task by
McCarthy and Navigli [8] (following the earlier ideas of
McCarthy [9]) at the Semantic Evaluations 2007 (S EM E VAL)
workshop. The task involves finding a meaning-preserving
substitute for a polysemous word (the so-called target word)
in a context, all while preserving the sentence grammaticality.
To illustrate this, consider the following example:1
III.
To build lexical substitution systems that can address the
task of lexical substitution, we require data that a system can
either learn form or be evaluated on. Unfortunately, datasets are
almost always costly and time-consuming to produce, as they
require a significant human effort to ensure their quality and
reliability, something that cannot be achieved automatically. In
recent years, crowdsourcing [11] has started to gain popularity.
Crowdsourcing is based on the idea that it is possible to obtain
reliable data by soliciting contributions from a large group of
people (usually online). Nonetheless, using crowdsourcing as
means of compiling a dataset has its downsides as well – even
though it is easy and cheap to obtain data, some effort needs to
be put in assuring the collected data is of sufficient quality, as
online users annotating the data may not always be good-willed
[12]. Many researchers in NLP [13, 11, 14] recognized this as
a viable strategy and started using one of the many popular
crowdsourcing platforms, such as Amazon Mechanical Turk2
and CrowdFlower,3 among others.
He was bright and independent and proud.
Given this sentence, suitable substitutes for the word bright
may be intelligent, smart, or clever, among others. In contrast,
substitutes like luminous and blue represent a different sense of
the word, or make no sense, respectively. Additionally, knowit-all is not a viable choice, as it does not fit in the sentence
(part-of-speech mismatch), even though it somewhat bears the
same meaning. Sometimes it is allowed to use hypernyms or
multiword expressions (MWEs) as well.
Technically, solving the lexical substitution task comes in
two steps:
1)
2)
DATASETS
Finding the set of candidate substitutes for a given
word;
Finding the best candidate(s), given a word context.
Types of lexical substitution datasets. Lexical substitution
datasets come in two types: lexical sample and all-words.
Whereas lexical sample provides the data for a set of target
words selected in some manner (most usually hand-picked),
all-words datasets consider all the content words (usually
nouns, adjectives, verbs, and adverbs) from a running text.
Both of these approaches have their merits and downsides in
terms of both building the dataset and using it. The lexical
sample approach offers a bit more tedious dataset compilation
(due to a need of picking the words), but in the same time
it offers a more controlled experiment environment, as words
with specific attributes can be selected. On the other hand, allwords datasets are more realistic and easier to prepare, because
using all the content words from a running text facilitates
model generalization, as it is expected for a model to encounter
a lot of words with highly varying attributes, not only the
carefully selected ones.
The first step, also called candidate synonym collection [7], includes collecting all possible candidate substitutes from lexical
resources or corpora. Note that this step does not account for
the compatibility of substitute candidates in a given context.
After all the candidates are obtained, the next step, also called
context-based fitness scoring [7], selects the best substitute
candidates. This step is usually tackled as a ranking problem,
which results in the list of candidate substitutes ranked by their
compatibility with a given context.
As mentioned in the introduction, lexical substitution task
is interesting as it solves some of the problems occurring in
traditional WSD setups. Most importantly, this task does not
presume any predefined sense inventory, but rather models
word meaning through provided word substitutes. Another
point is that lexical substitution might seem a more natural
task, and often easier for the annotators, as they are not trying
to explicitly select the most appropriate predefined meaning
of the word (devised by somebody else) in context, but only
come up with its synonyms or near-synonyms.
Unfortunately, only a handful of datasets for the task of
lexical substitution are available, and what is more – mostly
only for English. This section shortly describes the available
datasets. Dataset statistics are reported in Table I. (CS denotes
whether the dataset was compiled using crowdsourcing.)
Similarities with other tasks. Lexical substitution shares
similarities with number of other NLP tasks. Maybe the most
2 https://www.mturk.com/mturk/welcome
1 Instance
3 https://www.crowdflower.com/
bright.a 1 from L EX S UB dataset (cf. Section III) [8].
2
TABLE I.
L EX S UB
TBWSI-2
COINCO
DATASET STATISTICS
Type
CS
POS cover
#target words
#instances
lexical sample
lexical sample
all-words
7
3
3
N, A, V, R
N
N, A, V, R
201
1,012
15,629
2,002
24,647
15,629
instances from T (A ⊆ T ) for which the system has predicted
at least one substitute. We denote a ranked set of systempredicted substitutes for the instance ti as Si . For each instance
ti , we presume a multiset union of annotator (the so-called
gold-standard) substitutes Hi (as each substitute could have
been proposed by multiple annotators). To make it clearer,
consider the example instance from Section I to be t1 . For
this instance, annotators have provided both the substitutes
intelligent and clever three times, and the substitute smart
once. Therefore, H1 = {intelligent : 3, clever : 3, smart : 1}.
L EX S UB dataset. This lexical sample dataset for English was
originally introduced along with the task itself at S EM E VAL2007 [8]. The sample comprises 201 words: nouns, verbs,
adjectives, and adverbs, selected in both manual and automatic
manner. For each sampled word, the authors sampled ten
sentences in which those words occur, amounting to 2,002
sentences in total (eight of them are not annotated, possibly by
mistake). Five native-speaker annotators were asked to provide
substitutes for target word in each of the sentences. Substitutes
are assigned a frequency that denotes how many annotators
suggested it for a given sentence.
Moreover, let counti (s) denote a function which returns
the number of occurrences of the substitute s in Hi . We also
define maxcounti as the maximum number of occurrences of
a single substitute in Hi . For simplicity, we use #P to define
the multiset cardinality of substitute set P according to goldstandard Hi , as shown in (1).
X
#P =
counti (s)
(1)
TBWSI-2 dataset. This is another, albeit larger, lexical sample dataset for English, presented in [15]. It comprises 24,647
sentences for a total of 1,012 frequent target nouns. Substitutes
were collected using an iterative three-step crowdsourcing. In
contrast to the L EX S UB dataset, substitutes are grouped per
word-sense, as the main goal of this work was to create a
crowdsourced sense inventory using lexical substitution as a
backbone of the approach.
s∈P
Additionally, as Hi is not a set, but rather a multiset, we presume that #Hi simply returns the total number of annotatorprovided substitutes for ti (possible duplicates included). To
exemplify, if we presume our system provided the substitute set
S1 = {intelligent, smart, brilliant} for our running example,
count1 (smart) returns 1, whereas count1 (brilliant) returns 0,
as the word brilliant is not in the gold-standard set. Also,
#S1 = 3 + 1 + 0 = 4 and maxcount1 = 3.
C O I N C O dataset. In contrast to both L EX S UB and
TBWSI-2 datasets, this dataset is an all-words dataset for English, compiled by Kremer et al. [16]. It comprises 15,629 target words across 15,629 sentences (7,117 nouns, 4,617 verbs,
2,470 adjectives, and 1,425 adverbs). Such a large amount of
sentences and target words required a crowdsourcing-based
annotation.
IV.
Some metrics in [8] require a notion of a mode substitute
mi – a unique most-frequent annotator substitute, which may
not exist for all the instances. In line with that, by filtering out
instances that do not contain the mode in their Hi , we construct
the mode variants of T and A and denote them with T m and
Am , respectively. For this reason, we also define bgi to be the
system’s best substitute (provided first in the substitute list) on
instance ti . In our running example, there is no mode as both
intelligent and clever are proposed three times. On the other
hand, system’s best guess (bg1 ) is intelligent, being provided
as the first substitute in S1 .
E VALUATION
Proper evaluation plays a key role in every NLP application, and lexical substitution is no exception. Therefore it
comes as no surprise that a lot of attention was given to
devising a robust evaluation scheme for lexical substitution
at the S EM E VAL-2007 [8].
A. L EX S UB-based evaluation
McCarthy and Navigli [8] introduce two separate subtasks on
which the lexical substitution systems should be evaluated –
BEST and OOT (out-of-ten).
Given an instance (sentence and target word), the evaluation of lexical substitution boils down to comparing substitute
sets (or more precisely, ranked substitute sets) obtained from
human annotators (i.e., gold-standard substitutes) to that of
a system. The final performance score is then obtained by
aggregating per-instance scores.
BEST . The system is free to provide as many substitutes as it
sees fit, but the score for each correct one is divided by the
total number of provided substitutes. The first substitute in the
list of provided substitutes is considered to be the system’s best
substitute. As providing multiple substitutes hurts the overall
score, the system cannot simply provide as many substitutes as
possible to increase its score, but it can still provide a couple
of them if it is not certain which response is better. The authors
also introduce mode variants of precision and recall, Mode P
and Mode R, which evaluate how successful the system is in
guessing a single best substitute (i.e., gold-standard mode).
Instances are scored according to (2).
#Si
1 if bgi = mi ,
best(i) =
best-mode(i) =
0 otherwise.
#Hi · |Si |
(2)
Evaluating system-provided substitutes against the goldstandard ones can be done in a few ways. Even though it
is most typically done using the original evaluation metrics
proposed at the S EM E VAL-2007, a somewhat recent attempt
to improve these metrics was proposed by Jabbari et al. [17].
Lastly, it is also possible to use general information retrieval
(IR) metrics.
Before delving into a more detailed introduction of
these metrics, we will first formally describe the data. Let
T = {t1 , t2 , . . . , ti , . . . , tN } denote a set of N instances for
which there are at least two substitutes given by the annotators.
Also, we define A = {a1 , a2 , . . . , ai , . . . , aM } as a set of those
3
penalty factor k, which might be set to 1 in the simplest case.
In the end, the overall score is calculated by averaging these
scores across T (same can be done for F1 score, a harmonic
mean of precision and recall).
OOT . The system may provide up to ten substitutes (i.e.,
|Si | ≤ 10), without being penalized for providing more than
one substitute. The way of scoring an individual instance is
therefore quite similar to the one used for the best subtask;
see (3). Notice that this way of scoring does not punish the
model for wrong predictions either. This implies that it does
not make sense for the model not to provide the maximum
number of substitutes as its output. As a result, this subtask
actually evaluates coverage of a system.
#Si
1 if mi ∈ Si ,
oot(i) =
(3)
oot-mode(i) =
0 otherwise.
#Hi
C. Ranking-based evaluation
As a last way of evaluating lexical substitution, one can
always use any of the ranking comparison metrics popular
in information retrieval (IR), assuming that the gold-standard
substitutes are ranked by how many times they were proposed
by annotators. Maybe the most popular ones are generalized
average precision (GAP) introduced in [18] and precision at
the k-th rank (P@k). As P@k is a rather straightforward
metric, we only describe GAP here. First, we introduce the
average precision (AP) in (7).
Overall precision and recall scores for both of BEST and OOT
subtasks are obtained by averaging over A and T , respectively.
Additionally, mode variants of these metrics are obtained by
averaging over Am and T m , respectively.
|Si |
B. L EX S UB-based evaluation revisited
APi =
Relatively recently, Jabbari et al. [17] proposed modifications to the original evaluation metrics (cf. previous section).
The first modification concerns the best metric, which is in
most cases well below 1 even for a system that performs
optimally on the task (i.e., returns a single correct mostfrequent response for the instance). The reason behind this
is that the score for instance ti is also divided by #Hi , whose
size depends on the number of substitutes proposed by the
annotators for this instance. They fix it by dividing the score
by maxcounti instead of #Hi , as shown in (4).
new-best(i) =
#Si
maxcounti · |Si |
|Si |
1 X
GAPi =
I(xj )pj
Ri j=1
#Si
#Si + k|Si − Hi |
(7)
k=1
Ri =
|Si |
X
I(yk )ȳk
(8)
k=1
D. Calculating inter-annotator agreement
In order to assess the reliability of compiled humanannotated dataset, one usually reports a inter-annotator agreement (IAA), a measure that tells how much the annotators
agree in annotating the dataset.
The way of calculating IAA for lexical substitution datasets
was proposed at S EM E VAL-2007. We formalize it as follows.
First, we denote Ci to denote all the possible pairs that
were required to annotate ti . As each annotator from the pair
provided a substitute set for ti , we use hri , ri0 i to denote their
responses. We compute pairwise agreement (PA) as shown in
(9).
X X |ri ∩ r0 |
1
i
(9)
PA =
0 | · |C | · |T |
|r
∪
r
i
i
i
0
The authors also criticize the coverage-based metrics, oot
and oot-mode, for not punishing the incorrect substitutes
provided by a system. In the original metrics, a system is
encouraged to provide up to ten substitutes without worrying
about any kind of penalization for multiple (possibly incorrect)
substitutes. To account for this, they use the original oot as
a per-instance recall metric, and provide a new per-instance
precision metric; see (6).
new-P(i) =
1X
xj
j
We presume that I(xj ) = max(xj , 0), and tha ȳj is an
average weight of the ranked gold-standard set of substitutes
y1 , y2 , . . . , y|Hi | . In simple terms, GAP is just a variant of
AP that includes weighting of the substitutes with their goldstandard frequency. This enables a more smooth, realistic
comparison of two ranked substitute sets.
(4)
counti (bgi )
(5)
new-best-mode(i) =
maxcounti
However, they give up on precision score and use only recall
obtained by averaging over T (same as in best and best-mode).
The metric still assigns the score of 1 to the instance containing
the correct substitute as the best guess, but it also partially
rewards the instances on which the system failed to predict
the most-frequent annotator substitute.
#Si
#Hi
j
pj =
A term xi is a binary variable indicating whether the i-th item
(as ranked by a system) is in gold-standard set or not. If we
generalize xi to denote the gold-standard weight of the i-th
item or 0 if it is not in the gold-standard set, we can define
GAP according to (8).
Additionally, they find that best-mode has some downsides as
well. First, it is too brittle as the overall score is calculated
by averaging only over the instances that contain modes (T m
and Am ). Secondly, it is lossy as it assigns no score for the
responses that are suboptimal (i.e., that do not have the correct
substitute listed as the best one). To account for this, they use
the metric shown in (5).
new-R(i) =
1 X
xj pj
|Hi | j=1
ti ∈T hri ,ri i∈Ci
The pairwise agreement with a mode PAm is also introduced,
according to (10). We denote the set of all provided substitute
sets for ti as Ri .
X X
1
PAm =
I(mi ) ·
(10)
m
|r
|
·
i |T |
m
(6)
ti ∈T
The term |Si − Hi | represents a number of incorrect substitutions, i.e., a penalty term whose effect is modified using a
ri ∈Ri
Term I(mi ) is 1 if mi ∈ ri , 0 otherwise.
4
V.
list of its synonyms is available. They accomplish this by
using delexicalized (i.e., non-lexical) features, which have the
same semantics regardless of a target word, context or a
substitute candidate. For their experiments, they use Maximum
Entropy (MaxEnt) binary classifier that predicts whether a
given substitute candidate is appropriate in a given context
or not. Final candidate ranking is then naturally obtained by
sorting the candidates by their posterior probability of being
an appropriate substitute candidate. They use a wide range
of features, ranging from local n-gram frequencies obtained
from web data, shallow syntactic features (e.g., part-of-speech
patterns) to the hypernyms information and number of senses
in WordNet. To make sure they are actually evaluating their
model on unseen words, they split the dataset on a wordlevel, i.e., some words (and all their instances) are part of a
training set, whereas others are part of a test set. They evaluate
on L EX S UB and TBWSI-2 datasets, mostly ignoring original
evaluation metrics, and sticking with GAP and P@1.
L EXICAL SUBSTITUTION MODELS
Work in lexical substitution either addresses both the
generation of plausible substitute candidates, as well as their
ranking [19, 20, 21], or only focuses on the ranking step
[22, 23, 24, 25, 26, 27]. Even though the substitute candidates
generation and their ranking are both an integral part of an endto-end lexical substitution system that can generalize to unseen
targets, most work subsequent to the original S EM E VAL2007 task focused almost exclusively on candidate ranking.
Motivation behind this mostly lies in the easier comparison of
ranking models, as the ranking does not need to be concerned
with the generation of possible substitute candidates.
A. Generating substitute candidates
There has been almost an exclusive focus on obtaining
the set of substitute of candidates from lexical resources. As
many available lexical resources (e.g., thesauri, wordnets) often
contain the information about synonymy relations between
words, this approach turns out to be rather straightforward.
Szarvas et al. [41] upgrade their previous work by experimenting with different learning-to-rank models, such as
E XP E NS, R ANK B OOST, R ANK SVM, and L AMBDA MART.
They repeat the evaluation on L EX S UB and TBWSI-2, showing that L AMBDA MART outperforms other methods. They
also compare their models to previous work on L EX S UB,
showing that some of their models significantly outperform
[32], as well as some other models from the literature.
One of such resources is the English WordNet [28], extremely popular in the literature [7, 19, 20, 21, 29, 30, 31].
The simplest approach is to fetch all the words from the
target’s synsets, but some also use words from synsets accessible following a single link [21] or those via a similar to,
entailment, and also see relations [32], or decide to include the
hypernyms of the target word [20]. Some researchers [7] also
use TransGraph [33], a large multilingual graph-based resource
from which the synonyms can be deduced as well. Expectedly,
various thesauri, such as Oxford American Writer Thesaurus
(OAWT) [34] (e.g., in [19]), Roget’s Thesaurus [35] (e.g., in
[7, 21, 36]), Macquarie Thesaurus [37] (e.g., in [29, 38]), and
Microsoft Encarta Thesaurus [7, 20] are often used as well.
Unsupervised methods. Models from this category mostly
tackle a problem of constructing a contextualized distributional
representation of word meaning (word meaning in context)
[42, 43, 22, 23, 44, 24, 45, 25, 46, 47]. Even though lexical
substitution and word meaning in context tasks have a different
goal, solving the latter provides simple means of solving the
former as well. When one finally acquires a contextualized
vector of word meaning, it is simple to check which substitute
candidate generates the vector most similar to the original one
(in terms of, for example, cosine similarity), and is thus the
best candidate. That is why most models dealing with word
meaning in context evaluate on lexical substitution datasets.
However, the corpora-based approaches to extracting substitute candidates do exist, but are quite rare. Hawker [29] used
WebIT [39], a dataset of n-gram counts obtained from a very
large corpus, along with WordNet and Macquarie Thesaurus
to come up with the potential candidates. Moreover, some
works [7] derive substitute candidates from corpus using Lin’s
distributional similarity [40] to identify similar words.
Erk and Padó [23] introduced a structured vector space
model and abandon having a single vector for representing
word meaning. Instead, they propose having multiple vector
representations of a single word, one being a traditional cooccurrence vector, whereas other encoding selectional preferences for one particular relation that the word supports. The
contextualized word meaning representation is then calculated
by combining the basic word representation with the inverse
selectional preference vectors of other words in context. In his
work, Thater et al. [45] tackled the task in a similar manner,
but used second order co-occurrence vectors to improve the
performance.
B. Substitute candidate ranking
When experimenting with the models for substitute candidate ranking, one must have available a dataset that provides
gold-standard substitute candidates that need to get ranked
by a system (instead of automatically generating their own).
Most researchers simply obtain this by pooling the annotated
gold-standard substitutes from all the instances of each target
type (lemma and POS tag). Additionally, MWE substitutes are
usually discarded from the gold-standard substitute sets (as
well as the instances left with no substitutes after this step).
Reisinger and Mooney [44] and Erk and Padó [22] considered an exemplar-based (also called multi-prototype approach),
in which each word is represented by a set of similar contexts
(exemplars) in which it occurs.
This section will briefly outline prominent supervised and
unsupervised models, as well as some honorable mentions.
Supervised methods. Surprisingly, only a couple of papers
deal with supervised substitute candidates ranking.
There has been a relatively new strand of research on
word meaning in context extremely intertwined with lexical
substitution. The so-called substitute vectors (or paradigmatic
representations [48]) aim to represent the context of a target
word with the potential substitutes (fillers) that could fit in its
Szarvas et al. [32] abandoned the need of having a separate
classifier for each word in a training set, but rather decided
to create a classifier that can handle any word, as long a
5
VI.
place. Every word in the vector is weighted by how good of a
fit it is in a context. This context representation is opposite to
the traditional first-order word context representations based
on the neighbouring words of a target word. Melamud et al.
[46] follow that approach. First, they construct a substitute
vector, using smoothed 5-gram language model to measure the
substitute appropriateness. The second step includes computing
a weighted average of substitute vectors of all the contexts in
which the target word occurs (weighting is done according to
substitute vector similarity).
C ONCLUSION
Coping with polysemous words in NLP applications is both
an interesting and a challenging matter. Traditional ways of
dealing with word senses, such as word sense disambiguation,
are impractical due to their requirement of having a fixed sense
inventory. In this paper we gave a survey on a somewhat
novel task that does not pose this requirement – a lexical
substitution task. We explained the task and discussed some
similarities with other NLP tasks. We also briefly presented the
available datasets for English lexical substitution. Moreover,
we explained the possible approaches on evaluating lexical
substitution systems. As the task comprises two subtasks:
substitute generation and their ranking, we outline prominent
approaches to both of these subtasks, showing that unsupervised ones are more popular.
A number of approaches focus on directly computing
the felicity of a substitute candidate, instead of generating
word meaning in context representations. Melamud et al.
[27] propose a very simple approach to substitute candidate
ranking based on the popular skip-gram word embedding
model [49, 50]. What makes their approach interesting is
that they use the context embeddings internally generated by
the model, something that is usually ignored. They estimate
the suitability of a substitute candidate for a target word in
a context as a combination of two type of similarities: one
between the substitute candidate and the target word type, and
other between the substitute candidate and the other words
in the context. They experiment with four different weighting
schemes and report the results.
Considering that lexical substitutes give us somewhat new
information about word meaning in context, without making
the usual first-order information obsolete, we think that lexical
substitution has a lot of potential in many NLP tasks. As it
is really simple (and natural) to collect synonyms or nearsynonyms of words through lexical substitution, we hypothesize that it might be a great tool to facilitate efficient word
sense induction. A possible venue of research following this
idea would be to somehow cluster the substitutes – by doing
this across words, we would acquire synsets, and by doing this
across instances of a single word, we would come up with word
senses itself, along the lines of Biemann [51]. However, what
makes our approach different is that we are aiming to construct
large-scale inventories (and not only for nouns), enriched
continuously over time by crowdsourced effort. We argue that
such approach is feasible, as the task itself is extremely easy
for the annotators (what might not be the case for WSD setups
with too fine-grained sense inventories), making it perfectly
suitable for use with crowdsourcing.
Other methods. Besides the prominent approaches mentioned
above, there is lot of simple models that use interesting
methodologies and features worth mentioning.
Sinha and Mihalcea [7] define simple count-based models
for substitute candidate ranking. First, they generate all the inflections of the substitute candidate. Next, they replace a target
word in a sentence with all the generated inflections, and get
a large-corpus count of all n-grams (n and other modifications
depending on the model) that contain the candidate. To obtain
the final score of a substitute candidate, they sum up the counts.
All the candidates are therefore ranked by their score. What is
interesting about their approach is that they define a supervised
model that predicts what combination of lexical models and
model work best.
R EFERENCES
[1] R. Mitkov, The Oxford Handbook of Computational Linguistics, ser. Oxford Handbooks in Linguistics. Oxford
University Press, USA, 2003.
[2] R. Navigli, “Word sense disambiguation: A survey,” ACM
Computing Surveys (CSUR), vol. 41, no. 2, p. 10, 2009.
[3] C. Fellbaum, WordNet. Wiley Online Library, 1998.
[4] P. Edmonds and A. Kilgarriff, “Introduction to the special
issue on evaluating word sense disambiguation systems,”
Natural Language Engineering, vol. 8, no. 04, pp. 279–
291, 2002.
[5] B. Snyder and M. Palmer, “The English all-words task,”
in Proceedings of Senseval-3, 2004, pp. 41–43.
[6] K. Erk and D. McCarthy, “Graded word sense assignment,” in Proceedings of the EMNLP 2009, vol. 1. ACL,
2009, pp. 440–449.
[7] R. Sinha and R. Mihalcea, “Combining lexical resources
for contextual synonym expansion,” in International Conference Recent Advances in Natural Language Processing, RANLP, 2009, pp. 404–410.
[8] D. McCarthy and R. Navigli, “Semeval-2007 task 10:
English lexical substitution task,” in Proceedings of the
4th International Workshop on Semantic Evaluations.
ACL, 2007, pp. 48–53.
Giuliano et al. [19] devised two approaches. For the first,
they used Latent Semantic Analysis (LSA) to come up with
semantic representations of context and each substitute candidate (trained on British National Corpus (BNC)). They rank the
substitute by the similarity of their and context representation.
As for the approach, they replace the target word in a context
with each of the substitute candidates and extract all the
context n-grams (n being 2–5) containing it. A substitute score
is then obtained as a sum of frequencies of the extracted ngrams in a large Web IT corpus [39].
Hassan et al. [20] use six different models for candidate
ranking and they obtain the final substitute candidate score
as a sum of weighted reciprocal ranks of all the models,
weights fit using a genetic algorithm. The models they use
comprise simple resource-based models, corpus n-gram frequency models, language model, LSA-similarity based model,
and a model based on machine translation. The last model
offers an interesting approach – first, a sentence is translated
back-and-forth between English and French. Then they search
the translation for any of the substitute candidates (or their
inflections) and score them accordingly.
6
[9] D. McCarthy, “Lexical substitution as a task for wsd
evaluation,” in Proceedings of the ACL-02 workshop on
Word sense disambiguation: recent successes and future
directions-Volume 8.
Association for Computational
Linguistics, 2002, pp. 109–115.
[10] L. Specia, S. K. Jauhar, and R. Mihalcea, “Semeval-2012
task 1: English lexical simplification,” in Proceedings of
the Sixth International Workshop on Semantic Evaluation, vol. 2. Association for Computational Linguistics,
2012, pp. 347–355.
[11] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng,
“Cheap and fast—but is it good?: evaluating non-expert
annotations for natural language tasks,” in Proceedings of
the conference on empirical methods in natural language
processing. Association for Computational Linguistics,
2008, pp. 254–263.
[12] M. Lease, “On quality control and machine learning in
crowdsourcing.” Human Computation, vol. 11, p. 11,
2011.
[13] D. Jurgens, “Embracing ambiguity: A comparison of
annotation methodologies for crowdsourcing word sense
labels.” in HLT-NAACL, 2013, pp. 556–562.
[14] A. Wang, C. D. V. Hoang, and M.-Y. Kan, “Perspectives on crowdsourcing annotations for natural language
processing,” Language resources and evaluation, vol. 47,
no. 1, pp. 9–31, 2013.
[15] C. Biemann, “Turk Bootstrap Word Sense Inventory 2.0:
A Large-Scale Resource for Lexical Substitution.” in
LREC, 2012, pp. 4038–4042.
[16] G. Kremer, K. Erk, S. Padó, and S. Thater, “What
Substitutes Tell Us-Analysis of an “All-Words" Lexical
Substitution Corpus.” in EACL, 2014, pp. 540–549.
[17] S. Jabbari, M. Hepple, and L. Guthrie, “Evaluation metrics for the lexical substitution task,” in Human Language
Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational
Linguistics. Association for Computational Linguistics,
2010, pp. 289–292.
[18] K. Kishida, Property of average precision and its generalization: An examination of evaluation indicator for
information retrieval experiments. National Institute of
Informatics Tokyo, Japan, 2005, vol. 2005.
[19] C. Giuliano, A. Gliozzo, and C. Strapparava, “FBK-irst:
Lexical substitution task exploiting domain and syntagmatic coherence,” in Proceedings of the 4th International
Workshop on Semantic Evaluations.
Association for
Computational Linguistics, 2007, pp. 145–148.
[20] S. Hassan, A. Csomai, C. Banea, R. Sinha, and R. Mihalcea, “UNT: SubFinder: Combining knowledge sources
for automatic lexical substitution,” in Proceedings of the
4th International Workshop on Semantic Evaluations.
ACL, 2007, pp. 410–413.
[21] D. Yuret, “KU: Word sense disambiguation by substitution,” in Proceedings of the 4th International Workshop
on Semantic Evaluations. Association for Computational
Linguistics, 2007, pp. 207–213.
[22] K. Erk and S. Padó, “Exemplar-based models for word
meaning in context,” in Proceedings of the acl 2010
conference short papers. Association for Computational
Linguistics, 2010, pp. 92–97.
[23] ——, “A structured vector space model for word meaning
in context,” in Proceedings of the Conference on Empir-
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
7
ical Methods in Natural Language Processing. Association for Computational Linguistics, 2008, pp. 897–906.
S. Thater, G. Dinu, and M. Pinkal, “Ranking paraphrases
in context,” in Proceedings of the 2009 Workshop on Applied Textual Inference. Association for Computational
Linguistics, 2009, pp. 44–47.
S. Thater, H. Fürstenau, and M. Pinkal, “Word meaning
in context: A simple and effective vector model.” in
IJCNLP, 2011, pp. 1134–1143.
G. Dinu, S. Thater, and S. Laue, “A comparison of
models of word meaning in context,” in Proceedings of
the 2012 Conference of the North American Chapter of
the Association for Computational Linguistics: Human
Language Technologies. Association for Computational
Linguistics, 2012, pp. 611–615.
O. Melamud, O. Levy, I. Dagan, and I. Ramat-Gan, “A
simple word embedding model for lexical substitution,”
in Proceedings of NAACL-HLT, 2015, pp. 1–7.
G. A. Miller, “WordNet: a lexical database for English,”
Communications of the ACM, vol. 38, no. 11, pp. 39–41,
1995.
T. Hawker, “USYD: WSD and lexical substitution using
the Web1T corpus,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association
for Computational Linguistics, 2007, pp. 446–453.
D. Martinez, S. N. Kim, and T. Baldwin, “MELB-MKB:
Lexical substitution system based on relatives in context,”
in Proceedings of the 4th International Workshop on
Semantic Evaluations. Association for Computational
Linguistics, 2007, pp. 237–240.
S. Zhao, L. Zhao, Y. Zhang, T. Liu, and S. Li, “HIT: Web
based scoring method for english lexical substitution,”
in Proceedings of the 4th International Workshop on
Semantic Evaluations. Association for Computational
Linguistics, 2007, pp. 173–176.
G. Szarvas, C. Biemann, I. Gurevych et al., “Supervised all-words lexical substitution using delexicalized
features.” in HLT-NAACL, 2013, pp. 1131–1141.
O. Etzioni, K. Reiter, S. Soderland, M. Sammer, and
T. Center, “Lexical translation with application to image
search on the Web,” Machine Translation Summit XI,
2007.
C. A. Lindberg, Oxford American writer’s thesaurus.
Oxford University Press, USA, 2012.
B. A. Kipfer, “Roget’s New Millennium Thesaurus, 1/e,”
2007.
G. Dahl, A.-M. Frassica, and R. Wicentowski, “SW-AG:
Local context matching for English lexical substitution,”
in Proceedings of the 4th International Workshop on
Semantic Evaluations. ACL, 2007, pp. 304–307.
J. R. L.-B. Bernard, The Macquarie thesaurus: The book
of words. Macquarie Library, 1986.
S. Mohammad, G. Hirst, and P. Resnik, “TOR, TORMD:
Distributional profiles of concepts for unsupervised word
sense disambiguation,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association
for Computational Linguistics, 2007, pp. 326–333.
T. Brants and A. Franz, “Web 1T 5-gram corpus version
1.1,” Tech. Rep., 2006.
D. Lin, “An information-theoretic definition of similarity.” in ICML, vol. 98, 1998, pp. 296–304.
G. Szarvas, R. Busa-Fekete, and E. Hüllermeier, “Learn-
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
ing to rank lexical substitutions,” in Proceedings of
the 2013 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2013, 2013, pp. 1926–
1932.
K. Erk, “Vector space models of word meaning and
phrase meaning: A survey,” Language and Linguistics
Compass, vol. 6, no. 10, pp. 635–653, 2012.
K. Erk, D. McCarthy, and N. Gaylord, “Measuring word
meaning in context,” Computational Linguistics, vol. 39,
no. 3, pp. 511–554, 2013.
J. Reisinger and R. J. Mooney, “Multi-prototype vectorspace models of word meaning,” in Human Language
Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational
Linguistics. Association for Computational Linguistics,
2010, pp. 109–117.
S. Thater, H. Fürstenau, and M. Pinkal, “Contextualizing
semantic representations using syntactically enriched vector models,” in Proceedings of the 48th Annual Meeting
of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 948–
957.
O. Melamud, I. Dagan, and J. Goldberger, “Modeling
word meaning in context with substitute vectors,” in Proceedings of the 2015 Conference of the North American
Chapter of the Association for Computational Linguistics,
Denver, USA, 2015.
G. Dinu and M. Lapata, “Measuring distributional similarity in context,” in Proceedings of the 2010 Conference
on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 2010, pp.
1162–1172.
M. A. Yatbaz, E. Sert, and D. Yuret, “Learning syntactic
categories using paradigmatic representations of word
context,” in Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and
Computational Natural Language Learning. Association
for Computational Linguistics, 2012, pp. 940–951.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
estimation of word representations in vector space,” arXiv
preprint arXiv:1301.3781, 2013.
O. Levy and Y. Goldberg, “Dependency-based word
embeddings,” in In Proceedings of ACL 2014, 2014, pp.
302–308.
C. Biemann, “Creating a system for lexical substitutions
from scratch using crowdsourcing,” Language Resources
and Evaluation, vol. 47, no. 1, pp. 97–122, 2013.
8