Modeling Word Senses Using Lexical Substitution Domagoj Alagić Faculty of Electrical Engineering and Computing, University of Zagreb Text Analysis and Knowledge Engineering Lab Unska 3, 10000 Zagreb Email: [email protected] One of the most prototypical and longest-standing NLP applications concerned with the word senses is word sense disambiguation (WSD), a task of computationally determining the meaning of a word in its context [2]. To illustrate, consider the following two sentences, both containing the noun plane: Abstract — Modeling word meaning is beneficial in various natural language processing (NLP) applications, such as information retrieval, machine translation, and automatic text simplification, among others. The central aspect of word meaning are its senses – the meanings a word can bear. A prototypical NLP task that tackles this problem is word sense disambiguation (WSD). This task comprises assigning the right sense to a word in context, and therefore it requires a predefined sense inventory. As such inventories are often task-specific and of inappropriate granularity, new tasks that do not have this requirement have been introduced. One of them is lexical substitution, a task of finding a meaning-preserving substitute for a polysemous word in context. By asking a system (or annotators) to come up with the synonyms for a word in context instead of asking them to pick the right sense from a fixed sense inventory, we indirectly elicit the word senses in a more natural way. This paper describes the task in detail: outlines the available datasets for English, explains the evaluation of lexical substitution systems, and presents the work done on this topic, outlining the most prominent supervised and unsupervised machine learning approaches. 1) 2) The noun plane is a polysemous word, as it holds multiple meanings (e.g., senses) depending on the context it is found in. In the first one, it bears the meaning of (1) an aircraft that has a fixed wing and is powered by propellers or jets, whereas in the second sentence it denotes (2) an unbounded two-dimensional shape. Given a list of possible senses, a WSD system is required to predict, in most cases, the best-fitting sense for a given context. The sentences used for examples above, as well as their senses, are taken from the English WordNet [3], a large manually-constructed lexical resource. Keywords — natural language processing, machine learning, lexical semantics, computational semantics, polysemy, word sense, lexical substitution. I. The flight was delayed due to trouble with plane. Any line joining two points on a plane lies wholly on that plane. Even though WSD has been proven beneficial in various NLP applications, it still comes with its own fair share of problems. Maybe the most prominent one is that it relies on a fixed set of senses for each of the words, the so-called sense inventory. Even though sense inventories are usually manually compiled by experts (e.g., linguists), they are, more often than not, of inappropriate granularity, something that WordNet is often critized for [4, 5]. This stems from the fact that different applications often require different levels of granularity, making evaluation of WSD across applications difficult. Even though there were various attempts amending for inappropriate sense inventories, for instance, a multi-label and graded sense annotation scheme [6], they still remain the main issue of WSD systems. I NTRODUCTION Understanding natural language and leveraging the knowledge contained within it proves to be an extremely difficult, yet amazingly rewarding problem in the field of artificial intelligence. With the increasing internet popularity in recent years, amounts of freely-available unstructured text data are getting larger and larger, evidencing the need for intelligent text analytics. Therefore, it comes as no surprise that the research field of natural language processing (NLP) has become quite prominent. Getting around to understanding natural language most of the time boils down to understanding single words and generalizing those meanings to bigger text fragments, building on the principle of semantic composition. In general terms, this is exactly what lexical semantics, a sub-discipline of linguistics, is concerned with – modeling the way in which the meanings of lexical items contribute to the meanings of the phrases and sentences in which they appear [1]. Correspondingly, in natural language processing (as well as in the related field of computational linguistics), the study falls within the purview of computational lexical semantics. This discipline’s backbone is often machine learning, which allows for word representations to be learnt automatically from corpus statistics. The task of lexical substitution task, i.e., a replacement of a target word in context with a suitable substitute, often a synonym or a near-synonym, was proposed in hope of avoiding these problems. This task circumvents the need of having a fixed sense inventory by allowing participants (systems or annotators) to indirectly elicit word meaning (and their synonyms) by providing the appropriate substitutes for words in context. This paper gives an overview of lexical substitution task, outlining the available datasets for English, explaining the evaluation of lexical substitution systems, and presenting the work done on this topic. 1 similar task is that of query expansion from information retrieval (IR), where, given a input query, a system needs to come up with additional terms similar to ones in the original query. Another related task is lexical simplification [10], where a word in context needs to be replaced with a suitable, but cognitively simpler synonym. This amounts to implementing simplicity constraint in the ranking step of lexical substitution. On the other hand, if the candidate generation step is changed to search for the synonyms in another language, one ends up with lexical (word-for-word) machine translation. The rest of the paper is organized as follows. Section II describes the lexical substitution task, while Section III introduces available datasets for this task. Section IV outlines evaluations metrics commonly used to evaluate performance of lexical substitution systems, as well as inter-annotator agreement between the annotators constructing lexical substitution datasets. We proceed by presenting and categorizing different models used on this task. Lastly, Section VI concludes the paper. II. TASK The following sections will give an overview of the approaches for tackling both the substitute generation and substitute ranking subtasks, their evaluation, available datasets, and the proposed systems. The task of lexical substitution, sometimes called a contextual synonym expansion [7], was proposed as a task by McCarthy and Navigli [8] (following the earlier ideas of McCarthy [9]) at the Semantic Evaluations 2007 (S EM E VAL) workshop. The task involves finding a meaning-preserving substitute for a polysemous word (the so-called target word) in a context, all while preserving the sentence grammaticality. To illustrate this, consider the following example:1 III. To build lexical substitution systems that can address the task of lexical substitution, we require data that a system can either learn form or be evaluated on. Unfortunately, datasets are almost always costly and time-consuming to produce, as they require a significant human effort to ensure their quality and reliability, something that cannot be achieved automatically. In recent years, crowdsourcing [11] has started to gain popularity. Crowdsourcing is based on the idea that it is possible to obtain reliable data by soliciting contributions from a large group of people (usually online). Nonetheless, using crowdsourcing as means of compiling a dataset has its downsides as well – even though it is easy and cheap to obtain data, some effort needs to be put in assuring the collected data is of sufficient quality, as online users annotating the data may not always be good-willed [12]. Many researchers in NLP [13, 11, 14] recognized this as a viable strategy and started using one of the many popular crowdsourcing platforms, such as Amazon Mechanical Turk2 and CrowdFlower,3 among others. He was bright and independent and proud. Given this sentence, suitable substitutes for the word bright may be intelligent, smart, or clever, among others. In contrast, substitutes like luminous and blue represent a different sense of the word, or make no sense, respectively. Additionally, knowit-all is not a viable choice, as it does not fit in the sentence (part-of-speech mismatch), even though it somewhat bears the same meaning. Sometimes it is allowed to use hypernyms or multiword expressions (MWEs) as well. Technically, solving the lexical substitution task comes in two steps: 1) 2) DATASETS Finding the set of candidate substitutes for a given word; Finding the best candidate(s), given a word context. Types of lexical substitution datasets. Lexical substitution datasets come in two types: lexical sample and all-words. Whereas lexical sample provides the data for a set of target words selected in some manner (most usually hand-picked), all-words datasets consider all the content words (usually nouns, adjectives, verbs, and adverbs) from a running text. Both of these approaches have their merits and downsides in terms of both building the dataset and using it. The lexical sample approach offers a bit more tedious dataset compilation (due to a need of picking the words), but in the same time it offers a more controlled experiment environment, as words with specific attributes can be selected. On the other hand, allwords datasets are more realistic and easier to prepare, because using all the content words from a running text facilitates model generalization, as it is expected for a model to encounter a lot of words with highly varying attributes, not only the carefully selected ones. The first step, also called candidate synonym collection [7], includes collecting all possible candidate substitutes from lexical resources or corpora. Note that this step does not account for the compatibility of substitute candidates in a given context. After all the candidates are obtained, the next step, also called context-based fitness scoring [7], selects the best substitute candidates. This step is usually tackled as a ranking problem, which results in the list of candidate substitutes ranked by their compatibility with a given context. As mentioned in the introduction, lexical substitution task is interesting as it solves some of the problems occurring in traditional WSD setups. Most importantly, this task does not presume any predefined sense inventory, but rather models word meaning through provided word substitutes. Another point is that lexical substitution might seem a more natural task, and often easier for the annotators, as they are not trying to explicitly select the most appropriate predefined meaning of the word (devised by somebody else) in context, but only come up with its synonyms or near-synonyms. Unfortunately, only a handful of datasets for the task of lexical substitution are available, and what is more – mostly only for English. This section shortly describes the available datasets. Dataset statistics are reported in Table I. (CS denotes whether the dataset was compiled using crowdsourcing.) Similarities with other tasks. Lexical substitution shares similarities with number of other NLP tasks. Maybe the most 2 https://www.mturk.com/mturk/welcome 1 Instance 3 https://www.crowdflower.com/ bright.a 1 from L EX S UB dataset (cf. Section III) [8]. 2 TABLE I. L EX S UB TBWSI-2 COINCO DATASET STATISTICS Type CS POS cover #target words #instances lexical sample lexical sample all-words 7 3 3 N, A, V, R N N, A, V, R 201 1,012 15,629 2,002 24,647 15,629 instances from T (A ⊆ T ) for which the system has predicted at least one substitute. We denote a ranked set of systempredicted substitutes for the instance ti as Si . For each instance ti , we presume a multiset union of annotator (the so-called gold-standard) substitutes Hi (as each substitute could have been proposed by multiple annotators). To make it clearer, consider the example instance from Section I to be t1 . For this instance, annotators have provided both the substitutes intelligent and clever three times, and the substitute smart once. Therefore, H1 = {intelligent : 3, clever : 3, smart : 1}. L EX S UB dataset. This lexical sample dataset for English was originally introduced along with the task itself at S EM E VAL2007 [8]. The sample comprises 201 words: nouns, verbs, adjectives, and adverbs, selected in both manual and automatic manner. For each sampled word, the authors sampled ten sentences in which those words occur, amounting to 2,002 sentences in total (eight of them are not annotated, possibly by mistake). Five native-speaker annotators were asked to provide substitutes for target word in each of the sentences. Substitutes are assigned a frequency that denotes how many annotators suggested it for a given sentence. Moreover, let counti (s) denote a function which returns the number of occurrences of the substitute s in Hi . We also define maxcounti as the maximum number of occurrences of a single substitute in Hi . For simplicity, we use #P to define the multiset cardinality of substitute set P according to goldstandard Hi , as shown in (1). X #P = counti (s) (1) TBWSI-2 dataset. This is another, albeit larger, lexical sample dataset for English, presented in [15]. It comprises 24,647 sentences for a total of 1,012 frequent target nouns. Substitutes were collected using an iterative three-step crowdsourcing. In contrast to the L EX S UB dataset, substitutes are grouped per word-sense, as the main goal of this work was to create a crowdsourced sense inventory using lexical substitution as a backbone of the approach. s∈P Additionally, as Hi is not a set, but rather a multiset, we presume that #Hi simply returns the total number of annotatorprovided substitutes for ti (possible duplicates included). To exemplify, if we presume our system provided the substitute set S1 = {intelligent, smart, brilliant} for our running example, count1 (smart) returns 1, whereas count1 (brilliant) returns 0, as the word brilliant is not in the gold-standard set. Also, #S1 = 3 + 1 + 0 = 4 and maxcount1 = 3. C O I N C O dataset. In contrast to both L EX S UB and TBWSI-2 datasets, this dataset is an all-words dataset for English, compiled by Kremer et al. [16]. It comprises 15,629 target words across 15,629 sentences (7,117 nouns, 4,617 verbs, 2,470 adjectives, and 1,425 adverbs). Such a large amount of sentences and target words required a crowdsourcing-based annotation. IV. Some metrics in [8] require a notion of a mode substitute mi – a unique most-frequent annotator substitute, which may not exist for all the instances. In line with that, by filtering out instances that do not contain the mode in their Hi , we construct the mode variants of T and A and denote them with T m and Am , respectively. For this reason, we also define bgi to be the system’s best substitute (provided first in the substitute list) on instance ti . In our running example, there is no mode as both intelligent and clever are proposed three times. On the other hand, system’s best guess (bg1 ) is intelligent, being provided as the first substitute in S1 . E VALUATION Proper evaluation plays a key role in every NLP application, and lexical substitution is no exception. Therefore it comes as no surprise that a lot of attention was given to devising a robust evaluation scheme for lexical substitution at the S EM E VAL-2007 [8]. A. L EX S UB-based evaluation McCarthy and Navigli [8] introduce two separate subtasks on which the lexical substitution systems should be evaluated – BEST and OOT (out-of-ten). Given an instance (sentence and target word), the evaluation of lexical substitution boils down to comparing substitute sets (or more precisely, ranked substitute sets) obtained from human annotators (i.e., gold-standard substitutes) to that of a system. The final performance score is then obtained by aggregating per-instance scores. BEST . The system is free to provide as many substitutes as it sees fit, but the score for each correct one is divided by the total number of provided substitutes. The first substitute in the list of provided substitutes is considered to be the system’s best substitute. As providing multiple substitutes hurts the overall score, the system cannot simply provide as many substitutes as possible to increase its score, but it can still provide a couple of them if it is not certain which response is better. The authors also introduce mode variants of precision and recall, Mode P and Mode R, which evaluate how successful the system is in guessing a single best substitute (i.e., gold-standard mode). Instances are scored according to (2). #Si 1 if bgi = mi , best(i) = best-mode(i) = 0 otherwise. #Hi · |Si | (2) Evaluating system-provided substitutes against the goldstandard ones can be done in a few ways. Even though it is most typically done using the original evaluation metrics proposed at the S EM E VAL-2007, a somewhat recent attempt to improve these metrics was proposed by Jabbari et al. [17]. Lastly, it is also possible to use general information retrieval (IR) metrics. Before delving into a more detailed introduction of these metrics, we will first formally describe the data. Let T = {t1 , t2 , . . . , ti , . . . , tN } denote a set of N instances for which there are at least two substitutes given by the annotators. Also, we define A = {a1 , a2 , . . . , ai , . . . , aM } as a set of those 3 penalty factor k, which might be set to 1 in the simplest case. In the end, the overall score is calculated by averaging these scores across T (same can be done for F1 score, a harmonic mean of precision and recall). OOT . The system may provide up to ten substitutes (i.e., |Si | ≤ 10), without being penalized for providing more than one substitute. The way of scoring an individual instance is therefore quite similar to the one used for the best subtask; see (3). Notice that this way of scoring does not punish the model for wrong predictions either. This implies that it does not make sense for the model not to provide the maximum number of substitutes as its output. As a result, this subtask actually evaluates coverage of a system. #Si 1 if mi ∈ Si , oot(i) = (3) oot-mode(i) = 0 otherwise. #Hi C. Ranking-based evaluation As a last way of evaluating lexical substitution, one can always use any of the ranking comparison metrics popular in information retrieval (IR), assuming that the gold-standard substitutes are ranked by how many times they were proposed by annotators. Maybe the most popular ones are generalized average precision (GAP) introduced in [18] and precision at the k-th rank (P@k). As P@k is a rather straightforward metric, we only describe GAP here. First, we introduce the average precision (AP) in (7). Overall precision and recall scores for both of BEST and OOT subtasks are obtained by averaging over A and T , respectively. Additionally, mode variants of these metrics are obtained by averaging over Am and T m , respectively. |Si | B. L EX S UB-based evaluation revisited APi = Relatively recently, Jabbari et al. [17] proposed modifications to the original evaluation metrics (cf. previous section). The first modification concerns the best metric, which is in most cases well below 1 even for a system that performs optimally on the task (i.e., returns a single correct mostfrequent response for the instance). The reason behind this is that the score for instance ti is also divided by #Hi , whose size depends on the number of substitutes proposed by the annotators for this instance. They fix it by dividing the score by maxcounti instead of #Hi , as shown in (4). new-best(i) = #Si maxcounti · |Si | |Si | 1 X GAPi = I(xj )pj Ri j=1 #Si #Si + k|Si − Hi | (7) k=1 Ri = |Si | X I(yk )ȳk (8) k=1 D. Calculating inter-annotator agreement In order to assess the reliability of compiled humanannotated dataset, one usually reports a inter-annotator agreement (IAA), a measure that tells how much the annotators agree in annotating the dataset. The way of calculating IAA for lexical substitution datasets was proposed at S EM E VAL-2007. We formalize it as follows. First, we denote Ci to denote all the possible pairs that were required to annotate ti . As each annotator from the pair provided a substitute set for ti , we use hri , ri0 i to denote their responses. We compute pairwise agreement (PA) as shown in (9). X X |ri ∩ r0 | 1 i (9) PA = 0 | · |C | · |T | |r ∪ r i i i 0 The authors also criticize the coverage-based metrics, oot and oot-mode, for not punishing the incorrect substitutes provided by a system. In the original metrics, a system is encouraged to provide up to ten substitutes without worrying about any kind of penalization for multiple (possibly incorrect) substitutes. To account for this, they use the original oot as a per-instance recall metric, and provide a new per-instance precision metric; see (6). new-P(i) = 1X xj j We presume that I(xj ) = max(xj , 0), and tha ȳj is an average weight of the ranked gold-standard set of substitutes y1 , y2 , . . . , y|Hi | . In simple terms, GAP is just a variant of AP that includes weighting of the substitutes with their goldstandard frequency. This enables a more smooth, realistic comparison of two ranked substitute sets. (4) counti (bgi ) (5) new-best-mode(i) = maxcounti However, they give up on precision score and use only recall obtained by averaging over T (same as in best and best-mode). The metric still assigns the score of 1 to the instance containing the correct substitute as the best guess, but it also partially rewards the instances on which the system failed to predict the most-frequent annotator substitute. #Si #Hi j pj = A term xi is a binary variable indicating whether the i-th item (as ranked by a system) is in gold-standard set or not. If we generalize xi to denote the gold-standard weight of the i-th item or 0 if it is not in the gold-standard set, we can define GAP according to (8). Additionally, they find that best-mode has some downsides as well. First, it is too brittle as the overall score is calculated by averaging only over the instances that contain modes (T m and Am ). Secondly, it is lossy as it assigns no score for the responses that are suboptimal (i.e., that do not have the correct substitute listed as the best one). To account for this, they use the metric shown in (5). new-R(i) = 1 X xj pj |Hi | j=1 ti ∈T hri ,ri i∈Ci The pairwise agreement with a mode PAm is also introduced, according to (10). We denote the set of all provided substitute sets for ti as Ri . X X 1 PAm = I(mi ) · (10) m |r | · i |T | m (6) ti ∈T The term |Si − Hi | represents a number of incorrect substitutions, i.e., a penalty term whose effect is modified using a ri ∈Ri Term I(mi ) is 1 if mi ∈ ri , 0 otherwise. 4 V. list of its synonyms is available. They accomplish this by using delexicalized (i.e., non-lexical) features, which have the same semantics regardless of a target word, context or a substitute candidate. For their experiments, they use Maximum Entropy (MaxEnt) binary classifier that predicts whether a given substitute candidate is appropriate in a given context or not. Final candidate ranking is then naturally obtained by sorting the candidates by their posterior probability of being an appropriate substitute candidate. They use a wide range of features, ranging from local n-gram frequencies obtained from web data, shallow syntactic features (e.g., part-of-speech patterns) to the hypernyms information and number of senses in WordNet. To make sure they are actually evaluating their model on unseen words, they split the dataset on a wordlevel, i.e., some words (and all their instances) are part of a training set, whereas others are part of a test set. They evaluate on L EX S UB and TBWSI-2 datasets, mostly ignoring original evaluation metrics, and sticking with GAP and P@1. L EXICAL SUBSTITUTION MODELS Work in lexical substitution either addresses both the generation of plausible substitute candidates, as well as their ranking [19, 20, 21], or only focuses on the ranking step [22, 23, 24, 25, 26, 27]. Even though the substitute candidates generation and their ranking are both an integral part of an endto-end lexical substitution system that can generalize to unseen targets, most work subsequent to the original S EM E VAL2007 task focused almost exclusively on candidate ranking. Motivation behind this mostly lies in the easier comparison of ranking models, as the ranking does not need to be concerned with the generation of possible substitute candidates. A. Generating substitute candidates There has been almost an exclusive focus on obtaining the set of substitute of candidates from lexical resources. As many available lexical resources (e.g., thesauri, wordnets) often contain the information about synonymy relations between words, this approach turns out to be rather straightforward. Szarvas et al. [41] upgrade their previous work by experimenting with different learning-to-rank models, such as E XP E NS, R ANK B OOST, R ANK SVM, and L AMBDA MART. They repeat the evaluation on L EX S UB and TBWSI-2, showing that L AMBDA MART outperforms other methods. They also compare their models to previous work on L EX S UB, showing that some of their models significantly outperform [32], as well as some other models from the literature. One of such resources is the English WordNet [28], extremely popular in the literature [7, 19, 20, 21, 29, 30, 31]. The simplest approach is to fetch all the words from the target’s synsets, but some also use words from synsets accessible following a single link [21] or those via a similar to, entailment, and also see relations [32], or decide to include the hypernyms of the target word [20]. Some researchers [7] also use TransGraph [33], a large multilingual graph-based resource from which the synonyms can be deduced as well. Expectedly, various thesauri, such as Oxford American Writer Thesaurus (OAWT) [34] (e.g., in [19]), Roget’s Thesaurus [35] (e.g., in [7, 21, 36]), Macquarie Thesaurus [37] (e.g., in [29, 38]), and Microsoft Encarta Thesaurus [7, 20] are often used as well. Unsupervised methods. Models from this category mostly tackle a problem of constructing a contextualized distributional representation of word meaning (word meaning in context) [42, 43, 22, 23, 44, 24, 45, 25, 46, 47]. Even though lexical substitution and word meaning in context tasks have a different goal, solving the latter provides simple means of solving the former as well. When one finally acquires a contextualized vector of word meaning, it is simple to check which substitute candidate generates the vector most similar to the original one (in terms of, for example, cosine similarity), and is thus the best candidate. That is why most models dealing with word meaning in context evaluate on lexical substitution datasets. However, the corpora-based approaches to extracting substitute candidates do exist, but are quite rare. Hawker [29] used WebIT [39], a dataset of n-gram counts obtained from a very large corpus, along with WordNet and Macquarie Thesaurus to come up with the potential candidates. Moreover, some works [7] derive substitute candidates from corpus using Lin’s distributional similarity [40] to identify similar words. Erk and Padó [23] introduced a structured vector space model and abandon having a single vector for representing word meaning. Instead, they propose having multiple vector representations of a single word, one being a traditional cooccurrence vector, whereas other encoding selectional preferences for one particular relation that the word supports. The contextualized word meaning representation is then calculated by combining the basic word representation with the inverse selectional preference vectors of other words in context. In his work, Thater et al. [45] tackled the task in a similar manner, but used second order co-occurrence vectors to improve the performance. B. Substitute candidate ranking When experimenting with the models for substitute candidate ranking, one must have available a dataset that provides gold-standard substitute candidates that need to get ranked by a system (instead of automatically generating their own). Most researchers simply obtain this by pooling the annotated gold-standard substitutes from all the instances of each target type (lemma and POS tag). Additionally, MWE substitutes are usually discarded from the gold-standard substitute sets (as well as the instances left with no substitutes after this step). Reisinger and Mooney [44] and Erk and Padó [22] considered an exemplar-based (also called multi-prototype approach), in which each word is represented by a set of similar contexts (exemplars) in which it occurs. This section will briefly outline prominent supervised and unsupervised models, as well as some honorable mentions. Supervised methods. Surprisingly, only a couple of papers deal with supervised substitute candidates ranking. There has been a relatively new strand of research on word meaning in context extremely intertwined with lexical substitution. The so-called substitute vectors (or paradigmatic representations [48]) aim to represent the context of a target word with the potential substitutes (fillers) that could fit in its Szarvas et al. [32] abandoned the need of having a separate classifier for each word in a training set, but rather decided to create a classifier that can handle any word, as long a 5 VI. place. Every word in the vector is weighted by how good of a fit it is in a context. This context representation is opposite to the traditional first-order word context representations based on the neighbouring words of a target word. Melamud et al. [46] follow that approach. First, they construct a substitute vector, using smoothed 5-gram language model to measure the substitute appropriateness. The second step includes computing a weighted average of substitute vectors of all the contexts in which the target word occurs (weighting is done according to substitute vector similarity). C ONCLUSION Coping with polysemous words in NLP applications is both an interesting and a challenging matter. Traditional ways of dealing with word senses, such as word sense disambiguation, are impractical due to their requirement of having a fixed sense inventory. In this paper we gave a survey on a somewhat novel task that does not pose this requirement – a lexical substitution task. We explained the task and discussed some similarities with other NLP tasks. We also briefly presented the available datasets for English lexical substitution. Moreover, we explained the possible approaches on evaluating lexical substitution systems. As the task comprises two subtasks: substitute generation and their ranking, we outline prominent approaches to both of these subtasks, showing that unsupervised ones are more popular. A number of approaches focus on directly computing the felicity of a substitute candidate, instead of generating word meaning in context representations. Melamud et al. [27] propose a very simple approach to substitute candidate ranking based on the popular skip-gram word embedding model [49, 50]. What makes their approach interesting is that they use the context embeddings internally generated by the model, something that is usually ignored. They estimate the suitability of a substitute candidate for a target word in a context as a combination of two type of similarities: one between the substitute candidate and the target word type, and other between the substitute candidate and the other words in the context. They experiment with four different weighting schemes and report the results. Considering that lexical substitutes give us somewhat new information about word meaning in context, without making the usual first-order information obsolete, we think that lexical substitution has a lot of potential in many NLP tasks. As it is really simple (and natural) to collect synonyms or nearsynonyms of words through lexical substitution, we hypothesize that it might be a great tool to facilitate efficient word sense induction. A possible venue of research following this idea would be to somehow cluster the substitutes – by doing this across words, we would acquire synsets, and by doing this across instances of a single word, we would come up with word senses itself, along the lines of Biemann [51]. However, what makes our approach different is that we are aiming to construct large-scale inventories (and not only for nouns), enriched continuously over time by crowdsourced effort. We argue that such approach is feasible, as the task itself is extremely easy for the annotators (what might not be the case for WSD setups with too fine-grained sense inventories), making it perfectly suitable for use with crowdsourcing. Other methods. Besides the prominent approaches mentioned above, there is lot of simple models that use interesting methodologies and features worth mentioning. Sinha and Mihalcea [7] define simple count-based models for substitute candidate ranking. First, they generate all the inflections of the substitute candidate. Next, they replace a target word in a sentence with all the generated inflections, and get a large-corpus count of all n-grams (n and other modifications depending on the model) that contain the candidate. To obtain the final score of a substitute candidate, they sum up the counts. All the candidates are therefore ranked by their score. What is interesting about their approach is that they define a supervised model that predicts what combination of lexical models and model work best. R EFERENCES [1] R. Mitkov, The Oxford Handbook of Computational Linguistics, ser. Oxford Handbooks in Linguistics. Oxford University Press, USA, 2003. [2] R. Navigli, “Word sense disambiguation: A survey,” ACM Computing Surveys (CSUR), vol. 41, no. 2, p. 10, 2009. [3] C. Fellbaum, WordNet. Wiley Online Library, 1998. [4] P. Edmonds and A. Kilgarriff, “Introduction to the special issue on evaluating word sense disambiguation systems,” Natural Language Engineering, vol. 8, no. 04, pp. 279– 291, 2002. [5] B. Snyder and M. Palmer, “The English all-words task,” in Proceedings of Senseval-3, 2004, pp. 41–43. [6] K. Erk and D. McCarthy, “Graded word sense assignment,” in Proceedings of the EMNLP 2009, vol. 1. ACL, 2009, pp. 440–449. [7] R. Sinha and R. Mihalcea, “Combining lexical resources for contextual synonym expansion,” in International Conference Recent Advances in Natural Language Processing, RANLP, 2009, pp. 404–410. [8] D. McCarthy and R. Navigli, “Semeval-2007 task 10: English lexical substitution task,” in Proceedings of the 4th International Workshop on Semantic Evaluations. ACL, 2007, pp. 48–53. Giuliano et al. [19] devised two approaches. For the first, they used Latent Semantic Analysis (LSA) to come up with semantic representations of context and each substitute candidate (trained on British National Corpus (BNC)). They rank the substitute by the similarity of their and context representation. As for the approach, they replace the target word in a context with each of the substitute candidates and extract all the context n-grams (n being 2–5) containing it. A substitute score is then obtained as a sum of frequencies of the extracted ngrams in a large Web IT corpus [39]. Hassan et al. [20] use six different models for candidate ranking and they obtain the final substitute candidate score as a sum of weighted reciprocal ranks of all the models, weights fit using a genetic algorithm. The models they use comprise simple resource-based models, corpus n-gram frequency models, language model, LSA-similarity based model, and a model based on machine translation. The last model offers an interesting approach – first, a sentence is translated back-and-forth between English and French. Then they search the translation for any of the substitute candidates (or their inflections) and score them accordingly. 6 [9] D. McCarthy, “Lexical substitution as a task for wsd evaluation,” in Proceedings of the ACL-02 workshop on Word sense disambiguation: recent successes and future directions-Volume 8. Association for Computational Linguistics, 2002, pp. 109–115. [10] L. Specia, S. K. Jauhar, and R. Mihalcea, “Semeval-2012 task 1: English lexical simplification,” in Proceedings of the Sixth International Workshop on Semantic Evaluation, vol. 2. Association for Computational Linguistics, 2012, pp. 347–355. [11] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng, “Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks,” in Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008, pp. 254–263. [12] M. Lease, “On quality control and machine learning in crowdsourcing.” Human Computation, vol. 11, p. 11, 2011. [13] D. Jurgens, “Embracing ambiguity: A comparison of annotation methodologies for crowdsourcing word sense labels.” in HLT-NAACL, 2013, pp. 556–562. [14] A. Wang, C. D. V. Hoang, and M.-Y. Kan, “Perspectives on crowdsourcing annotations for natural language processing,” Language resources and evaluation, vol. 47, no. 1, pp. 9–31, 2013. [15] C. Biemann, “Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource for Lexical Substitution.” in LREC, 2012, pp. 4038–4042. [16] G. Kremer, K. Erk, S. Padó, and S. Thater, “What Substitutes Tell Us-Analysis of an “All-Words" Lexical Substitution Corpus.” in EACL, 2014, pp. 540–549. [17] S. Jabbari, M. Hepple, and L. Guthrie, “Evaluation metrics for the lexical substitution task,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 289–292. [18] K. Kishida, Property of average precision and its generalization: An examination of evaluation indicator for information retrieval experiments. National Institute of Informatics Tokyo, Japan, 2005, vol. 2005. [19] C. Giuliano, A. Gliozzo, and C. Strapparava, “FBK-irst: Lexical substitution task exploiting domain and syntagmatic coherence,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007, pp. 145–148. [20] S. Hassan, A. Csomai, C. Banea, R. Sinha, and R. Mihalcea, “UNT: SubFinder: Combining knowledge sources for automatic lexical substitution,” in Proceedings of the 4th International Workshop on Semantic Evaluations. ACL, 2007, pp. 410–413. [21] D. Yuret, “KU: Word sense disambiguation by substitution,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007, pp. 207–213. [22] K. Erk and S. Padó, “Exemplar-based models for word meaning in context,” in Proceedings of the acl 2010 conference short papers. Association for Computational Linguistics, 2010, pp. 92–97. [23] ——, “A structured vector space model for word meaning in context,” in Proceedings of the Conference on Empir- [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] 7 ical Methods in Natural Language Processing. Association for Computational Linguistics, 2008, pp. 897–906. S. Thater, G. Dinu, and M. Pinkal, “Ranking paraphrases in context,” in Proceedings of the 2009 Workshop on Applied Textual Inference. Association for Computational Linguistics, 2009, pp. 44–47. S. Thater, H. Fürstenau, and M. Pinkal, “Word meaning in context: A simple and effective vector model.” in IJCNLP, 2011, pp. 1134–1143. G. Dinu, S. Thater, and S. Laue, “A comparison of models of word meaning in context,” in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2012, pp. 611–615. O. Melamud, O. Levy, I. Dagan, and I. Ramat-Gan, “A simple word embedding model for lexical substitution,” in Proceedings of NAACL-HLT, 2015, pp. 1–7. G. A. Miller, “WordNet: a lexical database for English,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995. T. Hawker, “USYD: WSD and lexical substitution using the Web1T corpus,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007, pp. 446–453. D. Martinez, S. N. Kim, and T. Baldwin, “MELB-MKB: Lexical substitution system based on relatives in context,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007, pp. 237–240. S. Zhao, L. Zhao, Y. Zhang, T. Liu, and S. Li, “HIT: Web based scoring method for english lexical substitution,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007, pp. 173–176. G. Szarvas, C. Biemann, I. Gurevych et al., “Supervised all-words lexical substitution using delexicalized features.” in HLT-NAACL, 2013, pp. 1131–1141. O. Etzioni, K. Reiter, S. Soderland, M. Sammer, and T. Center, “Lexical translation with application to image search on the Web,” Machine Translation Summit XI, 2007. C. A. Lindberg, Oxford American writer’s thesaurus. Oxford University Press, USA, 2012. B. A. Kipfer, “Roget’s New Millennium Thesaurus, 1/e,” 2007. G. Dahl, A.-M. Frassica, and R. Wicentowski, “SW-AG: Local context matching for English lexical substitution,” in Proceedings of the 4th International Workshop on Semantic Evaluations. ACL, 2007, pp. 304–307. J. R. L.-B. Bernard, The Macquarie thesaurus: The book of words. Macquarie Library, 1986. S. Mohammad, G. Hirst, and P. Resnik, “TOR, TORMD: Distributional profiles of concepts for unsupervised word sense disambiguation,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007, pp. 326–333. T. Brants and A. Franz, “Web 1T 5-gram corpus version 1.1,” Tech. Rep., 2006. D. Lin, “An information-theoretic definition of similarity.” in ICML, vol. 98, 1998, pp. 296–304. G. Szarvas, R. Busa-Fekete, and E. Hüllermeier, “Learn- [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] ing to rank lexical substitutions,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 2013, pp. 1926– 1932. K. Erk, “Vector space models of word meaning and phrase meaning: A survey,” Language and Linguistics Compass, vol. 6, no. 10, pp. 635–653, 2012. K. Erk, D. McCarthy, and N. Gaylord, “Measuring word meaning in context,” Computational Linguistics, vol. 39, no. 3, pp. 511–554, 2013. J. Reisinger and R. J. Mooney, “Multi-prototype vectorspace models of word meaning,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 109–117. S. Thater, H. Fürstenau, and M. Pinkal, “Contextualizing semantic representations using syntactically enriched vector models,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 948– 957. O. Melamud, I. Dagan, and J. Goldberger, “Modeling word meaning in context with substitute vectors,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, Denver, USA, 2015. G. Dinu and M. Lapata, “Measuring distributional similarity in context,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010, pp. 1162–1172. M. A. Yatbaz, E. Sert, and D. Yuret, “Learning syntactic categories using paradigmatic representations of word context,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012, pp. 940–951. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. O. Levy and Y. Goldberg, “Dependency-based word embeddings,” in In Proceedings of ACL 2014, 2014, pp. 302–308. C. Biemann, “Creating a system for lexical substitutions from scratch using crowdsourcing,” Language Resources and Evaluation, vol. 47, no. 1, pp. 97–122, 2013. 8
© Copyright 2025 Paperzz