Evaluating multi-sense embeddings for semantic resolution monolingually and in word translation Gábor Borbély Department of Algebra, BME∗ Egry József u. 1 1111 Budapest, Hungary [email protected] Márton Makrai Institute for Linguistics, HAS† Benczúr u. 33 1068 Budapest, Hungary [email protected] Dávid Nemeskey and András Kornai Institute for Computer Science, HAS† Kende u. 13-17 1111 Budapest, Hungary [email protected], [email protected] Abstract Multi-sense word embeddings (MSEs) model different meanings of word forms with different vectors. We propose two new methods for evaluating MSEs by their degree of semantic resolution, measuring the detail of the sense clustering. One method is based on monolingual dictionaries, and the other exploits the principle that words may be ambiguous as far as the postulated senses translate to different words in some other language. 1 Introduction Multi-sense word embeddings (MSEs) (Huang et al., 2012; Neelakantan et al., 2014b; Bartunov et al., 2015) attempt to represent different word senses with different vectors. MSEs have been evaluated for lexical relatedness going back to the seminal Reisinger and Mooney (2010). Their motivation was psychological (category learning) rather than lexicographic: in fact they “do not assume that clusters correspond to traditional word senses. Rather, we only rely on clusters to capture meaningful variation in word usage.” In other words, the evaluation itself was not psychological: it is unclear whether their vectors correspond to different concepts in the human mind. As they note, usage splits words more finely (with synonyms and near-synonyms ending up in distant clusters) than semantics. ∗ † Budapest University of Technology and Economics Hungarian Academy of Sciences Here we propose testing MSEs for their fit with the semantics, by two different methods, one based on monolingual dictionaries, the other on bilingual considerations. Items listed in traditional monolingual dictionaries for the same word form are supposed to correspond to different concepts. We propose the correlation between the number of senses of each word-form in the embedding and in the manually created inventory as an approximate measure of how well embedding vectors correspond to concepts in speakers’ mind – this is discussed in Section 2. The other evaluation method is based on the linear method of Mikolov et al. (2013b), who formulate word translation as a linear mapping from the source language embedding to the target one, trained on a seed of a few thousand word pairs. Our proposal is to perform such translations from MSEs, with the idea that what are different senses in the source language will very often translate to different words in the target language. This way, we can use single-sense embeddings on the target side and thereby reduce the noise of MSEs – this is discussed in Section 3. ND: We should get rid of CMultiVec Altogether, we present a preliminary evaluation of five MSE implementations by these two methods on two languages, English and Hungarian (some of the 20 combinations are deferred to the full paper): CMultiVec (Huang et al., 2012), as reimplemented by Jeremy Salwen 1 ; the learning process of Neelakantan et al. (2014a) with adaptive sense numbers (we report results using their 1 https://github.com/antonyms/CMultiVec release MSEs vectors.MSSG.300D.30K and their tool itself, calling both Neela); the parametrized Bayesian learner of Bartunov et al. (2015) where the number of senses is controlled by a parameter α for semantic resolution, referred to as AdaGram; topical embeddings (Liu et al., 2015), a combination of neural frameworks with LDA topic models; and jiweil (Li and Jurafsky, 2015), which was published to test MSEs in downstream tasks. Some very preliminary conclusions are offered in Section 4, more in regards to the feasibility of the two evaluation methods we propose than about the merits of these systems. We emphasize at the outset that our evaluation proposal probes an aspect of MSEs, semantic resolution, which is not well measured by the wellknown word sense disambiguation (WSD) task that aimes at classifying occurrences of a word form to different elements of a sense inventory pre-defined by some experts. Since our goal is to probe the granularity of the inventory itself, we are more interested in word sense induction (WSI), the task of discovering sense clusters among occurrences (Schütze, 1998). 2 Conceptual lexical items The differentiation of word senses is fraught with difficulties, especially when we wish to distinguish homophony, using the same written or spoken form to express different concepts, such as Russian mir ‘world’ and mir ‘peace’ from polysemy, where speakers feel that the two senses are very strongly connected, such as in Hungarian nap ‘day’ and nap ‘sun’. To quote (Zgusta, 1971) “Of course it is a pity that we have to rely on the subjective interpretations of the speakers, but we have hardly anything else on hand”. Etymology makes clear that different languages make different lump/split decisions in the conceptual space, so much so that translational relatedness can, to a remarkable extent, be used to recover the universal clustering (Youna et al., 2016). Another confounding factor is part of speech (POS). Very often, the entire distinction is lodged in the POS, as in divorce (Noun) and divorce (Verb), while at other times this is less clear, compare the verbal to bank ‘rely on a financial institution’ and to bank ‘tilt’. Clearly the former is strongly related to the nominal bank ‘financial institution’ while the semantic relation ‘sloping sideways’ that connects the tilting of resource size mean std CED LDOCE EKSZ NSZ (b) 82,024 30,265 121,578 5,594 1.030 1.137 1.012 1.029 0.206 0.394 0.119 0.191 huang CMultiVec/UMBC Neela Neela/WK1 Neela/WK2 jiweil/WK senses jiweil/WK cntxt Adagram a05 Adagram 100,232 100,000 99,156 99,156 99,156 285,856 285,856 238,462 238,462 1.553 4.844 1.605 1.005 1.016 2.483 2.600 1.062 1.626 2.161 0.384 0.919 0.072 0.215 1.181 1.239 0.276 0.910 Table 1: The size (in words), mean, and standard deviation of the number of word senses in lexicographic and automatically generated resources the airplane to the side of the river is somewhat less direct, and not always perceived by the speakers. This problem affects our sources as well: the Collins-COBUILD (CED, Sinclair (1987)) dictionary starts with the semantic distinctions and subordinates POS distinctions to these, while the Longman dictionary (LDOCE, Boguraev and Briscoe (1989)) starts with a POS-level split and puts the semantic split below. Of the Hungarian lexicographic sources, the Comprehensive Dictionary of Hungarian (NSZ, Ittzés (2011)) is closer to CED, while the Explanatory Dictionary of Hungarian (EKSZ, Pusztai (2003)), is closer to LDOCE in this regard. The corpora we rely on are UMBC Webbase (Han et al., 2013) for English and Webkorpusz (Halácsy et al., 2004) for Hungarian. Table 1 summarizes the major word sense statistics (mean and variance) both for our lexicographic sources and for the automatically generated MSEs. The figure of merit we propose is the correlation betwen the number of senses obtained by the automatic method and by the manual (lexicographic) method. We experiment both with Spearman and Pearson values, Jensen-Shannon and KL divergence, cosine similarity and Cohen’s κ for rater agreement. The entropy-based measures are not reported, however, as they failed to meaningfully distinguish between the various resource pairs. The cosine similarities and κ values should also be taken with a grain of salt: the former does not take the exact number of senses into account (hence the high perceived similarity between CMultiVec and CED/LDOCE), while the latter penalises all disagreements the same, regardless of how far the guesses are. Nevertheless, all four reported metrics agree on the best pairing. First, we note that the dictionaries themselves are quite well correlated with each other: CED and LDOCE have Spearman (Pearson) correlation .266 (.277); EKSZ and NSZ have .648 (.677), considerably larger both because we only used a subsample of NSZ (the letter b) so there are only 5,363 words to compare, and because NSZ and EKSZ come from the same Hungarian lexicographic tradition, while CED and LDOCE never shared personnel or editorial outlook. Table 2 shows correlations of sense numbers attributed to each word by different resources, comparing automated to lexicographic (above the line) and different forms of automated (below). As is clear from the numbers, only one system, Neela, shows high correlation with a lexical resource (LDOCE), and only two systems, AdaGram and Neela correlate well with each other (different parametrizations of the same system are of course often well correlated to one another). We tried to filter bogus word senses from the MSEs by applying a simple heuristic: sense vectors within a cosine distance of 0.05 to zero or to another sense of the same word were thrown away. As expected, the adaptive methods, Neela and AdaGram, were unaffected, but jiweil and CMultiVec benefited slightly from the change. In the full paper we will also discuss a method for estimating ambiguity of a word directly from dictionaries automatically created from parallel corpora and available for many langauges. Since such disctionaries are very noisy, to obtain a robust ambiguity estimate we compute the perplexity of the unigram frequencies of the translations in the target language, and average these numbers over all target languages. 3 Cross-linguistic treatment of concepts Since monolingual dictionaries are an expensive resource, we also propose an automatic evaluation of MSEs based on the discovery of Mikolov et al. (2013b) that embeddings of different languages are so similar that a linear transformation can map vectors of the source language words to the vectors of their translations. The method uses a seed dictionary of a few thousand words to learn translation as a linear mapping W : Rd1 → Rd2 from the source (monolingual) embedding to the target: the translation zi ∈ Rd2 of a source word xi ∈ Rd1 is approximately its image W xi by the mapping. The translation model is trained with linear regression on the seed dictionary min W X ||W xi − zi ||2 i and can be used to collect translations for the whole vocabulary by choosing zi to be the nearest neighbor of W xi . We follow Mikolov et al. (2013b) in using different metrics, Euclidean distance in training and cosine similarity in collection of translations. This choice is theoretically unmotivated, but it seems to work better than more consistent use of metrics; see (Xing et al., 2015) for opposing results. In a multi-sense embedding scenario, we take a single-sense (traditional) embedding as source model, and a multi-sense embedding as target model. The evaluation of a specific target MSE model is done in three ways, referred as global, first sense and all senses respectively. The tools that generate MSEs all provide fallbacks to singe-sense embeddings in the form of so called global vectors. The method global can be considered as a baseline; a traditional, singlesense translation. In the first sense method, we trained the translation matrix to map a source vector into the first sense vector of the target word, but during the evaluation, all senses of the target word were accepted as a valid translation. When considering all senses, the translation matrix was trained to map intoall of the target senses, i.e. P P W xi − zi j 2 was minimized where zi j i j denotes the different senses of zi . tanulmány jelentés jelentés értelmezés report memorandum meaning interpretation Figure 1: Linear translation of word senses. The Hungarian word jelentés is ambiguous between ‘meaning’ and ‘report’. The two senses are identified by the “neighboring” words értelmezés ‘interpretation’ and tanulmány ‘memorandum’. Resources compared words 1 words 2 shared words Spearman Pearson cos Cohen-κ Neela vs CED Neela vs LDOCE huang vs CED huang vs LDOCE CMultiVec/UMBC vs CED CMultiVec/UMBC vs LDOCE Neela 2 vs EKSZ jiweil sense vs EKSZ Adagram vs EKSZ Adagram.a05 vs EKSZ 99156 99156 100232 100232 100000 100000 771870 285856 238462 238462 82024 30265 82024 30265 82024 30265 67546 67546 67546 67546 23508 21715 23706 21763 25231 23702 45401 32007 26739 26739 0.089 0.226 0.078 0.28 0.011 0.037 0.067 -0.023 0.086 0.0879 0.090 0.222 0.08 0.289 0.016 0.035 0.067 -0.022 0.094 0.103 0.869 0.882 0.593 0.65 0.954 0.936 0.878 0.891 0.859 0.920 0.023 0.073 0.0308 0.124 0 0 0.03 -0.002 0.005 0.039 Neela vs huang Neela vs CMultiVec/UMBC Neela 2 vs jiweil sense Neela 2 vs jiweil context Adagram vs Neela 2 Adagram vs jiweil sense Adagram vs jiweil context Adagram.a05 vs Neela 2 Adagram.a05 vs jiweil sense Adagram.a05 vs jiweil context Adagram vs Adagram.a05 99156 99156 771870 771870 238462 238462 238462 238462 238462 238462 238462 100232 100000 285856 285856 771870 285856 285856 771870 285856 285856 238462 99156 69007 283083 283083 199370 201291 201291 199370 201291 201291 238461 0.349 0.066 -0.014 0.123 0.389 0.034 0.139 0.421 -0.010 0.095 0.453 0.349 0.068 -0.074 0.129 0.478 0.020 0.223 0.498 -0.055 0.132 0.505 0.647 0.882 0.845 0.872 0.897 0.793 0.836 0.946 0.869 0.890 0.907 0.111 -0.001 -0.007 0.004 0.025 0.0038 0.027 0.188 -0.008 0.0005 0.071 4 Conclusions To summarize, we have proposed evaluating word embeddings in terms of their semantic resolution (granularity of word senses), to serve as a proxy for WSD, WSI, and MT tasks, over and above the gains that MSEs have over single-sense embeddings in similarity and analogy tasks. The central linguistic/semantic/psychological neela adagram The quality of the translation was measured by training on 5k of the most frequent word pairs and evaluating on another thousand seed pairs. The seed dictionary has been extracted by Tiedemann (2012) from the OpenSubtitles corpus2 . In the evaluations shown in the first two sections of ?? the English embedding was pre-trained by Mikolov et al. (2013a) on the Google News corpus.3 We trained the target Hungarian embeddings on the Hungarian Webcorpus with different tools. We measure precision in the standard manner: prec@k is the ratio of test items when at least one of the gold translations of the source word is among the k nearest neighbors of the computed vector in the target space. 3 shows an evaluation based on translating from Hungarian to English: here we trained the Hungarian embedding on Webcorpus with word2vec and took a pre-trained English MSE (Neelakantan et al., 2014b). jiweil Table 2: Word sense distribution similarity between various resources method global first sense all senses global first sense all senses global first sense all senses prec@1 .217 .135 .116 .140 .141 .156 .217 .165 .135 prec@5 .387 .237 .208 .273 .300 .315 .377 .336 .283 prec@10 .440 .286 .247 .320 .350 .380 .449 .409 .368 Table 3: Translation precision property we wish to capture is that of a concept, the underlying word sense unit. To the extent standard lexicographic practice offers a reasonably robust notion (this is of course debatable, but we consider a correlation of 0.27 over a large vocabulary a strong indication of consistency), this is something that MSEs should aim at capturing. We leave the matter of aligning word senses in different dictionaries for future work, but we expect that by (manual or automated) alignment the interdictionary (inter-annotator) agreement can be improved considerably, to provide a more robust gold standard. At this point everything we do is done in software4 , so other researchers can accurately reproduce these kinds of evaluations. Whether a 2 http://www.opensubtitles.org/ https://code.google.com/archive/p/ word2vec/ 3 4 Some glue code for this project can be found at https: //github.com/hlt-bme-hu/multiwsi ‘gold’ sense-disambiguated dictionary should be produced beyond the publicly available CED is not entirely clear, and we hope the referees and the other workshop participants will weigh in on this matter. References Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, and Dmitry Vetrov. 2015. Breaking sticks and ambiguities with adaptive skip-gram. ArXiv preprint . Branimir K. Boguraev and Edward J. Briscoe. 1989. Computational Lexicography for Natural Language Processing. Longman. Péter Halácsy, András Kornai, László Németh, András Rung, István Szakadát, and Viktor Trón. 2004. Creating open language resources for Hungarian. In Proc. LREC2004. pages 203– 210. Lushan Han, Abhay Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC_EBIQUITY-CORE: Semantic textual similarity systems. In Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics. pages 44–52. Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’12, pages 873–882. Nóra Ittzés, editor. 2011. A magyar nyelv nagyszótára III-IV. Akadémiai Kiadó. Jiwei Li and Dan Jurafsky. 2015. Do multi-sense embeddings improve natural language understanding? In EMNLP. Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, and Noah A. Smith. 2015. Toward abstractive summarization using semantic representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, pages 1077–1086. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Y. Bengio and Y. LeCun, editors, Proceedings of the ICLR 2013. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. Xiv preprint arXiv:1309.4168. Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014a. Efficient non-parametric estimation of multiple embeddings per word in vectoe space. In EMNLP. Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014b. Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv preprint arXiv:1504.06654 . Ferenc Pusztai, editor. 2003. Magyar értelmező kéziszótár. Akadémiai Kiadó. Joseph Reisinger and Raymond J Mooney. 2010. Multi-prototype vector-space models of word meaning. In The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 109–117. Hinrich Schütze. 1998. Automatic word sense discrimination. Computational linguistics . John M. Sinclair. 1987. Looking up: an account of the COBUILD project in lexical computing. Collins ELT. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey. Chao Xing, Chao Liu, RIIT CSLT, Dong Wang, China TNList, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In NAACL. Hyejin Youna, Logan Sutton, Eric Smith, Cristopher Moore, Jon F. Wilkins, Ian Maddieson, William Croft, and Tanmoy Bhattacharya. 2016. On the universal structure of human lexical semantics. PNAS 113(7):1766–1771. Ladislav Zgusta. 1971. Manual of lexicography. Academia, Prague.
© Copyright 2026 Paperzz