Gathering Information about Word Similarity from Neighbor Sentences Natalia Loukachevitch1 and Aleksei Alekseev1 Research Computing Center of Lomonosov Moscow State University, Moscow, Russia, louk [email protected], [email protected] Abstract. In this paper we present the first results of detecting word semantic similarity on the Russian translations of Miller-Charles and Rubenstein-Goodenough sets prepared for the first Russian word semantic evaluation Russe-2015. The experiments were carried out on three text collections: Russian Wikipedia, a news collection, and their united collection. We found that the best results in detection of lexical paradigmatic relations are achieved using the combination of word2vec with the new type of features based on word co-occurrences in neighbor sentences. Keywords: Russian word semantic similarity, evaluation, neighbor sentences, word2vec, Spearman’s correlation 1 Introduction The knowledge about semantic word similarity can be used in various tasks of natural language processing and information retrieval including word sense disambiguation, text clustering, textual entailment, search query expansion, etc. To check the possibility of current methods to automatically determine word semantic similarity, various datasets were created. Most known gold standards for this task include the Rubenstein-Goodenough (RG) dataset [21], the MillerCharles (MC) dataset [18] and WordSim-353 [7]. These datasets contain word pairs annotated by human subjects according to lexical semantic similarity. Results of automatic approaches are compared with the datasets using Pearson or Spearman’s correlations [1]. Some researchers distinguish proper similarity between word senses that is mainly based on paradigmatic relations (synonyms, hyponyms, or hyperonyms), and relatedness that is established on other types of lexical relations. Agirre et al., [1] subdivided the WordSim-353 dataset into two subsets: the WordSim-353 similarity set and the WordSim-353 relatedness set. The former set consists of word pairs classified as synonyms, antonyms, identical, or hyponym-hyperonym and unrelated pairs. The latter set contains word-pairs connected with other relations and unrelated pairs. To provide word similarity research in other languages, the English similarity tests have been translated. Gurevych translated the RG and MC datasets into 2 Gathering Information about Word Similarity from Neighbor Sentences German [9]; Hassan and Mihalcea translated them into Spanish, Arabic and Romanian [11]; Postma and Vossen translated the datasets into Dutch [20]. In 2015 the first Russian evaluation of word semantic similarity was organized [19]1 . One of the datasets created for the evaluation was the word pair sets with the scores of their similarity obtained by human judgments (hj-dataset). The Russe hj-dataset was prepared by translation of the existing English datasets MC, RG, and WordSim-353. One of traditional approaches to detect semantic word similarity are distributional approaches that are based on similar word neighbors in sentences. In this paper, we present a study of a possible contribution of features based on location of words in neighbor sentences. We suppose that co-occurrence of words in the same sentences can indicate that they are components of a collocation or are in the relatedness relation [22], but frequent co-occurrence in neighbor sentences indicates namely paradigmatic similarity between words. We conduct our experiments on the Russe hj test set. Then we work with the Russian translations of RG and MC datasets prepared during the Russe evaluation. We show that newly introduced features allow us to achieve the best results on the Russian translations of RG and MC datasets. 2 Related work The existing approaches to calculate semantic similarity between words are based on various resources and their features. One of well-known approaches utilizes the lexical relations described in manually created thesauri such as WordNet or Roget’s thesaurus [1, 5] or such online resource as Wikipedia [8]. Another type of approaches to detect semantic word similarity are distributional approaches that account for shared neighbors of words [14]. In these approaches a matrix, where each row represents a word w and each column corresponds to contexts of w, is constructed. The value of each matrix cell represents the association between the word wi and the context cj . A popular measure of this association is pointwise mutual information (PMI) and especially positive PMI (PPMI), in which all negative values of PMI are replaced by 0 [2, 15]. Similarity between two words can be calculated, for example, as the scalar product between their context vectors. Recently, neural-network based approaches, in which words are embedded into a low-dimensional space, appeared and became to be used in lexical semantic tasks. Especially, word2vec tool2 , a program for creating word embedding, became popular. In [2, 15] traditional distributional methods are compared with new neural-network approaches in several lexical tasks. Baroni et al. [2] found that new approaches consistently outperform traditional approaches in most lexical tasks. For example, on the RG test traditional approaches achieve 0.74 Spearman’s correlation rank, the neural approach obtains 0.84. 1 2 http://russe.nlpub.ru/ https://code.google.com/archive/p/word2vec/ Gathering Information about Word Similarity from Neighbor Sentences 3 Levy et al. [15] tried to find the best parameters for each approach and showed that on the WordSim-353 subsets, the word2vec similarity is better on the WordSim-353 similarity subset (0.773 vs. 0.755), while traditional distributional models based on PPMI weighting achieve better results on the WordSim-353 relatedness subset (0.688 vs. 0.623). They conclude that word2vec provides a robust baseline: ”while it might not be the best method for every task, it does not significantly underperform in any scenario.” Some authors suppose to combine various features to improve detection of word semantic similarity. Agirre et al. (2009) use a supervised combination of WordNet-based similarity, distributional similarity, syntactic similarity, and context window similarity, and achieve 0.96 Spearman’s correlation on the RG set and 0.83 on the WordSim-353 similarity subset. The best result achieved on the Russe-2016 hj test set [19], 0.7625 Spearmans correlation, was obtained with a supervised approach combining word2vec similarity scores calculated on a united text collection (consisting of the ruwac web corpus, the lib.ru fiction collection, and Russian Wikipedia), synonym database, prefix dictionary, and orthographic similarity [16]. The second result (0.7187) was obtained with the application of word2vec to two text corpora: Russian National Corpus and a news corpus. The word2vec scores from the news corpus were used if target words were absent in RNC [13]. The best result achieved with a traditional distributional approach was 0.7029 [19]. 3 Datasets and text collections In this study we use similarity datasets created for the first Russian evaluation on word semantic similarity Russe [19]. The Russe hj data set was constructed by translation of three English datasets: Miller-Charles set, RubensteinGoodenough set, and WordSim-353. The RG set consists of 65 word pairs ranging from synonymy pairs (e.g., car – automobile) to completely unrelated terms (e.g., noon – string). The MC set is a subset of the RG dataset, consisting of 30 word pairs. The WordSim-353 set consists of 353 word pairs. The Miller-Charles set is also a subset in the WordSim-353 data set. The noun pairs in each set were annotated by human subjects, using scales of semantic similarity. These datasets were joined together and translated into Russian in a consistent way. The word similarity scores for this joined dataset were obtained by crowdsourcing from Russian native speakers. Each annotator has been asked to assess the similarity of each pair using four possible values of similarity. All estimated word pairs were subdivided to the training set (66 word pairs) for adapting the participants’ methods and the test set, which was used for evaluation (335 word pairs) (Russe hj test set). Correlations with human judgments were calculated in terms of Spearmans rank correlation. In this study we present our results on the Russe hj test set (335 word pairs). Besides, we singled out the Russian translations of the MC and RG sets from the whole Russe hj similarity dataset and show our results on each dataset. 4 Gathering Information about Word Similarity from Neighbor Sentences We use two texts collections: Russian Wikipedia (0.24 B tokens) and Russian news collection (0.45 B tokens). Besides, we experimented on the united collection of the above-mentioned collections. 4 Using neighbor sentences for word similarity task Analysing the variety of features used for word semantic similarity detection, we could not find any work utilizing co-occurrence of words in neigbor sentences. However, repetitions of semantically related words in natural language texts bear a very important role providing the cohesion of text fragments [10], that is device for ”sticking together” different parts of the text. Cohesion can be based on the use of semantically related words, reference, ellipsis and conjunctions. Among these types the most frequent type is lexical cohesion, which is created with sense related words such as repetitions, synonyms, hyponyms, etc. When analyzing a specific text, the location of words in sentences can be employed for constructing lexical chains. In classical works [3, 12], the location of words in neighbor sentences is used for adding new elements in the current lexical chain. The distance in sentences between referential candidates is an important feature in tasks such as anaphora resolution and coreference chains construction [6]. Thus, in the text next sentences are often lexically attached to previous sentences. It allows us to suppose that the analysis of word distribution in neighbor sentences can give additional information about their semantic similarity. If two words frequently co-occur near each other it often indicates that they are components of the same collocation. Frequent word co-occurrence in the same sentences often mean that they are participants of the same situations [22]. But if words are frequently met in neighbor sentences it could show that are semantically or thematically similar. In this study, we calculate co-occurrence of words in two sentences that are direct neighbors to each other and analyze its importance for lexical similarity detection in form of two basic features: – frequency of co-occurrence of words in neighbor sentences (NS) where words should be located in different sentences: one word should be in the first sentence and the second word in the second sentences, – frequency of co-occurrence of words in a large window equal to two sentences (TS). We consider these features in two variant weightings: pointwise mutual information (pmiNS, pmiTS) and normalized pointwise mutual information (npmiNS, npmiTS). It should be also mentioned that both types of features do not require any tuning because they are based on sentence boundaries, that is they are self-adaptive to a text collection. To compare these features with features extracted from a window within the same sentences, we calculated the inside-sentence window feature (IS), with its variants pmiIS, npmiIS. These features show co-occurrence of a word pair in a Gathering Information about Word Similarity from Neighbor Sentences 5 specific word window within a sentence. We experimented with various sizes of word windows and presents only the best results usually based on a large window of 20 words (10 words to the left and 10 words to the right within a sentence). In fact, in most cases IS-window is equal to the whole sentence. Besides, we calculated npmi for word pair co-occurrences within documents (npmiDoc). Features NS pmiNS npmiNS TS pmiTS npmiTS IS pmiIS npmiIS npmiDOC best of w2v NS+w2v pmiNS+w2v npmiNS+w2v TS+w2v pmiTS+w2v npmiTS+w2v pmiIS+w2v npmiIS+w2v npmiDoc+w2v News Ru-Wiki United Collection Collection 0.519 0.494 0.543 0.632 0.598 0.660 0.638 0.612 0.677 0.604 0.580 0.607 0.645 0.627 0.657 0.648 0.636 0.670 0.577 0.601 0.594 0.645 0.633 0.652 0.663 0.658 0.681 0.611 0.567 0.641 0.651 0.618 0.680 0.659 0.633 0.674 0.687 0.647 0.712 0.661 0.652 0.704 0.679 0.662 0.689 0.694 0.663 0.710 0.679 0.662 0.697 0.693 0.670 0.710 0.689 0.674 0.705 0.659 0.619 0.692 Table 1. The results obtained on the Russe hj test set. The best single features and pairs of features are highlighted. Best w2v window sizes for all collections - 5 To compare with state-of-the art approaches, we utilized the word2vec tool for processing our collections (w2v). For preprocessing, the text collections were lemmatized, all function words were removed and, thus, the word2vec similarity between words was calculated only on the contexts containing nouns, verbs, adjectives, and adverbs. We used CBOW model with 500 dimensions, window sizes were varied to produce the best results. We also considered simple combinations between pairs of features calculated as summation of the ranks of word pairs in the ordering determined with those features. Table 1 presents the results on three collections: the news collection, Ru-wiki, and the united collection for the Russe hj test set. One can see that the best results obtained from a single feature for all three corpora are based on the inside sentence co-occurrence feature npmiIS. The best results for two collections also correlate with the inside-sentence features. In 6 Gathering Information about Word Similarity from Neighbor Sentences Features Dataset News Ru-Wiki United Collection Collection npmiNS RG 0.759 0.751 0.813 MC 0.803 0.833 0.827 npmiTS RG 0.779 0.808 0.820 MC 0.834 0.832 0.823 npmiIS RG 0.744 0.803 0.811 MC 0.805 0.824 0.817 best of w2v ) RG (wind=2-5) 0.791 0.805 0.858 MC (wind=1-3) 0.812 0.833 0.848 NS+w2v RG 0.833 0.798 0.860 MC 0.841 0.785 0.848 npmiNS+w2v RG 0.835 0.850 0.868 MC 0.837 0.861 0.868 TS+w2v RG 0.841 0.842 0.870 MC 0.872 0.816 0.881 npmiTS+w2v RG 0.831 0.807 0.863 MC 0.858 0.850 0.863 npmiIS+w2v RG 0.826 0.851 0.857 MC 0.852 0.843 0.867 Table 2. The results obtained on the Russian RG and MC sets. The best single features and pairs of features are highlighted. Best windows of word2vec are indicated our opinion, it is because the Russe hj test set contains a lot of related (nonparadigmatic) word pairs. However, the best result (0.712) for this set is achieved on summation of the word rankings obtained with two features word2vec and pmiNS that is pointwise mutual information calculated on co-occurrence of words in neighbor sentences. This result is very close to the second result achieved during the Russe evaluation (0.717) but in that approach a three times larger news collection and balanced Russian National Corpus (RNC) were used [13]. Thus, on the Russe test set, the best result was obtained with additional accounting for co-occurrence of words in neighbor sentences. Though it should be noted that combination of word2vec with an inside-sentence feature generated a very similar result (0.710). The specificity of this set is that it contains paradigmatic relations (proper semantic similarity) together with relatedness relations from the translated the WordSim-353 set. Therefore we decided to study the proposed features on the RG and MC sets, which are constructed for measuring paradigmatic relations between words. Table 2 presents the results obtained for the Russian translations of the RG and MC sets. Only best features and its pairs are shown. The presented results are the first results obtained for these Russian datasets. We can see a quite different picture on these sets than on the Russe hj test set. In both cases the maximal values of Spearman’s correlation rank for all three collections are obtained with the combination of word2vec and features based on Gathering Information about Word Similarity from Neighbor Sentences 7 co-occurrence of words in neigbor sentences. The best single features are mainly associated with neighbor-sentence features. Our best result for the Russian MC set (0.881) is achieved with the pair of features (word2vec, TS – the co-occurrence of words in the window of two sentences) on the united text collection of Ru-Wiki and the news collection. The best result for Russian RG set (0.870) is obtained with the same combination of features on the same collection. The achieved results are much better than the results for each single feature. In our opinion, it means that the frequent cooccurrence of words in neighbor sentences bears additional useful information for detecting paradigmatic lexical similarity. 5 Conclusion In this paper we presented the first results of detecting word semantic similarity on the Russian translations of MC and RG sets prepared for the first Russian word semantic evaluation Russe-2015. The experiments were performed on three text collections: Russian Wikipedia, a news collection, and the united collection of Ru-Wiki and the news collection. We found that the best results in detection of lexical paradigmatic relations are achieved using the combination of word2vec with the new type of features proposed in this paper – the co-occurrence of words in neighbor sentences. We suppose that the found feature combination is especially useful for improvement of sense-related word extraction on medium-sized collections. We plan to continue the study of the proposed features on other datasets including the Russian translation of the WordSim-353 dataset and original English datasets. In addition, we suppose to test possible contributions of these neighbor sentence features in probabilistic topic modeling [4]. Acknowledgments. This work was partially supported by Russian Foundation for Basic Research, grant N14-07-00383. References 1. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paca, M., Soroa, A.: 2009, May. A study on similarity and relatedness using distributional and wordnet-based approaches. The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 19–27 (2009) 2. Baroni, M., Dinu, G., Kruszewski, G. : Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL-2014, 238-247 (2014) 3. Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. Advances in automatic text summarization, 111–121 (1999) 4. Blei, D.M., Ng, A.Y., Jordan, M.I. Latent dirichlet allocation. the Journal of machine Learning research, 3, 993–1022 (2003) 5. Budanitsky, A. Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1), 13–47 (2006) 8 Gathering Information about Word Similarity from Neighbor Sentences 6. Fernandes, E.R., dos Santos, C.N.,Milidi, R.L.: Latent trees for coreference resolution. Computational Linguistics (2014) 7. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing Search in Context: The Concept Revisited, Proceedings of the 10th International Conference on World Wide Web, 406–414 (2001) 8. Gabrilovich E., Markovitch, S.: Computing Semantic Relatedness using Wikipediabased Explicit Semantic Analysis. Proc of IJCAI , 6–12, (2007) 9. Gurevych, I.: Using the Structure of a Conceptual Network in Computing Semantic Relatedness, Proceedings of the 2nd International Joint Conference on Natural Language Processing, Jeju Island, South Korea, 767–778 (2005) 10. Halliday, M., Hasan, R.: Cohesion in English. Routledge (2014) 11. Hassan, S., Mihalcea, R.: Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Vol. 3, Singapore, 1192-1201 (2009) 12. Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An electronic lexical database, 305–332 (1998) 13. Kutuzov, A., Kuzmenko, E.: Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian. In Computational Linguistics and Intelligent Text Processing. Springer, 47–58 (2015) 14. Lapesa, G., Evert, S.: A large scale evaluation of distributional semantic models: Parameters, interactions and model selection. Transactions of the Association for Computational Linguistics, 2, 531–545 (2014) 15. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225 (2015) 16. Lopukhin K.A., Lopukhina A.A., Nosyrev G.V.: The impact of different vector space models and supplementary techniques on Russian semantic similarity task. In Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference. Dialogue, Vol. 2, 145–153 (2015) 17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 3111-3119 (2013) 18. Miller, G. A., Charles, W. G.: Contextual correlates of semantic similarity, Language and Cognitive Processes, 6(1), 1-28 (1991) 19. Panchenko, A., Loukachevitch, N., Ustalov, D., Paperno, D., Meyer, C., Konstantinova, N.: Russe: The first workshop on Russian semantic similarity. In Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference. Dialogue, Vol. 2, 89–105 (2015). 20. Postma, M., Vossen, P. : What implementation and translation teach us: the case of semantic similarity measures in wordnets, Proceedings of Global Word-Net Conference GWC-2014, Tartu, Estonia, 133–141 (2014) 21. Rubenstein, H., Goodenough J. B.: Contextual Correlates of Synonymy. Communications of the ACM, Vol. 8(10), 627-633 (1965) 22. Sahlgren, M.: The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in highdimensional vector spaces. Ph.D. thesis, University of Stockolm (2006)
© Copyright 2026 Paperzz