Gathering Information about Word Similarity from Neighbor Sentences

Gathering Information about Word Similarity
from Neighbor Sentences
Natalia Loukachevitch1 and Aleksei Alekseev1
Research Computing Center of
Lomonosov Moscow State University, Moscow, Russia,
louk [email protected], [email protected]
Abstract. In this paper we present the first results of detecting word
semantic similarity on the Russian translations of Miller-Charles and
Rubenstein-Goodenough sets prepared for the first Russian word semantic evaluation Russe-2015. The experiments were carried out on three
text collections: Russian Wikipedia, a news collection, and their united
collection. We found that the best results in detection of lexical paradigmatic relations are achieved using the combination of word2vec with the
new type of features based on word co-occurrences in neighbor sentences.
Keywords: Russian word semantic similarity, evaluation, neighbor sentences, word2vec, Spearman’s correlation
1
Introduction
The knowledge about semantic word similarity can be used in various tasks
of natural language processing and information retrieval including word sense
disambiguation, text clustering, textual entailment, search query expansion, etc.
To check the possibility of current methods to automatically determine word
semantic similarity, various datasets were created. Most known gold standards
for this task include the Rubenstein-Goodenough (RG) dataset [21], the MillerCharles (MC) dataset [18] and WordSim-353 [7]. These datasets contain word
pairs annotated by human subjects according to lexical semantic similarity. Results of automatic approaches are compared with the datasets using Pearson or
Spearman’s correlations [1].
Some researchers distinguish proper similarity between word senses that is
mainly based on paradigmatic relations (synonyms, hyponyms, or hyperonyms),
and relatedness that is established on other types of lexical relations. Agirre et
al., [1] subdivided the WordSim-353 dataset into two subsets: the WordSim-353
similarity set and the WordSim-353 relatedness set. The former set consists of
word pairs classified as synonyms, antonyms, identical, or hyponym-hyperonym
and unrelated pairs. The latter set contains word-pairs connected with other
relations and unrelated pairs.
To provide word similarity research in other languages, the English similarity
tests have been translated. Gurevych translated the RG and MC datasets into
2
Gathering Information about Word Similarity from Neighbor Sentences
German [9]; Hassan and Mihalcea translated them into Spanish, Arabic and
Romanian [11]; Postma and Vossen translated the datasets into Dutch [20].
In 2015 the first Russian evaluation of word semantic similarity was organized
[19]1 . One of the datasets created for the evaluation was the word pair sets with
the scores of their similarity obtained by human judgments (hj-dataset). The
Russe hj-dataset was prepared by translation of the existing English datasets
MC, RG, and WordSim-353.
One of traditional approaches to detect semantic word similarity are distributional approaches that are based on similar word neighbors in sentences. In this
paper, we present a study of a possible contribution of features based on location
of words in neighbor sentences. We suppose that co-occurrence of words in the
same sentences can indicate that they are components of a collocation or are in
the relatedness relation [22], but frequent co-occurrence in neighbor sentences
indicates namely paradigmatic similarity between words. We conduct our experiments on the Russe hj test set. Then we work with the Russian translations
of RG and MC datasets prepared during the Russe evaluation. We show that
newly introduced features allow us to achieve the best results on the Russian
translations of RG and MC datasets.
2
Related work
The existing approaches to calculate semantic similarity between words are based
on various resources and their features. One of well-known approaches utilizes
the lexical relations described in manually created thesauri such as WordNet or
Roget’s thesaurus [1, 5] or such online resource as Wikipedia [8].
Another type of approaches to detect semantic word similarity are distributional approaches that account for shared neighbors of words [14]. In these
approaches a matrix, where each row represents a word w and each column corresponds to contexts of w, is constructed. The value of each matrix cell represents
the association between the word wi and the context cj . A popular measure of
this association is pointwise mutual information (PMI) and especially positive
PMI (PPMI), in which all negative values of PMI are replaced by 0 [2, 15]. Similarity between two words can be calculated, for example, as the scalar product
between their context vectors.
Recently, neural-network based approaches, in which words are embedded
into a low-dimensional space, appeared and became to be used in lexical semantic tasks. Especially, word2vec tool2 , a program for creating word embedding,
became popular. In [2, 15] traditional distributional methods are compared with
new neural-network approaches in several lexical tasks. Baroni et al. [2] found
that new approaches consistently outperform traditional approaches in most
lexical tasks. For example, on the RG test traditional approaches achieve 0.74
Spearman’s correlation rank, the neural approach obtains 0.84.
1
2
http://russe.nlpub.ru/
https://code.google.com/archive/p/word2vec/
Gathering Information about Word Similarity from Neighbor Sentences
3
Levy et al. [15] tried to find the best parameters for each approach and
showed that on the WordSim-353 subsets, the word2vec similarity is better on the
WordSim-353 similarity subset (0.773 vs. 0.755), while traditional distributional
models based on PPMI weighting achieve better results on the WordSim-353
relatedness subset (0.688 vs. 0.623). They conclude that word2vec provides a
robust baseline: ”while it might not be the best method for every task, it does
not significantly underperform in any scenario.”
Some authors suppose to combine various features to improve detection of
word semantic similarity. Agirre et al. (2009) use a supervised combination of
WordNet-based similarity, distributional similarity, syntactic similarity, and context window similarity, and achieve 0.96 Spearman’s correlation on the RG set
and 0.83 on the WordSim-353 similarity subset.
The best result achieved on the Russe-2016 hj test set [19], 0.7625 Spearmans
correlation, was obtained with a supervised approach combining word2vec similarity scores calculated on a united text collection (consisting of the ruwac web
corpus, the lib.ru fiction collection, and Russian Wikipedia), synonym database,
prefix dictionary, and orthographic similarity [16]. The second result (0.7187)
was obtained with the application of word2vec to two text corpora: Russian National Corpus and a news corpus. The word2vec scores from the news corpus
were used if target words were absent in RNC [13]. The best result achieved
with a traditional distributional approach was 0.7029 [19].
3
Datasets and text collections
In this study we use similarity datasets created for the first Russian evaluation on word semantic similarity Russe [19]. The Russe hj data set was constructed by translation of three English datasets: Miller-Charles set, RubensteinGoodenough set, and WordSim-353.
The RG set consists of 65 word pairs ranging from synonymy pairs (e.g., car
– automobile) to completely unrelated terms (e.g., noon – string). The MC set
is a subset of the RG dataset, consisting of 30 word pairs. The WordSim-353
set consists of 353 word pairs. The Miller-Charles set is also a subset in the
WordSim-353 data set. The noun pairs in each set were annotated by human
subjects, using scales of semantic similarity.
These datasets were joined together and translated into Russian in a consistent way. The word similarity scores for this joined dataset were obtained by
crowdsourcing from Russian native speakers. Each annotator has been asked to
assess the similarity of each pair using four possible values of similarity.
All estimated word pairs were subdivided to the training set (66 word pairs)
for adapting the participants’ methods and the test set, which was used for evaluation (335 word pairs) (Russe hj test set). Correlations with human judgments
were calculated in terms of Spearmans rank correlation.
In this study we present our results on the Russe hj test set (335 word pairs).
Besides, we singled out the Russian translations of the MC and RG sets from
the whole Russe hj similarity dataset and show our results on each dataset.
4
Gathering Information about Word Similarity from Neighbor Sentences
We use two texts collections: Russian Wikipedia (0.24 B tokens) and Russian news collection (0.45 B tokens). Besides, we experimented on the united
collection of the above-mentioned collections.
4
Using neighbor sentences for word similarity task
Analysing the variety of features used for word semantic similarity detection, we
could not find any work utilizing co-occurrence of words in neigbor sentences.
However, repetitions of semantically related words in natural language texts
bear a very important role providing the cohesion of text fragments [10], that is
device for ”sticking together” different parts of the text. Cohesion can be based
on the use of semantically related words, reference, ellipsis and conjunctions.
Among these types the most frequent type is lexical cohesion, which is created
with sense related words such as repetitions, synonyms, hyponyms, etc.
When analyzing a specific text, the location of words in sentences can be
employed for constructing lexical chains. In classical works [3, 12], the location
of words in neighbor sentences is used for adding new elements in the current
lexical chain. The distance in sentences between referential candidates is an
important feature in tasks such as anaphora resolution and coreference chains
construction [6].
Thus, in the text next sentences are often lexically attached to previous sentences. It allows us to suppose that the analysis of word distribution in neighbor
sentences can give additional information about their semantic similarity. If two
words frequently co-occur near each other it often indicates that they are components of the same collocation. Frequent word co-occurrence in the same sentences
often mean that they are participants of the same situations [22]. But if words
are frequently met in neighbor sentences it could show that are semantically or
thematically similar.
In this study, we calculate co-occurrence of words in two sentences that are
direct neighbors to each other and analyze its importance for lexical similarity
detection in form of two basic features:
– frequency of co-occurrence of words in neighbor sentences (NS) where words
should be located in different sentences: one word should be in the first
sentence and the second word in the second sentences,
– frequency of co-occurrence of words in a large window equal to two sentences
(TS).
We consider these features in two variant weightings: pointwise mutual information (pmiNS, pmiTS) and normalized pointwise mutual information (npmiNS,
npmiTS). It should be also mentioned that both types of features do not require
any tuning because they are based on sentence boundaries, that is they are
self-adaptive to a text collection.
To compare these features with features extracted from a window within the
same sentences, we calculated the inside-sentence window feature (IS), with its
variants pmiIS, npmiIS. These features show co-occurrence of a word pair in a
Gathering Information about Word Similarity from Neighbor Sentences
5
specific word window within a sentence. We experimented with various sizes of
word windows and presents only the best results usually based on a large window
of 20 words (10 words to the left and 10 words to the right within a sentence).
In fact, in most cases IS-window is equal to the whole sentence. Besides, we
calculated npmi for word pair co-occurrences within documents (npmiDoc).
Features
NS
pmiNS
npmiNS
TS
pmiTS
npmiTS
IS
pmiIS
npmiIS
npmiDOC
best of w2v
NS+w2v
pmiNS+w2v
npmiNS+w2v
TS+w2v
pmiTS+w2v
npmiTS+w2v
pmiIS+w2v
npmiIS+w2v
npmiDoc+w2v
News
Ru-Wiki United
Collection
Collection
0.519
0.494
0.543
0.632
0.598
0.660
0.638
0.612
0.677
0.604
0.580
0.607
0.645
0.627
0.657
0.648
0.636
0.670
0.577
0.601
0.594
0.645
0.633
0.652
0.663
0.658
0.681
0.611
0.567
0.641
0.651
0.618
0.680
0.659
0.633
0.674
0.687
0.647
0.712
0.661
0.652
0.704
0.679
0.662
0.689
0.694
0.663
0.710
0.679
0.662
0.697
0.693
0.670
0.710
0.689
0.674
0.705
0.659
0.619
0.692
Table 1. The results obtained on the Russe hj test set. The best single features and
pairs of features are highlighted. Best w2v window sizes for all collections - 5
To compare with state-of-the art approaches, we utilized the word2vec tool
for processing our collections (w2v). For preprocessing, the text collections were
lemmatized, all function words were removed and, thus, the word2vec similarity
between words was calculated only on the contexts containing nouns, verbs,
adjectives, and adverbs. We used CBOW model with 500 dimensions, window
sizes were varied to produce the best results.
We also considered simple combinations between pairs of features calculated
as summation of the ranks of word pairs in the ordering determined with those
features. Table 1 presents the results on three collections: the news collection,
Ru-wiki, and the united collection for the Russe hj test set.
One can see that the best results obtained from a single feature for all three
corpora are based on the inside sentence co-occurrence feature npmiIS. The best
results for two collections also correlate with the inside-sentence features. In
6
Gathering Information about Word Similarity from Neighbor Sentences
Features
Dataset
News
Ru-Wiki United
Collection
Collection
npmiNS
RG
0.759
0.751
0.813
MC
0.803
0.833
0.827
npmiTS
RG
0.779
0.808
0.820
MC
0.834
0.832
0.823
npmiIS
RG
0.744
0.803
0.811
MC
0.805
0.824
0.817
best of w2v ) RG (wind=2-5)
0.791
0.805
0.858
MC (wind=1-3)
0.812
0.833
0.848
NS+w2v
RG
0.833
0.798
0.860
MC
0.841
0.785
0.848
npmiNS+w2v
RG
0.835
0.850
0.868
MC
0.837
0.861
0.868
TS+w2v
RG
0.841
0.842
0.870
MC
0.872
0.816
0.881
npmiTS+w2v
RG
0.831
0.807
0.863
MC
0.858
0.850
0.863
npmiIS+w2v
RG
0.826
0.851
0.857
MC
0.852
0.843
0.867
Table 2. The results obtained on the Russian RG and MC sets. The best single features
and pairs of features are highlighted. Best windows of word2vec are indicated
our opinion, it is because the Russe hj test set contains a lot of related (nonparadigmatic) word pairs. However, the best result (0.712) for this set is achieved
on summation of the word rankings obtained with two features word2vec and
pmiNS that is pointwise mutual information calculated on co-occurrence of words
in neighbor sentences. This result is very close to the second result achieved
during the Russe evaluation (0.717) but in that approach a three times larger
news collection and balanced Russian National Corpus (RNC) were used [13].
Thus, on the Russe test set, the best result was obtained with additional accounting for co-occurrence of words in neighbor sentences. Though it should be
noted that combination of word2vec with an inside-sentence feature generated a
very similar result (0.710). The specificity of this set is that it contains paradigmatic relations (proper semantic similarity) together with relatedness relations
from the translated the WordSim-353 set.
Therefore we decided to study the proposed features on the RG and MC
sets, which are constructed for measuring paradigmatic relations between words.
Table 2 presents the results obtained for the Russian translations of the RG and
MC sets. Only best features and its pairs are shown. The presented results are
the first results obtained for these Russian datasets.
We can see a quite different picture on these sets than on the Russe hj test
set. In both cases the maximal values of Spearman’s correlation rank for all three
collections are obtained with the combination of word2vec and features based on
Gathering Information about Word Similarity from Neighbor Sentences
7
co-occurrence of words in neigbor sentences. The best single features are mainly
associated with neighbor-sentence features.
Our best result for the Russian MC set (0.881) is achieved with the pair
of features (word2vec, TS – the co-occurrence of words in the window of two
sentences) on the united text collection of Ru-Wiki and the news collection. The
best result for Russian RG set (0.870) is obtained with the same combination
of features on the same collection. The achieved results are much better than
the results for each single feature. In our opinion, it means that the frequent cooccurrence of words in neighbor sentences bears additional useful information
for detecting paradigmatic lexical similarity.
5
Conclusion
In this paper we presented the first results of detecting word semantic similarity
on the Russian translations of MC and RG sets prepared for the first Russian
word semantic evaluation Russe-2015. The experiments were performed on three
text collections: Russian Wikipedia, a news collection, and the united collection
of Ru-Wiki and the news collection.
We found that the best results in detection of lexical paradigmatic relations
are achieved using the combination of word2vec with the new type of features
proposed in this paper – the co-occurrence of words in neighbor sentences. We
suppose that the found feature combination is especially useful for improvement
of sense-related word extraction on medium-sized collections.
We plan to continue the study of the proposed features on other datasets
including the Russian translation of the WordSim-353 dataset and original English datasets. In addition, we suppose to test possible contributions of these
neighbor sentence features in probabilistic topic modeling [4].
Acknowledgments. This work was partially supported by Russian Foundation
for Basic Research, grant N14-07-00383.
References
1. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paca, M., Soroa, A.: 2009, May.
A study on similarity and relatedness using distributional and wordnet-based approaches. The 2009 Annual Conference of the North American Chapter of the
Association for Computational Linguistics, 19–27 (2009)
2. Baroni, M., Dinu, G., Kruszewski, G. : Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of
ACL-2014, 238-247 (2014)
3. Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. Advances
in automatic text summarization, 111–121 (1999)
4. Blei, D.M., Ng, A.Y., Jordan, M.I. Latent dirichlet allocation. the Journal of machine Learning research, 3, 993–1022 (2003)
5. Budanitsky, A. Hirst, G.: Evaluating wordnet-based measures of lexical semantic
relatedness. Computational Linguistics, 32(1), 13–47 (2006)
8
Gathering Information about Word Similarity from Neighbor Sentences
6. Fernandes, E.R., dos Santos, C.N.,Milidi, R.L.: Latent trees for coreference resolution. Computational Linguistics (2014)
7. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G.,
Ruppin, E.: Placing Search in Context: The Concept Revisited, Proceedings of the
10th International Conference on World Wide Web, 406–414 (2001)
8. Gabrilovich E., Markovitch, S.: Computing Semantic Relatedness using Wikipediabased Explicit Semantic Analysis. Proc of IJCAI , 6–12, (2007)
9. Gurevych, I.: Using the Structure of a Conceptual Network in Computing Semantic
Relatedness, Proceedings of the 2nd International Joint Conference on Natural
Language Processing, Jeju Island, South Korea, 767–778 (2005)
10. Halliday, M., Hasan, R.: Cohesion in English. Routledge (2014)
11. Hassan, S., Mihalcea, R.: Cross-lingual Semantic Relatedness Using Encyclopedic
Knowledge, Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing, Vol. 3, Singapore, 1192-1201 (2009)
12. Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection
and correction of malapropisms. WordNet: An electronic lexical database, 305–332
(1998)
13. Kutuzov, A., Kuzmenko, E.: Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian. In Computational Linguistics and Intelligent Text Processing. Springer, 47–58 (2015)
14. Lapesa, G., Evert, S.: A large scale evaluation of distributional semantic models:
Parameters, interactions and model selection. Transactions of the Association for
Computational Linguistics, 2, 531–545 (2014)
15. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons
learned from word embeddings. Transactions of the Association for Computational
Linguistics, 3, 211–225 (2015)
16. Lopukhin K.A., Lopukhina A.A., Nosyrev G.V.: The impact of different vector
space models and supplementary techniques on Russian semantic similarity task. In
Computational Linguistics and Intellectual Technologies: Papers from the Annual
Conference. Dialogue, Vol. 2, 145–153 (2015)
17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In Advances in Neural
Information Processing Systems, 3111-3119 (2013)
18. Miller, G. A., Charles, W. G.: Contextual correlates of semantic similarity, Language and Cognitive Processes, 6(1), 1-28 (1991)
19. Panchenko, A., Loukachevitch, N., Ustalov, D., Paperno, D., Meyer, C., Konstantinova, N.: Russe: The first workshop on Russian semantic similarity. In Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference. Dialogue, Vol. 2, 89–105 (2015).
20. Postma, M., Vossen, P. : What implementation and translation teach us: the case
of semantic similarity measures in wordnets, Proceedings of Global Word-Net Conference GWC-2014, Tartu, Estonia, 133–141 (2014)
21. Rubenstein, H., Goodenough J. B.: Contextual Correlates of Synonymy. Communications of the ACM, Vol. 8(10), 627-633 (1965)
22. Sahlgren, M.: The Word-Space Model: Using distributional analysis to represent
syntagmatic and paradigmatic relations between words in highdimensional vector
spaces. Ph.D. thesis, University of Stockolm (2006)