Evaluating multi-sense embeddings for semantic

Evaluating multi-sense embeddings for semantic resolution
monolingually and in word translation
Gábor Borbély
Department of Algebra, BME∗
Egry József u. 1
1111 Budapest, Hungary
[email protected]
Márton Makrai
Institute for Linguistics, HAS†
Benczúr u. 33
1068 Budapest, Hungary
[email protected]
Dávid Nemeskey and András Kornai
Institute for Computer Science, HAS†
Kende u. 13-17
1111 Budapest, Hungary
[email protected], [email protected]
Abstract
Multi-sense word embeddings (MSEs)
model different meanings of word forms
with different vectors. We propose two
new methods for evaluating MSEs by their
degree of semantic resolution, measuring
the detail of the sense clustering. One
method is based on monolingual dictionaries, and the other exploits the principle that
words may be ambiguous as far as the postulated senses translate to different words
in some other language.
1
Introduction
Multi-sense word embeddings (MSEs) (Huang
et al., 2012; Neelakantan et al., 2014b; Bartunov
et al., 2015) attempt to represent different word
senses with different vectors. MSEs have been
evaluated for lexical relatedness going back to
the seminal Reisinger and Mooney (2010). Their
motivation was psychological (category learning)
rather than lexicographic: in fact they “do not assume that clusters correspond to traditional word
senses. Rather, we only rely on clusters to capture meaningful variation in word usage.” In other
words, the evaluation itself was not psychological:
it is unclear whether their vectors correspond to
different concepts in the human mind. As they
note, usage splits words more finely (with synonyms and near-synonyms ending up in distant
clusters) than semantics.
∗
†
Budapest University of Technology and Economics
Hungarian Academy of Sciences
Here we propose testing MSEs for their fit
with the semantics, by two different methods, one
based on monolingual dictionaries, the other on
bilingual considerations. Items listed in traditional
monolingual dictionaries for the same word form
are supposed to correspond to different concepts.
We propose the correlation between the number of
senses of each word-form in the embedding and
in the manually created inventory as an approximate measure of how well embedding vectors correspond to concepts in speakers’ mind – this is discussed in Section 2.
The other evaluation method is based on the linear method of Mikolov et al. (2013b), who formulate word translation as a linear mapping from
the source language embedding to the target one,
trained on a seed of a few thousand word pairs.
Our proposal is to perform such translations from
MSEs, with the idea that what are different senses
in the source language will very often translate to
different words in the target language. This way,
we can use single-sense embeddings on the target
side and thereby reduce the noise of MSEs – this
is discussed in Section 3.
ND: We should get rid of CMultiVec
Altogether, we present a preliminary evaluation of five MSE implementations by these two
methods on two languages, English and Hungarian (some of the 20 combinations are deferred to
the full paper): CMultiVec (Huang et al., 2012), as
reimplemented by Jeremy Salwen 1 ; the learning
process of Neelakantan et al. (2014a) with adaptive sense numbers (we report results using their
1
https://github.com/antonyms/CMultiVec
release MSEs vectors.MSSG.300D.30K and
their tool itself, calling both Neela); the
parametrized Bayesian learner of Bartunov et al.
(2015) where the number of senses is controlled
by a parameter α for semantic resolution, referred
to as AdaGram; topical embeddings (Liu et al.,
2015), a combination of neural frameworks with
LDA topic models; and jiweil (Li and Jurafsky,
2015), which was published to test MSEs in downstream tasks. Some very preliminary conclusions
are offered in Section 4, more in regards to the feasibility of the two evaluation methods we propose
than about the merits of these systems.
We emphasize at the outset that our evaluation
proposal probes an aspect of MSEs, semantic resolution, which is not well measured by the wellknown word sense disambiguation (WSD) task
that aimes at classifying occurrences of a word
form to different elements of a sense inventory
pre-defined by some experts. Since our goal is
to probe the granularity of the inventory itself, we
are more interested in word sense induction (WSI),
the task of discovering sense clusters among occurrences (Schütze, 1998).
2
Conceptual lexical items
The differentiation of word senses is fraught with
difficulties, especially when we wish to distinguish homophony, using the same written or spoken form to express different concepts, such as
Russian mir ‘world’ and mir ‘peace’ from polysemy, where speakers feel that the two senses
are very strongly connected, such as in Hungarian
nap ‘day’ and nap ‘sun’. To quote (Zgusta, 1971)
“Of course it is a pity that we have to rely on the
subjective interpretations of the speakers, but we
have hardly anything else on hand”. Etymology
makes clear that different languages make different lump/split decisions in the conceptual space,
so much so that translational relatedness can, to a
remarkable extent, be used to recover the universal
clustering (Youna et al., 2016).
Another confounding factor is part of speech
(POS). Very often, the entire distinction is lodged
in the POS, as in divorce (Noun) and divorce
(Verb), while at other times this is less clear,
compare the verbal to bank ‘rely on a financial
institution’ and to bank ‘tilt’. Clearly the former is strongly related to the nominal bank ‘financial institution’ while the semantic relation
‘sloping sideways’ that connects the tilting of
resource
size
mean
std
CED
LDOCE
EKSZ
NSZ (b)
82,024
30,265
121,578
5,594
1.030
1.137
1.012
1.029
0.206
0.394
0.119
0.191
huang
CMultiVec/UMBC
Neela
Neela/WK1
Neela/WK2
jiweil/WK senses
jiweil/WK cntxt
Adagram a05
Adagram
100,232
100,000
99,156
99,156
99,156
285,856
285,856
238,462
238,462
1.553
4.844
1.605
1.005
1.016
2.483
2.600
1.062
1.626
2.161
0.384
0.919
0.072
0.215
1.181
1.239
0.276
0.910
Table 1: The size (in words), mean, and standard
deviation of the number of word senses in lexicographic and automatically generated resources
the airplane to the side of the river is somewhat less direct, and not always perceived by
the speakers. This problem affects our sources
as well: the Collins-COBUILD (CED, Sinclair
(1987)) dictionary starts with the semantic distinctions and subordinates POS distinctions to these,
while the Longman dictionary (LDOCE, Boguraev and Briscoe (1989)) starts with a POS-level
split and puts the semantic split below. Of the
Hungarian lexicographic sources, the Comprehensive Dictionary of Hungarian (NSZ, Ittzés (2011))
is closer to CED, while the Explanatory Dictionary of Hungarian (EKSZ, Pusztai (2003)), is
closer to LDOCE in this regard. The corpora we
rely on are UMBC Webbase (Han et al., 2013) for
English and Webkorpusz (Halácsy et al., 2004) for
Hungarian. Table 1 summarizes the major word
sense statistics (mean and variance) both for our
lexicographic sources and for the automatically
generated MSEs.
The figure of merit we propose is the correlation
betwen the number of senses obtained by the automatic method and by the manual (lexicographic)
method. We experiment both with Spearman and
Pearson values, Jensen-Shannon and KL divergence, cosine similarity and Cohen’s κ for rater
agreement. The entropy-based measures are not
reported, however, as they failed to meaningfully
distinguish between the various resource pairs.
The cosine similarities and κ values should also be
taken with a grain of salt: the former does not take
the exact number of senses into account (hence the
high perceived similarity between CMultiVec and
CED/LDOCE), while the latter penalises all disagreements the same, regardless of how far the
guesses are. Nevertheless, all four reported metrics agree on the best pairing.
First, we note that the dictionaries themselves
are quite well correlated with each other: CED
and LDOCE have Spearman (Pearson) correlation
.266 (.277); EKSZ and NSZ have .648 (.677),
considerably larger both because we only used
a subsample of NSZ (the letter b) so there are
only 5,363 words to compare, and because NSZ
and EKSZ come from the same Hungarian lexicographic tradition, while CED and LDOCE never
shared personnel or editorial outlook. Table 2
shows correlations of sense numbers attributed to
each word by different resources, comparing automated to lexicographic (above the line) and different forms of automated (below). As is clear from
the numbers, only one system, Neela, shows high
correlation with a lexical resource (LDOCE), and
only two systems, AdaGram and Neela correlate
well with each other (different parametrizations of
the same system are of course often well correlated to one another).
We tried to filter bogus word senses from the
MSEs by applying a simple heuristic: sense vectors within a cosine distance of 0.05 to zero or
to another sense of the same word were thrown
away. As expected, the adaptive methods, Neela
and AdaGram, were unaffected, but jiweil and
CMultiVec benefited slightly from the change.
In the full paper we will also discuss a method
for estimating ambiguity of a word directly from
dictionaries automatically created from parallel
corpora and available for many langauges. Since
such disctionaries are very noisy, to obtain a robust
ambiguity estimate we compute the perplexity of
the unigram frequencies of the translations in the
target language, and average these numbers over
all target languages.
3
Cross-linguistic treatment of concepts
Since monolingual dictionaries are an expensive
resource, we also propose an automatic evaluation
of MSEs based on the discovery of Mikolov et al.
(2013b) that embeddings of different languages
are so similar that a linear transformation can map
vectors of the source language words to the vectors
of their translations.
The method uses a seed dictionary of a few
thousand words to learn translation as a linear
mapping W : Rd1 → Rd2 from the source (monolingual) embedding to the target: the translation
zi ∈ Rd2 of a source word xi ∈ Rd1 is approximately its image W xi by the mapping. The translation model is trained with linear regression on
the seed dictionary
min
W
X
||W xi − zi ||2
i
and can be used to collect translations for the
whole vocabulary by choosing zi to be the nearest neighbor of W xi .
We follow Mikolov et al. (2013b) in using different metrics, Euclidean distance in training and
cosine similarity in collection of translations. This
choice is theoretically unmotivated, but it seems to
work better than more consistent use of metrics;
see (Xing et al., 2015) for opposing results.
In a multi-sense embedding scenario, we take
a single-sense (traditional) embedding as source
model, and a multi-sense embedding as target
model. The evaluation of a specific target MSE
model is done in three ways, referred as global,
first sense and all senses respectively.
The tools that generate MSEs all provide fallbacks to singe-sense embeddings in the form of
so called global vectors. The method global can
be considered as a baseline; a traditional, singlesense translation. In the first sense method, we
trained the translation matrix to map a source vector into the first sense vector of the target word,
but during the evaluation, all senses of the target
word were accepted as a valid translation. When
considering all senses, the translation matrix was
trained to map intoall of the target senses, i.e.
P P
W xi − zi j 2 was minimized where zi j
i
j
denotes the different senses of zi .
tanulmány
jelentés
jelentés
értelmezés
report
memorandum
meaning
interpretation
Figure 1: Linear translation of word senses. The
Hungarian word jelentés is ambiguous between
‘meaning’ and ‘report’. The two senses are identified by the “neighboring” words értelmezés ‘interpretation’ and tanulmány ‘memorandum’.
Resources compared
words 1
words 2
shared words
Spearman
Pearson
cos
Cohen-κ
Neela vs CED
Neela vs LDOCE
huang vs CED
huang vs LDOCE
CMultiVec/UMBC vs CED
CMultiVec/UMBC vs LDOCE
Neela 2 vs EKSZ
jiweil sense vs EKSZ
Adagram vs EKSZ
Adagram.a05 vs EKSZ
99156
99156
100232
100232
100000
100000
771870
285856
238462
238462
82024
30265
82024
30265
82024
30265
67546
67546
67546
67546
23508
21715
23706
21763
25231
23702
45401
32007
26739
26739
0.089
0.226
0.078
0.28
0.011
0.037
0.067
-0.023
0.086
0.0879
0.090
0.222
0.08
0.289
0.016
0.035
0.067
-0.022
0.094
0.103
0.869
0.882
0.593
0.65
0.954
0.936
0.878
0.891
0.859
0.920
0.023
0.073
0.0308
0.124
0
0
0.03
-0.002
0.005
0.039
Neela vs huang
Neela vs CMultiVec/UMBC
Neela 2 vs jiweil sense
Neela 2 vs jiweil context
Adagram vs Neela 2
Adagram vs jiweil sense
Adagram vs jiweil context
Adagram.a05 vs Neela 2
Adagram.a05 vs jiweil sense
Adagram.a05 vs jiweil context
Adagram vs Adagram.a05
99156
99156
771870
771870
238462
238462
238462
238462
238462
238462
238462
100232
100000
285856
285856
771870
285856
285856
771870
285856
285856
238462
99156
69007
283083
283083
199370
201291
201291
199370
201291
201291
238461
0.349
0.066
-0.014
0.123
0.389
0.034
0.139
0.421
-0.010
0.095
0.453
0.349
0.068
-0.074
0.129
0.478
0.020
0.223
0.498
-0.055
0.132
0.505
0.647
0.882
0.845
0.872
0.897
0.793
0.836
0.946
0.869
0.890
0.907
0.111
-0.001
-0.007
0.004
0.025
0.0038
0.027
0.188
-0.008
0.0005
0.071
4
Conclusions
To summarize, we have proposed evaluating word
embeddings in terms of their semantic resolution
(granularity of word senses), to serve as a proxy
for WSD, WSI, and MT tasks, over and above the
gains that MSEs have over single-sense embeddings in similarity and analogy tasks.
The central linguistic/semantic/psychological
neela adagram
The quality of the translation was measured by
training on 5k of the most frequent word pairs and
evaluating on another thousand seed pairs. The
seed dictionary has been extracted by Tiedemann
(2012) from the OpenSubtitles corpus2 .
In the evaluations shown in the first two sections of ?? the English embedding was pre-trained
by Mikolov et al. (2013a) on the Google News
corpus.3 We trained the target Hungarian embeddings on the Hungarian Webcorpus with different
tools. We measure precision in the standard manner: prec@k is the ratio of test items when at least
one of the gold translations of the source word is
among the k nearest neighbors of the computed
vector in the target space. 3 shows an evaluation
based on translating from Hungarian to English:
here we trained the Hungarian embedding on Webcorpus with word2vec and took a pre-trained
English MSE (Neelakantan et al., 2014b).
jiweil
Table 2: Word sense distribution similarity between various resources
method
global
first sense
all senses
global
first sense
all senses
global
first sense
all senses
prec@1
.217
.135
.116
.140
.141
.156
.217
.165
.135
prec@5
.387
.237
.208
.273
.300
.315
.377
.336
.283
prec@10
.440
.286
.247
.320
.350
.380
.449
.409
.368
Table 3: Translation precision
property we wish to capture is that of a concept,
the underlying word sense unit. To the extent standard lexicographic practice offers a reasonably robust notion (this is of course debatable, but we
consider a correlation of 0.27 over a large vocabulary a strong indication of consistency), this is
something that MSEs should aim at capturing. We
leave the matter of aligning word senses in different dictionaries for future work, but we expect
that by (manual or automated) alignment the interdictionary (inter-annotator) agreement can be improved considerably, to provide a more robust gold
standard.
At this point everything we do is done in software4 , so other researchers can accurately reproduce these kinds of evaluations. Whether a
2
http://www.opensubtitles.org/
https://code.google.com/archive/p/
word2vec/
3
4
Some glue code for this project can be found at https:
//github.com/hlt-bme-hu/multiwsi
‘gold’ sense-disambiguated dictionary should be
produced beyond the publicly available CED is not
entirely clear, and we hope the referees and the
other workshop participants will weigh in on this
matter.
References
Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, and Dmitry Vetrov. 2015. Breaking sticks
and ambiguities with adaptive skip-gram. ArXiv
preprint .
Branimir K. Boguraev and Edward J. Briscoe.
1989. Computational Lexicography for Natural
Language Processing. Longman.
Péter Halácsy, András Kornai, László Németh,
András Rung, István Szakadát, and Viktor Trón.
2004. Creating open language resources for
Hungarian. In Proc. LREC2004. pages 203–
210.
Lushan Han, Abhay Kashyap, Tim Finin,
James Mayfield, and Jonathan Weese. 2013.
UMBC_EBIQUITY-CORE: Semantic textual
similarity systems. In Proceedings of the 2nd
Joint Conference on Lexical and Computational
Semantics. pages 44–52.
Eric H. Huang, Richard Socher, Christopher D.
Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and
multiple word prototypes. In Proceedings of
the 50th Annual Meeting of the Association for
Computational Linguistics: Long Papers - Volume 1. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’12, pages
873–882.
Nóra Ittzés, editor. 2011.
A magyar nyelv
nagyszótára III-IV. Akadémiai Kiadó.
Jiwei Li and Dan Jurafsky. 2015. Do multi-sense
embeddings improve natural language understanding? In EMNLP.
Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman
Sadeh, and Noah A. Smith. 2015. Toward abstractive summarization using semantic representations. In Proceedings of the 2015 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. Association for
Computational Linguistics, Denver, Colorado,
pages 1077–1086.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word
representations in vector space. In Y. Bengio
and Y. LeCun, editors, Proceedings of the ICLR
2013.
Tomas Mikolov, Quoc V Le, and Ilya Sutskever.
2013b.
Exploiting similarities among languages for machine translation. Xiv preprint
arXiv:1309.4168.
Arvind Neelakantan, Jeevan Shankar, Alexandre
Passos, and Andrew McCallum. 2014a. Efficient non-parametric estimation of multiple embeddings per word in vectoe space. In EMNLP.
Arvind Neelakantan, Jeevan Shankar, Alexandre
Passos, and Andrew McCallum. 2014b. Efficient non-parametric estimation of multiple
embeddings per word in vector space. arXiv
preprint arXiv:1504.06654 .
Ferenc Pusztai, editor. 2003. Magyar értelmező
kéziszótár. Akadémiai Kiadó.
Joseph Reisinger and Raymond J Mooney. 2010.
Multi-prototype vector-space models of word
meaning. In The 2010 Annual Conference of
the North American Chapter of the Association
for Computational Linguistics. Association for
Computational Linguistics, pages 109–117.
Hinrich Schütze. 1998. Automatic word sense discrimination. Computational linguistics .
John M. Sinclair. 1987. Looking up: an account
of the COBUILD project in lexical computing.
Collins ELT.
Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios
Piperidis, editors, Proceedings of the Eight International Conference on Language Resources
and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul,
Turkey.
Chao Xing, Chao Liu, RIIT CSLT, Dong Wang,
China TNList, and Yiye Lin. 2015. Normalized
word embedding and orthogonal transform for
bilingual word translation. In NAACL.
Hyejin Youna, Logan Sutton, Eric Smith, Cristopher Moore, Jon F. Wilkins, Ian Maddieson,
William Croft, and Tanmoy Bhattacharya.
2016. On the universal structure of human lexical semantics. PNAS 113(7):1766–1771.
Ladislav Zgusta. 1971. Manual of lexicography.
Academia, Prague.