Not All Contexts Are Equal: Automa)c iden)fica)on of Antonyms, Hypernyms, Co-‐ Hyponyms, and Synonyms in DSMs Enrico Santus, Qin Lu, Alessandro Lenci and Chu-‐Ren Huang Modeling Human Language Ability • In the last decades, NLP has achieved impressive progress in modeling human language ability, developing a large number of applica)ons: – Informa)on Retrieval (IR) – Informa)on Extrac)on (IE) – Ques)on Answering (QA) – Machine Transla)on (MT) – Others… The Need of Resources • These applica)ons were not only improved through beOering the algorithms, but also through the use of beOer lexical resources and ontologies (Lenci, 2008a) – WordNet – SUMO – DOLCE – ConceptNET – Others… Automa)c Crea)on of Resources • As the relevance of these resources has grown, systems for their automa)c crea)on have assumed a key role in NLP. • Handmade resources are in fact: – Arbitrary – Expensive to create – Time consuming – Difficult to keep updated Seman)c Rela)ons as Building Blocks • En;;es and rela;ons have been iden)fied as the main building blocks of these resources (Herger, 2014). • NLP has focused on methods for the automa)c extrac)on and representa)on of en))es and rela)ons, in order to: – Increase the effec;veness of such resources; – Reduce the costs of development and updates. • Yet, we are s)ll far from achieving sa)sfying results. Seman)c Rela)ons Iden)fica)on and Discrimina)on in DSMs • The distribu;onal approach was chosen because it can be: – Completely unsupervised (Turney and Pantel, 2010); – Portable to any language for which large corpora can be collected (ibid.); – Large range of applicability (ibid.) – Cogni)vely plausible (Lenci, 2008b) – Strength in iden)fying similarity (ibid.) From Syntagma)c to Paradigma)c Rela)ons • The main seman)c rela)ons (i.e. synonymy, antonymy, hypernymy, meronymy) are also called paradigma;c seman;c rela;ons. • Paradigma)c rela)ons are concerned with the possibility of subs;tu;on in the same syntagma;c contexts. • They should be considered in opposi)on to the syntagma;c rela;ons, which – instead – are concerned with the posi;on in the sentence (syntagm). (de Saussure, 1916) Distribu)onal Seman)cs • Distribu)onal Seman)cs can be used to derive the paradigma)c rela)ons from syntagma)c ones. • It relies on the Distribu(onal Hypothesis (Harris, 1954), according to which: 1. At least some aspects of meaning of a linguis6c expression depend on its distribu6on in contexts; 2. The degree of similarity between two linguis6c expressions is a func6on of the similarity of the contexts in which they occur. Vector Space Models • Star)ng from the Distribu)onal Hypothesis, computa;onal models represen)ng words as vectors, whose dimensions contain the Strength of Associa6on (SoA) with the contexts, have been developed. – SoA is generally the frequency of co-‐occurrence or the mutual informa)on (PMI, PPMI, LMI, etc.) – Contexts may be single words within a window or within a syntac)c structure, pairs of words, etc. • Words are therefore spa)ally represented and their meaning is given by the proximity with other vectors in such vector space, oden also referred as seman;c space (Turney and Pantel, 2010). Distribu)onal Seman)c Models: Similarity as Proximity • DSMs are known for their ability in iden;fying seman;cally similar lexemes. • Vector cosine is generally used: Vector cosine returns a value between 0 and 1, where 0 is paradigma)cally totally unrelated, and 1 is distribu)onally iden)cal. (Santus et al., 2014a-‐c) Shortcomings of DSMs: Seman)c Rela)ons • Unfortunately the defini;on of distribu6onal similarity is so loose that under its umbrella fall not only near-‐ synonyms (e.g. nice-‐good), but also: – – – – hypernyms (e.g. car-‐vehicle) co-‐hyponyms (e.g. car-‐motorbike) antonyms (e.g. good-‐bad) meronyms (e.g. dog-‐tail) • Words holding these rela)ons have in fact similar distribu)ons. (Santus et al., 2014a-‐c; 2015a-‐c) How to Iden)fy and Discriminate Seman)c Rela)ons • Iden;fica;on of Seman;c rela;ons (classifica)on) consists in classifying word-‐pairs according to the seman)c rela)on they hold. F1 score is generally used to evaluate the accuracy of the algorithm. • Seman;c rela;ons discrimina;on (rela)on retrieval): consists in returning a list of word-‐pairs, sorted according to a score that aims to predict a specific rela)on. Average Precision is generally used to evaluate the accuracy of the algorithm. Distribu)onal Seman)c Model • All the experiments described in the next slides are performed on standard window-‐based DSM recording co-‐ occurrences with the nearest X content words to the leJ and right of each target word. – In most of our experiments, we have used X = 2 or 5, because small windows are most appropriate for paradigma)c rela)on • Co-‐occurrences were extracted from a combina)on of the freely available ukWaC and WaCkypedia corpora, and weighted with Local Mutual Informa;on (LMI; Evert, 2005). (Santus et al., 2014a-‐c; 2015a-‐c) NEAR-‐SYNONYMY Near-‐Synonymy: TOEFL & ESL • Near-‐Synonym: a word having the same or nearly the same meaning as another in the language, as happy, joyful, elated. • Similarity is the main organizer of the seman)c lexicon (Landuaer and Dumais, 1997) • Two common tests for evalua)ng methods of near-‐synonymy iden)fica)on are the TOEFL and the ESL test. • These tests consist in several ques;ons: a word is provided and the algorithm should find the most similar one (near-‐synonym), among four possible ones. – TOEFL (Test of English as Foreign Language): 80 ques;ons, with four choices each; – ESL (English as Second Language): 50 ques;ons, with four choices each. (Santus et al., 2016b; 2016d) APSyn: Hypothesis • We have developed APSyn (Average Precision for Synonyms), a varia)on of the Average Precision measure that aims to automa)c iden)fy near-‐ synonyms in corpora. • The measure is based on the hypothesis that: – Not only similar words occur in similar contexts, but also they tend to share their most relevant contexts. • E.g. good-‐nice will share contexts like preDy, very, rather, quite, weather, etc. (SketchEngine: Diffs) APSyn: Method • To iden;fy the most related contexts, we decided to rank them according to the Local Mutual Informa)on (LMI; Evert, 2005) – LMI is similar to the Pointwise Mutual Informa)on (PMI), but it not biased towards low frequent elements. • In our experiments, ader having ranked the contexts, we pick the N top ones, were: 100≤N≤1000 • At this point, the intersec;on among the top N contexts of the two target words in the word-‐pair is evaluated and weighted according to the average rank of the shared contexts. APSyn: Defini)on • For every feature f included in the intersec)on between the top N features of w1 (i.e. N(F1)), and w2 (i.e. N(F2)), APSyn will add 1 divided by the average rank of the feature, among the top LMI ranked features of w1 (i.e. rank1(f)), and w2 (i.e. rank2(f)). • Expected scores: – High scores for Synonyms – Low scores or zero for less similar words Experiments 1. Ques)ons were transformed into word-‐pairs: PROBL_WORD – POSSIB_CHOICE 2. APSyn scores were assigned to all the word-‐pairs. 3. In every ques)on, the word-‐pairs were ranked decreasing, according to APSyn. 4. If the right answer was ranked first, we added 0.25 ;mes the number of WRONG ANSWERS present in our DSM to the final score. • BASELINES – – – Cosine and Co-‐Occurrence Baseline are provided for comparison. Random baseline is 25%. Average non-‐English US college applicant in the TOEFL is 64.5% Discussion • APSyn, without any op)miza)on, is s)ll not as good as the state-‐of-‐the-‐art – 100% on TOEFL and 82% on ESL • However, it is: – completely unsupervised (therefore applicable to other languages) – linguis;cally grounded (therefore it catches some linguis)c proper)es) • And it performs: – beOer than the random baseline, the vector cosine and the co-‐occurrence, on our DMS; – very similarly to foreign students that take the TOEFL test. • The value of N: – the smaller N (close to 100), the beOer the performance of APSyn. • This is probably due to the fact that when N is too big, not only the most relevant contexts are considered. In order to op)mize the performance, N can be learnt from a training set. (Santus et al., 2016b; 2016d) • ANTONYMY Antonymy: Importance & Defini)on • Antonymy: – is one of the main rela)ons shaping the organiza)on of the seman;c memory (together with near-‐synonymy and hypernymy). – Although it is essen)al for many NLP tasks (e.g. MT, SA, etc.), current approaches to antonymy iden)fica)on are s)ll weak in discrimina;ng antonyms from synonyms. • Antonymy is in fact: – similar to synonymy in many aspects (e.g. distribu)onal behavior); – hard to define: • there are many subtypes of antonymy; • even na)ve speakers of a language do not always agree on classifying word-‐pairs as antonyms. (Mohammad et al., 2008; 2013) Antonymy: Defini)on • Over the years, scholars from different disciplines have tried to – define antonymy; – classify the different subtypes of antonymy. • Kempson (1977) defined antonyms as word-‐pairs with a “binary incompa)ble rela)on”, such that the presence of one meaning entails the absence of the other. • giant – dwarf vs. giant – person • Cruse (1986) iden)fied an important property of antonymy and called it the paradox of simultaneous similarity and difference between antonyms: – Antonyms are similar in every dimension of meaning except in a specific one. • giant = dwarf, except for the size (big vs. small) Antonymy: Co-‐Occurrence Hypothesis • Most of the unsupervised work on antonymy iden)fica)on is based on the co-‐occurrence hypothesis: – antonyms co-‐occur in the same sentence more oJen than expected by chance (e.g. in coordinate contexts of the form A and/or B) • Do you prefer meat or vegetables? • Shortcoming: also other seman)c rela)ons are characterized by this property (e.g. co-‐hyponyms, near synonyms). • Do you prefer a dog or a cat? • Is she only pre>y or wonderful? (Santus et al., 2014b-‐c) APAnt: Hypothesis • If we consider the paradox of simultaneous similarity and difference between antonyms, we have the following distribu)onal correlate: MEANING à DISTRIBUTIONAL BEHAVIOUR SYNONYMS Similar in every dimension à Similar distribu)onal behaviors ANTONYMS Similar in every dimension except one à ??? • We can fill the empty field with: – “Similar distribu;onal behaviors except for one dimension of meaning”. • Since giant and dwarf are similar in every dimension of meaning except for the one related to size à They occur in similar contexts, except for those related to that dimension. • We can also assume that the dimension of meaning in which they differ is a salient one, and – by consequence – that they will behave distribu)onally differently in their most relevant contexts. • The size is a salient dimension for both giant and dwarf and they are expected to have a different distribu)onal behavior for this dimension of meaning (i.e. big vs. small). APAnt: Method • APAnt (Average Precision for Antonyms) is defined as the inverse of APSyn (Santus et al., 2014b-‐c; 2015b-‐c). – Note: • 1/vector cosine = “no distribu)onal similarity” • while 1/APSyn = “no sharing the most salient contexts” • • For every feature f included in the intersec)on between the top N features of w1 (i.e. N(F1)), and w2 (i.e. N(F2)), APSyn will add 1 divided by the average rank of the feature, among the top LMI ranked features of w1 (i.e. rank1(f)), and w2 (i.e. rank2(f)). Expected scores: – High scores for Synonyms – Low scores or zero for less similar words • Expected scores: – High scores for words not sharing many top contexts (antonyms or unrelated words) – Low scores for words sharing many top contexts (near-‐synonyms) APAnt: Evalua)on • We performed several antonyms retrieval experiments to evaluate APAnt. • For the evalua)on, we relied on three main datasets, which contain word-‐pairs labeled with the seman)c rela)ons they hold: – BLESS (Baroni and Lenci, 2011) • Hypernyms, Co-‐Hyponyms, Meronyms, etc. – Lenci/Benoao (Santus et al., 2014b-‐c) • Antonyms, Synonyms and Hypernyms – EVALu;on 1.0 (Santus et al., 2015a) • Hypernyms, Meronyms, Synonyms, Antonyms, etc. • DSMs: 2 content words on the led and the right. APAnt: Experiment 1 Informa;on Retrieval • • • • • APAnt scores were assigned to all the word-‐pairs in the dataset. Word-‐pairs were ranked decreasing according to the first word in the pair, and then the APAnt value. Average Precision is used to evaluate the rank (Kotlerman et al., 2010). It returns a value between 0 and 1, where 0 is returned if all antonyms are at the boOom, 1 if they are all on the top. Results for the Lenci/BenoOo dataset (2.232 word-‐pairs), and by POS, are provided. BASELINES – Vector Cosine and Co-‐Occurrence Baseline. (Santus et al., 2014b-‐c) APAnt – Discussion • The evalua)on was performed on: – 2,232 word-‐pairs • about 50% Antonyms • about 50% Synonyms • APAnt outperforms vector cosine and co-‐occurrence baseline in the full dataset. – The co-‐occurrence and the cosine promote synonyms. • APAnt outperforms vector cosine and co-‐occurrence baseline also for the different POS: – Best results are obtained for NOUNS – Worst results are obtained for ADJECTIVES • It is in fact likely that opposite adjec)ves share their main contexts more than nouns do (e.g. cold/hot can be used to define the same en)ty, while giant/dwarf no). • N=100 is the best value in our setngs. HYPERNYMY Hypernymy: Hypothesis • Another measure for the iden)fica)on of hypernyms was proposed in Santus et al. (2014a): SLQS. • Given a word-‐pair, SLQS evaluates the generality of the N most related contexts of the two words, under the hypothesis that: – hypernyms tend to occur with more general contexts (e.g. animalàeat) than hyponyms (e.g. dogàbark). • Generality is evaluated in terms of median Shannon entropy among the N most related contexts: the higher the median entropy, the more general the word is considered. SLQS: Method • • The N most LMI related contexts for both words are selected (100≤N≤250) For each context, we calculate the entropy (Shannon, 1948): • Then, for each word, we pick the median entropy among the N most LMI related contexts: • And we finally calculate SLQS according to the following formula: • Expected Results: – SLQS = 0 if the words in the pair have similar generality – SLQS > 0 if w2 is more general – SLQS < 0 if w2 is less general SLQS: Experiment 1 • Task: Iden)fy the direc)onality of the pair. • DSM: 2-‐window • Dataset: BLESS (Baroni and Lenci, 2011) – 1277 hypernyms • Note: all of them are in the hyponym-‐hypernym order; therefore we expect to have SLQS>0 • Results: – SLQS obtains 87% precision, outperforming both WeedsPrec, which is based on the Inclusion hypothesis, and frequency baselines. (Santus et al., 2014) SLQS: Experiment 2 • Task: Informa)on Retrieval task. – Given hypernyms, coordinates, meronyms and randoms, score them in a way that the hypernyms are ranked on top. • • DSM: 2-‐window Dataset: BLESS (Baroni and Lenci, 2011) – 1277 à hypernyms, coordinates, meronyms and randoms. • We combined SLQS and Cosine, as they respec)vely caught generality and similarity. • Results: – SLQS*Cosine obtains 59% AP (Kotlerman et al., 2010), outperforming WeedsPrec, which is based on the Inclusion hypothesis, cosines and frequency baselines. (Santus et al., 2014) HYPERNYMY, CO-‐HYPONYMY AND RANDOMS ROOT13: A Supervised Method • • Task: Classifica)on of Hypernyms, Co-‐Hyponyms and Randoms Classifier: RandomForest • Features: – Cosine, co-‐occurrence frequency, frequency of w1-‐w2, entropy of w1-‐w2 (Turney and Pantel, 2010; Shannon, 1948) – Shared: size of the intersec)on between the top 1k associated contexts of the two terms according to the LMI score (Evert 2005) – APSyn: for every context in the intersec)on between the top 1k associated contexts of the two terms, this measure adds 1, divided by its average rank in the term-‐context list (Santus et al. 2014b) – Diff Freqs: difference between the terms frequencies – Diff Entrs: difference between the terms entropies – C-‐Freq 1, 2: two features storing the average frequency among the top 1k associated contexts for each term – C-‐Entr 1, 2: two features, storing the average entropy among the top 1k associated contexts for each term (Shannon, 1948) • Dataset: combina)on of BLESS, Lenci/BenoOo and EVALu)on 1.0 – 9,600 pairs: 33% Hypernyms, 33% Co-‐Hyponyms and 33% Randoms ROOT13: Experiment 1 • Baseline: Cosine • Accuracy: measured with F1 • Three classes: – 88.3% vs. 57.6% • Hyper-‐Coord: – 93.4% vs. 60.2% • Hyper-‐Random: – 92.3% vs. 65.5% • Coord-‐Random: – 97.3% vs. 81.5% (Santus et al., 2016) CONCLUSIONS Conclusions • • • Extrac)ng proper)es from the most related contexts seems to provide important informa)on about seman)c rela)ons (entropy, intersec)on, frequency, etc.). APSyn, APAnt, SLQS and ROOT13 are all methods that try to inves)gate and combine such proper)es. The former three methods have obtained good results in the tasks we have performed, without any par)cular op)miza)on. Moreover, they are: – Unsupervised (and therefore applicable to several languages) – Linguis)cally grounded (they tell us something about word usage) • • Up to this date, we have iden)fied the most related contexts with Local Mutual Informa)on (Evert, 2005), but it is likely that we will start using the Posi;ve Pointwise Mutual Informa;on, as most of the literature is using it. We are currently developing a system to automa)cally extract as many sta)s)cal proper)es as possible, evalua)ng their correla)on with seman)c rela)ons. • Briefly: Not All Contexts Are Equal Thank you Enrico Santus – The Hong Kong Polytechnic University Alessandro Lenci – University of Pisa Qin Lu – The Hong Kong Polytechnic University Sabine Schulte im Walde – Ins)tute for Natural Language Processing, University of StuOgart Frances Yung – Nara Ins)tute of Science and Technology
© Copyright 2026 Paperzz