Anca Dinu, University of Bucharest ACED 15, Bucharest 2013 Ack: This work was supported by a grant of the Romanian National Authority for Scientific Research, CNCS –UEFISCDI, project number PN-II-ID-PCE-2011-3-0959. What this paper is about; Formal versus Distributional Semantics – a Complementarity perspective; Compositionality Problem for Distributional Semantics; Our idea –dropping the row frequencies, using rankings; Experimental results; Further work; Conclusions. This paper focuses on an alternative method of measuring word-word semantic relatedness in the Distributional Semantics (DS) framework. Distributional Semantics is an approach to semantics that is based on the contexts of words (and linguistic expressions) in large corpora. Frege: Linguistic signs have a reference and a sense. Formal Semantics studies “meaning” as “reference”. Distributional semantics focuses on “meaning” as “sense” leading to the “language as use” view. • Focus of FS: Grammatical words: • prepositions, • articles, • quantifiers, • coordination, • auxiliary verbs, • pronouns, • negation. • Focus of DS: Content words: • nouns, • adjectives, • verbs, • adverbs. Formal semantics gives an elaborate and elegant account of the productive and systematic nature of language. Then whay bother with Distributional Semantics? Because many semantic issues come from the lexicon of content words and not from grammatical terms. And because it’s a solvable problem. And because there are a number of applications for it (like in information retrieval tasks or search engines). For instance: “In order for a cow to be brown most of its body's surface should be brown, though not its udders, eyes, or internal organs. A brown crystal, on the other hand, needs to be brown both inside and outside. . . Furthermore, in order for a cow to be brown, brown should be the animal's natural color. A table, on the other hand, is brown even if it is only painted brown. But while a table or a caw are not brown if covered with brown sugar, a cookie is. In short, what is to be brown varies from one linguistic context to another.” (Lahav 1993:76) The problem of lexical semantics is primarily a problem of size: even considering the many regularities found in the content lexicon, a handby-hand analysis is simply not feasible. There have been important attempts to tackle the lexicon problem from the point of view of formal semantics, like Pustejovsky's (1995) theory of the Generative Lexicon. Distributive Semantics is rather like a concrete implementation of a feature-based semantic representation, akin to Generative Lexicon (Pustejovsky, 1995) or Lexical Conceptual Structures (Jackendo , 1990). However, unlike these theories, DS Models can be induced fully automatically on a large scale, from corpus data. DS assumes that the statistical distribution of words in context plays a key role in characterizing their semantic behavior. The idea that word co-occurrence statistics extracted from text corpora can provide a basis for semantic representations can be traced back at least to Firth (1957): ”You shall know a word by the company it keeps” and to Harris (1954) ”words that occur in similar contexts tend to have similar meanings”. For instance, by looking at a word's context, one can infer some of its meaning: tasty tnassiorc greasy tnassiorc FOOD tnassiorc with butter tnassiorc for breakfast He filled the wampimuk, passed it around and we all drunk some DRINK We found a little, hairy wampimuk sleeping behind the tree ANIMAL Traditionally, vector space models were applied to measure word similarity in DS framework. A word is represented by a vector in which the elements are derived from the co-occurrences of the word in various contexts, such as: windows of words (Lund and Burgess (1996)), grammatical dependencies (Lin (1998), Pado and Lapata (2007)), richer contexts consisting of dependency links and selectional preferences on the argument positions (Erk and Pado (2008)). For instance, for corpus = English Wikipedia (Hinrich Schütze, Distributional Semantics, 2012) Co-occurrence defined as occurrence within k = 10 words of each other: cooc.(rich, silver) = 186 cooc.(poor, silver) = 34 cooc.(rich, disease) = 17 cooc.(poor, disease) = 162 cooc.(rich, society) = 143 cooc.(poor, society) = 228 cooc.(poor, silver)=34, cooc.(poor, disease)=162, cooc.(poor, society)=228, cooc.(rich, silver)=186, cooc.(rich, disease)=17, cooc.(rich, society)=143 The similarity between two words is measured with the cosine of the angle between them. Small angle: silver and gold are similar. Instead of only two dimension space it is usual to employ a very high-dimensional space with a large number of vectors represented in it. We can compute the nearest neighbors of any word in this word space. Nearest neighbors of “silver”: 1.000 silver / 0.865 bronze / 0.842 gold / 0.836 medal / 0.826 medals / 0.761 relay / 0.740 medalist / 0.737 coins / 0.724 freestyle / 0.720 metre / 0.716 coin / 0.714 copper / 0.712 golden / 0.706 event / 0.701 won / 0.700 foil / 0.698 Winter / 0.684 Pan / 0.680 vault / 0.675 jump Nearest neighbors of “disease”: 1.000 disease / 0.858 Alzheimer / 0.852 chronic / 0.846 infectious / 0.843 diseases / 0.823 diabetes / 0.814 cardiovascular / 0.810 infection / 0.807 symptoms / 0.805 syndrome / 0.801 kidney / 0.796 liver / 0.788 Parkinson / 0.787 disorders / 0.787 coronary / 0.779 complications / 0.778 cure / 0.778 disorder / 0.778 Crohn / 0.773 bowel The vectors in the space have been words so far. But we can also represent other entities like: phrases, sentences, paragraphs, documents, even entire books. Compositionality problem: how to obtain the distribution vector of a phrase? Option 1: The distribution of phrases – even sentences – can be obtained from corpora, but... those distributions are very sparse; observing them does not account for productivity in language. Option 2: Use vector addition or product of two or more words to compute the phrase distribution, but... Addition and Multiplication are commutative in a word-based model: ▪ [[The cat chases the mouse]] = [[The mouse chases the cat]]. There is a excelent programm that takes into account the syntax: “Frege in Space: A Program for Compositional Distributional Semantics” by Baroni et al., 2012. This paper presents an alternative method to measuring word-word semantic relatedness in DS framework. Instead of representing words in vector spaces, we represent them as rankings of co-occuring words arranged in their co-occurence frequency. Advantages: Simpler; Presumably more robust, since rankings of features are expected to vary less then the row frequency with the choice of corpus; Opens the perspective of experimenting with new methods of composing (distributional) meaning by agregating rankings instead of combining (adding, multiplying) vectors. We used the Wacky corpus (Baroni et al., 2009), which is lemmatized and pos tagged. We have extracted from Wacky corpus the cooccurrence vectors for the words in WS-353 Test (Finkelstein et al. 2002), in a 10 words window . WS-353 Test is a semantic relatedness test set consisting of 353 word pairs and a gold standard defined as the mean value of evaluations by up to 17 human judges. Completly unrelated words were assigned a value of 0, while identical words a value of 10. WS-353 Test has been widely used in the literature and has become the de facto standard for semantic relatedness measure evaluation. For each of the distinct 437 target-word in WS353 Test, we computed its co-occurrences frequency with words in the corpus. This resulted in 437 rankings of words (features) ordered in decreasing order of their cooccurence frequency. Then we computed 4 different measures of similarity (Jaro, Rank, Inverse Rank and MeanRank) for each pair of rankings representing the pair of words in the WS-353 Test. The best Spearman Corellation value with human jugements (0,46) was achived by Jaro distance deffined as: where t the number of character transpositions in S1 and S2 (i.e. the number of common characters, but in different positions, divided by 2) and c is the number of common characters. This is close to the baseline Spearman Corelation obtained by DSMs which is around 0.5. When inspecting the worst mismatches between human/machine relatedness judgements between pairs of words, we observed that most of them were following a pattern: Lower values asigned by humans corresponded to much higher values computed by machine, such as: (month, hotel) (money, operation) (king, cabbage) (coast, forest) (rooster, voyage) (governor, interview) (drink, car) (day, summer) (architecture, century) (morality, marriage) Humans Machine 1,81 3,31 0,23 3,15 0,62 3,25 3,04 3,94 3,78 3,69 6,239567 6,40989 4,171145 6,409761 4,656631 6,08319 5,931482 6,576498 5,927852 5,450308 Our method performs around the baseline for DSMs, so there is hope for improvement; It is also less computationally expensive; It provides a new framework for experimenting with distributional semantic compositionality. Thank you! Contextual information – bag of words: The first category of models aims at integrating the widest possible range of context information without recourse to linguistic structure. The best-known work in this category is Schutze (1998). He first computes “first-order” vector representations for word meaning by collecting co-occurrence counts from the entire corpus. Then, he determines “second-order” vectors for individual word instances in their context, which is taken to be a simple surface window, by summing up all first-order vectors of the words in this context. The resulting vectors form sense clusters. Contextual information – including syntactic information (A Structured Vector Space Model for Word Meaning in Context Katrin Erk and Sebastian Pado, Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 897–906, Honolulu, October 2008): they encode each word as a combination of one vector that models the lexical meaning of the word, and a set of vectors, each of which represents the selectional preferences for the particular syntactic relations that the word supports (such as subj, obj, mod). Or (Structured Composition of Semantic Vectors, Stephen Wu and William Schuler, Proceedings of the International Conference on Computational Semantics, Oxford, UK, 2011.) proposes a structured vectorial semantic framework, in which semantic vectors are defined and composed in syntactic context. Cases where simple word space models fail: Antonyms are judged to be similar: “disease” and “cure” Ambiguity: “Cambridge” Homonimy: ”bank” Non-specificity (occurs in a large variety of different contexts and has few/no specific semantic associations): “person”
© Copyright 2026 Paperzz