NATURAL LANGUAGE PROCESSING (COM4513/6513) DISTRIBUTED WORD REPRESENTATIONS Andreas Vlachos [email protected] Department of Computer Science University of Sheffield 1 SO FAR Word senses and their disambiguation. We considered senses to be discrete representations of meaning. Thus, each sense is equally different from every other sense. 2 IN THIS LECTURE We will challenge the discreteness of the word senses. Instead of senses, we will map words to a vector. 3 A QUICK TEST What is tesguino? A bottle of tesguino is on the table. Everybody likes tesguino Tesguino makes you drunk. We make tesguino out of corn. You shall know a word by the company it keeps (Firth, 1957) Thus, two words are similar if they appear in similar contexts 4 WORD-CONTEXT MATRIX typically very sparse the shorter the window the more syntactic they are the longer the window the more semantic they are example above with 7 words window, syntactic or semantic? 5 CONTEXT VECTORS The most basic word representation, similar intuition to the term-document matrix in information retrieval one-hot xapricot = [1, 0, 0, . . . , 0] xpineapple = [0, 1, 0, . . . , 0] ∈ ∈ Re Re |V| |V| similarity(xapricot , xpineapple ) = 0 context xapricot = [0, 1, 1, . . . , 0] xpineapple = [0, 1, 1, . . . , 0] ∈ ∈ Re Re |V| |V| similarity(xapricot , xpineapple ) > 0 6 SIMILARITY inner-product(w, v) = w ⋅v = ∑ |V| i=1 wi vi Problem: More frequent words appear in more contexts, will be more similar to anything. Solution: Divide by their length (a.k.a. cosine) cosine(w, v) = ⋅ w v |w||v| ∑ |V| = i=1 ∑w √ 2 i wi vi ∑v √ 2 i Other options: Dice, Jaccard, etc. 7 POINTWISE MUTUAL INFORMATION Raw counts are OK, but frequent words (articles, pronouns, etc.) can dominate without being informative. Pointwise mutual information measures how often two events occur relative to them occurring independently: P(word, context) PMI(word, context) = log 2 P(word)P(context) Positive values quantify relatedness. Negative values? Usually ignored: positivePMI(word, context) = max(PMI(word, context), 0) 8 CHOICE OF CONTEXTS We can refine contexts using: their part-of-speech tags (bank_V vs. bank_N) syntactic dependencies (eat_dobj vs. eat_subj) We can weigh contexts according to the distance from the word: the further away, the lower the weight. Rare contexts dominate, replace P(context) = #(context) ∑ c with: Pα (context) = #(context) ∑ c α #(c) α , #(c) α = 0.75 9 SINGULAR VALUE DECOMPOSITION PPMI matrices are good, but: high dimensional very sparse Dimensionality reduction using truncated singular value decomposition: PPM I |V|×|C| ≈ W |V|×k S k×k C k×|C| Approximation is good: removes noise and redundnancy Same as latent semantic analysis for term-document matrices 10 SINGULAR VALUE DECOMPOSITION 11 SKIP-GRAM (MIKOLOV ET AL. 2013) Running SVD on large matrices is expensive. Let's look at one word a time: P(context|wt ) = P(wt−2 , wt−1 , wt+1 , wt+2 |wt ) P(context|wt ) = P(wc |wt ) = ∏ P(w∈ context ⋅ exp(w c w t ) ∑ ′ w c ∈V exp(w c ′ ⋅w t) |wt ) (skip-gram) (independence) (word vectors) A giant logistic regression classifier: words are the labels and the parameters are the word vectors Negative sampling: negative training examples are subsampled intelligently EACH WORD HAS TWO EMBEDDINGS Can discard the context word embeddings, add them, or concatenate them with the target word embeddings 13 12 SKIP-GRAM 14 EVALUATION Intrinsic: similarity: order word pairs according to their semantic similarity in context similarity: substitute a word in a sentence without chagning its meaning. analogy: Athens is to Greece what Rome is to ...? Extrinsic: use them to improve performance in a task. They are an easy way to take advantage of unlabeled data to do semi-supervised learning. 15 BEST WORD VECTORS? high-dimensional (processed) counts? low-dimensional neural/SVD? Recent paper by Levy et al. (2015) showed that choice of context window size, rare word removal, context distribution smoothing matter more. Choice of texts to obtain the counts matters. More text is better, and low-dimensional methods scale better. Learning the vectors makes it possible to incorporate domain knowledge, e.g. bass being more similar to fish. 16 WHAT ABOUT POLYSEMY? All occurrences of a word (and all its senses) are represented by one vector. How do we handle polysemy? all senses are present in the vector given a task, it is often useful to adapt the vectors to represent the appropriate sense 17 LIMITATIONS antonyms appear in similar contexts, hard to distinguish them from synonyms compositionality: what is the meaning of a sequence of words? while we might be able to obtain context vectors for short phrases, this doesn't scale to whole sentences, paragraphs, etc. 18 BIBLIOGRAPHY Jurafsky & Martin Chapter 19 Omer Levy's article Turian et al.'s paper on how to use them 19 COMING UP NEXT Language modeling 20
© Copyright 2026 Paperzz