Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia [email protected] Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Learning word vectors (Cont.) v Representation learning in NLP 6501 Natural Language Processing 2 Recap: Latent Semantic Analysis v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors Recap: Mapping to Latent Space via SVD πΊ β π πͺ π×π π' π×π π×π v SVD generalizes the original data v Uncovers relationships not explicit in the thesaurus v Term vectors projected to π-dim latent space v Word similarity: cosine of two column vectors in πΊπ $ π×π Low rank approximation v Frobenius norm. C is a π×π matrix 9 ||πΆ||/ = 6 1 1 |π34 |5 378 478 v Rank of a matrix. v How many vectors in the matrix are independent to each other 6501 Natural Language Processing 5 Low rank approximation v Low rank approximation problem: min ||πΆ β π||/ π . π‘. ππππ π = π = v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially,weminimizetheβreconstructionlossβunderalowrankconstraint 6501 Natural Language Processing 6 Low rank approximation v Low rank approximation problem: min ||πΆ β π||/ π . π‘. ππππ π = π = v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially,weminimizetheβreconstructionlossβunderalowrankconstraint 6501 Natural Language Processing 7 Low rank approximation v Assume rank of πΆ is r v SVD: πΆ = πΞ£π ' , Ξ£ = diag(π8 , π5 β¦ πP , 0,0,0, β¦ 0) π8 0 0 Ξ£ = 0 β± 0 0 0 0 π non-zeros v Zero-out the r β π trailing values Ξ£β² = diag(π8 , π5 β¦ πU , 0,0,0, β¦ 0) v πΆ V = UΞ£V π ' is the best k-rank approximation: C V = πππmin ||πΆ β π||/ π . π‘. ππππ π = π = 6501 Natural Language Processing 8 Word2Vec v LSA: a compact representation of cooccurrence matrix v Word2Vec:Predict surrounding words (skip-gram) v Similar to using co-occurrence counts Levy&Goldberg (2014), Pennington et al. (2014) v Easy to incorporate new words or sentences 6501 Natural Language Processing 9 Word2Vec v Similar to language model, but predicting next word is not the goal. v Idea: words that are semantically similar often occur near each other in text v Embeddings that are good at predicting neighboring words are also good at representing similarity 6501 Natural Language Processing 10 Skip-gram v.s Continuous bag-of-words v What differences? 6501 Natural Language Processing 11 Skip-gram v.s Continuous bag-of-words 6501 Natural Language Processing 12 Objective of Word2Vec (Skip-gram) v Maximize the log likelihood of context word π€\]9 , π€\]9^8, β¦ , π€\]8 , π€\^8 , π€\^5 , β¦ , π€\^9 given word π€\ v m is usually 5~10 6501 Natural Language Processing 13 Objective of Word2Vec (Skip-gram) v How to model log π(π€\^4 |π€\ )? cde(fghij β lgh ) π π€\^4 π€\ = β gn cde(fgn β lgh ) v softmax function Again! v Every word has 2 vectors v π£p : when π€ is the center word v π’p : when π€ is the outside word (context word) 6501 Natural Language Processing 14 How to update? cde(fghij β lgh ) π π€\^4 π€\ = β gn cde(fgn β lgh ) v How to minimize π½(π) v Gradient descent! v How to compute the gradient? 6501 Natural Language Processing 15 Recap: Calculus v Gradient: π' = π₯8 π₯5 π₯z , ππ(π) ππ₯8 ππ(π) βπ π = ππ₯5 ππ(π) ππ₯z v π π = π β π (or represented as π' π) βπ π = π 6501 Natural Language Processing 16 Recap: Calculus v Ifπ¦ = π π’ andπ’ = π π₯ (i.e,. π¦ = π(π π₯ ) Ζβ Ζβ¦ = Ζβ (f) Ζβ‘(β¦) Ζf Ζβ¦ 1. π¦ = π₯ Λ + 6 z ( Ζβ Ζf Ζf Ζβ¦ ) 2. y = ln(π₯ 5 + 5) 3. y = exp(x β’ + 3π₯ + 2) 6501 Natural Language Processing 17 Other useful formulation v π¦ = exp π₯ dy = exp x dx v y = log x dy 1 = dx x WhenIsaylog(inthiscourse), usuallyImeanln 6501 Natural Language Processing 18 6501 Natural Language Processing 19 Example v Assume vocabulary set is π. We have one center word π, and one context word π. v What is the conditional probability π π π exp(π’β’ β π£β ) π ππ = βpV exp(π’p n β π£β ) v What is the gradient of the log likelihood w.r.t π£β ? π log π π π = π’β’ β πΈpβΌβ’ π€ π [π’p ] ππ£β 6501 Natural Language Processing 20 Gradient Descent min π½(π€) p Update w: π€ β π€ β πβπ½(π€) 6501 Natural Language Processing 21 Local minimum v.s. global minimum 6501 Natural Language Processing 22 Stochastic gradient descent v Let π½ π€ = 8 6 β π½ (π€) 6 478 4 v Gradient descent update rule: π€βπ€ ΕΎ 6 β 6 β478 π»π½4 π€ v Stochastic gradient descent: 8 v Approximate 6 β6478 π»π½4 π€ by the gradient at a single example π»π½3 π€ (why?) v At each step: Randomlypickanexampleπ π€ β π€ β ππ»π½3 π€ 6501 Natural Language Processing 23 Negative sampling v With a large vocabulary set, stochastic gradient descent is still not enough (why?) π log π π π = π’β’ β πΈpβΌβ’ π€ π [π’p ] ππ£β v Letβs approximate it again! vOnly sample a few words that do not appear in the context vEssentially, put more weights on positive samples 6501 Natural Language Processing 24 More about Word2Vec β relation to LSA v LSA factorizes a matrix of co-occurrence counts v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix! v PMI(w,c) ¡(β|β¦) =log ¡(β) = ¡(β¦,β) log β’(β¦)¡(β) # π€, π β |π·| = log #(π€)#(π) 6501 Natural Language Processing 25 All problem solved? 6501 Natural Language Processing 26 Continuous Semantic Representations sunny cloudy rainy windy car emotion cab sad wheel joy 6501 Natural Language Processing feeling 27 Semantics Needs More Than Similarity Tomorrow will be rainy. Tomorrow will be sunny. π ππππππ(rainy, sunny)? πππ‘πππ¦π(rainy, sunny)? 6501 Natural Language Processing 28 Polarity Inducing LSA [Yih, Zweig, Platt 2012] v Data representation v Encode two opposite relations in a matrix using βpolarityβ v Synonyms & antonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: row- Inducing polarity vector joy gladden sorrow sadden goodwill Group 1:βjoyfulnessβ 1 1 -1 -1 0 Group2:βsadβ -1 -1 1 1 0 Group3:βaffectionβ 0 0 0 0 1 Cosine Score: +ππ¦ππππ¦ππ Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: row- Inducing polarity vector joy gladden sorrow sadden goodwill Group 1:βjoyfulnessβ 1 1 -1 -1 0 Group2:βsadβ -1 -1 1 1 0 Group3:βaffectionβ 0 0 0 0 1 Cosine Score: βπ΄ππ‘πππ¦ππ Continuous representations for entities DemocraticParty RepublicParty ? GeorgeWBush LauraBush MichelleObama 6501 Natural Language Processing 32 Continuous representations for entities β’ Useful resources for NLP applications β’ Semantic Parsing & Question Answering β’ Information Extraction 6501 Natural Language Processing 33
© Copyright 2026 Paperzz