Lecture 7: Word Embeddings - cs.Virginia

Lecture 7: Word
Embeddings
Kai-Wei Chang
CS @ University of Virginia
[email protected]
Couse webpage: http://kwchang.net/teaching/NLP16
6501 Natural Language Processing
1
This lecture
v Learning word vectors (Cont.)
v Representation learning in NLP
6501 Natural Language Processing
2
Recap: Latent Semantic Analysis
v Data representation
v Encode single-relational data in a matrix
v Co-occurrence (e.g., from a general corpus)
v Synonyms (e.g., from a thesaurus)
v Factorization
v Apply SVD to the matrix to find latent
components
v Measuring degree of relation
v Cosine of latent vectors
Recap: Mapping to Latent Space via SVD
𝚺
β‰ˆ 𝐔
π‘ͺ
𝑑×𝑛
𝐕'
π‘˜×π‘˜
𝑑×π‘˜
v SVD generalizes the original data
v Uncovers relationships not explicit in the thesaurus
v Term vectors projected to π‘˜-dim latent space
v Word similarity:
cosine of two column vectors in πšΊπ• $
π‘˜×𝑛
Low rank approximation
v Frobenius norm. C is a π‘š×𝑛 matrix
9
||𝐢||/ =
6
1 1 |𝑐34 |5
378 478
v Rank of a matrix.
v How many vectors in the matrix are
independent to each other
6501 Natural Language Processing
5
Low rank approximation
v Low rank approximation problem:
min ||𝐢 βˆ’ 𝑋||/ 𝑠. 𝑑. π‘Ÿπ‘Žπ‘›π‘˜ 𝑋 = π‘˜
=
v If I can only use k independent vectors to describe
the points in the space, what are the best choices?
Essentially,weminimizetheβ€œreconstructionloss”underalowrankconstraint
6501 Natural Language Processing
6
Low rank approximation
v Low rank approximation problem:
min ||𝐢 βˆ’ 𝑋||/ 𝑠. 𝑑. π‘Ÿπ‘Žπ‘›π‘˜ 𝑋 = π‘˜
=
v If I can only use k independent vectors to describe
the points in the space, what are the best choices?
Essentially,weminimizetheβ€œreconstructionloss”underalowrankconstraint
6501 Natural Language Processing
7
Low rank approximation
v Assume rank of 𝐢 is r
v SVD: 𝐢 = π‘ˆΞ£π‘‰ ' , Ξ£ = diag(𝜎8 , 𝜎5 … 𝜎P , 0,0,0, … 0)
𝜎8 0 0
Ξ£ = 0 β‹± 0
0 0 0
π‘Ÿ non-zeros
v Zero-out the r βˆ’ π‘˜ trailing values
Ξ£β€² = diag(𝜎8 , 𝜎5 … 𝜎U , 0,0,0, … 0)
v 𝐢 V = UΞ£V 𝑉 ' is the best k-rank approximation:
C V = π‘Žπ‘Ÿπ‘”min ||𝐢 βˆ’ 𝑋||/ 𝑠. 𝑑. π‘Ÿπ‘Žπ‘›π‘˜ 𝑋 = π‘˜
=
6501 Natural Language Processing
8
Word2Vec
v LSA: a compact representation of cooccurrence matrix
v Word2Vec:Predict surrounding words (skip-gram)
v Similar to using co-occurrence counts Levy&Goldberg
(2014), Pennington et al. (2014)
v Easy to incorporate new words
or sentences
6501 Natural Language Processing
9
Word2Vec
v Similar to language model, but predicting next
word is not the goal.
v Idea: words that are semantically similar often
occur near each other in text
v Embeddings that are good at predicting neighboring
words are also good at representing similarity
6501 Natural Language Processing
10
Skip-gram v.s Continuous bag-of-words
v What differences?
6501 Natural Language Processing
11
Skip-gram v.s Continuous bag-of-words
6501 Natural Language Processing
12
Objective of Word2Vec (Skip-gram)
v Maximize the log likelihood of context word
𝑀\]9 , 𝑀\]9^8, … , 𝑀\]8 , 𝑀\^8 , 𝑀\^5 , … , 𝑀\^9
given word 𝑀\
v m is usually 5~10
6501 Natural Language Processing
13
Objective of Word2Vec (Skip-gram)
v How to model log 𝑃(𝑀\^4 |𝑀\ )?
cde(fghij β‹…lgh )
𝑝 𝑀\^4 𝑀\ = βˆ‘
gn cde(fgn β‹…lgh
)
v softmax function Again!
v Every word has 2 vectors
v 𝑣p : when 𝑀 is the center word
v 𝑒p : when 𝑀 is the outside word (context word)
6501 Natural Language Processing
14
How to update?
cde(fghij β‹…lgh )
𝑝 𝑀\^4 𝑀\ = βˆ‘
gn cde(fgn β‹…lgh
)
v How to minimize 𝐽(πœƒ)
v Gradient descent!
v How to compute the gradient?
6501 Natural Language Processing
15
Recap: Calculus
v Gradient:
𝒙' = π‘₯8
π‘₯5 π‘₯z ,
πœ•πœ™(𝒙)
πœ•π‘₯8
πœ•πœ™(𝒙)
βˆ‡πœ™ 𝒙 =
πœ•π‘₯5
πœ•πœ™(𝒙)
πœ•π‘₯z
v πœ™ 𝒙 = 𝒂 β‹… 𝒙
(or represented as 𝒂' 𝒙)
βˆ‡πœ™ 𝒙 = 𝒂
6501 Natural Language Processing
16
Recap: Calculus
v If𝑦 = 𝑓 𝑒 and𝑒 = 𝑔 π‘₯ (i.e,. 𝑦 = 𝑓(𝑔 π‘₯ )
Ζ’β€ž
ƒ…
=
Ġ(f) ƒ‑(…)
Ζ’f
ƒ…
1. 𝑦 = π‘₯ Λ† + 6
z
(
Ζ’β€ž Ζ’f
Ζ’f ƒ…
)
2. y = ln(π‘₯ 5 + 5)
3. y = exp(x β€’ + 3π‘₯ + 2)
6501 Natural Language Processing
17
Other useful formulation
v 𝑦 = exp π‘₯
dy
= exp x
dx
v y = log x
dy 1
=
dx x
WhenIsaylog(inthiscourse), usuallyImeanln
6501 Natural Language Processing
18
6501 Natural Language Processing
19
Example
v Assume vocabulary set is π‘Š. We have one
center word 𝑐, and one context word π‘œ.
v What is the conditional probability 𝑝 π‘œ 𝑐
exp(𝑒‒ β‹… 𝑣– )
𝑝 π‘œπ‘ =
βˆ‘pV exp(𝑒p n β‹… 𝑣– )
v What is the gradient of the log likelihood
w.r.t 𝑣– ?
πœ• log 𝑝 π‘œ 𝑐
= 𝑒‒ βˆ’ 𝐸pβˆΌβ„’ 𝑀 𝑐 [𝑒p ]
πœ•π‘£β€“
6501 Natural Language Processing
20
Gradient Descent
min 𝐽(𝑀)
p
Update w: 𝑀 ← 𝑀 βˆ’ πœ‚βˆ‡π½(𝑀)
6501 Natural Language Processing
21
Local minimum v.s. global minimum
6501 Natural Language Processing
22
Stochastic gradient descent
v Let 𝐽 𝑀 =
8 6
βˆ‘ 𝐽 (𝑀)
6 478 4
v Gradient descent update rule:
𝑀←𝑀
ΕΎ 6
βˆ’ 6 βˆ‘478 𝛻𝐽4
𝑀
v Stochastic gradient descent:
8
v Approximate 6 βˆ‘6478 𝛻𝐽4 𝑀 by the gradient at a
single example 𝛻𝐽3 𝑀 (why?)
v At each step:
Randomlypickanexample𝑖
𝑀 ← 𝑀 βˆ’ πœ‚π›»π½3 𝑀
6501 Natural Language Processing
23
Negative sampling
v With a large vocabulary set, stochastic
gradient descent is still not enough (why?)
πœ• log 𝑝 π‘œ 𝑐
= 𝑒‒ βˆ’ 𝐸pβˆΌβ„’ 𝑀 𝑐 [𝑒p ]
πœ•π‘£β€“
v Let’s approximate it again!
vOnly sample a few words that do not appear
in the context
vEssentially, put more weights on positive
samples
6501 Natural Language Processing
24
More about Word2Vec – relation to LSA
v LSA factorizes a matrix of co-occurrence
counts
v (Levy and Goldberg 2014) proves that
skip-gram model implicitly factorizes a
(shifted) PMI matrix!
v PMI(w,c)
¡(β€ž|…)
=log ¡(β€ž)
=
¡(…,β€ž)
log β„’(…)¡(β€ž)
# 𝑀, 𝑐 β‹… |𝐷|
= log
#(𝑀)#(𝑐)
6501 Natural Language Processing
25
All problem solved?
6501 Natural Language Processing
26
Continuous Semantic Representations
sunny
cloudy
rainy
windy
car
emotion
cab
sad
wheel
joy
6501 Natural Language Processing
feeling
27
Semantics Needs More Than Similarity
Tomorrow will
be rainy.
Tomorrow will
be sunny.
π‘ π‘–π‘šπ‘–π‘™π‘Žπ‘Ÿ(rainy, sunny)?
π‘Žπ‘›π‘‘π‘œπ‘›π‘¦π‘š(rainy, sunny)?
6501 Natural Language Processing
28
Polarity Inducing LSA
[Yih, Zweig, Platt 2012]
v Data representation
v Encode two opposite relations in a matrix using
β€œpolarity”
v Synonyms & antonyms (e.g., from a thesaurus)
v Factorization
v Apply SVD to the matrix to find latent
components
v Measuring degree of relation
v Cosine of latent vectors
Encode Synonyms & Antonyms in Matrix
v Joyfulness: joy, gladden; sorrow, sadden
v Sad: sorrow, sadden; joy, gladden
Target word: row- Inducing polarity
vector
joy
gladden
sorrow
sadden
goodwill
Group 1:β€œjoyfulness”
1
1
-1
-1
0
Group2:β€œsad”
-1
-1
1
1
0
Group3:β€œaffection”
0
0
0
0
1
Cosine Score: +π‘†π‘¦π‘›π‘œπ‘›π‘¦π‘šπ‘ 
Encode Synonyms & Antonyms in Matrix
v Joyfulness: joy, gladden; sorrow, sadden
v Sad: sorrow, sadden; joy, gladden
Target word: row- Inducing polarity
vector
joy
gladden
sorrow
sadden
goodwill
Group 1:β€œjoyfulness”
1
1
-1
-1
0
Group2:β€œsad”
-1
-1
1
1
0
Group3:β€œaffection”
0
0
0
0
1
Cosine Score: βˆ’π΄π‘›π‘‘π‘œπ‘›π‘¦π‘šπ‘ 
Continuous representations for entities
DemocraticParty
RepublicParty
?
GeorgeWBush
LauraBush
MichelleObama
6501 Natural Language Processing
32
Continuous representations for entities
β€’ Useful resources for NLP applications
β€’ Semantic Parsing & Question Answering
β€’ Information Extraction
6501 Natural Language Processing
33