Improving Distributional Similarity with Lessons Learned from Word

Improving Distributional Similarity
with Lessons Learned from Word Embeddings
Presented by Jiaxing Tan
Some Slides from the original paper presentation
1
Outline
• Background
• Hyper-parameter to experiment
• Experiment and Result
Motivation of word vector representation
• Compare Word Similarity & Relatedness
• How similar is iPhone to iPad?
• How related is AlphaGo to Google?
• In a search engine we use vectors to represent query as vector to
compare similarity.
• Representing words as vectors allows easy computation of similarity
Approaches for Representing Words
Distributional Semantics (Count)
Word Embeddings (Predict)
• Used since the 90’s
• Inspired by deep learning
• Sparse word-context PMI/PPMI matrix
• word2vec (SGNS)(Mikolov et al., 2013)
• Decomposed with SVD(dense)
• GloVe (Pennington et al., 2014)
Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57)
“Similar words occur in similar contexts”
4
Contributions
1) Identifying the existence of new hyperparameters
• Not always mentioned in papers
2) Adapting the hyperparameters across algorithms
• Must understand the mathematical relation between algorithms
3) Comparing algorithms across all hyperparameter settings
• Over 5,000 experiments
5
What is word2vec?
6
What is word2vec?
• word2vec is not a single algorithm
• It is a software package for representing words as vectors, containing:
• Two distinct models
• CBoW
• Skip-Gram
• Various training methods
• Negative Sampling
• Hierarchical Softmax
• A rich preprocessing pipeline
• Dynamic Context Windows
• Subsampling
• Deleting Rare Words
7
What is word2vec?
• word2vec is not a single algorithm
• It is a software package for representing words as vectors, containing:
• Two distinct models
• CBoW
• Skip-Gram
(SG)
• Various training methods
• Negative Sampling
• Hierarchical Softmax
(NS)
• A rich preprocessing pipeline
• Dynamic Context Windows
• Subsampling
• Deleting Rare Words
8
Skip-Grams with Negative Sampling (SGNS)
Marco saw a furry little wampimuk hiding in the tree.
“word2vec Explained…”
Goldberg & Levy, arXiv 92014
Skip-Grams with Negative Sampling (SGNS)
Marco saw a furry little wampimuk hiding in the tree.
“word2vec Explained…”
Goldberg & Levy, arXiv102014
Skip-Grams with Negative Sampling (SGNS)
Marco saw a furry little wampimuk hiding in the tree.
words
wampimuk
wampimuk
wampimuk
wampimuk
…
contexts
furry
little
hiding
in
…
𝐷 (data)
“word2vec Explained…”
Goldberg & Levy, arXiv112014
Skip-Grams with Negative Sampling (SGNS)
• SGNS finds a vector 𝑤 for each word 𝑤 in our vocabulary 𝑉𝑊
• Each such vector has 𝑑 latent dimensions (e.g. 𝑑 = 100)
• Effectively, it learns a matrix 𝑊 whose rows represent 𝑉𝑊
• Key point: it also learns a similar auxiliary matrix 𝐶 of context vectors
• In fact, each word has two embeddings
𝑑
𝑑
𝑊
≠
𝑐:wampimuk = (−5.6, 2.95, 1.4, −1.3, … )
𝑉𝐶
𝑉𝑊
𝑤:wampimuk = (−3.1, 4.15, 9.2, −6.5, … )
𝐶
“word2vec Explained…”
Goldberg & Levy, arXiv122014
Skip-Grams with Negative Sampling (SGNS)
• Maximize: 𝜎 𝑤 ⋅ 𝑐
• 𝑐 was observed with 𝑤
words
wampimuk
wampimuk
wampimuk
wampimuk
contexts
furry
little
hiding
in
• Minimize: 𝜎 𝑤 ⋅ 𝑐 ′
• 𝑐′ was hallucinated with 𝑤
words
wampimuk
wampimuk
wampimuk
wampimuk
contexts
Australia
cyber
the
1985
“word2vec Explained…”
Goldberg & Levy, arXiv132014
Skip-Grams with Negative Sampling (SGNS)
• “Negative Sampling”
• SGNS samples 𝑘 contexts 𝑐 ′ at random as negative examples
• “Random” = unigram distribution
#𝑐
𝑃 𝑐 =
𝐷
• Spoiler: Changing this distribution has a significant effect
14
What is SGNS learning?
• They prove that for large enough 𝑑 and enough iterations
• They get the word-context PMI matrix, shifted by a global constant
• k is the number of negative samples
𝑂𝑝𝑡 𝑤 ⋅ 𝑐 = 𝑃𝑀𝐼 𝑤, 𝑐 − log 𝑘
𝑉𝐶
𝑑
𝐶
=
𝑉𝑊
𝑊
𝑑
𝑉𝑊
𝑉𝐶
𝑀𝑃𝑀𝐼
− log 𝑘
“Neural Word Embeddings as Implicit Matrix Factorization”
15
Levy & Goldberg, NIPS 2014
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
16
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
17
Dynamic Context Windows
Marco saw a furry little wampimuk hiding in the tree.
18
Dynamic Context Windows
Marco saw a furry little wampimuk hiding in the tree.
19
Dynamic Context Windows
Marco saw a furry little wampimuk hiding in the tree.
The probabilities that each specific context word will be included in the training data.
word2vec:
1
4
2
4
3
4
4
4
4
4
3
4
2
4
1
4
GloVe:
1
4
1
3
1
2
1
1
1
1
1
2
1
3
1
4
The Word-Space Model (Sahlgren, 2006)
20
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
21
Subsampling
• Randomly removes words that are more frequent than some
threshold t with a probability of p, where f marks the word’s corpus
frequency
• t = 10−5 in experiments
• Remove stop words
• The removal of tokens is done before the corpus is processed into
word-context pairs
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
23
Delete Rare Words
• Ignore words that are rare in the training corpus
• Remove these tokens from the corpus before creating context
windows
• Narrow the distance between tokens
• Insert new word-context pairs that did not exist in the original corpus
with the same window size.
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
25
Adding Context Vectors
• SGNS creates word vectors 𝑤
• SGNS creates auxiliary context vectors 𝑐
• So do GloVe and SVD
• Instead of just 𝑤
• Represent a word as: 𝑤 + 𝑐
• Introduced by Pennington et al. (2014)
• Only applied to GloVe
26
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
27
Eigenvalue Weighting
• Word vector (W)and context vector (C) derived using SVD are typically
represented by
• Add a parameter p to control eigenvalue matrix Σ
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
29
Vector Normalization (nrm)
• Normalize to unit length (L2 normalization)
• Different types:
• Row
• Column
• Both
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
31
k in SGNS learning?
• k is the number of negative samples
𝑆𝐺𝑁𝑆 𝑤 ⋅ 𝑐 = 𝑃𝑀𝐼 𝑤, 𝑐 − log 𝑘
• Also, k causes the shift of PMI Matrix
New Hyperparameters
• Preprocessing
• Dynamic Context Windows (dyn, win)
• Subsampling (sub)
• Deleting Rare Words (del)
• Postprocessing
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
• Association Metric
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
33
Context Distribution Smoothing
• For the original calculation of PMI:
𝑃(𝑤, 𝑐)
𝑃𝑀𝐼 𝑤, 𝑐 = log
𝑃 𝑤 ⋅𝑷 𝒄
• If c is rare word, PMI is still high.
• How to solve? Via P(c)-> add a α(=0.75)
𝑃 𝑐 =
#𝑐
𝑐 ′ ∈𝑉
′
#𝑐
𝐶
𝑃α 𝑐 =
#𝑐
𝑐 ′ ∈𝑉
𝐶
α
#𝑐 ′
α
34
Context Distribution Smoothing
• We can adapt context distribution smoothing to PMI!
• Replace 𝑃(𝑐) with 𝑃0.75 (𝑐):
𝑃𝑀𝐼0.75
𝑃(𝑤, 𝑐)
𝑤, 𝑐 = log
𝑃 𝑤 ⋅ 𝑷𝟎.𝟕𝟓 𝒄
• Consistently improves PMI on every task
• Always use Context Distribution Smoothing!
35
Experiment and Result
• Experiment Setup
• Result
Experiment Setup
• 9 Hyperparameters
• 6 New
• 4 Word Representation Algorithms
•
•
•
•
PPMI
SVD
SGNS
GloVe
• 8 Benchmarks
• 6 Word Similarity Tasks
• 2 Analogy Tasks
• 5,632 experiments
37
Experiment Setup
Experiment Setup
• Word Similarity
• WordSim353 (Finkelstein et al., 2002) partitioned into two datasets, WordSim
Similarity and WordSim Relatedness (Zesch et al., 2008; Agirre et al., 2009);
• Bruni et al.’s (2012) MEN dataset;
• Radinsky et al.’s (2011) Mechanical Turk dataset
• Luong et al.’s (2013) Rare Words dataset
• Hill et al.’s (2014) SimLex-999 dataset
• All these datasets contain word pairs together with human-assigned
similarity scores.
• The word vectors are evaluated by ranking the pairs according to their
cosine similarities, and measuring the correlation with the human ratings.
Experiment Setup
• Analogy
• MSR’s analogy dataset (Mikolov et al., 2013c)
• Google’s analogy dataset (Mikolov et al., 2013a)
• The two analogy datasets present questions of the form “a is to a ∗ as
b is to b ∗ ”, where b ∗ is hidden, and must be guessed from the entire
vocabulary.
Results
• Time Limited. Let’s jump to Results
Overall Results
• Hyperparameters often have stronger effects than algorithms
• Hyperparameters often have stronger effects than more data
• Prior superiority claims were not accurate
• If time is available, I will show some details
42
Hyper-Parameter
Hyper-Parameter
Results-Hyper-parameter VS Algorithm
No dominant Method
SVD
Practical Guide
• Always use context distribution smoothing (cds = 0.75) to modify
PMI.
• Do not use SVD “correctly” (eig = 1). Instead, use one of the
symmetric variants .
• SGNS is a robust baseline. Moreover, SGNS is the fastest method to
train, and cheapest (by far) in terms of disk space and memory
consumption.
• With SGNS, prefer many negative samples.
• For both SGNS and GloVe, it is worthwhile to experiment with the
w +c variant,