Improving Distributional Similarity with Lessons Learned from Word Embeddings Presented by Jiaxing Tan Some Slides from the original paper presentation 1 Outline • Background • Hyper-parameter to experiment • Experiment and Result Motivation of word vector representation • Compare Word Similarity & Relatedness • How similar is iPhone to iPad? • How related is AlphaGo to Google? • In a search engine we use vectors to represent query as vector to compare similarity. • Representing words as vectors allows easy computation of similarity Approaches for Representing Words Distributional Semantics (Count) Word Embeddings (Predict) • Used since the 90’s • Inspired by deep learning • Sparse word-context PMI/PPMI matrix • word2vec (SGNS)(Mikolov et al., 2013) • Decomposed with SVD(dense) • GloVe (Pennington et al., 2014) Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts” 4 Contributions 1) Identifying the existence of new hyperparameters • Not always mentioned in papers 2) Adapting the hyperparameters across algorithms • Must understand the mathematical relation between algorithms 3) Comparing algorithms across all hyperparameter settings • Over 5,000 experiments 5 What is word2vec? 6 What is word2vec? • word2vec is not a single algorithm • It is a software package for representing words as vectors, containing: • Two distinct models • CBoW • Skip-Gram • Various training methods • Negative Sampling • Hierarchical Softmax • A rich preprocessing pipeline • Dynamic Context Windows • Subsampling • Deleting Rare Words 7 What is word2vec? • word2vec is not a single algorithm • It is a software package for representing words as vectors, containing: • Two distinct models • CBoW • Skip-Gram (SG) • Various training methods • Negative Sampling • Hierarchical Softmax (NS) • A rich preprocessing pipeline • Dynamic Context Windows • Subsampling • Deleting Rare Words 8 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv 92014 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv102014 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. words wampimuk wampimuk wampimuk wampimuk … contexts furry little hiding in … 𝐷 (data) “word2vec Explained…” Goldberg & Levy, arXiv112014 Skip-Grams with Negative Sampling (SGNS) • SGNS finds a vector 𝑤 for each word 𝑤 in our vocabulary 𝑉𝑊 • Each such vector has 𝑑 latent dimensions (e.g. 𝑑 = 100) • Effectively, it learns a matrix 𝑊 whose rows represent 𝑉𝑊 • Key point: it also learns a similar auxiliary matrix 𝐶 of context vectors • In fact, each word has two embeddings 𝑑 𝑑 𝑊 ≠ 𝑐:wampimuk = (−5.6, 2.95, 1.4, −1.3, … ) 𝑉𝐶 𝑉𝑊 𝑤:wampimuk = (−3.1, 4.15, 9.2, −6.5, … ) 𝐶 “word2vec Explained…” Goldberg & Levy, arXiv122014 Skip-Grams with Negative Sampling (SGNS) • Maximize: 𝜎 𝑤 ⋅ 𝑐 • 𝑐 was observed with 𝑤 words wampimuk wampimuk wampimuk wampimuk contexts furry little hiding in • Minimize: 𝜎 𝑤 ⋅ 𝑐 ′ • 𝑐′ was hallucinated with 𝑤 words wampimuk wampimuk wampimuk wampimuk contexts Australia cyber the 1985 “word2vec Explained…” Goldberg & Levy, arXiv132014 Skip-Grams with Negative Sampling (SGNS) • “Negative Sampling” • SGNS samples 𝑘 contexts 𝑐 ′ at random as negative examples • “Random” = unigram distribution #𝑐 𝑃 𝑐 = 𝐷 • Spoiler: Changing this distribution has a significant effect 14 What is SGNS learning? • They prove that for large enough 𝑑 and enough iterations • They get the word-context PMI matrix, shifted by a global constant • k is the number of negative samples 𝑂𝑝𝑡 𝑤 ⋅ 𝑐 = 𝑃𝑀𝐼 𝑤, 𝑐 − log 𝑘 𝑉𝐶 𝑑 𝐶 = 𝑉𝑊 𝑊 𝑑 𝑉𝑊 𝑉𝐶 𝑀𝑃𝑀𝐼 − log 𝑘 “Neural Word Embeddings as Implicit Matrix Factorization” 15 Levy & Goldberg, NIPS 2014 New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 16 New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 17 Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 18 Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 19 Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. The probabilities that each specific context word will be included in the training data. word2vec: 1 4 2 4 3 4 4 4 4 4 3 4 2 4 1 4 GloVe: 1 4 1 3 1 2 1 1 1 1 1 2 1 3 1 4 The Word-Space Model (Sahlgren, 2006) 20 New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 21 Subsampling • Randomly removes words that are more frequent than some threshold t with a probability of p, where f marks the word’s corpus frequency • t = 10−5 in experiments • Remove stop words • The removal of tokens is done before the corpus is processed into word-context pairs New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 23 Delete Rare Words • Ignore words that are rare in the training corpus • Remove these tokens from the corpus before creating context windows • Narrow the distance between tokens • Insert new word-context pairs that did not exist in the original corpus with the same window size. New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 25 Adding Context Vectors • SGNS creates word vectors 𝑤 • SGNS creates auxiliary context vectors 𝑐 • So do GloVe and SVD • Instead of just 𝑤 • Represent a word as: 𝑤 + 𝑐 • Introduced by Pennington et al. (2014) • Only applied to GloVe 26 New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 27 Eigenvalue Weighting • Word vector (W)and context vector (C) derived using SVD are typically represented by • Add a parameter p to control eigenvalue matrix Σ New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 29 Vector Normalization (nrm) • Normalize to unit length (L2 normalization) • Different types: • Row • Column • Both New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 31 k in SGNS learning? • k is the number of negative samples 𝑆𝐺𝑁𝑆 𝑤 ⋅ 𝑐 = 𝑃𝑀𝐼 𝑤, 𝑐 − log 𝑘 • Also, k causes the shift of PMI Matrix New Hyperparameters • Preprocessing • Dynamic Context Windows (dyn, win) • Subsampling (sub) • Deleting Rare Words (del) • Postprocessing • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm) • Association Metric • Shifted PMI (neg) • Context Distribution Smoothing (cds) 33 Context Distribution Smoothing • For the original calculation of PMI: 𝑃(𝑤, 𝑐) 𝑃𝑀𝐼 𝑤, 𝑐 = log 𝑃 𝑤 ⋅𝑷 𝒄 • If c is rare word, PMI is still high. • How to solve? Via P(c)-> add a α(=0.75) 𝑃 𝑐 = #𝑐 𝑐 ′ ∈𝑉 ′ #𝑐 𝐶 𝑃α 𝑐 = #𝑐 𝑐 ′ ∈𝑉 𝐶 α #𝑐 ′ α 34 Context Distribution Smoothing • We can adapt context distribution smoothing to PMI! • Replace 𝑃(𝑐) with 𝑃0.75 (𝑐): 𝑃𝑀𝐼0.75 𝑃(𝑤, 𝑐) 𝑤, 𝑐 = log 𝑃 𝑤 ⋅ 𝑷𝟎.𝟕𝟓 𝒄 • Consistently improves PMI on every task • Always use Context Distribution Smoothing! 35 Experiment and Result • Experiment Setup • Result Experiment Setup • 9 Hyperparameters • 6 New • 4 Word Representation Algorithms • • • • PPMI SVD SGNS GloVe • 8 Benchmarks • 6 Word Similarity Tasks • 2 Analogy Tasks • 5,632 experiments 37 Experiment Setup Experiment Setup • Word Similarity • WordSim353 (Finkelstein et al., 2002) partitioned into two datasets, WordSim Similarity and WordSim Relatedness (Zesch et al., 2008; Agirre et al., 2009); • Bruni et al.’s (2012) MEN dataset; • Radinsky et al.’s (2011) Mechanical Turk dataset • Luong et al.’s (2013) Rare Words dataset • Hill et al.’s (2014) SimLex-999 dataset • All these datasets contain word pairs together with human-assigned similarity scores. • The word vectors are evaluated by ranking the pairs according to their cosine similarities, and measuring the correlation with the human ratings. Experiment Setup • Analogy • MSR’s analogy dataset (Mikolov et al., 2013c) • Google’s analogy dataset (Mikolov et al., 2013a) • The two analogy datasets present questions of the form “a is to a ∗ as b is to b ∗ ”, where b ∗ is hidden, and must be guessed from the entire vocabulary. Results • Time Limited. Let’s jump to Results Overall Results • Hyperparameters often have stronger effects than algorithms • Hyperparameters often have stronger effects than more data • Prior superiority claims were not accurate • If time is available, I will show some details 42 Hyper-Parameter Hyper-Parameter Results-Hyper-parameter VS Algorithm No dominant Method SVD Practical Guide • Always use context distribution smoothing (cds = 0.75) to modify PMI. • Do not use SVD “correctly” (eig = 1). Instead, use one of the symmetric variants . • SGNS is a robust baseline. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory consumption. • With SGNS, prefer many negative samples. • For both SGNS and GloVe, it is worthwhile to experiment with the w +c variant,
© Copyright 2026 Paperzz