On the Role of Seed Lexicons in Learning Bilingual Word Embeddings Ivan Vuli¢ and Anna Korhonen University of Cambridge [email protected] ACL 2016; Berlin; August 8, 2016 1 / 42 Word Embeddings Dense representations → real-valued low-dimensional vectors Word embedding induction → learn word-level features which generalise well across tasks and languages Word embeddings capture interesting and universal regularities: 2 / 42 Word Embeddings Dense representations → real-valued low-dimensional vectors Word embedding induction → learn word-level features which generalise well across tasks and languages Word embeddings capture interesting and universal regularities: 3 / 42 Motivation The NLP community has developed useful features for several tasks but nding features that are... 1. task-invariant (POS tagging, SRL, NER, parsing, ...) (monolingual word embeddings) 2. language-invariant (English, Dutch, Chinese, Spanish, ...) (bilingual word embeddings → this talk) ...is non-trivial and time-consuming (20+ years of feature engineering...) 4 / 42 Motivation The NLP community has developed useful features for several tasks but nding features that are... 1. task-invariant (POS tagging, SRL, NER, parsing, ...) (monolingual word embeddings) 2. language-invariant (English, Dutch, Chinese, Spanish, ...) (bilingual word embeddings → this talk) ...is non-trivial and time-consuming (20+ years of feature engineering...) Learn word-level features which generalise across tasks and languages 5 / 42 Word Embeddings Representation of each word w ∈ V : vec(w) = [f1 , f2 , . . . , fdim ] Word representations in the same shared semantic (or Image courtesy of [Gouws et al., ICML 2015] 6 / 42 embedding) space! Bilingual Word Embeddings (BWEs) Representation of a word w1S ∈ V S : 1 vec(w1S ) = [f11 , f21 , . . . , fdim ] Exactly the same representation for w2T ∈ V T : 2 vec(w2T ) = [f12 , f22 , . . . , fdim ] Language-independent word representations in the same shared semantic (or embedding) space! 7 / 42 Bilingual Word Embeddings Monolingual vs. Bilingual Q1 → How to align semantic spaces in two dierent languages? Q2 → Which bilingual signals are used for the alignment? See also: [Upadhyay et al.: Cross-Lingual Models of Word Embeddings: An Empirical 8 / 42 Comparison; ACL 2016] Bilingual Word Embeddings Two desirable properties: P1 → Leverage (large) monolingual training sets through a bilingual signal tied together in order to learn a shared space in a scalable and widely applicable manner across languages and domains P2 → Use as inexpensive bilingual signal as possible 9 / 42 BWEs and Bilingual Signals (Type 1) Jointly learn and align BWEs using parallel-only data [Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014] (Type 2) Jointly learn and align BWEs using monolingual and parallel data [Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015] (Type 3) Learn BWEs from comparable document-aligned data [Vuli¢ and Moens, ACL 2015, JAIR 2016] (Type 4) Align pretrained monolingual embedding spaces using seed lexicons [Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015] 10 / 42 BWEs and Bilingual Signals (Type 1) Jointly learn and align BWEs using parallel-only data [Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014] (Type 2) Jointly learn and align BWEs using monolingual and parallel data [Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015] (Type 3) Learn BWEs from comparable document-aligned data [Vuli¢ and Moens, ACL 2015, JAIR 2016] (Type 4) Align pretrained monolingual embedding spaces using seed lexicons [Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015] 11 / 42 BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons 12 / 42 BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons Learn to transform the pre-trained source language embeddings into a space where the distance between a word and its translation pair is minimised 13 / 42Bilingual signal → word translation pairs BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons Could BWE learning be improved by making more intelligent choices when deciding over seed lexicon entries? Key Question → 14 / 42 BWEs: Type 4 Post-Hoc Mapping with Seed Lexicons Could BWE learning be improved by making more intelligent choices when deciding over seed lexicon entries? Key Question → We analyse a spectrum of seed lexicons with respect to controllable parameters such as: Lexicon source Lexicon size Translation method Translation pair reliability ... 15 / 42 Basic Framework Monolingual WE model → Skip-gram with negative sampling (SGNS) [Mikolov et al., NIPS 2013] 16 / 42 Basic Framework Monolingual WE model → Skip-gram with negative sampling (SGNS) [Mikolov et al., NIPS 2013] Bilingual signal 17 / 42 →N word translation pairs (xi , yi ) , i = 1, . . . , N Basic Framework Monolingual WE model → Skip-gram with negative sampling (SGNS) [Mikolov et al., NIPS 2013] Bilingual signal →N word translation pairs Transformation between spaces → (xi , yi ) , i = 1, . . . , N we assume linear mapping [Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015] min W∈RdS ×dT ||XW − Y||2F + λ||W||2F X→ Source language vectors for words from a training set Y→ Target language vectors for words from a training set W→ Translation (or transformation) matrix (n.b.: max-margin framework [Lazaridou et al., ACL 2915] yields similar 18 / 42insights) A Hybrid Model: Type 3 + Type 4 A type-hybrid procedure which would retain only highly reliable translation pairs obtained by a Type 3 model as a seed lexicon for Type 4 models satises P1 and P2. Type 3 model used: [Vuli¢ and Moens, JAIR 2016] 19 / 42 Seed Lexicon Source and Translation Method Previous work → 5K most frequent words translated using a dictionary or Google Translate (GT) 20 / 42 Seed Lexicon Source and Translation Method Previous work → 5K most frequent words translated using a dictionary or Google Translate (GT) To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT 21 / 42 → BNC+GT Seed Lexicon Source and Translation Method Previous work → 5K most frequent words translated using a dictionary or Google Translate (GT) To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT → BNC+GT Why not translating BNC using a Type 3 model? 22 / 42 → BNC+HYB Seed Lexicon Source and Translation Method Previous work → 5K most frequent words translated using a dictionary or Google Translate (GT) To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT → BNC+GT Why not translating BNC using a Type 3 model? Or use the frequency list of a Type 3 model? 23 / 42 → BNC+HYB → HFQ+HYB Seed Lexicon Source and Translation Method Previous work → 5K most frequent words translated using a dictionary or Google Translate (GT) To simulate this setup: (1) Start from the BNC frequency list of 6,318 most frequent English lemmas [Kilgarri, Journal of Lexicography 1997] (2) Translate them to other languages using GT → BNC+GT Why not translating BNC using a Type 3 model? Or use the frequency list of a Type 3 model? → HFQ+HYB Or simply words shared between two languages? [Kiros et al., NIPS 2015] 24 / 42 → BNC+HYB → ORTHO Seed Lexicon Size Previous work 25 / 42 → typically 5K training pairs Seed Lexicon Size Previous work → typically 5K training pairs We also investigate more extreme settings: Limited setting: only 100-500 pairs? 26 / 42 Seed Lexicon Size Previous work → typically 5K training pairs We also investigate more extreme settings: Limited setting: only 100-500 pairs? Testing the more the merrier hypothesis 27 / 42 → 40K-50K training pairs? Translation Pair Reliability Using a Type 3 model, it is possible to control the reliability of induced translation pairs The symmetry constraint → using only pairs that are mutual nearest neighbours as training pairs BNC+HYB+SYM and HFQ+HYB+SYM Without the constraint 28 / 42 → BNC+HYB+ASYM and HFQ+HYB+ASYM Translation Pair Reliability Using a Type 3 model, it is possible to control the reliability of induced translation pairs The symmetry constraint → using only pairs that are mutual nearest neighbours as training pairs BNC+HYB+SYM and HFQ+HYB+SYM Without the constraint → BNC+HYB+ASYM Symmetry with a threshold → and HFQ+HYB+ASYM even more conservative reliability criteria sim(xi , yi ) − sim(xi , zi ) > T HR sim(yi , xi ) − sim(yi , wi ) > T HR 29 / 42 Experimental Setup Task → Bilingual lexicon learning (BLL) Goal → to build a non-probabilistic bilingual lexicon of word translations Test Sets → ground truth word translation pairs built for three language pairs: Spanish (ES)-, Dutch (NL)-, Italian (IT)-English (EN) [Vuli¢ and Moens, NAACL 2013, EMNLP 2013] (Similar relative performance on other BLL test sets) Evaluation Metric → Top 1 accuracy (Acc1 ) (Similar model rankings with 30 / 42 Acc5 and Acc10 ) Baseline BWE Models Type 1 → BiCVM [Hermann and Blunsom, ACL 2014] Type 2 → BilBOWA [Gouws et al., ICML 2015] Type 3 → BWESG with length-ratio shue [Vuli¢ and Moens, JAIR 2016] Type 4 → Linear mapping (BNC+GT) [Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015] → All baselines trained with standard suggested settings (more in the paper) → Baselines use similar training data as our Type 4 models, e.g., Polyglot Wiki plus Europarl for BilBOWA, document-aligned LinguaTools Wiki for BWESG 31 / 42 Training Setup and Data (Our Models) Monolingual SGNS on Polyglot Wikipedias Standard pre-processing and SGNS hyper-parameters (window size: 4) We report results with (similar results with 32 / 42 d = 300 for all models d = 40, 64, 500) Ranked Lists with Dierent Seed Lexicons BNC+GT BNC+HYB BNC+HYB HFQ+HYB HFQ+HYB ORTHO +ASYM +SYM +ASYM +SYM casamiento casamiento marriage marry marrying betrothal wedding wed elopement 33 / 42 casamiento casamiento casamiento casamiento maría señor doña juana noche amor guerra marry marriage marriage marriage marrying wed wedding betrothal remarry marry marrying wedding betrothal wed marriages marry betrothal marrying wedding daughter betrothed marry betrothal marrying wedding wed elopement marriage Experiments Experiment I: Standard BLL Setting (5K seed lexicons) Model BiCVM (Type 1) BilBOWA (Type 2) BWESG (Type 3) 0.532 0.632 0.676 0.583 0.636 0.626 0.569 0.647 0.643 BNC+GT (Type 4) 0.677 0.641 0.646 ORTHO BNC+HYB+ASYM BNC+HYB+SYM (3388; 2738; 3145) HFQ+HYB+ASYM HFQ+HYB+SYM 0.233 0.673 0.681 0.506 0.626 * 0.658 0.224 0.644 0.663* 0.596 0.657* 0.667* → 0.673 0.695* 0.635 Document-level semantic spaces can provide seed lexicons → Reliability matters 34 / 42 ES-EN NL-EN IT-EN Experiments Experiment II: Lexicon-Size (Spanish-English) 0.7 0.6 Acc1 scores 0.5 0.4 0.3 0.2 BNC+GT BNC+HYB+ASYM BNC+HYB+SYM HFQ+HYB+ASYM HFQ+HYB+SYM ORTHO 0.1 0 0.1k 35 / 42 0.2k 0.5k 1k 2k 5k Lexicon size 10k 20k 50k Experiments Experiment II: Lexicon-Size (Dutch-English) 0.7 0.6 0.5 0.4 0.3 0.2 BNC+GT BNC+HYB+ASYM BNC+HYB+SYM HFQ+HYB+ASYM HFQ+HYB+SYM ORTHO 0.1 0 0.1k 0.2k 0.5k 1k 2k 5k Lexicon size 10k 36 / 42BNC+SYM and HFQ+SYM are the best models overall 20k 50k Experiments Experiment III: Translation Pair Reliability (Spanish-English) 0.7 Acc1 scores 0.68 0.66 0.64 0.62 THR=None THR=0.01 THR=0.025 THR=0.05 THR=0.075 THR=0.1 0.6 1k 37 / 42 2k 4k 5k 10k Lexicon size 20k 40k Experiments Experiment III: Translation Pair Reliability (Dutch-English) 0.66 0.64 0.62 0.6 0.58 THR=None THR=0.01 THR=0.025 THR=0.05 THR=0.075 THR=0.1 0.56 0.54 1k 2k 4k 5k 10k Lexicon size 20k 38 / 42Stricter selection criteria can help (but not necessarily) 40k Experiments Experiment IV: Another Task - Suggesting Word Translations in Context (6K seed lexicons) Model 0.406 0.703 0.433 0.712 0.408 0.789 BiCVM (Type 1) BilBOWA (Type 2) BWESG (Type 3) 0.506 0.586 0.783 0.586 0.656 0.858 0.522 0.589 0.792 BNC+GT (Type 4) 0.794 0.858 0.783 0.647 0.806* * 0.794 0.872 * 0.678 0.778 * ORTHO BNC+HYB+ASYM BNC+HYB+SYM (3839; 3117; 3693) HFQ+HYB+ASYM HFQ+HYB+SYM (THR = None) HFQ+HYB+SYM (THR=0.01) HFQ+HYB+SYM (THR=0.025) 39 / 42 ES-EN NL-EN IT-EN No Context Best System [Vuli¢ and Moens, EMNLP 2014] 0.808 0.875 0.814 0.789 0.792 0.792 0.800 0.864 0.869 0.858 0.853 0.781 0.786 0.789 0.792 Conclusion and Future Work Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective but... The choice of training pairs and their reliability matter (Excellent results with a hybrid BWE model that can train on monolingual data and use only document alignments as supervision) 40 / 42 Conclusion and Future Work Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective but... The choice of training pairs and their reliability matter (Excellent results with a hybrid BWE model that can train on monolingual data and use only document alignments as supervision) More sophisticated reliability measures? Other models of pair selection? Other context types and mapping functions? Other languages? Language pairs with scarce resources? 41 / 42 Questions? 42 / 42
© Copyright 2026 Paperzz