On the Role of Seed Lexicons in Learning Bilingual Word Embeddings

On the Role of Seed Lexicons in
Learning Bilingual Word Embeddings
Ivan Vuli¢
and Anna
Korhonen
University of Cambridge
[email protected]
ACL 2016; Berlin; August 8, 2016
1 / 42
Word Embeddings
Dense representations →
real-valued low-dimensional vectors
Word embedding induction
→ learn word-level features which generalise well across tasks and
languages
Word embeddings capture interesting and universal regularities:
2 / 42
Word Embeddings
Dense representations →
real-valued low-dimensional vectors
Word embedding induction
→ learn word-level features which generalise well across tasks and
languages
Word embeddings capture interesting and universal regularities:
3 / 42
Motivation
The NLP community has developed useful features for several tasks but nding
features that are...
1. task-invariant (POS tagging, SRL, NER, parsing, ...)
(monolingual word embeddings)
2. language-invariant (English, Dutch, Chinese, Spanish, ...)
(bilingual word embeddings
→
this talk)
...is non-trivial and time-consuming (20+ years of feature engineering...)
4 / 42
Motivation
The NLP community has developed useful features for several tasks but nding
features that are...
1. task-invariant (POS tagging, SRL, NER, parsing, ...)
(monolingual word embeddings)
2. language-invariant (English, Dutch, Chinese, Spanish, ...)
(bilingual word embeddings
→
this talk)
...is non-trivial and time-consuming (20+ years of feature engineering...)
Learn word-level features which generalise across tasks and
languages
5 / 42
Word Embeddings
Representation of each word w ∈ V :
vec(w) = [f1 , f2 , . . . , fdim ]
Word representations in the same shared semantic (or
Image courtesy of [Gouws et al., ICML 2015]
6 / 42
embedding)
space!
Bilingual Word Embeddings (BWEs)
Representation of a word w1S ∈ V S :
1
vec(w1S ) = [f11 , f21 , . . . , fdim
]
Exactly the same representation for w2T ∈ V T :
2
vec(w2T ) = [f12 , f22 , . . . , fdim
]
Language-independent word representations in the same shared
semantic (or embedding) space!
7 / 42
Bilingual Word Embeddings
Monolingual
vs.
Bilingual
Q1 →
How to align semantic spaces in two dierent languages?
Q2 →
Which bilingual
signals
are used for the alignment?
See also:
[Upadhyay et al.: Cross-Lingual Models of Word Embeddings: An Empirical
8 / 42 Comparison; ACL 2016]
Bilingual Word Embeddings
Two desirable properties:
P1 → Leverage (large) monolingual training sets
through a bilingual signal
tied together
in order
to learn a shared space in a scalable and widely applicable manner
across languages and domains
P2 → Use as inexpensive bilingual signal as possible
9 / 42
BWEs and Bilingual Signals
(Type 1) Jointly learn and align BWEs using parallel-only data
[Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]
(Type 2) Jointly learn and align BWEs using monolingual and parallel data
[Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]
(Type 3) Learn BWEs from comparable document-aligned data
[Vuli¢ and Moens, ACL 2015, JAIR 2016]
(Type 4) Align pretrained monolingual embedding spaces using seed lexicons
[Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]
10 / 42
BWEs and Bilingual Signals
(Type 1) Jointly learn and align BWEs using parallel-only data
[Hermann and Blunsom, ACL 2014; Chandar et al., NIPS 2014]
(Type 2) Jointly learn and align BWEs using monolingual and parallel data
[Gouws et al., ICML 2015; Soyer et al., ICLR 2015, Shi et al., ACL 2015]
(Type 3) Learn BWEs from comparable document-aligned data
[Vuli¢ and Moens, ACL 2015, JAIR 2016]
(Type 4) Align pretrained monolingual embedding spaces using seed lexicons
[Mikolov et al., arXiv 2013; Lazaridou et al., ACL 2015]
11 / 42
BWEs: Type 4
Post-Hoc Mapping with Seed Lexicons
12 / 42
BWEs: Type 4
Post-Hoc Mapping with Seed Lexicons
Learn to transform the pre-trained source language embeddings
into a space where the distance between a word and its translation
pair is minimised
13 / 42Bilingual
signal →
word translation pairs
BWEs: Type 4
Post-Hoc Mapping with Seed Lexicons
Could BWE learning be improved by making
more intelligent choices when deciding over seed lexicon entries?
Key Question →
14 / 42
BWEs: Type 4
Post-Hoc Mapping with Seed Lexicons
Could BWE learning be improved by making
more intelligent choices when deciding over seed lexicon entries?
Key Question →
We analyse a spectrum of seed lexicons with respect to controllable
parameters such as:
Lexicon source
Lexicon size
Translation method
Translation pair reliability
...
15 / 42
Basic Framework
Monolingual WE model
→ Skip-gram with negative sampling (SGNS)
[Mikolov et al., NIPS 2013]
16 / 42
Basic Framework
Monolingual WE model
→ Skip-gram with negative sampling (SGNS)
[Mikolov et al., NIPS 2013]
Bilingual signal
17 / 42
→N
word translation pairs
(xi , yi ) , i = 1, . . . , N
Basic Framework
Monolingual WE model
→ Skip-gram with negative sampling (SGNS)
[Mikolov et al., NIPS 2013]
Bilingual signal
→N
word translation pairs
Transformation between spaces
→
(xi , yi ) , i = 1, . . . , N
we assume linear mapping
[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]
min
W∈RdS ×dT
||XW − Y||2F + λ||W||2F
X→
Source language vectors for words from a training set
Y→
Target language vectors for words from a training set
W→
Translation (or transformation) matrix
(n.b.: max-margin framework [Lazaridou et al., ACL 2915] yields similar
18 / 42insights)
A Hybrid Model: Type 3 + Type 4
A type-hybrid procedure which would retain only highly reliable translation
pairs obtained by a Type 3 model as a seed lexicon for Type 4 models satises
P1 and P2.
Type 3 model used:
[Vuli¢ and Moens, JAIR 2016]
19 / 42
Seed Lexicon Source and Translation Method
Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
20 / 42
Seed Lexicon Source and Translation Method
Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
To simulate this setup:
(1) Start from the BNC frequency list of 6,318 most frequent English lemmas
[Kilgarri, Journal of Lexicography 1997]
(2) Translate them to other languages using GT
21 / 42
→ BNC+GT
Seed Lexicon Source and Translation Method
Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
To simulate this setup:
(1) Start from the BNC frequency list of 6,318 most frequent English lemmas
[Kilgarri, Journal of Lexicography 1997]
(2) Translate them to other languages using GT
→ BNC+GT
Why not translating BNC using a Type 3 model?
22 / 42
→ BNC+HYB
Seed Lexicon Source and Translation Method
Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
To simulate this setup:
(1) Start from the BNC frequency list of 6,318 most frequent English lemmas
[Kilgarri, Journal of Lexicography 1997]
(2) Translate them to other languages using GT
→ BNC+GT
Why not translating BNC using a Type 3 model?
Or use the frequency list of a Type 3 model?
23 / 42
→ BNC+HYB
→ HFQ+HYB
Seed Lexicon Source and Translation Method
Previous work
→
5K most frequent words translated using a dictionary or
Google Translate (GT)
To simulate this setup:
(1) Start from the BNC frequency list of 6,318 most frequent English lemmas
[Kilgarri, Journal of Lexicography 1997]
(2) Translate them to other languages using GT
→ BNC+GT
Why not translating BNC using a Type 3 model?
Or use the frequency list of a Type 3 model?
→ HFQ+HYB
Or simply words shared between two languages?
[Kiros et al., NIPS 2015]
24 / 42
→ BNC+HYB
→ ORTHO
Seed Lexicon Size
Previous work
25 / 42
→
typically 5K training pairs
Seed Lexicon Size
Previous work
→
typically 5K training pairs
We also investigate more extreme settings:
Limited setting: only 100-500 pairs?
26 / 42
Seed Lexicon Size
Previous work
→
typically 5K training pairs
We also investigate more extreme settings:
Limited setting: only 100-500 pairs?
Testing the more the merrier hypothesis
27 / 42
→
40K-50K training pairs?
Translation Pair Reliability
Using a Type 3 model, it is possible to control the reliability of induced
translation pairs
The symmetry constraint
→
using only pairs that are mutual nearest
neighbours as training pairs
BNC+HYB+SYM and HFQ+HYB+SYM
Without the constraint
28 / 42
→ BNC+HYB+ASYM
and
HFQ+HYB+ASYM
Translation Pair Reliability
Using a Type 3 model, it is possible to control the reliability of induced
translation pairs
The symmetry constraint
→
using only pairs that are mutual nearest
neighbours as training pairs
BNC+HYB+SYM and HFQ+HYB+SYM
Without the constraint
→ BNC+HYB+ASYM
Symmetry with a threshold
→
and
HFQ+HYB+ASYM
even more conservative reliability criteria
sim(xi , yi ) − sim(xi , zi ) > T HR
sim(yi , xi ) − sim(yi , wi ) > T HR
29 / 42
Experimental Setup
Task
→
Bilingual lexicon learning (BLL)
Goal
→
to build a non-probabilistic bilingual lexicon of word translations
Test Sets
→
ground truth word translation pairs built for three language pairs:
Spanish (ES)-, Dutch (NL)-, Italian (IT)-English (EN)
[Vuli¢ and Moens, NAACL 2013, EMNLP 2013]
(Similar relative performance on other BLL test sets)
Evaluation Metric
→
Top 1 accuracy (Acc1 )
(Similar model rankings with
30 / 42
Acc5
and
Acc10 )
Baseline BWE Models
Type 1
→
BiCVM
[Hermann and Blunsom, ACL 2014]
Type 2
→
BilBOWA
[Gouws et al., ICML 2015]
Type 3
→
BWESG with length-ratio shue
[Vuli¢ and Moens, JAIR 2016]
Type 4
→
Linear mapping (BNC+GT)
[Mikolov et al., arXiv 2013; Dinu et al., ICLR WS 2015]
→
All baselines trained with standard suggested settings (more in the paper)
→
Baselines use similar training data as our Type 4 models, e.g., Polyglot Wiki
plus Europarl for BilBOWA, document-aligned LinguaTools Wiki for BWESG
31 / 42
Training Setup and Data (Our Models)
Monolingual SGNS on Polyglot Wikipedias
Standard pre-processing and SGNS hyper-parameters
(window size: 4)
We report results with
(similar results with
32 / 42
d = 300
for all models
d = 40, 64, 500)
Ranked Lists with Dierent Seed Lexicons
BNC+GT
BNC+HYB BNC+HYB HFQ+HYB HFQ+HYB ORTHO
+ASYM
+SYM
+ASYM
+SYM
casamiento
casamiento
marriage
marry
marrying
betrothal
wedding
wed
elopement
33 / 42
casamiento
casamiento
casamiento
casamiento
maría
señor
doña
juana
noche
amor
guerra
marry
marriage
marriage
marriage
marrying
wed
wedding
betrothal
remarry
marry
marrying
wedding
betrothal
wed
marriages
marry
betrothal
marrying
wedding
daughter
betrothed
marry
betrothal
marrying
wedding
wed
elopement
marriage
Experiments
Experiment I: Standard BLL Setting
(5K seed lexicons)
Model
BiCVM (Type 1)
BilBOWA (Type 2)
BWESG (Type 3)
0.532
0.632
0.676
0.583
0.636
0.626
0.569
0.647
0.643
BNC+GT (Type 4)
0.677
0.641
0.646
ORTHO
BNC+HYB+ASYM
BNC+HYB+SYM
(3388; 2738; 3145)
HFQ+HYB+ASYM
HFQ+HYB+SYM
0.233
0.673
0.681
0.506
0.626
*
0.658
0.224
0.644
0.663*
0.596
0.657*
0.667*
→
0.673
0.695*
0.635
Document-level semantic spaces can provide seed lexicons
→ Reliability matters
34 / 42
ES-EN NL-EN IT-EN
Experiments
Experiment II: Lexicon-Size
(Spanish-English)
0.7
0.6
Acc1 scores
0.5
0.4
0.3
0.2
BNC+GT
BNC+HYB+ASYM
BNC+HYB+SYM
HFQ+HYB+ASYM
HFQ+HYB+SYM
ORTHO
0.1
0
0.1k
35 / 42
0.2k
0.5k
1k
2k
5k
Lexicon size
10k
20k
50k
Experiments
Experiment II: Lexicon-Size
(Dutch-English)
0.7
0.6
0.5
0.4
0.3
0.2
BNC+GT
BNC+HYB+ASYM
BNC+HYB+SYM
HFQ+HYB+ASYM
HFQ+HYB+SYM
ORTHO
0.1
0
0.1k
0.2k
0.5k
1k
2k
5k
Lexicon size
10k
36 / 42BNC+SYM and HFQ+SYM are the best models overall
20k
50k
Experiments
Experiment III: Translation Pair Reliability
(Spanish-English)
0.7
Acc1 scores
0.68
0.66
0.64
0.62
THR=None
THR=0.01
THR=0.025
THR=0.05
THR=0.075
THR=0.1
0.6
1k
37 / 42
2k
4k 5k
10k
Lexicon size
20k
40k
Experiments
Experiment III: Translation Pair Reliability
(Dutch-English)
0.66
0.64
0.62
0.6
0.58
THR=None
THR=0.01
THR=0.025
THR=0.05
THR=0.075
THR=0.1
0.56
0.54
1k
2k
4k 5k
10k
Lexicon size
20k
38 / 42Stricter selection criteria can help (but not necessarily)
40k
Experiments
Experiment IV: Another Task - Suggesting Word
Translations in Context
(6K seed lexicons)
Model
0.406
0.703
0.433
0.712
0.408
0.789
BiCVM (Type 1)
BilBOWA (Type 2)
BWESG (Type 3)
0.506
0.586
0.783
0.586
0.656
0.858
0.522
0.589
0.792
BNC+GT (Type 4)
0.794
0.858
0.783
0.647
0.806*
*
0.794
0.872
*
0.678
0.778
*
ORTHO
BNC+HYB+ASYM
BNC+HYB+SYM
(3839; 3117; 3693)
HFQ+HYB+ASYM
HFQ+HYB+SYM (THR = None)
HFQ+HYB+SYM (THR=0.01)
HFQ+HYB+SYM (THR=0.025)
39 / 42
ES-EN NL-EN IT-EN
No Context
Best System
[Vuli¢ and Moens, EMNLP 2014]
0.808
0.875
0.814
0.789
0.792
0.792
0.800
0.864
0.869
0.858
0.853
0.781
0.786
0.789
0.792
Conclusion and Future Work
Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective
but...
The choice of training pairs and their reliability matter
(Excellent results with a hybrid BWE model that can train on monolingual data
and use only document alignments as supervision)
40 / 42
Conclusion and Future Work
Type 4 BWE models (Post-hoc mapping with seed lexicons) are very eective
but...
The choice of training pairs and their reliability matter
(Excellent results with a hybrid BWE model that can train on monolingual data
and use only document alignments as supervision)
More sophisticated reliability measures? Other models of pair selection?
Other context types and mapping functions? Other languages? Language
pairs with scarce resources?
41 / 42
Questions?
42 / 42