PDF

Problems
Simulations
Applications
A modest proposal
Representation learning
Cf. Mooney
(2014)
Don’t cram two completely different meanings
into a single !&??@#ˆ$% vector!
Or should you?
Hinrich Schütze, Yadollah Yaghoobzadeh
Center for Information and Language Processing, LMU Munich
2017-04-07
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
1 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Outline
1
Problems
2
Simulations
3
Applications
4
A modest proposal
5
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
2 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Problem 1: Conflation in ambiguity
Intuition
Each point in embedding space represents a distinct meaning.
Conflation
Two meanings are conflated if their sum cannot be disentangled
into the two component meanings.
Example: Let ~a, ~b, ~c , ~d be different meanings
If ~a + ~b = ~c + ~d, then ~a and ~b are conflated. (and also ~c and ~d)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
3 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Conflation in ambiguity
airplane spacecraft
Conflation can definitely happen.
But is it a problem in practice?
car
boat
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
4 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Cramming two meanings into one vector: The alternative?
For each sense, learn a separate embedding
Most common:
(i) cluster contexts to define senses
(ii) use resource (e.g., WordNet) to define senses
Cf. Schütze (1992); Reisinger & Mooney (2010); Huang, Socher, Manning &
Ng (2012); Neelakantan, Shankar, Passos & McCallum (2014); Jauhar, Dyer &
Hovy (2015); Rothe & Schütze (2015); Flekova & Gurevych (2016); Pilehvar,
Camacho-Collados, Navigli & Collier (2017)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
5 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Overview
1
Problems
2
Simulations
3
Applications
4
A modest proposal
5
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
6 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Problem 2: Ambiguity incompatible with topology?
outfit
clothing
~s2
legal-case
litigation
~s1
~
w
suit
apparel
lawsuit
Two senses of “suit”: litigation vs. clothing
Let’s represent the two senses using the embeddings ~s1 , ~s2 .
~ of “suit” is 0.5(~s1 + ~s2 )
Plausible approach: the embedding w
~ is not close to either sense (“litigation” / “clothing”)!
But w
Does that mean we cannot cram two meanings into one vector?
Only if we want this to hold:
closeness in meaning → closeness in the t-SNE plot
But t-SNE plots are misleading!
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
7 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Simulations
Simulation 1:
Is it possible to cram two meanings into one vector?
Simulation 2:
How many meanings can we cram into one vector?
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
8 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Outline
1
Problems
2
Simulations
3
Applications
4
A modest proposal
5
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
9 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Setup for ambiguity experiment
Define PCFG grammar
Cf. Yaghoobzadeh
& Schütze (2016)
PCFG models ambiguity in natural language
Generate a corpus using the PCFG
Train the embedding model on the corpus
Evaluate quality of learned embeddings
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
10 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Ambiguity grammar (PCFG) that generates the corpus
P(AV1 B|S) = 9/20
P(CV2 D|S) = β·1/20
P(AV2 B|S) = (1 − β)·1/20
P(CW1 D|S) = 9/20
P(AW2 B|S) = β·1/20
P(CW2 D|S) = (1 − β)·1/20
P(ai |A)
= 1/10
P(bi |B)
= 1/10
P(ci |C )
= 1/10
P(di |D)
= 1/10
P(vi |V1 )
= 1/45
P(vi |V2 )
= 1/5
P(wi |W1 )
= 1/45
P(wi |W2 )
= 1/5
skewedness parameter β
0≤i
0≤i
0≤i
0≤i
5≤i
0≤i
5≤i
0≤i
≤9
≤9
≤9
≤9
≤ 49
≤4
≤ 49
≤4
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
11 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Corpus generated by the PCFG
a4
a2
c3
c4
ab-v8
ab-v8
cd-v4
cd-v4
b8
b6
d8
d2
a4
a2
c3
c4
w0
w0
w0
w0
b8
b6
d8
d2
Two types of contexts: a-b contexts and c-d contexts
Unambiguous words (only one context): ab-v8, cd-v4, many more
Ambiguous words (occur in both contexts): w0
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
12 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Ambiguity: Experiment
Learn embeddings from corpus
Train an SVM for the binary classification task
“can this word occur in an A-B context?”
Test set: ambiguous words: w0 , w1 , w2 , w3 , w4
Skewedness: α ∈ {1.0, 1.1, 1.2, . . . , 2.0}, β = 2−α
(1.0 = balanced, 2.0 = skewed)
50 trials
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
13 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Can embeddings accurately represent embeddings?
Recap: We train an SVM on the binary classification task:
“can this word occur in an A-B context?”
Hypothesis 1. This does not work:
One of the senses is not represented by the embedding.
correct for skewed
Hypothesis 2. This does work:
Both senses are accurately represented by the embedding.
correct for balanced
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
14 / 81
Simulations
Applications
A modest proposal
Representation learning
A single vector is
fully capable of
representing two
completely different
meanings unless . . .
0.8
0.4
0.6
A single vector may
not be capable of
representing two
completely different
meanings if sense
distribution is
skewed.
0.2
pmi
lbl
cbow
skip
cwin
sskip
0.0
accuracy of disambiguation
1.0
Problems
1.0
1.2
1.4
1.6
1.8
2.0
balanced sense dist. <−−−−−−−> skewed sense dist.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
15 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Takeaway
A combination of meanings of similar frequency
is easier to represent in one vector.
A combination of meanings of different frequencies
is harder to represent in one vector.
Cf. Schütze (1992); Reisinger & Mooney (2010); Huang, Socher, Manning &
Ng (2012); Neelakantan, Shankar, Passos & McCallum (2014); Jauhar, Dyer &
Hovy (2015); Rothe & Schütze (2015); Flekova & Gurevych (2016); Pilehvar,
Camacho-Collados, Navigli & Collier (2017)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
16 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Are skewed sense distributions frequent? Yes.
“see”
Main sense: to perceive with the eyes.
Less frequent sense: seat of authority of a bishop.
“lead”
Main sense: to cause to go with one
Less frequent sense: (“lead to”) to cause to happen
“company”
Main sense: commercial business
Cf. Kilgarriff (2004); Calvo &
Less frequent sense: a body of soldiers
Gelbukh (2015); Postma,
Izquierdo, Agirre, Rigau &
Vossen (2016)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
17 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Experimental setup
150 context types
In each context type:
128 different words can occur (unique to this context)
Cf. Schütze
(1992); Gale,
Church &
Yarowsky
(1992);
Pilehvar &
Navigli (2014)
For each word, randomly generate 128 occurrences (contexts)
Add noise
Create pseudowords by conflating 2k words, 1 ≤ k ≤ 7
Task. Given embedding of pseudoword:
“Can this pseudoword occur in a particular context?”
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
18 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Corpus generated by the PCFG
x059r4
x059r6
x059r3
x059r2
x122r2
x122r5
x122r3
x122r1
c059w002
c059w002
c059w002
c059w002
c122w013
c122w013
c122w013
c122w013
y059r8
y059r2
y059r5
y059r1
y122r0
y122r8
y122r4
y122r8
x059r4
x059r6
x059r3
x059r2
x122r2
x122r5
x122r3
x122r1
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
pseudoword
pseudoword
pseudoword
pseudoword
pseudoword
pseudoword
pseudoword
pseudoword
y059r8
y059r2
y059r5
y059r1
y122r0
y122r8
y122r4
y122r8
19 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
0.8
0.7
0.6
0.5
0.3
0.4
accuracy of disambiguation
0.9
1.0
The more senses
are conflated in a
pseudoword, the
lower
disambiguation
performance.
2
5
10
20
50
100
number of senses
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
20 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Takeaway
A combination of a small number of meanings
is easier to represent in one vector.
A combination of a large number of meanings
is harder to represent in one vector.
Cf. Schütze (1992); Reisinger & Mooney (2010); Huang, Socher, Manning &
Ng (2012); Neelakantan, Shankar, Passos & McCallum (2014); Jauhar, Dyer &
Hovy (2015); Rothe & Schütze (2015); Flekova & Gurevych (2016); Pilehvar,
Camacho-Collados, Navigli & Collier (2017)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
21 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Are there 128-way ambiguous words? Yes.
“run”
140 senses! (counting bullet points and phrases as senses)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
22 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Outline
1
Problems
2
Simulations
3
Applications
4
A modest proposal
5
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
23 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Applications
Application 1: Sentiment analysis
Application 2: Named entity typing
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
24 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Rotation of embedding space into interpretable subspaces
Rothe, Ebert & Schütze (2016); Rothe & Schütze (2016)
Find R that minimizes:
P
P
~ )|
~ )| + (v ,w )∈Ldifferent-polarity −|PR(~v − w
v −w
(v ,w )∈Lsame-polarity |PR(~
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
25 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Basic idea: Rotate the embedding space
put all polarity info
on a single dimension, the
x-dimension
bad
x-dimension = polarity
dimension
maximize red distances
(distances between words
of different polarity)
joy
fun
minimize blue distances
(distances between words
of identical polarity)
polarity dimension
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
26 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Basic idea: Rotate the embedding space
bad
put all polarity info
on a single dimension, the
x-dimension
x-dimension =
polarity-dimension
maximize red distances
(distances between words
of different polarity)
joy
minimize blue distances
(distances between words
of identical polarity)
fun
polarity dimension
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
27 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Basic idea: Rotate the embedding space
bad
put all polarity info
on a single dimension, the
x-dimension
x-dimension =
polarity-dimension
maximize red distances
(distances between words
of different polarity)
joy
minimize blue distances
(distances between words
of identical polarity)
fun
polarity dimension
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
28 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Use rotation R for ambiguity analysis
Train 400-dim’sional word2vec embeddings on twitter
Train R to yield 1-dimensional polarity subspace
Create 691 self-antonyms
that conflate two words w1 and w2
that are very dissimilar in polarity subspace
and very similar in the orthogonal complement
Cf. Adel & Schütze
(2014); Santus, Lu,
Lenci & Huang
(2014); Pham,
Lazaridou & Baroni
(2015); Ono, Miwa
& Sasaki (2015);
Nguyen, Schulte im
Walde & Vu (2016)
Example: “poverty@wealth”
In twitter corpus:
substitute constituent words with self-antonyms
Train 100-dimensional embeddings on this new corpus
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
29 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
30 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Embeddings of self-antonyms
This is the worst case of cramming two completely different
meanings into a single !&??@#ˆ$% vector!
Hypothesis 1. This doesn’t work:
The two meanings get conflated. wrong
Hypothesis 2. There is no problem:
The embedding is a complete&accurate representation of the
self-antonym. correct
Experimental setup:
train two classifiers for embeddings
classifier pos-vs-rest: positive vs. the negative+neutral,
classifier neg-vs-rest: negative vs. positive+neutral
How do these classifiers classify self-antonyms?
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
31 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Classification of self-antonyms
classifier pos-vs-rest
training set size
2997
999
positive
negative+neutral 1998
test set size
691
self-antonyms
691
accuracy on test
93%
classifier neg-vs-rest
training set size
2997
999
negative
positive+neutral 1998
test set size
691
self-antonyms
691
accuracy on test
80%
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
32 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Takeaway
Even self-antonyms can be represented in one vector.
Embeddings distinguish occurrence in neutral contexts from
occurrence in a mix of positive and negative contexts.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
33 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Neutral contexts vs. Mix positive/negative contexts
What if a neutral word occurs in a mix of positive/negative
contexts?
Can it still be distinguished from a self-antonym?
Try to find neutral words close to self-antonyms
“civil” is a close neighbor of “slavery@equality”
So a neutral word that occurs in a mix of positive and
negative contexts (“civil”) may not be distinguishable from a
polarity self-antonym (“slavery@equality”)!
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
34 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
35 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
36 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Good representation of self-antonyms:
Effect of dimensionality
Use again our 100-dimensional twitter embeddings
Train R to yield 1-dimensional polarity subspace
Question: what is the distribution of positive, negative,
neutral and self-antonymic words
in this 1-dimensional polarity subspace?
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
37 / 81
Simulations
Applications
A modest proposal
Representation learning
negative
self−antonym
neutral
positive
1.0
0.0
0.5
density
1.5
2.0
Problems
−2
−1
0
1
2
value in sentiment subspace
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
38 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Cf. Li &
Jurafsky (2015)
Takeaway
A combination of meanings
is easier to represent in one vector of high dimensionality.
A combination of meanings
is harder to represent in one vector of low dimensionality.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
39 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Entity embeddings (learned with word2vec)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
40 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Embedding-based entity typing:
Given embedding, predict correct types of entity

−0.16
+0.12
+0.57

+0.67
−0.61

−0.19
−0.10

+0.05
+0.81

−0.12
+0.06

+0.56
−0.91
−0.81
+0.11
−0.10
Cf. Wang, Zhang, Feng & Chen
(2014), Yogatama, Gillick &
Lazic (2015), Neelakantan &
Chang (2015), Yaghoobzadeh
& Schütze (2015, 2017)

city
food
person
musician
politician
author
athlete
nobelist
location
~v (Obama)
(entity)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
41 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Combination of meanings in a vector:
typical combinations
Systematic ambiguity is a hallmark of language.
E.g., metonymy
I live in Valencia. → city
Valencia won the game. → soccer club
Many city names have this city/soccer “ambiguity”.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
42 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Combination of meanings in a vector:
untypical combinations
Some combinations of entity types are untypical,
e.g., musician and town
Johann Sebastian Bach → musician
Bach (on the Danube, in Bavaria) → town
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
43 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Combination of meanings in a vector:
typical vs. untypical
Hypothesis 1: Typical combinations are easier.
Valencia (city) vs. Valencia (soccer club)
Two meanings that typically cooccur in the world (and
presumably have similar embeddings) do not get conflated and
are both preserved in the embedding. correct
Hypothesis 2: Untypical combinations are easier.
Bach (town) vs. Bach (musician)
Two meanings that typically don’t cooccur in the world (and
presumably have “more orthogonal” embeddings) do not get
conflated and are both preserved in the embedding. wrong
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
44 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
0.25
0.20
0.15
0.10
0.05
0.00
(smoothed) F1 of disambiguation
0.30
The more typical a
combination of
types/senses is, the
higher is
disambiguation
performance.
0.30
0.35
0.40
0.45
0.50
untypical comb. <−−−−− PMI −−−−−> typical comb.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
45 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Takeaway
A combination of two meanings that is typical
is easier to represent in one vector.
A combination of two meanings that is untypical
is harder to represent in one vector.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
Cf. Rodd,
Gaskell,
Marslen-Wilson
(2000)
46 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Summary: Do not cram k different meanings
into a single n-dimensional vector:
if one of the meanings is infrequent
if this combination of meanings is untypical
if k is too large
if n is too small
In principle, there is no problem with cramming two (or more)
completely different meanings into one vector
– even for self-antonyms.
How to define linguistic units with nice ambiguity properties?
(balanced sense distribution, no untypical meaning combinations,
not too many senses)
These units cannot be words. Perhaps human language processing
also is generally not based on word-units?
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
47 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Outline
1
Problems
2
Simulations
3
Applications
4
A modest proposal
5
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
48 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Are skewed sense distributions frequent? Yes.
“company”
“lead”
“see”
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
49 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Skewed and highly polysemous words: Two observations
Humans don’t have a problem with this?
Misunderstandings due to ambiguity are rare. Why?
So far, we have made a big implicit assumption.
The linguistic units we should represent as vectors are words.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
50 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Linguistic units with nice ambiguity properties
Simplest approach:
consider all possible units,
then select a subset of good units
Important constraint: a unit must be easily recognizable
E.g., if a complex unit requires disambiguation for recognition,
then that doesn’t help us.
Thought experiment for this talk:
Consider all units of length 10 characters
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
51 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Cookie-cutter segmentation (1)
Cf. Asgari & Mofrad
(2015, 2016)
Input:
the renaissance arrived in the iberian peninsula through the mediterranean possessions of the aragonese crown and the city of valencia.
Space is just a regular character:
the@renaissance@arrived@in@the@iberian@peninsula@through@th
Cookie-cutter segmentation for cookie-cutter size 10:
@the@renai ssance@arr ived@in@th e@iberian@ peninsula@
through@th e@mediterr anean@poss essions@of @the@arago
nese@crown @and@the@c ity@of@val encia.@ear
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
52 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Cookie-cutter segmentation (2)
Cf. Asgari & Mofrad
(2015, 2016)
Input:
the renaissance arrived in the iberian peninsula through the mediterranean possessions of the aragonese crown and the city of valencia.
Cookie-cutter segmentation for cookie-cutter size 10:
@the@renai ssance@arr ived@in@th e@iberian@
the@renais sance@arri ved@in@the @iberian@p
he@renaiss ance@arriv ed@in@the@ iberian@pe
e@renaissa nce@arrive d@in@the@i berian@pen
@renaissan ce@arrived @in@the@ib erian@peni
renaissanc e@arrived@ in@the@ibe rian@penin
enaissance @arrived@i n@the@iber ian@penins
naissance@ arrived@in @the@iberi an@peninsu
aissance@a rrived@in@ the@iberia n@peninsul
issance@ar rived@in@t he@iberian @peninsula
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
53 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Cookie-cutter segmentation (3)
Cookie-cutter segmentation for cookie-cutter size 10:
@the@renai ssance@arr ived@in@th e@iberian@
the@renais sance@arri ved@in@the @iberian@p
he@renaiss ance@arriv ed@in@the@ iberian@pe
e@renaissa nce@arrive d@in@the@i berian@pen
@renaissan ce@arrived @in@the@ib erian@peni
renaissanc e@arrived@ in@the@ibe rian@penin
enaissance @arrived@i n@the@iber ian@penins
naissance@ arrived@in @the@iberi an@peninsu
aissance@a rrived@in@ the@iberia n@peninsul
issance@ar rived@in@t he@iberian @peninsula
Cf. Asgari & Mofrad
(2015, 2016)
The corpus thus generated contains
all observed potential units.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
54 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Cookie-cutter segmented text: Embeddings
For cookie-cutter size k,
generate k copies of the corpus, each shifted by 1
(We set k = 10.)
Run embedding learning algorithm on corpus
(We use word2vec.)
Result:
Each (sufficiently) frequent k-gram has an embedding.
Corpus: Wikipedia
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
55 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Highly polysemous words:
10-grams as directly observable senses
Many instances of “run” occur in contexts that have two
properties:
They narrow down the 140 senses of “run” to a small subset.
“run” is part of a directly observable 10-gram.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
56 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Directly observable sense of “run”: ools@run@b
Selected nearest neighbors
hools,@man
bsidised@b
nistered@b
ls@owned@b
,@funded@b
schools, managed
subsidised by
administered by
schools owned by
, funded by
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
57 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Directly observable sense of “run”: a@two@run@
Selected nearest neighbors
a@three@ru
e@run@home
it@a@three
@3@run@hom
a@walk@off
a three run home run
three run home run
hit a three run home run
3 run home run
a walk off home run
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
58 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Directly observable sense of “run”: d@at@run@t
Selected nearest neighbors
d@at@runti
d@at@compi
d@executab
d@dynamica
@o@log@n@t
linked at runtime
known at compile time
writable and executable
determined dynamically
O(n log n) time
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
59 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Directly observable sense of “run”: run@afoul
Selected nearest neighbors
@get@a@lot
@arise@out
@have@lots
@be@a@sort
@take@care
get a lot
arise out
have lots
be a sort of
take care
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
60 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Directly observable sense of “run”: icken@run
Selected nearest neighbors
ken@little
ot@chicken
porky@pig@
duck@soup@
m@chicken@
Chicken Little
Robot Chicken
Porkey Pig
Duck Soup
the film Chicken Little/Run/Hawk/...
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
61 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
a@two@run@
d@at@run@t
@run@afoul
icken@run@
ools@run@b
a@two@run@
d@at@run@t
@run@afoul
icken@run@
ools@run@b
Cosine similarites: Completely different meanings
schools
run by
1.00
0.13
0.13
0.09
0.10
a two run
home run
0.13
1.00
0.06
0.09
0.17
linked at
run time
0.13
0.06
1.00
0.05
0.08
run
afoul
0.09
0.09
0.05
1.00
0.15
Chicken
Run
0.10
0.17
0.08
0.15
1.00
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
62 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Less frequent senses:
10-grams as directly observable senses
Many instances of less frequent senses of a word w occur in
contexts that have two properties:
They make it likely that the less frequent sense of w is used.
w is part of a directly observable 10-gram.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
63 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Less frequent sense “body of soldiers” of “company”.
Directly observable 10-gram: th@company
Selected nearest neighbors
nk@company
th@cavalry
th@battali
th@marines
h@regiment
tank company
8th cavalry
4th batallion
4th marines
5th regiment
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
64 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Less frequent sense “cause to happen” of “lead (to)”.
Directly observable 10-gram: ly@led@to
Selected nearest neighbors
ly@caused@
ly@due@to@
ly@lead@to
ly@helped@
ly@earned@
mostly caused by
mainly due to
eventually lead to
greatly helped his boxing ability
Buck has finally earned Tia’s love
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
65 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Less frequent sense “seat of a bishop” of “see”.
Directly observable 10-gram: the@see@of
Selected nearest neighbors
pal@see@of
shopric@of
hbishop@of
the@archde
the@archbi
episcopal see of
bishopric of
archbishop of
the archdeacon
the archbishopric
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
66 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
A modest proposal: Summary
By using words as basic linguistic units,
we make this unnessarily hard for ourselves.
Research challenge: Define an objective
that replaces tokenization
with the derivation of linguistic units
that are optimized “information packages”.
In this talk: a thought experiment, not a rigorous evaluation
Main point:
Words are problematic as fundamental linguistic units.
We should look for alternatives.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
67 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Outline
1
Problems
2
Simulations
3
Applications
4
A modest proposal
5
Representation learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
68 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Why representation learning
My title:
“Don’t cram two completely different meanings into a single
!&??@#ˆ$% vector! Or should you?”
only makes sense in the context of representation learning.
Alternatives to representation learning
Start with everything
Start with nothing
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
69 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Start with everything. Example: HPSG
Cf. Kaplan & Bresnan (1982); Joshi (1985); Mel’čuk (1988); Pollard & Sag
(1994); Gross (1997); Böhmová, Hajič, Hajičová & Hladká (2003); Steedman &
Baldridge (2011)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
70 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Start with nothing: “Classical” machine learning
Example:
Train HMM part-of-speech tagger on Brown corpus
Consider P(dog|NN)
Before training starts,
we know nothing about this probability!
(Of course, there is a huge amount of high-quality information
in the annotation.)
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
71 / 81
Problems
Simulations
Applications
start
with nothing
A modest proposal
Representation learning
representation
learning
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
start
with everything
72 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Why representation learning
Robustness
Transfer learning, domain adaptation
Cf. Bengio,
Courville & Vincent
(2014)
Cognitive plausibility
Abundance of unlabeled data,
scarcity of labeled data
Jackendoff&Wittenberg’s
“Hierarchy of Grammars” (HOG) model
Vector representations are a natural fit
for Ferreira’s Good-Enough model?
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
73 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Jackendoff&Wittenberg’s HOG model
There is a hierarchy of grammars.
Simple grammar at the bottom.
Complex grammar at the top.
Language processing operates on all levels in parallel.
Example of top level:
long-distance dependencies, multiple center-embedding
Example of bottom level:
Ferreira’s Good-Enough (GE) model
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
74 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Ferreira’s Good-Enough
(GE) model
“. . . the language
comprehension system
creates syntactic and
semantic representations
that are merely ‘good
enough’ (GE) given the task
that the comprehender needs
to perform. GE
representations contrast with
ones that are detailed,
complete, and accurate with
respect to the input.”
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
75 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Evidence for Good-Enough model (1): How many of each
type of animal did Moses take on the ark?
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
76 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Evidence for Good-Enough model (2)
(terrible bus accident right on US-Mexico border)
Where should the authorities bury the survivors?
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
77 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Evidence for Good-Enough model (3)
“While Mary bathed the baby played in the crib”
“Did Mary bathe the baby”? “Yes”
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
78 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Evidence for Good-Enough model (4)
“The dog was bitten by the man.”
“Who was the agent?” “the dog”
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
79 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Ferreira’s Good-Enough (GE) model: Summary
There is solid evidence that humans employ some form of
“shallow” / “good enough” comprehension.
GE does not replace, but instead supplements complete&accurate
comprehension (cf. Jackendoff&Wittenberg).
Good-Enough model is accepted by a large number of experimental
psychologists.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
80 / 81
Problems
Simulations
Applications
A modest proposal
Representation learning
Deep learning, Good-Enough, ambiguity
Deep learning has been wildly successful.
No evidence that it does “true” complex NLP?
Perhaps its success is due to the fact
that you can get very far in NLP with shallow processing?
So maybe deep learning is NLP’s Good-Enough model?
Which is good: We need a Good-Enough model.
But then the question of how to handle ambiguity in this
Good-Enough model is a fundamental question we must
answer.
Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector?
81 / 81