Problems Simulations Applications A modest proposal Representation learning Cf. Mooney (2014) Don’t cram two completely different meanings into a single !&??@#ˆ$% vector! Or should you? Hinrich Schütze, Yadollah Yaghoobzadeh Center for Information and Language Processing, LMU Munich 2017-04-07 Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 1 / 81 Problems Simulations Applications A modest proposal Representation learning Outline 1 Problems 2 Simulations 3 Applications 4 A modest proposal 5 Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 2 / 81 Problems Simulations Applications A modest proposal Representation learning Problem 1: Conflation in ambiguity Intuition Each point in embedding space represents a distinct meaning. Conflation Two meanings are conflated if their sum cannot be disentangled into the two component meanings. Example: Let ~a, ~b, ~c , ~d be different meanings If ~a + ~b = ~c + ~d, then ~a and ~b are conflated. (and also ~c and ~d) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 3 / 81 Problems Simulations Applications A modest proposal Representation learning Conflation in ambiguity airplane spacecraft Conflation can definitely happen. But is it a problem in practice? car boat Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 4 / 81 Problems Simulations Applications A modest proposal Representation learning Cramming two meanings into one vector: The alternative? For each sense, learn a separate embedding Most common: (i) cluster contexts to define senses (ii) use resource (e.g., WordNet) to define senses Cf. Schütze (1992); Reisinger & Mooney (2010); Huang, Socher, Manning & Ng (2012); Neelakantan, Shankar, Passos & McCallum (2014); Jauhar, Dyer & Hovy (2015); Rothe & Schütze (2015); Flekova & Gurevych (2016); Pilehvar, Camacho-Collados, Navigli & Collier (2017) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 5 / 81 Problems Simulations Applications A modest proposal Representation learning Overview 1 Problems 2 Simulations 3 Applications 4 A modest proposal 5 Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 6 / 81 Problems Simulations Applications A modest proposal Representation learning Problem 2: Ambiguity incompatible with topology? outfit clothing ~s2 legal-case litigation ~s1 ~ w suit apparel lawsuit Two senses of “suit”: litigation vs. clothing Let’s represent the two senses using the embeddings ~s1 , ~s2 . ~ of “suit” is 0.5(~s1 + ~s2 ) Plausible approach: the embedding w ~ is not close to either sense (“litigation” / “clothing”)! But w Does that mean we cannot cram two meanings into one vector? Only if we want this to hold: closeness in meaning → closeness in the t-SNE plot But t-SNE plots are misleading! Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 7 / 81 Problems Simulations Applications A modest proposal Representation learning Simulations Simulation 1: Is it possible to cram two meanings into one vector? Simulation 2: How many meanings can we cram into one vector? Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 8 / 81 Problems Simulations Applications A modest proposal Representation learning Outline 1 Problems 2 Simulations 3 Applications 4 A modest proposal 5 Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 9 / 81 Problems Simulations Applications A modest proposal Representation learning Setup for ambiguity experiment Define PCFG grammar Cf. Yaghoobzadeh & Schütze (2016) PCFG models ambiguity in natural language Generate a corpus using the PCFG Train the embedding model on the corpus Evaluate quality of learned embeddings Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 10 / 81 Problems Simulations Applications A modest proposal Representation learning Ambiguity grammar (PCFG) that generates the corpus P(AV1 B|S) = 9/20 P(CV2 D|S) = β·1/20 P(AV2 B|S) = (1 − β)·1/20 P(CW1 D|S) = 9/20 P(AW2 B|S) = β·1/20 P(CW2 D|S) = (1 − β)·1/20 P(ai |A) = 1/10 P(bi |B) = 1/10 P(ci |C ) = 1/10 P(di |D) = 1/10 P(vi |V1 ) = 1/45 P(vi |V2 ) = 1/5 P(wi |W1 ) = 1/45 P(wi |W2 ) = 1/5 skewedness parameter β 0≤i 0≤i 0≤i 0≤i 5≤i 0≤i 5≤i 0≤i ≤9 ≤9 ≤9 ≤9 ≤ 49 ≤4 ≤ 49 ≤4 Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 11 / 81 Problems Simulations Applications A modest proposal Representation learning Corpus generated by the PCFG a4 a2 c3 c4 ab-v8 ab-v8 cd-v4 cd-v4 b8 b6 d8 d2 a4 a2 c3 c4 w0 w0 w0 w0 b8 b6 d8 d2 Two types of contexts: a-b contexts and c-d contexts Unambiguous words (only one context): ab-v8, cd-v4, many more Ambiguous words (occur in both contexts): w0 Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 12 / 81 Problems Simulations Applications A modest proposal Representation learning Ambiguity: Experiment Learn embeddings from corpus Train an SVM for the binary classification task “can this word occur in an A-B context?” Test set: ambiguous words: w0 , w1 , w2 , w3 , w4 Skewedness: α ∈ {1.0, 1.1, 1.2, . . . , 2.0}, β = 2−α (1.0 = balanced, 2.0 = skewed) 50 trials Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 13 / 81 Problems Simulations Applications A modest proposal Representation learning Can embeddings accurately represent embeddings? Recap: We train an SVM on the binary classification task: “can this word occur in an A-B context?” Hypothesis 1. This does not work: One of the senses is not represented by the embedding. correct for skewed Hypothesis 2. This does work: Both senses are accurately represented by the embedding. correct for balanced Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 14 / 81 Simulations Applications A modest proposal Representation learning A single vector is fully capable of representing two completely different meanings unless . . . 0.8 0.4 0.6 A single vector may not be capable of representing two completely different meanings if sense distribution is skewed. 0.2 pmi lbl cbow skip cwin sskip 0.0 accuracy of disambiguation 1.0 Problems 1.0 1.2 1.4 1.6 1.8 2.0 balanced sense dist. <−−−−−−−> skewed sense dist. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 15 / 81 Problems Simulations Applications A modest proposal Representation learning Takeaway A combination of meanings of similar frequency is easier to represent in one vector. A combination of meanings of different frequencies is harder to represent in one vector. Cf. Schütze (1992); Reisinger & Mooney (2010); Huang, Socher, Manning & Ng (2012); Neelakantan, Shankar, Passos & McCallum (2014); Jauhar, Dyer & Hovy (2015); Rothe & Schütze (2015); Flekova & Gurevych (2016); Pilehvar, Camacho-Collados, Navigli & Collier (2017) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 16 / 81 Problems Simulations Applications A modest proposal Representation learning Are skewed sense distributions frequent? Yes. “see” Main sense: to perceive with the eyes. Less frequent sense: seat of authority of a bishop. “lead” Main sense: to cause to go with one Less frequent sense: (“lead to”) to cause to happen “company” Main sense: commercial business Cf. Kilgarriff (2004); Calvo & Less frequent sense: a body of soldiers Gelbukh (2015); Postma, Izquierdo, Agirre, Rigau & Vossen (2016) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 17 / 81 Problems Simulations Applications A modest proposal Representation learning Experimental setup 150 context types In each context type: 128 different words can occur (unique to this context) Cf. Schütze (1992); Gale, Church & Yarowsky (1992); Pilehvar & Navigli (2014) For each word, randomly generate 128 occurrences (contexts) Add noise Create pseudowords by conflating 2k words, 1 ≤ k ≤ 7 Task. Given embedding of pseudoword: “Can this pseudoword occur in a particular context?” Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 18 / 81 Problems Simulations Applications A modest proposal Representation learning Corpus generated by the PCFG x059r4 x059r6 x059r3 x059r2 x122r2 x122r5 x122r3 x122r1 c059w002 c059w002 c059w002 c059w002 c122w013 c122w013 c122w013 c122w013 y059r8 y059r2 y059r5 y059r1 y122r0 y122r8 y122r4 y122r8 x059r4 x059r6 x059r3 x059r2 x122r2 x122r5 x122r3 x122r1 Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? pseudoword pseudoword pseudoword pseudoword pseudoword pseudoword pseudoword pseudoword y059r8 y059r2 y059r5 y059r1 y122r0 y122r8 y122r4 y122r8 19 / 81 Problems Simulations Applications A modest proposal Representation learning 0.8 0.7 0.6 0.5 0.3 0.4 accuracy of disambiguation 0.9 1.0 The more senses are conflated in a pseudoword, the lower disambiguation performance. 2 5 10 20 50 100 number of senses Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 20 / 81 Problems Simulations Applications A modest proposal Representation learning Takeaway A combination of a small number of meanings is easier to represent in one vector. A combination of a large number of meanings is harder to represent in one vector. Cf. Schütze (1992); Reisinger & Mooney (2010); Huang, Socher, Manning & Ng (2012); Neelakantan, Shankar, Passos & McCallum (2014); Jauhar, Dyer & Hovy (2015); Rothe & Schütze (2015); Flekova & Gurevych (2016); Pilehvar, Camacho-Collados, Navigli & Collier (2017) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 21 / 81 Problems Simulations Applications A modest proposal Representation learning Are there 128-way ambiguous words? Yes. “run” 140 senses! (counting bullet points and phrases as senses) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 22 / 81 Problems Simulations Applications A modest proposal Representation learning Outline 1 Problems 2 Simulations 3 Applications 4 A modest proposal 5 Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 23 / 81 Problems Simulations Applications A modest proposal Representation learning Applications Application 1: Sentiment analysis Application 2: Named entity typing Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 24 / 81 Problems Simulations Applications A modest proposal Representation learning Rotation of embedding space into interpretable subspaces Rothe, Ebert & Schütze (2016); Rothe & Schütze (2016) Find R that minimizes: P P ~ )| ~ )| + (v ,w )∈Ldifferent-polarity −|PR(~v − w v −w (v ,w )∈Lsame-polarity |PR(~ Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 25 / 81 Problems Simulations Applications A modest proposal Representation learning Basic idea: Rotate the embedding space put all polarity info on a single dimension, the x-dimension bad x-dimension = polarity dimension maximize red distances (distances between words of different polarity) joy fun minimize blue distances (distances between words of identical polarity) polarity dimension Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 26 / 81 Problems Simulations Applications A modest proposal Representation learning Basic idea: Rotate the embedding space bad put all polarity info on a single dimension, the x-dimension x-dimension = polarity-dimension maximize red distances (distances between words of different polarity) joy minimize blue distances (distances between words of identical polarity) fun polarity dimension Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 27 / 81 Problems Simulations Applications A modest proposal Representation learning Basic idea: Rotate the embedding space bad put all polarity info on a single dimension, the x-dimension x-dimension = polarity-dimension maximize red distances (distances between words of different polarity) joy minimize blue distances (distances between words of identical polarity) fun polarity dimension Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 28 / 81 Problems Simulations Applications A modest proposal Representation learning Use rotation R for ambiguity analysis Train 400-dim’sional word2vec embeddings on twitter Train R to yield 1-dimensional polarity subspace Create 691 self-antonyms that conflate two words w1 and w2 that are very dissimilar in polarity subspace and very similar in the orthogonal complement Cf. Adel & Schütze (2014); Santus, Lu, Lenci & Huang (2014); Pham, Lazaridou & Baroni (2015); Ono, Miwa & Sasaki (2015); Nguyen, Schulte im Walde & Vu (2016) Example: “poverty@wealth” In twitter corpus: substitute constituent words with self-antonyms Train 100-dimensional embeddings on this new corpus Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 29 / 81 Problems Simulations Applications A modest proposal Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 30 / 81 Problems Simulations Applications A modest proposal Representation learning Embeddings of self-antonyms This is the worst case of cramming two completely different meanings into a single !&??@#ˆ$% vector! Hypothesis 1. This doesn’t work: The two meanings get conflated. wrong Hypothesis 2. There is no problem: The embedding is a complete&accurate representation of the self-antonym. correct Experimental setup: train two classifiers for embeddings classifier pos-vs-rest: positive vs. the negative+neutral, classifier neg-vs-rest: negative vs. positive+neutral How do these classifiers classify self-antonyms? Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 31 / 81 Problems Simulations Applications A modest proposal Representation learning Classification of self-antonyms classifier pos-vs-rest training set size 2997 999 positive negative+neutral 1998 test set size 691 self-antonyms 691 accuracy on test 93% classifier neg-vs-rest training set size 2997 999 negative positive+neutral 1998 test set size 691 self-antonyms 691 accuracy on test 80% Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 32 / 81 Problems Simulations Applications A modest proposal Representation learning Takeaway Even self-antonyms can be represented in one vector. Embeddings distinguish occurrence in neutral contexts from occurrence in a mix of positive and negative contexts. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 33 / 81 Problems Simulations Applications A modest proposal Representation learning Neutral contexts vs. Mix positive/negative contexts What if a neutral word occurs in a mix of positive/negative contexts? Can it still be distinguished from a self-antonym? Try to find neutral words close to self-antonyms “civil” is a close neighbor of “slavery@equality” So a neutral word that occurs in a mix of positive and negative contexts (“civil”) may not be distinguishable from a polarity self-antonym (“slavery@equality”)! Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 34 / 81 Problems Simulations Applications A modest proposal Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 35 / 81 Problems Simulations Applications A modest proposal Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 36 / 81 Problems Simulations Applications A modest proposal Representation learning Good representation of self-antonyms: Effect of dimensionality Use again our 100-dimensional twitter embeddings Train R to yield 1-dimensional polarity subspace Question: what is the distribution of positive, negative, neutral and self-antonymic words in this 1-dimensional polarity subspace? Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 37 / 81 Simulations Applications A modest proposal Representation learning negative self−antonym neutral positive 1.0 0.0 0.5 density 1.5 2.0 Problems −2 −1 0 1 2 value in sentiment subspace Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 38 / 81 Problems Simulations Applications A modest proposal Representation learning Cf. Li & Jurafsky (2015) Takeaway A combination of meanings is easier to represent in one vector of high dimensionality. A combination of meanings is harder to represent in one vector of low dimensionality. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 39 / 81 Problems Simulations Applications A modest proposal Representation learning Entity embeddings (learned with word2vec) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 40 / 81 Problems Simulations Applications A modest proposal Representation learning Embedding-based entity typing: Given embedding, predict correct types of entity −0.16 +0.12 +0.57 +0.67 −0.61 −0.19 −0.10 +0.05 +0.81 −0.12 +0.06 +0.56 −0.91 −0.81 +0.11 −0.10 Cf. Wang, Zhang, Feng & Chen (2014), Yogatama, Gillick & Lazic (2015), Neelakantan & Chang (2015), Yaghoobzadeh & Schütze (2015, 2017) city food person musician politician author athlete nobelist location ~v (Obama) (entity) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 41 / 81 Problems Simulations Applications A modest proposal Representation learning Combination of meanings in a vector: typical combinations Systematic ambiguity is a hallmark of language. E.g., metonymy I live in Valencia. → city Valencia won the game. → soccer club Many city names have this city/soccer “ambiguity”. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 42 / 81 Problems Simulations Applications A modest proposal Representation learning Combination of meanings in a vector: untypical combinations Some combinations of entity types are untypical, e.g., musician and town Johann Sebastian Bach → musician Bach (on the Danube, in Bavaria) → town Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 43 / 81 Problems Simulations Applications A modest proposal Representation learning Combination of meanings in a vector: typical vs. untypical Hypothesis 1: Typical combinations are easier. Valencia (city) vs. Valencia (soccer club) Two meanings that typically cooccur in the world (and presumably have similar embeddings) do not get conflated and are both preserved in the embedding. correct Hypothesis 2: Untypical combinations are easier. Bach (town) vs. Bach (musician) Two meanings that typically don’t cooccur in the world (and presumably have “more orthogonal” embeddings) do not get conflated and are both preserved in the embedding. wrong Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 44 / 81 Problems Simulations Applications A modest proposal Representation learning 0.25 0.20 0.15 0.10 0.05 0.00 (smoothed) F1 of disambiguation 0.30 The more typical a combination of types/senses is, the higher is disambiguation performance. 0.30 0.35 0.40 0.45 0.50 untypical comb. <−−−−− PMI −−−−−> typical comb. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 45 / 81 Problems Simulations Applications A modest proposal Representation learning Takeaway A combination of two meanings that is typical is easier to represent in one vector. A combination of two meanings that is untypical is harder to represent in one vector. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? Cf. Rodd, Gaskell, Marslen-Wilson (2000) 46 / 81 Problems Simulations Applications A modest proposal Representation learning Summary: Do not cram k different meanings into a single n-dimensional vector: if one of the meanings is infrequent if this combination of meanings is untypical if k is too large if n is too small In principle, there is no problem with cramming two (or more) completely different meanings into one vector – even for self-antonyms. How to define linguistic units with nice ambiguity properties? (balanced sense distribution, no untypical meaning combinations, not too many senses) These units cannot be words. Perhaps human language processing also is generally not based on word-units? Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 47 / 81 Problems Simulations Applications A modest proposal Representation learning Outline 1 Problems 2 Simulations 3 Applications 4 A modest proposal 5 Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 48 / 81 Problems Simulations Applications A modest proposal Representation learning Are skewed sense distributions frequent? Yes. “company” “lead” “see” Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 49 / 81 Problems Simulations Applications A modest proposal Representation learning Skewed and highly polysemous words: Two observations Humans don’t have a problem with this? Misunderstandings due to ambiguity are rare. Why? So far, we have made a big implicit assumption. The linguistic units we should represent as vectors are words. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 50 / 81 Problems Simulations Applications A modest proposal Representation learning Linguistic units with nice ambiguity properties Simplest approach: consider all possible units, then select a subset of good units Important constraint: a unit must be easily recognizable E.g., if a complex unit requires disambiguation for recognition, then that doesn’t help us. Thought experiment for this talk: Consider all units of length 10 characters Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 51 / 81 Problems Simulations Applications A modest proposal Representation learning Cookie-cutter segmentation (1) Cf. Asgari & Mofrad (2015, 2016) Input: the renaissance arrived in the iberian peninsula through the mediterranean possessions of the aragonese crown and the city of valencia. Space is just a regular character: the@renaissance@arrived@in@the@iberian@peninsula@through@th Cookie-cutter segmentation for cookie-cutter size 10: @the@renai ssance@arr ived@in@th e@iberian@ peninsula@ through@th e@mediterr anean@poss essions@of @the@arago nese@crown @and@the@c ity@of@val encia.@ear Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 52 / 81 Problems Simulations Applications A modest proposal Representation learning Cookie-cutter segmentation (2) Cf. Asgari & Mofrad (2015, 2016) Input: the renaissance arrived in the iberian peninsula through the mediterranean possessions of the aragonese crown and the city of valencia. Cookie-cutter segmentation for cookie-cutter size 10: @the@renai ssance@arr ived@in@th e@iberian@ the@renais sance@arri ved@in@the @iberian@p he@renaiss ance@arriv ed@in@the@ iberian@pe e@renaissa nce@arrive d@in@the@i berian@pen @renaissan ce@arrived @in@the@ib erian@peni renaissanc e@arrived@ in@the@ibe rian@penin enaissance @arrived@i n@the@iber ian@penins naissance@ arrived@in @the@iberi an@peninsu aissance@a rrived@in@ the@iberia n@peninsul issance@ar rived@in@t he@iberian @peninsula Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 53 / 81 Problems Simulations Applications A modest proposal Representation learning Cookie-cutter segmentation (3) Cookie-cutter segmentation for cookie-cutter size 10: @the@renai ssance@arr ived@in@th e@iberian@ the@renais sance@arri ved@in@the @iberian@p he@renaiss ance@arriv ed@in@the@ iberian@pe e@renaissa nce@arrive d@in@the@i berian@pen @renaissan ce@arrived @in@the@ib erian@peni renaissanc e@arrived@ in@the@ibe rian@penin enaissance @arrived@i n@the@iber ian@penins naissance@ arrived@in @the@iberi an@peninsu aissance@a rrived@in@ the@iberia n@peninsul issance@ar rived@in@t he@iberian @peninsula Cf. Asgari & Mofrad (2015, 2016) The corpus thus generated contains all observed potential units. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 54 / 81 Problems Simulations Applications A modest proposal Representation learning Cookie-cutter segmented text: Embeddings For cookie-cutter size k, generate k copies of the corpus, each shifted by 1 (We set k = 10.) Run embedding learning algorithm on corpus (We use word2vec.) Result: Each (sufficiently) frequent k-gram has an embedding. Corpus: Wikipedia Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 55 / 81 Problems Simulations Applications A modest proposal Representation learning Highly polysemous words: 10-grams as directly observable senses Many instances of “run” occur in contexts that have two properties: They narrow down the 140 senses of “run” to a small subset. “run” is part of a directly observable 10-gram. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 56 / 81 Problems Simulations Applications A modest proposal Representation learning Directly observable sense of “run”: ools@run@b Selected nearest neighbors hools,@man bsidised@b nistered@b ls@owned@b ,@funded@b schools, managed subsidised by administered by schools owned by , funded by Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 57 / 81 Problems Simulations Applications A modest proposal Representation learning Directly observable sense of “run”: a@two@run@ Selected nearest neighbors a@three@ru e@run@home it@a@three @3@run@hom a@walk@off a three run home run three run home run hit a three run home run 3 run home run a walk off home run Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 58 / 81 Problems Simulations Applications A modest proposal Representation learning Directly observable sense of “run”: d@at@run@t Selected nearest neighbors d@at@runti d@at@compi d@executab d@dynamica @o@log@n@t linked at runtime known at compile time writable and executable determined dynamically O(n log n) time Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 59 / 81 Problems Simulations Applications A modest proposal Representation learning Directly observable sense of “run”: run@afoul Selected nearest neighbors @get@a@lot @arise@out @have@lots @be@a@sort @take@care get a lot arise out have lots be a sort of take care Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 60 / 81 Problems Simulations Applications A modest proposal Representation learning Directly observable sense of “run”: icken@run Selected nearest neighbors ken@little ot@chicken porky@pig@ duck@soup@ m@chicken@ Chicken Little Robot Chicken Porkey Pig Duck Soup the film Chicken Little/Run/Hawk/... Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 61 / 81 Problems Simulations Applications A modest proposal Representation learning a@two@run@ d@at@run@t @run@afoul icken@run@ ools@run@b a@two@run@ d@at@run@t @run@afoul icken@run@ ools@run@b Cosine similarites: Completely different meanings schools run by 1.00 0.13 0.13 0.09 0.10 a two run home run 0.13 1.00 0.06 0.09 0.17 linked at run time 0.13 0.06 1.00 0.05 0.08 run afoul 0.09 0.09 0.05 1.00 0.15 Chicken Run 0.10 0.17 0.08 0.15 1.00 Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 62 / 81 Problems Simulations Applications A modest proposal Representation learning Less frequent senses: 10-grams as directly observable senses Many instances of less frequent senses of a word w occur in contexts that have two properties: They make it likely that the less frequent sense of w is used. w is part of a directly observable 10-gram. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 63 / 81 Problems Simulations Applications A modest proposal Representation learning Less frequent sense “body of soldiers” of “company”. Directly observable 10-gram: th@company Selected nearest neighbors nk@company th@cavalry th@battali th@marines h@regiment tank company 8th cavalry 4th batallion 4th marines 5th regiment Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 64 / 81 Problems Simulations Applications A modest proposal Representation learning Less frequent sense “cause to happen” of “lead (to)”. Directly observable 10-gram: ly@led@to Selected nearest neighbors ly@caused@ ly@due@to@ ly@lead@to ly@helped@ ly@earned@ mostly caused by mainly due to eventually lead to greatly helped his boxing ability Buck has finally earned Tia’s love Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 65 / 81 Problems Simulations Applications A modest proposal Representation learning Less frequent sense “seat of a bishop” of “see”. Directly observable 10-gram: the@see@of Selected nearest neighbors pal@see@of shopric@of hbishop@of the@archde the@archbi episcopal see of bishopric of archbishop of the archdeacon the archbishopric Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 66 / 81 Problems Simulations Applications A modest proposal Representation learning A modest proposal: Summary By using words as basic linguistic units, we make this unnessarily hard for ourselves. Research challenge: Define an objective that replaces tokenization with the derivation of linguistic units that are optimized “information packages”. In this talk: a thought experiment, not a rigorous evaluation Main point: Words are problematic as fundamental linguistic units. We should look for alternatives. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 67 / 81 Problems Simulations Applications A modest proposal Representation learning Outline 1 Problems 2 Simulations 3 Applications 4 A modest proposal 5 Representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 68 / 81 Problems Simulations Applications A modest proposal Representation learning Why representation learning My title: “Don’t cram two completely different meanings into a single !&??@#ˆ$% vector! Or should you?” only makes sense in the context of representation learning. Alternatives to representation learning Start with everything Start with nothing Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 69 / 81 Problems Simulations Applications A modest proposal Representation learning Start with everything. Example: HPSG Cf. Kaplan & Bresnan (1982); Joshi (1985); Mel’čuk (1988); Pollard & Sag (1994); Gross (1997); Böhmová, Hajič, Hajičová & Hladká (2003); Steedman & Baldridge (2011) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 70 / 81 Problems Simulations Applications A modest proposal Representation learning Start with nothing: “Classical” machine learning Example: Train HMM part-of-speech tagger on Brown corpus Consider P(dog|NN) Before training starts, we know nothing about this probability! (Of course, there is a huge amount of high-quality information in the annotation.) Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 71 / 81 Problems Simulations Applications start with nothing A modest proposal Representation learning representation learning Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? start with everything 72 / 81 Problems Simulations Applications A modest proposal Representation learning Why representation learning Robustness Transfer learning, domain adaptation Cf. Bengio, Courville & Vincent (2014) Cognitive plausibility Abundance of unlabeled data, scarcity of labeled data Jackendoff&Wittenberg’s “Hierarchy of Grammars” (HOG) model Vector representations are a natural fit for Ferreira’s Good-Enough model? Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 73 / 81 Problems Simulations Applications A modest proposal Representation learning Jackendoff&Wittenberg’s HOG model There is a hierarchy of grammars. Simple grammar at the bottom. Complex grammar at the top. Language processing operates on all levels in parallel. Example of top level: long-distance dependencies, multiple center-embedding Example of bottom level: Ferreira’s Good-Enough (GE) model Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 74 / 81 Problems Simulations Applications A modest proposal Representation learning Ferreira’s Good-Enough (GE) model “. . . the language comprehension system creates syntactic and semantic representations that are merely ‘good enough’ (GE) given the task that the comprehender needs to perform. GE representations contrast with ones that are detailed, complete, and accurate with respect to the input.” Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 75 / 81 Problems Simulations Applications A modest proposal Representation learning Evidence for Good-Enough model (1): How many of each type of animal did Moses take on the ark? Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 76 / 81 Problems Simulations Applications A modest proposal Representation learning Evidence for Good-Enough model (2) (terrible bus accident right on US-Mexico border) Where should the authorities bury the survivors? Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 77 / 81 Problems Simulations Applications A modest proposal Representation learning Evidence for Good-Enough model (3) “While Mary bathed the baby played in the crib” “Did Mary bathe the baby”? “Yes” Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 78 / 81 Problems Simulations Applications A modest proposal Representation learning Evidence for Good-Enough model (4) “The dog was bitten by the man.” “Who was the agent?” “the dog” Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 79 / 81 Problems Simulations Applications A modest proposal Representation learning Ferreira’s Good-Enough (GE) model: Summary There is solid evidence that humans employ some form of “shallow” / “good enough” comprehension. GE does not replace, but instead supplements complete&accurate comprehension (cf. Jackendoff&Wittenberg). Good-Enough model is accepted by a large number of experimental psychologists. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 80 / 81 Problems Simulations Applications A modest proposal Representation learning Deep learning, Good-Enough, ambiguity Deep learning has been wildly successful. No evidence that it does “true” complex NLP? Perhaps its success is due to the fact that you can get very far in NLP with shallow processing? So maybe deep learning is NLP’s Good-Enough model? Which is good: We need a Good-Enough model. But then the question of how to handle ambiguity in this Good-Enough model is a fundamental question we must answer. Schütze, Yaghoobzadeh (LMU Munich): Two different meanings in one vector? 81 / 81
© Copyright 2025 Paperzz