Distributional semantics III: One distributional semantics, many semantic relations Marco Baroni UPF Computational Semantics Course Outline Attributional and relational similarity Distributional Memory Conclusion Is distributional semantics really “general purpose”? I I Most standard tasks (synonym detection, similarity judgments, categorization, coordinate priming. . . ) evaluate taxonomic similarity Taxonomic similarity is fundamental, but a general purpose model should also: I I capture other semantic relations distinguish between different semantic relations Other semantic relations Baroni & Lenci, in press SVD SUBJECTS TAXO RELENT PART QUALITY ACTIVITY FUNCTION LOCATION Different types of semantic relations Nearest neighbours of book in a ukWaC-based SVD-ed semantic space I reading function I story part? I article coordinate I writing agentive I excerpt part? I writer relent I commentary coordinate I to write agentive I interesting quality Finding and distinguishing semantic relations I Classic distributional semantic models are based on attributional similarity I Single words/concepts that tend to share contexts/attributes are similar I However, corpora (especially large ones) also contain many instances of pairs (or longer tuples) of words/concepts, that, by the very fact that they co-occur, will tend to be related, and not only by taxonomic similarity (parts, functions, activities. . . ) I Pairs that occur in similar contexts (in particular: connected by similar words and structures) will tend to be related, with the shared contexts acting as a cue to the nature of their relation, i.e., measuring their relational similarity (Turney 2006) Attributional and relational similarity Turney (2006) I I Policeman is attributionally similar to soldier since they both occur in contexts such as: kill X, with gun, for security The pair policeman-gun is relationally similar to teacher-book since they both occur in contexts in which they are connected by with, use, of I Structuralists might call this: looking for paradigmatic relations of syntagmatic relations Finding and distinguishing semantic relations I Find non-taxonomic semantic relations: I I Look at direct co-occurrences of word pairs in texts (when we talk about a concept, we are likely to also mention its parts, function, etc.) Distinguish between different semantic relations I I Use the contexts of pairs to measure pair similarity, and group them into coherent relation types by their contexts Analogical reasoning (Turney 2006, 2008) General purpose? I It seems like we need to collect statistics to build two different models: I I word-by-context statistics to measure attributional similarity word-pair-by-context statistics to measure relational similarity General purpose? Should we care? I I No? Dominating paradigm in computational linguistics is “special purpose”: I I I define the task build appropriate training resources extract generalizations from training data, and use them to tackle the task General purpose? Should we care? I I Yes! 20 years of statistical methods in semantics have produced no general purpose resource such as WordNet, FrameNet, ConceptNet, etc. I With possible exception of Dekang Lin’s (1998) thesaurus I Going back to the corpus to collect ad-hoc statistics is time-and-resource consuming, and leaves little time to think of the “next steps” (what do you do with the resources you collected) I Human semantic memory is general purpose! General purpose? I It seems like we need to collect statistics to build two different models: I I I Baroni and Lenci (2009): if we go for an attributional semantic space where contexts include links, the two models are different views of the same tuple statistics graph, that can be collected once and for all from the corpus I I I word-by-context statistics to measure attributional similarity word-pair-by-context statistics to measure relational similarity The extended example below uses dependency paths as links, but links could also be pattern-based A distributional semantic model is a set of weighted concept+link+concept tuples collected from the corpus; different matrices are generated on demand from these tuples, on a task-specific basis We refer to this approach as Distributional Memory (DM), by analogy with human semantic memory Outline Attributional and relational similarity Distributional Memory Conclusion One set of tuple statistics from the corpus concept1 soldier gun policeman gun kill victim kill soldier link use use−1 with with−1 obj obj−1 subj_tr subj_tr−1 concept2 gun soldier gun policeman victim kill soldier kill weight 41.0 41.0 30.5 30.5 915.4 915.4 1306.9 1306.9 The DM graph die subj_tr 109.4 subj_in 1335.2 obj 915.4 obj 9.9 teacher use 10.1 handbook with 3.2 with 28.9 kill subj_in 68.6 subj_tr 22.4 victim in 11894.4 at in 7020.1 2.5 school subj_tr 38.2 subj_in obj 4547.5 538.1 policeman in 2.5 use 7.4 subj_tr obj 1306.9 8948.3 soldier with 30.5 at 10.3 gun with 105.9 use in 41.0 2.8 Different matrix views of the DM graph Concept by Link+Concept (CxLC) teacher soldier policeman subj_in−1 die 109.4 4547.5 68.6 subj_tr−1 kill 0.0 1306.9 38.2 obj−1 kill 9.9 8948.3 538.1 with gun 0.0 105.9 30.5 use gun 0.0 41.0 7.4 Concept+Concept by Link (CCxL) teacher teacher soldier school handbook gun in 11894.4 2.5 2.8 at 7020.1 0.0 10.3 with 28.9 3.2 105.9 use 0.0 10.1 41.0 Same underlying corpus model, different similarities Empirical validation from Baroni & Lenci (2009) I ukWaC pre-parsed with MINIPAR (Lin 1998) I Target concepts/words: Top 20k most frequent nouns, top 5k most frequent verbs Links: I I I I I the top 30 most frequent direct V-N dependency paths (e.g. kill+obj+victim) the top 30 preposition-mediated N-N or V-N paths (e.g. soldier+with+gun) the top 50 transitive-verb-mediated N-N noun paths (e.g. soldier+use+gun) all the inverse relations of above (e.g. victim+obj−1 +kill) I Weighting by Local MI (Evert 2008) I 69 million tuples extracted I No SVD on the attributional/relational matrices, although in principle it would be possible Same underlying corpus model, different similarities Empirical validation from Baroni & Lenci (2009) I The emphasis is on the model generality and adaptivity I Goal is to achieve state-of-the-art results (not necessarily the best score), without task-specific parameter tuning Attributional similarity The Concept-by-Link+Concept (CxLC) space I Rows: target concepts I Columns: nodes linked to the targets in the graph, “typed” with the corresponding edge label (links) I Standard dependency-based linked model a-la Grefenstette, Lin, Curran & Moens, etc. teacher victim soldier policeman subj_in−1 die 109.4 1335.2 4547.5 68.6 subj_tr−1 kill 0.0 22.4 1306.9 38.2 obj−1 kill 9.9 915.4 8948.3 538.1 with gun 0.0 0.0 105.9 30.5 use gun 0.0 0.0 41.0 7.4 Rubenstein & Goodenough results model ukWaCDV DM ukWaCWindow r 0.70 0.64 0.63 model DV+ ukWaCHAL cosDV+ r 0.62 0.61 0.47 Categorization I I 44 concrete nouns (ESSLLI 2008 Distributional Semantics shared task) 24 natural entities I I I 20 artifacts I I 13 tools (hammer), 7 vehicles (car) Clustering with CLUTO I I 15 animals: 7 birds (eagle), 8 ground animals (lion) 9 plants: 4 fruits (banana), 5 greens (onion) 6-way (birds, ground animals, fruits, greens, tools and vehicles), 3-way (animals, plants and artifacts) and 2-way (natural and artificial entities) clustering Cluster evaluation again in terms of purity Categorization Results model Katrenko Peirsman+ ukWaCDV DM ukWaCHAL Peirsman− ukWaCWindow Shaoul 6-way 89 82 80 77 75 73 70 41 3-way 100 84 75 79 68 71 68 52 2-way 80 86 61 59 68 61 59 55 Relational similarity The Concept+Concept-by-Link (CCxL) space I Same tuples, different matrix view: I I Rows: concept pairs Columns: links between the concept pairs teacher teacher soldier school handbook gun in 11894.4 2.5 2.8 at 7020.1 0.0 10.3 with 28.9 3.2 105.9 use 0.0 10.1 41.0 Relational tasks I Finding the most similar pair to a target pair (recognizing SAT analogies) I Defining a relation type by example pairs, and deciding whether a target instantiates the relation type (SEMEVAL semantic relation classification) I Unsupervised clustering of relation types (McRae nominal relation data-set) Recognizing SAT analogies I 374 SAT multiple-choice questions (Turney 2006) I Each question includes 1 target pair (stem) and 5 answer pairs I the task is to choose the pair most analogous to the stem mason stone teacher chalk carpenter wood soldier gun photograph camera book word Recognizing SAT analogies I 374 SAT multiple-choice questions (Turney 2006) I Each question includes 1 target pair (stem) and 5 answer pairs I the task is to choose the pair most analogous to the stem mason stone teacher chalk carpenter wood soldier gun photograph camera book word Recognizing SAT analogies I 374 SAT multiple-choice questions (Turney 2006) I Each question includes 1 target pair (stem) and 5 answer pairs I the task is to choose the pair most analogous to the stem mason stone teacher chalk carpenter wood soldier gun photograph camera book word Relational analogue to the TOEFL task! I I Approach: for each stem-pair, select the answer-pair with the highest cosine in CCxL space Recognizing SAT analogies Results model LRA PERT PairClass VSM DM+ PairSpace k-means % correct 56.1 53.3 52.1 47.1 45.3 44.9 44.0 model KnowBest DM− LSA AttrMax AttrAvg AttrMin Random % correct 43.0 42.3 42.0 35.0 31.0 27.3 20.0 Domain analogies I Turney (2008) extends the relational approach to entire analogical domains solar system → atom sun → nucleus planet → electron mass → charge attracts → attracts revolves → revolves gravity → electromagnetism Semantic relation classification I SemEval-2007 Task 04: Semantic relations between Nominals I 7 relation types: C AUSE -E FFECT, I NSTRUMENT-AGENCY, P RODUCT-P RODUCER, O RIGIN -E NTITY, T HEME -TOOL, PART-W HOLE, C ONTENT-C ONTAINER I Instances harvested with patterns from the Web, and manually labeled as hits or misses I For each relation, 140 training examples (about 50% hits), about 80 test cases Semantic relation classification I For each relation, we build hit and miss prototype vectors by averaging across vectors of training examples in CCxL space I For each test pair, we base hit/miss choice on cosine similarity to hit and miss average vectors Prototype vectors I No need to rescale the resulting vector (e.g., to mean values) as long as we are only going to look at its angle/cosine with other vectors 8 6 y 4 2 0 0 2 4 y 6 8 10 Simply sum the vectors of relevant class to obtain prototype (average, centroid) vector 10 I 0 2 4 6 x 8 10 0 2 4 6 x 8 10 Semantic relation classification model UCD-FC UCB ILK DM+ UMELB-B SemeEval avg DM− UTH majority probmatch UC3M alltrue precision 66.1 62.7 60.5 60.3 61.5 59.2 56.7 56.1 81.3 48.5 48.2 48.5 recall 66.7 63.0 69.5 62.6 55.7 58.7 58.2 57.1 42.9 48.5 40.3 100.0 F 64.8 62.7 63.8 61.1 57.8 58.0 57.1 55.9 30.8 48.5 43.1 64.8 accuracy 66.0 65.4 63.5 63.3 62.7 61.1 59.0 58.8 57.0 51.7 49.9 48.5 Can we discover relation types? I In both previous tasks, relation types are (implicitly) given by example (bottle-drink exemplifies the “function” relation type, etc.) I Can we discover relation types in the CCxL space, like we found concept types (animals, fruit, etc.) in the CxLC space, by unsupervised similarity-based clustering? At least some evidence that we do, form Baroni et al (submitted) I I Using a different underlying graph from the one of the other experiments! Relation clustering I I 205 prototypycally related concept pairs from spontaneously produced concept descriptions (McRae et al’s 2005 conceptual norms) Manual annotation by McRae and colleagues into 4 classes: I I I I I 85 part relations: motorcycle-engine 31 category relations: pineapple-fruit 25 location relations: pig-farm 64 (nominally expressed) function relations: ship+cargo Clustering with CLUTO in CCxL space Clustering relation types: results clust 1 part 77 cat 0 loc 1 funct 10 2 1 31 0 1 3 2 0 24 21 4 5 0 0 32 descriptive links P_of_C, C_’s_P, C_with_P C_have_P, P_on_C P_such_as_C, P_like_C, C_as_P P_as_C, P_from_C C_on_P, P_with_C, C_in_P C_from_P, C_to_P C_of_P, P_from_C, C_for_P P_in_C, P_by_C Clustering relation types: results The problem with function I Function as annotated in the McRae norms (as recoded by us) is a mixed bag: I I I I mushroom+hallucinogen (category?), ship+cargo (part?), cherry+cake (location?) Activities: pen+writing Objects and participants in the functional activity: pen+page, pen+author Fillers of prototypical functions: cup+tea Ad interim summary I Attributional and relational similarity can be seen as different views of the same corpus-estimated statistical model I Empirical evidence that this is a viable strategy I Empirical evidence that relational similarity can tackle the problems of finding different classes of related words/concepts, and identifying the nature of the relation The Devil’s Advocate I I Relational similarity works for arbitrary relations Including taxonomic ones: “category” above (=hypo-/hypernym), synonym detection. . . I I Counterintuitively, given a sufficiently large corpus you can do the TOEFL synonym test fairly well (73.75% accuracy) based on direct synonym co-occurrence (Turney 2001) Then, why do we need attributional similarity at all? Taxonomic as relational soldier sing artist accident Garbanzo_bean and and such_as or also_known_as policeman dance painter incident chickpea Attributional similarity and generalization I Relational similarity is based on direct co-occurrence of concepts in pair I If a pair does not co-occur in corpus, we have no relational information about it Consequently, methods based on relational similarity need to rely on huge corpora I I I I Turney uses a 50 billion word corpus Rapp and others got better TOEFL synonym results using attributional similarity on a 100M word corpus than Turney with relational similarity on the WWW More importantly, the whole point of human language is to allow us to express new notions by combining words and their meanings I I He devoured the topinambur (unattested, totally fine once you know topinambur is a vegetable) He devoured the sympathy (unattested, bad) Attributional similarity and generalization I Suppose that neither devour+obj+topinambur nor devour+obj+sympathy were seen in the corpus, and thus they are not in the tuple graph I Extract tuples that contain relational information about devour and its objects (devour+obj+meat, devour+obj+banana, etc.) I Build an attributional vector in CxLC space for the prototypical object of devour, by summing the meat, banana, etc. vectors Measure the attributional similarity of topinambur and sympathy to the devour object prototype I I The prototype and topinambur vectors will probably share high values on features such as obj− 1_cook, obj− 1_eat, on_table, etc. Mixing relational and attributional similarity to model selectional preferences I I Basic idea from Padó et al (2007) Correlation with human acceptability judgments of noun-verb pairs from McRae et al. (1997) (100 pairs, 36 raters) and Padó (2007) (211 pairs, ∼20 raters per pair): I I For each verb we build its prototype subj (obj) argument vector I I I how typical is deer as an object of shoot? sum the vectors of the 50 nouns with the highest weight on the appropriate link to the verb (e.g., the top 50 nouns connected to shoot by an obj link) NB: if the target noun is in the prototype set, its vector is subtracted from the prototype: we treat each argument as topinambur! Cosine between prototype and candidate noun in CxLC space is taken as the model “plausibility judgment” about the noun occurring as the relevant verb argument Modeling selectional preferences Results I Performance measured by Spearman ρ correlation coefficient between the average human ratings and the model predictions (Padó et al. 2007) model Padó DM ParCos ukWaCDV ukWaCHAL ukWaCWindow Resnik McRae coverage 56 96 91 96 96 96 94 ρ 41 28 21 21 12 12 3 Padó coverage 97 98 98 98 98 98 98 ρ 51 50 48 39 29 27 24 Acceptability of some potential objects of kill object kangaroo person robot hate flower stone fun book conversation sympathy cosine 0.51 0.45 0.15 0.11 0.11 0.05 0.05 0.04 0.03 0.01 Acceptability of some potential instruments of kill object hammer stone brick smile flower antibiotic person heroin kindness graduation cosine 0.26 0.25 0.18 0.15 0.12 0.12 0.12 0.12 0.07 0.04 Modeling selectional preferences I Once more, no need to collect ad-hoc statistics I We have all the necessary relational information (to find typical subjects/objects of a verb) and attributional information (to compare the prototype subject/object to the candidate) in the underlying tuple graph Other views of the graph I Concept-by-Link+Concept: attributional similarity I Concept+Concept-by-Link: relational similarity I Concept+Link-by-Concept: “slot” similarity, coming next I Link-by-Concept+Concept: “type” similarity, left to further work I Direct Concept+(Link+)+Concept ranking: properties of concepts The Concept+Link-by-Concept (CLxC) space I Rows: concept+link pairs I Columns: concepts related to the concept+link pairs in the graph kill kill die subj_tr obj subj_in teacher 0.0 9.9 109.4 victim 22.4 915.4 1335.2 soldier 1306.9 8948.3 4547.5 policeman 38.2 538.1 68.6 Argument alternations I I Measuring similarity between verb slots can be used to study syntactic alterations (Joanis et al. 2008) Causative/inchoative alternations (Levin 1993) I the patient of break can also surface as its (inchoative) subject, whereas this does not happen with mince I I The cook broke the vase → The vase broke The cook minced the meat → *The meat minced Argument alternations in the CLxC space I for alternating verbs, the direct object vector (verb + obj) should be similar to the intransitive subject vector (verb + subj-in) I I the same things that are broken break for non-alternating verbs, the two slots should not be similar I the things that are minced are different from those that mince them Argument alternations Experiments I 402 verbs extracted from Levin Classes (Levin 1993) I I I 232 alternating causatives/inchoatives (break) 170 non-alternating transitives (mince) Median per-verb pairwise cosines among slots: alternating non-alternating subj-intr subj-tr 0.28 0.29 subj-intr obj 0.31 0.09 subj-tr obj 0.16 0.11 Outline Attributional and relational similarity Distributional Memory Conclusion The main points Attributional and relational similarity I I Traditional distributional semantics models capture attributional similarity (similarity of words that share many contexts, and thus, presumably, many properties) Attributional similarity tends to capture taxonomically related concepts (synonyms, antonyms, etc.) but: I I I it does not capture other important semantic relations (part, function, etc.) it does not distinguish between relations of different types (e.g., coordinates vs. categories) Both issues can be tackled by measuring the relational similarity of pairs of co-occurring terms by comparing the contexts (in particular: links) in which they co-occur I I Strongly related terms will tend to co-occur, no matter what type of relation they instantiate Similarity of pairs in terms of contexts/links will group pairs that instantiate the same relation, allowing supervised or unsupervised relation classification The main points One model, many views I I I I To build link-based attributional similarity models, we need to collect, from the corpus, weighted concept+link+concept tuples (one concept will be represented by the vector of co-occurring link+other-concept pairs, with the corresponding weights as dimensions) Instead of seeing the tuple list as a temporary resource in building the concept-by-link+concept matrix, we should see it as the basic data structure of our distributional semantics model From the same tuple (conveniently represented as a graph) we can extract a relational similarity matrix, by representing concept pairs by the vector of links that connect them, with the corresponding weights (same as above) as dimensions Thus, attributional and relational similarity are different views of the same distributional data, harvested once and for all from the corpus The main points Attributional similarity for generalization I Relational similarity can capture a wider range of relations than attributional similarity, and provide some form of typing of these relations I I I Taxonomic relations might also be captured by relational similarity However, relational similarity can only tell us something about word pairs that co-occur in our source corpus We can combine relational and attributional similarity to perform productive semantic generalizations I From concepts for which we have direct evidence that they are in a certain relation (relational similarity) to attributionally similar concepts The main points Views of the tuple graph I From weighted concept+link+concept tuples, we can build four kinds of matrices to compare different kinds of objects, all of semantic interest: I I I I concept-by-link+concept: concept+concept-by-link: concept+link-by-concept: link-by-concept+concept: attributional similarity relational similarity slot similarity link/type similarity
© Copyright 2026 Paperzz