Meaning ^ Mining the Web for Synonyms Peter Turney National Research Council of Canada Personalize with title, slogan or I/B/P name in master slide Outline ● ECML-PKDD 2001 ● ● Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL ECML-PKDD 2011 ● Mining the Web for Meaning – synonymy – analogy – hypernymy – meronymy – composition – and beyond ... Institute for Information Technology Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL ● Pointwise Mutual Information ● measure of word similarity based on probability of co-occurrence PMI(word1, word2) = log2 ● p(word1 & word2) p(word1) x p(word2) PMI-IR = Pointwise Mutual Information + Information Retrieval ● idea: use hit counts from a web search engine to estimate p(...) Institute for Information Technology Application for PMI-IR: Sentiment Analysis ● ● semantic orientation is evaluative character of a word ● positive orientation: good, nice, excellent, positive, fortunate, ... ● negative orientation: bad, nasty, poor, negative, unfortunate, ... use PMI to calculate semantic orientation SO(word) = PMI(word, “excellent”) - PMI(word, “poor”) ● rate reviews as positive or negative based on average SO ● cars: 84.0% ● banks: 80.0% ● movies: 65.8% ● travel: 70.5% Institute for Information Technology Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL TOEFL (Test of English as a Foreign Language) synonym question ● Stem: ● levied Choices: (a) (b) (c) (d) imposed believed requested correlated Solution: (a) imposed PMI-IR = Pointwise Mutual Information + Information Retrieval ● idea: use hit counts from a web search engine to estimate p(...) ● result: accuracy of 74% on TOEFL test ● conclusion: simple idea works well with a huge corpus Institute for Information Technology Various TOEFL Results ● http://aclweb.org/aclwiki/index.php?title=TOEFL_Synonym_Questions Beyond Synonyms: Analogies ● SAT (Scholastic Aptitude Test) analogy question Stem: ● mason:stone Choices: (a) (b) (c) (d) (e) teacher:chalk carpenter:wood soldier:gun photograph:camera book:word Solution: (b) carpenter:wood like the TOEFL synonyms, but word pairs instead of single words ● analogies instead of synonyms ● relations between words instead of properties of single words Institute for Information Technology Various SAT Results ● http://aclweb.org/aclwiki/index.php?title=SAT_Analogy_Questions Synonyms to Analogies, TOEFL to SAT ● ● TOEFL synonyms ● humans: 64.5% ● Rapp (2003): 92.5% ● word is represented by a row in a word-context matrix ● word similarity is cosine of angle between row vectors SAT analogies ● humans: 57% ● Turney (2006): 56.1% ● word pair relation represented by row in pair-pattern matrix ● pair similarity is cosine of angle between row vectors Institute for Information Technology SAT Analogies ● traffic:street::water:riverbed ● traffic is to street as water is to riverbed pattern traffic:street water:riverbed 615 hits 91 hits 6 hits 0 hits “Y with X” 478 hits 11 hits “X from the Y” 136 hits 14 hits 2 hits 0 hits 1237 hits 116 hits “X in the Y” “Y on X” “X when Y” total Institute for Information Technology SAT Analogies ● relational similarity of two pairs is cosine of two vectors ● traffic:street pattern frequency vector ● water:riverbed pattern frequency vector ● similarity(traffic:street, water:riverbed) = cosine of vector angle Stem pair: Choices: traffic:street Cosine (a) ship:gangplank 0.318 (b) crop:harvest 0.572 (c) car:garage 0.687 (d) pedestrians:feet 0.497 (e) water:riverbed 0.692 Institute for Information Technology Vector Space Models of Semantics ● Turney, P.D., and Pantel, P. (2010), From frequency to meaning: Vector space models of semantics, Journal of Artificial Intelligence Research (JAIR), 37, 141-188. ● ● Distributional Hypothesis ● ● survey paper — argues that vector space model is a natural consequence of the Distributional Hypothesis words that occur in similar contexts tend to have similar meanings (Harris, 1954; Firth, 1957) Statistical Semantics Hypothesis ● statistical patterns of human word usage can be used to figure out what people mean (Weaver, 1955; Furnas et al., 1983) Institute for Information Technology More Semantics ● Turney, P.D. (2011), Analogy perception applied to seven tests of word comprehension, Journal of Experimental and Theoretical Artificial Intelligence, 23 (3), 343-362. ● 374 analogy questions from SAT ● 80 synonym questions from TOEFL ● 50 synonym questions from ESL ● 136 synonym-antonym questions from ESL ● 160 synonym-antonym questions from computational linguistics ● 144 similar-associated-both questions from cognitive psychology ● 600 noun-modifier relation classification problems from computational linguistics Institute for Information Technology Analogical Mappings ● mapping from solar system to Rutherford-Bohr atom ● source = Copernican solar system (more familiar; 1514-1543) ● target = Rutherford-Bohr atomic model (less familiar; 1911-1913) Institute for Information Technology Analogical Mappings ● ● Latent Relation Mapping Engine (LRME) uses vectors for analogical mapping input to LRME: Source A planet attracts revolves sun gravity solar system mass Institute for Information Technology Target B revolves atom attracts electromagnetism nucleus charge electron Analogical Mappings ● output of LRME: Source A solar system sun planet mass attracts revolves gravity Mapping M → → → → → → → Institute for Information Technology Target B atom nucleus electron charge attracts revolves electromagnetism Semantic Composition ● ● vectors work well as representations of individual words, but what about phrases, sentences, paragraphs, …? given a vector for “dog” and a vector for “house”, can we calculate a vector for “dog house”? ● Jeff Mitchell and Mirella Lapata (2008, 2009, 2010) ● vector-based models of semantic composition ● element-wise multiplication of vectors of probabilities Institute for Information Technology Vectors and Logic ● ● can we apply AND, OR, and NOT to vectors? given a vector for “bass” and a vector for “fish”, can we exclude the “fish” sense of “bass” and focus on the musical instrument sense of “bass”? ● ● Dominic Widdows (2004) “bass” AND NOT “fish” = projection of “bass” vector into the orthogonal complement of the “fish” vector Institute for Information Technology Analog and Digital ● vectors: ● ● words: ● ● analog, real-valued, continuous, distributed, diffused, spatial, geometrical digital, Boolean-valued, discrete, concentrated, exclusive, symbolic, logical how to reconcile vectors and words? Institute for Information Technology Analog and Digital semantic, prior, analog, spatial, vector <p(H ) = 0.03, p(H ) = 0.06, ..., p(H ) = 0.02> 1 2 n input, event, episode, evidence episodic, posterior, digital, symbolic <q(H ) = 0, q(H ) = 1, ..., q(H ) = 0> 1 2 n Conclusion ● Statistical Semantics Hypothesis: ● statistical patterns of human word usage can be used to figure out what people mean ● Furnas, Landauer, Gomez, and Dumais (1983) ● Turney and Pantel (2010) Institute for Information Technology Conclusion ● 2001: Mining the Web for Synonyms ● 2011: Mining the Web for Meaning Institute for Information Technology
© Copyright 2026 Paperzz