Using Abstract Context to Detect Figurative Language Nazneen Fatema Rajani Edaena Salinas Raymond Mooney University of Texas at Austin University of Texas at Austin University of Texas at Austin Austin, TX Austin, TX Austin, TX [email protected] [email protected] [email protected] Abstract Metaphor is ubiquitous in language, and in many applications it is important to distinguish between the literal and figurative use of language. Therefore, a SemEval task was recently introduced for detecting figurative language. This paper introduces a new approach to this task that uses a rich combination of features, particularly ones that measure the degree of abstractness for words in a potentially metaphorical phrase and the words in the surrounding context. Experimental results demonstrate that this approach achieves state of-the-art performance on the SemEval 2013 phrasal semantics dataset. 1 Introduction Idioms are ubiquitous in human language, and explicit knowledge of idioms from resources such as dictionaries and knowledge bases has been shown to provide only limited coverage (Widdows and Dorow, 2005). Hence, methods are needed for automatically identifying idiomatic expressions. One fundamental task is identifying if a potentially idiomatic phrase is being used in its figurative or literal sense. Although some idiomatic phrases rarely occur in their literal sense, a number of them are commonly used both literally and figuratively. As discussed by Li and Sporleder (2010), figurativelanguage detection is important for Machine Translation because it can prevent mistakenly translating a phrase using its literal sense when it is intended figuratively. For example, translating the sentence “It would be a sad thing to kick the bucket without having been to Alaska” from English to Spanish and back to English using Google Translate produces “It would be very sad to stretch the leg without being in Alaska”. Detecting figurative language is also useful for other tasks such as Information Retrieval. In this paper we explore a wide range of contextual features for allowing a logistic-regression classifier to predict whether a phrase is being used literally or figuratively. In particular, features that measure the degree of abstractness for words in the phrase and words in the surrounding context were found to be very useful. 2 Features Identification and Extraction We extracted a variety of contextual features related to lexical semantics, abstractness, and cooccurrence. All of the features were aimed at identifying if a phrase was being used figuratively. Bag of Words: The frequency of occurrence of each word in the context of the target phrase was used as a feature. For each document in the training corpus, all standard English stop-words were first removed along with the target-phrases whose literal/figurative usage were to be determined. The remaining words separated by white-space were each considered as a feature. This allows individual words in the context of a given phrase to help determine the nature of its usage. Concreteness Measure: This measure assigns a score to a word depending on its degree of abstractness. For example, words like milk, talking, and heavy refer to concrete things, events or properties whereas words such as politics, calculating, and functional refer to ideas and concepts that are dis- tant from immediate perception and are hence abstract. We used the MRC Psycholinguistic Database Machine Usable Dictionary Coltheart (1981) which contains a list of 4, 295 words with their concreteness measure between 100 and 700. We used the average concreteness score (where known) of the words in the context as an additional feature. The idea being that a figurative usage of a phrase may have relatively more abstract words in the context in which it occurs as opposed to a literal usage of the same phrase. Phrase to Context Relatedness: This feature measures how related the words in a phrase are to the context in which it appears. For example, for the phrase bread and butter, occurring in a literal sense would have words related to “food” or “eating” in the context, while a figurative use of the phrase will tend to occur with semantically less-related words. In order to compute phrase to context relatedness, we used WordNet Similarity. The first step determines a WordNet sense for each word in the context as well as in the target phrase. Senses were computed using the All Words package which assigns a sense to every word depending on how closely it is related to other words in its context (Pedersen and Kolhatkar, 2009). The word senses are represented by second-order co-occurrence vectors of their dictionary (WordNet) definitions. The relatedness of two senses is then computed as the cosine of their representative vectors (Pedersen and Kolhatkar, 2009). The total phrase to context relatedness measure is computed by adding up the relatedness between every word in the phrase to every word in the given instance and then normalizing by the number of words in the phrase and in the given instance: Relatedness = X X sim(p, u) |P | · |U | (1) p∈P u∈U where p represents a word in the phrase P and u represents a word in the context U , and sim is the cosine similarity. Topic to Context Relatedness: This feature added a supplementary resource by extracting a list of related topic words for the given phrase. We extracted two lists of topic words for each phrase in the training corpus; one for the literal sense and the other for the figurative sense. For example, for the phrase bread and butter some of the topic words extracted were: literal: tea, food, bread, butter, evening, drink, dish, eating figurative: council, people, great, year, value, oil, west Topic words were extracted using the Maui algorithm which builds on the Keyphrase Extraction Algorithm (KEA) and is used for automatic topic indexing (Medelyan, 2009). We used the model trained on the SemEval2010 keyphrase extraction track. Once we obtained the topic words, we computed how related each word from a context was to each word from the topic list. This was done the same way as for Phrase to Context Relatedness, except that instead of comparing to every word in the phrase we compared to every word in the topic list. This approach allows determining if the current context of a phrase is more semantically similar to typical literal or figurative uses of the phrase. The two relatedness features together contributed to the semantic features of the model. 3 3.1 Experimental Evaluation Data We used the data for the SemEval 2013 phrasal semantics task (Korkontzelos et al., 2013). This data was created using a list of English idioms obtained from Wiktionary and manually removing phrases which were unlikely to be used in the literal sense. An example from the dataset for the phrase in the bag being used figuratively is “Come off it, lighten up for once. Australian win is not gonna happen, it’s in the bag for us, or I will eat my modem. I hope you’re hungry”. The same phrase is used literally in the dataset as “Is any of this coming back to you, agent Mulder? MULDER: No. Can I see what’s in the bag now? CURTIS: Be my guest”. The SemEval task was subdivided into subtasks. The first subtask is for “known phrases” that is the same phrases occurred in training, development and test sets. The second “unknown phrases” or “unseen phrases” subtask had disjoint or unseen phrases in training, development and test sets. A few instances in the dataset were classified as being both figurative Table 1: Known phrases in the dataset Phrases At the end of the day Bread and butter Break a leg Drop the ball In the bag In the fast lane Play ball Rub it in Through the roof Under the microscope and literal, in this case, we duplicated the instance and added a copy to both the literal and figurative classes. Altogether there are 1, 427 and 1, 117 training instances, 359 and 343 development instances, and 595 and 519 test instances for the known-phrase and unknown-phrase tasks respectively. Table 1 displays the list of known phrases from the dataset. 3.2 Evaluation Methodology and Metrics In this section we discuss our methods for training and testing figurative language detectors using the features described in Section 2. We used LIBLINEAR’s L2 regularized Logistic Regression (L2LR) Fan et al. (2008) to build classifiers using the extracted features. Classification accuracy on the test set was used to evaluate performance. The cost parameter C was tuned using the development set. The L2LR parameters used are discussed further in Section 4. 4 Results and Discussion ent feature subsets on the known and unseen phrases test sets are shown in Table 2. BOW stands for bagof-words features, Abstract is the concreteness feature, and Semantic are the two relatedness measures described in Section 2. All the accuracies reported are the micro-average accuracies on the test sets. For the seen phrases, we obtained the highest accuracy for the phrase Play ball and lowest accuracy for the phrase In the fast lane. Table 3 compares our results with those obtained by teams participating in the competition for the known and unknown phrases categories. The baseline is obtained by labeling all the test instances with the majority class in training. It can be observed that our accuracy for known phrases is a lot better than the second best while we are close to the second best for the unseen phrases category. Table 2: Accuracies for subset of features on the test set for known and unseen phrases Features All features Bag-of-words BOW+Abstract BOW+Semantic Category Known Phrases Unseen Phrases Results We obtained the highest accuracy of 80% using C = 0.59 and the bag-of-words along with the degree of concreteness features on the known phrases test set. For the unseen phrases dataset, we obtained the highest accuracy of 64% using all of the aforementioned features. The accuracies obtained for differ- Unseen Phrases 0.6416 0.6171 0.6301 0.6146 Table 3: Results for Known and Unseen Phrases Subtasks for Participating Teams, US represents our accuracy In this section we present our results and compare them to those of teams that participated in this SemEval 2013 task. This section also contains discussion and analysis of the results obtained. 4.1 Known Phrases 0.7882 0.7749 0.80027 0.7899 4.2 Participant US IIRG 3 UNAL 2 UNAL 1 IIRG 1 BASELINE IIRG 2 UNAL 1 UNAL 2 US BASELINE CLaC 1 Accuracy 0.80027 0.779461279 0.754208754 0.722222222 0.53030303 0.50337 0.501683502 0.667953668 0.644787645 0.641618497 0.61583 0.55019305 Discussion Our overall accuracy for the known phrases is 2% better than the second best and this may be around the best accuracy that can be obtained by training and testing on the given corpus. The IIRG 3 algorithm used bag-of-words by selecting canonical form of all the noun tokens in training and each token was labeled as figurative or literal depending on it’s frequency of occurrence within a given context. They used a threshold of 2 so a given token must occur atleast 2 times in a given context for it to be added to the bag-of-words (Byrne et al., 2013). We obtained a maximum F 1 = 0.80 score for the figurative class by using bag-of-words and concreteness score as features. We did not achieve the best accuracy for unseen phrases because all our features were implemented to learn from the abstractness measure and semantics of the training corpus which was not possible for unseen phrases and thus the training was almost redundant. It also indicates that to solve such a task we need to extract features that are independent of the particular idiomatic phrase and it’s context. This is a hard task even for human judgment. The UNAL 1 algorithm used variety of cohesiveness features such as various combinations of Poinwise Mutual Information (PMI) measure and the term frequency (tf) and inverse document frequency (idf) scores (Jimenez et al., 2013). While experimenting and analyzing various features extracted from data, it is observed that bagof-words by itself proved to be significantly accurate on the known phrases, better than the baseline. However, it did not perform well on the unseen phrases as expected because the learning was redundant, the words that occurred in training almost never occurred in test. The concreteness feature on it’s own unexpectedly did not perform well, the reason being the list of words provided by the MRC Database was very limited and only few of those words actually occurred in our corpus. However, when used along with the bag-of-words, the concreteness feature helped in boosting accuracy significantly. The semantic features along with the bag-ofwords were not able to significantly affect accuracy on the known phrases and proved to deteriorate the performance on unseen phrases. The reason being the semantic features were over-fitting the training instances. 5 Related work Most previous work on idiom classification is based on the type-extraction method. As mentioned in Li and Sporleder (2010), this method focuses on the fact that idioms have properties that help differentiate them from other expressions such as a degree of syntactic and lexical fixedness. One example is that they are not used in a passive voice, for example the bucket was kicked instead of kick the bucket. In addition, although there has been linguistic research on the properties of idioms, however, there is still no agreement on which properties characterize them. Moreover there has been work done in trying to identify those properties automatically (Fazly et al., 2009). Nevertheless, in this approach they focused on a particular type of idioms which are formed by combining a frequent verb with a noun. Examples include shoot the breeze and make a face. Numerous studies have focused on the TokenBased approach. One example is a minimally supervised algorithm that distinguishes if the sense in which a verb is being used is figurative or literal (Birke and Sarkar, 2006). Since it is focusing on the verb, which is a particular token, this approach does not consider the linguistic properties previously mentioned. 6 Conclusion Metaphor is ubiquitous and recognizing the figurative use of phrases is an important, challenging task. An algorithm for distinguishing literal and figurative senses of a word can potentially improve many NLP applications such as machine translation, search, and automatic summarization. We introduced a new method that uses a variety of features to predict if a given phrase has been used literally or figuratively. Features based on the abstractness measure of words in the context were shown to be particularly useful. The method was shown to obtain state-of-the art results on the recent SemEval phrasal semantics task. References Julia Birke and Anoop Sarkar. 2006. A clustering approach for the nearly unsupervised recognition of nonliteral language. In Proceedings of EACL, volume 6, pages 329–336. Lorna Byrne, Caroline Fenlon, and John Dunnion. 2013. Iirg: A naive approach to evaluating phrasal semantics. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 103–107, Atlanta, Georgia, USA, June. Association for Computational Linguistics. Max Coltheart. 1981. The mrc psycholinguistic database. The Quarterly Journal of Experimental Psychology, 33(4):497–505. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9:1871–1874. Afsaneh Fazly, Paul Cook, and Suzanne Stevenson. 2009. Unsupervised type and token identification of idiomatic expressions. Computational Linguistics, 35(1):61–103. Sergio Jimenez, Claudia Becerra, Alexander Gelbukh, Av Juan Dios Bátiz, and Av Mendizábal. 2013. Unal: Discriminating between literal and figurative phrasal usage using distributional statistics and pos tags. In Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), Atlanta, Georgia, USA. Ioannis Korkontzelos, Torsten Zesch, Fabio Zanzoto, and Chris Biemann. 2013. Semeval-2013 task 5: Evaluating phrasal semantics. April. Linlin Li and Caroline Sporleder. 2010. Using gaussian mixture models to detect figurative language in context. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 297–300. Association for Computational Linguistics. Olena Medelyan. 2009. Human-competitive automatic topic indexing. Ph.D. thesis, The University of Waikato. Ted Pedersen and Varada Kolhatkar. 2009. Wordnet:: Senserelate:: Allwords: a broad coverage word sense tagger that maximizes semantic relatedness. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Demonstration Session, pages 17–20. Association for Computational Linguistics. Dominic Widdows and Beate Dorow. 2005. Automatic extraction of idioms using graph analysis and asymmetric lexicosyntactic patterns. In Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition, pages 48–56. Association for Computational Linguistics.
© Copyright 2024 Paperzz