Using Abstract Context to Detect Figurative Language

Using Abstract Context to Detect Figurative Language
Nazneen Fatema Rajani
Edaena Salinas
Raymond Mooney
University of Texas at Austin University of Texas at Austin University of Texas at Austin
Austin, TX
Austin, TX
Austin, TX
[email protected] [email protected]
[email protected]
Abstract
Metaphor is ubiquitous in language, and in
many applications it is important to distinguish between the literal and figurative use of
language. Therefore, a SemEval task was recently introduced for detecting figurative language. This paper introduces a new approach
to this task that uses a rich combination of
features, particularly ones that measure the
degree of abstractness for words in a potentially metaphorical phrase and the words in
the surrounding context. Experimental results
demonstrate that this approach achieves state
of-the-art performance on the SemEval 2013
phrasal semantics dataset.
1
Introduction
Idioms are ubiquitous in human language, and explicit knowledge of idioms from resources such as
dictionaries and knowledge bases has been shown
to provide only limited coverage (Widdows and
Dorow, 2005). Hence, methods are needed for automatically identifying idiomatic expressions. One
fundamental task is identifying if a potentially idiomatic phrase is being used in its figurative or literal sense. Although some idiomatic phrases rarely
occur in their literal sense, a number of them are
commonly used both literally and figuratively. As
discussed by Li and Sporleder (2010), figurativelanguage detection is important for Machine Translation because it can prevent mistakenly translating
a phrase using its literal sense when it is intended
figuratively. For example, translating the sentence
“It would be a sad thing to kick the bucket without
having been to Alaska” from English to Spanish and
back to English using Google Translate produces “It
would be very sad to stretch the leg without being in
Alaska”. Detecting figurative language is also useful for other tasks such as Information Retrieval.
In this paper we explore a wide range of contextual features for allowing a logistic-regression classifier to predict whether a phrase is being used literally or figuratively. In particular, features that
measure the degree of abstractness for words in the
phrase and words in the surrounding context were
found to be very useful.
2
Features Identification and Extraction
We extracted a variety of contextual features related to lexical semantics, abstractness, and cooccurrence. All of the features were aimed at identifying if a phrase was being used figuratively.
Bag of Words: The frequency of occurrence of each
word in the context of the target phrase was used as
a feature. For each document in the training corpus,
all standard English stop-words were first removed
along with the target-phrases whose literal/figurative
usage were to be determined. The remaining words
separated by white-space were each considered as a
feature. This allows individual words in the context
of a given phrase to help determine the nature of its
usage.
Concreteness Measure: This measure assigns a
score to a word depending on its degree of abstractness. For example, words like milk, talking, and
heavy refer to concrete things, events or properties
whereas words such as politics, calculating, and
functional refer to ideas and concepts that are dis-
tant from immediate perception and are hence abstract. We used the MRC Psycholinguistic Database
Machine Usable Dictionary Coltheart (1981) which
contains a list of 4, 295 words with their concreteness measure between 100 and 700. We used the
average concreteness score (where known) of the
words in the context as an additional feature. The
idea being that a figurative usage of a phrase may
have relatively more abstract words in the context in
which it occurs as opposed to a literal usage of the
same phrase.
Phrase to Context Relatedness: This feature measures how related the words in a phrase are to the
context in which it appears. For example, for the
phrase bread and butter, occurring in a literal sense
would have words related to “food” or “eating” in
the context, while a figurative use of the phrase will
tend to occur with semantically less-related words.
In order to compute phrase to context relatedness,
we used WordNet Similarity. The first step determines a WordNet sense for each word in the context
as well as in the target phrase. Senses were computed using the All Words package which assigns
a sense to every word depending on how closely
it is related to other words in its context (Pedersen
and Kolhatkar, 2009). The word senses are represented by second-order co-occurrence vectors of
their dictionary (WordNet) definitions. The relatedness of two senses is then computed as the cosine of
their representative vectors (Pedersen and Kolhatkar,
2009). The total phrase to context relatedness measure is computed by adding up the relatedness between every word in the phrase to every word in the
given instance and then normalizing by the number
of words in the phrase and in the given instance:
Relatedness =
X X sim(p, u)
|P | · |U |
(1)
p∈P u∈U
where p represents a word in the phrase P and u
represents a word in the context U , and sim is the
cosine similarity.
Topic to Context Relatedness: This feature added
a supplementary resource by extracting a list of related topic words for the given phrase. We extracted
two lists of topic words for each phrase in the training corpus; one for the literal sense and the other
for the figurative sense. For example, for the phrase
bread and butter some of the topic words extracted
were:
literal: tea, food, bread, butter, evening, drink,
dish, eating
figurative: council, people, great, year, value,
oil, west
Topic words were extracted using the Maui algorithm which builds on the Keyphrase Extraction
Algorithm (KEA) and is used for automatic topic
indexing (Medelyan, 2009). We used the model
trained on the SemEval2010 keyphrase extraction
track. Once we obtained the topic words, we computed how related each word from a context was to
each word from the topic list. This was done the
same way as for Phrase to Context Relatedness, except that instead of comparing to every word in the
phrase we compared to every word in the topic list.
This approach allows determining if the current context of a phrase is more semantically similar to typical literal or figurative uses of the phrase.
The two relatedness features together contributed
to the semantic features of the model.
3
3.1
Experimental Evaluation
Data
We used the data for the SemEval 2013 phrasal semantics task (Korkontzelos et al., 2013). This data
was created using a list of English idioms obtained
from Wiktionary and manually removing phrases
which were unlikely to be used in the literal sense.
An example from the dataset for the phrase in the
bag being used figuratively is “Come off it, lighten
up for once. Australian win is not gonna happen,
it’s in the bag for us, or I will eat my modem. I hope
you’re hungry”. The same phrase is used literally
in the dataset as “Is any of this coming back to you,
agent Mulder? MULDER: No. Can I see what’s in
the bag now? CURTIS: Be my guest”.
The SemEval task was subdivided into subtasks.
The first subtask is for “known phrases” that is the
same phrases occurred in training, development and
test sets. The second “unknown phrases” or “unseen
phrases” subtask had disjoint or unseen phrases in
training, development and test sets. A few instances
in the dataset were classified as being both figurative
Table 1: Known phrases in the dataset
Phrases
At the end of the day
Bread and butter
Break a leg
Drop the ball
In the bag
In the fast lane
Play ball
Rub it in
Through the roof
Under the microscope
and literal, in this case, we duplicated the instance
and added a copy to both the literal and figurative
classes. Altogether there are 1, 427 and 1, 117 training instances, 359 and 343 development instances,
and 595 and 519 test instances for the known-phrase
and unknown-phrase tasks respectively. Table 1 displays the list of known phrases from the dataset.
3.2
Evaluation Methodology and Metrics
In this section we discuss our methods for training
and testing figurative language detectors using the
features described in Section 2. We used LIBLINEAR’s L2 regularized Logistic Regression (L2LR)
Fan et al. (2008) to build classifiers using the extracted features. Classification accuracy on the test
set was used to evaluate performance. The cost parameter C was tuned using the development set. The
L2LR parameters used are discussed further in Section 4.
4
Results and Discussion
ent feature subsets on the known and unseen phrases
test sets are shown in Table 2. BOW stands for bagof-words features, Abstract is the concreteness feature, and Semantic are the two relatedness measures
described in Section 2. All the accuracies reported
are the micro-average accuracies on the test sets. For
the seen phrases, we obtained the highest accuracy
for the phrase Play ball and lowest accuracy for the
phrase In the fast lane.
Table 3 compares our results with those obtained
by teams participating in the competition for the
known and unknown phrases categories. The baseline is obtained by labeling all the test instances with
the majority class in training. It can be observed that
our accuracy for known phrases is a lot better than
the second best while we are close to the second best
for the unseen phrases category.
Table 2: Accuracies for subset of features on the test set
for known and unseen phrases
Features
All features
Bag-of-words
BOW+Abstract
BOW+Semantic
Category
Known Phrases
Unseen Phrases
Results
We obtained the highest accuracy of 80% using C =
0.59 and the bag-of-words along with the degree
of concreteness features on the known phrases test
set. For the unseen phrases dataset, we obtained the
highest accuracy of 64% using all of the aforementioned features. The accuracies obtained for differ-
Unseen Phrases
0.6416
0.6171
0.6301
0.6146
Table 3: Results for Known and Unseen Phrases Subtasks
for Participating Teams, US represents our accuracy
In this section we present our results and compare
them to those of teams that participated in this SemEval 2013 task. This section also contains discussion and analysis of the results obtained.
4.1
Known Phrases
0.7882
0.7749
0.80027
0.7899
4.2
Participant
US
IIRG 3
UNAL 2
UNAL 1
IIRG 1
BASELINE
IIRG 2
UNAL 1
UNAL 2
US
BASELINE
CLaC 1
Accuracy
0.80027
0.779461279
0.754208754
0.722222222
0.53030303
0.50337
0.501683502
0.667953668
0.644787645
0.641618497
0.61583
0.55019305
Discussion
Our overall accuracy for the known phrases is 2%
better than the second best and this may be around
the best accuracy that can be obtained by training
and testing on the given corpus. The IIRG 3 algorithm used bag-of-words by selecting canonical
form of all the noun tokens in training and each token was labeled as figurative or literal depending on
it’s frequency of occurrence within a given context.
They used a threshold of 2 so a given token must
occur atleast 2 times in a given context for it to be
added to the bag-of-words (Byrne et al., 2013). We
obtained a maximum F 1 = 0.80 score for the figurative class by using bag-of-words and concreteness
score as features.
We did not achieve the best accuracy for unseen
phrases because all our features were implemented
to learn from the abstractness measure and semantics of the training corpus which was not possible
for unseen phrases and thus the training was almost
redundant. It also indicates that to solve such a
task we need to extract features that are independent
of the particular idiomatic phrase and it’s context.
This is a hard task even for human judgment. The
UNAL 1 algorithm used variety of cohesiveness features such as various combinations of Poinwise Mutual Information (PMI) measure and the term frequency (tf) and inverse document frequency (idf)
scores (Jimenez et al., 2013).
While experimenting and analyzing various features extracted from data, it is observed that bagof-words by itself proved to be significantly accurate on the known phrases, better than the baseline. However, it did not perform well on the unseen
phrases as expected because the learning was redundant, the words that occurred in training almost
never occurred in test. The concreteness feature on
it’s own unexpectedly did not perform well, the reason being the list of words provided by the MRC
Database was very limited and only few of those
words actually occurred in our corpus. However,
when used along with the bag-of-words, the concreteness feature helped in boosting accuracy significantly. The semantic features along with the bag-ofwords were not able to significantly affect accuracy
on the known phrases and proved to deteriorate the
performance on unseen phrases. The reason being
the semantic features were over-fitting the training
instances.
5
Related work
Most previous work on idiom classification is based
on the type-extraction method. As mentioned in Li
and Sporleder (2010), this method focuses on the
fact that idioms have properties that help differentiate them from other expressions such as a degree of
syntactic and lexical fixedness. One example is that
they are not used in a passive voice, for example the
bucket was kicked instead of kick the bucket.
In addition, although there has been linguistic research on the properties of idioms, however, there is
still no agreement on which properties characterize
them. Moreover there has been work done in trying to identify those properties automatically (Fazly
et al., 2009). Nevertheless, in this approach they
focused on a particular type of idioms which are
formed by combining a frequent verb with a noun.
Examples include shoot the breeze and make a face.
Numerous studies have focused on the TokenBased approach. One example is a minimally supervised algorithm that distinguishes if the sense in
which a verb is being used is figurative or literal
(Birke and Sarkar, 2006). Since it is focusing on
the verb, which is a particular token, this approach
does not consider the linguistic properties previously
mentioned.
6
Conclusion
Metaphor is ubiquitous and recognizing the figurative use of phrases is an important, challenging task.
An algorithm for distinguishing literal and figurative
senses of a word can potentially improve many NLP
applications such as machine translation, search, and
automatic summarization. We introduced a new
method that uses a variety of features to predict if
a given phrase has been used literally or figuratively.
Features based on the abstractness measure of words
in the context were shown to be particularly useful.
The method was shown to obtain state-of-the art results on the recent SemEval phrasal semantics task.
References
Julia Birke and Anoop Sarkar. 2006. A clustering approach for the nearly unsupervised recognition of nonliteral language. In Proceedings of EACL, volume 6,
pages 329–336.
Lorna Byrne, Caroline Fenlon, and John Dunnion. 2013.
Iirg: A naive approach to evaluating phrasal semantics. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings
of the Seventh International Workshop on Semantic
Evaluation (SemEval 2013), pages 103–107, Atlanta,
Georgia, USA, June. Association for Computational
Linguistics.
Max Coltheart.
1981.
The mrc psycholinguistic
database. The Quarterly Journal of Experimental Psychology, 33(4):497–505.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui
Wang, and Chih-Jen Lin. 2008. Liblinear: A library
for large linear classification. The Journal of Machine
Learning Research, 9:1871–1874.
Afsaneh Fazly, Paul Cook, and Suzanne Stevenson.
2009. Unsupervised type and token identification
of idiomatic expressions. Computational Linguistics,
35(1):61–103.
Sergio Jimenez, Claudia Becerra, Alexander Gelbukh,
Av Juan Dios Bátiz, and Av Mendizábal. 2013. Unal:
Discriminating between literal and figurative phrasal
usage using distributional statistics and pos tags. In
Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), Atlanta, Georgia,
USA.
Ioannis Korkontzelos, Torsten Zesch, Fabio Zanzoto, and
Chris Biemann. 2013. Semeval-2013 task 5: Evaluating phrasal semantics. April.
Linlin Li and Caroline Sporleder. 2010. Using gaussian mixture models to detect figurative language in
context. In Human Language Technologies: The 2010
Annual Conference of the North American Chapter of
the Association for Computational Linguistics, pages
297–300. Association for Computational Linguistics.
Olena Medelyan. 2009. Human-competitive automatic topic indexing. Ph.D. thesis, The University of
Waikato.
Ted Pedersen and Varada Kolhatkar. 2009. Wordnet::
Senserelate:: Allwords: a broad coverage word sense
tagger that maximizes semantic relatedness. In Proceedings of Human Language Technologies: The 2009
Annual Conference of the North American Chapter of
the Association for Computational Linguistics, Companion Volume: Demonstration Session, pages 17–20.
Association for Computational Linguistics.
Dominic Widdows and Beate Dorow. 2005. Automatic
extraction of idioms using graph analysis and asymmetric lexicosyntactic patterns. In Proceedings of the
ACL-SIGLEX Workshop on Deep Lexical Acquisition,
pages 48–56. Association for Computational Linguistics.