Pun Location
Jennings Jin, Justin Kim, Ian Lin, Grace Wu
{jinjenni jskimmer tiannis gcywu} @umich.edu
Abstract
(with the same spelling) adopts two different
meanings. For example:
This report investigates the concept of detecting homographic punwords with the
same spelling but two different meaningswithin a sentence. Given a statement
containing a homographic pun, we analyze each word with five different metrics. These scores are then totalled and
the word with the highest score is returned as the pun candidate.
1
Introduction
Recent advances in NLP have led to influx of
chatbots which serve a wide variety of purposes: from planning your next vacation to
sending you daily images of cats. One area that
these chatbots often struggle with is understanding and responding to humor. Detection of humor will be a necessary step in the creation of
flexible and natural sounding chatbots. We focus on the detection of puns whose funniness
lies in lexical ambiguity.
Puns are a form of humor in which there is
a statement that can be interpreted in multiple
ways. Often, one is a normal (and often reasonable) usage while the other is (while still
plausible) more outlandish, or unrelated. There
are various types of puns, but in this project we
have limited the scope to the detection of the
position of a homographic pun. A homographic
pun is defined as a pun in which the same word
• It’s OK to watch an elephant bathe as they
usually have their trunks on.
• An elephant’s opinion carries a lot of
weight.
• Energizer Bunny arrested – charged with
battery.
In the first example, the word trunks has two
meanings. The first being the part of the elephant’s body and the second being a term swimming shorts. So the goal of the model in this
case is to identify trunks as the pun-word.
2
Metric
Related works offer us a baseline for how we
want our algorithm to perform, and a goal that
we want our algorithm to surpass. They also
gave us perspective on what ideas may or may
not be feasible, different scoring systems we
could use, others take on how to determine if a
word is a pun, and so on.
Automatic Disambiguation of English
Puns, Miller, Tristan and Gurevych, Iryna
Word sense disambiguation typically assumes that each word in a document or
sentence has a fixed underlying meaning. In
the case of puns this is false, and Miller and
Gurevych explore using traditional word sense
disambiguation methods to identify meanings
of puns, using their manually annotated collection of puns. They did so by slightly modifying
Lesk algorithms (simplified, simplified extended, and simplified lexically expanded) to
return the two top-scoring senses of a word, as
opposed to the single top-scoring sense of a
word. We also decided to use Lesk similarity
scores as one of our metrics, and performed the
same modifications to the algorithm that they
did.
Towards the Automatic Detection and
Identification of English puns, Miller, Tristan
and Turkovic, Mladen
Lexical polysemy (word sense ambiguity) is
used deliberately for humor in the form of puns
This paper outlines general concepts regarding
pun detection and its relationship with word
sense disambiguation, similar to the previous
paper. The authors discuss the challenges for
various kinds of pun detection, one of which
is the lack of well annotated training data.
They created a sense inventory using WordNet,
which we ended up using as well. Instead
of using WordNet as a metric however, they
instead use it to create a custom program to
assist with the construction of a large manually
annotated set of puns.
Play on Words: Predicting Punniness with
Statistics and Semantics, Kao, Justine and
Tan, Jireh
This paper focuses on detection of homophone puns – puns that are phonetically
ambiguous, e.g. feet and feat – and prediction
of funniness given a pun. The main feature
they utilize is the semantic ambiguity between
different senses of the pun word and the rest of
the context. In general, if the context contained
a pun, they found that replacing the pun word
with its counterpart maintained a similar level
of coherence but doing the same in non-pun
contexts resulted in a sharp drop in coherence.
While their research is different in dataset
and goal from our own project, we used a
similar idea in one of our metrics, replacing
words in a sentence with synonyms and analyzing the coherence of the sentence afterwards.
The Funny Thing About Incongruity:
A Computational Model of Humor in Puns,
Kao, Justine and Roger, Levy and Goodman,
Noah D.
In this paper, the authors analyze humor in
homophone puns, which we described in the
exploration of the previous paper. Their model
focuses on ambiguity and distinctiveness.
They define ambiguity as multiple definitions
that of a word that have a high usage rate,
and distinctiveness as a word that does not
only make sense in multiple contexts, but
is supported by other words in the context.
While this paper is also different in dataset
and goal, we used similar ideas of ambiguity
and distinctiveness in our WordNet metric.
Our measure of distinctiveness is also simpler
because our algorithm works assuming that
there is a pun word in the sentence.
From Humor Recognition to Irony Detection: The Figurative Language of Social
Media, Reyes, Antonio and Rosso, Paolo and
Buscaldi, Davide
Humor uses a different linguistic strategy
than regular text in order to produce an effect.
This paper shows how humor can be analyzed
through linguistic devices. Features that they
used to analyze humor were ambiguity, polarity, unexpectedness, and emotion. The latter
three features were irrelevant to our project, but
ambiguity was one of the central ideas of pun
detection. Reyes et al. used Google N-grams
to measure sentence complexity to help measure ambiguity. We looked into using Google
N-grams as well, but ended up using Microsoft
N-grams.
Humor Recognition and Humor Anchor
Extraction
This paper also focuses on humor detection,
ad well as humor recognition. They identify the
latent semantic structures behind humor, which
include incongruity, ambiguity, phonetic style,
and personal affect. It is similar then to the
previous article, but this research uses a bag of
words baseline to create sets of words that differentiate humor and non-humor. In our project,
we also use bag of words, and instead create sets
of senses for every word to compare with senses
of other words.
3
Dataset
Our dataset consists of 98 contexts that contain a single homographic pun. Our puns
were pulled from website(s) that focused on
single-word puns. After we extracted the
puns from the website(s) we manually sorted
through the results to find the contexts that fit
our specifications, which was that the context
should contain a single homographic pun. We
annotated our data with the following format:
An elephant’s opinion carries a lot
of weight
weight
Energizer Bunny arrested -- charged
with battery
battery
Each context is on its own line, and is followed by the word that is the pun.
4
4.1
Approach
Lesk Similarity Scores
This metric uses the classic bag of words
method in a manner very similar to the Lesk
algorithm. For every word, we generate a bag
of words from the definitions and examples
usages of its senses. We then compare the
overlap in the bag of words for every sense with
the bag of words for other terms in the context.
Intuitively, senses that fit in the context should
have a high overlap and thereby a high score.
To convert sense scores to pun probabilities per
word, we go with the intuition that pun words
should have exactly two high scoring senses.
Therefore the score returned for each word is:
1 - (topScore-secondScore)/(topScore) +
(secondScore - thirdScore)/(secondScore)
For each context, the scores are normalized to sum to one and can be interpreted
as probabilities according to this particular
similarity score.
4.2
Wordnet Similarity Scores
Similarity scores expanded on the logic that
homographic pun words have multiple – two,
specifically – definitions that fit within the
context. We decided take advantage of Wordnet, a lexical database of the English language
created by researchers at Princeton University,
to help us perform this task. The following is
an example of how we used the library:
>>> Wordnet.synsets(’dog’)
[Synset(’dog.n.01’),
Synset(’frump.n.01’),
Synset(’dog.n.03’),
Synset(’cad.n.01’),
Synset(’frank.n.02’),
Synset(’pawl.n.01’),
Synset(’andiron.n.01’),
Synset(’chase.v.01’)]
>>> dog = Wordnet.synset(’dog.n.01’)
>>> cat = Wordnet.synset(’cat.n.01’)
>>> dog.path similarity(cat)
0.2...
In the example, we see how we use Wordnet to find all of the senses of a word, and then
compare two senses of two different words to
find how ’similar’ they are. The similarity score
ranges from 0 to 1, where 1 is a perfect similarity, and it is based on the shortest path that connects the senses. Here, ’dog.n.01’ and ’cat.n.01’
have a relatively high similarity score of 0.2.
In our algorithm, we performed this comparison between every sense of every word and every sense of every other word. At first, our algorithm gave a weight – a probability that the
word is a pun – to a word based on whether or
not that word had only two senses that scored
highly. For example, a word with only two high
senses has a high score, a word with only one
high sense has a low score, a word with all similar sense scores has a low score. However, this
weighting scheme had very low accuracy because of sense overlap; many senses returned
were ’close’ in human terms, so multiple senses
of a word would score highly. We found that
the metric performed better when we weighted
each word as the sum of the similarity scores for
all of its senses.
4.3
N-Gram Synonyms
This metric sought to capture a words ambiguity by seeing how appropriate the synonyms of
various senses of the word fit into the context
by using bigram probabilities. For every word
in the sentence, we listed out all of its various senses using Wordnet. For each of those
senses, we generated a list of synonyms, using
the Wordnet function lemma names, that would
capture the meaning of that sense. For example, in the sentence Me and geography have a
rocky relationship, the word rocky has multiple
senses. For one sense, the list of synonyms was
bouldery, bouldered, stony and for a different
sense, the list of synonyms was rough. We took
each of those synonyms and formed an N-gram
with every other word in the sentence, averaged
the probabilities returned for the N-grams by
the Microsoft Web Language Model API, and
gave each sense a sense score.
To find which word was most likely to be
the pun, we wanted to select the word which
had the following properties: (1) has multiple
senses, (2) the top two senses each have a high
score, (3) the difference in the score between
the top two senses is small. We felt that this
would be a metric of the ambiguity of a words
senses in the context of the sentence, which is
an indicator that the word is a pun. To capture
these properties, we used the same metric as in
the Bag of Words Similarity method:
1 - (topScore-secondScore)/(topScore) +
(secondScore - thirdScore)/(secondScore)
Again, the scores are normalized across
the context, and are interpreted as probabilities
for being the pun word.
4.4
N-Gram Sets
This metric is build on the assumption that the
keywords in a homographic pun are related to
each other in some manner, so if we take every
permutation of words in a sentence we can then
score the bi-grams with some metric (e.g. PMI,
or log-likelihood). The assumption would then
be that if a word is highly related to other words
in the sentence, that word and its related words
would be more likely to be the pun-words. For
Example:
• I don’t like mechanical keyboards, they’re
not my type.
We then enumerate all permutations of
words. Among them we have:
• Mechanical keyboards
• ”Keyboards type”
Each of these bigrams have high association
scores, and they share the word keyboard, so we
can assume that we have:
• Output set: Mechanical, keyboards, type
We can then take these three words (that
have association score) and add them additional
weight in determining the candidates for what
the pun-word is. One of the pitfalls of this
method is that bigrams of common things like
stop-words might have high scores if they appear frequently together. To prevent this, stopwords could be removed. Additional exploration could also be made using different scoretypes or additional datasets.
4.5
Metric
Location
Bag of Words
Wordnet
N-Gram Synonyms
N-Gram Sets
Total
Word Placement
Most pun words occur at the end of the sentence. This number increases when we process
the sentence and remove stopwords. We created a metric that weighs words near the end of
the sentence more heavily. Each word is given
a weight of two to the power of the index of
that word, and this metric returns the normalized vector of these weights.
5
pun word, together they improve the odds of
predicting the correct word.
Below is a chart of the accuracy of the 5 individual metrics along with the total accuracy on
the entire set of 99 puns:
Results
Here is an example result of our algorithm on
the pun: My bakery was burned down last night,
and now my business is toast, with the pun word
being toast.
Table 1: Accuracy for algorithms
We found that individually, each of the 4
sense scores had an accuracy of about 20%.
The baseline score of predicting the last token
had an accuracy of 71.17%. The 5 scores together received an accuracy of 57.57%. Although the combination of scores had a lower
accuracy than the baseline, its important to note
the diversity of our individual scores. As mentioned before, each sense score correctly classified the pun word in about 20 contexts for total
of 80. Of those 80 contexts, we found that 55
of them were unique, meaning the set of puns
that each metric correctly classified was quite
different.
Our approach is also fairly scalable and efficient. Each context can be classified independently and doesnt require any initial training.
The bottleneck of our program is making the
Microsoft NGram API calls and we optimize
that by batching our calls on the entire set.
6
Figure 1: Scoring methods
As you can see, the results of each scoring
method differs quite a lot. However, nearly all
of them give toast a fairly high score, causing
it to have the highest total score. This is the
goal behind our usage of the individual metrics:
while each metric may incorrectly predict the
Accuracy
.7171
.1717
.2323
.2323
.1818
.5757
Conclusion
We limited our scope to homographic puns and
investigated which features to look for in these
kinds of puns. Our different algorithms all had
some success in predicting the pun location,
which suggests that our intuition for using word
sense ambiguity as a feature for a pun is correct.
An interesting point to mention is that each algorithm correctly predicted the pun location for
different sets of puns in general. It could be
the case that there are certain kinds of puns for
which a certain algorithm performs better than
others. If this is the case, then we would like
to investigate features that suggest which algorithm should be applied.
We are planning to use neural networks to
see if we could dynamically weight our metrics
based on the pun, and increase the accuracy of
our predictions. We are planning to pursue this
line of inquiry and submit those results, along
with the contents of the rest of our paper to the
SemEval 2017 challenge for Pun Detection and
Location.
7
Individual Contributions
All reports and presentations were done
together. Below are our specific code contributions.
Jennings Jin: NGramSets, Crawler
Justin Kim:
formatting
NGramSynonyms and final
Ian Lin: Bag of words scoring and main
driver program
Grace Wu:
Location scoring,
similarity, and Crawler.
wordNet
Acknowledgments
We would like to thank Rada Mihalcea and
Shibamouli Lahiri for their help in the class as
well as on the project. We would also like to
thank Tristan Miller for providing guidance for
the SemEval challenge.
References
Kao, Justine and Roger Levy and Noah D. Goodman, The Funny Thing About Incongruity: A
Computational Model of Humor in Puns, Proceedings of the 35th Annual Conference of the
Cognitive Science Society, 728733.
Kao, Justine and Tan Jireh, Play on Words: Predicting Punniness with Statistics and Semantics,
Stanford University, CA.
Miller, Tristan and Iryna Gurevych, Proceedings
of the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7h International Joint Conference on Natural Language
Processing (ACL-IJCNLP 2015), Association
for Computational Linguistics, Stroudsburg, PA,
2015.
Miller, Tristan and Mladen Turkovi, Towards the
Automatic Detection and Identification of English
Puns, The European Journal of Humor Research,
4(1):59-75
Reyes, Antonio and Paolo Rosso and Davide
Buscaldi, From Humor Recognition to Irony
Detection: The Figurative Language of Social
Media,Data Knowledge Engineering, 74:1-12,
2012.
Yang, Diyi et al., Humor Recognition and Humor
Anchor Extraction, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
© Copyright 2026 Paperzz