Inside Jokes : Identifying Humorous Cartoon Captions

”Inside Jokes : Identifying Humorous Cartoon
Captions”
Mbonteh Ekuh Jude
htw saar – Hochschule für Technik und Wirtschaft des Saarlandes
Seminar “Internet in den Medien”
Sommersemester 2016
Abstract—A funny joke doesn’t just make us laugh, but also
benefits our brain in multiple ways. Motivated by the idea of
creating a computational model for humor, which can detect
funny cartoon captions, a group of researchers at Microsoft
together with a cartoonist at The New Yorker magazine teamed
up to understand the language used in cartoon captions and the
features that make these captions to be deemed funny. The study
makes use of a large collection of crowd-sourced cartoon caption
data, submitted for the New Yorker cartoon contest.
Firstly, we would take a look at how judgments were acquired
based on humorousness of different captions. Using the same
cartoon, the captions classified as funnier would then be paired
with the less funny, the caption pairs are then examined and the
major differences between the funny and the less funny cartoon
captions are sorted, based on this information, a classifier is then
built, which can identify funnier captions automatically. Given
captions based on the same joke, the classifier picks the funnier
caption 69% of the time and 64% for any pair of captions. Finally
the classifier is used by the cartoon contest judge in finding the
best captions, thus significantly reducing the workload of the
cartoon contest judge.
I. I NTRODUCTION
Humor plays a very import role in multiple context, ranging
from a better interaction in social gatherings, increasing attention span and much more, making it a vital tool in education
and also can be of huge importance in advertisement when
used correctly. But humor is mainly a human trait, and it being
mainly a human trait makes it varies across different cultural
and religious background.
So far a lot have been done in fields like psychology and
linguistic to better understand how humor works and possibilities of how this knowledge can be used to create actual humor
models, as of yet very little work has been done in field of
computer sciences on developing humor models, making it a
new and exciting challenge for computer scientist.
In this report we are going to identify and examine features
of humorous cartoon captions pairs. The experiment is based
on a large crowd source data from the New Yorker magazine
weekly contest. The New Yorker magazine holds a weekly
cartoons caption contest , where readers are given a cartoon
are asked to submit captions suggestion for the given cartoon
, the submitted captions are then analyzed and the judge
selects a shortlist of the funniest captions and members of
the editorial staff narrow it down to three finalists. All three
finalists captions are then published in a later issue, and readers
vote for their favorites.
Figure 1. Sample prototype diagram of the classifier, having caption
inputs vary with respect to same or different joke
Study Main Contributions The contributions are results
of a study conducted by a group of researchers ,and this
paper is mainly based on thees results ,goal of study was the
construction of a computational model for predicting relative
humorousness of cartoon captions. A joke classifier (Figure
1) was built for predicting the percentage of humorousness of
captions without deep image analysis and text understanding,
by leveraging human tagging of scenes and automated
analysis of linguistic features.
The research [1] was conducted by :
Robert Mankoff a cartoonist at The New Yorker
Dafna Shahaf researcher at Microsoft
Eric Horvitz researcher at Microsoft
Michael Cavan writer at The Washington Post
For conducting the experiment, a data-set of crowd-sourced
captions from the New Yorker competition was used. Also
with the help of human judgments, different variations of
the same jokes where identify and analyzed, to find factors
affecting the level of perceived humor. Jokes within a caption
where also automatically identified, and the intensity of humor
of different jokes and their relation to the cartoon quantified. A
classifier was constructed, which when given any two captions
and a particular cartoon, can determine which caption is funnier. The classifier achieves 69% accuracy for captions hinging
on the same joke, and 64% accuracy comparing any two
Feature
%Funnier is higher
Perlexity (1-gram)
0.45**
Perlexity (2-gram)
0.49**
Perlexity (3-gram)
0.46*
Perlexity (4-gram)
0.46*
POS Perplexity (1-gram)
0.45*
POS Perplexity (2-gram)
0.52
POS Perplexity (3-gram)
0.5
POS Perplexity (4-gram)
0.51
Sentiment
0.61*
Readability
0.57**. 0.56*
Proper Nouns
0.48
Indefinite articles
0.43
3rd Person
0.58
Location (quarters)
0.53,0.42*,0.49,0.55*
Table I
D IFFERENT JOKES : P ERCENTAGE OF CAPTION PAIRS IN WHICH A FEATURE
HAS HIGHER NUMERICAL VALUE IN THE FUNNIER CAPTION . )
Figure 2. Example cartoon from the New Yorker contest, with the shortlist
of submitted captions.
captions. A Swiss-system tournament was then implemented
that ranks all captions. On average, all of the judge’s top-10
captions were ranked in the top 55.8%, making it obvious that
the methods can be used to significantly reduce the workload
faced by judges in selecting the best caption during a cartoon
caption contest.
one ranked by 30-35 workers, for a total of 1016 tasks and
10,160 pairwise rankings. Only 35% of the unique pairs that
were ranked by at least five people achieved 80% agreement,
resulting in 754 pairs. Indicating the difficulty of the task.
B. Hypotheses: Humor Indications Features
To make the task of finding the funny captions a machine
learning problem ,features that could help determine if a
caption is funny or not were selected. These features were
then evaluated and assigned percentages corresponding to their
respective humor predictive power , results of the findings on
features and their respective percentage on the funny scale are
presented on Table 1
Table II.
1) Unusual Language: Based on the hypothesis that funny
captions may turn to make use of unusual language, a language
model function was implemented to test the validity of this
supposition. The idea of testing if a caption made use of
language in a more unusual manner was examined using a
language model function, which takes an input string (caption)
, then outputs the predicted probability of the caption being
on the unusual language side based on a gram scale . The language model was tested on 2GB of ClueWeb data [13],meant
to represent common language. The distinctiveness of the used
language was computed on a gram based model with various
degree of perplexity.
1-grams : low perplexity
4-grams : high perplexity
,and with 2-gram and 3-gram somewhere in between 1 and 4.
Interestingly, the higher the perplexity, the more the caption
stand out, but surprisingly , funnier cartoon captions tend
to use less-distinctive(low perplexity)language, reinforcing the
idea of keeping the caption simple and readable, because with
high perplexity one may even miss out on the joke completely,
II. A NALYZING AND C OMPARING C APTIONS
For the purpose of the Study 16 New Yorker cartoons along
with their submitted captions were analyzed, similar cartoon
captions data-set where grouped using the algorithm from
King et al. [2]. After the captions were grouped, the three
largest clusters for each cartoon was picked, 10 entries were
then selected from each cluster, with the number of words in
five of the captions ranging between 5 to 6 words and the
other five captions ranging between 9 to 10 words, the above
word rang was chosen based on the median word length of the
submitted captions Figure 2 shows an example of a cartoon
and some 9 captions submitted for the sample cartoon [3]. In
the cartoon, a car salesperson attempts to sell a strange hybrid
creature that appears to be part car, part animal. The judge’s
short-list of captions appears under the cartoon..
A. Acquiring Judgment
In finding out what makes some captions funnier, humor
judgments where obtained from recruited crowd workers via
Mechanical Turk ( an online outsourcing platform). The crowd
workers where asked to compare randomly drawn caption from
a batch and asked to rate which caption was funnier. Each
answer provided a 10 pairwise comparisons (#1 is funnier than
#2, #3, #4 and #5. #2 is funnier than #3, #4 and #5 and so
on). Pairs of captions that achieved high agreement among
the rankers were selected (80% agreement or more, similar
length, ranked by at least five people). Using 30 batches, each
2
say you don’t understand the principal joke phrase due to the
use of unusual language.
Similar to unusual language (Perplexity),The part of speech
perplexity (POS Perplexity) feature was achieved by replacing
words in the caption with their corresponding parts of speech
,modifying the caption syntactically but keeping it’s semantic
property. Again funny captions tend to have less distinct
grammar based on the gram scale .
2) Taking Expert Advice: In an article written by one of the
New Yorker contest winner, by name Patrick House, he really
emphasizes on the use of simple language [15]. He advises
contestants on the use of , c̈ommon, simple, monosyllabic
wordsänd toS̈teer clear of proper nouns that could potentially
alienate .̈ Based of this advice a computational model was built
that determines the readability of caption based on an index
function and the results from examining features using this
model clearly shows that the readability of a caption scored
significantly higher percentage on the funnier scale especially
in the case of caption hinging on the same joke whereas joke
location feature which we would later take a closer look at
was more relevant in the case of captions hinging on different
jokes
3) Joke Location: Taking the location of the main joke
phrase into consideration , it was import to know the impact
of joke location in a caption sentences . so the sentences were
partitioned into 4 quarters, quarter 1 being the beginning of the
caption , it was interesting to know (Table 1) that less funny
captions mostly had their joke phrase at the second quarter ,
that’s almost at the beginning of the sentences, whereas placing
the joke location at the end was the most effective strategy.
4) Sentiment: A caption is considered to have negative
sentiment, when it consist of negative words(bad, terrible,
mistake, not), so for finding out the usefulness of negative
words in a caption, results from an early study On humor
conducted by Michalcea and Pulman [5] was used as a starting
point, The Study found out Humor was mostly associated
with negative orientation ( negative words ),But surprisingly
in our experiment based of the input from the crowd-workers,
positive cartoon captions appear funnier when compared to
negative captions contradicting the results of earlier research.
Possible reason for the contrast could be based on the preferences of the crowd-workers, since their response where mainly
based on their humor preference and as we know, humor
varies from person to person based on factors like religious or
cultural background , another reason could also be the inability
of the sentiment analyzer in detecting the negative words in a
caption
Figure 3. Cartoon from the New Yorker contest and the tags associated
with it. One list describes the general setting, or context, of the cartoon.
The other describes its anomalies.
to examine different jokes, since it’s going to be a much
challenging task finding the relationship between the cartoon
and it’s caption, talk less of further narrowing it down to
finding the main joke phrase in the caption.
In finding the joke phrase for different jokes, the word with
the least perplexity (highest distinctiveness) were selected with
respect to the 4-gram model mentioned earlier . For words
belonging to a particular phrase, the entire phrase was selected
, the method proofed very accurate at finding the main joke
in a caption. looking at Figure 4 for an example.
The red phrases are the ones identified through perplexity,and the purple ones were out-of-vocabulary words another
useful way to identify the joke. Typos are often picked up
as out-of-vocabulary words. If they happen to spell a real
word, they are sometimes picked up as low-perplexity words
(see, for example, the ” anti-theft devise,̈ including the
misspelling of d̈evice¨). Also, note that our algorithm misses
several expressions, e.g., ẗigerı̈n the tank”.
IV. S UMMARY OF R ESULTS
Based on the experiments and analyses of various features
, Table 1 and 2 shows the results obtained from examining
features for same joke and different jokes respectively . Taking
a clear look at the statistics from the above tables . we can
clearly notice the features which scored a better percentages
on the funny cartoon caption scale. From a lexical point of
view, funny captions used simpler language, which made them
more readable , This feature had almost the same magnitude
for both same and different joke, but although readability and
sentiment both played a role in both scenarios. they were no
III. S AME J OKES VS D IFFERENT J OKES
There were some main differences in analyzing and Comparing same jokes and different, because unlike in the case
of same jokes, where the captions examined revolved around
the same humor context, For example in Figure 2 , all 9
captions revolving around the idea of the car being a hybrid.
On the other hand the captions used on different jokes could
varied completely in context, making it even more difficult
3
Feature
%Funnier is higher
Perplexity (1-gram)
0.48
Perplexity (2-gram)
0.51*
Perplexity (3-gram)
0.52*
Perplexity (4-gram)
0.4***
POS Perplexity (1-gram)
0.45*
POS Perplexity (2-gram)
0.5
POS Perplexity (3-gram)
0.5
POS Perplexity (4-gram)
0.51
Indefinite articles
0.43
3rd Person
0.45**
Min Perplexity
0.46*****
Length of Joke Phrase
0.42*****
Location (quarters)
Proper Nouns
0.46,0.41*****,0.46*,0.43****
0.35*****
Sentiment
0.49
Readability
0.52. 0.51
Similarities ( context,
anomaly, maxdiff, avgdiff)
0.48*,0.48,0.47**, 0.47**
Table II
D IFFERENT JOKES : P ERCENTAGE OF CAPTION PAIRS IN WHICH A FEATURE
HAS HIGHER NUMERICAL VALUE IN THE FUNNIER CAPTION .)
Figure 4. Identifying joke phrases: Phrases in red have low perplexity. Phrases
in purple do not appear in our vocabulary, and are often useful for find ing
the joke (as well as typos).
down to a smaller size,thus helping the judges to easily find
the best captions in lesser time.
longer of very great significance in the case of different jokes
as they were in that of same jokes. rather proper nouns and
3rd person words became very significant for different jokes,
with funnier captions using significantly less of both. same
joke captions use similar proper nouns most of the time. The
situation is different when considering multiple jokes. Since
many captions rely entirely on proper nouns for the in order
to understand joke. For example , looking at sample captions
for the cartoon in Figure 3, two of the captions read :
Oh my God, Florenz Ziegfeld has gone to heaven!
I’m afraid Mr. Burns is gone for the afterlife.
A possible explanation for the poor results for the caption, is
that names which are not universally recognizable may leave
readers confused , while making use of proper nouns solves
this issue
An improved version for the caption using personal
pronouns could read : Oh my God, he has gone to heaven!
I’m afraid he is gone for the afterlife.
To conduct this test , the classifier was used in a tournament
algorithm, pairing captions against each other to find the best
ones. With the goal being to find a set of the best captions, A non elimination tournament algorithm was employed
by choosing the Swiss-system tournament method. In this
method, competitors are paired randomly in the first round;
in the next rounds, they are paired based on their performance
so far. The competitors were grouped by total points and
add edges between players who have not been matched yet.
This ensures that no two competitors face each other more
than once. The Blossom algorithm was then use to compute
a maximal matching. For each cartoon, The classifier was
trained using data on all other cartoons. This is meant to
simulate a new contest. Whenever two captions were paired,
The classifier’s predictions were then computed . Since many
pairs of captions are incomparable . A victory was worth 3
points, and a tie was worth 1 point.
Figure 5 shows some of the top-ranking (left) and bottomranking (right) captions for the cartoon in Figure 1. Many of
the captions that were ranked at the very bottom were very
long (15-42 words each); to make the comparison between
the captions easy, only captions of length 5-10 where used.
just by replacing the names with personal pronouns , could
significantly increases the humor value of the caption
V. U SE C ASE
Finally a classifier was built , which can compare pairs
of cartoon captions. The main goal was actually to use this
computational model to reduce the workload of judges, during
a cartoon caption contest, saving them the time of manually
going through thousands of captions. so the classifier was
tested to see if it could explore and filter caption submissions,
4
Figure 5. Tournament results: High-ranking (left) and low-ranking (right) captions for the cartoon in Figure only captions of length 5-10 are used.
VI. R ELATED WORKS
language used, and how technical details were elaborated
Unlike in other fields like linguistics and psychology , The
study of humor in computer sciences is still in an infancy stage
and only few prototypes can be found on computational humor
.In this section, we would have a brief review of research
contributions from all of these fields.
Psychology: Several studies in psychology have focused
specifically on cartoons. For example, creativity was found to
have a link to the ability to generate cartoon captions [7]. Jones
et al. demonstrated the importance of a relationship between
picture and caption . Subjects viewed picture and caption either
simultaneously or sequentially (picture appearing five seconds
before the caption, or vice versa). More vigorous changes in
heart rate where detected when the elements were shown in
sequence.
Linguistics : There has also been plenty of research on theories of humor and linguistics of humor [8]. with most studies
attempting to develop symbolic relationships and linguistics
Computer Science: The JAPE punning-riddle generator[9]
is one of the first attempts of computational humor , focusing
on automatically generating humorous acronyms. Humor was
achieved by mixing terms from different domains, exploiting
incongruity theory. On the side of humor understanding.
1) Audience: It’s important to point out that the main
audience of the paper are topic experts, or individuals with
a well grounded background on the subject matter, whereas
the main audience of the Media is the general public ,this is
clearly noticeable in the style of writing. The research paper
goes into details and explanations the methods and techniques
used in the course of the experiment and the result obtained
are presented in a scientific manner, on the other hand the
Media simply tries to give the general public an overview of
what the study was all about and the results and the impact
these results could have.
2) Language and Technical Details: Having different audiences in mind, the writing style clearly varies in both
scenarios. For example a text from :
paper: ”We developed a classier that could pick the funnier
of two captions 64% of the time and used it to find the
best captions, significantly reducing the load on the cartoon
contest’s judges” media : ”The tale’s takeaway is intended to
be that computers, no matter how knowledgeable, will never
be able to fathom the linguistic kinks and quirks of the human
mind.”
Clearly the paper makes use of facts and statistics, also words
like algorithm and computer science related terms are used
and explained in detail in the paper , while all these scientific
details are is completely omitted from the media , making
easy and possible for the general public to have a big picture
of what the research was all about.
VII. A RTICLE P RESENTATION IN M EDIA VS PAPER
Michael Cavan wrote an article based on the result from
the research paper ”Inside Jokes : Identifying humorous
cartoon captions”. The article was published by The
Washington Post under the title ”Can robots now write
quality joke? How the world is becoming less laughable”
[10]. In this section we would take a look at the manner
in which the article was presented in the research paper
in contrast to the presentation style in the media, In both
scenarios we would look at who the target audience is ,
VIII. C ONCLUSION AND F UTURE W ORK
The challenge of learning to find the degree of humor
perceived for combinations of captions and cartoons was
examined. An important set of features from linguistic
properties of captions and the interplay between captions and
pictures was extracted, with large amount of crowd sourced
5
cartoon captions data from the New Yorker, a classifier was
implemented that could pick the funnier of two captions 64 %
of the time, and this classifier was then to help cartoon contest
Judges in finding the best caption, significantly reducing the
load on the cartoon contest judges. The framework developed
can serve as a basis for further research in computational
humor which is still in it’s infancy stage. Future directions
of work can include more detailed analysis of the context
and anomalies represented in the cartoons, as well as their
influence on setting up tension, questions, or confusion.
Another interesting domain is humor generation for captions
and cartoons. In principle, the classifier could be use to try
achieve this goal and improve an existing caption, making it
possible to identify weaknesses of the caption, and suggest
improvements (for example, replacing a word by a simpler
synonym). Furthermore,it would be important to explore
the generative process employed by people in coming up
with captions. For example, by trying to understand the
influence of the visual salience of the core context and
anomalies of a scene on the focus and linguistic structure of
a joke. Finally, humor being mainly a human trait making
it vary across different cultural and religious background. It
would be an interesting challenge to explore opportunities
to personalize the classifier with respect to personal taste
of humor. Automated generation and recognition of humor
could be useful to modulate attention, engagement, and
retention of concepts, and thus has numerous interesting
applications, including use in education, health, engagement,
and advertising.
Acknowledgments : The content of this work is mainly
based on the research of the authors mentioned above
R EFERENCES
[1] D. Shahaf, E. Horvitz and R. Mankoff Inside Jokes : Identifying
Humorous Cartoon captions, KDD, 2015
[2] B. King, R. Jha, D. R. Radev and R. Mankoff Random walk factoid
annotation for collective discourse., In ACL, page 249-254 , 2013
[3]
Tesla
stock
moves
on
april
fools’
joke,
http://blogs.wsj.com/moneybeat/2015/04/01/tesla-stock-moves-onapril-fools-joke, In ACL, page 249-254 , 2015
[4] E. Gabrilovich, M. Ringgaard, and A. Subramanya Facc1: Freebase
annotation of clueweb corpora , 2013
[5] P. House How to win the new yorker cartoon caption , 2008
[6] R. Mihalcea and S. Pulman Characterizing humour: An exploration of
features in humorous texts. In Computational Linguistics and Intelligent
Text Processing, pages 337 - 347. Springer , 2007
[7] D. M. Brodizinsky and J. Rubien Humor production as a function
of sex of subject, creativity, and cartoon content. Journal of consulting
and Clinical Psychology, 44(4):597, 1976.
[8] K. Binsted and G.Ritchie An implemented model punning riddles . In
AAAI’94 , 1994
[9] S. Binsted Linguistic Theories of Humor. Approaches to Semiotics.
Mouton de Gruyter, , 1994
[10] Can robots now write quality joke? How the world is becoming
less laughable,
https://www.washingtonpost.com/news/comicriffs/wp/2015/09/02/can-robots-now-write-quality-jokes-how-thenotion-is-becoming-less-laughable/,
Michael Cavan, September 2,
2015
6