Who Said What? - Predicting which Friends Character

Who Said What? - Predicting which Friends Character Said
What Line
Ariel Weingarten
Stephanie Chen
Danilo Gasques
University of California San
Diego
University of California San
Diego
University of California San
Diego
[email protected]
[email protected]
Markus Dücker
[email protected]
Hasso Plattner Institute
[email protected]
1.
THE DATASET
The dataset for this project was scraped from a Friends fan
site [1] that had fan-transcribed scripts for every episode.
We converted every line spoken in each episode into a JSON
object with the following schema:
{
” c h a r a c t e r ”: ”< c h a r a c t e r name>”,
” l i n e ”: ”< l i n e o f d i a l o g u e >”,
”episodeNum ”: <e p i s o d e number >,
” a c t i o n s ”: [”< l i s t of >”, ”< a c t i o n s >”]
}
Figure 1: Schema for JSON data.
Parsing 1990s HTML is fraught with complexity, and our
dataset definitely contains some malformed entries, but lines
of dialogue for all of the main characters should be correct.
Although we started off with 70,380 lines of script, once we
began delving into the details behind each line, we found
that there was much cleaning needed.
Because the lines were transcribed by a vast host of volunteers (super-fans of the television show), we encountered a
lack of consistency in character-naming convention. Some
volunteers used nicknames and abbreviations for characters
(e.g. MNCA for Monica, Rach for Rachel), which we discovered by examining the data, sorting lines by character, and
then creating a dictionary to map nicknames to the correct
character name.
Additionally, some of the lines were actually not spoken by
characters in the show, but rather stored information about
the episode (e.g. name of the director or casting director),
which for our purposes would not do, so we removed them.
We also removed lines from characters with name length
greater than 20 characters, because most of these characters’
names were things like ”The man with the black hat” or ”The
women in the waiting room” – these were clearly episode
extras, not one of our six titular characters, nor amongst
the main supporting cast, so we removed their lines from
the data set as well.
In order to ensure the lines contained only the words spoken by characters, we also had to remove actions from the
script. A character’s action, such as [Reach across the table
for the mug] or [Hugs Joey tightly], would not help us predict the words the character would speak, or which character
would speak which word. So, we also removed all the action
cues from the script. Thankfully, the action cues were consistently contained by brackets, so we were able to identify
them.
At the end of the data cleaning process, we had a set of
61,898 lines, which was a reduction by more than 12% from
the original set of lines.
2.
DATA EXPLORATION
Before starting with any prediction task we took a closer
look at the data. All the following insights are reported
after performing the previously described cleaning step.
2.1
Line Breakdown by Character
Breaking down the lines by characters gives us a first overview
of who is speaking the most. The average number of lines per
character is around 8500. In Figure 2 we see that Ross and
Rachel are speaking much more than the other characters,
while Phoebe is talking much less. Interestingly this still
leads to an almost balanced distribution when looking only
at the characters’ gender with 25,757 male lines vs. 25,200
female lines. Besides the six main cast characters there are
also 10,941 lines which are spoken by other characters.
2.2
Average Word Length by Character
Next we looked at the average word length over all lines
spoken by one character. We show the effect of short lines
by doing an additional measurement only on lines with more
than 5 words.
3.
PREDICTIVE TASK
Given this complete collection of all lines spoken by the six
main characters in the TV show Friends (season 1 to 10)
we want to train a model that can predict for a previously
unseen and unlabeled line the character who spoke it. This is
a supervised multi-class prediction problem because we each
instance in the training set is labeled with the character who
spoke it and we have a total number of 6 classes (the main
characters) to predict.
Each instance of our training data will represent one spoken
line, consisting of features that can be derived from the unstructured text (content-based and structure based). Additionally we experimented with context-based features looking at the line spoken directly before and after the current
one.
Figure 2: Across all ten seasons of the show, the number of
lines spoken by each of the main six characters.
To give a realistic example of the prediction task here are
six randomly drawn lines (one per character):
• “jean-claude van damme. i didn’t know he was in this
movie, he is so hot.” Monica
• “what?!” Rachel
• “no!! wait, wait, wait!! oh please, hold it up so i can
listen.” Phoebe
• “hey you’re right. yeah, it’s kinda been like us again a
little bit.” Joey
• “thank you for that! i was not flirting.” Chandler
• “okay, okay, awkward question. the hospital knows you
took two, right?” Ross
Figure 3: Average word length per character
2.3
Average Line Length by Character
Regarding the average length of lines (determined by counting the characters) we can say that all main characters are
very similar.
This random example already hints at the difficulties of correctly predicting the character from a given short line. Without any context even the most avid Friends won’t be able
to correctly name the speaker of the “what?!” line, there is
simply not enough information. We addressed this in our
model by excluding all lines with less than 4 words from the
dataset.
3.1
Base Line
To the best of our knowledge no one else has build a similar predictor for the Friends TV show that we could use
as a benchmark. Therefore we will assume a “lazy” model
as baseline that simply uses a random predictor. Given a
dataset that has a balanced class distribution this random
predictor is expected to have an accuracy of 61 for picking
the right charaacter. Another way to look at this would
be a “lazy” model that predicts the same character for all
lines, which would result in the same accuracy as the random model.
3.2
Model Evaluation
We are using four metrics to evaluate our models as suggested by [2]:
Figure 4: Average number of words per line
• % correct classification
• Precision
4.1
Trying different features
After an initial model using unigram tf-idf values, we experimented with other features like:
• phrase type (i.e.: is it an sentence ended with a dot(.),
exclamation(!), question(?), etc.)
• occurence of main cast names in the spoken text
• count of punctuation marks
• number of words
• average word length
Figure 5: Expected Classification Accuracy for the Naive
Classifier
• Recall
• ratios comparing the current line to the known number
of words and average word length metrics per character
• absolute counts per grammatical category (i.e.: count
of nouns, verbs, adjectives, etc.)
• name of the previous and next speaker
• F-score
• season number and episode number of the line
Given that we are facing with a multi-class problem, to
present results here, we averaged all of these metrics over
the 6 classes.
3.3
Features
We are using tf-idf (term frequency - inverse document frequency) values for each characters N most frequent unigrams. We used a raw frequency measurement to calculate
the tf values:
tf (t, d) = rawF requencyt,d
We used the standard formula to compute idf:
idf (t, D) = log
|D|
|dD : td|
We tried different combinations of unigram tf-idf values with
these features and, unfortunately, either they reduced our
accuracy or yielded negligible improvements. It turned out
that the best way to improve our model was to add more
tf-idf terms to it.
4.2
How we optimized the Model
Because our feature set is comprised of unigram tf-idf values, we experimented with the amount of unigrams to include and used performance on the validation set to pick
the optimal value.
We lowercased the entire corpus of lines and removed punctuation when counting unigram frequency. We also experimented with stemming using the PyStemmer module and
ignored stopwords by using an English stopword dictionary.
These latter measures increased the prediction performance
only slightly.
In the end, this produced a feature vector of length 4,801.
800 features for each of the 6 characters’ tf-idf on their top
800 unigrams and one bias term.
4.
THE MODEL
We are employing the Logistic Regression model provided by
the sklearn python library. Our parameters to the model
are:
• l2 regularization with strength 1.0
• Newton Non-linear Conjugate Gradient Optimizer
• No class weighting was employed
• We used a One against All classification approach which
treats each class as a separate binary classification task
Figure 6: Percentage Correctly Predicted. Performance is
quantitatively better than Naive Classifier.
When we used the top 50 unigrams, we reached 25.6% accuracy on the validation set. As we increased the unigram
count, we saw the accuracy increase to a maximum of 31.6%
when using the top 800 unigrams. Further increase in the
unigram count actually led to a decrease in accuracy, so our
best result from this model came from using 800 unigrams.
We found one other party that had attempted some predictive work on Friends, in a web report called The One
With All The Quantifiable Friendships. This author sought
to quantify the closeness of any two of the six main characters by identifying how often they shared a plot line. The
author did so by manually coding in ”dynamics” or interactions between characters in any scene across all 236 episodes.
Ultimately, she found that the closest three friend pairs (in
descending order) were Chandler and Monica, Chandler and
Joey, and Rachel and Ross.
5.2
Figure 7: Precision, Recall and F-Score generated across a
various number of unigrams
Table 1: Comparison between Logistic Regression and Support Vector Machines for n=4801 (800 unigrams per character and bias)
Method
Logistic
Regression
SVM (rbf)
SVM (linear)
4.3
Recall
F1
31.91%
32%
31.91%
31.85%
• Number of words written with all capital letters
25.69%
24.09%
25.93%
24.17%
25.69%
24.09%
25.23%
23.82%
• Use of a quotation
Trying different classifiers
Challenges
Thankfully, we did not have problems with overfitting because we split our data at the start into 80% training and
validation, and 20% test. Within the 80% for training and
validation, 80% of that data was allocated to the training
set, and the remaining 20% was for the validation set.
RELATED LITERATURE
Predicting which character said a given line of dialogue is
pretty much an authorship attribution prediction task. We
consulted a number of papers and took a look at the features
they used, the training methods they employed, and the
results they achieved.
5.1
• Number of special characters
Precision
To compute term frequency - inverse document frequency,
we started off by writing our own method from scratch.
However, this method was very complicated and too slow
when trying to evaluate different numbers of tf-idfs quickly.
We than switched to using the TfidfVectorizer in the sklearn
module, pre-written library to calculate the numbers we
needed, which sped up our computation time.
5.
Authorship Attribution for tweets is a useful comparison
to our task because both tweets and lines of dialogue in a
script are short. Bhargava et al. [3] employed a combination
of lexical and syntactic features for their task. We were
not able to borrow many of the syntactic features because
they were features that would not or would rarely occur in
a screenplay for a television show:
Accuracy
To make sure that our model low performance due to the
probabilistic assumptions of Logistic Regression, we trained
the model using support vector machines (SVM) using both
a Gaussian kernel and a linear kernel. The results can be
seen in Table 1.
4.4
Stylometric Analysis for Authorship Attribution on Twitter
The One With All The Quantifiable Friendships
The lexical features were fairly straightforward and did not
provide any new ideas:
• Tweet length in characters
• Hashtags
Despite the restricted sample size, Bhargava et al. were able
to achieve impressive results. For 10-15 authors and 200
samples per author, they achieved an F-score in the range
of 85.59% to 90.56%. When they attempted to increase the
number of authors to 20, the F-score dropped to 64.48%.
5.3
Authorship Attribution: What’s Easy and
What’s Hard?
This body of work examined different data sets to identify
which types made it easier to identify the author of the data
set using character n-grams and Bayesian multi-class regression. The authors found that the ”simplest kind of authorship attribution problem...is the one in which we are given
a small, closed set of candidate authors and are asked to
attribute an anonymous text to one of them.” [4] This is the
kind of predictive task we seek to do with our own data set,
so we believe that because our set is also closed (to only six
speakers/characters), we had a good shot at replicating this
paper’s success.
Overall, the authors attained an accuracy rate near or above
80% for each problem they tackled by constructing a feature
set of common words and n-grams, ”used in conjunction
with either Bayesian logistic regression or support vector
machines as a learning algorithm.”
5.4
Automatic source attribution of text: a neural networks approach
This paper used a word frequency classification technique to
demonstrate the application of using neural networks for autonomous attribution in textual analysis [5]. They preferred
identifying sources of texts, rather than authors, by bucketing types of authors by style and linguistic distinction. The
motivation behind this was that one author could have more
than one style of writing (e.g. William Shakespeare) which
the model should detect.
To prepare the data, the authors removed symbols such as
the quotation mark, single quote and hyphen. Any surrounding two spaces were joined, and all capital letters were
converted to lower-case. Similarly, we decided to remove
punctuation and also converted everything to lower-case.
The features used were all the remaining words in the corpora. The authors then applied a 5 layer and 420 Millionconnection neural network which ultimately achieved confidence rates ranging from 95 to 99% for 12 of the 19 corpora,
which included texts such as the King James Version of the
Bible and works by Noam Chomsky and William Shakespeare. Their lowest performing corpus was the book of
Mark from the New American Standard Bible translation,
on which they attained only a 72% confidence. Overall, we
were impressed that their network was able to correctly identify the source of corpora from the Bible, because it contains
writing by multiple authors, and their model was still able
to correctly identify its source.
6.
of 10 seasons’ worth of episodes down to only the lines spoken by our main six characters.
Lastly, after considering the findings of the paper on automatic source attribution of text using neural networks, we
believe that this could be a method to explore if we were
to continue exploring improving accuracy of predictions on
this data set. Although we did not implement a neural network for our Friends script analysis, we believe our data set
could be a good candidate for this method, especially considering that the scripts were written by different authors
(scriptwriters) over the 10-year course of the show’s run,
just like the paper used corpora written by different authors
but attributed as one overall source (e.g. the Bible).
Given the considerations from related literature, and after analyzing the performance of our various unigram models (described in section 4), we identified the 800 Unigram
Model as our best one.
The below graphs show the performance of our 800 Unigram
Model on the test set.
RESULTS
From the paper entitled “The One With All The Quantifiable Friendships,” we took the author’s findings into consideration as we began exploring our own data, and modified their approach. Whereas the author had predicted
which friend pairs were closest by manually coding which
characters appeared in a scene together, we decided to see
which characters most frequently called another directly by
name. Our rationale was that the closer a pair of characters were, the more frequently they would speak the other’s
name aloud, whether speaking directly to that character, or
speaking about him or her to another character.
Figure 8: Percentage of the Test Set Correctly Predicted by
our 800 Unigram Model
Our top three pairs (in descending order) were Rachel and
Ross, Joey and Ross, Monica and Chandler. So, two of our
three matched the findings from this other author, which
was interesting because we utilized a different strategy to
measure closeness between pairs of characters.
From the paper on the Stylometric Analysis of Twitter tweets,
we were not able to borrow many of the syntactic features
because they were features that would not or would rarely
occur in a screenplay for a television show such as Friends.
Nonetheless, we did consider length of lines as one feature
on which to base our model, which was something suggested
by this paper.
Next, regarding the paper on authorship attribution: because the authors put forth that a closed data set would
generate better results, we believed our data set matched
the characteristics they described, as we cleaned the script
Figure 9: Performance of our 800 Unigram Model, measured
with Precision, Recall and F-Score.
6.1
Comparison to Alternatives
• ross: and, im, is, it, no, that, the, to, what, you
Our model had a classification accuracy of 29.9432% which
is 1.8 times better than the random classifier, which guessed
one of the six available options (six main characters).
• chandler: and, im, is, it, me, that, the, to, what, you
• joey: and, hey, it, me, that, the, to, what, yeah, you
However, this accuracy rate is a far cry from the classification accuracies achieved in the related reading we presented
in the previous section which were all about 80% or higher,
particularly the one that utilized a five-layer neural network.
6.2
Even when we removed stop-words (which didn’t improve
our final score), the top words sorted according to their tfidf are not helpful to distinguish between characters. For
Which Feature Representations Worked Well? instance, the top 10 terms for Phoeve and Ross are:
(and which didn’t)
The best performing features were the tf-idf values for the
most common unigrams. These clearly outperformed all the
other features that we tried. By adding more tf-idf terms
as features we were able to increase the accuracy to almost
32% at which point we hit a barrier that we were not able
to overcome neither by switching the predictor model nor
by adding and experimenting with other features. This approach seems only to be successfull when including a large
number of terms as it is not possible to capture the differences between the characters in a just a small set of terms.
This can be attributed to the nature of this sitcom show
where the characters talk mainly about mundane things and
there is a repeating theme of topics which results in a high
correlation among the most frequent unigrams.
We found one way to significantly improve the model beyond
32% by looking at the context of the given line and including
the name of the previous and next speaker as an encoded
feature. This increased accuracy to 42% but also can be
seen as “cheating” because it basically reduces the number
of potential characters for this specific prediction. On the
other hand it is similar to how humans would try to identify
the character behind a given line by relating it back to the
context and situation in which it was spoken.
6.3
Interpretation of Model Parameters
Our model used a regularization parameter of 1.0, which
means that we did not need to employ a particular high
level of regularization to achieve generalizable predictor.
6.4
Why did the Proposed Model Fail?
A large contributor to the poor performance of our model
was the difficulty to come up with features that are good at
differentiating between the different characacters. There is a
huge overlap between the most common words per character.
Due to the short nature of the given lines they mostly consist
of common words. It is possible to identify words that are
mostly unique to one character but in this case they also
correlate with a low term frequency and therefore won’t help
us with classifying the majority of lines (by this character).
To illustrate the prevalence of common words among the
characters here are 10 terms ordered according to their tfidf (The frequency considering multiple lines from the same
author as one document) is
• monica: and, is, it, oh, that, the, this, to, what, you
• rachel: and, im, it, just, oh, that, the, to, what, you
• phoebe: and, it, no, oh, okay, so, that, the, to, you
• phoebe: hey, just, know, like, oh, okay, really, right,
yeah, yknow
• ross: hey, just, know, like, mean, oh, okay, right, uh,
yeah
7.
REFERENCES
[1] Nikki. Crazy for friends scripts, 2004.
[2] Guy Lapalme Marina Sokolova. A systematic analysis
of performance measures for classification tasks.
Information Processing and Management, 2009.
http://atour.iro.umontreal.ca/rali/sites/default/files/publis/Sokol
JIPM09.pdf.
[3] Mudit Bhargava, Pulkit Mehndiratta, and Krishna
Asawa. Stylometric analysis for authorship attribution
on twitter. In Proceedings of the Second International
Conference on Big Data Analytics - Volume 8302, BDA
2013, pages 37–47, New York, NY, USA, 2013.
Springer-Verlag New York, Inc.
[4] Shlomo Argamon Moshe Koppel, Jonathan Schler.
Authorship attribution: What’s easy and what’s hard?
Journal of Law and Policy, 21, 2013.
[5] Fraz Kurfess Foaad Khosmood. Automatic source
attribution of text: A neural networks approach. In
2005 IEEE International Joint Conference on Neural
Networks, volume 5. IEEE.