Who Said What? - Predicting which Friends Character Said What Line Ariel Weingarten Stephanie Chen Danilo Gasques University of California San Diego University of California San Diego University of California San Diego [email protected] [email protected] Markus Dücker [email protected] Hasso Plattner Institute [email protected] 1. THE DATASET The dataset for this project was scraped from a Friends fan site [1] that had fan-transcribed scripts for every episode. We converted every line spoken in each episode into a JSON object with the following schema: { ” c h a r a c t e r ”: ”< c h a r a c t e r name>”, ” l i n e ”: ”< l i n e o f d i a l o g u e >”, ”episodeNum ”: <e p i s o d e number >, ” a c t i o n s ”: [”< l i s t of >”, ”< a c t i o n s >”] } Figure 1: Schema for JSON data. Parsing 1990s HTML is fraught with complexity, and our dataset definitely contains some malformed entries, but lines of dialogue for all of the main characters should be correct. Although we started off with 70,380 lines of script, once we began delving into the details behind each line, we found that there was much cleaning needed. Because the lines were transcribed by a vast host of volunteers (super-fans of the television show), we encountered a lack of consistency in character-naming convention. Some volunteers used nicknames and abbreviations for characters (e.g. MNCA for Monica, Rach for Rachel), which we discovered by examining the data, sorting lines by character, and then creating a dictionary to map nicknames to the correct character name. Additionally, some of the lines were actually not spoken by characters in the show, but rather stored information about the episode (e.g. name of the director or casting director), which for our purposes would not do, so we removed them. We also removed lines from characters with name length greater than 20 characters, because most of these characters’ names were things like ”The man with the black hat” or ”The women in the waiting room” – these were clearly episode extras, not one of our six titular characters, nor amongst the main supporting cast, so we removed their lines from the data set as well. In order to ensure the lines contained only the words spoken by characters, we also had to remove actions from the script. A character’s action, such as [Reach across the table for the mug] or [Hugs Joey tightly], would not help us predict the words the character would speak, or which character would speak which word. So, we also removed all the action cues from the script. Thankfully, the action cues were consistently contained by brackets, so we were able to identify them. At the end of the data cleaning process, we had a set of 61,898 lines, which was a reduction by more than 12% from the original set of lines. 2. DATA EXPLORATION Before starting with any prediction task we took a closer look at the data. All the following insights are reported after performing the previously described cleaning step. 2.1 Line Breakdown by Character Breaking down the lines by characters gives us a first overview of who is speaking the most. The average number of lines per character is around 8500. In Figure 2 we see that Ross and Rachel are speaking much more than the other characters, while Phoebe is talking much less. Interestingly this still leads to an almost balanced distribution when looking only at the characters’ gender with 25,757 male lines vs. 25,200 female lines. Besides the six main cast characters there are also 10,941 lines which are spoken by other characters. 2.2 Average Word Length by Character Next we looked at the average word length over all lines spoken by one character. We show the effect of short lines by doing an additional measurement only on lines with more than 5 words. 3. PREDICTIVE TASK Given this complete collection of all lines spoken by the six main characters in the TV show Friends (season 1 to 10) we want to train a model that can predict for a previously unseen and unlabeled line the character who spoke it. This is a supervised multi-class prediction problem because we each instance in the training set is labeled with the character who spoke it and we have a total number of 6 classes (the main characters) to predict. Each instance of our training data will represent one spoken line, consisting of features that can be derived from the unstructured text (content-based and structure based). Additionally we experimented with context-based features looking at the line spoken directly before and after the current one. Figure 2: Across all ten seasons of the show, the number of lines spoken by each of the main six characters. To give a realistic example of the prediction task here are six randomly drawn lines (one per character): • “jean-claude van damme. i didn’t know he was in this movie, he is so hot.” Monica • “what?!” Rachel • “no!! wait, wait, wait!! oh please, hold it up so i can listen.” Phoebe • “hey you’re right. yeah, it’s kinda been like us again a little bit.” Joey • “thank you for that! i was not flirting.” Chandler • “okay, okay, awkward question. the hospital knows you took two, right?” Ross Figure 3: Average word length per character 2.3 Average Line Length by Character Regarding the average length of lines (determined by counting the characters) we can say that all main characters are very similar. This random example already hints at the difficulties of correctly predicting the character from a given short line. Without any context even the most avid Friends won’t be able to correctly name the speaker of the “what?!” line, there is simply not enough information. We addressed this in our model by excluding all lines with less than 4 words from the dataset. 3.1 Base Line To the best of our knowledge no one else has build a similar predictor for the Friends TV show that we could use as a benchmark. Therefore we will assume a “lazy” model as baseline that simply uses a random predictor. Given a dataset that has a balanced class distribution this random predictor is expected to have an accuracy of 61 for picking the right charaacter. Another way to look at this would be a “lazy” model that predicts the same character for all lines, which would result in the same accuracy as the random model. 3.2 Model Evaluation We are using four metrics to evaluate our models as suggested by [2]: Figure 4: Average number of words per line • % correct classification • Precision 4.1 Trying different features After an initial model using unigram tf-idf values, we experimented with other features like: • phrase type (i.e.: is it an sentence ended with a dot(.), exclamation(!), question(?), etc.) • occurence of main cast names in the spoken text • count of punctuation marks • number of words • average word length Figure 5: Expected Classification Accuracy for the Naive Classifier • Recall • ratios comparing the current line to the known number of words and average word length metrics per character • absolute counts per grammatical category (i.e.: count of nouns, verbs, adjectives, etc.) • name of the previous and next speaker • F-score • season number and episode number of the line Given that we are facing with a multi-class problem, to present results here, we averaged all of these metrics over the 6 classes. 3.3 Features We are using tf-idf (term frequency - inverse document frequency) values for each characters N most frequent unigrams. We used a raw frequency measurement to calculate the tf values: tf (t, d) = rawF requencyt,d We used the standard formula to compute idf: idf (t, D) = log |D| |dD : td| We tried different combinations of unigram tf-idf values with these features and, unfortunately, either they reduced our accuracy or yielded negligible improvements. It turned out that the best way to improve our model was to add more tf-idf terms to it. 4.2 How we optimized the Model Because our feature set is comprised of unigram tf-idf values, we experimented with the amount of unigrams to include and used performance on the validation set to pick the optimal value. We lowercased the entire corpus of lines and removed punctuation when counting unigram frequency. We also experimented with stemming using the PyStemmer module and ignored stopwords by using an English stopword dictionary. These latter measures increased the prediction performance only slightly. In the end, this produced a feature vector of length 4,801. 800 features for each of the 6 characters’ tf-idf on their top 800 unigrams and one bias term. 4. THE MODEL We are employing the Logistic Regression model provided by the sklearn python library. Our parameters to the model are: • l2 regularization with strength 1.0 • Newton Non-linear Conjugate Gradient Optimizer • No class weighting was employed • We used a One against All classification approach which treats each class as a separate binary classification task Figure 6: Percentage Correctly Predicted. Performance is quantitatively better than Naive Classifier. When we used the top 50 unigrams, we reached 25.6% accuracy on the validation set. As we increased the unigram count, we saw the accuracy increase to a maximum of 31.6% when using the top 800 unigrams. Further increase in the unigram count actually led to a decrease in accuracy, so our best result from this model came from using 800 unigrams. We found one other party that had attempted some predictive work on Friends, in a web report called The One With All The Quantifiable Friendships. This author sought to quantify the closeness of any two of the six main characters by identifying how often they shared a plot line. The author did so by manually coding in ”dynamics” or interactions between characters in any scene across all 236 episodes. Ultimately, she found that the closest three friend pairs (in descending order) were Chandler and Monica, Chandler and Joey, and Rachel and Ross. 5.2 Figure 7: Precision, Recall and F-Score generated across a various number of unigrams Table 1: Comparison between Logistic Regression and Support Vector Machines for n=4801 (800 unigrams per character and bias) Method Logistic Regression SVM (rbf) SVM (linear) 4.3 Recall F1 31.91% 32% 31.91% 31.85% • Number of words written with all capital letters 25.69% 24.09% 25.93% 24.17% 25.69% 24.09% 25.23% 23.82% • Use of a quotation Trying different classifiers Challenges Thankfully, we did not have problems with overfitting because we split our data at the start into 80% training and validation, and 20% test. Within the 80% for training and validation, 80% of that data was allocated to the training set, and the remaining 20% was for the validation set. RELATED LITERATURE Predicting which character said a given line of dialogue is pretty much an authorship attribution prediction task. We consulted a number of papers and took a look at the features they used, the training methods they employed, and the results they achieved. 5.1 • Number of special characters Precision To compute term frequency - inverse document frequency, we started off by writing our own method from scratch. However, this method was very complicated and too slow when trying to evaluate different numbers of tf-idfs quickly. We than switched to using the TfidfVectorizer in the sklearn module, pre-written library to calculate the numbers we needed, which sped up our computation time. 5. Authorship Attribution for tweets is a useful comparison to our task because both tweets and lines of dialogue in a script are short. Bhargava et al. [3] employed a combination of lexical and syntactic features for their task. We were not able to borrow many of the syntactic features because they were features that would not or would rarely occur in a screenplay for a television show: Accuracy To make sure that our model low performance due to the probabilistic assumptions of Logistic Regression, we trained the model using support vector machines (SVM) using both a Gaussian kernel and a linear kernel. The results can be seen in Table 1. 4.4 Stylometric Analysis for Authorship Attribution on Twitter The One With All The Quantifiable Friendships The lexical features were fairly straightforward and did not provide any new ideas: • Tweet length in characters • Hashtags Despite the restricted sample size, Bhargava et al. were able to achieve impressive results. For 10-15 authors and 200 samples per author, they achieved an F-score in the range of 85.59% to 90.56%. When they attempted to increase the number of authors to 20, the F-score dropped to 64.48%. 5.3 Authorship Attribution: What’s Easy and What’s Hard? This body of work examined different data sets to identify which types made it easier to identify the author of the data set using character n-grams and Bayesian multi-class regression. The authors found that the ”simplest kind of authorship attribution problem...is the one in which we are given a small, closed set of candidate authors and are asked to attribute an anonymous text to one of them.” [4] This is the kind of predictive task we seek to do with our own data set, so we believe that because our set is also closed (to only six speakers/characters), we had a good shot at replicating this paper’s success. Overall, the authors attained an accuracy rate near or above 80% for each problem they tackled by constructing a feature set of common words and n-grams, ”used in conjunction with either Bayesian logistic regression or support vector machines as a learning algorithm.” 5.4 Automatic source attribution of text: a neural networks approach This paper used a word frequency classification technique to demonstrate the application of using neural networks for autonomous attribution in textual analysis [5]. They preferred identifying sources of texts, rather than authors, by bucketing types of authors by style and linguistic distinction. The motivation behind this was that one author could have more than one style of writing (e.g. William Shakespeare) which the model should detect. To prepare the data, the authors removed symbols such as the quotation mark, single quote and hyphen. Any surrounding two spaces were joined, and all capital letters were converted to lower-case. Similarly, we decided to remove punctuation and also converted everything to lower-case. The features used were all the remaining words in the corpora. The authors then applied a 5 layer and 420 Millionconnection neural network which ultimately achieved confidence rates ranging from 95 to 99% for 12 of the 19 corpora, which included texts such as the King James Version of the Bible and works by Noam Chomsky and William Shakespeare. Their lowest performing corpus was the book of Mark from the New American Standard Bible translation, on which they attained only a 72% confidence. Overall, we were impressed that their network was able to correctly identify the source of corpora from the Bible, because it contains writing by multiple authors, and their model was still able to correctly identify its source. 6. of 10 seasons’ worth of episodes down to only the lines spoken by our main six characters. Lastly, after considering the findings of the paper on automatic source attribution of text using neural networks, we believe that this could be a method to explore if we were to continue exploring improving accuracy of predictions on this data set. Although we did not implement a neural network for our Friends script analysis, we believe our data set could be a good candidate for this method, especially considering that the scripts were written by different authors (scriptwriters) over the 10-year course of the show’s run, just like the paper used corpora written by different authors but attributed as one overall source (e.g. the Bible). Given the considerations from related literature, and after analyzing the performance of our various unigram models (described in section 4), we identified the 800 Unigram Model as our best one. The below graphs show the performance of our 800 Unigram Model on the test set. RESULTS From the paper entitled “The One With All The Quantifiable Friendships,” we took the author’s findings into consideration as we began exploring our own data, and modified their approach. Whereas the author had predicted which friend pairs were closest by manually coding which characters appeared in a scene together, we decided to see which characters most frequently called another directly by name. Our rationale was that the closer a pair of characters were, the more frequently they would speak the other’s name aloud, whether speaking directly to that character, or speaking about him or her to another character. Figure 8: Percentage of the Test Set Correctly Predicted by our 800 Unigram Model Our top three pairs (in descending order) were Rachel and Ross, Joey and Ross, Monica and Chandler. So, two of our three matched the findings from this other author, which was interesting because we utilized a different strategy to measure closeness between pairs of characters. From the paper on the Stylometric Analysis of Twitter tweets, we were not able to borrow many of the syntactic features because they were features that would not or would rarely occur in a screenplay for a television show such as Friends. Nonetheless, we did consider length of lines as one feature on which to base our model, which was something suggested by this paper. Next, regarding the paper on authorship attribution: because the authors put forth that a closed data set would generate better results, we believed our data set matched the characteristics they described, as we cleaned the script Figure 9: Performance of our 800 Unigram Model, measured with Precision, Recall and F-Score. 6.1 Comparison to Alternatives • ross: and, im, is, it, no, that, the, to, what, you Our model had a classification accuracy of 29.9432% which is 1.8 times better than the random classifier, which guessed one of the six available options (six main characters). • chandler: and, im, is, it, me, that, the, to, what, you • joey: and, hey, it, me, that, the, to, what, yeah, you However, this accuracy rate is a far cry from the classification accuracies achieved in the related reading we presented in the previous section which were all about 80% or higher, particularly the one that utilized a five-layer neural network. 6.2 Even when we removed stop-words (which didn’t improve our final score), the top words sorted according to their tfidf are not helpful to distinguish between characters. For Which Feature Representations Worked Well? instance, the top 10 terms for Phoeve and Ross are: (and which didn’t) The best performing features were the tf-idf values for the most common unigrams. These clearly outperformed all the other features that we tried. By adding more tf-idf terms as features we were able to increase the accuracy to almost 32% at which point we hit a barrier that we were not able to overcome neither by switching the predictor model nor by adding and experimenting with other features. This approach seems only to be successfull when including a large number of terms as it is not possible to capture the differences between the characters in a just a small set of terms. This can be attributed to the nature of this sitcom show where the characters talk mainly about mundane things and there is a repeating theme of topics which results in a high correlation among the most frequent unigrams. We found one way to significantly improve the model beyond 32% by looking at the context of the given line and including the name of the previous and next speaker as an encoded feature. This increased accuracy to 42% but also can be seen as “cheating” because it basically reduces the number of potential characters for this specific prediction. On the other hand it is similar to how humans would try to identify the character behind a given line by relating it back to the context and situation in which it was spoken. 6.3 Interpretation of Model Parameters Our model used a regularization parameter of 1.0, which means that we did not need to employ a particular high level of regularization to achieve generalizable predictor. 6.4 Why did the Proposed Model Fail? A large contributor to the poor performance of our model was the difficulty to come up with features that are good at differentiating between the different characacters. There is a huge overlap between the most common words per character. Due to the short nature of the given lines they mostly consist of common words. It is possible to identify words that are mostly unique to one character but in this case they also correlate with a low term frequency and therefore won’t help us with classifying the majority of lines (by this character). To illustrate the prevalence of common words among the characters here are 10 terms ordered according to their tfidf (The frequency considering multiple lines from the same author as one document) is • monica: and, is, it, oh, that, the, this, to, what, you • rachel: and, im, it, just, oh, that, the, to, what, you • phoebe: and, it, no, oh, okay, so, that, the, to, you • phoebe: hey, just, know, like, oh, okay, really, right, yeah, yknow • ross: hey, just, know, like, mean, oh, okay, right, uh, yeah 7. REFERENCES [1] Nikki. Crazy for friends scripts, 2004. [2] Guy Lapalme Marina Sokolova. A systematic analysis of performance measures for classification tasks. Information Processing and Management, 2009. http://atour.iro.umontreal.ca/rali/sites/default/files/publis/Sokol JIPM09.pdf. [3] Mudit Bhargava, Pulkit Mehndiratta, and Krishna Asawa. Stylometric analysis for authorship attribution on twitter. In Proceedings of the Second International Conference on Big Data Analytics - Volume 8302, BDA 2013, pages 37–47, New York, NY, USA, 2013. Springer-Verlag New York, Inc. [4] Shlomo Argamon Moshe Koppel, Jonathan Schler. Authorship attribution: What’s easy and what’s hard? Journal of Law and Policy, 21, 2013. [5] Fraz Kurfess Foaad Khosmood. Automatic source attribution of text: A neural networks approach. In 2005 IEEE International Joint Conference on Neural Networks, volume 5. IEEE.
© Copyright 2026 Paperzz