Success with Style: Using Writing Style to Predict the Success of Novels Vikas Ganjigunte Ashok Song Feng Yejin Choi Department of Computer Science Stony Brook University Stony Brook, NY 11794-4400 vganjiguntea, songfeng, [email protected] Abstract Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics of the writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. 1 Introduction Predicting the success of novels is a curious question among publishers, professional book reviewers, aspiring and even expert writers alike. There are potentially many influencing factors, some of which concern the intrinsic content and quality of the book, such as interestingness, novelty, style of writing, and engaging storyline, but external factors such as social context and even luck can play a role. As a result, recognizing successful literary work is a hard task even for experts working in the publication industries. Indeed, even some of the best sellers and award winners can go through several rejections be- fore they are picked up by a publisher.1 Perhaps due to its obvious complexity of the problem, there has been little previous work that attempts to build statistical models that predict the success of literary works based on their intrinsic content and quality. Some previous studies do touch on the notion of stylistic aspects in successful literature, e.g., extensive studies in Literature discuss literary styles of significant authors (e.g., Ellegård (1962), McGann (1998)), while others consider content characteristics such as plots, characteristics of characters, action, emotion, genre, cast, of the best-selling novels and blockbuster movies (e.g., Harvey (1953), Hall (2012), Yun (2011)). All these studies however, are qualitative in nature, as they rely on the knowledge and insights of human experts on literature. To our knowledge, no prior work has undertaken a systematic quantitative investigation on the overarching characterization of the writing style in successful literature. In consideration of widely different styles of authorship (e.g., Escalante et al. (2011), Peng et al. (2003), Argamon et al. (2003)), it is not even readily clear whether there might be common stylistic elements that help discriminating highly successful ones from less successful counterpart. In this work, we present the first study that investigates this unstudied and unexpected connection between stylistic elements and the literary success. The key findings of our research reveal that there exists distinct linguistic patterns shared among suc1 E.g., Paul Harding’s “Tinkers” that won 2010 Pulitzer Prize for Fiction and J. K. Rowling’s “Harry Potter and the Philosopher’s Stone” that sold over 450 million copies. cessful literature, at least within the same genre, making it possible to build a model with surprisingly high accuracy (up to 84%) in predicting the success of a novel. This result is surprising for two reasons. First, we tackle the hard task of predicting the success of novels written by previously unseen authors, avoiding incidental learning of authorship signature, since previous research demonstrated that one can achieve very high accuracy in authorship attribution (as high as 96% in some experimental setup) (e.g., Raghavan et al. (2010), Feng et al. (2012)). Second, we aim to discriminate highly successful novels from less successful, but nonetheless published books written by professional writers, which are undoubtedly of higher quality than average writings. It is important to note that the task we tackle here is much harder than discriminating highly successful works from those that have not even passed the scrutinizing eyes of publishers. In order to quantify the success of literary works, and to obtain corresponding gold standard labels, one needs to first define “success”. For practical convenience, we largely rely on the download counts available at Project Gutenberg as a surrogate to quantify the success of novels. For a small number of novels however, we also consider award recipients (e.g., Pulitzer, Nobel), and Amazon’s sales records to define a novel’s success. We also extend our empirical study to movie scripts, where we quantify the success of movies based on the average review scores at imdb.com. We leave analysis based on other measures of literary success as future research. In this study, we do not attempt to separate out success based on literary quality (award winners) from success based on popularity (commercial hit, often in spite of bad literary quality), mainly because it is not practically easy to determine whether the high download counts are due to only one reason or the other. We expect that in many cases, the two different aspects of success are likely to coincide, however. In the case of the corpus obtained from Project Gutenberg, where most of our experiments are conducted, we expect that the download counts are more indicative of success based on the literary quality (which then may have resulted in popularity) rather than popularity without quality. We examine several genres in fiction and movie G ENRE Adventure Detective / Mystery Fiction Historical Fiction Love Stories Poetry Science Fiction Short Stories #B OOKS 409 374 1148 374 342 580 902 1117 τ− 17 25 7 25 16 9 30 9 τ+ 100 90 125 115 85 70 100 224 Table 1: # of books available per genre at Gutenberg with download thresholds used to define more successful (≥ τ + ) and less successful (≤ τ − ) classes. scripts, e.g., adventure stories, mystery, fiction, historical fiction, sci-fi, short stories, as well as poetry, and present systematic analyses based on lexical and syntactic features which have been known to be effective in a variety of NLP tasks ranging from authorship attribution (e.g., Raghavan et al. (2010)), genre detection (e.g., Rayson et al. (2001), Douglas and Broussard (2000)), gender identification (e.g., Sarawgi et al. (2011)) and native language detection (e.g., Wong and Dras (2011)). Our empirical results demonstrate that (1) statistical stylometry can be surprisingly effective in discriminating successful literature, achieving accuracy up to 84%, (2) some elements of successful styles are genre-dependent while others are more universal. In addition, this research results in (3) findings that are somewhat contrary to the conventional wisdom with respect to the connection between successful writing styles and readability, (4) interesting correlations between sentiment / connotation and the literary success, and finally, (5) comparative insights between fiction and nonfiction with respect to the successful writing style. 2 Dataset Construction For our experiments, we procure novels from project Gutenberg2 . Project Gutenberg houses over 40, 000 books available for free download in electronic format and provides a catalog containing brief descriptions (title, author, genre, language, download count, etc.) of these books. We experiment with genres in Table 1, which have sufficient number of books allowing us to construct reasonably sized datasets. We use the download counts in Gutenberg-catalog 2 http://www.gutenberg.org/ Figure 1: Differences in POS tag distribution between more successful and less successful books across different genres. Negative (positive) value indicates higher percentage in less (more) successful class. as a surrogate to measure the degree of success of novels. For each genre, we determine a lower bound (τ+ ) and an upper bound (τ− ) of download counts as shown in Table 1 to categorize the available books as more successful and less successful respectively. These thresholds are set to obtain at least 50 books for each class, and for each genre. To balance the data, for each genre, we construct a dataset of 100 novels (50 per class). We make sure that no single author has more than 2 books in the resulting dataset, and in the majority of the cases, only one book has been taken from each author.3 Furthermore, we make sure that the books from the same author do not show up in both training and test data. These constraints make sure that we learn general linguistic patterns of successful novels, rather than a particular writing style of a few successful authors. 3 Methodology In what follows, we describe five different aspects of linguistic styles we measure quantitatively. The first three correspond to the features that have been frequently utilized in previous studies in related tasks, 3 The complete list of novels used for each genre in our dataset is available at http://www.cs.stonybrook. edu/˜ychoi/successwithstyle/ e.g., genre detection (e.g., Kessler et al. (1997)) and authorship attribution (e.g., Stamatatos (2009)), while the last two are newly explored in this work. I. Lexical Choices: unigrams and bigrams. II. Distribution of Word Categories: Many previous studies have shown that the distribution of part-of-speech (POS) tags alone can reveal surprising insights on genre and authorship (e.g., Koppel and Schler (2003)), hence we examine their distributions with respect to the success of literary works. III. Distribution of Grammar Rules: Recent studies reported that features based on CFG rules are helpful in authorship attribution (e.g., Raghavan et al. (2010), Feng et al. (2012)). We experiment with four different encodings of production rules: • Γ: lexicalized production rules (all production rules, including those with terminals) • ΓG : lexicalized production rules prepended with the grandparent node. • γ: unlexicalized production rules (all production rules except those with terminals). • γ G : unlexicalized production rules prepended with the grandparent node. F EATURE POS Unigram Bigram Γ ΓG γ γG Γ+Unigram ΓG +Unigram γ+Unigram γ G +Unigram PHR PHR+CLS PHR+Unigram PHR+CLS+Unigram Adven 74.0 84.0 81.0 73.0 75.0 72.0 72.0 79.0 80.0 82.0 80.0 74.0 75.0 80.0 80.0 Myster 63.9 73.0 73.0 71.0 74.0 70.0 69.0 73.0 74.0 72.0 73.0 65.0 69.0 74.0 75.0 Fiction 72.0 75.0 75.0 75.0 75.0 65.0 74.0 73.0 74.0 73.0 74.0 65.0 64.0 71.0 71.0 G ENRE Histor 47.0 60.0 51.0 54.0 58.0 53.0 55.0 59.0 56.0 56.0 58.0 56.0 61.0 56.0 56.0 Love 65.9 82.0 72.0 78.0 81.0 70.0 75.0 80.0 82.0 81.0 82.0 64.0 59.0 79.0 79.0 Poetr 63.0 71.0 70.0 74.0 72.0 66.0 69.0 73.0 72.0 69.0 70.0 62.0 62.0 73.0 73.0 Sci-fi 63.0 61.0 59.0 71.0 76.0 64.0 67.0 71.0 73.0 62.0 65.0 69.0 69.0 67.0 66.0 Short 67.0 57.0 57.0 77.0 77.0 71 73.0 73.0 72.0 59.0 58.0 71.0 67.0 66.0 66.0 Avg 64.5 70.3 67.2 71.6 73.5 66.3 69.2 72.6 72.8 69.2 70 65.7 65.7 70.7 70.7 Avg w/o History 66.9 71.8 69.5 74.1 75.7 68.2 71.2 74.5 75.2 71.1 71.7 67.1 66.4 72.8 72.8 Table 2: Classification results in accuracy (%). IV. Distribution of Constituents: PCFG grammar rules are overly specific to draw a big picture on the distribution of large, recursive syntactic units. We hypothesize that the distribution of constituents can serve this purpose, and that it will reveal interesting and more interpretable insights into writing styles in highly successful literature. Despite its relative simplicity, we are not aware of previous work that looks at the distribution of constituents directly. In particular, we are interested in examining the distribution of phrasal and/or clausal tags as follows: (i) Phrasal tag percent (PHR) - percentage distribution of phrasal tags.4 (ii) Clausal tag percent (CLS) - percentage distribution of clausal tags. V. Distribution of Sentiment and Connotation: Finally, we examine whether the distribution of sentiment and connotation words, and their polarity, has any correlation with respect to the success of literary works. We are not aware of any previous work that looks into this connection. 4 Prediction Performance We use LibLinear SVM (Fan et al., 2008) with L2 tuned over training data, and all performance is based on 5-fold cross validation. We take 1000 sentences from the beginning of each book. POS features are encoded as unit normalized frequency and all other features are encoded as tf-idf.5 4 The percentage of any phrasal tag is the count of occurrence of that tag over the sum of counts of all phrasal tags. 5 POS tags are obtained using the Stanford POS tagger (Toutanova and Manning, 2000), and parse trees are based on the Stanford parser (Klein and Manning, 2003). Prediction Results Table 2 shows the classification results. The best performance reaches as high as 84% in accuracy. In fact, in all genres except for history, the best performance is at least 74%, if not higher. Another notable observation is that even in the poetry genre, which is not prose, the accuracy gets as high as 74%. This level of performance is not entirely anticipated, given that (1) the test data consists of books written only by previously unseen authors, and (2) each author has widely different writing style, and (3) we do not have training data at scale, and (4) we aim to tackle the hard task of discriminating highly successful ones from less successful, but nonetheless successful ones, as all of them were, after all, good enough to be published.6 Prediction with Varying Thresholds of Download Counts Before we proceed to comprehensive analysis of writing style that are prominent in more successful literature (§5), in Table 3, we present how the prediction accuracy varies as we adjust the definition of more-successful and less-successful literature, by gradually increasing (decreasing) the threshold τ − (τ + ). As we reduce the gap between τ − and τ + , the performance decreases, which shows that indeed there are notable statistical differences in linguistic patterns between novels with high and low download counts, and the stylistic difference monotonically increases (thereby higher prediction accuracy) as we increase the gap between two classes. 6 In our pilot study, we also experimented with the binary classification task of discriminating highly successful ones from those that are not even published (unpublished online novels), and it was a much easier task as expected. τ− 17 25 35 45 55 τ+ 100 90 80 70 60 ACCURACY 84.0 78.4 77.6 76.4 73.5 Table 3: Accuracy (%) with varying thresholds of download counts for A DVENTURE with unigram features. Less Successful CATEGORY Negative Body Parts Location Emotional / Action Verbs Extreme Words This is particularly interesting as the size of training data set is actually monotonically decreasing (making it harder to achieve high accuracy) while we increase the separation between τ − and τ + . 5 Analysis of Successful Writing Styles 5.1 Insights Based on Lexical Choices It is apparent from Table 2 that unigram features yield curiously high performance in many genres. We therefore examine discriminative unigrams for A DVENTURE, shown in Table 4. Interestingly, less successful books rely on verbs that are explicitly descriptive of actions and emotions (e.g., “wanted”, “took”, “promised”, “cried”, “cheered”, etc.), while more successful books favor verbs that describe thought-processing (e.g., “recognized”, “remembered”), and verbs that serve the purpose of quotes and reports (e.g,. “say”). Also, more successful books use discourse connectives and prepositions more frequently, while less successful books rely more on topical words that could be almost cliché, e.g., “love”, typical locations, and involve more extreme (e.g., “breathless”) and negative words (e.g., “risk”). 5.2 Distribution of Sentiment & Connotation We also determine the distribution of sentiment and connotation words separately for each class (Table 5) to check if there exists a connection with respect to successful writing styles.7 We first compare distribution of sentiment and connotation for the entire words. As can be seen in Table 5 – Top, there are not notable differences. However, when we compare distribution only with respect to discriminative unigrams only (i.e., features with non-zero weights), as 7 We use MPQA subjectivity lexicon (Wilson et al., 2005) and connotation lexicon (Feng et al., 2013) for determining sentiment and connotation of words respectively. Love Related UNIGRAMS never, risk, worse, slaves, hard, murdered, bruised, heavy, prison, face, arm, body, skins room, beach, bay, hills, avenue, boat, door want, went, took, promise, cry, shout, jump, glare, urge never, very, breathless, sacred slightest, absolutely, perfectly desires, affairs More Successful CATEGORY UNIGRAMS Negation Report / Quote Self Reference not said, words, says I, me , my and, which, though, that, as, after, but, where, what, whom, since, whenever up, into, out, after, in, within recognized, remembered Connectives Prepositions Thinking Verbs Table 4: Discriminative unigrams for A DVENTURE. shown in Table 5 – Bottom, we find substantial differences in all genres. In particular, discriminative unigrams that characterize less successful novels involve significantly more sentiment-laden words. 5.3 Distribution of Word Categories Summarized analysis of POS distribution across all genres is reported in Table 6. It can be seen that prepositions, nouns, pronouns, determiners and adjectives are predictive of highly successful books whereas less successful books are characterized by higher percentage of verbs, adverbs, and foreign words. Per genre distributions of POS tags are visualized in Figure 1. Interestingly, some POS tags show almost universal patterns (e.g., prepositions (IN), NNP, WP, VB), while others are more genrespecific. In Relation to Journalism Style The work of Douglas and Broussard (2000) reveals that informative writing (journalism) involves increased use of nouns, prepositions, determiners and coordinating conjunctions whereas imaginative writing (novels) involves more use of verbs and adverbs, as has been also confirmed by Rayson et al. (2001). Comparing their findings with Table 6, we find that highly +ve S -ve S Tot S +ve C -ve C Total C Adven + 4.7 4.9 4.0 4.0 8.7 8.9 22.3 22.5 19.4 19.6 41.7 42.1 Myster + 4.8 4.6 4.0 4.0 8.9 8.7 22.3 22.5 19.8 19.8 42.1 42.3 Fiction + 5.6 4.9 4.3 4.2 9.9 9.0 23.7 23.0 20.3 19.5 44.0 42.5 Histor + 5.0 5.1 4.2 4.2 9.2 9.3 23.0 23.2 19.2 19.4 42.3 42.6 Love + 5.5 5.1 4.1 4.2 9.6 9.3 23.34 23.3 20.2 20.4 43.5 43.7 Poetr + 6.3 5.7 4.3 4.3 10.6 9.9 23.8 22.9 17.7 17.4 41.5 40.3 Sci-fi + 4.1 3.7 2.9 2.9 7.0 6.7 21.2 20.6 16.6 16.7 37.9 37.3 Short + 4.7 4.8 3.8 4.0 8.5 8.9 22.6 22.7 18.3 18.9 41.0 41.6 +ve S -ve S Total S +ve C -ve C Total C 3.5 5.5 9.1 12.9 14.1 27.0 4.1 6.3 10.4 14.3 15.2 29.5 3.7 5.5 9.2 12.9 13.7 26.6 3.0 4.7 7.7 11.5 12.4 23.9 3.4 5.1 8.5 12.0 12.9 24.87 3.9 5.8 9.7 14.0 14.3 28.3 7.3 9.0 16.3 19.6 20.0 39.7 5.1 7.3 12.4 16.5 17.0 33.5 1.8 3.4 5.2 8.9 9.8 18.7 2.0 3.6 5.6 9.8 10.9 20.7 1.4 2.9 4.3 8.5 9.9 18.4 1.0 1.9 3.0 6.2 7.0 13.2 1.3 2.6 3.9 7.7 8.5 16.1 2.0 3.3 5.2 9.6 10.3 19.8 5.9 8.0 13.9 19.2 19.7 38.9 2.7 4.8 7.5 11.9 13.3 25.2 Table 5: Top: Distribution of sentiment (connotation) among entire unigrams. Bottom: distribution of sentiment (connotation) among discriminative unigrams. ’S’ and ’C’ stand for sentiment and connotation respectively. More Successful CATEGORY SUB - CATEGORY Prepositions Determiners General General Plural Proper (Singular) General General Posesseive General WH Possessive WH General Superlative Less Successful SUB - CATEGORY General General WH Base Non-3rd sing. present Past tense Past participle 3rd person sing. present Modal General General General Nouns Coord. conj. Numbers Pronouns Adjectives CATEGORY Adverbs Verbs Foreign Symbols Interjections DIFF 0.00592 0.00226 0.00189 0.00016 0.00118 0.00102 0.00081 0.00042 5.4E-05 0.00102 0.00011 DIFF -0.00272 -0.00028 -0.00239 -0.00084 -0.00041 -0.00039 -0.00036 -0.00091 -0.00067 -0.00018 -0.00016 Table 6: Top discriminative POS tags. successful books tend to bear closer resemblance to informative articles. 5.4 Distribution of Constituents It can be seen in Table 2 that deep syntactic features expressed in terms of different encodings of production rules consistently yield good perfor- mance across almost all genres. Production rules are overly specific to gain more generalized, interpretable, high-level insights however (Feng et al., 2012). Therefore, similarly as word categories (POS), we consider the categories of nonterminal nodes of the parse trees, in particular, phrasal and clausal tags, as they represent the gist of constituent structure that goes beyond shallow syntactic information represented by POS. Table 8 shows how the distribution of phrasal and clausal tags differ for successful books when computed over all genres. Positive (negative) DIFF values indicate that the corresponding tags are favored in more successful (less successful) books when counted across all genres. We also report the number of genres (#Genres) in which the individual difference is positive / negative. In terms of phrasal tags, we find that more successful books are composed of higher percentage of PP, NP and wh-noun phrases (WHNP), whereas less successful books are composed of higher percentage of VP, adverb phrases (ADVP), interjections (INTJ) and fragments (FRAG). Notice that this observation is inline with our earlier findings with respect to the distribution of POS. In regard to clausal tags, more successful books involve more clausal tags that are necessary for complex sentence structure and inverted sentence structure (SBAR, SBARQ and SQ) whereas less successful books rely more on simple sentence structure (S). Figure 2 shows the visualization of the distribution of these phrasal and clausal tags. It is also worth to mention that phrasal and clausal Figure 2: Difference between phrasal and clausal tag percentage distributions of more successful and less successful books across different genres. Specifically, we plot D− –D+ , where D+ is the phrasal tag distribution (in %) of more successful books and D− is the phrasal tag distribution (in %) of less successful books. R EADABILITY I NDICES FOG index Flesch index More Succ. 9.88 87.48 Less Succ. 9.80 87.64 Table 7: Readability: Lower FOG and higher Flesch indicate higher readability (numbers in Boldface). tags alone can yield classification performance that are generally better than that of POS tags, in spite of the very small feature set (26 tags in total). In fact, constituent tags deliver the best performance in case of historical fiction genre (Table 2). Connection to Readability Pitler and Nenkova (2008) provide comprehensive insights into assessment of readability. In their work, among the most discriminating features characterizing text with better readability is increased use of verb phrases (VP). Interestingly, contrary to the conventional wisdom – that readability is of desirable quality of good writings – our findings in Table 2 suggest that the increased use of VP correlates strongly with the writing style of the opposite spectrum of highly successful novels. As a secondary way of probing the connection be- tween readability and the writing style of successful literature, we also compute two different readability measures that have been used widely in prior literature (e.g., Sierra et al. (1992), Blumenstock (2008), Ali et al. (2010)): (i) Flesch reading ease score (Flesch, 1948), (ii) Gunning FOG index (Gunning, 1968). The overall weighted average readability scores are reported in Table 7. Again, we find that less successful novels have higher readability compared to more successful ones. The work of Sawyer et al. (2008) provides yet another interesting contrasting point, where the authors found that award winning academic papers in marketing journals correlate strongly with increased readability, characterized by higher percentage of simple sentences. We conjecture that this opposite trend is likely to be due to difference between fiction and nonfiction, leaving further investigation as future research. In sum, our analysis reveals an intriguing and unexpected observation on the connection between readability and the literary success — that they correlate into the opposite directions. Surely our findings only demonstrate correlation, not to be con- Phrasal ADJP ADJP ADVP CONJP FRAG LST NAC NP NX PP PRN PRT QP RRC UCP VP WHADJP WHAVP WHNP WHPP X Clausal SBAR SQ SBARQ SINV S + 0.030 0.030 0.052 3E-4 0.008 2E-4 9E-6 0.459 1E-4 0.122 0.005 0.010 0.001 8E-5 8E-4 0.292 2E-4 0 0.013 0.001 0.001 + 0.166 0.020 0.014 0.018 0.781 − 0.031 0.031 0.054 3E-4 0.008 1E-4 6E-6 0.453 1E-4 0.117 0.004 0.010 0.001 8E-5 7E-4 0.300 2E-4 0 0.012 9E-4 0.001 − 0.164 0.018 0.013 0.018 0.785 D IFF -6E-4 -6E-4 -0.002 2E-5 -1E-4 5E-5 3E-6 0.005 -4E-7 0.005 2E-4 -5E-4 7E-5 6E-6 1E-4 -0.008 -5E-5 0 0.001 1E-4 -4E-5 D IFF 0.002 0.002 0.001 -6E-5 -0.004 − #+ Gen /#Gen 5/3 5/3 2/6 5/3 2/6 6/2 5/3 6/2 3/5 7/1 4/4 3/5 6/2 6/2 8/0 1/7 1/7 8/0 6/2 4/4 − + Gen /#Gen 4/4 7/1 7/1 5/3 3/5 Table 8: Overall Phrasal / Clausal Tag Distribution and analysis. All values are rounded to [3-5] decimal places. fused as causation, between readability and literary success. We conjecture that the conceptual complexity of highly successful literary work might require syntactic complexity that goes against readability. 6 Literature beyond Project Gutenberg One might wonder how the prediction algorithms trained on the dataset based on Project Gutenberg might perform on books not included at Gutenberg. This section attempts to address such a question. Due to the limited availability of electronically available books that are free of charge however, we could not procure more than a handful of books.8 6.1 Highly Successful Books First, we apply the classifiers trained on the Project Gutenberg dataset (all genres merged) on a few extremely successful novels (Pulitzer prize, National Award recipients, etc). Table 9 shows the results of 8 We report our prediction results on all books beyond Project Gutenberg of which we managed to get electronic copies, i.e., the results in Table 9 are not cherry-picked. M ORE SUCCESSFUL BOOK (Q) PDKL U PDKL “Don Quixote” 0.139 0.152 – Miguel De Cervantes “Other Voices, Other Rooms” 0.014 0.010 – Truman Capote “The Fixer” 0.013 0.015 – Bernard Malamud “Robinson Crusoe” 0.042 0.051 – Daniel Defoe “The old man and the sea” 0.065 0.060 – Ernest Hemingway “A Tale of Two Cities” 0.027 0.030 – Charles Dickens “Independence Day” 0.031 0.026 – Richard Ford “Rabbit At Rest” 0.047 0.048 – John Updike “American Pastoral” 0.039 0.043 – Philip Roth “Dr Jackel and Mr. Hyde” 0.036 0.037 – Robert Stevenson L ESS SUCCESSFUL “The lost symbol” 0.046 0.042 – Dan Brown “The magic barrel” 0.0288 0.0284 – Bernard Malamud “Two Soldiers” 0.130 0.117 – William Faulkner “My life as a man” 0.046 0.052 – Philip Roth Su SΓ∗ + + + + + + + + + + + + + + + + + + + + - - + - - + - + Table 9: Prediction on books beyond Gutenberg. Shaded entries indicate incorrect predictions. two classification options: (1) KL-divergence based, and (2) unigram-feature based. Although KL-divergence based prediction was not part of the classifiers that we explored in the previous sections, we include it here mainly to provide better insights as to which well-known books share closer structural similarity to either more or less successful writing style. As a probability model, we use the distributions of phrasal tags, as those can give us insights on deep syntactic structure while suppressing potential noises due to topical variances. Table 9 shows symmetrised KL-divergence between each of the previously unseen novels and the collection of books from Gutenberg corresponding to more successful (less successful) labels. For prediction, the label with smaller KL is chosen. Based only on the distribution of 26 phrasal tags, the KL divergence classifier is able to make correct predictions on 7 out of 10 books, a surprisingly high performance based on mere 26 features. Of course, considering only the distribution of phrasal tags is significantly less informed than considering numerous other features that have shown substantially better performance, e.g., unigrams and CFG rewrite rules. Therefore, we also present the SVM classifier trained on unigram features. It turns out unigram features are powerful enough to make correct predictions for all ten books in Table 9. Hemingway and Minimalism It is good to think about where and why KL-divergence-based approach fails. In fact, when we included Hemingway’s The Old Man and the Sea into the test set, we were expecting some level of confusions when relying only on high-level syntactic structure, as Hemingway’s signature style is minimalism, with 70% of his sentences corresponding to simple sentences. Not surprisingly, more adequately informed classifiers, e.g., SVM with unigram features, are still able to recognize Hemingway’s writings as those of highly successful ones. 6.2 Less Successful Books In order to obtain less successful books, we consider the Amazon seller’s rank included in the product details of a book. The less successful books considered in Table 9 had an Amazon seller’s rank beyond 200k (higher rank indicating less commercial success) except Dan Brown’s The lost symbol, which we included mainly because of negative critiques it had attracted from media despite its commercial success. As shown in Table 9, all three classifiers make (arguably) correct predictions on Dan Brown’s book.9 This result also supports our earlier assumption on the nature of novels available at Project Gutenberg — that they would be more representative of literary success than general popularity (with or without literary quality). 7 Predicting Success of Movie Scripts We have seen successful results in the novel domain, but can stylometry-based prediction work on very different domains, such as screenplays? Unlike novels, movie scripts are mostly in dialogues, which 9 Most notable pattern based on phrasal tag analysis is a significantly increased use of fragments (FRAG), which associates strongly with less successful books in our dataset. F EATURE POS Unigram Bigrams Γ ΓG γ γG Γ+Uni ΓG +Uni γ+Uni γ G +Uni PHR PHR+CLR PHR+Uni PHR+CLR+Uni Adven 62.0 62.0 73.3 66.0 62.0 62.0 69.3 62.0 54.7 58.0 58.0 46.0 76.7 62.0 62.0 Fanta 58.0 81.3 84.7 81.3 69.3 81.3 77.3 85.3 81.3 89.3 84.7 42.7 31.3 81.3 81.3 Roman 61.7 70.0 80.8 70.0 86.7 78.3 77.5 70.0 70.0 70.0 70.0 65.8 65.8 70.0 70.0 Thril 56.0 80.0 76.0 76.0 60.0 76.0 68.0 76.0 76.0 76.0 76.0 80.0 80.0 80.0 80.0 Table 10: Classification results on movie dialogue data (rating ≥ 8 vs rating ≤ 5.5). are likely to be more informal. Also, what to keep in mind is that much of the success of movies depends on factors beyond the quality of writing of the scripts, such as the quality of acting, the popularity of actors, budgets, artistic taste of directors and producers, editing and so forth. We use the Movie Script Dataset introduced in Danescu-Niculescu-Mizil and Lee (2011). It includes the dialogue scripts of 617 movies. The average rating of all movies is 6.87. We consider movies with IMDb rating ≥ 8 as “more successful”, the ones with IMDb rating ≤ 5.5 as “less successful”. We combine all the dialogues of each movie and filter out the movies with less than 200 sentences. There are 11 genres (“A DVENTURE ”, “FANTASY ”, “ROMANCE ”, “T HRILLER ”, “ACTION ”, “C OMEDY ”, “C RIME ”, “D RAMA”, “H ORROR ”, “M YSTERY ”, “S CI FI ” ) with 15 movies or more per class, we take 15 movies per class and perform classification tasks with the same experiment setting as Table 2. Table 10, we show some of the example genres with relatively successful outcome, reaching as high as 89.3% accuracy in FANTASY genre. We would like to note however that in many other genres, the prediction did not work as well as it did for the novel domain. We suspect that there are at least two reasons for this: it must be partly due to very limited data size — only 15 instances per class with the rating threshold we selected for defining the success of movies. The second reason is due to many other external factors that can also influence the success of movies, as discussed earlier. 8 Related Work Predicting success of novels and movies: To the best of our knowledge, our work is the first that provides quantitative insights into the unstudied connection between the writing style and the success of literary works. There have been some previous work that aims to gain insights into the secret recipe of successful books, but most were qualitative, based only on a dozen of books, focusing mainly on the high-level content of the books, such as the personalities of protagonists, antagonists, the nature of plots (e.g., Harvey (1953), Yun (2011)). In contrast, our work examines a considerably larger collection of books (800 in total) over eight different sub-genres, providing insights into lexical, syntactic, and discourse patterns that characterize the writing styles commonly shared among the successful literature. Another relevant work has been on a different domain of movies (Yun, 2011), however, the prediction is based only on external, non-textual information such as the reputation of actors and directors, and the power of distribution systems etc, without analyzing the actual content of the movie scripts. Text quality and readability: Louis (2012) explored various features that measure the quality of text, which has some high-level connections to our work. Combining the insights from Louis (2012) with our results, we find that the characteristics of text quality explored in Louis (2012), readability of text in particular, do not correspond to the prominent writing style of highly successful literature. There have been a number of other work that focused on predicting and measuring readability (e.g., Kate et al. (2010), Pitler and Nenkova (2008), Schwarm and Ostendorf (2005), Heilman and Eskenazi (2006) and Collins-Thompson et al. (2004)) employing various linguistic features. There is an important difference however, in regard to the nature of the selected text for analysis: most studies in readability focus on differentiating good writings from noticeably bad writings, often involving machine generated text or those written by ESL students. In contrast, our work essentially deals with differentiating good writings from even better writings. After all, all the books that we analyzed are written by expert writers who passed the scrutinizing eyes of publishers, hence it is reasonable to expect that the writing quality of even less successful books is respectful. Predicting success among academic papers: In the domain of academic papers, which belongs to the broad genre of non-fiction, the work of Sawyer et al. (2008) investigated the stylistic characteristics of award winning papers in marketing journals, and found that the readability plays an important role. Combined with our study which focuses on fiction and creative writing, it suggests that the recipe for successful publications can be very different depending on whether it belongs to fiction or nonfiction. The work of Bergsma et al. (2012) is also somewhat relevant to ours in that their work included differentiating the writing styles of workshop papers from major conference papers, where the latter would be generally considered to be more successful. 9 Conclusion We presented the first quantitative study that learns to predict the success of literary works based on their writing styles. Our empirical results demonstrated that statistical stylometry can be surprisingly effective in discriminating successful literature, achieving accuracy up to 84% in the novel domain and 89% in the movie domain. Furthermore, our study resulted in several insights including: lexical and syntactic elements of successful styles, the connection between successful writing style and readability, the connection between sentiment / connotation and the literary success, and last but not least, comparative insights between successful writing styles of fiction and nonfiction. Acknowledgments This research was supported in part by the Stony Brook University Office of the Vice President for Research, and in part by gift from Google. We thank anonymous reviewers, Steve Greenspan, and Mike Collins for helpful comments and suggestions, Alex Berg for the title, and Arun Nampally for helping with the preliminary work. References Omar Ali, Ilias N Flaounas, Tijl De Bie, Nick Mosdell, Justin Lewis, and Nello Cristianini. 2010. Automating news content analysis: An application to gender bias and readability. Journal of Machine Learning Research-Proceedings Track, 11:36–43. Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. 2003. Gender, genre, and writing style in formal written texts. TEXT-THE HAGUE THEN AMSTERDAM THEN BERLIN-, 23(3):321– 346. Shane Bergsma, Matt Post, and David Yarowsky. 2012. Stylometric analysis of scientific articles. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 327–337. Association for Computational Linguistics. Joshua E Blumenstock. 2008. Automatically assessing the quality of wikipedia articles. Kevyn Collins-Thompson, James P. Callan, and James P. Callan. 2004. A language modeling approach to predicting reading difficulty. In HLT-NAACL, pages 193– 200. Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011. Dan Douglas and Kathleen M Broussard. 2000. Longman grammar of spoken and written english. TESOL Quarterly, 34(4):787–788. Alvar Ellegård. 1962. A Statistical method for determining authorship: the Junius Letters, 1769-1772, volume 13. Göteborg: Acta Universitatis Gothoburgensis. Hugo J Escalante, Thamar Solorio, and M Montes-y Gómez. 2011. Local histograms of character ngrams for authorship attribution. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 288–298. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9:1871–1874. Song Feng, Ritwik Banerjee, and Yejin Choi. 2012. Characterizing stylistic elements in syntactic structure. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1522–1533. Association for Computational Linguistics. Song Feng, Jun Sak Kang, Polina Kuznetsova, and Yejin Choi. 2013. Connotation lexicon: A dash of sentiment beneath the surface meaning. In Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, Angust. Association for Computational Linguistics. Rudolph Flesch. 1948. A new readability yardstick. Journal of applied psychology, 32(3):221. Robert Gunning. 1968. The technique of clear writing. McGraw-Hill New York. James W Hall. 2012. Hit Lit: Cracking the Code of the Twentieth Century’s Biggest Bestsellers. Random House Digital, Inc. John Harvey. 1953. The content characteristics of bestselling novels. Public Opinion Quarterly, 17(1):91– 114. Michael Heilman and Maxine Eskenazi. 2006. Language learning: Challenges for intelligent tutoring systems. In Proceedings of the workshop of intelligent tutoring systems for ill-defined tutoring systems. Eight international conference on intelligent tutoring systems, pages 20–28. Rohit J Kate, Xiaoqiang Luo, Siddharth Patwardhan, Martin Franz, Radu Florian, Raymond J Mooney, Salim Roukos, and Chris Welty. 2010. Learning to predict readability using diverse linguistic features. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 546–554. Association for Computational Linguistics. Brett Kessler, Geoffrey Numberg, and Hinrich Schütze. 1997. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pages 32–38. Association for Computational Linguistics. Dan Klein and Christopher D Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for Computational Linguistics. Moshe Koppel and Jonathan Schler. 2003. Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, volume 69, page 72. Citeseer. Annie Louis. 2012. Automatic metrics for genre-specific text quality. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, page 54. Jerome McGann. 1998. The Poetics of Sensibility: A Revolution in Literary Style. Oxford University Press. Fuchun Peng, Dale Schuurmans, Shaojun Wang, and Vlado Keselj. 2003. Language independent authorship attribution using character level language models. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1, pages 267–274. Association for Computational Linguistics. Emily Pitler and Ani Nenkova. 2008. Revisiting readability: a unified framework for predicting text quality. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 186–195, Stroudsburg, PA, USA. Association for Computational Linguistics. Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers, pages 38–42. Association for Computational Linguistics. Paul Rayson, Andrew Wilson, and Geoffrey Leech. 2001. Grammatical word class variation within the british national corpus sampler. Language and Computers, 36(1):295–306. Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: tracing stylometric evidence beyond topic and genre. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 78–86. Association for Computational Linguistics. Alan G Sawyer, Juliano Laran, and Jun Xu. 2008. The readability of marketing journals: Are awardwinning articles better written? Journal of Marketing, 72(1):108–117. Sarah E. Schwarm and Mari Ostendorf. 2005. Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 523–530, Stroudsburg, PA, USA. Association for Computational Linguistics. Arlene E Sierra, Mark A Bisesi, Terry L Rosenbaum, and E James Potchen. 1992. Readability of the radiologic report. Investigative radiology, 27(3):236–239. Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3):538–556. Kristina Toutanova and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In In EMNLP/VLC 2000, pages 63–70. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 347–354. Association for Computational Linguistics. Sze-Meng Jojo Wong and Mark Dras. 2011. Exploiting parse structures for native language identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1600–1610. Association for Computational Linguistics. Chang-Joo Yun. 2011. Performance evaluation of intelligent prediction models on the popularity of motion pictures. In Interaction Sciences (ICIS), 2011 4th International Conference on, pages 118–123. IEEE.
© Copyright 2026 Paperzz