Distinguishing Poets Based on Writing Style 1. Introduction Consider any anonymous document or document under suspicion of plagiarism. Examples include letters in forensic evidence, and unsigned historical documents. We wanted to train an AI to correctly identify the author of such a document based on a list of possible authors and their known writing samples. For this project, we chose to focus on poets and use style as the primary differentiator. In the case of poetry, style is often a more significant differentiator than meaning because poems can be semantically difficult to interpret for people and AI alike. But what, exactly, about style makes a poet distinctive? Human readers of poetry can often recognize an author’s style even when reading a new poem. Can a machine do the same? Can a machine do better? An AI capable of recognizing authors’ writing styles has both practical and scholarly potential. Identifying the author of a text could lend insight to forensic and historical cases, and catch instances of plagiarism. But it could also lend insight into the defining aspects of great authors’ work. Not only could it find evidence for whether a particular play was actually written by Shakespeare, it could provide numerical insight into what makes Shakespeare “sound like” Shakespeare. 2. Task Definition For this project, we considered two tasks of distinguishing poets: clustering and multi-class classification. Our clustering algorithm takes as input the complete works of a set of poets, as well as a feature extractor to apply to those works. As output, it produces a clustering of the poems. To evaluate this task, we consider whether poems by the same poet are clustered together, and the purity of each cluster with respect to the poets included in it. We also take into consideration whether poems by different authors but from the same school of poetry are clustered together. The multi-class classifier trains on a sampling of works from a set of poets (including the correct classifications). Using a provided feature extractor, it learns weights for each feature for each poet. Once trained, it takes as input a new poem and predicts the author based on the new poem’s features and the learned feature weights. To evaluate the success, we use the zero-one loss function, and compare to the results that would be obtained by predicting a random author. 3. Related Work 3.1 At Stanford http://nlp.stanford.edu/courses/cs224n/2010/reports/rof-karalevy-kmontag.pdf The goal of The Metric System was to take prose and generate semantically similar, Englishlanguage text that follows a specified poetic meter. The primary challenge of the project was to extract meaning from prose and maintain semantic consistency; adhering to a specific meter is a more easily quantifiable task than extracting and reformulating semantics. The goal for this project of “exploring quantitative approaches to processing and evaluating poetic verse” aligned with our own preliminary goals for feature extraction. The primary difference is that we used these features to identify poems, where they used them to generate poems. Because their work was generative, they placed a greater emphasis on semantic features. We modeled our feature extractor for metrical distance off the metrical distance function that they used in this project to evaluate the meter of the generated poems; the function derived the meter of a line from syllabic stresses and calculated the edit distance from the line’s scansion to an archetypal metrical pattern, such as iambic pentameter. http://web.stanford.edu/~jurafsky/kaojurafsky12.pdf The goal of this project was to extract features from award-winning poets and identify those features that qualified these works as high-quality. We had the common goal of quantitatively evaluating style, although our ultimate goal was classification, instead of aesthetic evaluation. Our feature extractor examines many of the phonemic features used in this project to evaluate a poem’s beauty. 3.2 In the News http://bits.blogs.nytimes.com/2012/01/03/software-helps-identify-anonymous-writers-orhelps-them-stay-that-way/?_php=true&_type=blogs&_r=0 In 2012, graduate students at Drexel University released two programs. One helped to identify the author of an anonymous prose 500-word writing sample based on a pool of <50 people with 6,500 of words from known writing samples per suspect; the other piece of software altered writing samples in order to better mask their authors. Our model mirrors that of the Drexel project in that we assume that our possible poets are in a limited pool and we focused on prolific writers, who could provide the greatest number of known writing samples. http://latimesblogs.latimes.com/jacketcopy/2010/05/shakespeares-style.html Stylometry, the application of the study of linguistics style, has applications ranging from forensic linguistics to authorship of famous texts such as Shakespeare’s works and the Federalist Papers. In this article, a lab at Claremont McKenna College applied computer-driven stylometrics to Shakespeare’s works to gain insight into the authorship of works attributed to Shakespeare to answer the question of whether Shakespeare’s style evolved or another author stepped in in his later works. 4. Approach 4.1 Feature Extraction Each feature of our feature extractor focused on stylistic element of poetry. We developed our features in three categories: diction, poetic devices, and quantifiable elements of style. To look at diction, we looked at unigram and bigram features, with and without punctuation. The poetic devices that we looked at were simile, alliteration, assonance, rhyme, caesura, enjambment, and metrical pattern. The quantifiable elements that we measured were average word length, average sentence length, and the average length of a work. We focused specifically on elements that were common enough among our poets as to be comparatively meaningful but not so common as to not contribute to differentiating between poets. We modeled our feature vector as sparse vectors. 4.2 K-Means Clustering We applied K-means clustering as an unsupervised learning technique to gauge the efficacy of our features in differentiating between poets. Categorizing poets manually is a costly process, making it a suitable unsupervised learning task. Our goal at this stage was to develop our feature extractor such that each cluster roughly represented a school of poetry. The centroid of each cluster would represent the stylistic ideal or average of that particular school. Schools of poetry are generally categorized by a common style or a common ethos. Given that our feature extractor focused on syntactic stylistic elements, rather than semantics, we hoped to achieve some success categorizing schools of poetry that are characterized by common stylistic elements. 4.3 Multi-Class Classification We implemented a multi-class classifier in which each class represented a poet in our data set. We experimented with feature vectors composed of different combinations of features on different testing and training sets to minimize error. We looked for an ideal combination of features for which the features were common enough across the poets in our data set as to be nontrivial but implemented differently enough that it could be a meaningful differentiator. 5. Results and Error Analysis 5.1 K-Means Clustering Results To start, we performed K-Means clustering using the unigram features, as well as a some numerical features--poem length, average sentence length, average word length--and a sampling of literary device features measuring the presence of rhymes, enjambment, simile, and alliteration. We looked at the results of clustering poems by William Shakespeare (1564-1616), Emily Dickinson (1830-1886), Anne Sexton (1928-1974), and Sylvia Plath (1932-1963). Emily Dickinson is considered a predecessor of the Modernist movement,1 while Anne Sexton and Sylvia Plath are both leading poets of the Confessional Movement.2 Shakespeare was a far more prolific poet than Sylvia Plath (with the others falling somewhere in between), so to standardize the number of poems in our K-Means input, we chose 10 poems at random from each author. The results of K-Means differed depending on the particular poems chosen in addition to the variation that results from the algorithm’s different starting centroids in each run. Some of the clusters were not unreasonable, however. On some runs of K-Means, most of Anne Sexton’s poems and all of Sylvia Plath’s poems were grouped together into one cluster containing only a single other poem by Shakespeare and Dickinson. This is a good result given that Sylvia Plath and Anne Sexton were both Confessional poets, a movement that Shakespeare and Dickinson were not part of. In this same run, many of Shakespeare and Dickinson’s poems clustered together in the two other clusters. Given that Shakespeare and Dickinson wrote at very different times, and with styles that do not appear particularly similar to a human reader, this result is 1 2 http://www.egs.edu/library/emily-dickinson/biography/ http://www.poetryarchive.org/poet/anne-sexton less encouraging. However, Shakespeare was one of Emily Dickinson’s great influences, and she referenced and responded to him in a lot of her work.3 Considering that our features in this run focused mainly on word choices, and that literary references often rely on particular images which are detectable in word choices, this was not altogether discouraging. The resulting clusters can be seen below: On the other hand, other runs of the K-means algorithm were far less insightful. Often, the clusters generated by K-means varied greatly in size, with most of the poems grouping together in one cluster, and the rest forming multiple smaller groups. In a set of runs involving the complete works of Shakespeare, Emily Dickinson, and Robert Frost, one particular Shakespeare poem, “Venus and Adonis,” formed a cluster by itself in nearly every trial. Because of this outlier behavior, we removed that particular poem from our standard Shakespeare data set. But the “big cluster” pattern is one that we often still see. Below are the results of a “big cluster” grouping of a different set of Emily Dickinson, Shakespeare, Anne Sexton, and Sylvia Plath poems. In this case, no particular clustering by poetry movement is apparent at all. 3 http://muse.jhu.edu/books/9781613760901 This mixing of poets became even more apparent when we ran the algorithm on the works of ten or twelve poets, rather than three or four, as shown below. Using different feature extractors on the poems before clustering sometimes produced better results. An example of this can be seen with Emily Dickinson and William Shakespeare. As we found in our earlier experiments, even in the best results obtained using unigrams, poem length, average sentence length, average word length, rhymes, enjambment, simile, and alliteration, poems by Shakespeare and Emily Dickinson tended to cluster together. But clustering these same two authors using a measure of meter alone produced better clusters. When we performed K-Means using unigrams and the other original features, we obtained clusters of sizes 118 and 22 (in actuality, there were 81 Shakespeare poems and 59 Dickinson poems to consider). These clusters had a purity of 0.6786 where purity is measured by assigning each cluster to the majority poet, and computing the fraction of poems that would be correctly classified, formally:4 Where: is the set of clusters is the set of classes Meanwhile, clustering with only the iambic pentameter meter level feature (a measure of the average deviance of the poem’s lines from iambic pentameter) obtained clusters of size 75 and 65, with a purity of 0.8000. Clustering with the iambic meter level feature (a generalization on the iambic pentameter measure to calculate the average deviance of lines from iambic meter of any length) produced clusters of size 97 and 43, with a purity of 0.8714. Thus we see that both metrical features extractors, when used alone in the K-Means run, produced purer and more evenly sized clusters. 5.2 Multi-Class Classification Results With our multi-class classifier, we considered poems by 10 poets of varied times and schools of poetry: Elizabeth Bishop, E.E. Cummings, Emily Dickinson, Robert Frost, Allen Ginsberg, Philip Larkin, Robert Lowell, Sylvia Plath, William Shakespeare, and Percy Bysshe Shelley. We divided poems by all of the authors into training and test sets, using a random selection of ¾ of each author’s works in the training set, and the remaining ¼ in the test set. We then calculated the training and test error, and tuned our parameters and features to minimize the test error. Given that our classifier chose one poet out of ten potential outputs, simply predicting a random poet could be expected to have an error rate of about 90%. We used this error rate as a baseline to evaluate our actual predictors. To assess our features, we tested them in isolation first, then in combinations. The best single features were the unigram and bigram word features, which simply extracted all of the unigrams or bigrams from the poems and put their frequency counts into the feature vector. 4 This evaluation of the purity of clustering (and images of the formula) come from http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Interestingly, we found that unigrams performed substantially better (about 0.47 test error instead of 0.57) when punctuation was stripped from the words, while bigrams performed slightly better when preserving punctuation. While at first surprising, this result could have to do with the fact that bigrams already capture more contextual information about the text, while unigrams focus on the counts of the words themselves. Including punctuation augments the contextual information of bigrams, but throws off the frequency counts of some of the unigrams. In both cases, standardizing the case of the letters hurt the accuracy slightly. Case could be expected, like punctuation, to provide a bit of contextual information (about whether a word comes first in a sentence, for example). But there is another aspect to its predictive power as well, since certain poets broke the conventions of capitalization. Emily Dickinson, for example, would capitalize important non-proper nouns in her work, while E. E. Cummings often wrote completely in lowercase. Quantitative Features We next considered features looking at some of the quantifiable aspects of style: the length of the poems, and average length of sentences and words used. We tried these features first alone, then in combination with unigrams and bigrams. The results of the single feature tests and the tests with unigrams are shown below: Alone, the feature considering average word length had a decent amount of predictive power, and the poem length had a very slight amount of predictive power. But when combined with unigrams, the classifier actually had a slightly higher error rate than when it used the unigrams alone. At least in the case of the average word length, this was almost certainly the result of the two features being correlated. Even worse was the average sentence length feature, which actually resulted in a classifier that performed worse than the random baseline. Even when combined with unigrams, the classifier was comparable to and even slightly worse than random. While we designed the average sentence length feature on the thought that certain authors use longer sentences than others, this is perhaps a pattern more commonly seen in prose than poetry. Furthermore, many prose and poetic authors alike vary their only sentence lengths greatly to make different points. It is likely that the average sentence length feature found patterns in the training data that occurred only by chance, given the particular poems in the training set. Poetic (Literary) Device Features As with the quantitative features, the poetic device features performed poorly when tested alone. But when combined with unigrams, a couple of them actually improved on the accuracy of the unigrams-only predictor. In particular, the features that considered the frequency of rhyming words, or extracted a particular rhyme scheme from the poem, combined well with unigrams. Alone, the classifier with a rhyming words feature extractor had a test error rate of 0.9178, but in combination with unigrams, the test error rate was 0.41096 -- better than the test error rate of 0.4657 for the classifier using the unigrams extractor only. Similarly, the classifier with the rhyme scheme feature extractor only had a test error rate of 0.9041, while the classifier using both rhyme scheme and unigrams had the same improved test error rate of 0.41096. The best of all the poetic device features, however, were the ones considering the meter of the poem’s lines. Results using our two best metrical features are shown above. The meter pattern feature extractor parses the meter of each line, producing a sequence of 1’s for stressed syllables and 0’s for unstressed syllables. The feature vector stores the counts of each metrical pattern (coded in 0’s and 1’s) seen for that author. Likewise, the iambic meter level feature extractor parses the meter of each line. But instead of storing the lines of meter themselves, this feature extractor produces an average edit distance of the poem’s meter from perfect iambic meter of the same number of syllables. (A third feature, not shown, measures the average distance from iambic pentameter in particular, and performed slightly less well). Alone, the meter pattern extractor actually produced a classifier with a test error rate of 0.5616-slightly better than the classifier using unigrams with punctuation. The iambic meter level, while less impressive, was also quite predictive, particularly considering that this feature produced only one value in the feature vector. When combined with unigrams (no punctuation), we achieved a test error rate of 0.3562, the best throughout all of our experiments. Using the iambic meter level feature along with unigrams had a test error rate of 0.3973, which was still a substantial improvement on the unigrams alone. Combining both metrical features with unigrams caused the test error to jump to 0.5342, most likely because the two metrical features are somewhat correlated. We also experimented with some transformations of the iambic meter level. Since the feature measures a numerical average edit distance from iambic meter, we considered the possibility that the relationship might not be linear. We experimented with using the squared average distance instead, or 1 divided by the distance. However, neither of these transformations improved the test error. Also not particularly helpful was combining more than two of our features. Though we tried many combinations of our better performing features, nothing quite matched the accuracy level of using the unigrams and meter pattern features. It seems that sometimes a simpler model really is best. 6. Conclusion Our results provide interesting quantitative insights into what factors contribute to what we perceive as poetic style. The efficacy of various features is a qualitative measure of how much varying that feature affects style. Our results revealed that each poetic devices was, on its own, unable to meaningful differentiate between poets. A possible reason behind this result is that many different kinds of poets use the same poetic devices to different effects; in other words, poetic devices measure minute choices in a poem and these choices are significant in the greater context of a poem. It may be more meaningful to measure the effect to which each poet uses a device, rather than the frequency of use for the device itself. While it may not be particularly meaningful to note that one poet uses caesura 5 times in a poem while another uses caesura 2 times, it would be more useful to note that one poet uses caesura to build suspense while another uses it to break a monotonous rhythm to highlight a particular line. The observation that poetic devices become more significant when they are placed in the greater context of the work, in this case through semantics, was illustrated in the relative success of rhyme and metrical pattern in differentiating between poets, as compared to other poetic devices. Both rhyme and metrical pattern are poetic devices that provide a more comprehensive view of a poem because they are comparative devices that look at how words relate to other words and lines to other lines. For example, looking at rhyme scheme was more productive than looking at either rhyming words or pairs of rhyming lines because it captures a more comprehensive picture of the poem. The most successful quantitative feature was word length, followed by poem length, and then sentence length. The success from looking at word length makes sense when you consider the success of metric patterns as a differentiator. At a higher level, word length and metric patterns occupy a similar role in determining the less easily measured flow of a poem. Sentence length was likewise a less useful metric because varying sentence length corresponds to attempts to vary meter and syntactic complexity. Poem length was a less significant differentiator because poets rarely generally do not stick to one poetic form by writing only sonnets or only epics, for example. The most significant features looked at diction. Unigram and bigram features were the most powerful features that we implemented. Even without the additional information of syntax and semantics, diction helps to point to the author of a work because certain words are more prevalent in different time periods and word use can also be used to interpret semantic and syntactic complexity. Looking at diction also reveals when poets tend to focus on specific themes because they will have a high frequency for vocabulary associated with those themes. Style, especially poetic style, is generally considered an abstract concept. It has as much to do with the feeling that a poet’s works tend to evoke as it had to do with the mechanics of their poetry. In the course of this project, our feature experimentation brought us closer to understanding what contributes that poetic signature.
© Copyright 2026 Paperzz