report

Distinguishing Poets Based on Writing Style
1. Introduction
Consider any anonymous document or document under suspicion of plagiarism. Examples
include letters in forensic evidence, and unsigned historical documents. We wanted to train an
AI to correctly identify the author of such a document based on a list of possible authors and
their known writing samples.
For this project, we chose to focus on poets and use style as the primary differentiator. In the
case of poetry, style is often a more significant differentiator than meaning because poems can
be semantically difficult to interpret for people and AI alike. But what, exactly, about style
makes a poet distinctive? Human readers of poetry can often recognize an author’s style even
when reading a new poem. Can a machine do the same? Can a machine do better?
An AI capable of recognizing authors’ writing styles has both practical and scholarly potential.
Identifying the author of a text could lend insight to forensic and historical cases, and catch
instances of plagiarism. But it could also lend insight into the defining aspects of great authors’
work. Not only could it find evidence for whether a particular play was actually written by
Shakespeare, it could provide numerical insight into what makes Shakespeare “sound like”
Shakespeare.
2. Task Definition
For this project, we considered two tasks of distinguishing poets: clustering and multi-class
classification.
Our clustering algorithm takes as input the complete works of a set of poets, as well as a feature
extractor to apply to those works. As output, it produces a clustering of the poems. To evaluate
this task, we consider whether poems by the same poet are clustered together, and the purity of
each cluster with respect to the poets included in it. We also take into consideration whether
poems by different authors but from the same school of poetry are clustered together.
The multi-class classifier trains on a sampling of works from a set of poets (including the correct
classifications). Using a provided feature extractor, it learns weights for each feature for each
poet. Once trained, it takes as input a new poem and predicts the author based on the new
poem’s features and the learned feature weights. To evaluate the success, we use the zero-one
loss function, and compare to the results that would be obtained by predicting a random author.
3. Related Work
3.1 At Stanford
http://nlp.stanford.edu/courses/cs224n/2010/reports/rof-karalevy-kmontag.pdf
The goal of The Metric System was to take prose and generate semantically similar, Englishlanguage text that follows a specified poetic meter. The primary challenge of the project was to
extract meaning from prose and maintain semantic consistency; adhering to a specific meter is a
more easily quantifiable task than extracting and reformulating semantics. The goal for this
project of “exploring quantitative approaches to processing and evaluating poetic verse” aligned
with our own preliminary goals for feature extraction. The primary difference is that we used
these features to identify poems, where they used them to generate poems. Because their work
was generative, they placed a greater emphasis on semantic features. We modeled our feature
extractor for metrical distance off the metrical distance function that they used in this project to
evaluate the meter of the generated poems; the function derived the meter of a line from syllabic
stresses and calculated the edit distance from the line’s scansion to an archetypal metrical
pattern, such as iambic pentameter.
http://web.stanford.edu/~jurafsky/kaojurafsky12.pdf
The goal of this project was to extract features from award-winning poets and identify those
features that qualified these works as high-quality. We had the common goal of quantitatively
evaluating style, although our ultimate goal was classification, instead of aesthetic evaluation.
Our feature extractor examines many of the phonemic features used in this project to evaluate a
poem’s beauty.
3.2 In the News
http://bits.blogs.nytimes.com/2012/01/03/software-helps-identify-anonymous-writers-orhelps-them-stay-that-way/?_php=true&_type=blogs&_r=0
In 2012, graduate students at Drexel University released two programs. One helped to identify
the author of an anonymous prose 500-word writing sample based on a pool of <50 people with
6,500 of words from known writing samples per suspect; the other piece of software altered
writing samples in order to better mask their authors. Our model mirrors that of the Drexel
project in that we assume that our possible poets are in a limited pool and we focused on prolific
writers, who could provide the greatest number of known writing samples.
http://latimesblogs.latimes.com/jacketcopy/2010/05/shakespeares-style.html
Stylometry, the application of the study of linguistics style, has applications ranging from
forensic linguistics to authorship of famous texts such as Shakespeare’s works and the Federalist
Papers. In this article, a lab at Claremont McKenna College applied computer-driven
stylometrics to Shakespeare’s works to gain insight into the authorship of works attributed to
Shakespeare to answer the question of whether Shakespeare’s style evolved or another author
stepped in in his later works.
4. Approach
4.1 Feature Extraction
Each feature of our feature extractor focused on stylistic element of poetry. We developed our
features in three categories: diction, poetic devices, and quantifiable elements of style. To look
at diction, we looked at unigram and bigram features, with and without punctuation. The poetic
devices that we looked at were simile, alliteration, assonance, rhyme, caesura, enjambment, and
metrical pattern. The quantifiable elements that we measured were average word length,
average sentence length, and the average length of a work. We focused specifically on elements
that were common enough among our poets as to be comparatively meaningful but not so
common as to not contribute to differentiating between poets. We modeled our feature vector as
sparse vectors.
4.2 K-Means Clustering
We applied K-means clustering as an unsupervised learning technique to gauge the efficacy of
our features in differentiating between poets. Categorizing poets manually is a costly process,
making it a suitable unsupervised learning task. Our goal at this stage was to develop our feature
extractor such that each cluster roughly represented a school of poetry. The centroid of each
cluster would represent the stylistic ideal or average of that particular school.
Schools of poetry are generally categorized by a common style or a common ethos. Given that
our feature extractor focused on syntactic stylistic elements, rather than semantics, we hoped to
achieve some success categorizing schools of poetry that are characterized by common stylistic
elements.
4.3 Multi-Class Classification
We implemented a multi-class classifier in which each class represented a poet in our data set.
We experimented with feature vectors composed of different combinations of features on
different testing and training sets to minimize error. We looked for an ideal combination of
features for which the features were common enough across the poets in our data set as to be
nontrivial but implemented differently enough that it could be a meaningful differentiator.
5. Results and Error Analysis
5.1 K-Means Clustering Results
To start, we performed K-Means clustering using the unigram features, as well as a some
numerical features--poem length, average sentence length, average word length--and a sampling
of literary device features measuring the presence of rhymes, enjambment, simile, and
alliteration. We looked at the results of clustering poems by William Shakespeare (1564-1616),
Emily Dickinson (1830-1886), Anne Sexton (1928-1974), and Sylvia Plath (1932-1963). Emily
Dickinson is considered a predecessor of the Modernist movement,1 while Anne Sexton and
Sylvia Plath are both leading poets of the Confessional Movement.2 Shakespeare was a far more
prolific poet than Sylvia Plath (with the others falling somewhere in between), so to standardize
the number of poems in our K-Means input, we chose 10 poems at random from each author.
The results of K-Means differed depending on the particular poems chosen in addition to the
variation that results from the algorithm’s different starting centroids in each run. Some of the
clusters were not unreasonable, however. On some runs of K-Means, most of Anne Sexton’s
poems and all of Sylvia Plath’s poems were grouped together into one cluster containing only a
single other poem by Shakespeare and Dickinson. This is a good result given that Sylvia Plath
and Anne Sexton were both Confessional poets, a movement that Shakespeare and Dickinson
were not part of. In this same run, many of Shakespeare and Dickinson’s poems clustered
together in the two other clusters. Given that Shakespeare and Dickinson wrote at very different
times, and with styles that do not appear particularly similar to a human reader, this result is
1
2
http://www.egs.edu/library/emily-dickinson/biography/
http://www.poetryarchive.org/poet/anne-sexton
less encouraging. However, Shakespeare was one of Emily Dickinson’s great influences, and she
referenced and responded to him in a lot of her work.3 Considering that our features in this run
focused mainly on word choices, and that literary references often rely on particular images
which are detectable in word choices, this was not altogether discouraging.
The resulting clusters can be seen below:
On the other hand, other runs of the K-means algorithm were far less insightful. Often, the
clusters generated by K-means varied greatly in size, with most of the poems grouping together
in one cluster, and the rest forming multiple smaller groups. In a set of runs involving the
complete works of Shakespeare, Emily Dickinson, and Robert Frost, one particular Shakespeare
poem, “Venus and Adonis,” formed a cluster by itself in nearly every trial. Because of this outlier
behavior, we removed that particular poem from our standard Shakespeare data set. But the
“big cluster” pattern is one that we often still see.
Below are the results of a “big cluster” grouping of a different set of Emily Dickinson,
Shakespeare, Anne Sexton, and Sylvia Plath poems. In this case, no particular clustering by
poetry movement is apparent at all.
3
http://muse.jhu.edu/books/9781613760901
This mixing of poets became even more apparent when we ran the algorithm on the works of ten
or twelve poets, rather than three or four, as shown below.
Using different feature extractors on the poems before clustering sometimes produced better
results. An example of this can be seen with Emily Dickinson and William Shakespeare. As we
found in our earlier experiments, even in the best results obtained using unigrams, poem length,
average sentence length, average word length, rhymes, enjambment, simile, and alliteration,
poems by Shakespeare and Emily Dickinson tended to cluster together.
But clustering these same two authors using a measure of meter alone produced better clusters.
When we performed K-Means using unigrams and the other original features, we obtained
clusters of sizes 118 and 22 (in actuality, there were 81 Shakespeare poems and 59 Dickinson
poems to consider). These clusters had a purity of 0.6786 where purity is measured by assigning
each cluster to the majority poet, and computing the fraction of poems that would be correctly
classified, formally:4
Where:
is the set of clusters
is the set of classes
Meanwhile, clustering with only the iambic pentameter meter level feature (a measure of the
average deviance of the poem’s lines from iambic pentameter) obtained clusters of size 75 and
65, with a purity of 0.8000. Clustering with the iambic meter level feature (a generalization on
the iambic pentameter measure to calculate the average deviance of lines from iambic meter of
any length) produced clusters of size 97 and 43, with a purity of 0.8714. Thus we see that both
metrical features extractors, when used alone in the K-Means run, produced purer and more
evenly sized clusters.
5.2 Multi-Class Classification Results
With our multi-class classifier, we considered poems by 10 poets of varied times and schools of
poetry: Elizabeth Bishop, E.E. Cummings, Emily Dickinson, Robert Frost, Allen Ginsberg, Philip
Larkin, Robert Lowell, Sylvia Plath, William Shakespeare, and Percy Bysshe Shelley. We divided
poems by all of the authors into training and test sets, using a random selection of ¾ of each
author’s works in the training set, and the remaining ¼ in the test set. We then calculated the
training and test error, and tuned our parameters and features to minimize the test error. Given
that our classifier chose one poet out of ten potential outputs, simply predicting a random poet
could be expected to have an error rate of about 90%. We used this error rate as a baseline to
evaluate our actual predictors.
To assess our features, we tested them in isolation first, then in combinations. The best single
features were the unigram and bigram word features, which simply extracted all of the unigrams
or bigrams from the poems and put their frequency counts into the feature vector.
4
This evaluation of the purity of clustering (and images of the formula) come from
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
Interestingly, we found that unigrams performed substantially better (about 0.47 test error
instead of 0.57) when punctuation was stripped from the words, while bigrams performed
slightly better when preserving punctuation. While at first surprising, this result could have to
do with the fact that bigrams already capture more contextual information about the text, while
unigrams focus on the counts of the words themselves. Including punctuation augments the
contextual information of bigrams, but throws off the frequency counts of some of the unigrams.
In both cases, standardizing the case of the letters hurt the accuracy slightly. Case could be
expected, like punctuation, to provide a bit of contextual information (about whether a word
comes first in a sentence, for example). But there is another aspect to its predictive power as
well, since certain poets broke the conventions of capitalization. Emily Dickinson, for example,
would capitalize important non-proper nouns in her work, while E. E. Cummings often wrote
completely in lowercase.
Quantitative Features
We next considered features looking at some of the quantifiable aspects of style: the length of
the poems, and average length of sentences and words used. We tried these features first alone,
then in combination with unigrams and bigrams. The results of the single feature tests and the
tests with unigrams are shown below:
Alone, the feature considering average word length had a decent amount of predictive power,
and the poem length had a very slight amount of predictive power. But when combined with
unigrams, the classifier actually had a slightly higher error rate than when it used the unigrams
alone. At least in the case of the average word length, this was almost certainly the result of the
two features being correlated.
Even worse was the average sentence length feature, which actually resulted in a classifier that
performed worse than the random baseline. Even when combined with unigrams, the classifier
was comparable to and even slightly worse than random. While we designed the average
sentence length feature on the thought that certain authors use longer sentences than others,
this is perhaps a pattern more commonly seen in prose than poetry. Furthermore, many prose
and poetic authors alike vary their only sentence lengths greatly to make different points. It is
likely that the average sentence length feature found patterns in the training data that occurred
only by chance, given the particular poems in the training set.
Poetic (Literary) Device Features
As with the quantitative features, the poetic device features performed poorly when tested alone.
But when combined with unigrams, a couple of them actually improved on the accuracy of the
unigrams-only predictor.
In particular, the features that considered the frequency of rhyming words, or extracted a
particular rhyme scheme from the poem, combined well with unigrams. Alone, the classifier
with a rhyming words feature extractor had a test error rate of 0.9178, but in combination with
unigrams, the test error rate was 0.41096 -- better than the test error rate of 0.4657 for the
classifier using the unigrams extractor only. Similarly, the classifier with the rhyme scheme
feature extractor only had a test error rate of 0.9041, while the classifier using both rhyme
scheme and unigrams had the same improved test error rate of 0.41096.
The best of all the poetic device features, however, were the ones considering the meter of the
poem’s lines.
Results using our two best metrical features are shown above. The meter pattern feature
extractor parses the meter of each line, producing a sequence of 1’s for stressed syllables and 0’s
for unstressed syllables. The feature vector stores the counts of each metrical pattern (coded in
0’s and 1’s) seen for that author. Likewise, the iambic meter level feature extractor parses the
meter of each line. But instead of storing the lines of meter themselves, this feature extractor
produces an average edit distance of the poem’s meter from perfect iambic meter of the same
number of syllables. (A third feature, not shown, measures the average distance from iambic
pentameter in particular, and performed slightly less well).
Alone, the meter pattern extractor actually produced a classifier with a test error rate of 0.5616-slightly better than the classifier using unigrams with punctuation. The iambic meter level, while
less impressive, was also quite predictive, particularly considering that this feature produced
only one value in the feature vector. When combined with unigrams (no punctuation), we
achieved a test error rate of 0.3562, the best throughout all of our experiments. Using the iambic
meter level feature along with unigrams had a test error rate of 0.3973, which was still a
substantial improvement on the unigrams alone. Combining both metrical features with
unigrams caused the test error to jump to 0.5342, most likely because the two metrical features
are somewhat correlated.
We also experimented with some transformations of the iambic meter level. Since the feature
measures a numerical average edit distance from iambic meter, we considered the possibility
that the relationship might not be linear. We experimented with using the squared average
distance instead, or 1 divided by the distance. However, neither of these transformations
improved the test error.
Also not particularly helpful was combining more than two of our features. Though we tried
many combinations of our better performing features, nothing quite matched the accuracy level
of using the unigrams and meter pattern features. It seems that sometimes a simpler model
really is best.
6. Conclusion
Our results provide interesting quantitative insights into what factors contribute to what we
perceive as poetic style. The efficacy of various features is a qualitative measure of how much
varying that feature affects style. Our results revealed that each poetic devices was, on its own,
unable to meaningful differentiate between poets. A possible reason behind this result is that
many different kinds of poets use the same poetic devices to different effects; in other words,
poetic devices measure minute choices in a poem and these choices are significant in the greater
context of a poem. It may be more meaningful to measure the effect to which each poet uses a
device, rather than the frequency of use for the device itself. While it may not be particularly
meaningful to note that one poet uses caesura 5 times in a poem while another uses caesura 2
times, it would be more useful to note that one poet uses caesura to build suspense while
another uses it to break a monotonous rhythm to highlight a particular line. The observation
that poetic devices become more significant when they are placed in the greater context of the
work, in this case through semantics, was illustrated in the relative success of rhyme and
metrical pattern in differentiating between poets, as compared to other poetic devices. Both
rhyme and metrical pattern are poetic devices that provide a more comprehensive view of a
poem because they are comparative devices that look at how words relate to other words and
lines to other lines. For example, looking at rhyme scheme was more productive than looking at
either rhyming words or pairs of rhyming lines because it captures a more comprehensive
picture of the poem.
The most successful quantitative feature was word length, followed by poem length, and then
sentence length. The success from looking at word length makes sense when you consider the
success of metric patterns as a differentiator. At a higher level, word length and metric patterns
occupy a similar role in determining the less easily measured flow of a poem. Sentence length
was likewise a less useful metric because varying sentence length corresponds to attempts to
vary meter and syntactic complexity. Poem length was a less significant differentiator because
poets rarely generally do not stick to one poetic form by writing only sonnets or only epics, for
example.
The most significant features looked at diction. Unigram and bigram features were the most
powerful features that we implemented. Even without the additional information of syntax and
semantics, diction helps to point to the author of a work because certain words are more
prevalent in different time periods and word use can also be used to interpret semantic and
syntactic complexity. Looking at diction also reveals when poets tend to focus on specific themes
because they will have a high frequency for vocabulary associated with those themes.
Style, especially poetic style, is generally considered an abstract concept. It has as much to do
with the feeling that a poet’s works tend to evoke as it had to do with the mechanics of their
poetry. In the course of this project, our feature experimentation brought us closer to
understanding what contributes that poetic signature.