Using Synonyms for Author Recognition

Using Synonyms for Author Recognition
Abstract. An approach for identifying authors using synonym sets is presented.
Drawing on modern psycholinguistic research, we justify the basis of our theory. Having formally defined the operations needed to algorithmically determine authorship, we present the results of applying our method to a corpus of
classic literature. We argue that this technique of author recognition is both accurate as an author identification tool, as well as applicable to other domains in
computer science such as speaker recognition.
1 Introduction and Motivation
Current research in the area of Stylometry focuses on identifying idiosyncrasies in
written literature to identify the author. We present a novel approach for identifying
authors in written text that is also applicable to identifying speakers from automatically transcribed discourse.
In this paper, we argue that an author’s choice of lexicons used by an author or
speaker when many synonyms are available is idiosyncratic to the point of providing
identification. Modern psycholinguistic evidence indicates that this is a well-founded
approach. Though previous methods of author attribution have focused on linguistic
elements peculiar to written text, the use of synonyms does not require such elements
as punctuation to be effective and so may be applied to automatically transcribed
speech as well.
Such a flexible approach is ideal for situations such as a Smart Home environment
where requests made in natural language could originate from sources as varied as a
personal computer to a mounted microphone. Traditional methods might require the
use of both a vocal analyzer and a text-specific analyzer to identify the source. When
using synonyms, no such dependence on disparate technologies is incurred.
The ability to discern the source of a document or statement also has value in the
area of knowledge acquisition. One goal of knowledge acquisition is to gain the most
accurate knowledge possible. However, all sources of information are not equally
reliable and so such a system could be designed not trust them equally. The first step
toward this type of learning is the ability to correctly identify the source of a piece of
knowledge (i.e. a document). The presented method of author recognition aims to be
adaptable to all of these applications.
2 Related Work
Author attribution is a well-studied area of artificial intelligence. Formal methods for
determining authorship have even older roots. The field of Stylometry has thrived well
before the turn of the twentieth century having several documented methods of analyzing texts to settle disputed authorship. Linguistic idiosyncrasies that have been
identified as characteristic of an author include everything from counting keywords to
analyzing punctuation.
2.1 Related Efforts in Stylometry
The classification of documents by content (as opposed to by author) has been one
area in which similar techniques have been employed. Keywords have shown to be
very effective in this area by many studies. This idea has been taken further by those
such as Paek [10], who used keywords in image descriptions to classify the content of
the images.
In the field of Stylometry, several algorithmic approaches have been applied to
author attribution. For instance, Brinegar [3], Glover, and Hirst [5] used distributions
of word length statistics to determine authorship while others including Morton [9]
used sentence length for identification. The number of distinct words in a set of documents was studied by Holmes [7] and Thisted. [14]
2.2 Psycholinguistic Foundations
In the development of this method, consideration was given to not only the empirical
correctness of its results, but also to the theoretical foundations on which it is built.
Current psycholinguistic research suggests that synonyms are a part of language that is
affected by one’s environment.
Developed by the Cognitive Science Laboratory at Princeton University, the
WordNet lexical database is itself rooted in modern psycholinguistic theory. [13] It is
organized based largely upon how humans are believed to understand words. It is no
coincidence that synonym sets are at the heart of WordNet.
In Holland’s 1991 study, when subjects were requested to think of as many synonyms as possible for a set of words, a small but significant priming effect was
found. [6] That is to say, subjects were more likely to produce as synonyms words that
they had previously seen. This suggests that the synonyms produced by an individual
are a product of experience, something that is very unique indeed.
Let us digress a moment and further pursue the concept of priming. Associative
priming, the process by which one concept leads to another (e.g. a dog might cause
one to think about a bone), is a result of spreading activation. [1] The way in which
activation potentials spread through a neural network is dependent upon what connections have been made and how often they have been asserted. According to Hebbian
brain theory, connections that fire together are more likely to fire together in the future. Thus, from the perspective of cognitive psychology, it makes sense that though
many synonyms may be activated by a concept, the word at the forefront of one’s
mind will be based on experience. The uniqueness of this experience and the corresponding uniqueness of synonym choice may be exploited to determine authorship.
3 Theory
3.1 Definitions
As a simple formalization of this theory, let us begin by defining a set of authors α,
which have been encountered by our system before any identification is processed.
We then define λi as the lexicon corresponding to author αi so that |λi| is the vocabulary richness. Next, consider two functions which may be applied to a word w in λ:
occ(w) and syn(w) where occ(w) is the number of times word w was encountered and
syn(w) is the number of synonyms for x.
Consider a threshold θ, which is the minimum number of synonyms that a word
must have before to be considered an idiosyncratic identifier of an author. Note that it
is the use of a sufficiently large value of θ that provides a reasonable running time for
our algorithm. Now we define the filtered lexicon σi as author αi’s lexicon with words
having sufficiently large set of synonyms such that σi ⊆ λi where each word σij has
syn(σij) ≥ θ.
Having considered the task of learning each author’s style, we can examine the case
of identifying the work of an unknown author αu where αu ∈ α. The heart of the algorithm is in the intersection of the filtered lexicon of the unknown author with that of
all known authors:
ρi = σi ∩ σu.
(1)
Here, there are some special considerations to be made as to exactly how the intersection of these two lexicons is to be calculated. Since we have also associated a number of occurrences occ(w) with each element of σ, we need to determine how to evaluate this function for an intersection. For our purposes, let the number of occurrences
for a word used by both authors i and j be
occ( σi ∩ σj ) = min( occ(σi), occ(σj) ).
(2)
Finally, we calculate the match factor for each author:
|ρ i |
µ(i,u) =
∑ occ( ρij ) • syn( ρij )
(3)
j =1
The hypothetical author αi of the text corresponds to the maximum value of the
match factor where n = |α| such that
µ(i,u) = max( µ(0,u), µ(1,u), … , µ(i,u), … µ(n,u) )
(4)
3.2 Tractability
Though upon first glance, calculating σi ∩ σu for every author in α may spark complexity concerns, we have already addressed the problem of tractability by means of θ.
Recall that σ is a subset of λ and, as we will see, can be a very small subset indeed, if
the threshold is set high enough. Even though it seems we are cutting many possible
matches, our theory holds that for reasonably small θ, we are cutting less important
words, since the author was constrained by the lexicon of the language when choosing
a word with syn(w) < θ, due to a lack of synonyms.
3.3 Parallelization
By means of the independence of the various stages of the training and, later, the discernment process, this algorithm is an excellent candidate for parallelization (see Fig.
1). Each author and, in fact, each document that is added to the training set α can be
evaluated separately, leaving the merging of the sets to a central dispatcher. Since the
string operations necessary to calculate the statistics are far more processor-intensive
than calculating the union of said sets, a very high rate of speedup is possible.
However, true to Amdahl’s law, the stage of identifying the work of an unknown
author is dependent on having all training data prepared. Still, each intersection
σi ∩ σu and the subsequent calculations of µ(i,u) are independent of one another, giving yet another opportunity for parallelization. In the end, it is reasonable to distribute
the calculations associated with each individual author to a separate parallel process.
Fig. 1. An illustration of the major parts of the algorithm that may be run in parallel
4 Implementation
4.1 WordNet
As our method requires the ability to determine the number of synonyms for a word,
we chose to use WordNet1 to accomplish this task. In development since 1985,
WordNet is now the one of the foremost lexical resources in computational linguistics.
With over 118,000 word forms, it encompasses a substantial portion of the English
language. [13]
In WordNet, each word is linked to one or more senses. These senses, in turn, can
reside in synonym sets. Thus, we can find if a word shares at least one of its senses in
common with another word making it, by definition, a synonym. Furthermore, this
gives us the number of synonyms for a word by summing the number of unique members of a word’s synonym sets.
Though unimportant from a theoretical standpoint, it should be noted that the actual
version of WordNet used in this research was not the traditional ‘C’ library, but rather
a normalized database format for PostgreSQL.2 This allowed the number of synonyms
for a word to be determined in the execution of a single SQL statement.
4.2 Corpus
The texts used to train the system consisted of works of classic literature by Charles
Dickens and William Shakespeare. As these texts are in the public domain, they are
freely available for review by the curious reader.3
Table 1. Texts included in our experimental corpus. The corpus was comprised of 286,898
words total
Set
Dickens Train
Total Words
65,157
Dickens Test
80,832
Shakespeare Train 73,863
Shakespeare Test
1
67,046
Included Texts
Battle of Life, Chimes, To be Read at
Dusk
Cricket on the Hearth, Three Ghost
Stories, A Christmas Carol
Comedy of Errors, Hamlet, Romeo
and Juliet
Julius Caesar, Henry V, Macbeth
WordNet 2.1 is available for download from the Princeton Cognitive Science Laboratory at
http://wordnet.princeton.edu/.
2 WordNet SQL Builder 1.2.0 for WordNet 2.1, the application used to generate a PostgreSQL
database of WordNet, is available for download at http://wnsqlbuilder.sourceforge.net/.
3 The texts listed here are available from Project Gutenberg at http://www.gutenberg.org/.
The selected works of Dickens includes the so-called “Christmas Stories,” which
are A Christmas Carol, The Chimes, The Cricket on the Hearth, and the Battle of Life,
all written within the same decade. The other of Dickens’ texts are short-stories, “To
be Read at Dusk” and “Three Ghost Stories.” Shakespeare’s writings that were analyzed are some of his more famous comedies and tragedies such as The Comedy of
Errors, Hamlet, Romeo and Juliet, Julius Caesar, Henry V, and Macbeth. These
works of Shakespeare span over a 16-year period of his writing career.
In total, the corpus contained 286,898 words. The works of Shakespeare contributed 140,909 of the words while Dickens’ text constituted 145,989 words of the total.
Each author used a relatively large vocabulary, though still a very small portion of the
English language as a whole. Shakespeare’s lexicon had over 13,000 words while
Dickens’ had in excess of 12,000.
The corpus was divided into four sections. Most obviously, the texts were grouped
by author. Secondly, each author’s texts were divided in half, resulting in groups of
about 72,000±9% words. These sets were tasked as “train” and “test” where the “train”
was used in acquiring λ, the characteristic lexicon of an author and a match factor
µ(i,u) was then calculated for each train-test pair σTrain ∩ σTest.
5 Results
5.1 Author Identification Results
For our test data, the results showed that there is a well-defined difference in the
match factor of a correct pair and an incorrect pair (see Fig. 2 and Fig. 3). This trend
is consistent through all tested values of the threshold θ. In the cases of both Dickens
and Shakespeare, the correct test set was matched with its corresponding test set.
Moving beyond the simple fact that the system managed to produce the correct answers, let us take note of the margin by which the correct answer was ranked above
the others. The strongest case in the set was that of the Shakespeare training set versus the two test sets. With the difference between the matching and non-matching set
being roughly 17%, there is little question that this test run was a success.
In the case of the Dickens training set run against the test sets, the answer is less
confident with a 10% difference between the matching and non-matching values.
However, upon further consideration, it makes sense that the authors should have a
good deal of their vocabulary in common. Though the unique qualities of an author’s
style may be subtle, we still have the ability to detect these subtleties and exploit them
to determine authorship.
Match Factor
Percent Difference
250,000
200,000
Shakespeare Train Shakespeare Test
150,000
Shakespeare Train Dickens Test
100,000
Dickens Train Shakespeare Test
Dickens Train Dickens Test
50,000
0
15
20
25
30
Threshold Value
Fig. 2. The match factor µ(i,u) correctly correlates each test set with the training set
written by the same author. Thus, given a set known to be by a certain author, the
system can discern one author from another
Percent Difference in Match Factors
Percent Difference
25%
20%
15%
Shakespeare Train
vs Two Test Sets
10%
Dickens Train vs Two
Test Sets
5%
0%
15
20
25
30
Threshold Value
Fig. 3. By applying sufficiently large values of the synonym threshold θ, the number of synonym matches, and therefore the computational overhead, is greatly reduced
Number of Synonym Matches
Number of Synonym Matches
2,000
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
Shakespeare Train
- Shakespeare Test
Shakespeare Train
- Dickens Test
Dickens Train Shakespeare Test
Dickens Train Dickens Test
15
20
25
30
Threshold Value
Fig. 4. By applying sufficiently large values of the synonym threshold θ, the number of synonym matches, and therefore the computational overhead, is greatly reduced. Due to our particular interest in words with large numbers of synonyms, the
cutting of words with small synonym sets does not negatively effect the algorithm’s
ability to distinguish authors from one another
Reduction in Synonym Matches
Reductions in Synonym Matches
100%
95%
Shakespeare Train Shakespeare Test
90%
Shakespeare Train Dickens Test
85%
Dickens Train Shakespeare Test
80%
Dickens Train Dickens Test
75%
15
20
25
30
Threshold Value
Fig. 5. Even moderate values of θ yield in excess of 90% reduction in the synonym
matches between authors
5.2 Threshold Performance Results
As expected, the cutting of words having few synonyms was successful in reducing
the size of ρ for each author. In the case of Shakespeare’s works, there were over
11,900 unique words. Yet, after filtering out words via a threshold of 30, only 539
words had to be examined, giving us a 94% overall reduction in the size of ρ. (see Fig.
4 and Fig. 5).
The fact that the threshold was so successful in identifying those words that were
useful and discarding others is key in making this algorithm tractable for larger sets of
authors. Even higher values of θ do not negatively impact the accuracy of discerning
authorship.
6 Future Work
6.1 Multilingual Implementation
The next step in our work is to determine if it will be successful in other languages.
Since the only language dependent part of our algorithm is WordNet, we simply need
to query a lexical database for our target language.
Currently, there is much work being done on the creation and automatic generation
of WordNet databases for various languages. One such implementation is MEANING,
a project using the web as a huge corpus in an effort to build a multilingual WordNet.
Already, the project has produced a web-based interface to view their progress.4
6.2 Integration with an Automatic Speech Transcriber
One key point of this algorithm is that it is not dependent upon the peculiarities of the
written word. That is, the system is not hindered by the punctuation generated by an
automatic speech transcription system. Provided the transcription system can accurately generate text from speech, the results of our method of author recognition
should seamlessly carry over to speaker recognition.
One possible implementation of this involves Carnegie Mellon’s Sphinx Live Decoder. Using Hidden Markov Models, this program has the ability to transcribe voice
input to text on the fly. [8] This type of system would be ideal for situations in which
input could come from disparate sources such as keyboards and microphones.
4
The demo of MEANING project can be found at http://www.lsi.upc.es/~nlp/meaning/. The
web-based demo can be accessed from this page.
6.3 Integration in to a Smart Home Environment
In a Smart Home environment, it is key to provide a truly natural experience for the
home’s inhabitants. Thus, it follows that the Smart Home should be able to take vocal
requests and that it should be able to discern who is making these requests. This will
allow the system’s responses to be more appropriate. For example, an truly intelligent
Smart Home would not be likely to comply with a three-year-old child’s demand for
ten dozen cookies. A author-speaker recognition system using the methods proposed
here is the first step in making this scenario a reality.
7 Conclusions
Research using synonyms to recognize authors shows much promise for the future.
Being grounded in psycholinguistic theory, we can be confident that it has a solid
foundation on which future work can be built. Furthermore, the fact that this method
can be optimized using higher threshold values and distributed processing allows it to
be used in situations where running time is a consideration. Having displayed its ability to accurately identify authors for this domain, we look forward to applying our new
theory to more areas of usage.
References
1.
Anderson, J.R. Cognitive Psychology and its Implications. New York: W.H. Freeman and
Company. (1995)
2. Baayen, H., van Halteren, H., Neijt, A., Tweedie, F. An experiment in authorship attribution. JADT 2002: 6es Journees internationales d’ Analyse stastique des Donnees Textuelles. (2002)
3. Brinegar, C. Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of
Authorship. Journal of the American Statistical Association, 58, 85–96. (1963)
4. Chaski, C. (2005). Who’s At the Keyboard? Recent Results in Authorship Attribution.
International Journal of Digital Evidence. Spring (2005)
5. Glover, A. and Hirst, G. Detecting stylistic inconsistencies in collaborative writing. In
Sharples, Mike and van der Geest, Thea (eds.), The new writing environment: Writers at
work in a world of technology. London: Springer-Verlag. (1996)
6. Holland, Cynthia Rose. Does synonym priming exist on a word completion task? Doctoral
Thesis, Case Western Reserve University, Psychology. (1992)
7. Holmes, D. Authorship Attribution. Computers and the Humanities, 28, 87–106. Kluwer
Academic Publishers, Netherlands. (1994)
8. Huang, X. et al. The Sphinx-II Speech Recognition System. Computer Speech and Language. (1993)
9. Morton, A. The Authorship of Greek Prose. Journal of the Royal Statistical Society (A),
128, 169–233. (1965)
10. Paek, S. et al. Integration of Visual and Text-Based Approaches for the Content Labeling
and Classication of Photographs. ACM SIGIR. (1999)
11. Reiter, E. and S. Sripada. Contextual Influences on Near-Synonym Choice. Proceedings of
INLG-2004, pages 161-170. (2004)
12. Kaster, A., Siersdorfer, S., Gerhard, W. (2005). Combining Text and Linguistic Document
Representations for Authorship Attribution. SIGIR Workshop: Stylistic Analysis of Text
for Information Access (STYLE), Salvador, Bahia, Brazil. (2005)
13. Miller, George A. WordNet: A Lexical Database for English. Communications of the
ACM. November 1995/Vol.38, No. 11 (1995)
14. Thisted, R. and Efron, B. Did Shakespeare Write a Newly-discovered Poem? Biometrika,
74, 445–455, (1987)
15. Uzuner, Ö., Katz, B. (2005). A Comparative Study of Language Models for Book and
Author Recognition. Lecture Notes in Computer Science. Volume 3651/2005, pp. 969.
Springer-Verlag GmbH. (2005)