Identifying the Authors of Suspect Email Malcolm Corney * Email: [email protected] Phone: +61 7 3864 4304 Alison Anderson * Email: [email protected] Phone: +61 7 3864 2465 George Mohay * [email protected] Phone: +61 7 3684 1964 Olivier de Vel ** [email protected] Phone: +61 8 8259 7254 * Information Security Research Centre Faculty of Information Technology Queensland University of Technology GPO Box 2434, Brisbane 4001 AUSTRALIA ** Defence Science and Technology Organisation PO Box 1500 Edinburgh SA 5111 AUSTRALIA Abstract In this paper, we present the results of an investigation into identifying the authorship of email messages by analysis of the contents and style of the email messages themselves. A set of stylistic features applicable to text in general and an extended set of email-specific structural features were identified. A Support Vector Machine learning method was used to discriminate between the authorship classes. Through a series of baseline experiments on non-email data, it was found that approximately 20 email messages with approximately 100 words in each message should be sufficient to discriminate authorship in most cases. These results were confirmed with a corpus of email data and performance was further enhanced when a set of email-specific features were added. This outcome has important implications in the management of such problems as email abuse, anonymous email messages and computer forensics. Keywords: Authorship Attribution, Email, Stylometry, Machine Learning, Support Vector Machine, Data Mining 1. Introduction Email is not only the most frequently used Internet application, it is at one and the same time transforming workplace behaviour. In particular there has been a noticeable shift in the way workers treat electronic documents as opposed to paper documents. Analyses of email writing style [1] situate email communication somewhere between the informality of the spoken word and the relative formality of an official memo or letter. Email message production involves so little effort that workers are encouraged to use email as their dominant form of communication and the speedy and casual nature of email, provides an environment for pushing the boundaries of acceptable behaviour. The act of constructing and sending an email message, however, is more significant than may be realised, since stored email is potentially a persistent record of an act, an instruction, a response, an intention or even an attitude. As the usage of email has grown, so too have the responsibilities of employers and workplace administrators who have never before had the task of enforcing so many sociolegal policies. Email evidence can be central in cases of sexual or racial harassment, threats, bullying and even unfair dismissal. Email can also be critical evidence for or against accusations of copyright violation, plagiarism and disputed authorship. Nonetheless email can be spoofed or anonymised and therein lies the central problem of successfully using email texts as the formal record of any event. While some work has been done in preventing email spoofing and anonymization in local, intra-net situations, this does not address the prime difficulty we face with email in the wider sphere where enforced identification and authentication are socially and technically vexed. Hence, while enforced authentication of all email may arguably be desirable, it is currently infeasible. In the meantime, policy administrators need tools to demonstrate that social, legal and company rules are and will be enforced and tools to identify email authorship where spoofing or anonymization may have occurred. This paper describes the results of applying a machine learning technique using a Support Vector Machine (SVM) to the problem of identifying possible authors of a suspect email. SVM-based identification appears to give useful results on text samples as small as 100 to 200 words, especially when email construction habits are taken into account. Although not yet appropriate for use in formal forensics, for example in court cases, it promises to become a useful adjunct in policy enforcement 2. Context of this Research Employers often permit some private email usage by their employees as long as efficiency does not suffer. Recent cases [2] confirm, however, that the employer is entitled to monitor computer-based activities for behaviour that may be illegal or against company rules. While the monitoring of network traffic to identify potential computer intrusions is now well established, it is now clear that message content analysis can be an important adjunct to traffic analysis to identify illegal activity. Such activity is very likely to be insider activity and may be against policy, rather than a direct compromise of computer security. Typically, such cases need to be investigated rather than prevented. However, little has been done to show how data mining techniques might be used to identify the authors from their email style in cases where authorship of such email is spoofed or disputed. An obvious way in which to approach this problem is to exploit research in literary stylometry, i.e. authorship identification from writing style analysis. In this paper, we present the results of SVM application to email messages, using stylometric features that have previously been recognized as successful in the case of ordinary text. We describe experiments which test the reliability of certain style markers and also email structure as author identifiers, and which establish the smallest possible text length giving useful results. The experiments described include analysis of both email and non-email texts. The paper reviews relevant work in stylometry and use of the SVM learning framework and then shows how these ideas have been integrated for developing email specific stylometrics amenable to machine learning techniques. Following this, we present an account of a sequence of controlled experiments aimed at defining the limitations and scope of the SVM learning software in this context. Finally, we discuss the forensic and practical implications of email content surveillance. 3. Stylometry and Email Stylometry, a development of literary stylistics, assumes that an author has distinctive writing habits and that these are exhibited in features like core vocabulary use, sentence complexity and phraseology. Another assumption is that these habits are unconscious, so ingrained that they are difficult to conceal deliberately or need at least a conscious effort to do so. Further, stylometry seeks to establish methods for style feature extraction and associated metrics for assessing text similarity [3]. An example demonstrates that with restricted choice, any experienced reader can apply a kind of stylometry. Which of the following examples was written more recently? Are there two authors here or only one? Which example was written by a native English speaker? Telltale features include fluency, grammar, syntax, spelling, punctuation and vocabulary. Example A England a commercial country! Yes; as Venice was. She may excel other nations in commerce, but yet it is not that in which she most prides herself, in which she most excels. Example B Just another query about the results to the project, will results be available by 14th Dec? Because if not I was told that i will not be able to attend the graduations. If you will be willing to write the letter for me, I can meet up with you for any further discussions. The general problem of deciding about authorship of an anonymous text, based on comparisons with known authorship texts is, however, far more challenging. An anonymous text sample may indeed have been written by some other author entirely. Stylometry uses metrics for features such as fluency, grammar, syntax and spelling. Based on such metrics, an assessment of how similar or dissimilar the anonymous text is to the known authorship texts can be made. Hence, its success depends both on having quantities of known-author text available and restricting choice to a small set of possible authors. Its techniques also require texts to be comparable in the sense that they come from similar genres: poetry, prose, drama etc. Traditionally, stylometry has been applied in cases of disputed literary authorship, such as the Federalist Papers [4, 5], a longstanding problem in stylometry involving the authorship of anonymous essays by one of three known authors. According to [6], at least 1000 proposals for "style markers" exist in stylometric research. The following incomplete list of example markers shows that they may be categorised as character-based, word-based, sentence-based, document-based, structural or syntactic. These and similar markers have been tested with mixed success either alone, in combination or augmented by grammatical or syntactic mark-up: • Letter frequencies • n-gram frequencies (overlapping n-character sequences) • Function word usage (short structure-determining words: common adverbs, auxiliary verbs, conjunctions, determiners, numbers, prepositions and pronouns) • Vocabulary richness (number of different words used) • Lexical richness (word frequency as a function of full text) • Distribution of syllables per word • Word frequencies • Hapax legomena (words used once only) • Hapax dislegomena (words used twice only) • Word length distribution • Word collocations (words frequently used together). • Sentence length • Preferred word positions • Prepositional phrase structure • Distribution of parts of speech • Phrasal composition grammar While most style metrics appear to work some of the time (i.e. on some texts and some authors), the most reliably successful features have, in general, been function words and ngrams. A number of successful experiments with function words have been reported [4, 5, 7]. N-grams to some extent overlap with function words, since frequent short words count higher but n-gram frequencies also take into account some punctuation and other structural properties of the text. Most reports, e.g. [8], indicate that 2 or 3-grams give the best discrimination. The effectiveness of n-grams derives from the fact that n-grams are a successful summary marker, one that can substitute for other markers. It captures something about the author's favourite vocabulary as well as sentence structure. A recent report [9], however, suggests that even successful style markers may nonetheless be sensitive to differences in genre, topic etc. when the text corpus is small. These results question the existence of authorial fingerprints, at least for small samples. Clearly there are therefore, problems with applying stylometry to email authorship given that typical email message size is small. From the above, it would seem that differing genre, topic, learning corpus size and text sample size could confound identification. The degree to which email texts are sensitive to these effects is not known. This current work extends some initial investigations performed on the classification of email message authorship [10] and it is for the reasons outlined above that we have set out to conduct a series of controlled experiments on both non-email and email texts – so as to identify the extent of this sensitivity and at the same time to establish which style markers are most effective. Fortunately, email has structural features that pure text lacks. While some of these features are covered by style markers (paragraphing and tabbing), others such as the usage of greeting and/or farewell text and the inclusion of a signature may be as habitual as vocabulary or syntax. In the experiments described below, we examined the effects of using stylometry's most promising style markers in conjunction with an analysis of features unique to email. These email structural features must be identified with care in order to allow for the widest possible variety of message formats and to account for adulterating included text. Our chosen email features are listed in Section 5, “Conduct of the Experiments”. We acknowledge that the features described above can be deliberately disguised, but point out that it may be difficult to disguise all of them at once, in the same message, without considerable planning and drafting and that such planning may leave a trail of its own. 4. Experimental Approach Our experimental approach is based on the fact that the stylistic features from an email message and the structural features from the layout of an email message can be reduced to a set of numerical values by data mining the text and other fields of the email message being analysed. We then hypothesise that these values can be thought of as defining a pattern of authorship. If this authorship profile or pattern exists, a machine learning technique can be used to discriminate between different authors. Classification of documents of unknown authorship can then be attempted. Support vector machines [11] are based on the structural risk minimization principle from computational learning theory. The idea of structural risk minimization is to find a hypothesis that guarantees the lowest true error for a classification problem. The true error of the hypothesis is the probability that the learnt classifier model will make an error on an unseen and randomly selected test example. The values of the features selected for a classification task are transformed into a hyperspace and the support vector machine finds a hyperplane that separates the positive and negative examples from the training set with a maximum margin. The hyperplane that separates the examples is based on a kernel function that can be linear, polynomial, a radial basis function or any other function that the user chooses. The training examples that are closest to the hyperplane and thus define the hyperplane are called Support Vectors. The SVM classification technique is limited to performing binary classifications. When the classification problem has more than two classes, one of the classes is made the positive class and all others are made negative and a classifier model is learned. For n classes, n classifier models are learned with each class being made the positive class for its corresponding classifier model. The learnt classifier models generated by the Support Vector Machine can then be used to classify unseen test examples. The output of this classification step determines whether the test example belongs to a particular class or not. When analysing the results of classification, a data point can belong to the positive or negative class and in either case it can be classified correctly or incorrectly. The classification result for a single data point can be considered to be a true positive, true negative, false positive or false negative result. The results can be collected into a two-way Predicted Class Positive Negative True Positive False Negative frequency frequency False Positive True Negative frequency frequency Positive Actual Class Negative Figure 1 Two-way Confusion Matrix confusion matrix as shown in Figure 1. It is necessary to produce a two-way confusion matrix for each authorship class in the experiment. A set of statistics can then be calculated from the two-way confusion matrix to determine the level of success of the classification experiment [12]. The error rate gives an indication of how many data points were incorrectly classified. The precision statistic measures the impact of false positives on the results and the recall statistic measures the impact of false negatives. The precision and recall statistics can be combined into a single statistic, F1, by calculating the harmonic mean of the precision and recall values. Formulae for these statistics are shown below. Error Rate ( E ) = FP + FN TP + FP + FN + TN Pr ecision ( P) = Re call ( R) = F1 = TP TP + FP TP TP + FN 2× R× P R+P To get an indication of the overall success of a classification experiment, the macro averaged error rate and F1 statistics [13] can be calculated across each authorship class using the following formulae: n (M ) 1 F = ∑F i =1 1, ACi n and n E (M ) = ∑E i =1 ACi n where AC i is the author class (i = 1, 2, …, n). To compensate for document frequency, the statistics for each authorship class are inversely weighted by the number of data points in each class [10]. The weighted macro-averaged F1 and error rate can then be calculated: n F (M ) 1 = ∑ (1 − w i =1 ACi ) F1, ACi n −1 and n E (M ) = ∑ (1 − w i =1 ACi ) E ACi n −1 where w ACi is the document frequency weight: w ACi = N ACi n ∑N i =1 ACi and N ACi is the number of documents in authorship class AC i 5. Conduct of the Experiments The methodology used in these experiments, was first tested by running baseline tests on large documents within a single genre to determine whether or not authorship could be classified using the proposed scheme. The documents used were a corpus of Information Technology PhD theses written by three authors. A series of experiments were run on these documents to identify the most useful features and to determine how much text data was required for reliable authorship categorization. These experiments are detailed in Sections 5.1 to 5.3. Following completion of these baseline experiments, further tests were then conducted on a corpus of email messages as outlined in Section 5.4. The email corpus contains 253 email messages in total from 4 authors. This corpus contains over 23,200 words in total. The email messages vary in length from 0 to 964 words with an average length of 92 words. There was no attempt made to control the topic or style of these email messages and the topics of these email messages, therefore, are quite diverse. 5.1 Feature Selection A list of features that have been used in work by other researchers in the field of authorship attribution was compiled. Only those features that had been indicated to be successful were considered. This list of features was split into sets with a similar basis (character-based, word-based, document-based, function words, word length frequency). In total, 184 stylistic features were used in these experiments. Definitions of the stylistic features are shown in Table 1. In the table, the following definitions apply: • C = total number of characters in the document chunk or email message • N = total number of tokens (i.e. words) in the document or email message • V = total number of distinct words in the document or email message Email messages, by nature, do not contain a constant number of words from message to message. To overcome this variable, the features were normalized, where possible, as ratios of a frequency of some property (e.g. number of upper case letters) to a summary property (e.g. the total number of letters). These tests were carried out on the PhD theses using sets of features and combinations of these sets of feature. The documents were split into chunks of text containing 1000 words. The values for the features proposed for discrimination of authorship were mined from the documents and prepared using 10-fold stratified cross-validation [12] for classification using SVMlight [14]. Table 2 shows the results of classification for some feature sets and for some combinations of feature sets. Feature Set Description of Stylistic Features Character-based (C) 10 features total number of characters in words / C total number of letters (a – z) / C total number of upper-case characters / C total number of digit characters / C total number of white-space characters / C total number of space characters / C total number of spaces / total number of white-space characters total number of tab characters / C total number of tab characters / total number of white-space characters total number of punctuation characters / C Word-based (W) 20 features average word length (in characters) vocabulary richness (V / N) total number of function words / N total number of short words (1 – 3 characters) / N count of hapax legomena / N count of hapax legomena / V count of hapax dislegomena / N Guirad’s R* Herdan’s C* Rubet’s K* Maas’ A* Dugast’s U* Luk’’janenkov & Neistov’s measure* Brunet’s W* Honore’s H* Sichel’s S* Yule’s K* Simpson’s D* Herdan’s V* Entropy* Document-based (D) 2 features number of blank lines / total number of lines average sentence length (in words) Function word frequency distribution (F) 122 features frequency of each function word / N Word length frequency distribution (L) 30 features frequency of occurrence of words with length 1 to 30 / N Table 1 * List of Stylistic Features These features are implemented as described in Tweedie & Baayen [15] 5.2 Number of Words per Document Chunk As email messages often have fewer than 100 words, it was necessary to demonstrate that the technique is capable of reliable discrimination of authorship between small chunks of text i.e., chunks of text of size comparable with typical email messages. The feature sets that were discovered to be best for discrimination of authorship in the experiments, whose results are listed in Table 2, were then used to determine authorship of texts of decreasing size. The documents from the previous experiments with the PhD theses were split into chunks of 100, 200 and 500 words and the classification tests were re-run. The classification results are shown in Figure 2 for the set of function word features, for the set of 2-grams and for a combination of the following features: character-based, word-based, document-based, function word frequencies and word length distribution frequencies. Feature Sets Character-based (C) Word-based (W) Function Word Frequency (F) Word Length Frequency Distribution (L) 2-grams C+W+F C+W+L C+F+L F+L+W C + F + L + W + Document Based (D) Table 2 Weighted Macro-Averaged Error Rate (%) 8.5 6.9 2.7 12.7 0.8 1.2 4.1 0.3 0.3 0.3 Weighted Macro-Averaged F1 (%) 86.7 89.4 95.6 81.4 98.8 98.0 93.8 99.6 99.6 99.6 The Effect of Feature Sets on Classification of Authorship 100 F1 (%) 90 Function Word 2-grams 80 CWDFL 70 60 0 200 400 600 800 1000 Number of Words Figure 2 5.3 The Effect of Chunk Size on Classification Results for Different Feature Sets Number of Documents per Author When applying the methodology to a corpus of email messages, a set of authentic email messages from each suspected author is needed. One desired result of this experimentation was to identify how many samples of text are needed from each author for reliable classification. The following set of tests was performed to find this minimum number of email messages per authorship class. The PhD theses were split into between 10 and 50 chunks of 200 and 500 words by randomly sampling the documents, ensuring that each document chunk was mutually exclusive of the others. Each set of data was tested using a combination of all of the stylistic feature sets to determine the results of classification. These results are shown in Figure 3. 100 98 F1 (%) 96 94 92 90 88 86 84 0 10 20 30 40 50 60 70 80 90 100 Number of Chunks 500 Words ( Figure 3 5.4 d) 200 Words ( d) The Effect of Number of Data Points on Classification Results Classification of Email Messages After the baseline tests on non-email data were completed, a set of experiments was then carried out on the corpus of email messages detailed previously. The corpus of email messages was initially tested using the basic stylistic features of Table 1, followed by further tests where extra features were mined from the structure of each email message. The details of these structural features are shown in Table 3. Feature Set Email-based (E) 5 features HTML Tag Frequency Distribution (H) 22 features Table 3 Description of Stylistic Features Position of requoted text in the message (7 possiblilities) Absence / presence of greeting text Absence / presence of farewell text Absence / presence of signature Number of attachments HTML tag frequency / total number of HTML tags List of Email Message Structure Features A comparison of results from the classification of these email messages tested with and without the use of the extra structural features is shown in Table 4. Feature Set Function Words Function Words + Email Structure C+F+L+W C + F + L + W + Email Structure Table 4 6. Weighted Macro-Averaged Error Rate (%) 19.8 9.6 17.3 8.2 Weighted Macro-Averaged F1 (%) 57.1 79.7 62.9 82.4 Comparison of Classification of Email With and Without Email Structural Features Discussion of Results In the conduct of our experiments, the best single feature for discrimination of authorship has consistently been the set of function words. Some tests were performed where the different types of function words, such as adverbs, auxiliary verbs, prepositions and pronouns, were used as separate feature sets. The preposition and pronoun function word sets performed slightly better than the others but better results were obtained when the full set of function words was used. Intuitively, the word length frequency distribution should give some discrimination between authorship classes, as this distribution could be linked to an author’s level of education. However, of all of the stylistic features used, the word length frequency distribution is the poorest feature set when used alone. The use of n-grams as a feature set provides very good classification results. It seems however, that n-grams discriminate between author’s documents based on content rather than on style alone. Some tests were conducted where an unseen document from an author was used as the test set for a series of learned classifiers. The unseen document had a very high error rate when 2-grams were used as the features but the error rate was lower when the stylistic features defined in Table 1 were used. The two documents had different topics and it seems that topic specific words inflate the 2-gram metric of the different documents and therefore discriminate between them on this basis. The best classification results are achieved when all of the feature sets in Table 1 are combined. The incremental addition of feature sets shows a persistent improvement in classification as additional feature sets are added. Most of the stylistic features mined from text that are used for classification are normalized with respect to the number of words in the document. For small documents, many of the features will have values of zero. There must be some threshold number of words, therefore, that a document must contain before it can be classified. The results from the work performed on the number of words required to effect reliable classification, as shown in Figure 2, indicates that good results are achieved with 200 words per document and as few as 100 words may be satisfactory in some cases. This is an encouraging result, as it approximates the size of many emails. It is interesting to note from Figure 2, that there is little difference in classification results for different chunk sizes when 2-grams are used as the feature set. This is again thought to be due to the effect of various 2-gram frequencies being inflated by content specific words in the documents. This effect requires further investigation. In some cases, the number of email messages from a suspect author may be low. Before the classification of suspect email messages can take place, it will be necessary to know the minimum number of messages that are required to determine a pattern of authorship. The results from tests on the number of documents required, as displayed in Figure 3, show that once the number of documents reaches 20, for documents with 200 and 500 words, the classification efficiency seems to levels off. This is once again an encouraging result and requires confirmation with a larger corpus of data. The previous results were all obtained on report style documents and for all tests, the features were mined from segments of these documents, which had the same number of words. The data set of email messages could not be controlled in this manner. Fortunately, email messages do have a structure and this structure can be utilised as another set of features. Table 4 shows the classification results from tests undertaken when the email specific features (greeting, farewell and signature use, HTML use, attachments and requote position) are added to the stylistic features that can be mined from the email message text. The improvement in classification of the email messages due to the features mined from their structure is quite dramatic. The email messages used in these tests were not controlled in any manner. It may be possible, that if some minimum number of words were imposed as a limit on the data, that better results could be obtained. 7. Conclusion Our experiments have shown that it is possible to carry out effective authorship identification of typical email messages in some circumstances. The extent to which this is possible in wider circumstances is still to be determined. The experiments have shown that the classification of documents from different authors with a SVM machine-learning algorithm provides a systematic way of determining the relative effectiveness of raw style markers. Our experiments show that a selected subset of these markers from the field of stylistics may be effective enough to create an authorship identification tool but more importantly, we have shown they were good discriminators at an email-sized level. These experiments also show that the natural structure of email provides additional authorspecific features and these in conjunction with stylometry give a better result. Stepwise SVM experiments are a convenient way to expose the marginal improvements of individual features and feature combinations and to show whether these are worth using on the population to hand. More experimentation is needed to decide whether authors have particular sensitivity to particular markers or more importantly, whether the learner can be spoofed. As yet, the SVM authorship identifier does not approach the status of a forensic stylistics tool. Expert evidence such as forensic linguistics can only be considered 'scientific' in the legal sense if it has court-accepted attributes, i.e. empirical testing, known error rates, standard procedures, peer review etc. [16]. The SVM learner, however, is a convenient platform for establishing the first two of these. Meanwhile, as a tool for narrowing a suspect list, it has applications beyond email abuse and by excluding whole classes of suspects can suggest avenues for investigation by other means. 8. References 1. Baron, N., Letters by Phone or Speech by Other Means: The Linguistics of Email. Language and Communication, 1998. 18: p. 133-170. 2. Sallis, P.J., Computer-mediated Communication: Readability. Information Sciences, 2000. 123: p. 45-53. Experiments with E-mail 3. Burrows, J.F., Computers and the Study of Literature, in Computers and Written Text, C. Butler, Editor. 1992, Blackwell: Oxford. p. 167-204. 4. Mosteller, F. and D. Wallace, Applied Bayesian and Classical Inference: The Case of the Federalist Papers. 1984, New York: Springer-Verlag New York, Incorporated. 5. Tweedie, F., S. Singh, and D.I. Holmes, Neural network applications in stylometry: the Federalist Papers. Computers and the Humanities, 1996. 30(1): p. 1-10. 6. Rudman, J., The State of Authorship Attribution Studies: Solutions. Computers and the Humanities, 1998. 31(4): p. 351-365. Some Problems and 7. Craig, H., Authorial Attribution and Computational Stylistics: If You Can Tell Authors Apart, Have You Learned Anything About Them? Literary and Linguistic Computing, 1999. 14(1): p. 103-113. 8. Kjell, B. Authorship Attribution of Text Samples using Neural Networks and Bayesian Classifiers. in IEEE International Conference on Systems, Man and Cybernetics. 1994. 9. Baayen, H., et al., Back to the Cave of Shadows: Stylistic Fingerprints in Authorship Attribution. 2000, The ALLC/ ACH 2000 Conference, University of Glasgow. 10. de Vel, O. Mining E-mail Authorship. in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2000. Boston, MA, USA. 11. Vapnik, V.N., The Nature of Statistical Learning Theory. 1995, New York: SpringerVerlag. 12. Witten, I.H. and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. The Morgan Kaufmann Series in Data Management Systems. 2000, San Francisco, California, USA: Morgan Kaufmann Publishers. 13. Yang, Y., An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retireval, 1999. 1(1): p. 67-88. 14. Joachims, T., Making Large-Scale SVM Learning Practical, in Advances in Kernel Methods - Support Vector Learning, B. Scholkopf, C.J.C. Burges, and A. Smola, Editors. 1999, MIT Press. 15. Tweedie, F. and H. Baayen, How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 1998. 32(5): p. 323-352. 16. Chaski, C.E., Who Wrote It: Steps Toward a Science of Authorship Identification. National Institute of Justice Journal, 1997(September 1997): p. 15-22.
© Copyright 2026 Paperzz