Identifying the Authors of Suspect Email

Identifying the Authors of Suspect Email
Malcolm Corney *
Email: [email protected]
Phone: +61 7 3864 4304
Alison Anderson *
Email: [email protected]
Phone: +61 7 3864 2465
George Mohay *
[email protected]
Phone: +61 7 3684 1964
Olivier de Vel **
[email protected]
Phone: +61 8 8259 7254
*
Information Security Research Centre
Faculty of Information Technology
Queensland University of Technology
GPO Box 2434, Brisbane 4001
AUSTRALIA
**
Defence Science and Technology Organisation
PO Box 1500
Edinburgh SA 5111
AUSTRALIA
Abstract
In this paper, we present the results of an investigation into identifying the authorship of
email messages by analysis of the contents and style of the email messages themselves. A set
of stylistic features applicable to text in general and an extended set of email-specific
structural features were identified. A Support Vector Machine learning method was used to
discriminate between the authorship classes. Through a series of baseline experiments on
non-email data, it was found that approximately 20 email messages with approximately 100
words in each message should be sufficient to discriminate authorship in most cases. These
results were confirmed with a corpus of email data and performance was further enhanced
when a set of email-specific features were added. This outcome has important implications
in the management of such problems as email abuse, anonymous email messages and
computer forensics.
Keywords: Authorship Attribution, Email, Stylometry, Machine Learning, Support Vector
Machine, Data Mining
1.
Introduction
Email is not only the most frequently used Internet application, it is at one and the same time
transforming workplace behaviour. In particular there has been a noticeable shift in the way
workers treat electronic documents as opposed to paper documents. Analyses of email
writing style [1] situate email communication somewhere between the informality of the
spoken word and the relative formality of an official memo or letter. Email message
production involves so little effort that workers are encouraged to use email as their dominant
form of communication and the speedy and casual nature of email, provides an environment
for pushing the boundaries of acceptable behaviour. The act of constructing and sending an
email message, however, is more significant than may be realised, since stored email is
potentially a persistent record of an act, an instruction, a response, an intention or even an
attitude.
As the usage of email has grown, so too have the responsibilities of employers and workplace
administrators who have never before had the task of enforcing so many sociolegal policies.
Email evidence can be central in cases of sexual or racial harassment, threats, bullying and
even unfair dismissal. Email can also be critical evidence for or against accusations of
copyright violation, plagiarism and disputed authorship. Nonetheless email can be spoofed
or anonymised and therein lies the central problem of successfully using email texts as the
formal record of any event.
While some work has been done in preventing email spoofing and anonymization in local,
intra-net situations, this does not address the prime difficulty we face with email in the wider
sphere where enforced identification and authentication are socially and technically vexed.
Hence, while enforced authentication of all email may arguably be desirable, it is currently
infeasible. In the meantime, policy administrators need tools to demonstrate that social, legal
and company rules are and will be enforced and tools to identify email authorship where
spoofing or anonymization may have occurred.
This paper describes the results of applying a machine learning technique using a Support
Vector Machine (SVM) to the problem of identifying possible authors of a suspect email.
SVM-based identification appears to give useful results on text samples as small as 100 to
200 words, especially when email construction habits are taken into account. Although not
yet appropriate for use in formal forensics, for example in court cases, it promises to become
a useful adjunct in policy enforcement
2.
Context of this Research
Employers often permit some private email usage by their employees as long as efficiency
does not suffer. Recent cases [2] confirm, however, that the employer is entitled to monitor
computer-based activities for behaviour that may be illegal or against company rules. While
the monitoring of network traffic to identify potential computer intrusions is now well
established, it is now clear that message content analysis can be an important adjunct to
traffic analysis to identify illegal activity. Such activity is very likely to be insider activity
and may be against policy, rather than a direct compromise of computer security. Typically,
such cases need to be investigated rather than prevented. However, little has been done to
show how data mining techniques might be used to identify the authors from their email style
in cases where authorship of such email is spoofed or disputed.
An obvious way in which to approach this problem is to exploit research in literary
stylometry, i.e. authorship identification from writing style analysis. In this paper, we present
the results of SVM application to email messages, using stylometric features that have
previously been recognized as successful in the case of ordinary text. We describe
experiments which test the reliability of certain style markers and also email structure as
author identifiers, and which establish the smallest possible text length giving useful results.
The experiments described include analysis of both email and non-email texts.
The paper reviews relevant work in stylometry and use of the SVM learning framework and
then shows how these ideas have been integrated for developing email specific stylometrics
amenable to machine learning techniques. Following this, we present an account of a
sequence of controlled experiments aimed at defining the limitations and scope of the SVM
learning software in this context. Finally, we discuss the forensic and practical implications
of email content surveillance.
3.
Stylometry and Email
Stylometry, a development of literary stylistics, assumes that an author has distinctive writing
habits and that these are exhibited in features like core vocabulary use, sentence complexity
and phraseology. Another assumption is that these habits are unconscious, so ingrained that
they are difficult to conceal deliberately or need at least a conscious effort to do so. Further,
stylometry seeks to establish methods for style feature extraction and associated metrics for
assessing text similarity [3].
An example demonstrates that with restricted choice, any experienced reader can apply a
kind of stylometry. Which of the following examples was written more recently? Are there
two authors here or only one? Which example was written by a native English speaker?
Telltale features include fluency, grammar, syntax, spelling, punctuation and vocabulary.
Example A
England a commercial country! Yes; as
Venice was. She may excel other nations
in commerce, but yet it is not that in which
she most prides herself, in which she most
excels.
Example B
Just another query about the results to
the project, will results be available by
14th Dec? Because if not I was told that
i will not be able to attend the
graduations. If you will be willing to
write the letter for me, I can meet up
with you for any further discussions.
The general problem of deciding about authorship of an anonymous text, based on
comparisons with known authorship texts is, however, far more challenging. An anonymous
text sample may indeed have been written by some other author entirely. Stylometry uses
metrics for features such as fluency, grammar, syntax and spelling. Based on such metrics,
an assessment of how similar or dissimilar the anonymous text is to the known authorship
texts can be made. Hence, its success depends both on having quantities of known-author
text available and restricting choice to a small set of possible authors. Its techniques also
require texts to be comparable in the sense that they come from similar genres: poetry, prose,
drama etc. Traditionally, stylometry has been applied in cases of disputed literary authorship,
such as the Federalist Papers [4, 5], a longstanding problem in stylometry involving the
authorship of anonymous essays by one of three known authors.
According to [6], at least 1000 proposals for "style markers" exist in stylometric research.
The following incomplete list of example markers shows that they may be categorised as
character-based, word-based, sentence-based, document-based, structural or syntactic. These
and similar markers have been tested with mixed success either alone, in combination or
augmented by grammatical or syntactic mark-up:
•
Letter frequencies
•
n-gram frequencies (overlapping n-character sequences)
•
Function word usage (short structure-determining words: common adverbs, auxiliary
verbs, conjunctions, determiners, numbers, prepositions and pronouns)
•
Vocabulary richness (number of different words used)
•
Lexical richness (word frequency as a function of full text)
•
Distribution of syllables per word
•
Word frequencies
•
Hapax legomena (words used once only)
•
Hapax dislegomena (words used twice only)
•
Word length distribution
•
Word collocations (words frequently used together).
•
Sentence length
•
Preferred word positions
•
Prepositional phrase structure
•
Distribution of parts of speech
•
Phrasal composition grammar
While most style metrics appear to work some of the time (i.e. on some texts and some
authors), the most reliably successful features have, in general, been function words and ngrams. A number of successful experiments with function words have been reported [4, 5,
7]. N-grams to some extent overlap with function words, since frequent short words count
higher but n-gram frequencies also take into account some punctuation and other structural
properties of the text. Most reports, e.g. [8], indicate that 2 or 3-grams give the best
discrimination. The effectiveness of n-grams derives from the fact that n-grams are a
successful summary marker, one that can substitute for other markers. It captures something
about the author's favourite vocabulary as well as sentence structure.
A recent report [9], however, suggests that even successful style markers may nonetheless be
sensitive to differences in genre, topic etc. when the text corpus is small. These results
question the existence of authorial fingerprints, at least for small samples.
Clearly there are therefore, problems with applying stylometry to email authorship given that
typical email message size is small. From the above, it would seem that differing genre,
topic, learning corpus size and text sample size could confound identification. The degree to
which email texts are sensitive to these effects is not known. This current work extends some
initial investigations performed on the classification of email message authorship [10] and it
is for the reasons outlined above that we have set out to conduct a series of controlled
experiments on both non-email and email texts – so as to identify the extent of this sensitivity
and at the same time to establish which style markers are most effective.
Fortunately, email has structural features that pure text lacks. While some of these features
are covered by style markers (paragraphing and tabbing), others such as the usage of greeting
and/or farewell text and the inclusion of a signature may be as habitual as vocabulary or
syntax. In the experiments described below, we examined the effects of using stylometry's
most promising style markers in conjunction with an analysis of features unique to email.
These email structural features must be identified with care in order to allow for the widest
possible variety of message formats and to account for adulterating included text. Our
chosen email features are listed in Section 5, “Conduct of the Experiments”.
We acknowledge that the features described above can be deliberately disguised, but point
out that it may be difficult to disguise all of them at once, in the same message, without
considerable planning and drafting and that such planning may leave a trail of its own.
4.
Experimental Approach
Our experimental approach is based on the fact that the stylistic features from an email
message and the structural features from the layout of an email message can be reduced to a
set of numerical values by data mining the text and other fields of the email message being
analysed. We then hypothesise that these values can be thought of as defining a pattern of
authorship. If this authorship profile or pattern exists, a machine learning technique can be
used to discriminate between different authors. Classification of documents of unknown
authorship can then be attempted.
Support vector machines [11] are based on the structural risk minimization principle from
computational learning theory. The idea of structural risk minimization is to find a
hypothesis that guarantees the lowest true error for a classification problem. The true error of
the hypothesis is the probability that the learnt classifier model will make an error on an
unseen and randomly selected test example.
The values of the features selected for a classification task are transformed into a hyperspace
and the support vector machine finds a hyperplane that separates the positive and negative
examples from the training set with a maximum margin. The hyperplane that separates the
examples is based on a kernel function that can be linear, polynomial, a radial basis function
or any other function that the user chooses. The training examples that are closest to the
hyperplane and thus define the hyperplane are called Support Vectors.
The SVM classification technique is limited to performing binary classifications. When the
classification problem has more than two classes, one of the classes is made the positive class
and all others are made negative and a classifier model is learned. For n classes, n classifier
models are learned with each class being made the positive class for its corresponding
classifier model. The learnt classifier models generated by the Support Vector Machine can
then be used to classify unseen test examples. The output of this classification step
determines whether the test example belongs to a particular class or not.
When analysing the results of classification, a data point can belong to the positive or
negative class and in either case it can be classified correctly or incorrectly. The
classification result for a single data point can be considered to be a true positive, true
negative, false positive or false negative result. The results can be collected into a two-way
Predicted Class
Positive
Negative
True Positive
False Negative
frequency
frequency
False Positive
True Negative
frequency
frequency
Positive
Actual Class
Negative
Figure 1
Two-way Confusion Matrix
confusion matrix as shown in Figure 1. It is necessary to produce a two-way confusion
matrix for each authorship class in the experiment.
A set of statistics can then be calculated from the two-way confusion matrix to determine the
level of success of the classification experiment [12]. The error rate gives an indication of
how many data points were incorrectly classified. The precision statistic measures the impact
of false positives on the results and the recall statistic measures the impact of false negatives.
The precision and recall statistics can be combined into a single statistic, F1, by calculating
the harmonic mean of the precision and recall values. Formulae for these statistics are shown
below.
Error Rate ( E ) =
FP + FN
TP + FP + FN + TN
Pr ecision ( P) =
Re call ( R) =
F1 =
TP
TP + FP
TP
TP + FN
2× R× P
R+P
To get an indication of the overall success of a classification experiment, the macro averaged
error rate and F1 statistics [13] can be calculated across each authorship class using the
following formulae:
n
(M )
1
F
=
∑F
i =1
1, ACi
n
and
n
E
(M )
=
∑E
i =1
ACi
n
where AC i is the author class (i = 1, 2, …, n).
To compensate for document frequency, the statistics for each authorship class are inversely
weighted by the number of data points in each class [10]. The weighted macro-averaged F1
and error rate can then be calculated:
n
F
(M )
1
=
∑ (1 − w
i =1
ACi
) F1, ACi
n −1
and
n
E
(M )
=
∑ (1 − w
i =1
ACi
) E ACi
n −1
where w ACi is the document frequency weight:
w ACi =
N ACi
n
∑N
i =1
ACi
and N ACi is the number of documents in authorship class AC i
5.
Conduct of the Experiments
The methodology used in these experiments, was first tested by running baseline tests on
large documents within a single genre to determine whether or not authorship could be
classified using the proposed scheme. The documents used were a corpus of Information
Technology PhD theses written by three authors. A series of experiments were run on these
documents to identify the most useful features and to determine how much text data was
required for reliable authorship categorization. These experiments are detailed in Sections
5.1 to 5.3. Following completion of these baseline experiments, further tests were then
conducted on a corpus of email messages as outlined in Section 5.4. The email corpus
contains 253 email messages in total from 4 authors. This corpus contains over 23,200 words
in total. The email messages vary in length from 0 to 964 words with an average length of 92
words. There was no attempt made to control the topic or style of these email messages and
the topics of these email messages, therefore, are quite diverse.
5.1
Feature Selection
A list of features that have been used in work by other researchers in the field of authorship
attribution was compiled. Only those features that had been indicated to be successful were
considered. This list of features was split into sets with a similar basis (character-based,
word-based, document-based, function words, word length frequency). In total, 184 stylistic
features were used in these experiments. Definitions of the stylistic features are shown in
Table 1. In the table, the following definitions apply:
•
C = total number of characters in the document chunk or email message
•
N = total number of tokens (i.e. words) in the document or email message
•
V = total number of distinct words in the document or email message
Email messages, by nature, do not contain a constant number of words from message to
message. To overcome this variable, the features were normalized, where possible, as ratios
of a frequency of some property (e.g. number of upper case letters) to a summary property
(e.g. the total number of letters).
These tests were carried out on the PhD theses using sets of features and combinations of
these sets of feature. The documents were split into chunks of text containing 1000 words.
The values for the features proposed for discrimination of authorship were mined from the
documents and prepared using 10-fold stratified cross-validation [12] for classification using
SVMlight [14]. Table 2 shows the results of classification for some feature sets and for some
combinations of feature sets.
Feature Set
Description of Stylistic Features
Character-based (C)
10 features
total number of characters in words / C
total number of letters (a – z) / C
total number of upper-case characters / C
total number of digit characters / C
total number of white-space characters / C
total number of space characters / C
total number of spaces / total number of white-space characters
total number of tab characters / C
total number of tab characters / total number of white-space characters
total number of punctuation characters / C
Word-based (W)
20 features
average word length (in characters)
vocabulary richness (V / N)
total number of function words / N
total number of short words (1 – 3 characters) / N
count of hapax legomena / N
count of hapax legomena / V
count of hapax dislegomena / N
Guirad’s R*
Herdan’s C*
Rubet’s K*
Maas’ A*
Dugast’s U*
Luk’’janenkov & Neistov’s measure*
Brunet’s W*
Honore’s H*
Sichel’s S*
Yule’s K*
Simpson’s D*
Herdan’s V*
Entropy*
Document-based (D)
2 features
number of blank lines / total number of lines
average sentence length (in words)
Function word frequency distribution
(F)
122 features
frequency of each function word / N
Word length frequency distribution
(L)
30 features
frequency of occurrence of words with length 1 to 30 / N
Table 1
*
List of Stylistic Features
These features are implemented as described in Tweedie & Baayen [15]
5.2
Number of Words per Document Chunk
As email messages often have fewer than 100 words, it was necessary to demonstrate that the
technique is capable of reliable discrimination of authorship between small chunks of text
i.e., chunks of text of size comparable with typical email messages. The feature sets that
were discovered to be best for discrimination of authorship in the experiments, whose results
are listed in Table 2, were then used to determine authorship of texts of decreasing size. The
documents from the previous experiments with the PhD theses were split into chunks of 100,
200 and 500 words and the classification tests were re-run. The classification results are
shown in Figure 2 for the set of function word features, for the set of 2-grams and for a
combination of the following features: character-based, word-based, document-based,
function word frequencies and word length distribution frequencies.
Feature Sets
Character-based (C)
Word-based (W)
Function Word Frequency (F)
Word Length Frequency Distribution (L)
2-grams
C+W+F
C+W+L
C+F+L
F+L+W
C + F + L + W + Document Based (D)
Table 2
Weighted
Macro-Averaged
Error Rate
(%)
8.5
6.9
2.7
12.7
0.8
1.2
4.1
0.3
0.3
0.3
Weighted
Macro-Averaged
F1
(%)
86.7
89.4
95.6
81.4
98.8
98.0
93.8
99.6
99.6
99.6
The Effect of Feature Sets on Classification of Authorship
100
F1 (%)
90
Function Word
2-grams
80
CWDFL
70
60
0
200
400
600
800
1000
Number of Words
Figure 2
5.3
The Effect of Chunk Size on Classification Results for Different Feature
Sets
Number of Documents per Author
When applying the methodology to a corpus of email messages, a set of authentic email
messages from each suspected author is needed. One desired result of this experimentation
was to identify how many samples of text are needed from each author for reliable
classification. The following set of tests was performed to find this minimum number of
email messages per authorship class. The PhD theses were split into between 10 and 50
chunks of 200 and 500 words by randomly sampling the documents, ensuring that each
document chunk was mutually exclusive of the others. Each set of data was tested using a
combination of all of the stylistic feature sets to determine the results of classification. These
results are shown in Figure 3.
100
98
F1 (%)
96
94
92
90
88
86
84
0
10
20
30
40
50
60
70
80
90
100
Number of Chunks
500 Words
(
Figure 3
5.4
d)
200 Words
(
d)
The Effect of Number of Data Points on Classification Results
Classification of Email Messages
After the baseline tests on non-email data were completed, a set of experiments was then
carried out on the corpus of email messages detailed previously. The corpus of email
messages was initially tested using the basic stylistic features of Table 1, followed by further
tests where extra features were mined from the structure of each email message. The details
of these structural features are shown in Table 3.
Feature Set
Email-based (E)
5 features
HTML Tag Frequency
Distribution (H)
22 features
Table 3
Description of Stylistic Features
Position of requoted text in the message (7 possiblilities)
Absence / presence of greeting text
Absence / presence of farewell text
Absence / presence of signature
Number of attachments
HTML tag frequency / total number of HTML tags
List of Email Message Structure Features
A comparison of results from the classification of these email messages tested with and
without the use of the extra structural features is shown in Table 4.
Feature Set
Function Words
Function Words + Email Structure
C+F+L+W
C + F + L + W + Email Structure
Table 4
6.
Weighted
Macro-Averaged
Error Rate
(%)
19.8
9.6
17.3
8.2
Weighted
Macro-Averaged
F1
(%)
57.1
79.7
62.9
82.4
Comparison of Classification of Email With and Without Email
Structural Features
Discussion of Results
In the conduct of our experiments, the best single feature for discrimination of authorship has
consistently been the set of function words. Some tests were performed where the different
types of function words, such as adverbs, auxiliary verbs, prepositions and pronouns, were
used as separate feature sets. The preposition and pronoun function word sets performed
slightly better than the others but better results were obtained when the full set of function
words was used.
Intuitively, the word length frequency distribution should give some discrimination between
authorship classes, as this distribution could be linked to an author’s level of education.
However, of all of the stylistic features used, the word length frequency distribution is the
poorest feature set when used alone.
The use of n-grams as a feature set provides very good classification results. It seems
however, that n-grams discriminate between author’s documents based on content rather than
on style alone. Some tests were conducted where an unseen document from an author was
used as the test set for a series of learned classifiers. The unseen document had a very high
error rate when 2-grams were used as the features but the error rate was lower when the
stylistic features defined in Table 1 were used. The two documents had different topics and it
seems that topic specific words inflate the 2-gram metric of the different documents and
therefore discriminate between them on this basis.
The best classification results are achieved when all of the feature sets in Table 1 are
combined. The incremental addition of feature sets shows a persistent improvement in
classification as additional feature sets are added. Most of the stylistic features mined from
text that are used for classification are normalized with respect to the number of words in the
document. For small documents, many of the features will have values of zero. There must
be some threshold number of words, therefore, that a document must contain before it can be
classified. The results from the work performed on the number of words required to effect
reliable classification, as shown in Figure 2, indicates that good results are achieved with 200
words per document and as few as 100 words may be satisfactory in some cases. This is an
encouraging result, as it approximates the size of many emails.
It is interesting to note from Figure 2, that there is little difference in classification results for
different chunk sizes when 2-grams are used as the feature set. This is again thought to be
due to the effect of various 2-gram frequencies being inflated by content specific words in the
documents. This effect requires further investigation.
In some cases, the number of email messages from a suspect author may be low. Before the
classification of suspect email messages can take place, it will be necessary to know the
minimum number of messages that are required to determine a pattern of authorship. The
results from tests on the number of documents required, as displayed in Figure 3, show that
once the number of documents reaches 20, for documents with 200 and 500 words, the
classification efficiency seems to levels off. This is once again an encouraging result and
requires confirmation with a larger corpus of data.
The previous results were all obtained on report style documents and for all tests, the features
were mined from segments of these documents, which had the same number of words. The
data set of email messages could not be controlled in this manner. Fortunately, email
messages do have a structure and this structure can be utilised as another set of features.
Table 4 shows the classification results from tests undertaken when the email specific
features (greeting, farewell and signature use, HTML use, attachments and requote position)
are added to the stylistic features that can be mined from the email message text. The
improvement in classification of the email messages due to the features mined from their
structure is quite dramatic. The email messages used in these tests were not controlled in any
manner. It may be possible, that if some minimum number of words were imposed as a limit
on the data, that better results could be obtained.
7.
Conclusion
Our experiments have shown that it is possible to carry out effective authorship identification
of typical email messages in some circumstances. The extent to which this is possible in
wider circumstances is still to be determined.
The experiments have shown that the classification of documents from different authors with
a SVM machine-learning algorithm provides a systematic way of determining the relative
effectiveness of raw style markers. Our experiments show that a selected subset of these
markers from the field of stylistics may be effective enough to create an authorship
identification tool but more importantly, we have shown they were good discriminators at an
email-sized level.
These experiments also show that the natural structure of email provides additional authorspecific features and these in conjunction with stylometry give a better result. Stepwise SVM
experiments are a convenient way to expose the marginal improvements of individual
features and feature combinations and to show whether these are worth using on the
population to hand. More experimentation is needed to decide whether authors have
particular sensitivity to particular markers or more importantly, whether the learner can be
spoofed.
As yet, the SVM authorship identifier does not approach the status of a forensic stylistics
tool. Expert evidence such as forensic linguistics can only be considered 'scientific' in the
legal sense if it has court-accepted attributes, i.e. empirical testing, known error rates,
standard procedures, peer review etc. [16]. The SVM learner, however, is a convenient
platform for establishing the first two of these. Meanwhile, as a tool for narrowing a suspect
list, it has applications beyond email abuse and by excluding whole classes of suspects can
suggest avenues for investigation by other means.
8.
References
1.
Baron, N., Letters by Phone or Speech by Other Means: The Linguistics of Email.
Language and Communication, 1998. 18: p. 133-170.
2.
Sallis, P.J., Computer-mediated Communication:
Readability. Information Sciences, 2000. 123: p. 45-53.
Experiments with E-mail
3.
Burrows, J.F., Computers and the Study of Literature, in Computers and Written Text,
C. Butler, Editor. 1992, Blackwell: Oxford. p. 167-204.
4.
Mosteller, F. and D. Wallace, Applied Bayesian and Classical Inference: The Case of
the Federalist Papers. 1984, New York: Springer-Verlag New York, Incorporated.
5.
Tweedie, F., S. Singh, and D.I. Holmes, Neural network applications in stylometry:
the Federalist Papers. Computers and the Humanities, 1996. 30(1): p. 1-10.
6.
Rudman, J., The State of Authorship Attribution Studies:
Solutions. Computers and the Humanities, 1998. 31(4): p. 351-365.
Some Problems and
7.
Craig, H., Authorial Attribution and Computational Stylistics: If You Can Tell
Authors Apart, Have You Learned Anything About Them? Literary and Linguistic
Computing, 1999. 14(1): p. 103-113.
8.
Kjell, B. Authorship Attribution of Text Samples using Neural Networks and
Bayesian Classifiers. in IEEE International Conference on Systems, Man and Cybernetics.
1994.
9.
Baayen, H., et al., Back to the Cave of Shadows: Stylistic Fingerprints in Authorship
Attribution. 2000, The ALLC/ ACH 2000 Conference, University of Glasgow.
10.
de Vel, O. Mining E-mail Authorship. in ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. 2000. Boston, MA, USA.
11.
Vapnik, V.N., The Nature of Statistical Learning Theory. 1995, New York: SpringerVerlag.
12.
Witten, I.H. and E. Frank, Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. The Morgan Kaufmann Series in Data Management
Systems. 2000, San Francisco, California, USA: Morgan Kaufmann Publishers.
13.
Yang, Y., An Evaluation of Statistical Approaches to Text Categorization. Journal of
Information Retireval, 1999. 1(1): p. 67-88.
14.
Joachims, T., Making Large-Scale SVM Learning Practical, in Advances in Kernel
Methods - Support Vector Learning, B. Scholkopf, C.J.C. Burges, and A. Smola, Editors.
1999, MIT Press.
15.
Tweedie, F. and H. Baayen, How Variable May a Constant Be? Measures of Lexical
Richness in Perspective. Computers and the Humanities, 1998. 32(5): p. 323-352.
16.
Chaski, C.E., Who Wrote It: Steps Toward a Science of Authorship Identification.
National Institute of Justice Journal, 1997(September 1997): p. 15-22.