Sequential Aggregation of Textual Features for Domain

Sequential Aggregation of Textual Features for
Domain Independent Author Identification
Sekventiella textuella särdrag för ämnesoberoende
författarbestämning
LINDA ERIKSSON
[email protected]
June 22, 2014
Degree project in Computer Science
KTH Royal Institute of Technology
Stockholm, Sweden
Supervisor: Jussi Karlgren
Examiner: Jens Lagergren
Abstract
Sequential Aggregation of Textual Features
for Domain Independent Author Identification
In the area of Author Identification many approaches have been made to
identify the author of a written text. By identifying the individual variation
that can be found in texts, features can be calculated. These feature values are commonly calculated by normalizing the values to an average value
over the whole text. When using this kind of Simple features much of the
variation that can be found in texts will not get captured. This project
intends to use the sequential nature of the text to define Sequential features
at sentence level. The theory is that the Sequential features will be able to
capture more of the variation that can be found in the texts, compared to the
Simple features. To evaluate these features a classification of authors was
made on several different datasets. The result showed that the Sequential
features performs better than the Simple features in some cases, however the
difference was not large enough to confirm the theory of them being better
than the Simple features.
Sammanfattning
Sekventiella textuella särdrag för
ämnesoberoende författarbestämning
Inom området som behandlar författarbestämning har många olika tillvägagångssätt använts för att identifiera författaren av en skriven text. Genom
att identifiera den individuella variation som särskiljer texter från varandra, kan olika särdrag beräknas. Dessa särdrags värden beräknas vanligen
genom att normaliseras till ett medelvärde över hela texten. När denna
typ av Enkla särdrag används så döljs mycket av den variation som särskiljer texter från varandra. Målet med detta projekt är att istället använda
textens sekventiella natur som grund för att definiera Sekventiella särdrag
på meningsnivå. Teorin är att de sekventiella särdragen kommer att kunna
identifiera mer av den variation som kan identifieras i texter, jämfört med de
enkla särdragen. För att utvärdera dessa särdrag gjordes en klassificering av
författare på flera olika dataset. Resultatet visade att de sekventiella särdragen presterade bättre än de enkla särdragen i vissa fall, men skillnaden var
inte tillräckligt stor för att bekräfta teorin om att de skulle vara bättre än
de enkla särdragen.
Table of Contents
1 Introduction
1
2 Background
2.1 Finding individual variation . . . . . . . . . . . . . . . . . . . . . . .
2.2 Project description . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
3
3 Theory
3.1 The Sequential Minimal Optimization algorithm . . . . . . . . . . .
3.2 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
6
4 Method
4.1 Software . . . . . . . . . .
4.1.1 Weka . . . . . . .
4.2 Datasets . . . . . . . . . .
4.2.1 PAN14 dataset . .
4.2.2 C50 dataset . . . .
4.2.3 LAT94 dataset and
4.2.4 Tasa dataset . . .
4.3 Features . . . . . . . . . .
4.3.1 Simple features . .
4.3.2 Sequential features
4.4 ARFF generation . . . . .
4.5 Text files . . . . . . . . .
4.6 Classification . . . . . . .
4.7 Macro average . . . . . .
4.8 Confusion matrices . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
GH95 dataset
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
7
7
8
8
8
8
9
9
12
12
12
12
12
5 Experiments and Results
5.1 Performance of features
5.2 Clustering of authors . .
5.2.1 C50 dataset . . .
5.2.2 LAT94 dataset .
5.2.3 GH95 dataset . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
16
16
16
17
6 Discussion
6.1 Performance of individual features . . . . . . . .
6.2 Unexpected results . . . . . . . . . . . . . . . . .
6.3 Simple vs Sequential features . . . . . . . . . . .
6.4 Classification performance on the PAN14 dataset
6.5 Confusion matrices and clusters . . . . . . . . . .
6.6 Genre and domain independence . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
19
20
20
21
22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Conclusion
23
Bibliography
24
Appendix
A
Confusion
B
Confusion
C
Confusion
D
Confusion
matrix
matrix
matrix
matrix
C50 . . . . . . . . . . . . .
LAT94 . . . . . . . . . . . .
GH95 . . . . . . . . . . . .
C50 with removed elements
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
27
28
29
30
E
F
Confusion matrix LAT94 with removed elements . . . . . . . . . . . 31
Confusion matrix GH95 with removed elements . . . . . . . . . . . . 32
List of Tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Simple features. . . . . . . . . . . . . . . . . . . . . . . . .
Binary Patterns. . . . . . . . . . . . . . . . . . . . . . . .
Sequential features. . . . . . . . . . . . . . . . . . . . . . .
Performance of Simple features. . . . . . . . . . . . . . . .
Performance of Sequential features. . . . . . . . . . . . . .
Performance of Simple features on the PAN14 dataset. . .
Performance of Sequential features on the PAN14 dataset.
Cluster 1 in the LAT94 dataset. . . . . . . . . . . . . . . .
Cluster 2 in the LAT94 dataset. . . . . . . . . . . . . . . .
Cluster 1 in the GH95 dataset. . . . . . . . . . . . . . . .
Cluster 2 in the GH95 dataset. . . . . . . . . . . . . . . .
Cluster 3 in the GH95 dataset. . . . . . . . . . . . . . . .
Performance of Simple features, taken from Table 4. . . .
Performance of Sequential features, taken from Table 5. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
11
14
14
15
15
16
16
17
17
17
20
20
Support Vector Machine. . . . . . . . . . . . . . . . . . . . . . . .
Example of a Confusion matrix. . . . . . . . . . . . . . . . . . . . .
Example of an ARFF file. . . . . . . . . . . . . . . . . . . . . . . .
Confusion matrix for the clusters in the LAT94 dataset. . . . . . .
Confusion matrix for the clusters in the GH95 dataset. . . . . . . .
Confusion matrix for the C50 dataset. . . . . . . . . . . . . . . . .
Confusion matrix for the LAT94 dataset. . . . . . . . . . . . . . .
Confusion matrix for the GH95 dataset. . . . . . . . . . . . . . . .
Confusion matrix for the C50 dataset where elements less than or
equal to 5 are removed. . . . . . . . . . . . . . . . . . . . . . . . .
Confusion matrix for the LAT94 dataset where elements less than
or equal to 5 are removed. . . . . . . . . . . . . . . . . . . . . . . .
Confusion matrix for the GH95 dataset where elements less than or
equal to 5 are removed. . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
5
6
7
17
18
27
28
29
List of Figures
1
2
3
4
5
6
7
8
9
10
11
. 30
. 31
. 32
1
Introduction
Much of the individual variation that can be found in written text can be used
to distinguish authors from each other. This variation can be lexical, structural,
bound to genre or depend on some extralinguistic factors. Examples of these factors can be the authors age, gender or background. The uniqueness of an author’s
writing style can be established by studying this variation. The writing style of an
author can be used to identify the author of a written text. Author Identification
is an important problem in many areas, such as law and journalism where knowing
the author of a document, e.g. a ransom note could be of big importance [1].
Identifying the variation which makes each author unique can be a challenging
task, both for humans and when using a computational approach. Today most of
the measures on texts are calculated using the words or phrases in the text. These
measures are called features. These feature values can be normalized and averaged
over the whole text. These values can then be compared between texts with the
intention of distinguishing the texts from each other.
When normalizing and averaging the feature values much of the variation that
can be found in texts will not get captured. Instead, this project intends to use
the sequential nature of the text to define features on sentence level. The goal is to
investigate if these features will be able to capture more of the individual variation
that can be found in written text compared to features which are normalized and
averaged over the whole text.
1
2
Background
This Section presents some background knowledge in the area of the project and
the project description.
2.1
Finding individual variation
In the area of Author Identification features represent variation that can be found
in text, to distinguish authors from each other. Selecting good features is important to get good classification results. The study by Li et al. [2] investigated how
to find the optimal features among a set of extracted features in an English and a
Chinese dataset. By choosing only the features that perform well they got the best
possible performance from the classifier. For the English dataset with the set of
optimal features they achieved a classification accuracy of 99.01% compared to a
classification accuracy of 97.85% with the full feature set. For the Chinese dataset
they achieved a classification accuracy of 93.56% with the optimal features and
a classification accuracy of 92.42% with the full feature set. The results showed
that the classification accuracy was better using only the optimal features for both
datasets compared to using the full feature set. The features were extracted from
the datasets using a Genetic Algorithm. By applying the genetic operations implemented in a genetic algorithm the solution containing the optimal features could
be represented. Over all Li et al. seemed to have a good approach to solving the
task of sorting out the best features for a specific classification task. When trying
to identify the author of a text some sort of feature selection can be used, Adams
et al. have investigated this approach [3]. Their approach was to use Author Identification techniques based on genetic and evolutionary feature selection, similar
to what Li et al. did [2]. The difference between their approaches was that Adams
et al. created a heuristic for feature selection while Li et al. introduced a method
for identifying the key features for authors with the intention to help trace author
identities in cybercrime investigations [2].
Today, a common approach in machine learning to avoid the risks of Overfitting is
to use Cross validation. This technique was used by Adams et al. [3]. Overfitting
occurs when the model starts to describe the noise in the training data instead of
describing the statistical model. This results in bad performance on the test data
and a classifier that generalizes badly. When using cross validation the dataset is
divided into small subsets, usually a training set, a test set and a validation set.
Cross validation involves doing this sampling of test and training data a number
of times. The training set is used for training while the test set is used to analyse
how well the trained classifier works on unseen instances (how well it generalizes to
unseen data). The validation set is used during training to see how well the learner
is learning. The purpose of the validation set is to help define a stop condition that
tells the learner when to stop.
Gabrilovich and Markovitch [4] studied how to generate features based on domainspecific and common-sense knowledge. By using the Open Directory Project (ODP)
[5] as background they generated new features from the ODP. The texts were divided into series of contexts where each context could be sentences or paragraphs.
Gabrilovich and Markovitch believed that by looking at the whole text at the same
time could be misleading and therefore chose to build the feature extractor on
sentences or paragraphs that could be found in texts. These starting points are
similar to the ones of this project.
2
Another way to generate features is to use lexical chains. Jayarajan et al. [6]
believed that by using a simple Bag of Words (BoW) model many of the linguistic
and semantic features in texts would be ignored. A BoW model ignores grammar
and word order, but keeps track of the number of times each word occurs. Instead of using a BoW model they used lexical chains which consisted of semantic
information that was encoded from the documents [6]. Their experiments showed
that by using these features they achieved better classification results compared
to when using features based on the BoW model. They also got a reduction of
the dimensionality of the feature vectors with 30% compared to the feature vectors
generated with the BoW model.
Another approach to author identification was made by Huang et al. in [7]. They
investigated the importance of several different kinds of features like Lexical, Syntactic, Structural and Content-specific features when used for Author Identification
on online messages. Their research showed that by using all of these four different feature types the performance of their classifier increased. They developed
three classifiers using different classification techniques. The algorithms chosen
was Decision Trees, Back Propagation Neural Networks and Support Vector Machines (SVM). The testing was done on both an English and a Chinese dataset.
The study showed that the classifier using the SVM algorithm outperformed the
others on both datasets. On the English dataset the SVM classifier had a correct
classification rate of 97.69% using all feature types. In comparison to the classifier
using Back Propagation Neural Networks which had a correct classification rate of
96.66%. The Decision tree classifier was quite far behind the others in the number
of correct classifications. It could only classify 93.36% of the authors in the English
dataset correctly. Of the three approaches they used the SVM classifier had the
best performance in all tests. Huang et al. also had some interesting ideas about
how different a classifier can perform depending on the language of the dataset.
The Chinese language is built up very differently compared to English, for example it does not contain any spaces between words and this requires some additional
language processing to separate out the words compared to when using an English
dataset [7]. This is important to consider when designing systems that should work
with several different languages.
The study by Karlgren and Eriksson [8] was the initial inspiration to the project.
The motivation for their study was to investigate if features which are typically
measured and then averaged over a whole text, would capture more of the variation
that can be found in texts if patterns of occurrence were measured. By using a
sliding windows approach with different window lengths, they computed sequential
features. These features were then stored as binary patterns. In their study, this
line of investigation was not pursued to a functional conclusion and an investigation remains to see if their approach, which showed some promising results, can be
useful.
2.2
Project description
This degree project is motivated by some of the initial results presented by Karlgren
and Eriksson [8]. The idea is to better investigate the value of sequential features
defined by locally and sequentially aggregating observations of some linguistic item,
using sliding windows of different lengths. By using the sequential nature of the text
instead of mean values the theory is to capture more of the variation that can be
3
found in texts. Mean values do not capture much of this variation. The sequential
features shall be implemented using a sliding windows approach (see Section 4.3.2).
Examples of features can be e.g., word length, sentence length, use of adverbials
and use of pronouns. The feature representation shall be able to identify variation
in texts and it should be possible to use the feature representation as a background
for Author Identification. All of the features used in the project were defined by
the project supervisor since the intention of this project was neither to evaluate
the linguistic characteristics of text nor about inventing new features.
4
3
Theory
This Section presents the classification algorithm used in the project and the representation that was used to present the classification results.
3.1
The Sequential Minimal Optimization algorithm
The Sequential Minimal Optimization algorithm was implemented by John Platt
and is an alternative algorithm for training a Support Vector Machine (SVM) [9].
The SVM algorithm is used in many different areas in machine learning and is one
of the most popular algorithms [10, p. 119]. The SVM can transform the input to
a high-dimensional space which makes it possible to do linear separation between
classes in the input space. When using a SVM the margins are maximized between
the classes to make the probability of the data points ending up in the wrong class
as small as possible. Some of the advantages of using a SVM is that it has very
good generalization properties and works very well with few training samples [11].
One disadvantage is that SVMs are hard to implement efficiently [11]. Figure 1
illustrates a simple classification problem using a SVM classifier. The classes are
separated by the optimal hyperplane which has the maximum margin to both of
the classes. The data points (filled) which lie closest to the margin in each class
are called support vectors [10, p. 121].
y
al
im
pt
O
p
hy
er
e
an
pl
Maximum margin
x
Figure 1: Support Vector Machine.
Instead of using a large quadratic programming loop for solving the optimization
problem, like a regular SVM training algorithm, the SMO algorithm splits this
problem into the smallest possible sub problems which are solved analytically. This
makes the SMO algorithm faster and more scalable than the original SVM training
algorithm [9].
5
3.2
Confusion matrix
The representation of results that was used in the project was Confusion matrices. A confusion matrix is a square matrix which visualizes the performance of a
classifier. All of the classes that exists in the data are presented in the horizontal
and vertical directions of the matrix [10, p. 32]. Figure 2 shows how the confusion
matrices used in this project are represented. The vertical direction presents the
actual class while the horizontal direction presents the predicted class. The main
diagonal (marked with C) in the matrix therefore represents the number of correct
classifications for the classes. All of the other elements (marked with x) in the
matrix represents the number of incorrect classifications between the actual class
and predicted class which are defined in the row and column of the matrix.
a
C
x
x
x
x
x
x
x
x
x
b
x
C
x
x
x
x
x
x
x
x
c
x
x
C
x
x
x
x
x
x
x
d
x
x
x
C
x
x
x
x
x
x
e
x
x
x
x
C
x
x
x
x
x
f
x
x
x
x
x
C
x
x
x
x
g
x
x
x
x
x
x
C
x
x
x
h
x
x
x
x
x
x
x
C
x
x
i
x
x
x
x
x
x
x
x
C
x
j
x
x
x
x
x
x
x
x
x
C
|
|
|
|
|
|
|
|
|
|
<-- classified as
a = Class-1
b = Class-2
c = Class-3
d = Class-4
e = Class-5
f = Class-6
g = Class-7
h = Class-8
i = Class-9
j = Class-10
Figure 2: Example of a Confusion matrix.
6
4
Method
This Section will describe all of the different methods that were used in the project.
4.1
Software
This Section will present the software that was used in the project.
4.1.1
Weka
Weka is a Data Mining software written in Java which contains a large collection of
Machine Learning algorithms [12]. Weka has a graphical user interface which makes
it possible to apply the algorithms directly to a dataset. The algorithms can also
be invoked in Java code [12], which was used in this project. The Weka software
is available at their website [13]. Weka uses a file format called Attribute-Relation
File Format (ARFF) to represent the data used for classification. Figure 3 presents
a small example of an ARFF file. This example has three attributes, two numeric
and one nominal. Each of the lines under @data represents one instance. In the
project each instance represents a text file and each attribute represents a feature.
To be able to use the classification algorithms implemented in the Weka Software
all the training and test data in the project had to be stored in this file format.
@relation name
@attribute att1 numeric
@attribute att2 numeric
@attribute att3 {yes, no}
@data
0,0,no
0,1,yes
1,0,yes
1,1,yes
Figure 3: Example of an ARFF file.
4.2
Datasets
This Section will present the datasets that were used in the project. One thing to
note is that these datasets are not designed to test the same question. One dataset
is designed to test the question “Did author X write this text?” whereas the other
datasets are designed to test the question “Who of these authors wrote this text?”.
Even though these questions are very similar the result will be very different (see
Section 5).
4.2.1
PAN14 dataset
PAN is a competition which is given every year, the competition includes classification tasks like Author Identification, Author Profiling and Plagiarism Detection.
PAN has publicly available training datasets which can be downloaded at their
website [14]. In this project the training dataset for the Author Identification task
year 2014 was used. This dataset can be downloaded at [15]. It consists of a collection of text documents in four different languages, Dutch, Greek, English and
7
Spanish. In this project the English dataset containing 300 tasks was used. Each
task consists of one document with a unknown author and a number of documents
with a known author. The goal is to correctly classify the unknown document as
if it belongs to the same author as the known documents, or not. The training
dataset also includes a file containing the correct answers where each classification
task is listed together with a letter, where Y represents that the authors are the
same and N represents that they are not.
4.2.2
C50 dataset
The C50 dataset was downloaded from the UCI Machine Learning Repository [16].
It consists of one training and one test set, these sets are not overlapping. Each
of the datasets contains 2500 documents (50 authors with 50 documents each) in
text format. All of the documents are written in English and belongs to the same
subtopic which will minimize the possibility of being able to classify documents
depending on topics instead of the unique features which represent each author.
4.2.3
LAT94 dataset and GH95 dataset
The LAT94 dataset and the GH95 dataset are two separate datasets which consist
of a large number of articles from Los Angeles Times year 1994 and Glasgow
Herald year 1995, respectively. These datasets have been used in CLEF information
retrieval evaluation campaigns [17]. Each file contains multiple articles and to be
able to use these articles they had to be extracted into separate files and tagged
with the name of the author. It resulted in 6685 articles for the LAT94 dataset
and 10595 articles from the GH95 dataset. Some of the articles had no specified
author while some had multiple authors and they were therefore excluded. The
remaining articles in the LAT94 and GH95 datasets were sorted by author, the 50
authors having the largest number of articles in each dataset were used to create
two new datasets. These new datasets were used to evaluate the performance of
the classifier.
4.2.4
Tasa dataset
The Tasa dataset was created from a large text file containing school essays with
unknown authors. The file was split up into a large number of smaller files. Starting
from the beginning of the file a random number (between 30-50) of sentences were
selected and a new file was generated. This continued until the end of the Tasa
file and then a new dataset had been created consisting of 19901 files where each
file had between 30-50 sentences. Since the starting point and end point of the
essays were not easily distinguishable, they could not be considered when splitting
the file. The newly generated files could therefore contain text from several essays
with different authors. This will create anomalies when trying to classify these
files. These files were used as examples of texts not belonging to the known author
in the PAN14 dataset.
4.3
Features
This section describes which features that were implemented in the project.
8
4.3.1
Simple features
The Simple features implemented in this project refers to features which calculates
an average value over the whole text, e.g., the average word length. The Simple
features also includes features which calculates occurrences of words. These features were implemented as a test in the beginning of the project and were later
on used when comparing feature performances. Table 1 presents all of the Simple
features implemented in the project. Column N shows how many features of each
kind that were implemented.
N
Feature description
1
Average sentence length: The average sentence length.
1
Average word length: The average word length.
1
Adverbial: The total number of adverbials.
1
Pronoun: The total number of pronouns.
1
Utterance: The total number of utterance verbs.
3
Ly-words: The total number of -ly words in the beginning, middle and end of a sentence.
This feature does not look at all words ending with -ly, only the words defined in the
list of words used for this specific feature.
7
Mendenhall length X : The Mendenhall feature [18] consists of 7 features where each
feature records the number of words with length X. For the first 6 features X is in the
range 1-6 and for the last feature X is > 6.
Table 1: Simple features.
4.3.2
Sequential features
The Sequential features used an approach called Sliding Windows where the text
was divided into sequences of sentences which from here on are referred to as
windows. For windows of size X it meant that the following X sentences in the
text were considered. The windows were generated by iterating through the text
and moving the starting point one sentence forward in each iteration step, until the
end of the text was reached. The features were applied to the sentences contained
in each window. All of the features were represented by a binary condition. If the
condition was met, the sentence would get the value 1, if not it would get the value
0. This gave a binary pattern of ones and zeros in the specified window size. For
all of the Sequential features the window size of 1–4 was used, the total number of
binary combinations was:
21 + 22 + 23 + 24 = 30
where each of the combinations were one of the binary patterns presented in Table
2.
Size Binary pattern
1
0, 1
2
00, 01, 10, 11
3
000, 001, 010, 011, 100, 101, 110, 111
4
0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101,
1110, 1111
Table 2: Binary Patterns.
9
Each feature value was recorded as an integer in the range of 0–100. This value
corresponded to the percentage of occurrences for each binary pattern. All of the
Sequential features that were implemented are presented in Table 3.
10
N
Feature description
30
Sentence length: Measures the pattern of long vs. short sentences. A sentence is
considered to be long if the number of characters is more than 30 (including white
spaces, separators etc.).
30
Word length: Measures the pattern of sentences with long vs. short words. A sentence
is considered to be having long words if the average word length is >= 5.
30
Coherence: Measures the coherence value between sentences. A sentence is considered
to be coherent with the following sentences if 60 % or more of the words reoccur.
30
Pronoun: Measures if each sentence contains any pronouns (I, you, he, she etc.).
30
First person pronoun: Measures if each sentence contains any First person pronouns
(I, we, me etc.).
30
Second person pronoun: Measures if each sentence contains any Second person pronouns (you, yours etc.).
30
Third person pronoun: Measures if each sentence contains any Third person pronouns
(he, she, their etc.).
30
Relative pronoun: Measures if each sentence contains any Relative pronouns (that,
when, which etc.).
30
Intensive/Reflexive pronoun: Measures if each sentence contains any Intensive/Reflexive pronouns (myself, yourself, itself etc.).
30
Adverbial: Measures if each sentence contains any Adverbials (after, almost etc.).
30
Clausal adverbial: Measures if each sentence contains any Clausal adverbials (suddenly,
immediately, apparently etc.).
30
Utterance: Measures if each sentence contains any Utterance verbs (acknowledge, admit, allow etc.).
30
Amplifiers: Measures if each sentence contains any Amplifiers (absolutely, altogether,
completely etc.).
30
Hedges: Measures if each sentence contains any Hedges (apparently, appear, around
etc.).
30
Enneg: Measures if each sentence contains any Negative sentiment terms (abhor, abuse,
alarm etc.).
30
Enpos: Measures if each sentence contains any Positive sentiment terms (adore, agree,
amazing etc.).
30
Negation: Measures if each sentence contains any Negations (cannot, nor, not etc.).
30
Think verbs: Measures if each sentence contains any Think verbs (fear, hope etc.).
30
Overt emotion: Measures if each sentence contains any Explicit emotional terms (anger,
fury etc.).
30
Time expressions: Measures if each sentence contains any Time expressions (monday,
tuesday etc.).
30
Auxes and Modals: Measures if each sentence contains any Auxiliary or Modal verbs
(will, would etc.).
30
Overt signals: Measures if each sentence contains any Overt signals of X (next, last
etc.). Over signals are explicit textual and discourse markers which are used to bind
together bits of the text to a coherent whole.
30
Present tense: Measures if each sentence contains any Present tense (do, does etc.).
30
Past tense: Measures if each sentence contains any Past tense (did, was etc.).
30
Words ending with -ing: Measures if each sentence contains any words ending with
-ing.
30
Introduced: Measures in each sentence if all of the words with a “the” in front of them
has been introduced before (in the same window).
Table 3: Sequential features.
11
The following two definitions (in quotation marks) were given by the project supervisor Jussi Karlgren as a clarification of the Present tence, Past tence and the
Introduced features.
“Since English has a very spare morphology, the tense of the main verb cannot
be determined with any precision, at least not without considerable computation.
For this purpose the text is sampled, and certain common verbs are used to identify
the tense of the clause (e.g. Have vs Had).”
“For each word token immediately preceded by the, checks sentence in the sliding window to see if that word has been used previously in the window. This is
intended to capture the tendency of some authors to use terms in definite form,
assuming their reference to be inferable from discourse context without explicit
introduction.”
4.4
ARFF generation
To be able to use the classification algorithms implemented in Weka all the training
and test data had to be stored in the ARFF format (see Section 4.1.1). Several
different versions of a ARFF generator had to be implemented depending on the
format of the training and test sets used in the project.
4.5
Text files
All of the words used to define each feature were specified in text files, one for each
feature. These text files were parsed in the implementation and the words were
compared to the words in the texts.
4.6
Classification
The classification algorithm used in the project was the SMO algorithm (see Section
4.6). The algorithm was already implemented in Weka and could be invoked in the
Java code.
4.7
Macro average
Two different techniques can be used to calculate the correct classification rate
of a classifier. The Macro average technique uses equal weight for classes while
Micro average uses equal weight for data points. In this project the Macro average
measure which is implemented in Weka was used.
4.8
Confusion matrices
Confusion matrices (see Section 3.2) were used in the project to present the classification results. These matrices were also used to identify clusters of authors (see
Section 5.2).
12
5
Experiments and Results
The following Section presents the performance of the classifier and some experiments on the datasets presented in Section 4.2.
5.1
Performance of features
All of the features that were implemented were tested independently on all datasets.
The C50 dataset consists of a training and a test set. The training set contains
2500 text documents with 50 unique authors having 50 documents each. The test
set has the same authors and the same number of documents as the training set.
The documents in the training and the test set do not overlap. The LAT94 dataset
contains 6685 text documents with 50 unique authors. Each of the authors have
between 88–319 documents. The GH95 dataset contains 10595 text documents
with 50 unique authors. Each of the authors have between 131–372 documents.
Both the LAT94 and the GH95 datasets were randomly split in half. One half was
used for training and the other one for testing.
Table 4 presents the performance of the Simple features on these datasets. The
columns in the table presents the correct classification rates for each feature. It
shows that the feature which had the best performance in all three datasets presented in Table 4 was the Mendenhall length X feature. By itself it was able to
correctly classify 9.32% of the C50 documents, 12.66% of the LAT94 documents
and 13.42% of the GH95 documents. Intuitively, the Mendenhall feature vector
will capture the distribution of the shortest words in the text, many of which are
grammatical function words and not words based on content, such as “a”, “is”
and “was”. The good performance of this feature witness to individual variation
in grammatical structure of the texts in the datasets. When combining all of the
Simple features the performance rose to 11.12% of the C50 documents, 16.90% of
the LAT94 documents and 18.54% of the GH95 documents.
The performance of the Sequential features is presented in Table 5. It shows that
the best feature in all three datasets was the Introduced feature. By itself it was
able to classify 4.88% of the C50 documents, 10.63% of the LAT94 documents and
7.30% of the GH95 documents. When combining all of the Sequential features the
performance rose to 12.68% of the C50 documents, 18.59% of the LAT94 documents and 23.48% of the GH95 documents.
Table 6 shows the performance of the Simple features on the PAN14 dataset. It
shows that the best feature was the Ly-words feature which had a performance of
59.33%. The performance of all the features combined was 55.67%.
Table 7 shows the performance of the Sequential features on the PAN14 dataset.
It shows that the best feature was the Sentence length feature which had a performance of 60.00%. The performance of all the features combined was 55.00%.
The classification rates in Table 6 and 7 can not be compared to the rates in
Table 4 and 5 since the PAN14 dataset consisted of binary classification tasks
while the other datasets had one class out of 50 which was correct (see Section
4.2). This makes the probability of an incorrect classification much higher in the
other datasets.
13
Feature
C50 (%)
LAT94 (%)
GH95 (%)
Average sentence length
4.6000
6.5640
4.7565
Average word length
2.8800
6.5640
5.3983
Adverbial
3.4000
4.7633
4.2658
Pronoun
2.3200
5.3732
6.7950
Utterance
2.1200
4.7633
3.9449
Ly-words
2.2000
4.8504
3.3409
Mendenhall length X
9.3200
12.6634
13.4202
ALL FEATURES
11.1200
16.9039
18.5353
Table 4: Performance of Simple features.
Feature
C50 (%)
LAT94 (%)
GH95 (%)
Sentence length
4.1200
8.6552
5.7569
Word length
4.2800
7.9582
6.3231
Coherence
4.5200
9.4685
5.7380
Pronoun
3.5600
6.0122
6.6629
First person pronoun
2.3200
5.6056
5.0019
Second person pronoun
2.0800
5.0537
3.1899
Third person pronoun
2.9200
6.2155
6.4553
Relative pronoun
3.6400
4.3276
4.9830
Intensive/Reflexive pronoun
2.1200
4.9085
3.1899
Adverbial
4.8400
8.0744
5.6059
Clausal adverbial
2.3600
4.9666
3.1144
Utterance
2.9200
4.7052
3.6240
Amplifiers
2.2000
4.8795
3.1333
Hedges
2.0400
4.6762
4.1148
Enneg
3.3600
5.0247
4.6433
Enpos
2.5200
5.6346
4.3790
Negation
2.6400
4.8214
3.5863
Think verbs
2.9200
4.7052
3.6240
Overt emotion
2.1200
4.9376
3.2654
Time expressions
2.8800
5.6346
3.9071
Auxes and Modals
3.9600
7.5225
6.2288
Overt signals
2.5600
5.2570
4.6810
Present tense
3.8800
5.9251
5.0396
Past tense
3.2400
5.1990
4.7565
Words ending with -ing
2.5200
8.6262
5.8513
Introduced
4.8800
10.6303
7.3046
ALL FEATURES
12.6800
18.5884
23.4806
Table 5: Performance of Sequential features.
14
Feature
PAN14 (%)
Average sentence length
49.00
Average word length
50.33
Adverbials
47.67
Pronouns
47.67
Utterance
49.00
Ly-words
59.33
Mendenhall length X
48.67
ALL FEATURES
55.67
Table 6: Performance of Simple features on the PAN14 dataset.
Feature
PAN14 (%)
Sentence length
60.00
Word length
52.67
Coherence
55.33
Pronoun
49.00
First person pronoun
45.33
Second person pronoun
49.00
Third person pronoun
52.67
Relative pronoun
53.00
Intensive/Reflexive pronoun
50.33
Adverbial
54.00
Clausal adverbial
53.67
Utterance
50.67
Amplifiers
53.00
Hedges
53.33
Enneg
56.33
Enpos
47.00
Negation
57.67
Think verbs
50.67
Overt emotion
51.00
Time expressions
52.67
Auxes and Modals
53.00
Overt signals
45.00
Present tense
50.67
Past tense
49.67
Words ending with -ing
49.33
Introduced
54.33
ALL FEATURES
55.00
Table 7: Performance of Sequential features on the PAN14 dataset.
15
5.2
Clustering of authors
The results from the classification were presented in confusion matrices (see Appendix A-C). In these matrices, elements placed on the main diagonal represents
the number of documents that were correctly classified in each class. In these
matrices some clusters of authors could be observed. These clusters contained authors that had documents that were often incorrectly classified as belonging to each
other. To help identify these clusters all of the incorrect classifications between a
pair of authors that were less or equal to 5 were removed. These new matrices are
presented in Appendix D-F. The following subsections presents the clusters found
in the confusion matrices for the C50, LAT94, and GH95 datasets.
5.2.1
C50 dataset
In the C50 dataset no clusters of authors could be found. The original confusion
matrix can be seen in Appendix A and the confusion matrix with the removed
elements can be seen in Appendix D.
5.2.2
LAT94 dataset
In the LAT94 dataset two clusters of authors could be found. In these clusters
the authors were often confused with each other whereas they were not as often
confused with authors not belonging to the same cluster. The first cluster from the
LAT94 dataset is presented in Table 8 and the second cluster is presented in Table
9. In these tables the number of correctly and incorrectly classified documents in
the clusters are presented. The values in the table are obtained from the original
confusion matrix in Appendix B. The rows in the tables presents the actual class
and the columns presents the predicted class. Each author in the table is represented by the letter combination given by the classifier which can be seen in the
confusion matrix in Appendix B.
Author
i
q
ac
i
16
5
5
q
3
36
9
ac
12
22
18
Table 8: Cluster 1 in the LAT94 dataset.
Author
p
ab
p
22
10
ab
10
37
Table 9: Cluster 2 in the LAT94 dataset.
After clustering some of the authors together, each of these clusters could be seen
as a representation of a type of author. One idea that came up during these
experiments was that it would be interesting to see if it was easier to identify a
type of author rather than a specific author. To test this theory all of the documents
in the dataset belonging to each of the authors in the clusters were merged together
into a new dataset where each cluster represented a type of author. On this dataset
a new classification was made and the result of this classification can be seen in
16
Figure 4. By looking at the confusion matrix it can be seen that the number of
correct classifications were much larger than the number of incorrect classifications.
The number of instances that were correctly classified were 392 and the number of
incorrectly classified instances were 48 which gave a correct classification rate of
89.09%.
a
b
<-- classified as
93 17 |
a = CLUSTER-2
31 299 |
b = CLUSTER-1
Figure 4: Confusion matrix for the clusters in the LAT94 dataset.
5.2.3
GH95 dataset
For the GH95 dataset the same approach as for the LAT94 dataset was used. At
first the clusters were identified by looking at the confusion matrix for the GH95
dataset in Appendix F, in this case three clusters were found. These clusters are
presented in Table 10, 11 and 12. The values are taken from the original confusion
matrix in Appendix C.
Author
f
ad
as
at
f
87
6
2
9
af
12
53
2
36
as
15
7
6
14
at
20
28
1
72
Table 10: Cluster 1 in the GH95 dataset.
Author
e
z
aj
e
56
21
25
z
17
97
10
aj
27
17
69
Table 11: Cluster 2 in the GH95 dataset.
Author
q
w
q
60
10
w
22
27
Table 12: Cluster 3 in the GH95 dataset.
The next step was to merge all of the documents in each cluster and then perform
a new classification on these clusters. The result of the classification using the new
clusters are presented in Figure 5. By looking at the confusion matrix in Figure
5 it can be seen that 1154 documents were correctly classified and 114 documents
were incorrectly classified, which gives a correct classification rate of 91.01%.
17
a
b
c
<-- classified as
457 14 21 |
a = CLUSTER-2
26 164 15 |
b = CLUSTER-3
28 10 533 |
c = CLUSTER-1
Figure 5: Confusion matrix for the clusters in the GH95 dataset.
18
6
Discussion
This Section discusses the results presented in Section 5.
6.1
Performance of individual features
The performance of the individual features shown in Tables 4–7 varies considerably
across datasets and feature types. This can depend on the type of documents, the
amount of training data, how large the variation between the authors in the datasets
is, and many more factors. When studying the performance on all Sequential
features it can be seen that the best performance was achieved on the GH95 dataset
where the correct classification rate was 23.48%. As described in Section 5.1 the
GH95 dataset consisted of 50 different authors and 10595 articles where each author
had between 131–372 articles each. It can be seen that the classifier performed well
above the performance of a random classifier. Since each author in the dataset does
not have exactly the same number of articles, the prior probability for each class
is not the same. A completely random classifier with equal prior probabilities
for each class would have had a correct classification rate of around 2.0% for a
dataset containing 50 classes. In the GH95 case the prior probability for each
class will vary and the probability for a document belonging to a class with many
documents will be higher than the probability of a document belonging to a class
with a smaller number of documents. Even in this case it should be safe to say that
a random classifier applied to the GH95 dataset would have a performance rate of
2.0% plus or minus a few percentage points. This value can be compared to the
classifier using the Sequential features which had a performance rate of 23.48% on
the GH95 dataset. It could almost classify 1 out of 4 documents correctly using
only Sequential features. This performance rate is much higher than it would be
using a random classifier. When studying the performance of the classifier using
the Sequential features on the other two datasets in Table 5 it can be seen that it
also performs well above the performance rate of a random classifier on the other
two datasets.
6.2
Unexpected results
While the performance of the classifier was well over random, it was far lower
than expected and is likely not to be of practical use in its present form. The low
performance can depend on many different factors, like what type of features that
have been implemented, the classification algorithm, the feature representation
or other implementation choices. The Sliding Windows approach and the choice
to represent the feature value for each binary pattern (see Section 4.3.2) as a
percentage of how often it occurs in the text could also be an implementation
choice that needs to be questioned. By looking at the individual performance of
the features some of the features have a quite good individual performance, like
the Introduced feature on the LAT94 datasets which had a performance rate of
10.63%. It was expected that the performance of all Sequential features would be
higher than it was for the three datasets in Table 5 considering the quite good
performance of many of the individual features. As mentioned above the best
feature was the Introduced feature which is very bound to the sequential nature
of the text. The Introduced feature, as explained in Table 3, measures in each
sentence if all of the words with a “the” in front of them has been introduced.
Each word token preceded by “the” is noted and if at least one occurrence of that
token has been observed in the preceding sliding window, that word is considered
19
to have been introduced by the author. It can be that more of these features that
are strictly bound to the sequential nature of the text needs to be implemented
to increase the performance of the classifier. Another feature that supports this
theory is the Coherence feature. The Coherence feature had a performance rate of
9.47% on the LAT94 datasets. This is actually the second best Sequential feature
on this dataset. It is also strictly bound to the sequential nature of the text since it
measures how coherent the sentence is to the other sentences in the same window.
A sentence is considered to be coherent with the other sentences in the window if
60.0% or more of the words reoccur more than once. This is worth thinking about
when choosing what features to implement.
6.3
Simple vs Sequential features
The purpose of this project was to implement and evaluate a representation of
Sequential features. The theory was that these features can capture more of the
individual variation that distinguish texts from each other, compared to more common types of features which are represented as mean values over the whole text.
Some of the features implemented in the project were both implemented as Sequential features and as Simple features (see Section 4.3.1 and 4.3.2). These features
were the Adverbial, Pronoun and Utterance features presented in Table 1 and 3.
The performance of these features are presented in Table 13 and 14 (the complete
tables can be seen in Section 5.1). When studying these performance rates it can
be seen that in six out of nine cases the performance of the Sequential features was
better than the performance of the Simple features, in some considerably better.
It should also be noted that in the three cases where the Simple features performed
better than the Sequential features, the difference in performance was much smaller
compared to the other way around. This is a very promising result encouraging
for further studies in the area, even though the theory of the Sequential features
being better than the Simple features could not be confirmed.
Feature
C50 (%)
LAT94 (%)
GH95 (%)
Adverbial
3.4000
4.7633
4.2658
Pronoun
2.3200
5.3732
6.7950
Utterance
2.1200
4.7633
3.9449
Table 13: Performance of Simple features, taken from Table 4.
Feature
C50 (%)
LAT94 (%)
GH95 (%)
Adverbial
4.8400
8.0744
5.6059
Pronoun
3.5600
6.0122
6.6629
Utterance
2.9200
4.7052
3.6240
Table 14: Performance of Sequential features, taken from Table 5.
6.4
Classification performance on the PAN14 dataset
One of the datasets that was used for evaluating the classifier was the PAN14
dataset described in Section 4.2.1. The structure of this dataset was very different
compared to the structure of the other datasets used in this project. The PAN14
dataset consists of a number of classification tasks where each task has a number
20
of documents with a known author and one document with an unknown author.
This author can either be the same author as the one for the known documents
or not. This classification task was hard since there were only example documents
for one of the classes. Using the SMO algorithm in Weka with only one example
class resulted in that all unknown documents were predicted to belong to the same
class as the known documents. This was not desirable. When using the SMO
algorithm in Weka there is a possibility to define a cost matrix which penalizes the
classifier differently depending on which class that is incorrectly classified. This
cost matrix was used in an effort to try to force the classifier not to guess on the
known class just because it had more training data. The value in the cost matrix
which corresponded to an incorrect classification for the unknown class was set
higher than the other values. Even when penalizing the algorithm in this way, it
still guessed on the known class every time. At this point another dataset had to
be introduced which could be used as example texts not belonging to the known
class. The dataset that was used was the Tasa dataset described in Section 4.2.4. A
number of documents from this dataset were randomly chosen to represent the other
class during classification. The classification performance is presented in Table 6
and 7. When studying the table it is obvious that the performance is very bad on
this dataset compared to the other datasets, since all of the classification tasks in
this dataset are binary classifications. The bad performance on this dataset can
depend on several factors, the most probable reason is the Tasa dataset. The Tasa
files were generated from one large file containing many unlabelled student essays.
A described in Section 4.2.4 the essays have no easily distinguishable starting and
end point, nor any defined authors. This means that when using this datasets
as a representation of the other class it can therefore contain text from several
essays written by different authors, which will make it hard to find any similarities
between the features in this class. Another problem is that the SMO algorithm
must choose one of the classes as the predicted class. It is not very likely that the
unknown document will have any similarities with the documents from the Tasa
dataset. To get good classification results on the PAN14 datasets it is important
to find a suitable dataset for representing the ”other” class when using the SMO
algorithm. Another possibility would be to try other Machine Learning algorithms
or other classification techniques for this kind of classification task.
6.5
Confusion matrices and clusters
The final classification results were presented as confusion matrices (see Appendix
A-F), these matrices show the predicted vs the actual document class of all documents in the test datasets. As described in Section 5.2, clusters of authors could be
found in the confusion matrices for the LAT94 and GH95 datasets. These clusters
can be seen as a representation of a type of author. The clusters defined a new
dataset where each cluster became a new class which contained all of the documents for all of the authors in the cluster. The theory behind this experiment was
that it would be easier to correctly classify a type of author rather than a specific
author. The result from the experiment showed that the correct classification rate
was increased when applying the classifier to the clusters, in line with the theory.
For the LAT94 dataset containing 2 clusters the correct classification rate of the
classifier was 89.09% and for the GH95 dataset containing 3 clusters the correct
classification rate was 91.01%. These results were very good compared to the results when using the original datasets (see Table 5) where the performance was
18.59% and 23.48%, respectively. One thing to note is that the classification in
this case is a bit easier since these new datasets only had 2 or 3 classes and the
21
original datasets had 50, which makes the probability of an incorrect classification
much higher in the original datasets since there were more classes to choose from.
Even if the classification on the new datasets created by the clusters of authors
is easier a major improvement in performance could be seen. These experiments
supports the theory that it is easier to classify a type of author rather than individual authors on the datasets used in this project. In this experiment it could not
easily be seen what specific variation in the authors writing styles made them end
up in the same cluster, but this could be a interesting aspect to evaluate in future
studies.
6.6
Genre and domain independence
If lexical features are used, i.e., features based on occurrences of individual words,
the classifier can partially start to model the genre in which the text is written,
instead of modelling the writing style of the author. This is a common approach
in authorship attribution tasks. It may lead to excellent classification results on
test data sets but it is unclear if it will translate to useful real life performance, if
documents in the same genre and same domain are being tested. In this project
none of the features that were implemented kept track of the words that were used
by the authors. In this way, the classifier will be more independent of the genre and
topical domain of the texts during classification compared to a classifier that keeps
track of what words is used by the authors. The performance of the classifier could
probably be increased if this kind of features were implemented, but modelling
genres was not the purpose of this project.
22
7
Conclusion
This project investigated the individual variation (features) that can be found in
written text. The theory was that by using the sequential nature of the text and
using features defined at sentence level, it could be possible to capture more of the
variation that distinguish texts from each other. Features defined as mean values
(Simple features) over the whole text do not capture this kind of variation and
therefore a large amount of information that can be used to distinguish the texts
from each other gets lost when using this kind of features.
The features were evaluated by constructing a classifier with the purpose of identifying the author of text documents in a number of datasets. The experiments
showed that the Sequential features performs better than the Simple features in
some cases, however the difference was not large enough to confirm the theory of
them being better than the Simple features.
Experiments also showed that by identifying clusters of authors and generate new
datasets where the classes were defined by these clusters, the performance of the
classifier could be improved.
It should also be noted that authorship attribution is a challenging task for humans as well. In many machine learning tasks the computational approach is to
scale up tasks that humans do well to large datasets, but this project addresses a
task which humans find quite difficult. The features under consideration are not
easily observed by humans, and if they would be observed they can be very hard to
explain, even if the reader would note that a text can be attributed to some author.
Even if the theory of the Sequential features being better than the Simple features could not be confirmed, the results encourages further studies in the area.
One part that can be investigated further is why the performance of the Sequential
features did not improve more when aggregating them. Is the problem in the feature representation or somewhere else? The clustering of authors is also a possible
subject for further studies. What similarities can be found between the authors
in each cluster? Due to time constraints in the project these questions were not
evaluated further but they can serve as ideas for further studies in the area.
23
Bibliography
[1] PAN 2014 Website.
http://www.uni-weimar.de/medien/webis/research/events/
pan - 14 / pan14 - web / author - identification . html. (Visited on
06/01/2014).
[2]
J. Li, R. Zheng, and H. Chen. “From fingerprint to writeprint”. In: Communications of the ACM 49.4 (2006), pp. 76 –82.
[3]
J. Adams, H. Williams, J. Carter, and G. Dozier. “Genetic Heuristic Development: Feature selection for author identification”. In: 2013 IEEE Symposium on Computational Intelligence in Biometrics and Identity Management
(CIBIM). Piscataway, NJ, USA, 2013, pp. 36 –41.
[4]
E. Gabrilovich and S. Markovitch. Feature Generation for Text Categorization Using World Knowledge. 32000 Haifa, Israel: Computer Science Department, Technion, 2005.
[5] The Open Directory Project.
http://www.dmoz.org/docs/en/about.html. (Visited on 06/01/2014).
[6]
D. Jayarajan, D. Deodhare, and B Ravindran. Lexical Chains as Document
Features. Centre for Artificial Intelligence, Robotics, Defence R, and D Organisation, Bangalore, INDIA. Dept. of CSE, IIT Madras, Chennai, INDIA.,
2008.
[7]
Z. Huang, R. Zheng, J. Li, and H. Chen. “A framework for authorship identification of online messages: writing-style features and classification techniques”. English. In: Journal of the American Society for Information Science
and Technology 57.3 (2006), pp. 378 –93. issn: 1532-2882.
[8]
J. Karlgren and G. Eriksson. “Authors, Genre, and Linguistic Convention”.
In: SIGIR Workshop on Plagiarism Analysis, Authorship Identifcation, and
Near-Duplicate Detection (2007).
[9]
J. C. Platt. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Advances in kernel methods - Support vector
learning, 1998.
[10]
S. Marsland. Machine Learning An Algoritmic Perspective. Chapman Hall/
CRC, 2009.
[11] Ö. Ekeberg. Support Vector Machines. http : / / www . csc . kth . se /
utbildning/kth/kurser/DD2431/ml12/schedule/05-svm-2x2.
pdf. 2012. (Visited on 02/27/2014).
[12] Weka 3: Data Mining Software in Java.
http://www.cs.waikato.ac.nz/ml/weka/index.html. (Visited on
04/14/2014).
[13] Weka 3, Download.
http://www.cs.waikato.ac.nz/ml/weka/downloading.html.
(Visited on 04/14/2014).
[14] PAN 2014 Website.
http://pan.webis.de/. (Visited on 04/14/2014).
[15] PAN 2014 Training dataset.
http://www.webis.de/research/corpora/corpus- pan- labs09- today/pan- 14/pan14- data/pan14- author- verificationtraining-corpus-2014-04-03.zip. (Visited on 04/14/2014).
24
[16] UCI Machine Learning Repositiory, Reuter 50 50 Dataset.
https://archive.ics.uci.edu/ml/datasets/Reuter_50_50.
(Visited on 04/14/2014).
[17] The Cross-Language Evaluation Forum.
http://www.clef-initiative.eu/. (Visited on 06/05/2014).
[18]
T. C. Mendenhall. “The characteristic curves of composition”. In: Science,
IX (1887), 237–249.
25
Appendix
This Section contains confusion matrices from Section 5 that were to large to put
inside the report.
26
A
Confusion matrix C50
27
a b
5 0
0 21
1 1
2 1
0 4
3 2
2 0
0 1
2 1
0 5
0 1
0 2
6 3
1 2
0 0
1 2
0 1
1 2
2 0
2 0
1 2
3 1
3 0
1 1
0 0
1 1
0 0
0 7
0 1
0 0
0 0
0 0
2 0
0 2
0 1
0 0
0 6
0 2
0 3
2 5
3 1
0 0
1 2
0 4
0 0
0 3
3 0
0 0
1 0
0 1
c
1
1
8
1
1
2
3
1
3
0
0
1
1
3
5
0
2
5
2
1
1
1
0
8
0
0
0
1
1
0
0
2
0
0
2
0
1
1
4
0
0
1
1
2
0
3
2
3
2
0
d
1
0
3
4
4
2
4
1
1
2
0
1
0
0
1
1
4
1
1
2
1
0
2
1
2
2
0
0
1
0
1
0
0
2
1
1
3
0
0
0
2
0
0
6
0
0
0
3
0
2
e
0
3
0
1
4
2
0
1
3
5
1
0
0
0
1
1
1
2
2
0
6
4
1
1
0
0
2
1
2
1
2
1
0
1
3
1
2
1
1
1
3
0
2
1
0
0
3
0
2
5
f g h
0 0 0
0 0 0
1 0 0
1 0 0
0 2 0
1 0 0
0 15 0
0 0 16
1 2 0
1 0 0
0 2 0
0 0 0
2 0 0
0 1 0
2 2 0
1 0 1
0 4 0
1 1 0
0 0 0
0 2 0
2 0 1
0 0 0
1 4 0
0 3 0
2 1 0
4 0 0
0 1 0
0 0 1
1 1 0
1 1 0
0 0 1
0 1 0
2 0 1
0 4 1
2 1 0
1 0 0
0 0 0
1 0 0
0 0 0
0 1 0
1 0 0
1 0 1
1 0 0
0 0 0
0 1 0
0 0 0
4 0 1
1 1 0
0 1 0
1 0 0
i j
2 1
1 0
1 0
6 0
1 2
3 0
3 0
1 0
1 0
1 20
2 0
0 0
2 0
2 0
0 0
1 1
0 0
0 0
1 0
2 0
2 0
2 0
1 0
0 0
0 0
1 1
0 3
0 0
0 0
0 0
3 2
1 0
0 0
1 1
1 0
4 0
0 1
0 0
0 0
0 0
0 0
1 0
0 0
1 0
0 1
0 0
1 0
2 0
2 0
1 1
k l
1 0
0 2
0 0
1 1
1 1
0 3
0 2
1 2
2 2
0 0
7 2
2 14
0 0
0 0
1 1
0 7
1 0
0 4
6 3
1 1
1 0
0 6
0 4
1 2
1 0
0 2
2 0
1 0
1 0
0 0
0 0
5 1
0 1
1 0
2 0
1 2
0 0
0 1
2 0
0 0
1 3
0 1
0 3
0 5
1 0
1 0
0 2
0 1
3 1
0 1
m
3
2
0
0
0
1
1
1
3
0
0
1
4
3
0
2
1
0
1
1
1
2
0
1
4
1
2
1
0
2
0
1
3
0
1
1
0
0
0
1
1
1
2
0
0
0
0
1
0
0
n
6
5
1
2
3
0
4
0
0
0
0
0
4
9
3
0
1
0
3
4
1
0
3
2
4
1
1
2
2
2
0
1
1
3
1
3
2
5
0
1
1
0
0
1
1
4
0
3
2
0
o
1
0
1
0
3
1
0
4
2
1
2
0
2
1
8
1
2
0
0
3
0
0
0
0
0
2
0
1
0
0
3
1
2
0
0
3
0
5
0
0
0
1
1
0
2
0
1
0
0
1
p
0
1
1
0
1
0
0
1
0
1
0
1
2
1
0
2
1
0
1
1
1
1
1
0
1
1
0
1
0
1
0
0
3
2
0
0
0
0
1
4
2
0
0
1
1
1
0
0
0
0
q
2
0
0
0
0
2
3
0
0
0
1
2
0
2
2
3
8
0
1
1
0
2
0
2
2
1
0
0
2
1
0
2
0
0
1
1
1
1
0
1
1
0
1
0
0
1
1
2
0
3
r
0
1
0
2
0
2
0
0
0
1
1
0
1
1
2
1
0
2
1
1
2
2
0
0
0
2
0
0
2
3
2
0
1
1
0
0
1
2
2
2
0
0
1
0
0
1
0
0
2
0
s
0
0
0
1
0
0
1
1
0
0
3
1
2
3
0
0
0
0
4
1
2
0
0
2
1
2
0
1
0
0
1
1
0
2
2
0
0
1
1
0
0
0
0
1
0
3
1
3
2
1
t
1
0
2
1
0
0
2
0
1
1
0
1
1
0
3
0
2
0
4
3
0
0
0
1
4
0
1
0
1
0
1
3
0
7
0
0
0
1
0
1
0
1
2
0
3
0
1
1
1
1
u v
0 2
0 0
1 0
0 0
2 1
0 1
0 0
2 2
1 1
0 2
1 1
1 0
1 1
0 1
0 2
0 0
1 1
1 1
2 0
0 1
4 1
0 10
3 1
0 1
0 0
0 1
0 1
0 2
0 3
0 1
3 0
4 2
0 1
0 0
1 0
2 1
0 0
0 1
2 2
0 0
1 2
4 1
2 5
2 1
0 0
2 0
1 0
1 1
0 0
1 0
w
0
0
0
1
1
6
0
1
0
1
0
1
0
0
0
3
0
0
0
0
0
0
7
0
0
2
3
3
3
1
1
0
3
0
0
0
1
2
0
0
2
1
0
1
0
0
0
4
0
1
x y
0 1
0 1
2 0
1 1
4 0
1 0
0 1
0 1
1 0
1 0
0 0
1 1
0 2
3 2
1 0
0 1
1 3
1 1
1 1
1 2
3 0
1 0
0 2
4 0
0 11
0 0
2 1
0 1
1 0
0 0
1 0
1 0
1 6
4 0
1 1
0 0
3 0
3 0
2 0
6 0
1 1
1 1
0 1
1 0
0 5
1 0
3 0
0 0
1 0
3 2
z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax
<-- classified as
3 4 3 1 0 1 1 4 1 0 1 0 2 0 1 0 0 0 0 0 0 0 0 1 0 | a = KirstinRidley
0 0 1 0 0 0 1 0 0 0 0 1 2 2 1 2 0 0 0 0 2 0 0 0 0 | b = MatthewBunce
0 2 1 1 1 2 2 0 2 2 0 0 1 3 0 0 1 1 0 0 5 1 1 1 0 | c = JaneMacartney
0 3 1 2 1 0 3 0 0 1 0 1 3 2 1 0 0 0 2 1 0 0 0 2 0 | d = PeterHumphrey
0 0 1 0 0 2 1 0 0 1 0 2 0 0 0 0 1 0 1 1 4 1 0 0 0 | e = DarrenSchuettler
0 1 0 3 1 0 0 1 1 1 0 1 0 2 1 0 0 1 0 2 0 0 2 0 1 | f = ToddNissen
0 0 0 0 0 0 2 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 | g = KarlPenhaul
0 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 2 1 0 0 0 1 0 0 | h = JimGilchrist
2 1 1 2 2 0 1 0 2 0 1 0 2 0 1 1 0 1 0 0 0 1 1 4 0 | i = DavidLawder
0 2 0 1 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 | j = LydiaZajc
1 0 0 1 3 0 5 0 3 2 1 1 1 2 0 0 0 1 1 0 0 3 0 1 0 | k = KouroshKarimkhany
0 0 1 0 1 0 4 0 3 0 1 1 0 5 0 0 1 0 0 0 1 0 1 0 1 | l = GrahamEarnshaw
1 1 1 0 1 0 0 0 1 0 2 0 2 0 3 0 0 3 1 0 0 0 0 0 0 | m = JonathanBirt
2 0 1 0 0 0 1 1 2 0 0 0 1 1 1 0 0 0 2 1 0 0 1 1 0 | n = LynnleyBrowning
2 0 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 1 2 0 1 1 0 1 0 | o = TheresePoletti
0 1 0 1 2 0 2 0 1 1 3 0 0 0 1 1 3 2 1 1 0 0 0 0 1 | p = EdnaFernandes
0 2 0 0 1 0 0 1 0 2 0 3 1 0 0 1 0 0 1 2 1 0 0 1 0 | q = MureDickie
4 1 0 1 1 2 0 0 0 4 0 0 0 5 1 2 1 0 0 2 2 0 1 0 0 | r = HeatherScoffield
0 0 2 0 0 0 1 1 2 1 1 0 1 1 0 0 0 0 1 2 0 0 1 0 0 | s = MartinWolk
0 2 0 0 3 0 3 0 1 1 0 0 1 1 0 0 1 0 0 4 0 0 0 2 1 | t = SamuelPerry
1 0 1 1 1 1 3 0 1 0 0 2 0 0 0 1 1 1 2 0 0 0 1 0 1 | u = MarkBendeich
0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 2 2 4 0 0 0 1 1 0 0 | v = TimFarrand
0 0 0 0 0 1 2 0 0 2 0 1 2 0 3 0 0 0 0 0 1 0 3 2 0 | w = AlanCrosby
0 0 1 0 0 0 0 2 2 2 0 1 0 0 0 1 1 0 2 1 4 1 1 0 0 | x = ScottHillis
0 0 1 0 0 1 0 1 2 0 1 0 0 0 0 1 0 1 1 1 4 1 0 0 2 | y = AlexanderSmith
4 0 3 0 2 0 2 1 3 1 0 0 2 0 0 1 0 3 0 2 0 1 0 0 0 | z = RobinSidel
4 4 1 5 0 0 0 1 0 0 1 1 0 1 6 0 3 0 1 0 1 0 0 0 2 | aa = KeithWeir
0 5 10 0 1 0 0 4 0 0 0 0 1 2 1 0 0 0 0 0 0 1 0 2 0 | ab = JoWinterbottom
4 5 2 1 1 0 1 0 1 2 0 1 1 0 0 0 0 1 0 3 1 1 2 1 0 | ac = JanLopatka
1 3 0 1 3 2 0 0 1 2 3 3 0 1 0 0 3 7 0 3 0 0 0 3 0 | ad = MarcelMichelson
4 0 1 1 1 3 0 0 1 0 0 3 2 0 1 0 1 3 1 0 2 5 0 0 0 | ae = PatriciaCommins
1 0 0 0 2 3 6 0 5 0 0 0 2 0 0 0 0 0 0 2 1 0 0 1 0 | af = EricAuchard
0 1 3 2 2 0 0 3 0 0 0 1 0 2 0 0 3 2 1 0 0 0 1 1 1 | ag = JoeOrtiz
0 0 0 0 1 0 4 0 1 2 1 0 2 0 0 1 0 0 1 1 0 1 1 0 2 | ah = MichaelConnor
3 1 1 0 4 1 3 1 0 2 1 1 2 0 0 0 1 1 0 2 0 0 2 1 2 | ai = FumikoFujisaki
0 1 1 1 3 2 1 0 2 1 8 0 1 0 1 4 1 0 0 1 1 0 0 0 0 | aj = PierreTran
1 1 1 0 0 1 2 1 0 2 1 8 0 1 0 1 4 0 0 0 1 0 3 0 1 | ak = WilliamKazer
1 0 0 1 0 0 3 0 0 3 1 2 5 0 0 0 0 0 2 1 1 2 1 0 0 | al = SarahDavison
1 0 0 0 4 0 0 2 1 0 0 1 1 14 1 0 2 1 1 0 1 0 0 0 0 | am = LynneO’Donnell
1 0 2 2 0 1 3 1 2 3 0 2 0 2 1 1 0 0 1 0 1 1 1 0 0 | an = NickLouth
1 1 0 1 2 1 0 2 1 1 1 0 2 0 0 1 1 6 0 0 0 3 0 0 0 | ao = KevinDrawbaugh
0 0 4 3 3 2 0 3 2 0 0 2 1 0 0 0 10 1 0 0 1 0 1 1 0 | ap = BernardHickey
0 0 1 0 2 0 0 0 0 1 1 0 2 0 0 2 1 11 0 1 0 2 0 1 0 | aq = KevinMorrison
0 2 3 0 0 1 1 1 0 0 1 1 1 2 0 2 1 1 1 0 1 1 1 2 0 | ar = TanEeLyn
3 0 0 0 0 3 2 0 2 0 0 0 0 0 0 1 0 4 1 14 2 0 0 1 2 | as = AaronPressman
0 0 0 0 0 2 4 0 1 2 2 3 0 0 0 1 0 2 1 1 8 0 1 1 1 | at = BenjaminKangLim
0 1 4 2 1 2 2 1 0 0 0 1 0 1 0 1 3 1 0 0 2 2 1 1 0 | au = BradDorfman
1 2 1 0 1 0 2 0 2 1 2 3 0 1 0 0 2 1 3 0 0 0 0 0 1 | av = JohnMastrini
0 0 3 5 0 2 0 1 1 0 6 1 1 0 0 0 0 0 0 5 2 0 1 2 0 | aw = RogerFillion
2 3 0 0 0 1 0 1 0 0 1 1 0 3 2 0 1 0 0 3 1 1 0 2 3 | ax = SimonCowell
Figure 6: Confusion matrix for the C50 dataset.
B
Confusion matrix LAT94
28
a
1
2
0
0
0
1
0
0
0
2
0
2
0
1
0
0
3
3
0
3
1
0
2
1
2
0
2
0
0
1
0
0
0
0
2
0
0
0
1
0
1
3
0
1
0
0
2
0
0
0
b c
2 0
7 0
0 15
1 2
2 0
4 0
0 0
3 0
1 2
0 3
5 1
3 0
1 6
3 0
3 0
0 0
1 0
0 2
1 2
2 4
1 0
0 4
0 2
1 1
1 0
2 1
1 0
0 0
3 0
0 1
2 1
1 0
1 1
1 5
0 0
0 0
2 4
0 7
3 0
1 2
1 2
1 1
2 0
1 6
0 0
0 1
1 1
1 3
3 1
2 2
d
1
1
0
5
0
0
0
0
6
4
1
0
5
2
4
0
0
1
0
2
0
1
2
0
1
5
3
0
0
4
0
0
3
0
0
0
3
1
1
1
2
1
0
0
0
1
0
0
0
0
e
0
0
0
0
5
1
0
0
0
0
0
0
3
2
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
2
0
0
1
0
0
0
0
f g
0 1
2 2
1 0
0 0
5 3
6 6
0 67
1 0
1 0
0 0
1 0
0 0
0 0
1 0
0 0
1 1
1 2
0 0
2 0
2 0
1 0
0 0
1 0
4 0
1 0
0 0
1 0
0 0
1 0
1 0
0 0
0 0
1 0
0 0
0 0
0 0
0 0
0 0
1 0
2 0
1 0
0 0
0 1
3 0
0 0
0 0
1 0
0 0
1 0
0 0
h i j k l m
1 0 3 0 1 1
0 1 0 0 2 0
0 2 0 1 0 6
0 3 1 1 1 4
0 7 1 0 0 1
0 0 0 2 0 2
0 0 0 0 0 0
1 0 1 1 1 3
0 16 1 0 0 8
0 1 12 1 0 5
0 0 2 32 1 1
2 0 8 1 10 2
1 5 11 0 1 12
1 6 3 2 2 1
1 2 2 1 6 5
0 0 0 0 0 3
0 3 3 0 1 4
1 5 4 0 1 5
1 2 2 1 4 3
1 0 6 5 2 4
0 3 0 2 0 1
1 3 2 0 2 4
0 1 3 2 2 4
1 0 0 1 0 1
3 1 5 1 0 4
0 5 2 0 1 2
1 3 7 0 2 2
0 1 0 0 0 0
1 12 5 0 0 9
0 0 0 0 0 1
0 4 7 0 0 2
0 0 0 0 0 4
0 1 1 0 0 1
0 0 3 1 1 7
0 5 4 2 3 4
0 5 3 0 0 4
0 4 1 0 1 6
0 8 1 0 1 2
0 1 1 1 0 0
0 0 1 1 1 6
0 4 0 0 0 4
1 2 3 0 0 3
0 0 5 1 1 1
0 5 2 0 0 1
1 12 1 0 1 1
2 2 1 0 1 8
0 2 7 1 4 5
1 2 1 1 0 11
0 4 1 1 0 0
0 7 2 0 0 8
n o p q
1 3 1 1
0 0 1 2
0 0 0 1
0 3 0 2
0 0 4 0
0 1 3 3
0 0 1 0
2 7 0 4
2 1 0 5
6 4 1 7
0 0 0 0
1 2 1 6
0 1 4 11
1 1 1 4
2 17 0 8
0 0 22 0
2 2 2 36
1 2 2 8
0 4 1 16
1 3 1 2
0 0 2 1
0 0 0 0
0 0 0 6
0 0 5 2
3 3 0 3
2 2 0 2
0 4 1 1
0 0 10 1
1 1 5 22
1 1 2 10
1 2 0 7
0 0 0 2
1 3 0 5
0 1 0 2
1 1 1 6
0 2 1 7
1 3 1 7
0 5 0 1
0 4 1 0
0 0 1 1
1 0 1 3
0 1 1 7
0 0 12 1
0 3 8 2
1 1 2 4
1 1 0 10
0 1 0 6
0 1 2 5
0 0 4 1
1 0 1 3
r s t
2 1 3
0 1 1
1 0 1
3 3 0
0 0 2
1 1 2
0 0 0
2 1 0
3 4 0
4 6 1
0 0 6
4 3 4
4 5 0
5 6 0
2 1 0
0 0 0
2 8 1
5 6 0
6 13 1
0 1 93
0 0 2
1 1 0
0 5 0
0 1 0
1 3 0
0 4 1
2 2 2
0 0 0
3 6 1
1 1 1
1 2 0
0 0 0
0 0 0
0 4 0
2 1 2
3 0 1
3 2 1
0 0 0
0 1 0
0 0 0
0 2 0
0 8 0
0 0 4
0 2 0
1 1 1
2 4 0
2 3 0
3 0 0
0 0 0
2 3 0
u
0
0
0
0
2
0
0
0
2
0
5
1
2
0
1
0
0
0
0
0
5
0
0
1
0
1
0
0
2
0
2
2
0
0
0
2
0
0
0
0
0
1
1
0
0
1
0
0
2
0
v
0
0
0
0
1
0
0
0
0
0
0
1
1
0
1
0
0
0
0
2
1
2
4
2
0
0
6
0
2
0
1
1
2
0
0
3
0
3
1
0
0
2
0
0
1
0
0
3
2
0
w x
1 4
0 5
0 1
2 0
0 5
0 2
0 1
1 1
0 2
1 2
0 0
0 2
0 0
0 1
3 1
0 6
1 0
0 0
1 3
0 0
0 2
0 1
3 1
1 20
0 0
0 0
3 1
0 1
0 2
0 0
1 0
1 2
1 0
0 0
0 1
2 0
0 2
2 0
1 2
0 1
3 2
4 2
0 9
0 1
1 2
0 2
1 0
1 1
0 5
2 0
y
1
0
0
0
0
0
0
2
0
1
0
2
2
2
3
0
1
0
0
0
0
0
3
1
2
0
2
0
2
0
1
0
1
0
0
1
0
0
0
0
0
0
1
0
1
1
3
0
0
1
z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax
<-- classified as
1 0 0 1 2 0 2 1 1 1 0 0 0 0 0 1 5 1 1 0 0 1 0 0 0 | a = JACK-SEARLES
0 2 3 4 1 1 2 0 0 1 0 2 0 1 0 0 1 2 1 1 0 0 0 7 0 | b = MIKE-REILLEY
2 2 0 2 0 2 1 1 2 0 0 0 1 0 3 0 2 0 1 1 1 0 2 0 0 | c = JOHN-DART
0 2 0 1 0 1 1 0 0 0 0 0 0 0 1 1 3 0 0 0 2 0 1 0 0 | d = JE-MITCHELL
1 0 1 1 0 0 1 0 1 0 2 0 2 0 0 0 0 1 0 0 0 1 0 7 0 | e = TJ-SIMERS
0 1 5 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 | f = EARL-GUSTKEY
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | g = ALLAN-MALAMUD
0 1 0 1 2 0 0 0 1 2 1 0 0 1 0 3 0 1 0 1 0 1 1 0 0 | h = MARY-F
1 0 0 5 1 2 2 0 3 2 2 0 3 0 0 3 1 0 3 6 0 1 1 1 0 | i = ED-BOND
0 2 0 9 2 4 0 0 0 4 0 0 1 0 1 1 5 2 3 2 1 3 2 0 1 | j = JON-NALICK
0 0 0 0 0 0 3 0 0 0 0 0 2 1 0 0 1 1 0 0 0 0 0 0 0 | k = KEVIN-THOMAS
0 3 1 1 0 1 0 1 1 5 0 1 1 1 0 0 5 0 0 0 0 1 1 0 3 | l = SARA-CATANIA
4 3 1 16 2 2 2 0 1 6 2 0 2 0 4 2 2 0 3 4 1 2 4 0 0 | m = BERT-ELJERA
0 0 0 7 3 1 2 2 1 1 4 1 1 0 1 1 3 0 3 0 1 0 0 1 1 | n = TRACY-WILSON
0 3 0 1 2 8 1 1 2 1 1 2 1 1 1 0 3 0 0 0 3 0 0 2 0 | o = JULIE-FIELDS
0 0 10 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 | p = SCOTT-HARRIS
0 0 1 9 0 0 2 2 0 3 4 0 0 0 1 0 1 0 0 2 1 1 0 0 0 | q = SHELBY-GRAD
1 0 0 5 2 0 2 0 1 2 0 0 0 1 3 0 3 1 1 3 1 1 2 1 0 | r = ALAN-EYERLY
1 1 0 14 2 0 3 2 2 3 0 1 1 0 3 1 8 0 1 2 0 3 2 0 0 | s = FRANK-MESSINA
0 3 0 1 2 3 1 1 7 4 0 0 1 0 1 0 0 2 0 0 0 2 0 1 1 | t = CHARLES-SOLOMON
0 1 0 0 0 1 6 1 2 0 0 1 0 1 1 1 1 0 3 3 0 0 1 0 1 | u = STEVE-HOCHMAN
0 1 0 4 0 0 0 1 1 1 4 0 0 0 0 0 2 0 2 3 1 0 2 2 1 | v = MARY-ANNE
0 0 0 2 0 2 1 0 2 1 0 0 1 1 0 0 2 0 0 0 2 1 0 1 0 | w = LYNN-FRANEY
0 0 2 2 0 0 0 1 0 0 0 0 0 2 0 0 0 2 2 2 0 0 0 3 0 | x = SHAV-GLICK
1 0 0 3 0 2 1 2 2 1 0 1 0 0 1 0 1 0 0 1 1 1 2 1 0 | y = PEGGY-Y
0 3 0 2 3 1 0 1 1 1 0 0 3 1 0 2 4 0 0 1 0 1 1 2 0 | z = MIGUEL-BUSTILLO
2 3 2 2 0 0 2 3 3 1 0 1 0 0 1 0 1 1 3 1 1 3 4 0 1 | aa = MAKI-BECKER
0 0 37 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 3 0 | ab = JIM-MURRAY
0 0 1 18 2 1 1 0 1 7 1 1 0 0 0 0 3 0 0 5 3 1 4 2 0 | ac = DEBRA-CANO
0 2 0 5 0 0 1 1 0 2 0 0 1 0 0 1 4 0 0 2 0 0 0 2 1 | ad = MARTIN-MILLER
3 2 0 3 0 11 2 2 3 3 3 1 0 0 0 2 3 0 0 1 0 1 4 0 1 | ae = MAIA-DAVIS
0 0 0 0 0 0 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 | af = TOM-PETRUNO
0 0 0 3 0 3 1 1 1 2 1 0 1 0 1 0 4 0 1 1 1 0 4 1 0 | ag = SUSAN-BYRNES
0 0 0 2 0 1 2 0 4 2 0 0 1 0 3 0 2 0 0 2 3 0 2 0 0 | ah = JENNIFER-OLDHAM
0 2 0 8 2 2 1 0 1 13 1 1 1 0 0 0 2 1 0 2 0 1 1 0 2 | ai = BILL-BILLITER
1 1 1 8 0 0 1 1 0 2 6 0 0 0 0 0 1 1 1 1 0 0 0 0 2 | aj = LESLEY-WRIGHT
0 1 1 5 0 1 2 0 1 2 0 0 2 0 3 0 2 1 2 2 0 2 1 0 0 | ak = HOLLY-J
0 1 0 1 0 0 0 0 2 0 1 1 3 2 0 2 1 0 0 4 0 0 3 1 0 | al = STEVE-RYFLE
0 3 0 2 0 1 1 1 1 0 1 1 4 2 0 1 1 0 2 0 0 1 0 1 0 | am = ERIC-SLATER
0 0 0 3 0 0 0 0 1 0 0 0 0 0 25 2 1 0 0 0 0 0 1 0 0 | an = MICHAEL-KRIKORIAN
0 1 0 1 0 2 3 2 3 0 0 0 1 0 1 0 4 0 2 3 0 1 0 0 0 | ao = KURT-PITZER
0 1 1 8 1 1 0 2 2 2 0 0 1 0 1 3 12 1 2 0 2 2 1 1 1 | ap = STEPHANIE-SIMON
0 1 2 1 0 1 1 0 1 4 0 1 1 0 0 0 1 12 0 1 0 1 0 5 1 | aq = HELENE-ELLIOTT
0 2 0 1 0 0 2 0 0 0 0 1 0 0 0 3 6 0 8 0 1 0 0 4 0 | ar = ERIN-J
0 1 1 6 0 0 2 2 1 3 0 0 1 0 0 0 1 1 2 4 1 0 2 1 0 | as = FRANK-MANNING
0 0 0 6 0 1 2 0 1 3 5 0 0 0 1 1 2 0 0 1 1 0 0 0 1 | at = RUSS-LOAR
0 0 1 3 0 0 0 2 0 3 1 2 0 0 0 0 1 0 1 1 1 8 0 0 0 | au = JEFF-BEAN
0 1 0 7 0 3 0 0 2 1 2 0 1 0 0 0 8 0 1 1 1 1 2 0 1 | av = KAY-HWANGBO
0 1 5 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 25 0 | aw = MARYANN-HUDSON
1 2 0 3 0 4 2 2 1 4 1 0 2 0 0 0 1 0 1 2 2 0 0 0 0 | ax = DANIELLE-A
Figure 7: Confusion matrix for the LAT94 dataset.
C
Confusion matrix GH95
29
a
16
3
3
1
1
1
1
0
4
5
7
3
2
0
1
3
1
0
0
1
3
3
9
1
3
0
4
1
2
0
0
2
1
1
4
3
1
3
0
2
8
0
4
2
0
0
4
1
0
1
b c d e f
1 2 1 4 1
7 0 0 1 0
0 12 3 5 2
0 3 30 6 6
0 1 1 56 0
0 1 12 2 87
0 1 3 3 3
1 1 3 1 3
2 2 3 5 4
0 1 3 3 1
0 4 2 5 2
0 0 1 1 2
1 2 1 4 3
0 0 0 0 6
1 3 1 5 3
6 0 0 2 0
2 0 1 2 2
0 0 0 3 5
0 2 3 4 4
0 2 0 11 4
1 3 3 2 1
1 2 0 9 0
3 1 1 2 0
0 1 2 2 7
0 1 2 4 3
0 1 1 17 0
1 1 1 4 18
0 0 2 4 0
0 2 2 5 0
0 0 1 0 12
0 0 4 3 4
1 7 5 2 10
1 1 3 3 4
0 1 1 0 1
2 1 0 7 0
1 2 0 27 1
1 5 1 1 1
2 0 0 4 0
0 4 2 9 0
1 0 1 3 5
1 2 2 9 0
1 3 1 2 3
3 2 0 1 4
1 6 3 3 3
0 0 3 0 15
0 0 1 1 20
2 1 7 3 9
1 4 2 4 4
1 0 5 1 3
1 2 6 15 0
g h i j k l m n o p q
0 1 6 5 5 5 1 0 4 0 0
0 2 6 0 2 1 1 2 2 11 3
2 1 6 5 9 1 1 0 3 1 1
1 1 3 2 5 2 1 0 0 0 0
2 0 4 2 2 1 0 0 5 0 0
0 0 6 0 2 0 2 1 1 1 0
0 4 0 1 1 0 1 0 2 3 0
3 18 4 0 4 0 0 2 5 0 1
0 2 58 4 5 6 3 1 1 0 7
0 0 8 10 1 1 0 0 0 2 5
1 7 12 3 22 3 1 0 4 1 1
0 0 15 1 2 19 0 0 0 1 2
0 1 10 0 1 0 13 0 2 0 3
0 2 2 0 2 2 1 11 2 1 1
1 8 2 1 7 0 0 1 15 2 0
0 0 5 1 1 2 1 1 2 35 6
1 0 5 1 3 3 0 0 0 2 60
0 10 0 1 2 0 2 4 1 1 1
0 1 2 0 4 0 1 0 2 0 1
2 1 6 0 1 0 0 0 6 0 0
3 2 5 1 4 0 0 0 1 0 0
0 2 0 1 2 0 0 0 4 0 0
0 0 9 2 4 7 0 0 0 9 22
3 1 4 1 5 4 0 0 2 0 1
1 3 3 1 4 2 1 1 3 2 0
1 0 2 2 0 0 0 0 1 1 1
3 3 11 3 8 6 1 2 6 2 0
1 6 2 1 2 0 0 2 3 1 0
2 0 8 1 4 2 2 0 1 0 0
0 4 4 0 1 0 1 3 3 0 2
1 1 1 0 6 0 0 1 7 2 1
0 2 2 0 10 3 1 1 1 0 0
0 2 5 0 4 2 0 0 3 0 1
0 1 6 0 2 1 2 1 0 2 0
2 1 2 6 2 1 0 1 2 0 0
0 0 1 1 0 0 0 0 0 0 1
3 5 2 1 0 0 1 0 3 1 0
0 1 0 2 0 0 0 0 0 0 0
2 1 0 0 6 0 0 1 5 0 1
0 3 3 3 2 1 0 0 4 2 0
0 1 1 6 1 0 0 1 4 4 2
1 0 5 3 1 2 0 0 0 0 0
1 15 1 0 4 1 1 2 6 7 1
5 6 4 0 10 0 1 2 4 1 1
0 0 6 0 1 1 0 3 0 4 0
1 0 8 0 0 0 1 1 1 0 0
3 0 5 0 3 2 1 1 4 2 0
1 4 2 0 6 1 0 0 7 0 0
3 6 0 0 3 0 1 1 8 1 1
1 3 1 0 5 2 0 2 1 0 1
r
0
0
0
0
0
0
1
0
1
0
3
1
0
1
1
0
0
8
1
0
1
1
0
1
1
0
3
1
0
0
2
0
0
0
0
0
3
0
3
1
0
1
2
1
0
2
0
1
0
0
s t u v w
0 0 3 2 2
0 0 0 2 2
0 3 2 1 0
1 2 0 2 0
0 3 0 2 0
0 2 1 0 0
0 5 0 0 0
0 2 7 1 0
0 1 1 0 1
0 1 1 6 0
1 5 2 0 0
0 0 0 0 4
0 0 0 1 0
1 1 0 1 0
1 2 2 4 0
0 1 0 0 0
0 0 0 0 10
0 0 2 1 0
2 0 2 1 0
2 18 0 1 0
2 2 10 3 0
0 1 1 36 0
0 1 1 0 27
1 0 0 2 0
1 2 3 4 1
0 4 0 7 0
0 0 2 4 1
0 2 3 1 1
0 0 2 0 0
0 0 0 0 0
3 5 3 1 0
1 3 1 0 0
1 3 2 3 0
0 0 1 0 0
1 0 4 6 0
0 1 1 4 0
1 2 2 3 0
0 0 0 2 3
1 3 1 0 0
0 2 0 1 0
1 2 2 11 3
0 2 1 1 0
1 3 2 2 1
0 3 3 0 0
0 0 1 0 1
0 0 0 0 0
0 2 0 1 2
4 7 4 4 0
0 0 1 2 1
1 2 3 1 1
x
3
0
1
1
0
4
1
1
2
2
3
3
0
0
1
0
1
1
3
1
2
0
0
2
2
0
1
0
0
0
0
0
3
0
0
2
0
0
2
4
2
0
1
1
0
1
1
2
4
2
y
1
3
0
2
2
0
2
2
0
0
1
0
0
4
1
0
0
3
0
0
0
2
0
1
6
2
3
1
0
1
0
1
0
0
3
0
2
0
1
1
1
0
3
0
0
0
2
3
2
4
z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax
<-- classified as
2 1 1 0 1 0 0 1 3 0 3 2 2 2 1 3 1 1 0 0 0 2 1 3 0 | a = MICHAEL-TUMELTY
1 3 2 0 2 2 0 0 2 1 0 2 3 0 0 0 0 2 2 0 2 0 0 0 0 | b = MARK-FISHER
6 1 0 1 2 3 3 0 0 0 2 1 0 4 0 4 2 0 2 0 0 0 2 1 0 | c = MURRAY-RITCHIE
1 3 0 0 2 1 9 3 1 0 1 0 1 0 0 0 1 0 0 0 1 1 1 4 0 | d = PATRICK-BROGAN
21 1 0 2 0 2 0 0 0 3 25 1 2 4 0 5 0 0 0 0 0 0 3 2 1 | e = CHRISTOPHER-SIMS
1 7 0 0 6 2 1 0 3 0 0 0 1 1 0 0 1 2 3 2 9 0 1 0 2 | f = KEN-GALLACHER
4 3 1 0 2 3 0 1 0 0 1 2 0 2 0 0 1 1 1 0 0 0 7 6 1 | g = JAMES-FREEMAN
0 5 3 0 1 3 0 1 0 1 0 3 0 4 0 0 0 3 2 0 2 2 5 6 1 | h = STEPHEN-MCGREGOR
1 3 0 0 1 1 2 1 5 1 2 2 2 3 3 1 3 1 2 0 1 0 1 1 0 | i = DOUG-GILLON
1 1 0 2 0 1 0 0 2 1 7 0 0 0 0 4 1 1 0 0 0 0 1 1 0 | j = ALF-YOUNG
6 1 3 0 1 1 5 1 1 2 1 2 1 6 0 4 1 2 3 0 0 2 3 1 1 | k = BENEDICT-BROGAN
1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 | l = TOM-SHIELDS
0 0 1 2 2 0 1 0 5 3 2 0 1 1 1 1 1 0 1 1 1 0 1 1 0 | m = JOSEPH-DILLON
0 5 0 1 2 1 0 2 2 3 0 0 0 0 2 0 0 1 1 1 1 0 2 2 0 | n = HARRY-CONROY
4 9 2 1 4 3 0 4 0 1 0 3 1 7 2 7 1 0 0 0 5 2 2 15 0 | o = GRAEME-SMITH
1 3 1 0 0 3 0 1 3 1 2 2 2 2 0 4 0 0 1 0 0 2 0 0 0 | p = SARA-VILLIERS
1 0 0 1 0 0 0 5 0 3 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 | q = DAVID-BELCHER
4 12 4 0 1 5 0 1 1 1 0 1 3 1 1 4 0 3 1 0 2 1 2 3 0 | r = DAVID-STEELE
3 2 2 2 0 3 1 2 0 2 2 1 0 1 2 4 0 0 1 0 0 2 7 1 1 | s = DUNCAN-BLACK
4 3 0 3 2 2 2 2 0 2 2 2 0 1 0 3 3 3 3 0 2 2 3 2 0 | t = ROBERT-ROSS
5 0 0 0 2 2 0 2 1 2 1 2 0 0 1 3 3 0 0 0 1 0 4 0 1 | u = WILLIAM-TINNING
17 0 2 0 0 1 0 0 0 8 7 1 2 4 0 10 0 0 2 0 1 0 0 1 0 | v = CHRIS-STONE
0 0 0 1 0 2 0 0 1 1 4 0 4 1 0 0 0 0 0 1 0 0 0 0 0 | w = MARY-BRENNAN
5 1 3 2 1 0 0 6 1 1 0 0 1 0 1 0 0 0 0 1 1 1 1 4 0 | x = CAMERON-SIMPSON
4 6 1 2 2 4 1 2 0 0 0 4 0 2 0 3 1 0 3 0 2 1 4 6 0 | y = ELIZABETH-BUIE
97 2 0 0 1 1 3 0 0 4 10 5 2 1 0 7 4 0 2 0 1 0 0 3 2 | z = ANDREW-WILSON
4 36 2 0 1 3 2 2 4 0 0 7 0 2 0 3 4 2 1 3 3 0 3 2 1 | aa = DEREK-DOUGLAS
5 0 12 0 0 4 1 2 1 1 0 5 0 1 0 2 0 2 0 0 0 0 1 7 0 | ab = KEITH-SINCLAIR
1 3 0 2 2 1 1 2 0 2 4 0 0 0 1 1 4 2 1 0 0 2 0 2 0 | ac = CHRIS-HOLME
1 3 0 0 53 2 3 0 6 0 0 0 0 0 0 0 0 2 4 0 36 2 1 2 0 | ad = JAMES-TRAYNOR
4 7 3 1 1 3 1 2 1 2 1 5 0 3 0 3 0 1 0 0 0 2 2 1 0 | ae = ALAN-MACDERMID
0 5 1 0 3 3 11 0 0 0 1 2 0 2 0 0 3 1 4 0 2 1 0 1 0 | af = GEOFFREY-PARKHOUSE
4 2 1 3 1 1 0 7 0 1 4 0 0 2 0 2 0 0 1 0 0 0 1 2 0 | ag = ROB-ROBERTSON
1 2 1 0 3 0 0 0 37 0 1 1 0 0 0 2 2 2 3 0 4 1 1 0 1 | ah = RAYMOND-JACOBS
5 2 0 0 0 2 1 0 0 25 13 1 3 2 0 16 3 1 0 0 0 0 1 0 1 | ai = ROBERT-POWELL
17 0 0 0 0 0 1 2 0 5 69 1 3 1 0 9 0 0 0 0 0 0 0 1 0 | aj = NICOLA-REEVES
10 4 2 0 0 1 2 1 2 2 2 19 0 7 1 7 3 1 7 0 0 0 2 3 0 | ak = ROY-ROGERS
10 0 0 0 0 1 0 0 1 0 6 1 24 0 0 1 1 0 0 0 0 0 0 0 0 | al = ROSS-FINLAY
4 1 5 0 0 0 1 0 0 6 4 6 0 12 0 1 2 1 0 0 1 0 0 4 0 | am = FRANCES-HORSBURGH
1 0 1 1 0 1 1 1 0 0 1 2 0 2 4 4 3 1 3 0 0 1 2 6 0 | an = LYNNE-ROBERTSON
8 0 1 0 0 1 1 0 1 15 12 4 5 2 0 40 2 1 0 0 0 0 2 0 0 | ao = IAN-MCCONNELL
7 2 0 1 0 2 1 1 0 3 2 5 1 2 1 7 13 0 0 0 0 0 1 1 0 | ap = IAIN-WILSON
2 4 2 1 4 4 2 0 0 1 0 6 0 3 0 1 1 9 3 1 1 2 3 1 1 | aq = ROBBIE-DINWOODIE
2 2 6 0 2 0 4 1 3 0 0 5 2 2 1 3 5 5 33 0 7 6 0 1 1 | ar = STUART-TROTTER
0 3 0 0 7 1 2 1 3 1 0 0 1 0 0 0 1 0 1 6 14 1 0 0 0 | as = JIM-REYNOLDS
0 6 0 0 28 0 0 1 5 0 0 0 1 0 0 0 0 2 3 2 72 1 1 2 0 | at = IAN-PAUL
0 4 1 0 13 1 1 1 1 0 1 1 2 1 0 0 0 0 3 0 4 21 1 7 0 | au = WILLIAM-RUSSELL
5 8 2 1 0 3 0 1 0 1 3 2 0 3 3 3 3 2 3 0 1 2 12 3 0 | av = DAVID-ROSS
2 1 3 0 1 0 0 5 0 0 0 2 0 2 0 1 0 0 8 0 0 4 1 48 1 | aw = BRUCE-MCKAIN
5 5 2 0 2 1 2 0 0 1 1 0 0 6 0 1 4 2 2 0 0 1 2 0 1 | ax = KEN-SMITH
Figure 8: Confusion matrix for the GH95 dataset.
D
Confusion matrix C50 with removed elements
30
a b
- - 21
- - - - - - - - - - 6 - - - - - - - - - - - - - - - 7
- - - - - - - - - 6
- - - - - - - - - - - - - -
c
8
8
-
d
6
-
e
6
-
f g h
- - - - - - - - - - - - - 15 - - 16
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i j
- - - 6 - - - - - - 20
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
k l
- - - - - - - - - - 7 - 14
- - - - 7
- - 6 - - - 6
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
m
-
n
6
9
-
o
8
-
p
-
q
8
-
r
-
s
-
t
7
-
u v
- - - - - - - - - - - - - - - - - - - - - - 10
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
w
6
7
-
x y
- - - - - - - - - - - - - - - - - - - - - - - - - 11
- - - - - - - - 6
- - - - - - 6 - - - - - - - - - - -
z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax
<-- classified as
- - - - - - - - - - - - - - - - - - - - - - - - - | a = KirstinRidley
- - - - - - - - - - - - - - - - - - - - - - - - - | b = MatthewBunce
- - - - - - - - - - - - - - - - - - - - - - - - - | c = JaneMacartney
- - - - - - - - - - - - - - - - - - - - - - - - - | d = PeterHumphrey
- - - - - - - - - - - - - - - - - - - - - - - - - | e = DarrenSchuettler
- - - - - - - - - - - - - - - - - - - - - - - - - | f = ToddNissen
- - - - - - - - - - - - - - - - - - - - - - - - - | g = KarlPenhaul
- - - - - - - - - - - - - - - - - - - - - - - - - | h = JimGilchrist
- - - - - - - - - - - - - - - - - - - - - - - - - | i = DavidLawder
- - - - - - - - - - - - - - - - - - - - - - - - - | j = LydiaZajc
- - - - - - - - - - - - - - - - - - - - - - - - - | k = KouroshKarimkhany
- - - - - - - - - - - - - - - - - - - - - - - - - | l = GrahamEarnshaw
- - - - - - - - - - - - - - - - - - - - - - - - - | m = JonathanBirt
- - - - - - - - - - - - - - - - - - - - - - - - - | n = LynnleyBrowning
- - - - - - - - - - - - - - - - - - - - - - - - - | o = TheresePoletti
- - - - - - - - - - - - - - - - - - - - - - - - - | p = EdnaFernandes
- - - - - - - - - - - - - - - - - - - - - - - - - | q = MureDickie
- - - - - - - - - - - - - - - - - - - - - - - - - | r = HeatherScoffield
- - - - - - - - - - - - - - - - - - - - - - - - - | s = MartinWolk
- - - - - - - - - - - - - - - - - - - - - - - - - | t = SamuelPerry
- - - - - - - - - - - - - - - - - - - - - - - - - | u = MarkBendeich
- - - - - - - - - - - - - - - - - - - - - - - - - | v = TimFarrand
- - - - - - - - - - - - - - - - - - - - - - - - - | w = AlanCrosby
- - - - - - - - - - - - - - - - - - - - - - - - - | x = ScottHillis
- - - - - - - - - - - - - - - - - - - - - - - - - | y = AlexanderSmith
- - - - - - - - - - - - - - - - - - - - - - - - - | z = RobinSidel
- - - - - - - - - - - - - - 6 - - - - - - - - - - | aa = KeithWeir
- - 10 - - - - - - - - - - - - - - - - - - - - - - | ab = JoWinterbottom
- - - - - - - - - - - - - - - - - - - - - - - - - | ac = JanLopatka
- - - - - - - - - - - - - - - - - 7 - - - - - - - | ad = MarcelMichelson
- - - - - - - - - - - - - - - - - - - - - - - - - | ae = PatriciaCommins
- - - - - - 6 - - - - - - - - - - - - - - - - - - | af = EricAuchard
- - - - - - - - - - - - - - - - - - - - - - - - - | ag = JoeOrtiz
- - - - - - - - - - - - - - - - - - - - - - - - - | ah = MichaelConnor
- - - - - - - - - - - - - - - - - - - - - - - - - | ai = FumikoFujisaki
- - - - - - - - - - 8 - - - - - - - - - - - - - - | aj = PierreTran
- - - - - - - - - - - 8 - - - - - - - - - - - - - | ak = WilliamKazer
- - - - - - - - - - - - - - - - - - - - - - - - - | al = SarahDavison
- - - - - - - - - - - - - 14 - - - - - - - - - - - | am = LynneO’Donnell
- - - - - - - - - - - - - - - - - - - - - - - - - | an = NickLouth
- - - - - - - - - - - - - - - - - 6 - - - - - - - | ao = KevinDrawbaugh
- - - - - - - - - - - - - - - - 10 - - - - - - - - | ap = BernardHickey
- - - - - - - - - - - - - - - - - 11 - - - - - - - | aq = KevinMorrison
- - - - - - - - - - - - - - - - - - - - - - - - - | ar = TanEeLyn
- - - - - - - - - - - - - - - - - - - 14 - - - - - | as = AaronPressman
- - - - - - - - - - - - - - - - - - - - 8 - - - - | at = BenjaminKangLim
- - - - - - - - - - - - - - - - - - - - - - - - - | au = BradDorfman
- - - - - - - - - - - - - - - - - - - - - - - - - | av = JohnMastrini
- - - - - - - - - - 6 - - - - - - - - - - - - - - | aw = RogerFillion
- - - - - - - - - - - - - - - - - - - - - - - - - | ax = SimonCowell
Figure 9: Confusion matrix for the C50 dataset where elements less than or equal to 5 are removed.
E
Confusion matrix LAT94 with removed elements
31
a
-
b c
- 7 - 15
- - - - - - - - - - 6
- - - - - - - - - - - - - - - - - - - - - - - - - 7
- - - - - - 6
- - - - - - -
d
6
-
e
-
f g
- - - - - 6 6
- 67
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
h i j k l m
- - - - - - - - - - - - - - - 6
- - - - - - 7 - - - - - - - - - - - - - - - - - - - 16 - - - 8
- - 12 - - - - - 32 - - - 8 - 10 - - 11 - - 12
- 6 - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 7 - - - - - - - - 12 - - - 9
- - - - - - - 7 - - - - - - - - - - - - - - - - - 7
- - - - - - - - - - - - - - - 6
- 8 - - - - - - - - - - - - - 6
- - - - - - - - - - - - - - - - - - - - - 12 - - - - - - - - 8
- - 7 - - - - - - - 11
- - - - - - 7 - - - 8
n o p q
- - - - - - - - - - - - - - - - - - - - - - 7 - - - - 6 - - 7
- - - - - - 6
- - - 11
- - - - 17 - 8
- - 22 - - - 36
- - - 8
- - - 16
- - - - - - - - - - - - 6
- - - - - - - - - - - - - - 10 - - - 22
- - - 10
- - - 7
- - - - - - - - - - - - 6
- - - 7
- - - 7
- - - - - - - - - - - - - - - 7
- - 12 - - 8 - - - - - - 10
- - - 6
- - - - - - - - - -
r s t
- - - - - - - - - - - - - - - - - - - 6 - - 6
- - - - - 6 - - - - - 8 - 6 6 13 - - 93
- - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - 8 - - - - - - - - - - - - - - - - -
u
-
v
6
-
w x
- - - - - - - - - - - - - - - - 6
- - - - - - - - 20
- - - - - - - - - - - - - - - - - - - 9
- - - - - - - -
y
-
z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax
<-- classified as
- - - - - - - - - - - - - - - - - - - - - - - - - | a = JACK-SEARLES
- - - - - - - - - - - - - - - - - - - - - - - 7 - | b = MIKE-REILLEY
- - - - - - - - - - - - - - - - - - - - - - - - - | c = JOHN-DART
- - - - - - - - - - - - - - - - - - - - - - - - - | d = JE-MITCHELL
- - - - - - - - - - - - - - - - - - - - - - - 7 - | e = TJ-SIMERS
- - - - - - - - - - - - - - - - - - - - - - - - - | f = EARL-GUSTKEY
- - - - - - - - - - - - - - - - - - - - - - - - - | g = ALLAN-MALAMUD
- - - - - - - - - - - - - - - - - - - - - - - - - | h = MARY-F
- - - - - - - - - - - - - - - - - - - 6 - - - - - | i = ED-BOND
- - - 9 - - - - - - - - - - - - - - - - - - - - - | j = JON-NALICK
- - - - - - - - - - - - - - - - - - - - - - - - - | k = KEVIN-THOMAS
- - - - - - - - - - - - - - - - - - - - - - - - - | l = SARA-CATANIA
- - - 16 - - - - - 6 - - - - - - - - - - - - - - - | m = BERT-ELJERA
- - - 7 - - - - - - - - - - - - - - - - - - - - - | n = TRACY-WILSON
- - - - - 8 - - - - - - - - - - - - - - - - - - - | o = JULIE-FIELDS
- - 10 - - - - - - - - - - - - - - - - - - - - - - | p = SCOTT-HARRIS
- - - 9 - - - - - - - - - - - - - - - - - - - - - | q = SHELBY-GRAD
- - - - - - - - - - - - - - - - - - - - - - - - - | r = ALAN-EYERLY
- - - 14 - - - - - - - - - - - - 8 - - - - - - - - | s = FRANK-MESSINA
- - - - - - - - 7 - - - - - - - - - - - - - - - - | t = CHARLES-SOLOMON
- - - - - - 6 - - - - - - - - - - - - - - - - - - | u = STEVE-HOCHMAN
- - - - - - - - - - - - - - - - - - - - - - - - - | v = MARY-ANNE
- - - - - - - - - - - - - - - - - - - - - - - - - | w = LYNN-FRANEY
- - - - - - - - - - - - - - - - - - - - - - - - - | x = SHAV-GLICK
- - - - - - - - - - - - - - - - - - - - - - - - - | y = PEGGY-Y
- - - - - - - - - - - - - - - - - - - - - - - - - | z = MIGUEL-BUSTILLO
- - - - - - - - - - - - - - - - - - - - - - - - - | aa = MAKI-BECKER
- - 37 - - - - - - - - - - - - - - - - - - - - - - | ab = JIM-MURRAY
- - - 18 - - - - - 7 - - - - - - - - - - - - - - - | ac = DEBRA-CANO
- - - - - - - - - - - - - - - - - - - - - - - - - | ad = MARTIN-MILLER
- - - - - 11 - - - - - - - - - - - - - - - - - - - | ae = MAIA-DAVIS
- - - - - - 37 - - - - - - - - - - - - - - - - - - | af = TOM-PETRUNO
- - - - - - - - - - - - - - - - - - - - - - - - - | ag = SUSAN-BYRNES
- - - - - - - - - - - - - - - - - - - - - - - - - | ah = JENNIFER-OLDHAM
- - - 8 - - - - - 13 - - - - - - - - - - - - - - - | ai = BILL-BILLITER
- - - 8 - - - - - - 6 - - - - - - - - - - - - - - | aj = LESLEY-WRIGHT
- - - - - - - - - - - - - - - - - - - - - - - - - | ak = HOLLY-J
- - - - - - - - - - - - - - - - - - - - - - - - - | al = STEVE-RYFLE
- - - - - - - - - - - - - - - - - - - - - - - - - | am = ERIC-SLATER
- - - - - - - - - - - - - - 25 - - - - - - - - - - | an = MICHAEL-KRIKORIAN
- - - - - - - - - - - - - - - - - - - - - - - - - | ao = KURT-PITZER
- - - 8 - - - - - - - - - - - - 12 - - - - - - - - | ap = STEPHANIE-SIMON
- - - - - - - - - - - - - - - - - 12 - - - - - - - | aq = HELENE-ELLIOTT
- - - - - - - - - - - - - - - - 6 - 8 - - - - - - | ar = ERIN-J
- - - 6 - - - - - - - - - - - - - - - - - - - - - | as = FRANK-MANNING
- - - 6 - - - - - - - - - - - - - - - - - - - - - | at = RUSS-LOAR
- - - - - - - - - - - - - - - - - - - - - 8 - - - | au = JEFF-BEAN
- - - 7 - - - - - - - - - - - - 8 - - - - - - - - | av = KAY-HWANGBO
- - - - - - - - - - - - - - - - - - - - - - - 25 - | aw = MARYANN-HUDSON
- - - - - - - - - - - - - - - - - - - - - - - - - | ax = DANIELLE-A
Figure 10: Confusion matrix for the LAT94 dataset where elements less than or equal to 5 are removed.
F
Confusion matrix GH95 with removed elements
32
a
16
7
9
8
-
b c d e f
- - - - 7 - - - - 12 - - - - 30 6 6
- - - 56 - - 12 - 87
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 6
- - - - 6 - - - - - - - - - - - - - - - - - - 11 - - - - - - - 9 - - - - - - - - 7
- - - - - - - 17 - - - - 18
- - - - - - - - - - - - 12
- - - - - 7 - - 10
- - - - - - - - - - - 7 - - - 27 - - - - - - - - - - - 9 - - - - - - - 9 - - - - - - - - - 6 - - - - - - 15
- - - - 20
- - 7 - 9
- - - - - - - - - - 6 15 -
g h i j k l m n o p q
- - 6 - - - - - - - - - 6 - - - - - - 11 - - 6 - 9 - - - - - - - - - - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - 18 - - - - - - - - - - 58 - - 6 - - - - 7
- - 8 10 - - - - - - - 7 12 - 22 - - - - - - - 15 - - 19 - - - - - - 10 - - - 13 - - - - - - - - - - 11 - - - 8 - - 7 - - - 15 - - - - - - - - - - 35 6
- - - - - - - - - - 60
- 10 - - - - - - - - - - - - - - - - - - - - 6 - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - 9 - - 7 - - - 9 22
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 11 - 8 6 - - 6 - - 6 - - - - - - - - - - 8 - - - - - - - - - - - - - - - - - - - - - 6 - - - 7 - - - - - 10 - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - 15 - - - - - - 6 7 - 6 - - 10 - - - - - - - 6 - - - - - - - - - 8 - - - - - - - - - - - - - - - - - - - - - 6 - - - 7 - - 6 - - - - - - 8 - - - - - - - - - - - -
r
8
-
s t u v w
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 7 - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - 10
- - - - - - - - - 18 - - - - 10 - - - - 36 - - - - 27
- - - - - - - - - - - 7 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - 11 - - - - - - - - - - - - - - - - - - - - - - - - - 7 - - - - - - - - - - -
x
-
y
6
-
z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax
<-- classified as
- - - - - - - - - - - - - - - - - - - - - - - - - | a = MICHAEL-TUMELTY
- - - - - - - - - - - - - - - - - - - - - - - - - | b = MARK-FISHER
6 - - - - - - - - - - - - - - - - - - - - - - - - | c = MURRAY-RITCHIE
- - - - - - 9 - - - - - - - - - - - - - - - - - - | d = PATRICK-BROGAN
21 - - - - - - - - - 25 - - - - - - - - - - - - - - | e = CHRISTOPHER-SIMS
- 7 - - 6 - - - - - - - - - - - - - - - 9 - - - - | f = KEN-GALLACHER
- - - - - - - - - - - - - - - - - - - - - - 7 6 - | g = JAMES-FREEMAN
- - - - - - - - - - - - - - - - - - - - - - - 6 - | h = STEPHEN-MCGREGOR
- - - - - - - - - - - - - - - - - - - - - - - - - | i = DOUG-GILLON
- - - - - - - - - - 7 - - - - - - - - - - - - - - | j = ALF-YOUNG
6 - - - - - - - - - - - - 6 - - - - - - - - - - - | k = BENEDICT-BROGAN
- - - - - - - - - - - - - - - - - - - - - - - - - | l = TOM-SHIELDS
- - - - - - - - - - - - - - - - - - - - - - - - - | m = JOSEPH-DILLON
- - - - - - - - - - - - - - - - - - - - - - - - - | n = HARRY-CONROY
- 9 - - - - - - - - - - - 7 - 7 - - - - - - - 15 - | o = GRAEME-SMITH
- - - - - - - - - - - - - - - - - - - - - - - - - | p = SARA-VILLIERS
- - - - - - - - - - - - - - - - - - - - - - - - - | q = DAVID-BELCHER
- 12 - - - - - - - - - - - - - - - - - - - - - - - | r = DAVID-STEELE
- - - - - - - - - - - - - - - - - - - - - - 7 - - | s = DUNCAN-BLACK
- - - - - - - - - - - - - - - - - - - - - - - - - | t = ROBERT-ROSS
- - - - - - - - - - - - - - - - - - - - - - - - - | u = WILLIAM-TINNING
17 - - - - - - - - 8 7 - - - - 10 - - - - - - - - - | v = CHRIS-STONE
- - - - - - - - - - - - - - - - - - - - - - - - - | w = MARY-BRENNAN
- - - - - - - 6 - - - - - - - - - - - - - - - - - | x = CAMERON-SIMPSON
- 6 - - - - - - - - - - - - - - - - - - - - - 6 - | y = ELIZABETH-BUIE
97 - - - - - - - - - 10 - - - - 7 - - - - - - - - - | z = ANDREW-WILSON
- 36 - - - - - - - - - 7 - - - - - - - - - - - - - | aa = DEREK-DOUGLAS
- - 12 - - - - - - - - - - - - - - - - - - - - 7 - | ab = KEITH-SINCLAIR
- - - - - - - - - - - - - - - - - - - - - - - - - | ac = CHRIS-HOLME
- - - - 53 - - - 6 - - - - - - - - - - - 36 - - - - | ad = JAMES-TRAYNOR
- 7 - - - - - - - - - - - - - - - - - - - - - - - | ae = ALAN-MACDERMID
- - - - - - 11 - - - - - - - - - - - - - - - - - - | af = GEOFFREY-PARKHOUSE
- - - - - - - 7 - - - - - - - - - - - - - - - - - | ag = ROB-ROBERTSON
- - - - - - - - 37 - - - - - - - - - - - - - - - - | ah = RAYMOND-JACOBS
- - - - - - - - - 25 13 - - - - 16 - - - - - - - - - | ai = ROBERT-POWELL
17 - - - - - - - - - 69 - - - - 9 - - - - - - - - - | aj = NICOLA-REEVES
10 - - - - - - - - - - 19 - 7 - 7 - - 7 - - - - - - | ak = ROY-ROGERS
10 - - - - - - - - - 6 - 24 - - - - - - - - - - - - | al = ROSS-FINLAY
- - - - - - - - - 6 - 6 - 12 - - - - - - - - - - - | am = FRANCES-HORSBURGH
- - - - - - - - - - - - - - - - - - - - - - - 6 - | an = LYNNE-ROBERTSON
8 - - - - - - - - 15 12 - - - - 40 - - - - - - - - - | ao = IAN-MCCONNELL
7 - - - - - - - - - - - - - - 7 13 - - - - - - - - | ap = IAIN-WILSON
- - - - - - - - - - - 6 - - - - - 9 - - - - - - - | aq = ROBBIE-DINWOODIE
- - 6 - - - - - - - - - - - - - - - 33 - 7 6 - - - | ar = STUART-TROTTER
- - - - 7 - - - - - - - - - - - - - - 6 14 - - - - | as = JIM-REYNOLDS
- 6 - - 28 - - - - - - - - - - - - - - - 72 - - - - | at = IAN-PAUL
- - - - 13 - - - - - - - - - - - - - - - - 21 - 7 - | au = WILLIAM-RUSSELL
- 8 - - - - - - - - - - - - - - - - - - - - 12 - - | av = DAVID-ROSS
- - - - - - - - - - - - - - - - - - 8 - - - - 48 - | aw = BRUCE-MCKAIN
- - - - - - - - - - - - - 6 - - - - - - - - - - - | ax = KEN-SMITH
Figure 11: Confusion matrix for the GH95 dataset where elements less than or equal to 5 are removed.