Sequential Aggregation of Textual Features for Domain Independent Author Identification Sekventiella textuella särdrag för ämnesoberoende författarbestämning LINDA ERIKSSON [email protected] June 22, 2014 Degree project in Computer Science KTH Royal Institute of Technology Stockholm, Sweden Supervisor: Jussi Karlgren Examiner: Jens Lagergren Abstract Sequential Aggregation of Textual Features for Domain Independent Author Identification In the area of Author Identification many approaches have been made to identify the author of a written text. By identifying the individual variation that can be found in texts, features can be calculated. These feature values are commonly calculated by normalizing the values to an average value over the whole text. When using this kind of Simple features much of the variation that can be found in texts will not get captured. This project intends to use the sequential nature of the text to define Sequential features at sentence level. The theory is that the Sequential features will be able to capture more of the variation that can be found in the texts, compared to the Simple features. To evaluate these features a classification of authors was made on several different datasets. The result showed that the Sequential features performs better than the Simple features in some cases, however the difference was not large enough to confirm the theory of them being better than the Simple features. Sammanfattning Sekventiella textuella särdrag för ämnesoberoende författarbestämning Inom området som behandlar författarbestämning har många olika tillvägagångssätt använts för att identifiera författaren av en skriven text. Genom att identifiera den individuella variation som särskiljer texter från varandra, kan olika särdrag beräknas. Dessa särdrags värden beräknas vanligen genom att normaliseras till ett medelvärde över hela texten. När denna typ av Enkla särdrag används så döljs mycket av den variation som särskiljer texter från varandra. Målet med detta projekt är att istället använda textens sekventiella natur som grund för att definiera Sekventiella särdrag på meningsnivå. Teorin är att de sekventiella särdragen kommer att kunna identifiera mer av den variation som kan identifieras i texter, jämfört med de enkla särdragen. För att utvärdera dessa särdrag gjordes en klassificering av författare på flera olika dataset. Resultatet visade att de sekventiella särdragen presterade bättre än de enkla särdragen i vissa fall, men skillnaden var inte tillräckligt stor för att bekräfta teorin om att de skulle vara bättre än de enkla särdragen. Table of Contents 1 Introduction 1 2 Background 2.1 Finding individual variation . . . . . . . . . . . . . . . . . . . . . . . 2.2 Project description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 3 Theory 3.1 The Sequential Minimal Optimization algorithm . . . . . . . . . . . 3.2 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 4 Method 4.1 Software . . . . . . . . . . 4.1.1 Weka . . . . . . . 4.2 Datasets . . . . . . . . . . 4.2.1 PAN14 dataset . . 4.2.2 C50 dataset . . . . 4.2.3 LAT94 dataset and 4.2.4 Tasa dataset . . . 4.3 Features . . . . . . . . . . 4.3.1 Simple features . . 4.3.2 Sequential features 4.4 ARFF generation . . . . . 4.5 Text files . . . . . . . . . 4.6 Classification . . . . . . . 4.7 Macro average . . . . . . 4.8 Confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GH95 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 7 7 8 8 8 8 9 9 12 12 12 12 12 5 Experiments and Results 5.1 Performance of features 5.2 Clustering of authors . . 5.2.1 C50 dataset . . . 5.2.2 LAT94 dataset . 5.2.3 GH95 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 16 16 16 17 6 Discussion 6.1 Performance of individual features . . . . . . . . 6.2 Unexpected results . . . . . . . . . . . . . . . . . 6.3 Simple vs Sequential features . . . . . . . . . . . 6.4 Classification performance on the PAN14 dataset 6.5 Confusion matrices and clusters . . . . . . . . . . 6.6 Genre and domain independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 19 20 20 21 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusion 23 Bibliography 24 Appendix A Confusion B Confusion C Confusion D Confusion matrix matrix matrix matrix C50 . . . . . . . . . . . . . LAT94 . . . . . . . . . . . . GH95 . . . . . . . . . . . . C50 with removed elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 27 28 29 30 E F Confusion matrix LAT94 with removed elements . . . . . . . . . . . 31 Confusion matrix GH95 with removed elements . . . . . . . . . . . . 32 List of Tables 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Simple features. . . . . . . . . . . . . . . . . . . . . . . . . Binary Patterns. . . . . . . . . . . . . . . . . . . . . . . . Sequential features. . . . . . . . . . . . . . . . . . . . . . . Performance of Simple features. . . . . . . . . . . . . . . . Performance of Sequential features. . . . . . . . . . . . . . Performance of Simple features on the PAN14 dataset. . . Performance of Sequential features on the PAN14 dataset. Cluster 1 in the LAT94 dataset. . . . . . . . . . . . . . . . Cluster 2 in the LAT94 dataset. . . . . . . . . . . . . . . . Cluster 1 in the GH95 dataset. . . . . . . . . . . . . . . . Cluster 2 in the GH95 dataset. . . . . . . . . . . . . . . . Cluster 3 in the GH95 dataset. . . . . . . . . . . . . . . . Performance of Simple features, taken from Table 4. . . . Performance of Sequential features, taken from Table 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 11 14 14 15 15 16 16 17 17 17 20 20 Support Vector Machine. . . . . . . . . . . . . . . . . . . . . . . . Example of a Confusion matrix. . . . . . . . . . . . . . . . . . . . . Example of an ARFF file. . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix for the clusters in the LAT94 dataset. . . . . . . Confusion matrix for the clusters in the GH95 dataset. . . . . . . . Confusion matrix for the C50 dataset. . . . . . . . . . . . . . . . . Confusion matrix for the LAT94 dataset. . . . . . . . . . . . . . . Confusion matrix for the GH95 dataset. . . . . . . . . . . . . . . . Confusion matrix for the C50 dataset where elements less than or equal to 5 are removed. . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix for the LAT94 dataset where elements less than or equal to 5 are removed. . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix for the GH95 dataset where elements less than or equal to 5 are removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 7 17 18 27 28 29 List of Figures 1 2 3 4 5 6 7 8 9 10 11 . 30 . 31 . 32 1 Introduction Much of the individual variation that can be found in written text can be used to distinguish authors from each other. This variation can be lexical, structural, bound to genre or depend on some extralinguistic factors. Examples of these factors can be the authors age, gender or background. The uniqueness of an author’s writing style can be established by studying this variation. The writing style of an author can be used to identify the author of a written text. Author Identification is an important problem in many areas, such as law and journalism where knowing the author of a document, e.g. a ransom note could be of big importance [1]. Identifying the variation which makes each author unique can be a challenging task, both for humans and when using a computational approach. Today most of the measures on texts are calculated using the words or phrases in the text. These measures are called features. These feature values can be normalized and averaged over the whole text. These values can then be compared between texts with the intention of distinguishing the texts from each other. When normalizing and averaging the feature values much of the variation that can be found in texts will not get captured. Instead, this project intends to use the sequential nature of the text to define features on sentence level. The goal is to investigate if these features will be able to capture more of the individual variation that can be found in written text compared to features which are normalized and averaged over the whole text. 1 2 Background This Section presents some background knowledge in the area of the project and the project description. 2.1 Finding individual variation In the area of Author Identification features represent variation that can be found in text, to distinguish authors from each other. Selecting good features is important to get good classification results. The study by Li et al. [2] investigated how to find the optimal features among a set of extracted features in an English and a Chinese dataset. By choosing only the features that perform well they got the best possible performance from the classifier. For the English dataset with the set of optimal features they achieved a classification accuracy of 99.01% compared to a classification accuracy of 97.85% with the full feature set. For the Chinese dataset they achieved a classification accuracy of 93.56% with the optimal features and a classification accuracy of 92.42% with the full feature set. The results showed that the classification accuracy was better using only the optimal features for both datasets compared to using the full feature set. The features were extracted from the datasets using a Genetic Algorithm. By applying the genetic operations implemented in a genetic algorithm the solution containing the optimal features could be represented. Over all Li et al. seemed to have a good approach to solving the task of sorting out the best features for a specific classification task. When trying to identify the author of a text some sort of feature selection can be used, Adams et al. have investigated this approach [3]. Their approach was to use Author Identification techniques based on genetic and evolutionary feature selection, similar to what Li et al. did [2]. The difference between their approaches was that Adams et al. created a heuristic for feature selection while Li et al. introduced a method for identifying the key features for authors with the intention to help trace author identities in cybercrime investigations [2]. Today, a common approach in machine learning to avoid the risks of Overfitting is to use Cross validation. This technique was used by Adams et al. [3]. Overfitting occurs when the model starts to describe the noise in the training data instead of describing the statistical model. This results in bad performance on the test data and a classifier that generalizes badly. When using cross validation the dataset is divided into small subsets, usually a training set, a test set and a validation set. Cross validation involves doing this sampling of test and training data a number of times. The training set is used for training while the test set is used to analyse how well the trained classifier works on unseen instances (how well it generalizes to unseen data). The validation set is used during training to see how well the learner is learning. The purpose of the validation set is to help define a stop condition that tells the learner when to stop. Gabrilovich and Markovitch [4] studied how to generate features based on domainspecific and common-sense knowledge. By using the Open Directory Project (ODP) [5] as background they generated new features from the ODP. The texts were divided into series of contexts where each context could be sentences or paragraphs. Gabrilovich and Markovitch believed that by looking at the whole text at the same time could be misleading and therefore chose to build the feature extractor on sentences or paragraphs that could be found in texts. These starting points are similar to the ones of this project. 2 Another way to generate features is to use lexical chains. Jayarajan et al. [6] believed that by using a simple Bag of Words (BoW) model many of the linguistic and semantic features in texts would be ignored. A BoW model ignores grammar and word order, but keeps track of the number of times each word occurs. Instead of using a BoW model they used lexical chains which consisted of semantic information that was encoded from the documents [6]. Their experiments showed that by using these features they achieved better classification results compared to when using features based on the BoW model. They also got a reduction of the dimensionality of the feature vectors with 30% compared to the feature vectors generated with the BoW model. Another approach to author identification was made by Huang et al. in [7]. They investigated the importance of several different kinds of features like Lexical, Syntactic, Structural and Content-specific features when used for Author Identification on online messages. Their research showed that by using all of these four different feature types the performance of their classifier increased. They developed three classifiers using different classification techniques. The algorithms chosen was Decision Trees, Back Propagation Neural Networks and Support Vector Machines (SVM). The testing was done on both an English and a Chinese dataset. The study showed that the classifier using the SVM algorithm outperformed the others on both datasets. On the English dataset the SVM classifier had a correct classification rate of 97.69% using all feature types. In comparison to the classifier using Back Propagation Neural Networks which had a correct classification rate of 96.66%. The Decision tree classifier was quite far behind the others in the number of correct classifications. It could only classify 93.36% of the authors in the English dataset correctly. Of the three approaches they used the SVM classifier had the best performance in all tests. Huang et al. also had some interesting ideas about how different a classifier can perform depending on the language of the dataset. The Chinese language is built up very differently compared to English, for example it does not contain any spaces between words and this requires some additional language processing to separate out the words compared to when using an English dataset [7]. This is important to consider when designing systems that should work with several different languages. The study by Karlgren and Eriksson [8] was the initial inspiration to the project. The motivation for their study was to investigate if features which are typically measured and then averaged over a whole text, would capture more of the variation that can be found in texts if patterns of occurrence were measured. By using a sliding windows approach with different window lengths, they computed sequential features. These features were then stored as binary patterns. In their study, this line of investigation was not pursued to a functional conclusion and an investigation remains to see if their approach, which showed some promising results, can be useful. 2.2 Project description This degree project is motivated by some of the initial results presented by Karlgren and Eriksson [8]. The idea is to better investigate the value of sequential features defined by locally and sequentially aggregating observations of some linguistic item, using sliding windows of different lengths. By using the sequential nature of the text instead of mean values the theory is to capture more of the variation that can be 3 found in texts. Mean values do not capture much of this variation. The sequential features shall be implemented using a sliding windows approach (see Section 4.3.2). Examples of features can be e.g., word length, sentence length, use of adverbials and use of pronouns. The feature representation shall be able to identify variation in texts and it should be possible to use the feature representation as a background for Author Identification. All of the features used in the project were defined by the project supervisor since the intention of this project was neither to evaluate the linguistic characteristics of text nor about inventing new features. 4 3 Theory This Section presents the classification algorithm used in the project and the representation that was used to present the classification results. 3.1 The Sequential Minimal Optimization algorithm The Sequential Minimal Optimization algorithm was implemented by John Platt and is an alternative algorithm for training a Support Vector Machine (SVM) [9]. The SVM algorithm is used in many different areas in machine learning and is one of the most popular algorithms [10, p. 119]. The SVM can transform the input to a high-dimensional space which makes it possible to do linear separation between classes in the input space. When using a SVM the margins are maximized between the classes to make the probability of the data points ending up in the wrong class as small as possible. Some of the advantages of using a SVM is that it has very good generalization properties and works very well with few training samples [11]. One disadvantage is that SVMs are hard to implement efficiently [11]. Figure 1 illustrates a simple classification problem using a SVM classifier. The classes are separated by the optimal hyperplane which has the maximum margin to both of the classes. The data points (filled) which lie closest to the margin in each class are called support vectors [10, p. 121]. y al im pt O p hy er e an pl Maximum margin x Figure 1: Support Vector Machine. Instead of using a large quadratic programming loop for solving the optimization problem, like a regular SVM training algorithm, the SMO algorithm splits this problem into the smallest possible sub problems which are solved analytically. This makes the SMO algorithm faster and more scalable than the original SVM training algorithm [9]. 5 3.2 Confusion matrix The representation of results that was used in the project was Confusion matrices. A confusion matrix is a square matrix which visualizes the performance of a classifier. All of the classes that exists in the data are presented in the horizontal and vertical directions of the matrix [10, p. 32]. Figure 2 shows how the confusion matrices used in this project are represented. The vertical direction presents the actual class while the horizontal direction presents the predicted class. The main diagonal (marked with C) in the matrix therefore represents the number of correct classifications for the classes. All of the other elements (marked with x) in the matrix represents the number of incorrect classifications between the actual class and predicted class which are defined in the row and column of the matrix. a C x x x x x x x x x b x C x x x x x x x x c x x C x x x x x x x d x x x C x x x x x x e x x x x C x x x x x f x x x x x C x x x x g x x x x x x C x x x h x x x x x x x C x x i x x x x x x x x C x j x x x x x x x x x C | | | | | | | | | | <-- classified as a = Class-1 b = Class-2 c = Class-3 d = Class-4 e = Class-5 f = Class-6 g = Class-7 h = Class-8 i = Class-9 j = Class-10 Figure 2: Example of a Confusion matrix. 6 4 Method This Section will describe all of the different methods that were used in the project. 4.1 Software This Section will present the software that was used in the project. 4.1.1 Weka Weka is a Data Mining software written in Java which contains a large collection of Machine Learning algorithms [12]. Weka has a graphical user interface which makes it possible to apply the algorithms directly to a dataset. The algorithms can also be invoked in Java code [12], which was used in this project. The Weka software is available at their website [13]. Weka uses a file format called Attribute-Relation File Format (ARFF) to represent the data used for classification. Figure 3 presents a small example of an ARFF file. This example has three attributes, two numeric and one nominal. Each of the lines under @data represents one instance. In the project each instance represents a text file and each attribute represents a feature. To be able to use the classification algorithms implemented in the Weka Software all the training and test data in the project had to be stored in this file format. @relation name @attribute att1 numeric @attribute att2 numeric @attribute att3 {yes, no} @data 0,0,no 0,1,yes 1,0,yes 1,1,yes Figure 3: Example of an ARFF file. 4.2 Datasets This Section will present the datasets that were used in the project. One thing to note is that these datasets are not designed to test the same question. One dataset is designed to test the question “Did author X write this text?” whereas the other datasets are designed to test the question “Who of these authors wrote this text?”. Even though these questions are very similar the result will be very different (see Section 5). 4.2.1 PAN14 dataset PAN is a competition which is given every year, the competition includes classification tasks like Author Identification, Author Profiling and Plagiarism Detection. PAN has publicly available training datasets which can be downloaded at their website [14]. In this project the training dataset for the Author Identification task year 2014 was used. This dataset can be downloaded at [15]. It consists of a collection of text documents in four different languages, Dutch, Greek, English and 7 Spanish. In this project the English dataset containing 300 tasks was used. Each task consists of one document with a unknown author and a number of documents with a known author. The goal is to correctly classify the unknown document as if it belongs to the same author as the known documents, or not. The training dataset also includes a file containing the correct answers where each classification task is listed together with a letter, where Y represents that the authors are the same and N represents that they are not. 4.2.2 C50 dataset The C50 dataset was downloaded from the UCI Machine Learning Repository [16]. It consists of one training and one test set, these sets are not overlapping. Each of the datasets contains 2500 documents (50 authors with 50 documents each) in text format. All of the documents are written in English and belongs to the same subtopic which will minimize the possibility of being able to classify documents depending on topics instead of the unique features which represent each author. 4.2.3 LAT94 dataset and GH95 dataset The LAT94 dataset and the GH95 dataset are two separate datasets which consist of a large number of articles from Los Angeles Times year 1994 and Glasgow Herald year 1995, respectively. These datasets have been used in CLEF information retrieval evaluation campaigns [17]. Each file contains multiple articles and to be able to use these articles they had to be extracted into separate files and tagged with the name of the author. It resulted in 6685 articles for the LAT94 dataset and 10595 articles from the GH95 dataset. Some of the articles had no specified author while some had multiple authors and they were therefore excluded. The remaining articles in the LAT94 and GH95 datasets were sorted by author, the 50 authors having the largest number of articles in each dataset were used to create two new datasets. These new datasets were used to evaluate the performance of the classifier. 4.2.4 Tasa dataset The Tasa dataset was created from a large text file containing school essays with unknown authors. The file was split up into a large number of smaller files. Starting from the beginning of the file a random number (between 30-50) of sentences were selected and a new file was generated. This continued until the end of the Tasa file and then a new dataset had been created consisting of 19901 files where each file had between 30-50 sentences. Since the starting point and end point of the essays were not easily distinguishable, they could not be considered when splitting the file. The newly generated files could therefore contain text from several essays with different authors. This will create anomalies when trying to classify these files. These files were used as examples of texts not belonging to the known author in the PAN14 dataset. 4.3 Features This section describes which features that were implemented in the project. 8 4.3.1 Simple features The Simple features implemented in this project refers to features which calculates an average value over the whole text, e.g., the average word length. The Simple features also includes features which calculates occurrences of words. These features were implemented as a test in the beginning of the project and were later on used when comparing feature performances. Table 1 presents all of the Simple features implemented in the project. Column N shows how many features of each kind that were implemented. N Feature description 1 Average sentence length: The average sentence length. 1 Average word length: The average word length. 1 Adverbial: The total number of adverbials. 1 Pronoun: The total number of pronouns. 1 Utterance: The total number of utterance verbs. 3 Ly-words: The total number of -ly words in the beginning, middle and end of a sentence. This feature does not look at all words ending with -ly, only the words defined in the list of words used for this specific feature. 7 Mendenhall length X : The Mendenhall feature [18] consists of 7 features where each feature records the number of words with length X. For the first 6 features X is in the range 1-6 and for the last feature X is > 6. Table 1: Simple features. 4.3.2 Sequential features The Sequential features used an approach called Sliding Windows where the text was divided into sequences of sentences which from here on are referred to as windows. For windows of size X it meant that the following X sentences in the text were considered. The windows were generated by iterating through the text and moving the starting point one sentence forward in each iteration step, until the end of the text was reached. The features were applied to the sentences contained in each window. All of the features were represented by a binary condition. If the condition was met, the sentence would get the value 1, if not it would get the value 0. This gave a binary pattern of ones and zeros in the specified window size. For all of the Sequential features the window size of 1–4 was used, the total number of binary combinations was: 21 + 22 + 23 + 24 = 30 where each of the combinations were one of the binary patterns presented in Table 2. Size Binary pattern 1 0, 1 2 00, 01, 10, 11 3 000, 001, 010, 011, 100, 101, 110, 111 4 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111 Table 2: Binary Patterns. 9 Each feature value was recorded as an integer in the range of 0–100. This value corresponded to the percentage of occurrences for each binary pattern. All of the Sequential features that were implemented are presented in Table 3. 10 N Feature description 30 Sentence length: Measures the pattern of long vs. short sentences. A sentence is considered to be long if the number of characters is more than 30 (including white spaces, separators etc.). 30 Word length: Measures the pattern of sentences with long vs. short words. A sentence is considered to be having long words if the average word length is >= 5. 30 Coherence: Measures the coherence value between sentences. A sentence is considered to be coherent with the following sentences if 60 % or more of the words reoccur. 30 Pronoun: Measures if each sentence contains any pronouns (I, you, he, she etc.). 30 First person pronoun: Measures if each sentence contains any First person pronouns (I, we, me etc.). 30 Second person pronoun: Measures if each sentence contains any Second person pronouns (you, yours etc.). 30 Third person pronoun: Measures if each sentence contains any Third person pronouns (he, she, their etc.). 30 Relative pronoun: Measures if each sentence contains any Relative pronouns (that, when, which etc.). 30 Intensive/Reflexive pronoun: Measures if each sentence contains any Intensive/Reflexive pronouns (myself, yourself, itself etc.). 30 Adverbial: Measures if each sentence contains any Adverbials (after, almost etc.). 30 Clausal adverbial: Measures if each sentence contains any Clausal adverbials (suddenly, immediately, apparently etc.). 30 Utterance: Measures if each sentence contains any Utterance verbs (acknowledge, admit, allow etc.). 30 Amplifiers: Measures if each sentence contains any Amplifiers (absolutely, altogether, completely etc.). 30 Hedges: Measures if each sentence contains any Hedges (apparently, appear, around etc.). 30 Enneg: Measures if each sentence contains any Negative sentiment terms (abhor, abuse, alarm etc.). 30 Enpos: Measures if each sentence contains any Positive sentiment terms (adore, agree, amazing etc.). 30 Negation: Measures if each sentence contains any Negations (cannot, nor, not etc.). 30 Think verbs: Measures if each sentence contains any Think verbs (fear, hope etc.). 30 Overt emotion: Measures if each sentence contains any Explicit emotional terms (anger, fury etc.). 30 Time expressions: Measures if each sentence contains any Time expressions (monday, tuesday etc.). 30 Auxes and Modals: Measures if each sentence contains any Auxiliary or Modal verbs (will, would etc.). 30 Overt signals: Measures if each sentence contains any Overt signals of X (next, last etc.). Over signals are explicit textual and discourse markers which are used to bind together bits of the text to a coherent whole. 30 Present tense: Measures if each sentence contains any Present tense (do, does etc.). 30 Past tense: Measures if each sentence contains any Past tense (did, was etc.). 30 Words ending with -ing: Measures if each sentence contains any words ending with -ing. 30 Introduced: Measures in each sentence if all of the words with a “the” in front of them has been introduced before (in the same window). Table 3: Sequential features. 11 The following two definitions (in quotation marks) were given by the project supervisor Jussi Karlgren as a clarification of the Present tence, Past tence and the Introduced features. “Since English has a very spare morphology, the tense of the main verb cannot be determined with any precision, at least not without considerable computation. For this purpose the text is sampled, and certain common verbs are used to identify the tense of the clause (e.g. Have vs Had).” “For each word token immediately preceded by the, checks sentence in the sliding window to see if that word has been used previously in the window. This is intended to capture the tendency of some authors to use terms in definite form, assuming their reference to be inferable from discourse context without explicit introduction.” 4.4 ARFF generation To be able to use the classification algorithms implemented in Weka all the training and test data had to be stored in the ARFF format (see Section 4.1.1). Several different versions of a ARFF generator had to be implemented depending on the format of the training and test sets used in the project. 4.5 Text files All of the words used to define each feature were specified in text files, one for each feature. These text files were parsed in the implementation and the words were compared to the words in the texts. 4.6 Classification The classification algorithm used in the project was the SMO algorithm (see Section 4.6). The algorithm was already implemented in Weka and could be invoked in the Java code. 4.7 Macro average Two different techniques can be used to calculate the correct classification rate of a classifier. The Macro average technique uses equal weight for classes while Micro average uses equal weight for data points. In this project the Macro average measure which is implemented in Weka was used. 4.8 Confusion matrices Confusion matrices (see Section 3.2) were used in the project to present the classification results. These matrices were also used to identify clusters of authors (see Section 5.2). 12 5 Experiments and Results The following Section presents the performance of the classifier and some experiments on the datasets presented in Section 4.2. 5.1 Performance of features All of the features that were implemented were tested independently on all datasets. The C50 dataset consists of a training and a test set. The training set contains 2500 text documents with 50 unique authors having 50 documents each. The test set has the same authors and the same number of documents as the training set. The documents in the training and the test set do not overlap. The LAT94 dataset contains 6685 text documents with 50 unique authors. Each of the authors have between 88–319 documents. The GH95 dataset contains 10595 text documents with 50 unique authors. Each of the authors have between 131–372 documents. Both the LAT94 and the GH95 datasets were randomly split in half. One half was used for training and the other one for testing. Table 4 presents the performance of the Simple features on these datasets. The columns in the table presents the correct classification rates for each feature. It shows that the feature which had the best performance in all three datasets presented in Table 4 was the Mendenhall length X feature. By itself it was able to correctly classify 9.32% of the C50 documents, 12.66% of the LAT94 documents and 13.42% of the GH95 documents. Intuitively, the Mendenhall feature vector will capture the distribution of the shortest words in the text, many of which are grammatical function words and not words based on content, such as “a”, “is” and “was”. The good performance of this feature witness to individual variation in grammatical structure of the texts in the datasets. When combining all of the Simple features the performance rose to 11.12% of the C50 documents, 16.90% of the LAT94 documents and 18.54% of the GH95 documents. The performance of the Sequential features is presented in Table 5. It shows that the best feature in all three datasets was the Introduced feature. By itself it was able to classify 4.88% of the C50 documents, 10.63% of the LAT94 documents and 7.30% of the GH95 documents. When combining all of the Sequential features the performance rose to 12.68% of the C50 documents, 18.59% of the LAT94 documents and 23.48% of the GH95 documents. Table 6 shows the performance of the Simple features on the PAN14 dataset. It shows that the best feature was the Ly-words feature which had a performance of 59.33%. The performance of all the features combined was 55.67%. Table 7 shows the performance of the Sequential features on the PAN14 dataset. It shows that the best feature was the Sentence length feature which had a performance of 60.00%. The performance of all the features combined was 55.00%. The classification rates in Table 6 and 7 can not be compared to the rates in Table 4 and 5 since the PAN14 dataset consisted of binary classification tasks while the other datasets had one class out of 50 which was correct (see Section 4.2). This makes the probability of an incorrect classification much higher in the other datasets. 13 Feature C50 (%) LAT94 (%) GH95 (%) Average sentence length 4.6000 6.5640 4.7565 Average word length 2.8800 6.5640 5.3983 Adverbial 3.4000 4.7633 4.2658 Pronoun 2.3200 5.3732 6.7950 Utterance 2.1200 4.7633 3.9449 Ly-words 2.2000 4.8504 3.3409 Mendenhall length X 9.3200 12.6634 13.4202 ALL FEATURES 11.1200 16.9039 18.5353 Table 4: Performance of Simple features. Feature C50 (%) LAT94 (%) GH95 (%) Sentence length 4.1200 8.6552 5.7569 Word length 4.2800 7.9582 6.3231 Coherence 4.5200 9.4685 5.7380 Pronoun 3.5600 6.0122 6.6629 First person pronoun 2.3200 5.6056 5.0019 Second person pronoun 2.0800 5.0537 3.1899 Third person pronoun 2.9200 6.2155 6.4553 Relative pronoun 3.6400 4.3276 4.9830 Intensive/Reflexive pronoun 2.1200 4.9085 3.1899 Adverbial 4.8400 8.0744 5.6059 Clausal adverbial 2.3600 4.9666 3.1144 Utterance 2.9200 4.7052 3.6240 Amplifiers 2.2000 4.8795 3.1333 Hedges 2.0400 4.6762 4.1148 Enneg 3.3600 5.0247 4.6433 Enpos 2.5200 5.6346 4.3790 Negation 2.6400 4.8214 3.5863 Think verbs 2.9200 4.7052 3.6240 Overt emotion 2.1200 4.9376 3.2654 Time expressions 2.8800 5.6346 3.9071 Auxes and Modals 3.9600 7.5225 6.2288 Overt signals 2.5600 5.2570 4.6810 Present tense 3.8800 5.9251 5.0396 Past tense 3.2400 5.1990 4.7565 Words ending with -ing 2.5200 8.6262 5.8513 Introduced 4.8800 10.6303 7.3046 ALL FEATURES 12.6800 18.5884 23.4806 Table 5: Performance of Sequential features. 14 Feature PAN14 (%) Average sentence length 49.00 Average word length 50.33 Adverbials 47.67 Pronouns 47.67 Utterance 49.00 Ly-words 59.33 Mendenhall length X 48.67 ALL FEATURES 55.67 Table 6: Performance of Simple features on the PAN14 dataset. Feature PAN14 (%) Sentence length 60.00 Word length 52.67 Coherence 55.33 Pronoun 49.00 First person pronoun 45.33 Second person pronoun 49.00 Third person pronoun 52.67 Relative pronoun 53.00 Intensive/Reflexive pronoun 50.33 Adverbial 54.00 Clausal adverbial 53.67 Utterance 50.67 Amplifiers 53.00 Hedges 53.33 Enneg 56.33 Enpos 47.00 Negation 57.67 Think verbs 50.67 Overt emotion 51.00 Time expressions 52.67 Auxes and Modals 53.00 Overt signals 45.00 Present tense 50.67 Past tense 49.67 Words ending with -ing 49.33 Introduced 54.33 ALL FEATURES 55.00 Table 7: Performance of Sequential features on the PAN14 dataset. 15 5.2 Clustering of authors The results from the classification were presented in confusion matrices (see Appendix A-C). In these matrices, elements placed on the main diagonal represents the number of documents that were correctly classified in each class. In these matrices some clusters of authors could be observed. These clusters contained authors that had documents that were often incorrectly classified as belonging to each other. To help identify these clusters all of the incorrect classifications between a pair of authors that were less or equal to 5 were removed. These new matrices are presented in Appendix D-F. The following subsections presents the clusters found in the confusion matrices for the C50, LAT94, and GH95 datasets. 5.2.1 C50 dataset In the C50 dataset no clusters of authors could be found. The original confusion matrix can be seen in Appendix A and the confusion matrix with the removed elements can be seen in Appendix D. 5.2.2 LAT94 dataset In the LAT94 dataset two clusters of authors could be found. In these clusters the authors were often confused with each other whereas they were not as often confused with authors not belonging to the same cluster. The first cluster from the LAT94 dataset is presented in Table 8 and the second cluster is presented in Table 9. In these tables the number of correctly and incorrectly classified documents in the clusters are presented. The values in the table are obtained from the original confusion matrix in Appendix B. The rows in the tables presents the actual class and the columns presents the predicted class. Each author in the table is represented by the letter combination given by the classifier which can be seen in the confusion matrix in Appendix B. Author i q ac i 16 5 5 q 3 36 9 ac 12 22 18 Table 8: Cluster 1 in the LAT94 dataset. Author p ab p 22 10 ab 10 37 Table 9: Cluster 2 in the LAT94 dataset. After clustering some of the authors together, each of these clusters could be seen as a representation of a type of author. One idea that came up during these experiments was that it would be interesting to see if it was easier to identify a type of author rather than a specific author. To test this theory all of the documents in the dataset belonging to each of the authors in the clusters were merged together into a new dataset where each cluster represented a type of author. On this dataset a new classification was made and the result of this classification can be seen in 16 Figure 4. By looking at the confusion matrix it can be seen that the number of correct classifications were much larger than the number of incorrect classifications. The number of instances that were correctly classified were 392 and the number of incorrectly classified instances were 48 which gave a correct classification rate of 89.09%. a b <-- classified as 93 17 | a = CLUSTER-2 31 299 | b = CLUSTER-1 Figure 4: Confusion matrix for the clusters in the LAT94 dataset. 5.2.3 GH95 dataset For the GH95 dataset the same approach as for the LAT94 dataset was used. At first the clusters were identified by looking at the confusion matrix for the GH95 dataset in Appendix F, in this case three clusters were found. These clusters are presented in Table 10, 11 and 12. The values are taken from the original confusion matrix in Appendix C. Author f ad as at f 87 6 2 9 af 12 53 2 36 as 15 7 6 14 at 20 28 1 72 Table 10: Cluster 1 in the GH95 dataset. Author e z aj e 56 21 25 z 17 97 10 aj 27 17 69 Table 11: Cluster 2 in the GH95 dataset. Author q w q 60 10 w 22 27 Table 12: Cluster 3 in the GH95 dataset. The next step was to merge all of the documents in each cluster and then perform a new classification on these clusters. The result of the classification using the new clusters are presented in Figure 5. By looking at the confusion matrix in Figure 5 it can be seen that 1154 documents were correctly classified and 114 documents were incorrectly classified, which gives a correct classification rate of 91.01%. 17 a b c <-- classified as 457 14 21 | a = CLUSTER-2 26 164 15 | b = CLUSTER-3 28 10 533 | c = CLUSTER-1 Figure 5: Confusion matrix for the clusters in the GH95 dataset. 18 6 Discussion This Section discusses the results presented in Section 5. 6.1 Performance of individual features The performance of the individual features shown in Tables 4–7 varies considerably across datasets and feature types. This can depend on the type of documents, the amount of training data, how large the variation between the authors in the datasets is, and many more factors. When studying the performance on all Sequential features it can be seen that the best performance was achieved on the GH95 dataset where the correct classification rate was 23.48%. As described in Section 5.1 the GH95 dataset consisted of 50 different authors and 10595 articles where each author had between 131–372 articles each. It can be seen that the classifier performed well above the performance of a random classifier. Since each author in the dataset does not have exactly the same number of articles, the prior probability for each class is not the same. A completely random classifier with equal prior probabilities for each class would have had a correct classification rate of around 2.0% for a dataset containing 50 classes. In the GH95 case the prior probability for each class will vary and the probability for a document belonging to a class with many documents will be higher than the probability of a document belonging to a class with a smaller number of documents. Even in this case it should be safe to say that a random classifier applied to the GH95 dataset would have a performance rate of 2.0% plus or minus a few percentage points. This value can be compared to the classifier using the Sequential features which had a performance rate of 23.48% on the GH95 dataset. It could almost classify 1 out of 4 documents correctly using only Sequential features. This performance rate is much higher than it would be using a random classifier. When studying the performance of the classifier using the Sequential features on the other two datasets in Table 5 it can be seen that it also performs well above the performance rate of a random classifier on the other two datasets. 6.2 Unexpected results While the performance of the classifier was well over random, it was far lower than expected and is likely not to be of practical use in its present form. The low performance can depend on many different factors, like what type of features that have been implemented, the classification algorithm, the feature representation or other implementation choices. The Sliding Windows approach and the choice to represent the feature value for each binary pattern (see Section 4.3.2) as a percentage of how often it occurs in the text could also be an implementation choice that needs to be questioned. By looking at the individual performance of the features some of the features have a quite good individual performance, like the Introduced feature on the LAT94 datasets which had a performance rate of 10.63%. It was expected that the performance of all Sequential features would be higher than it was for the three datasets in Table 5 considering the quite good performance of many of the individual features. As mentioned above the best feature was the Introduced feature which is very bound to the sequential nature of the text. The Introduced feature, as explained in Table 3, measures in each sentence if all of the words with a “the” in front of them has been introduced. Each word token preceded by “the” is noted and if at least one occurrence of that token has been observed in the preceding sliding window, that word is considered 19 to have been introduced by the author. It can be that more of these features that are strictly bound to the sequential nature of the text needs to be implemented to increase the performance of the classifier. Another feature that supports this theory is the Coherence feature. The Coherence feature had a performance rate of 9.47% on the LAT94 datasets. This is actually the second best Sequential feature on this dataset. It is also strictly bound to the sequential nature of the text since it measures how coherent the sentence is to the other sentences in the same window. A sentence is considered to be coherent with the other sentences in the window if 60.0% or more of the words reoccur more than once. This is worth thinking about when choosing what features to implement. 6.3 Simple vs Sequential features The purpose of this project was to implement and evaluate a representation of Sequential features. The theory was that these features can capture more of the individual variation that distinguish texts from each other, compared to more common types of features which are represented as mean values over the whole text. Some of the features implemented in the project were both implemented as Sequential features and as Simple features (see Section 4.3.1 and 4.3.2). These features were the Adverbial, Pronoun and Utterance features presented in Table 1 and 3. The performance of these features are presented in Table 13 and 14 (the complete tables can be seen in Section 5.1). When studying these performance rates it can be seen that in six out of nine cases the performance of the Sequential features was better than the performance of the Simple features, in some considerably better. It should also be noted that in the three cases where the Simple features performed better than the Sequential features, the difference in performance was much smaller compared to the other way around. This is a very promising result encouraging for further studies in the area, even though the theory of the Sequential features being better than the Simple features could not be confirmed. Feature C50 (%) LAT94 (%) GH95 (%) Adverbial 3.4000 4.7633 4.2658 Pronoun 2.3200 5.3732 6.7950 Utterance 2.1200 4.7633 3.9449 Table 13: Performance of Simple features, taken from Table 4. Feature C50 (%) LAT94 (%) GH95 (%) Adverbial 4.8400 8.0744 5.6059 Pronoun 3.5600 6.0122 6.6629 Utterance 2.9200 4.7052 3.6240 Table 14: Performance of Sequential features, taken from Table 5. 6.4 Classification performance on the PAN14 dataset One of the datasets that was used for evaluating the classifier was the PAN14 dataset described in Section 4.2.1. The structure of this dataset was very different compared to the structure of the other datasets used in this project. The PAN14 dataset consists of a number of classification tasks where each task has a number 20 of documents with a known author and one document with an unknown author. This author can either be the same author as the one for the known documents or not. This classification task was hard since there were only example documents for one of the classes. Using the SMO algorithm in Weka with only one example class resulted in that all unknown documents were predicted to belong to the same class as the known documents. This was not desirable. When using the SMO algorithm in Weka there is a possibility to define a cost matrix which penalizes the classifier differently depending on which class that is incorrectly classified. This cost matrix was used in an effort to try to force the classifier not to guess on the known class just because it had more training data. The value in the cost matrix which corresponded to an incorrect classification for the unknown class was set higher than the other values. Even when penalizing the algorithm in this way, it still guessed on the known class every time. At this point another dataset had to be introduced which could be used as example texts not belonging to the known class. The dataset that was used was the Tasa dataset described in Section 4.2.4. A number of documents from this dataset were randomly chosen to represent the other class during classification. The classification performance is presented in Table 6 and 7. When studying the table it is obvious that the performance is very bad on this dataset compared to the other datasets, since all of the classification tasks in this dataset are binary classifications. The bad performance on this dataset can depend on several factors, the most probable reason is the Tasa dataset. The Tasa files were generated from one large file containing many unlabelled student essays. A described in Section 4.2.4 the essays have no easily distinguishable starting and end point, nor any defined authors. This means that when using this datasets as a representation of the other class it can therefore contain text from several essays written by different authors, which will make it hard to find any similarities between the features in this class. Another problem is that the SMO algorithm must choose one of the classes as the predicted class. It is not very likely that the unknown document will have any similarities with the documents from the Tasa dataset. To get good classification results on the PAN14 datasets it is important to find a suitable dataset for representing the ”other” class when using the SMO algorithm. Another possibility would be to try other Machine Learning algorithms or other classification techniques for this kind of classification task. 6.5 Confusion matrices and clusters The final classification results were presented as confusion matrices (see Appendix A-F), these matrices show the predicted vs the actual document class of all documents in the test datasets. As described in Section 5.2, clusters of authors could be found in the confusion matrices for the LAT94 and GH95 datasets. These clusters can be seen as a representation of a type of author. The clusters defined a new dataset where each cluster became a new class which contained all of the documents for all of the authors in the cluster. The theory behind this experiment was that it would be easier to correctly classify a type of author rather than a specific author. The result from the experiment showed that the correct classification rate was increased when applying the classifier to the clusters, in line with the theory. For the LAT94 dataset containing 2 clusters the correct classification rate of the classifier was 89.09% and for the GH95 dataset containing 3 clusters the correct classification rate was 91.01%. These results were very good compared to the results when using the original datasets (see Table 5) where the performance was 18.59% and 23.48%, respectively. One thing to note is that the classification in this case is a bit easier since these new datasets only had 2 or 3 classes and the 21 original datasets had 50, which makes the probability of an incorrect classification much higher in the original datasets since there were more classes to choose from. Even if the classification on the new datasets created by the clusters of authors is easier a major improvement in performance could be seen. These experiments supports the theory that it is easier to classify a type of author rather than individual authors on the datasets used in this project. In this experiment it could not easily be seen what specific variation in the authors writing styles made them end up in the same cluster, but this could be a interesting aspect to evaluate in future studies. 6.6 Genre and domain independence If lexical features are used, i.e., features based on occurrences of individual words, the classifier can partially start to model the genre in which the text is written, instead of modelling the writing style of the author. This is a common approach in authorship attribution tasks. It may lead to excellent classification results on test data sets but it is unclear if it will translate to useful real life performance, if documents in the same genre and same domain are being tested. In this project none of the features that were implemented kept track of the words that were used by the authors. In this way, the classifier will be more independent of the genre and topical domain of the texts during classification compared to a classifier that keeps track of what words is used by the authors. The performance of the classifier could probably be increased if this kind of features were implemented, but modelling genres was not the purpose of this project. 22 7 Conclusion This project investigated the individual variation (features) that can be found in written text. The theory was that by using the sequential nature of the text and using features defined at sentence level, it could be possible to capture more of the variation that distinguish texts from each other. Features defined as mean values (Simple features) over the whole text do not capture this kind of variation and therefore a large amount of information that can be used to distinguish the texts from each other gets lost when using this kind of features. The features were evaluated by constructing a classifier with the purpose of identifying the author of text documents in a number of datasets. The experiments showed that the Sequential features performs better than the Simple features in some cases, however the difference was not large enough to confirm the theory of them being better than the Simple features. Experiments also showed that by identifying clusters of authors and generate new datasets where the classes were defined by these clusters, the performance of the classifier could be improved. It should also be noted that authorship attribution is a challenging task for humans as well. In many machine learning tasks the computational approach is to scale up tasks that humans do well to large datasets, but this project addresses a task which humans find quite difficult. The features under consideration are not easily observed by humans, and if they would be observed they can be very hard to explain, even if the reader would note that a text can be attributed to some author. Even if the theory of the Sequential features being better than the Simple features could not be confirmed, the results encourages further studies in the area. One part that can be investigated further is why the performance of the Sequential features did not improve more when aggregating them. Is the problem in the feature representation or somewhere else? The clustering of authors is also a possible subject for further studies. What similarities can be found between the authors in each cluster? Due to time constraints in the project these questions were not evaluated further but they can serve as ideas for further studies in the area. 23 Bibliography [1] PAN 2014 Website. http://www.uni-weimar.de/medien/webis/research/events/ pan - 14 / pan14 - web / author - identification . html. (Visited on 06/01/2014). [2] J. Li, R. Zheng, and H. Chen. “From fingerprint to writeprint”. In: Communications of the ACM 49.4 (2006), pp. 76 –82. [3] J. Adams, H. Williams, J. Carter, and G. Dozier. “Genetic Heuristic Development: Feature selection for author identification”. In: 2013 IEEE Symposium on Computational Intelligence in Biometrics and Identity Management (CIBIM). Piscataway, NJ, USA, 2013, pp. 36 –41. [4] E. Gabrilovich and S. Markovitch. Feature Generation for Text Categorization Using World Knowledge. 32000 Haifa, Israel: Computer Science Department, Technion, 2005. [5] The Open Directory Project. http://www.dmoz.org/docs/en/about.html. (Visited on 06/01/2014). [6] D. Jayarajan, D. Deodhare, and B Ravindran. Lexical Chains as Document Features. Centre for Artificial Intelligence, Robotics, Defence R, and D Organisation, Bangalore, INDIA. Dept. of CSE, IIT Madras, Chennai, INDIA., 2008. [7] Z. Huang, R. Zheng, J. Li, and H. Chen. “A framework for authorship identification of online messages: writing-style features and classification techniques”. English. In: Journal of the American Society for Information Science and Technology 57.3 (2006), pp. 378 –93. issn: 1532-2882. [8] J. Karlgren and G. Eriksson. “Authors, Genre, and Linguistic Convention”. In: SIGIR Workshop on Plagiarism Analysis, Authorship Identifcation, and Near-Duplicate Detection (2007). [9] J. C. Platt. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Advances in kernel methods - Support vector learning, 1998. [10] S. Marsland. Machine Learning An Algoritmic Perspective. Chapman Hall/ CRC, 2009. [11] Ö. Ekeberg. Support Vector Machines. http : / / www . csc . kth . se / utbildning/kth/kurser/DD2431/ml12/schedule/05-svm-2x2. pdf. 2012. (Visited on 02/27/2014). [12] Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka/index.html. (Visited on 04/14/2014). [13] Weka 3, Download. http://www.cs.waikato.ac.nz/ml/weka/downloading.html. (Visited on 04/14/2014). [14] PAN 2014 Website. http://pan.webis.de/. (Visited on 04/14/2014). [15] PAN 2014 Training dataset. http://www.webis.de/research/corpora/corpus- pan- labs09- today/pan- 14/pan14- data/pan14- author- verificationtraining-corpus-2014-04-03.zip. (Visited on 04/14/2014). 24 [16] UCI Machine Learning Repositiory, Reuter 50 50 Dataset. https://archive.ics.uci.edu/ml/datasets/Reuter_50_50. (Visited on 04/14/2014). [17] The Cross-Language Evaluation Forum. http://www.clef-initiative.eu/. (Visited on 06/05/2014). [18] T. C. Mendenhall. “The characteristic curves of composition”. In: Science, IX (1887), 237–249. 25 Appendix This Section contains confusion matrices from Section 5 that were to large to put inside the report. 26 A Confusion matrix C50 27 a b 5 0 0 21 1 1 2 1 0 4 3 2 2 0 0 1 2 1 0 5 0 1 0 2 6 3 1 2 0 0 1 2 0 1 1 2 2 0 2 0 1 2 3 1 3 0 1 1 0 0 1 1 0 0 0 7 0 1 0 0 0 0 0 0 2 0 0 2 0 1 0 0 0 6 0 2 0 3 2 5 3 1 0 0 1 2 0 4 0 0 0 3 3 0 0 0 1 0 0 1 c 1 1 8 1 1 2 3 1 3 0 0 1 1 3 5 0 2 5 2 1 1 1 0 8 0 0 0 1 1 0 0 2 0 0 2 0 1 1 4 0 0 1 1 2 0 3 2 3 2 0 d 1 0 3 4 4 2 4 1 1 2 0 1 0 0 1 1 4 1 1 2 1 0 2 1 2 2 0 0 1 0 1 0 0 2 1 1 3 0 0 0 2 0 0 6 0 0 0 3 0 2 e 0 3 0 1 4 2 0 1 3 5 1 0 0 0 1 1 1 2 2 0 6 4 1 1 0 0 2 1 2 1 2 1 0 1 3 1 2 1 1 1 3 0 2 1 0 0 3 0 2 5 f g h 0 0 0 0 0 0 1 0 0 1 0 0 0 2 0 1 0 0 0 15 0 0 0 16 1 2 0 1 0 0 0 2 0 0 0 0 2 0 0 0 1 0 2 2 0 1 0 1 0 4 0 1 1 0 0 0 0 0 2 0 2 0 1 0 0 0 1 4 0 0 3 0 2 1 0 4 0 0 0 1 0 0 0 1 1 1 0 1 1 0 0 0 1 0 1 0 2 0 1 0 4 1 2 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 4 0 1 1 1 0 0 1 0 1 0 0 i j 2 1 1 0 1 0 6 0 1 2 3 0 3 0 1 0 1 0 1 20 2 0 0 0 2 0 2 0 0 0 1 1 0 0 0 0 1 0 2 0 2 0 2 0 1 0 0 0 0 0 1 1 0 3 0 0 0 0 0 0 3 2 1 0 0 0 1 1 1 0 4 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 2 0 2 0 1 1 k l 1 0 0 2 0 0 1 1 1 1 0 3 0 2 1 2 2 2 0 0 7 2 2 14 0 0 0 0 1 1 0 7 1 0 0 4 6 3 1 1 1 0 0 6 0 4 1 2 1 0 0 2 2 0 1 0 1 0 0 0 0 0 5 1 0 1 1 0 2 0 1 2 0 0 0 1 2 0 0 0 1 3 0 1 0 3 0 5 1 0 1 0 0 2 0 1 3 1 0 1 m 3 2 0 0 0 1 1 1 3 0 0 1 4 3 0 2 1 0 1 1 1 2 0 1 4 1 2 1 0 2 0 1 3 0 1 1 0 0 0 1 1 1 2 0 0 0 0 1 0 0 n 6 5 1 2 3 0 4 0 0 0 0 0 4 9 3 0 1 0 3 4 1 0 3 2 4 1 1 2 2 2 0 1 1 3 1 3 2 5 0 1 1 0 0 1 1 4 0 3 2 0 o 1 0 1 0 3 1 0 4 2 1 2 0 2 1 8 1 2 0 0 3 0 0 0 0 0 2 0 1 0 0 3 1 2 0 0 3 0 5 0 0 0 1 1 0 2 0 1 0 0 1 p 0 1 1 0 1 0 0 1 0 1 0 1 2 1 0 2 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 3 2 0 0 0 0 1 4 2 0 0 1 1 1 0 0 0 0 q 2 0 0 0 0 2 3 0 0 0 1 2 0 2 2 3 8 0 1 1 0 2 0 2 2 1 0 0 2 1 0 2 0 0 1 1 1 1 0 1 1 0 1 0 0 1 1 2 0 3 r 0 1 0 2 0 2 0 0 0 1 1 0 1 1 2 1 0 2 1 1 2 2 0 0 0 2 0 0 2 3 2 0 1 1 0 0 1 2 2 2 0 0 1 0 0 1 0 0 2 0 s 0 0 0 1 0 0 1 1 0 0 3 1 2 3 0 0 0 0 4 1 2 0 0 2 1 2 0 1 0 0 1 1 0 2 2 0 0 1 1 0 0 0 0 1 0 3 1 3 2 1 t 1 0 2 1 0 0 2 0 1 1 0 1 1 0 3 0 2 0 4 3 0 0 0 1 4 0 1 0 1 0 1 3 0 7 0 0 0 1 0 1 0 1 2 0 3 0 1 1 1 1 u v 0 2 0 0 1 0 0 0 2 1 0 1 0 0 2 2 1 1 0 2 1 1 1 0 1 1 0 1 0 2 0 0 1 1 1 1 2 0 0 1 4 1 0 10 3 1 0 1 0 0 0 1 0 1 0 2 0 3 0 1 3 0 4 2 0 1 0 0 1 0 2 1 0 0 0 1 2 2 0 0 1 2 4 1 2 5 2 1 0 0 2 0 1 0 1 1 0 0 1 0 w 0 0 0 1 1 6 0 1 0 1 0 1 0 0 0 3 0 0 0 0 0 0 7 0 0 2 3 3 3 1 1 0 3 0 0 0 1 2 0 0 2 1 0 1 0 0 0 4 0 1 x y 0 1 0 1 2 0 1 1 4 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 2 3 2 1 0 0 1 1 3 1 1 1 1 1 2 3 0 1 0 0 2 4 0 0 11 0 0 2 1 0 1 1 0 0 0 1 0 1 0 1 6 4 0 1 1 0 0 3 0 3 0 2 0 6 0 1 1 1 1 0 1 1 0 0 5 1 0 3 0 0 0 1 0 3 2 z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax <-- classified as 3 4 3 1 0 1 1 4 1 0 1 0 2 0 1 0 0 0 0 0 0 0 0 1 0 | a = KirstinRidley 0 0 1 0 0 0 1 0 0 0 0 1 2 2 1 2 0 0 0 0 2 0 0 0 0 | b = MatthewBunce 0 2 1 1 1 2 2 0 2 2 0 0 1 3 0 0 1 1 0 0 5 1 1 1 0 | c = JaneMacartney 0 3 1 2 1 0 3 0 0 1 0 1 3 2 1 0 0 0 2 1 0 0 0 2 0 | d = PeterHumphrey 0 0 1 0 0 2 1 0 0 1 0 2 0 0 0 0 1 0 1 1 4 1 0 0 0 | e = DarrenSchuettler 0 1 0 3 1 0 0 1 1 1 0 1 0 2 1 0 0 1 0 2 0 0 2 0 1 | f = ToddNissen 0 0 0 0 0 0 2 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 | g = KarlPenhaul 0 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 2 1 0 0 0 1 0 0 | h = JimGilchrist 2 1 1 2 2 0 1 0 2 0 1 0 2 0 1 1 0 1 0 0 0 1 1 4 0 | i = DavidLawder 0 2 0 1 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 | j = LydiaZajc 1 0 0 1 3 0 5 0 3 2 1 1 1 2 0 0 0 1 1 0 0 3 0 1 0 | k = KouroshKarimkhany 0 0 1 0 1 0 4 0 3 0 1 1 0 5 0 0 1 0 0 0 1 0 1 0 1 | l = GrahamEarnshaw 1 1 1 0 1 0 0 0 1 0 2 0 2 0 3 0 0 3 1 0 0 0 0 0 0 | m = JonathanBirt 2 0 1 0 0 0 1 1 2 0 0 0 1 1 1 0 0 0 2 1 0 0 1 1 0 | n = LynnleyBrowning 2 0 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 1 2 0 1 1 0 1 0 | o = TheresePoletti 0 1 0 1 2 0 2 0 1 1 3 0 0 0 1 1 3 2 1 1 0 0 0 0 1 | p = EdnaFernandes 0 2 0 0 1 0 0 1 0 2 0 3 1 0 0 1 0 0 1 2 1 0 0 1 0 | q = MureDickie 4 1 0 1 1 2 0 0 0 4 0 0 0 5 1 2 1 0 0 2 2 0 1 0 0 | r = HeatherScoffield 0 0 2 0 0 0 1 1 2 1 1 0 1 1 0 0 0 0 1 2 0 0 1 0 0 | s = MartinWolk 0 2 0 0 3 0 3 0 1 1 0 0 1 1 0 0 1 0 0 4 0 0 0 2 1 | t = SamuelPerry 1 0 1 1 1 1 3 0 1 0 0 2 0 0 0 1 1 1 2 0 0 0 1 0 1 | u = MarkBendeich 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 2 2 4 0 0 0 1 1 0 0 | v = TimFarrand 0 0 0 0 0 1 2 0 0 2 0 1 2 0 3 0 0 0 0 0 1 0 3 2 0 | w = AlanCrosby 0 0 1 0 0 0 0 2 2 2 0 1 0 0 0 1 1 0 2 1 4 1 1 0 0 | x = ScottHillis 0 0 1 0 0 1 0 1 2 0 1 0 0 0 0 1 0 1 1 1 4 1 0 0 2 | y = AlexanderSmith 4 0 3 0 2 0 2 1 3 1 0 0 2 0 0 1 0 3 0 2 0 1 0 0 0 | z = RobinSidel 4 4 1 5 0 0 0 1 0 0 1 1 0 1 6 0 3 0 1 0 1 0 0 0 2 | aa = KeithWeir 0 5 10 0 1 0 0 4 0 0 0 0 1 2 1 0 0 0 0 0 0 1 0 2 0 | ab = JoWinterbottom 4 5 2 1 1 0 1 0 1 2 0 1 1 0 0 0 0 1 0 3 1 1 2 1 0 | ac = JanLopatka 1 3 0 1 3 2 0 0 1 2 3 3 0 1 0 0 3 7 0 3 0 0 0 3 0 | ad = MarcelMichelson 4 0 1 1 1 3 0 0 1 0 0 3 2 0 1 0 1 3 1 0 2 5 0 0 0 | ae = PatriciaCommins 1 0 0 0 2 3 6 0 5 0 0 0 2 0 0 0 0 0 0 2 1 0 0 1 0 | af = EricAuchard 0 1 3 2 2 0 0 3 0 0 0 1 0 2 0 0 3 2 1 0 0 0 1 1 1 | ag = JoeOrtiz 0 0 0 0 1 0 4 0 1 2 1 0 2 0 0 1 0 0 1 1 0 1 1 0 2 | ah = MichaelConnor 3 1 1 0 4 1 3 1 0 2 1 1 2 0 0 0 1 1 0 2 0 0 2 1 2 | ai = FumikoFujisaki 0 1 1 1 3 2 1 0 2 1 8 0 1 0 1 4 1 0 0 1 1 0 0 0 0 | aj = PierreTran 1 1 1 0 0 1 2 1 0 2 1 8 0 1 0 1 4 0 0 0 1 0 3 0 1 | ak = WilliamKazer 1 0 0 1 0 0 3 0 0 3 1 2 5 0 0 0 0 0 2 1 1 2 1 0 0 | al = SarahDavison 1 0 0 0 4 0 0 2 1 0 0 1 1 14 1 0 2 1 1 0 1 0 0 0 0 | am = LynneO’Donnell 1 0 2 2 0 1 3 1 2 3 0 2 0 2 1 1 0 0 1 0 1 1 1 0 0 | an = NickLouth 1 1 0 1 2 1 0 2 1 1 1 0 2 0 0 1 1 6 0 0 0 3 0 0 0 | ao = KevinDrawbaugh 0 0 4 3 3 2 0 3 2 0 0 2 1 0 0 0 10 1 0 0 1 0 1 1 0 | ap = BernardHickey 0 0 1 0 2 0 0 0 0 1 1 0 2 0 0 2 1 11 0 1 0 2 0 1 0 | aq = KevinMorrison 0 2 3 0 0 1 1 1 0 0 1 1 1 2 0 2 1 1 1 0 1 1 1 2 0 | ar = TanEeLyn 3 0 0 0 0 3 2 0 2 0 0 0 0 0 0 1 0 4 1 14 2 0 0 1 2 | as = AaronPressman 0 0 0 0 0 2 4 0 1 2 2 3 0 0 0 1 0 2 1 1 8 0 1 1 1 | at = BenjaminKangLim 0 1 4 2 1 2 2 1 0 0 0 1 0 1 0 1 3 1 0 0 2 2 1 1 0 | au = BradDorfman 1 2 1 0 1 0 2 0 2 1 2 3 0 1 0 0 2 1 3 0 0 0 0 0 1 | av = JohnMastrini 0 0 3 5 0 2 0 1 1 0 6 1 1 0 0 0 0 0 0 5 2 0 1 2 0 | aw = RogerFillion 2 3 0 0 0 1 0 1 0 0 1 1 0 3 2 0 1 0 0 3 1 1 0 2 3 | ax = SimonCowell Figure 6: Confusion matrix for the C50 dataset. B Confusion matrix LAT94 28 a 1 2 0 0 0 1 0 0 0 2 0 2 0 1 0 0 3 3 0 3 1 0 2 1 2 0 2 0 0 1 0 0 0 0 2 0 0 0 1 0 1 3 0 1 0 0 2 0 0 0 b c 2 0 7 0 0 15 1 2 2 0 4 0 0 0 3 0 1 2 0 3 5 1 3 0 1 6 3 0 3 0 0 0 1 0 0 2 1 2 2 4 1 0 0 4 0 2 1 1 1 0 2 1 1 0 0 0 3 0 0 1 2 1 1 0 1 1 1 5 0 0 0 0 2 4 0 7 3 0 1 2 1 2 1 1 2 0 1 6 0 0 0 1 1 1 1 3 3 1 2 2 d 1 1 0 5 0 0 0 0 6 4 1 0 5 2 4 0 0 1 0 2 0 1 2 0 1 5 3 0 0 4 0 0 3 0 0 0 3 1 1 1 2 1 0 0 0 1 0 0 0 0 e 0 0 0 0 5 1 0 0 0 0 0 0 3 2 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 2 0 0 1 0 0 0 0 f g 0 1 2 2 1 0 0 0 5 3 6 6 0 67 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 2 0 0 2 0 2 0 1 0 0 0 1 0 4 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 1 0 0 0 0 1 3 0 0 0 0 0 1 0 0 0 1 0 0 0 h i j k l m 1 0 3 0 1 1 0 1 0 0 2 0 0 2 0 1 0 6 0 3 1 1 1 4 0 7 1 0 0 1 0 0 0 2 0 2 0 0 0 0 0 0 1 0 1 1 1 3 0 16 1 0 0 8 0 1 12 1 0 5 0 0 2 32 1 1 2 0 8 1 10 2 1 5 11 0 1 12 1 6 3 2 2 1 1 2 2 1 6 5 0 0 0 0 0 3 0 3 3 0 1 4 1 5 4 0 1 5 1 2 2 1 4 3 1 0 6 5 2 4 0 3 0 2 0 1 1 3 2 0 2 4 0 1 3 2 2 4 1 0 0 1 0 1 3 1 5 1 0 4 0 5 2 0 1 2 1 3 7 0 2 2 0 1 0 0 0 0 1 12 5 0 0 9 0 0 0 0 0 1 0 4 7 0 0 2 0 0 0 0 0 4 0 1 1 0 0 1 0 0 3 1 1 7 0 5 4 2 3 4 0 5 3 0 0 4 0 4 1 0 1 6 0 8 1 0 1 2 0 1 1 1 0 0 0 0 1 1 1 6 0 4 0 0 0 4 1 2 3 0 0 3 0 0 5 1 1 1 0 5 2 0 0 1 1 12 1 0 1 1 2 2 1 0 1 8 0 2 7 1 4 5 1 2 1 1 0 11 0 4 1 1 0 0 0 7 2 0 0 8 n o p q 1 3 1 1 0 0 1 2 0 0 0 1 0 3 0 2 0 0 4 0 0 1 3 3 0 0 1 0 2 7 0 4 2 1 0 5 6 4 1 7 0 0 0 0 1 2 1 6 0 1 4 11 1 1 1 4 2 17 0 8 0 0 22 0 2 2 2 36 1 2 2 8 0 4 1 16 1 3 1 2 0 0 2 1 0 0 0 0 0 0 0 6 0 0 5 2 3 3 0 3 2 2 0 2 0 4 1 1 0 0 10 1 1 1 5 22 1 1 2 10 1 2 0 7 0 0 0 2 1 3 0 5 0 1 0 2 1 1 1 6 0 2 1 7 1 3 1 7 0 5 0 1 0 4 1 0 0 0 1 1 1 0 1 3 0 1 1 7 0 0 12 1 0 3 8 2 1 1 2 4 1 1 0 10 0 1 0 6 0 1 2 5 0 0 4 1 1 0 1 3 r s t 2 1 3 0 1 1 1 0 1 3 3 0 0 0 2 1 1 2 0 0 0 2 1 0 3 4 0 4 6 1 0 0 6 4 3 4 4 5 0 5 6 0 2 1 0 0 0 0 2 8 1 5 6 0 6 13 1 0 1 93 0 0 2 1 1 0 0 5 0 0 1 0 1 3 0 0 4 1 2 2 2 0 0 0 3 6 1 1 1 1 1 2 0 0 0 0 0 0 0 0 4 0 2 1 2 3 0 1 3 2 1 0 0 0 0 1 0 0 0 0 0 2 0 0 8 0 0 0 4 0 2 0 1 1 1 2 4 0 2 3 0 3 0 0 0 0 0 2 3 0 u 0 0 0 0 2 0 0 0 2 0 5 1 2 0 1 0 0 0 0 0 5 0 0 1 0 1 0 0 2 0 2 2 0 0 0 2 0 0 0 0 0 1 1 0 0 1 0 0 2 0 v 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 2 1 2 4 2 0 0 6 0 2 0 1 1 2 0 0 3 0 3 1 0 0 2 0 0 1 0 0 3 2 0 w x 1 4 0 5 0 1 2 0 0 5 0 2 0 1 1 1 0 2 1 2 0 0 0 2 0 0 0 1 3 1 0 6 1 0 0 0 1 3 0 0 0 2 0 1 3 1 1 20 0 0 0 0 3 1 0 1 0 2 0 0 1 0 1 2 1 0 0 0 0 1 2 0 0 2 2 0 1 2 0 1 3 2 4 2 0 9 0 1 1 2 0 2 1 0 1 1 0 5 2 0 y 1 0 0 0 0 0 0 2 0 1 0 2 2 2 3 0 1 0 0 0 0 0 3 1 2 0 2 0 2 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 3 0 0 1 z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax <-- classified as 1 0 0 1 2 0 2 1 1 1 0 0 0 0 0 1 5 1 1 0 0 1 0 0 0 | a = JACK-SEARLES 0 2 3 4 1 1 2 0 0 1 0 2 0 1 0 0 1 2 1 1 0 0 0 7 0 | b = MIKE-REILLEY 2 2 0 2 0 2 1 1 2 0 0 0 1 0 3 0 2 0 1 1 1 0 2 0 0 | c = JOHN-DART 0 2 0 1 0 1 1 0 0 0 0 0 0 0 1 1 3 0 0 0 2 0 1 0 0 | d = JE-MITCHELL 1 0 1 1 0 0 1 0 1 0 2 0 2 0 0 0 0 1 0 0 0 1 0 7 0 | e = TJ-SIMERS 0 1 5 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 | f = EARL-GUSTKEY 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | g = ALLAN-MALAMUD 0 1 0 1 2 0 0 0 1 2 1 0 0 1 0 3 0 1 0 1 0 1 1 0 0 | h = MARY-F 1 0 0 5 1 2 2 0 3 2 2 0 3 0 0 3 1 0 3 6 0 1 1 1 0 | i = ED-BOND 0 2 0 9 2 4 0 0 0 4 0 0 1 0 1 1 5 2 3 2 1 3 2 0 1 | j = JON-NALICK 0 0 0 0 0 0 3 0 0 0 0 0 2 1 0 0 1 1 0 0 0 0 0 0 0 | k = KEVIN-THOMAS 0 3 1 1 0 1 0 1 1 5 0 1 1 1 0 0 5 0 0 0 0 1 1 0 3 | l = SARA-CATANIA 4 3 1 16 2 2 2 0 1 6 2 0 2 0 4 2 2 0 3 4 1 2 4 0 0 | m = BERT-ELJERA 0 0 0 7 3 1 2 2 1 1 4 1 1 0 1 1 3 0 3 0 1 0 0 1 1 | n = TRACY-WILSON 0 3 0 1 2 8 1 1 2 1 1 2 1 1 1 0 3 0 0 0 3 0 0 2 0 | o = JULIE-FIELDS 0 0 10 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 | p = SCOTT-HARRIS 0 0 1 9 0 0 2 2 0 3 4 0 0 0 1 0 1 0 0 2 1 1 0 0 0 | q = SHELBY-GRAD 1 0 0 5 2 0 2 0 1 2 0 0 0 1 3 0 3 1 1 3 1 1 2 1 0 | r = ALAN-EYERLY 1 1 0 14 2 0 3 2 2 3 0 1 1 0 3 1 8 0 1 2 0 3 2 0 0 | s = FRANK-MESSINA 0 3 0 1 2 3 1 1 7 4 0 0 1 0 1 0 0 2 0 0 0 2 0 1 1 | t = CHARLES-SOLOMON 0 1 0 0 0 1 6 1 2 0 0 1 0 1 1 1 1 0 3 3 0 0 1 0 1 | u = STEVE-HOCHMAN 0 1 0 4 0 0 0 1 1 1 4 0 0 0 0 0 2 0 2 3 1 0 2 2 1 | v = MARY-ANNE 0 0 0 2 0 2 1 0 2 1 0 0 1 1 0 0 2 0 0 0 2 1 0 1 0 | w = LYNN-FRANEY 0 0 2 2 0 0 0 1 0 0 0 0 0 2 0 0 0 2 2 2 0 0 0 3 0 | x = SHAV-GLICK 1 0 0 3 0 2 1 2 2 1 0 1 0 0 1 0 1 0 0 1 1 1 2 1 0 | y = PEGGY-Y 0 3 0 2 3 1 0 1 1 1 0 0 3 1 0 2 4 0 0 1 0 1 1 2 0 | z = MIGUEL-BUSTILLO 2 3 2 2 0 0 2 3 3 1 0 1 0 0 1 0 1 1 3 1 1 3 4 0 1 | aa = MAKI-BECKER 0 0 37 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 3 0 | ab = JIM-MURRAY 0 0 1 18 2 1 1 0 1 7 1 1 0 0 0 0 3 0 0 5 3 1 4 2 0 | ac = DEBRA-CANO 0 2 0 5 0 0 1 1 0 2 0 0 1 0 0 1 4 0 0 2 0 0 0 2 1 | ad = MARTIN-MILLER 3 2 0 3 0 11 2 2 3 3 3 1 0 0 0 2 3 0 0 1 0 1 4 0 1 | ae = MAIA-DAVIS 0 0 0 0 0 0 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 | af = TOM-PETRUNO 0 0 0 3 0 3 1 1 1 2 1 0 1 0 1 0 4 0 1 1 1 0 4 1 0 | ag = SUSAN-BYRNES 0 0 0 2 0 1 2 0 4 2 0 0 1 0 3 0 2 0 0 2 3 0 2 0 0 | ah = JENNIFER-OLDHAM 0 2 0 8 2 2 1 0 1 13 1 1 1 0 0 0 2 1 0 2 0 1 1 0 2 | ai = BILL-BILLITER 1 1 1 8 0 0 1 1 0 2 6 0 0 0 0 0 1 1 1 1 0 0 0 0 2 | aj = LESLEY-WRIGHT 0 1 1 5 0 1 2 0 1 2 0 0 2 0 3 0 2 1 2 2 0 2 1 0 0 | ak = HOLLY-J 0 1 0 1 0 0 0 0 2 0 1 1 3 2 0 2 1 0 0 4 0 0 3 1 0 | al = STEVE-RYFLE 0 3 0 2 0 1 1 1 1 0 1 1 4 2 0 1 1 0 2 0 0 1 0 1 0 | am = ERIC-SLATER 0 0 0 3 0 0 0 0 1 0 0 0 0 0 25 2 1 0 0 0 0 0 1 0 0 | an = MICHAEL-KRIKORIAN 0 1 0 1 0 2 3 2 3 0 0 0 1 0 1 0 4 0 2 3 0 1 0 0 0 | ao = KURT-PITZER 0 1 1 8 1 1 0 2 2 2 0 0 1 0 1 3 12 1 2 0 2 2 1 1 1 | ap = STEPHANIE-SIMON 0 1 2 1 0 1 1 0 1 4 0 1 1 0 0 0 1 12 0 1 0 1 0 5 1 | aq = HELENE-ELLIOTT 0 2 0 1 0 0 2 0 0 0 0 1 0 0 0 3 6 0 8 0 1 0 0 4 0 | ar = ERIN-J 0 1 1 6 0 0 2 2 1 3 0 0 1 0 0 0 1 1 2 4 1 0 2 1 0 | as = FRANK-MANNING 0 0 0 6 0 1 2 0 1 3 5 0 0 0 1 1 2 0 0 1 1 0 0 0 1 | at = RUSS-LOAR 0 0 1 3 0 0 0 2 0 3 1 2 0 0 0 0 1 0 1 1 1 8 0 0 0 | au = JEFF-BEAN 0 1 0 7 0 3 0 0 2 1 2 0 1 0 0 0 8 0 1 1 1 1 2 0 1 | av = KAY-HWANGBO 0 1 5 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 25 0 | aw = MARYANN-HUDSON 1 2 0 3 0 4 2 2 1 4 1 0 2 0 0 0 1 0 1 2 2 0 0 0 0 | ax = DANIELLE-A Figure 7: Confusion matrix for the LAT94 dataset. C Confusion matrix GH95 29 a 16 3 3 1 1 1 1 0 4 5 7 3 2 0 1 3 1 0 0 1 3 3 9 1 3 0 4 1 2 0 0 2 1 1 4 3 1 3 0 2 8 0 4 2 0 0 4 1 0 1 b c d e f 1 2 1 4 1 7 0 0 1 0 0 12 3 5 2 0 3 30 6 6 0 1 1 56 0 0 1 12 2 87 0 1 3 3 3 1 1 3 1 3 2 2 3 5 4 0 1 3 3 1 0 4 2 5 2 0 0 1 1 2 1 2 1 4 3 0 0 0 0 6 1 3 1 5 3 6 0 0 2 0 2 0 1 2 2 0 0 0 3 5 0 2 3 4 4 0 2 0 11 4 1 3 3 2 1 1 2 0 9 0 3 1 1 2 0 0 1 2 2 7 0 1 2 4 3 0 1 1 17 0 1 1 1 4 18 0 0 2 4 0 0 2 2 5 0 0 0 1 0 12 0 0 4 3 4 1 7 5 2 10 1 1 3 3 4 0 1 1 0 1 2 1 0 7 0 1 2 0 27 1 1 5 1 1 1 2 0 0 4 0 0 4 2 9 0 1 0 1 3 5 1 2 2 9 0 1 3 1 2 3 3 2 0 1 4 1 6 3 3 3 0 0 3 0 15 0 0 1 1 20 2 1 7 3 9 1 4 2 4 4 1 0 5 1 3 1 2 6 15 0 g h i j k l m n o p q 0 1 6 5 5 5 1 0 4 0 0 0 2 6 0 2 1 1 2 2 11 3 2 1 6 5 9 1 1 0 3 1 1 1 1 3 2 5 2 1 0 0 0 0 2 0 4 2 2 1 0 0 5 0 0 0 0 6 0 2 0 2 1 1 1 0 0 4 0 1 1 0 1 0 2 3 0 3 18 4 0 4 0 0 2 5 0 1 0 2 58 4 5 6 3 1 1 0 7 0 0 8 10 1 1 0 0 0 2 5 1 7 12 3 22 3 1 0 4 1 1 0 0 15 1 2 19 0 0 0 1 2 0 1 10 0 1 0 13 0 2 0 3 0 2 2 0 2 2 1 11 2 1 1 1 8 2 1 7 0 0 1 15 2 0 0 0 5 1 1 2 1 1 2 35 6 1 0 5 1 3 3 0 0 0 2 60 0 10 0 1 2 0 2 4 1 1 1 0 1 2 0 4 0 1 0 2 0 1 2 1 6 0 1 0 0 0 6 0 0 3 2 5 1 4 0 0 0 1 0 0 0 2 0 1 2 0 0 0 4 0 0 0 0 9 2 4 7 0 0 0 9 22 3 1 4 1 5 4 0 0 2 0 1 1 3 3 1 4 2 1 1 3 2 0 1 0 2 2 0 0 0 0 1 1 1 3 3 11 3 8 6 1 2 6 2 0 1 6 2 1 2 0 0 2 3 1 0 2 0 8 1 4 2 2 0 1 0 0 0 4 4 0 1 0 1 3 3 0 2 1 1 1 0 6 0 0 1 7 2 1 0 2 2 0 10 3 1 1 1 0 0 0 2 5 0 4 2 0 0 3 0 1 0 1 6 0 2 1 2 1 0 2 0 2 1 2 6 2 1 0 1 2 0 0 0 0 1 1 0 0 0 0 0 0 1 3 5 2 1 0 0 1 0 3 1 0 0 1 0 2 0 0 0 0 0 0 0 2 1 0 0 6 0 0 1 5 0 1 0 3 3 3 2 1 0 0 4 2 0 0 1 1 6 1 0 0 1 4 4 2 1 0 5 3 1 2 0 0 0 0 0 1 15 1 0 4 1 1 2 6 7 1 5 6 4 0 10 0 1 2 4 1 1 0 0 6 0 1 1 0 3 0 4 0 1 0 8 0 0 0 1 1 1 0 0 3 0 5 0 3 2 1 1 4 2 0 1 4 2 0 6 1 0 0 7 0 0 3 6 0 0 3 0 1 1 8 1 1 1 3 1 0 5 2 0 2 1 0 1 r 0 0 0 0 0 0 1 0 1 0 3 1 0 1 1 0 0 8 1 0 1 1 0 1 1 0 3 1 0 0 2 0 0 0 0 0 3 0 3 1 0 1 2 1 0 2 0 1 0 0 s t u v w 0 0 3 2 2 0 0 0 2 2 0 3 2 1 0 1 2 0 2 0 0 3 0 2 0 0 2 1 0 0 0 5 0 0 0 0 2 7 1 0 0 1 1 0 1 0 1 1 6 0 1 5 2 0 0 0 0 0 0 4 0 0 0 1 0 1 1 0 1 0 1 2 2 4 0 0 1 0 0 0 0 0 0 0 10 0 0 2 1 0 2 0 2 1 0 2 18 0 1 0 2 2 10 3 0 0 1 1 36 0 0 1 1 0 27 1 0 0 2 0 1 2 3 4 1 0 4 0 7 0 0 0 2 4 1 0 2 3 1 1 0 0 2 0 0 0 0 0 0 0 3 5 3 1 0 1 3 1 0 0 1 3 2 3 0 0 0 1 0 0 1 0 4 6 0 0 1 1 4 0 1 2 2 3 0 0 0 0 2 3 1 3 1 0 0 0 2 0 1 0 1 2 2 11 3 0 2 1 1 0 1 3 2 2 1 0 3 3 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0 1 2 4 7 4 4 0 0 0 1 2 1 1 2 3 1 1 x 3 0 1 1 0 4 1 1 2 2 3 3 0 0 1 0 1 1 3 1 2 0 0 2 2 0 1 0 0 0 0 0 3 0 0 2 0 0 2 4 2 0 1 1 0 1 1 2 4 2 y 1 3 0 2 2 0 2 2 0 0 1 0 0 4 1 0 0 3 0 0 0 2 0 1 6 2 3 1 0 1 0 1 0 0 3 0 2 0 1 1 1 0 3 0 0 0 2 3 2 4 z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax <-- classified as 2 1 1 0 1 0 0 1 3 0 3 2 2 2 1 3 1 1 0 0 0 2 1 3 0 | a = MICHAEL-TUMELTY 1 3 2 0 2 2 0 0 2 1 0 2 3 0 0 0 0 2 2 0 2 0 0 0 0 | b = MARK-FISHER 6 1 0 1 2 3 3 0 0 0 2 1 0 4 0 4 2 0 2 0 0 0 2 1 0 | c = MURRAY-RITCHIE 1 3 0 0 2 1 9 3 1 0 1 0 1 0 0 0 1 0 0 0 1 1 1 4 0 | d = PATRICK-BROGAN 21 1 0 2 0 2 0 0 0 3 25 1 2 4 0 5 0 0 0 0 0 0 3 2 1 | e = CHRISTOPHER-SIMS 1 7 0 0 6 2 1 0 3 0 0 0 1 1 0 0 1 2 3 2 9 0 1 0 2 | f = KEN-GALLACHER 4 3 1 0 2 3 0 1 0 0 1 2 0 2 0 0 1 1 1 0 0 0 7 6 1 | g = JAMES-FREEMAN 0 5 3 0 1 3 0 1 0 1 0 3 0 4 0 0 0 3 2 0 2 2 5 6 1 | h = STEPHEN-MCGREGOR 1 3 0 0 1 1 2 1 5 1 2 2 2 3 3 1 3 1 2 0 1 0 1 1 0 | i = DOUG-GILLON 1 1 0 2 0 1 0 0 2 1 7 0 0 0 0 4 1 1 0 0 0 0 1 1 0 | j = ALF-YOUNG 6 1 3 0 1 1 5 1 1 2 1 2 1 6 0 4 1 2 3 0 0 2 3 1 1 | k = BENEDICT-BROGAN 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 | l = TOM-SHIELDS 0 0 1 2 2 0 1 0 5 3 2 0 1 1 1 1 1 0 1 1 1 0 1 1 0 | m = JOSEPH-DILLON 0 5 0 1 2 1 0 2 2 3 0 0 0 0 2 0 0 1 1 1 1 0 2 2 0 | n = HARRY-CONROY 4 9 2 1 4 3 0 4 0 1 0 3 1 7 2 7 1 0 0 0 5 2 2 15 0 | o = GRAEME-SMITH 1 3 1 0 0 3 0 1 3 1 2 2 2 2 0 4 0 0 1 0 0 2 0 0 0 | p = SARA-VILLIERS 1 0 0 1 0 0 0 5 0 3 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 | q = DAVID-BELCHER 4 12 4 0 1 5 0 1 1 1 0 1 3 1 1 4 0 3 1 0 2 1 2 3 0 | r = DAVID-STEELE 3 2 2 2 0 3 1 2 0 2 2 1 0 1 2 4 0 0 1 0 0 2 7 1 1 | s = DUNCAN-BLACK 4 3 0 3 2 2 2 2 0 2 2 2 0 1 0 3 3 3 3 0 2 2 3 2 0 | t = ROBERT-ROSS 5 0 0 0 2 2 0 2 1 2 1 2 0 0 1 3 3 0 0 0 1 0 4 0 1 | u = WILLIAM-TINNING 17 0 2 0 0 1 0 0 0 8 7 1 2 4 0 10 0 0 2 0 1 0 0 1 0 | v = CHRIS-STONE 0 0 0 1 0 2 0 0 1 1 4 0 4 1 0 0 0 0 0 1 0 0 0 0 0 | w = MARY-BRENNAN 5 1 3 2 1 0 0 6 1 1 0 0 1 0 1 0 0 0 0 1 1 1 1 4 0 | x = CAMERON-SIMPSON 4 6 1 2 2 4 1 2 0 0 0 4 0 2 0 3 1 0 3 0 2 1 4 6 0 | y = ELIZABETH-BUIE 97 2 0 0 1 1 3 0 0 4 10 5 2 1 0 7 4 0 2 0 1 0 0 3 2 | z = ANDREW-WILSON 4 36 2 0 1 3 2 2 4 0 0 7 0 2 0 3 4 2 1 3 3 0 3 2 1 | aa = DEREK-DOUGLAS 5 0 12 0 0 4 1 2 1 1 0 5 0 1 0 2 0 2 0 0 0 0 1 7 0 | ab = KEITH-SINCLAIR 1 3 0 2 2 1 1 2 0 2 4 0 0 0 1 1 4 2 1 0 0 2 0 2 0 | ac = CHRIS-HOLME 1 3 0 0 53 2 3 0 6 0 0 0 0 0 0 0 0 2 4 0 36 2 1 2 0 | ad = JAMES-TRAYNOR 4 7 3 1 1 3 1 2 1 2 1 5 0 3 0 3 0 1 0 0 0 2 2 1 0 | ae = ALAN-MACDERMID 0 5 1 0 3 3 11 0 0 0 1 2 0 2 0 0 3 1 4 0 2 1 0 1 0 | af = GEOFFREY-PARKHOUSE 4 2 1 3 1 1 0 7 0 1 4 0 0 2 0 2 0 0 1 0 0 0 1 2 0 | ag = ROB-ROBERTSON 1 2 1 0 3 0 0 0 37 0 1 1 0 0 0 2 2 2 3 0 4 1 1 0 1 | ah = RAYMOND-JACOBS 5 2 0 0 0 2 1 0 0 25 13 1 3 2 0 16 3 1 0 0 0 0 1 0 1 | ai = ROBERT-POWELL 17 0 0 0 0 0 1 2 0 5 69 1 3 1 0 9 0 0 0 0 0 0 0 1 0 | aj = NICOLA-REEVES 10 4 2 0 0 1 2 1 2 2 2 19 0 7 1 7 3 1 7 0 0 0 2 3 0 | ak = ROY-ROGERS 10 0 0 0 0 1 0 0 1 0 6 1 24 0 0 1 1 0 0 0 0 0 0 0 0 | al = ROSS-FINLAY 4 1 5 0 0 0 1 0 0 6 4 6 0 12 0 1 2 1 0 0 1 0 0 4 0 | am = FRANCES-HORSBURGH 1 0 1 1 0 1 1 1 0 0 1 2 0 2 4 4 3 1 3 0 0 1 2 6 0 | an = LYNNE-ROBERTSON 8 0 1 0 0 1 1 0 1 15 12 4 5 2 0 40 2 1 0 0 0 0 2 0 0 | ao = IAN-MCCONNELL 7 2 0 1 0 2 1 1 0 3 2 5 1 2 1 7 13 0 0 0 0 0 1 1 0 | ap = IAIN-WILSON 2 4 2 1 4 4 2 0 0 1 0 6 0 3 0 1 1 9 3 1 1 2 3 1 1 | aq = ROBBIE-DINWOODIE 2 2 6 0 2 0 4 1 3 0 0 5 2 2 1 3 5 5 33 0 7 6 0 1 1 | ar = STUART-TROTTER 0 3 0 0 7 1 2 1 3 1 0 0 1 0 0 0 1 0 1 6 14 1 0 0 0 | as = JIM-REYNOLDS 0 6 0 0 28 0 0 1 5 0 0 0 1 0 0 0 0 2 3 2 72 1 1 2 0 | at = IAN-PAUL 0 4 1 0 13 1 1 1 1 0 1 1 2 1 0 0 0 0 3 0 4 21 1 7 0 | au = WILLIAM-RUSSELL 5 8 2 1 0 3 0 1 0 1 3 2 0 3 3 3 3 2 3 0 1 2 12 3 0 | av = DAVID-ROSS 2 1 3 0 1 0 0 5 0 0 0 2 0 2 0 1 0 0 8 0 0 4 1 48 1 | aw = BRUCE-MCKAIN 5 5 2 0 2 1 2 0 0 1 1 0 0 6 0 1 4 2 2 0 0 1 2 0 1 | ax = KEN-SMITH Figure 8: Confusion matrix for the GH95 dataset. D Confusion matrix C50 with removed elements 30 a b - - 21 - - - - - - - - - - 6 - - - - - - - - - - - - - - - 7 - - - - - - - - - 6 - - - - - - - - - - - - - - c 8 8 - d 6 - e 6 - f g h - - - - - - - - - - - - - 15 - - 16 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - i j - - - 6 - - - - - - 20 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - k l - - - - - - - - - - 7 - 14 - - - - 7 - - 6 - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - m - n 6 9 - o 8 - p - q 8 - r - s - t 7 - u v - - - - - - - - - - - - - - - - - - - - - - 10 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - w 6 7 - x y - - - - - - - - - - - - - - - - - - - - - - - - - 11 - - - - - - - - 6 - - - - - - 6 - - - - - - - - - - - z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax <-- classified as - - - - - - - - - - - - - - - - - - - - - - - - - | a = KirstinRidley - - - - - - - - - - - - - - - - - - - - - - - - - | b = MatthewBunce - - - - - - - - - - - - - - - - - - - - - - - - - | c = JaneMacartney - - - - - - - - - - - - - - - - - - - - - - - - - | d = PeterHumphrey - - - - - - - - - - - - - - - - - - - - - - - - - | e = DarrenSchuettler - - - - - - - - - - - - - - - - - - - - - - - - - | f = ToddNissen - - - - - - - - - - - - - - - - - - - - - - - - - | g = KarlPenhaul - - - - - - - - - - - - - - - - - - - - - - - - - | h = JimGilchrist - - - - - - - - - - - - - - - - - - - - - - - - - | i = DavidLawder - - - - - - - - - - - - - - - - - - - - - - - - - | j = LydiaZajc - - - - - - - - - - - - - - - - - - - - - - - - - | k = KouroshKarimkhany - - - - - - - - - - - - - - - - - - - - - - - - - | l = GrahamEarnshaw - - - - - - - - - - - - - - - - - - - - - - - - - | m = JonathanBirt - - - - - - - - - - - - - - - - - - - - - - - - - | n = LynnleyBrowning - - - - - - - - - - - - - - - - - - - - - - - - - | o = TheresePoletti - - - - - - - - - - - - - - - - - - - - - - - - - | p = EdnaFernandes - - - - - - - - - - - - - - - - - - - - - - - - - | q = MureDickie - - - - - - - - - - - - - - - - - - - - - - - - - | r = HeatherScoffield - - - - - - - - - - - - - - - - - - - - - - - - - | s = MartinWolk - - - - - - - - - - - - - - - - - - - - - - - - - | t = SamuelPerry - - - - - - - - - - - - - - - - - - - - - - - - - | u = MarkBendeich - - - - - - - - - - - - - - - - - - - - - - - - - | v = TimFarrand - - - - - - - - - - - - - - - - - - - - - - - - - | w = AlanCrosby - - - - - - - - - - - - - - - - - - - - - - - - - | x = ScottHillis - - - - - - - - - - - - - - - - - - - - - - - - - | y = AlexanderSmith - - - - - - - - - - - - - - - - - - - - - - - - - | z = RobinSidel - - - - - - - - - - - - - - 6 - - - - - - - - - - | aa = KeithWeir - - 10 - - - - - - - - - - - - - - - - - - - - - - | ab = JoWinterbottom - - - - - - - - - - - - - - - - - - - - - - - - - | ac = JanLopatka - - - - - - - - - - - - - - - - - 7 - - - - - - - | ad = MarcelMichelson - - - - - - - - - - - - - - - - - - - - - - - - - | ae = PatriciaCommins - - - - - - 6 - - - - - - - - - - - - - - - - - - | af = EricAuchard - - - - - - - - - - - - - - - - - - - - - - - - - | ag = JoeOrtiz - - - - - - - - - - - - - - - - - - - - - - - - - | ah = MichaelConnor - - - - - - - - - - - - - - - - - - - - - - - - - | ai = FumikoFujisaki - - - - - - - - - - 8 - - - - - - - - - - - - - - | aj = PierreTran - - - - - - - - - - - 8 - - - - - - - - - - - - - | ak = WilliamKazer - - - - - - - - - - - - - - - - - - - - - - - - - | al = SarahDavison - - - - - - - - - - - - - 14 - - - - - - - - - - - | am = LynneO’Donnell - - - - - - - - - - - - - - - - - - - - - - - - - | an = NickLouth - - - - - - - - - - - - - - - - - 6 - - - - - - - | ao = KevinDrawbaugh - - - - - - - - - - - - - - - - 10 - - - - - - - - | ap = BernardHickey - - - - - - - - - - - - - - - - - 11 - - - - - - - | aq = KevinMorrison - - - - - - - - - - - - - - - - - - - - - - - - - | ar = TanEeLyn - - - - - - - - - - - - - - - - - - - 14 - - - - - | as = AaronPressman - - - - - - - - - - - - - - - - - - - - 8 - - - - | at = BenjaminKangLim - - - - - - - - - - - - - - - - - - - - - - - - - | au = BradDorfman - - - - - - - - - - - - - - - - - - - - - - - - - | av = JohnMastrini - - - - - - - - - - 6 - - - - - - - - - - - - - - | aw = RogerFillion - - - - - - - - - - - - - - - - - - - - - - - - - | ax = SimonCowell Figure 9: Confusion matrix for the C50 dataset where elements less than or equal to 5 are removed. E Confusion matrix LAT94 with removed elements 31 a - b c - 7 - 15 - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - 7 - - - - - - 6 - - - - - - - d 6 - e - f g - - - - - 6 6 - 67 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - h i j k l m - - - - - - - - - - - - - - - 6 - - - - - - 7 - - - - - - - - - - - - - - - - - - - 16 - - - 8 - - 12 - - - - - 32 - - - 8 - 10 - - 11 - - 12 - 6 - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 7 - - - - - - - - 12 - - - 9 - - - - - - - 7 - - - - - - - - - - - - - - - - - 7 - - - - - - - - - - - - - - - 6 - 8 - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - 12 - - - - - - - - 8 - - 7 - - - - - - - 11 - - - - - - 7 - - - 8 n o p q - - - - - - - - - - - - - - - - - - - - - - 7 - - - - 6 - - 7 - - - - - - 6 - - - 11 - - - - 17 - 8 - - 22 - - - 36 - - - 8 - - - 16 - - - - - - - - - - - - 6 - - - - - - - - - - - - - - 10 - - - 22 - - - 10 - - - 7 - - - - - - - - - - - - 6 - - - 7 - - - 7 - - - - - - - - - - - - - - - 7 - - 12 - - 8 - - - - - - 10 - - - 6 - - - - - - - - - - r s t - - - - - - - - - - - - - - - - - - - 6 - - 6 - - - - - 6 - - - - - 8 - 6 6 13 - - 93 - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - 8 - - - - - - - - - - - - - - - - - u - v 6 - w x - - - - - - - - - - - - - - - - 6 - - - - - - - - 20 - - - - - - - - - - - - - - - - - - - 9 - - - - - - - - y - z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax <-- classified as - - - - - - - - - - - - - - - - - - - - - - - - - | a = JACK-SEARLES - - - - - - - - - - - - - - - - - - - - - - - 7 - | b = MIKE-REILLEY - - - - - - - - - - - - - - - - - - - - - - - - - | c = JOHN-DART - - - - - - - - - - - - - - - - - - - - - - - - - | d = JE-MITCHELL - - - - - - - - - - - - - - - - - - - - - - - 7 - | e = TJ-SIMERS - - - - - - - - - - - - - - - - - - - - - - - - - | f = EARL-GUSTKEY - - - - - - - - - - - - - - - - - - - - - - - - - | g = ALLAN-MALAMUD - - - - - - - - - - - - - - - - - - - - - - - - - | h = MARY-F - - - - - - - - - - - - - - - - - - - 6 - - - - - | i = ED-BOND - - - 9 - - - - - - - - - - - - - - - - - - - - - | j = JON-NALICK - - - - - - - - - - - - - - - - - - - - - - - - - | k = KEVIN-THOMAS - - - - - - - - - - - - - - - - - - - - - - - - - | l = SARA-CATANIA - - - 16 - - - - - 6 - - - - - - - - - - - - - - - | m = BERT-ELJERA - - - 7 - - - - - - - - - - - - - - - - - - - - - | n = TRACY-WILSON - - - - - 8 - - - - - - - - - - - - - - - - - - - | o = JULIE-FIELDS - - 10 - - - - - - - - - - - - - - - - - - - - - - | p = SCOTT-HARRIS - - - 9 - - - - - - - - - - - - - - - - - - - - - | q = SHELBY-GRAD - - - - - - - - - - - - - - - - - - - - - - - - - | r = ALAN-EYERLY - - - 14 - - - - - - - - - - - - 8 - - - - - - - - | s = FRANK-MESSINA - - - - - - - - 7 - - - - - - - - - - - - - - - - | t = CHARLES-SOLOMON - - - - - - 6 - - - - - - - - - - - - - - - - - - | u = STEVE-HOCHMAN - - - - - - - - - - - - - - - - - - - - - - - - - | v = MARY-ANNE - - - - - - - - - - - - - - - - - - - - - - - - - | w = LYNN-FRANEY - - - - - - - - - - - - - - - - - - - - - - - - - | x = SHAV-GLICK - - - - - - - - - - - - - - - - - - - - - - - - - | y = PEGGY-Y - - - - - - - - - - - - - - - - - - - - - - - - - | z = MIGUEL-BUSTILLO - - - - - - - - - - - - - - - - - - - - - - - - - | aa = MAKI-BECKER - - 37 - - - - - - - - - - - - - - - - - - - - - - | ab = JIM-MURRAY - - - 18 - - - - - 7 - - - - - - - - - - - - - - - | ac = DEBRA-CANO - - - - - - - - - - - - - - - - - - - - - - - - - | ad = MARTIN-MILLER - - - - - 11 - - - - - - - - - - - - - - - - - - - | ae = MAIA-DAVIS - - - - - - 37 - - - - - - - - - - - - - - - - - - | af = TOM-PETRUNO - - - - - - - - - - - - - - - - - - - - - - - - - | ag = SUSAN-BYRNES - - - - - - - - - - - - - - - - - - - - - - - - - | ah = JENNIFER-OLDHAM - - - 8 - - - - - 13 - - - - - - - - - - - - - - - | ai = BILL-BILLITER - - - 8 - - - - - - 6 - - - - - - - - - - - - - - | aj = LESLEY-WRIGHT - - - - - - - - - - - - - - - - - - - - - - - - - | ak = HOLLY-J - - - - - - - - - - - - - - - - - - - - - - - - - | al = STEVE-RYFLE - - - - - - - - - - - - - - - - - - - - - - - - - | am = ERIC-SLATER - - - - - - - - - - - - - - 25 - - - - - - - - - - | an = MICHAEL-KRIKORIAN - - - - - - - - - - - - - - - - - - - - - - - - - | ao = KURT-PITZER - - - 8 - - - - - - - - - - - - 12 - - - - - - - - | ap = STEPHANIE-SIMON - - - - - - - - - - - - - - - - - 12 - - - - - - - | aq = HELENE-ELLIOTT - - - - - - - - - - - - - - - - 6 - 8 - - - - - - | ar = ERIN-J - - - 6 - - - - - - - - - - - - - - - - - - - - - | as = FRANK-MANNING - - - 6 - - - - - - - - - - - - - - - - - - - - - | at = RUSS-LOAR - - - - - - - - - - - - - - - - - - - - - 8 - - - | au = JEFF-BEAN - - - 7 - - - - - - - - - - - - 8 - - - - - - - - | av = KAY-HWANGBO - - - - - - - - - - - - - - - - - - - - - - - 25 - | aw = MARYANN-HUDSON - - - - - - - - - - - - - - - - - - - - - - - - - | ax = DANIELLE-A Figure 10: Confusion matrix for the LAT94 dataset where elements less than or equal to 5 are removed. F Confusion matrix GH95 with removed elements 32 a 16 7 9 8 - b c d e f - - - - 7 - - - - 12 - - - - 30 6 6 - - - 56 - - 12 - 87 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 6 - - - - 6 - - - - - - - - - - - - - - - - - - 11 - - - - - - - 9 - - - - - - - - 7 - - - - - - - 17 - - - - 18 - - - - - - - - - - - - 12 - - - - - 7 - - 10 - - - - - - - - - - - 7 - - - 27 - - - - - - - - - - - 9 - - - - - - - 9 - - - - - - - - - 6 - - - - - - 15 - - - - 20 - - 7 - 9 - - - - - - - - - - 6 15 - g h i j k l m n o p q - - 6 - - - - - - - - - 6 - - - - - - 11 - - 6 - 9 - - - - - - - - - - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - 18 - - - - - - - - - - 58 - - 6 - - - - 7 - - 8 10 - - - - - - - 7 12 - 22 - - - - - - - 15 - - 19 - - - - - - 10 - - - 13 - - - - - - - - - - 11 - - - 8 - - 7 - - - 15 - - - - - - - - - - 35 6 - - - - - - - - - - 60 - 10 - - - - - - - - - - - - - - - - - - - - 6 - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - 9 - - 7 - - - 9 22 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 11 - 8 6 - - 6 - - 6 - - - - - - - - - - 8 - - - - - - - - - - - - - - - - - - - - - 6 - - - 7 - - - - - 10 - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - 15 - - - - - - 6 7 - 6 - - 10 - - - - - - - 6 - - - - - - - - - 8 - - - - - - - - - - - - - - - - - - - - - 6 - - - 7 - - 6 - - - - - - 8 - - - - - - - - - - - - r 8 - s t u v w - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 7 - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - 10 - - - - - - - - - 18 - - - - 10 - - - - 36 - - - - 27 - - - - - - - - - - - 7 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 6 - - - - - - - - - - - - - - - - - - - - - - - 11 - - - - - - - - - - - - - - - - - - - - - - - - - 7 - - - - - - - - - - - x - y 6 - z aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax <-- classified as - - - - - - - - - - - - - - - - - - - - - - - - - | a = MICHAEL-TUMELTY - - - - - - - - - - - - - - - - - - - - - - - - - | b = MARK-FISHER 6 - - - - - - - - - - - - - - - - - - - - - - - - | c = MURRAY-RITCHIE - - - - - - 9 - - - - - - - - - - - - - - - - - - | d = PATRICK-BROGAN 21 - - - - - - - - - 25 - - - - - - - - - - - - - - | e = CHRISTOPHER-SIMS - 7 - - 6 - - - - - - - - - - - - - - - 9 - - - - | f = KEN-GALLACHER - - - - - - - - - - - - - - - - - - - - - - 7 6 - | g = JAMES-FREEMAN - - - - - - - - - - - - - - - - - - - - - - - 6 - | h = STEPHEN-MCGREGOR - - - - - - - - - - - - - - - - - - - - - - - - - | i = DOUG-GILLON - - - - - - - - - - 7 - - - - - - - - - - - - - - | j = ALF-YOUNG 6 - - - - - - - - - - - - 6 - - - - - - - - - - - | k = BENEDICT-BROGAN - - - - - - - - - - - - - - - - - - - - - - - - - | l = TOM-SHIELDS - - - - - - - - - - - - - - - - - - - - - - - - - | m = JOSEPH-DILLON - - - - - - - - - - - - - - - - - - - - - - - - - | n = HARRY-CONROY - 9 - - - - - - - - - - - 7 - 7 - - - - - - - 15 - | o = GRAEME-SMITH - - - - - - - - - - - - - - - - - - - - - - - - - | p = SARA-VILLIERS - - - - - - - - - - - - - - - - - - - - - - - - - | q = DAVID-BELCHER - 12 - - - - - - - - - - - - - - - - - - - - - - - | r = DAVID-STEELE - - - - - - - - - - - - - - - - - - - - - - 7 - - | s = DUNCAN-BLACK - - - - - - - - - - - - - - - - - - - - - - - - - | t = ROBERT-ROSS - - - - - - - - - - - - - - - - - - - - - - - - - | u = WILLIAM-TINNING 17 - - - - - - - - 8 7 - - - - 10 - - - - - - - - - | v = CHRIS-STONE - - - - - - - - - - - - - - - - - - - - - - - - - | w = MARY-BRENNAN - - - - - - - 6 - - - - - - - - - - - - - - - - - | x = CAMERON-SIMPSON - 6 - - - - - - - - - - - - - - - - - - - - - 6 - | y = ELIZABETH-BUIE 97 - - - - - - - - - 10 - - - - 7 - - - - - - - - - | z = ANDREW-WILSON - 36 - - - - - - - - - 7 - - - - - - - - - - - - - | aa = DEREK-DOUGLAS - - 12 - - - - - - - - - - - - - - - - - - - - 7 - | ab = KEITH-SINCLAIR - - - - - - - - - - - - - - - - - - - - - - - - - | ac = CHRIS-HOLME - - - - 53 - - - 6 - - - - - - - - - - - 36 - - - - | ad = JAMES-TRAYNOR - 7 - - - - - - - - - - - - - - - - - - - - - - - | ae = ALAN-MACDERMID - - - - - - 11 - - - - - - - - - - - - - - - - - - | af = GEOFFREY-PARKHOUSE - - - - - - - 7 - - - - - - - - - - - - - - - - - | ag = ROB-ROBERTSON - - - - - - - - 37 - - - - - - - - - - - - - - - - | ah = RAYMOND-JACOBS - - - - - - - - - 25 13 - - - - 16 - - - - - - - - - | ai = ROBERT-POWELL 17 - - - - - - - - - 69 - - - - 9 - - - - - - - - - | aj = NICOLA-REEVES 10 - - - - - - - - - - 19 - 7 - 7 - - 7 - - - - - - | ak = ROY-ROGERS 10 - - - - - - - - - 6 - 24 - - - - - - - - - - - - | al = ROSS-FINLAY - - - - - - - - - 6 - 6 - 12 - - - - - - - - - - - | am = FRANCES-HORSBURGH - - - - - - - - - - - - - - - - - - - - - - - 6 - | an = LYNNE-ROBERTSON 8 - - - - - - - - 15 12 - - - - 40 - - - - - - - - - | ao = IAN-MCCONNELL 7 - - - - - - - - - - - - - - 7 13 - - - - - - - - | ap = IAIN-WILSON - - - - - - - - - - - 6 - - - - - 9 - - - - - - - | aq = ROBBIE-DINWOODIE - - 6 - - - - - - - - - - - - - - - 33 - 7 6 - - - | ar = STUART-TROTTER - - - - 7 - - - - - - - - - - - - - - 6 14 - - - - | as = JIM-REYNOLDS - 6 - - 28 - - - - - - - - - - - - - - - 72 - - - - | at = IAN-PAUL - - - - 13 - - - - - - - - - - - - - - - - 21 - 7 - | au = WILLIAM-RUSSELL - 8 - - - - - - - - - - - - - - - - - - - - 12 - - | av = DAVID-ROSS - - - - - - - - - - - - - - - - - - 8 - - - - 48 - | aw = BRUCE-MCKAIN - - - - - - - - - - - - - 6 - - - - - - - - - - - | ax = KEN-SMITH Figure 11: Confusion matrix for the GH95 dataset where elements less than or equal to 5 are removed.
© Copyright 2026 Paperzz