Arabic Text Author Identification Using Support Vector Machines

Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
Arabic Text Author Identification Using
Support Vector Machines
Rebhi Barakaa, Samar Salem, Mona Abu
Hussien, Nidaa Nayef, and Wala Abu Shaban
Article Info
Received:12th December 2013
Accepted: 21st January 2014
Published online: 1st March 2014
Faculty of Information Technology, Islamic University of Gaza,
Gaza, Palestine
a
[email protected]
e-ISSN: 2231-8852
ABSTRACT
A model for Arabic text author identification is proposed. It classifies a set of Arabic text documents
with unknown authorship by capturing the style of each author through features extracted from the
text. The identification process is achieved through five phases which are: documents collection,
dataset preparation, features extraction, features optimization and classification model building. The
model relies on Support Vector Machines (SVM) and combines two feature types on two domains:
Political Analysis Articles and Literature. The experiments show that the model is effective with
classification accuracy that may reach 100%..
Keywords: Arabic NLP, Data Mining, Data Classification, Part of Speech Tagging
1. Introduction
Author identification is a typical problem in natural language processing. It is the task of
identifying the author of a given text. It can be considered as a typical classification problem,
where a set of documents with known authors are used for training and the aim is to
automatically determine the corresponding author of an anonymous text. This can be used in
a broad range of applications, to analyze ancient documents, in plagiarism detection which
can be used to establish whether claimed authorship is valid, in forensic investigations to
verify the authorship of e-mails and messages (Taş, 2007).
In contrast to other classification tasks, it is not clear which features of a text should be
used to classify an author. So, the main issue in computer-based author identifications is to
identify a set of features that represents the author’s writing style. These are then used to
classify the authors of selected anonymous texts. A set of feature types can be used to classify
authors; these include word-level, character-level, syntactic, semantic and lexical features
(Stamatatos, 2009).
This work deals with Arabic text author identification where we classify a set of Arabic
text documents with unknown authorship by capturing the style of each author through
features extracted from the text. There are many challenges in choosing the right features that
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
can capture the style of the writer more accurately and also in choosing the methods and
approaches to use for author classification.
In order to classify authors, we need to gather text-documents for different writers mainly
from the internet. Then by using NLP tools we can extract a set of features basically wordlevel, syntactic and lexical features to capture the authors’ style. Feature reduction is required
to get the most relevant features to be used to build the classifier. Support Vector Machine
(SVM) is proposed as a classification method due to its tested performance in text
classification and categorization applications.
The proposed author identification model consists of five phases which are: documents
collection, dataset preparation, features extraction, features optimization and classification
model building.
In the documents collection phase we collect a set of Arabic text documents in Political
Analysis Articles and Literature fields. Then in the dataset preparation phase we process the
documents and organize them into separate directories for each author. The feature extraction
phase is the most essential since a set of features are extracted from the documents to capture
the style of writers. For the feature optimization phase, the previously extracted features are
optimized to get the most relevant features that represent the style of authors. The final phase
is classification model building and evaluation to test the model performance.
The rest of the paper consists of the following sections: Section 2 presents related works,
Section 3 presents the proposed author identification model. Section 4 presents the
experiments and results of realizing the model. Section 5 concludes the paper and suggests
future work.
1. Related Work
Jing (2011) applies an approach based on using dependency grammar for Chinese
authorship identification. Their approach consists of 4 steps which are: data collection,
feature extraction, feature optimization and identification. For the feature set they proposed
dependency as a new syntactic-based level feature combined with another 3 features which
are: empty word, part-of-speech and punctuation to form the whole set. Principle component
analysis (PCA) is used for feature reduction and SVM for classification. Experiments showed
that combining the four features gives better performance than any other feature combination.
Taş (2007) presents a fully automated approach to the author identification of Turkish text
by adapting a set of style markers to the analysis of the text. For identifying an author, 35
style markers are determined for a set of 20 different authors. They tested and compared
several machine learning algorithms for classification. Maximum success is obtained with
Naïve Bayes Multinomial with a rate of 80%, using CFS subset evaluator with rank search
method.
Chaurasia (2011) proposes a novel approach for author identification which depends on
using initial character n-gram approach with real world text. They dealt only with one start
slit of every term (initial slit). After text preprocessing, the n-gram profile is generated then
this n-gram profile of each document is compared against the profiles of all documents in the
training classes using a dis-similarity algorithm. The accuracy is higher than other types of ngram.
2
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
Zhao (2006) proposes a new approach for author attribution based on relative entropy
which is an approach motivated from information theory. Kullback-Leibler Divergence
(KLD) is used as a classifier to measure how different two distributions are. They also tested
SVM algorithm and Bayes Networks then they compared their performance with the
performance of KLD for different Types of features including function words, POS tags and
punctuation. Their experiments showed that KLD classification is an effective method for
general text categorization while SVM result in better accuracy and precision for author
attribution.
Diederich (2003) introduces an SVM algorithm as an approach for author attribution due
to its ability to process a very large number of features without the need for feature selection.
A comparison is held between several classification methods and SVM to test its
performance on word-level features especially word forms and bigrams of function words.
The experiments showed that SVM achieve good performance for author identification. It
also showed that bigrams of function words result in a performance superior to other
classifiers but they perform less well than full word forms.
Stamatatos (2009) presents a survey of the latest advances in automated approaches used
in author attribution. He examines the characteristics of these approaches for text
representation and text classification, and also the evaluation methodologies and criteria used
in author attribution studies. The survey distinguishes several types of stylometric features to
quantify the writing style including lexical features, character features, syntactic, semantic
and application-specific features.
Mccombe (2002) investigates a set of computer based author identification methods to
determine which of them are the most effective. It introduced some programs used for the
process of author identification. An explanation of how to use each program, the algorithms
and mathematical methods and also the results obtained from using each program was
included. The most effective method as stated is the one based on letter unigram used for
300-word email author identification.
Author identification depends much on classification approaches. SVM classification
algorithm has been widely used in text classification. In our research we depend on SVM to
perform the preprocessing of Arabic text documents and feature extraction. Next we present
some works that have used SVM in their classification process.
Joachims (1998) explores and identify the use of SVM as a text classifier. The word stems
are used to represent the text features, where words are considered features if they are not
stop-words and if they occurred at least 3 times in training data. Information gain is used for
feature selection. The experiments compared SVM with other 4 learning methods used for
categorization. The results show that SVM has better precision and recall and better
performance than other methods.
Al-Harbi (2008) introduces an automatic approach for Arabic text classification based on
two classification algorithms: SVM and C5.0. They use 7 different datasets to cover different
subject domains. Lexical Features (single word) are extracted and chi squared statistics are
used to compute the dependency between the term and the class for feature selection. The
experiments are conducted using RAPIDMINER to implement SVM algorithm and
Clementine for C5.0 decision tree algorithm. The results show that C5.0 outperformed the
SVM in accuracy by about 10%.
3
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
El-Shishtawy (2009) proposes a method to extract key phrases from Arabic text by using
statistical measures such as term frequency and combine it with linguistic knowledge for
better results. The knowledge includes syntactic rules such as part-of-speech tags and word
frequency, also the abstract form of Arabic words is used instead of its stems. Analysis of
Variance (ANOVA) test is used to evaluate the selected features. The learning model is built
using Linear Discriminant Analysis (LDA). The experiments show that the presented system
has better performance than the existing Arabic extractor system with doubled precision and
recall obtained mainly from linguistic knowledge.
Saad (2010) examines the impact of text preprocessing and term weighting schemes on
Arabic text classification. Also they develop a new combination of term weighting schemes
to enhance text classification. The results show that using term weighting and stemming
reduces dataset dimensionality since stemming reduces many terms to their root, also
classification accuracy is much better for stemmed terms than Bags-of-Tokens (BOT)
without stemming for all combinations.
2. Author Identification Model
This section describes the model for author identification. The model is translated into a
series of steps starting from Arabic text documents collection and ending with classification
and evaluation.
The proposed model shown in Figure 1 represents the author identification process defined
as a series of steps performed in sequence. These steps are: Data collection, data preparation,
feature extraction, feature optimization, classification and finally evaluation. The documents
from each author are processed at each step to obtain the information we need to identify
authorship until the actual classification is performed and the classifier is evaluated.
Fig. 1: Authorship Identification Model
Next we explain the author identification process represented by the model in more details.
4
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
3.1 Data Collection
The process of collecting a set of text documents in specific domain for later processing to
achieve a specific purpose.
We collect multiple Arabic text documents for different writers in two domains: Literature
and Political Analysis Articles. The number and length of each document is different for all
authors in both domains. For the Political Analysis Articles domain we have 3 authors (3
classes) and for the literature we started with 3 authors then added two more for a sum of 5
authors (the size of the literature documents as stories and novels is much larger than the
Political Analysis Articles). Table 1 shows the total number and size of documents for each
author in each domain.
Table 1: The number and size of documents for Political Analysis Articles and Literature
domains
Number of
Size of
Author
Documents
Documents
Political Analysis Articles Domain
Fahmi Howeidi
80
613 KB
Saleh Naami
89
745 KB
Yasser Zaatrah
86
632 KB
Literature Domain
Abdullah Tayeh
10
0.99 MB
Nadia Khost
14
0.97 MB
Najeeb Mahfoz
13
866 KB
Ghassan Kanafani
10
495 KB
Gobran K. Gobran
11
417 KB
Although the number of the documents in the Political Analysis Articles domain is large
compared to that of the Literature domain, its size is much less. Because we need consistency
in the number of attributes extracted from the text documents of the two domains, the number
of Political Analysis Articles for each author is chosen based on an approximation of the
number of the attributes that will be extracted from them.
3.2 Data Preparation
The process of preparing the data for further processing involves eliminating redundant or
unnecessary parts which cause noise in data.
Punctuation marks and special characters are removed from text documents. After that, the
documents of each author are renamed with the author initials and a unique serial number
then stored in a separate directory named by the author (class) name. This stage results in 3
directories for the Political Analysis Articles domain and 5 directories for the Literature
domain.
3.3 Feature Extraction
This is the most essential phase in author identification where we perform feature
extraction on the text documents to represent the style of each author. We extract two types
5
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
of features lexical and syntactic. For the lexical features we extract word bi-grams and for the
syntactic we extract Part-of-Speech (POS) tags n-grams. Here we describe the extraction of
each feature in more details.
3.3.1 Word Bi-grams
For the first feature, we used RapidMiner to generate the word n-grams by using the
“Process Documents from Files” operator in the text processing package. To generate word
n-grams the text documents go through tokenization, stop words filtering, stemming and
generation of n-grams with maximum length of 2 for the n-gram operator to generate word
bi-grams. Finally we set the word vector creation scheme parameter in “Process Documents
from Files” operator to compute the Term Frequency-Inverse Document Frequency (TF-IDF)
for each bi-gram. The output from this step is the word bi-grams extracted from the
documents with the normalized TF-IDF computed for each bi-gram in each document.
The Tokenize Operator splits the text document into word tokens, then these tokens are
filtered to remove the stop words (because they occur frequently in all documents, so they
don’t distinguish between authors style). After all stop words are removed, the stem of the
remaining tokens is extracted. Finally the word bi-grams which are used to train the classifier
are generated from the stemmed tokens.
3.3.2 POS tags n-grams.
The extraction of the second feature is performed in two steps: the tagging of the text
document and tags n-gram generation.
Step 1: to tag the text documents we used the Arabic Fast Tagger model included in
Stanford POS tagger software tool, where each document is tagged separately. This is done
by running the tagger on the text to be tagged. The output of this process is a text that
contains each word of with its POS tag.
Step 2: to generate the n-grams for the tags we take the tags from the resulting text and
perform the same process we used for the word bi-gram extraction.
The output is the tags bi-grams with TF-IDF computed for each one in each document
where each row represents the normalized TF-IDF values of the POS tags n-grams
represented in columns for a specific text document. The class label is specified for each
instance of the text documents. After both features are extracted they are combined together
into a Comma Separated Values (CSV) file, the dataset, which is used in the classification.
3.4 Feature Optimization
It involves choosing the most relevant features extracted from a piece of text or a corpus to
reduce the size of data and get the most useful information for classification purposes.
In the feature extraction phase a large number of bi-grams are extracted from the text
documents these may not all be useful for text classification. So in order to enhance the
classification process we perform pruning on the resulting word bi-grams to remove those bi6
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
grams that occurs in less than 3 documents. Also a part of the optimization process is
performed in feature extraction phase using stop words filtering and stemming.
3.5 Classification
It is the process of assigning a label or a category to a piece of data to help identify it. In
this phase we build the classifier using the LIBLINEAR SVM classification tool in WEKA.
To train the classifier we read the dataset files into WEKA, choose the LIBLINEAR from
classifier list then we set the SVM type parameter into L2- loss support vector machines
(primal). L2 is the commonly used type for text document classification due to its speed and
accuracy (Jing, 2011). We set the cost parameter to 1 (its default value) then we start training.
This process is performed for each domain separately.
3.6. Evaluation
The final phase is the evaluation of the model by testing the results obtained from
experiments using predefined criteria such as performance and accuracy.
We tested the results obtained using 10 fold cross validation which is set from WEKA test
options in the explorer interface and then comparing the model accuracy and the accuracy per
class from the true positive rate (TP Rate) and confusion matrix shown in the main results
GUI in WEKA explorer.
3. Experiments and Results
This section presents the experiments we performed on Arabic text documents for
authorship identification using WEKA for SVM classifier building and evaluation. The
experiments are performed on features extracted from documents in two domains; Literature
and Political Analysis Articles. For the Literature domain the experiments is separated into
two parts: the first is performed on 3 classes and the second is performed on 5 classes to help
in evaluating the performance of the classification model. The experiments are divided into
two parts: the first part is performed on the word bigrams feature vector only without the POS
tags, and the second part includes both features. The reason we perform the experiments this
way is to see whether combining the two features will have any effect on the classification
accuracy.
Accuracy is the basic measure of evaluation. The classification model is evaluated using
10 fold cross validation to show the actual performance of the classification model.
10 Fold Cross Validation is a technique for estimating the performance of a predictive
model by dividing the data into 10 parts which are used to train and test the model. The data
is broken down into 10 sets of size n/10, and then the model is trained on 9 datasets of them
and tested on 1. The process is repeated 10 times and the mean accuracy is taken.
Testing Accuracy is defined as the percentage of the total number of the correctly
classified documents to the total number of documents.
Confusion Matrix is a specific table that allows visualization of the performance of an
algorithm where each column of the matrix represents the instances in a predicted class,
while each row represents the instances in an actual class. The name stems from the fact that
7
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
it makes it easy to see if the system is confusing two classes, for example mislabeling one as
another (Provos, 2011).
We introduce the confusion matrix because it provides a clear view of the results of
classifying the instances of each class which allow us to identify the number of correctly and
incorrectly classified instances and to identify the class to which they are mislabeled.
4.1 Experiment 1
The first experiment aims to train and test the classification model on the word bigrams
feature only using 10 fold cross validation to obtain the actual accuracy of the classifier. The
Political Analysis Articles classification model is first tested with an accuracy of around 93%
and then the two models of the Literature domain (first with 3 classes then 5) are tested with
an accuracy of 95% and 91%. The result of testing the Political Analysis Articles domain
classification model is presented in Fig. 1 (we do not show the rest of the testing results due
to space limitation).
Fig. 2: The result of testing model on word bi-grams feature for Political Analysis Articles
The dataset is split into 10 datasets, 9 are used for training the model and 1 is used to test
it. The figure shows the testing accuracy of 92.9412% which is the mean accuracy of the 10
runs of the 10 fold cross validation. Also it presents the TP rate and the confusion matrix
which provides an overview of the model accuracy for each class.
8
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
4.2 Experiment 2
The second experiment aims to train and test the classification model on the word bigrams
feature combined with the POS tags feature using 10 fold cross validation to obtain the actual
estimation of the classifier performance. The Political Analysis Articles classification model
is first tested with an accuracy of around 100% and then the two models of the literature
domain (first with 3 classes then 5) are tested with an accuracy of 97% and 98%. The result
of testing the Political Analysis Articles domain classification model is presented in Fig. 3
(again, we do not show the rest of the testing results due to space limitation).
Fig. 3: The result of testing the model on word bi-grams and POS Tags n-gram features for
Political Analysis Articles
The figure shows the testing results of 10 fold cross validation for the Political Analysis
Articles domain on both features; the word bi-grams and the POS tags n-grams. As the figure
shows the TP rate for all the classes is 1 which means that all instances of each class are
classified correctly. This is also demonstrated by the confusion matrix where all instances
where predicted correctly without any mislabeling. Also as for all previous experiments the
time taken to build the classification model is relatively short.
Table 2 summarizes the results of all the experiments performed on both domains and a
graphical representation is presented in Fig. 4.
9
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
Table 2: The results for classification model evaluation of Political Analysis Articles and
literature domains
Domain
No Tags
With Tags
Experiments
Experiments
Political Analysis Articles
92.94%
100%
Literature 3-classes
95%
97%
Literature 5-classes
91%
98%
Fig. 4: The representation of the results of evaluating the classification model for Literature
and Political Analysis Articles domains
The evaluation of both experiments performed on the Political Analysis Articles and the
Literature domain demonstrate that even though the word bi-grams feature results in high
classification accuracy, combining the POS tags n-grams feature with the word bi-grams
feature results in even higher accuracy. Also the results show that linear SVM is an effective
algorithm for text documents classification because it can handle the large number of features
and attributes that usually result from text documents representation and the fact that the
computation time is relatively short even with a large number of instances.
5. Conclusion and Future Work
This research deals with the problem of author identification for Arabic text using Support
Vector Machines (SVM). We performed several experiments on Arabic text documents taken
from two domains: Political Analysis Articles and Literature. For the document
representation we combined two types of features; lexical and syntactic where we used word
bi-grams lexical feature and Part-of-Speech (POS) tags n-grams syntactic feature.
To ensure the classifier accuracy, we conducted two experiments separately. The first one
was performed on a dataset of the word bi-grams feature only for both domains and the
second was performed on a dataset that combines both features.
The tested accuracy of the dataset with both feature types was higher for both domains
than when one feature type is used. This demonstrates that combining more than one feature
type enhances the classification process. Also the tested accuracy for all experiments shows
10
Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11
that SVM is an effective algorithm for text documents classification that results in accuracy
that may reach 100%.
This research showed a significant result for text document classification using support
vector machines. Further enhancements can be introduced such as the introduction of new
feature types with different combinations, the processing of a larger corpus with various
documents and larger number of authors, and the processing of metaphorical and rhymed text
documents in the literature field.
References
Al-Harbi, S. e. (2008). Automatic Arabic Text Classification. The 9th International
Conference on the Statistical Analysis of Textual Data. Lyon, France .
Chaurasia, M. A. (2011). An Empirical Study on Author Affirmation. International Journal
of Electrical & Computer Sciences, 11(1).
Diederich, J. K. (2003). Authorship attribution with support vector machines. Applied
intelligence, 19(1-2), 109-123.
El-Shishtawy, T. a.-S. (2009). Arabic Keyphrase Extraction using Linguistic knowledge and
Machine Learning Techniques. Proceedings of the Second International Conference
on Arabic Language Resources and Tools. Cairo, Egypt: The MEDAR Consortium.
Jing, W. Y. (2011). Authorship Identification for Chinese Texts Based on Dependency
Grammar. Journal of Convergence Information Technology, 6(6).
Joachims, T. (1998). Text categorization with support vector machines: learning with many
relevant features. Proceedings of ECML-98, 10th European Conference on Machine
Learning, 1398.
Mccombe, N. (2002). Methods of Author Identification. Trinity College Dublin.
Provos, K. (2011). Confusion Matrix. Retrieved 11 2013, from Wikipedia:
http://en.wikipedia.org/wiki/Confusion_matrix
Saad, M. K. (2010). Arabic Text Classification Using Decision Trees. 12th international
workshop on computer science and information technologies CSIT’2010. SaintPetersburg, Russia.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the
American Society for information Science and Technology, 60(3), 538-556.
Taş, T. a. (2007). Author Identification for Turkish Texts. Journal of Arts and Sciences, 7.
Zhao, Y. Z. (2006). Using relative entropy for authorship attribution. Information Retrieval
Technology, 92-105.
11