Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 Arabic Text Author Identification Using Support Vector Machines Rebhi Barakaa, Samar Salem, Mona Abu Hussien, Nidaa Nayef, and Wala Abu Shaban Article Info Received:12th December 2013 Accepted: 21st January 2014 Published online: 1st March 2014 Faculty of Information Technology, Islamic University of Gaza, Gaza, Palestine a [email protected] e-ISSN: 2231-8852 ABSTRACT A model for Arabic text author identification is proposed. It classifies a set of Arabic text documents with unknown authorship by capturing the style of each author through features extracted from the text. The identification process is achieved through five phases which are: documents collection, dataset preparation, features extraction, features optimization and classification model building. The model relies on Support Vector Machines (SVM) and combines two feature types on two domains: Political Analysis Articles and Literature. The experiments show that the model is effective with classification accuracy that may reach 100%.. Keywords: Arabic NLP, Data Mining, Data Classification, Part of Speech Tagging 1. Introduction Author identification is a typical problem in natural language processing. It is the task of identifying the author of a given text. It can be considered as a typical classification problem, where a set of documents with known authors are used for training and the aim is to automatically determine the corresponding author of an anonymous text. This can be used in a broad range of applications, to analyze ancient documents, in plagiarism detection which can be used to establish whether claimed authorship is valid, in forensic investigations to verify the authorship of e-mails and messages (Taş, 2007). In contrast to other classification tasks, it is not clear which features of a text should be used to classify an author. So, the main issue in computer-based author identifications is to identify a set of features that represents the author’s writing style. These are then used to classify the authors of selected anonymous texts. A set of feature types can be used to classify authors; these include word-level, character-level, syntactic, semantic and lexical features (Stamatatos, 2009). This work deals with Arabic text author identification where we classify a set of Arabic text documents with unknown authorship by capturing the style of each author through features extracted from the text. There are many challenges in choosing the right features that Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 can capture the style of the writer more accurately and also in choosing the methods and approaches to use for author classification. In order to classify authors, we need to gather text-documents for different writers mainly from the internet. Then by using NLP tools we can extract a set of features basically wordlevel, syntactic and lexical features to capture the authors’ style. Feature reduction is required to get the most relevant features to be used to build the classifier. Support Vector Machine (SVM) is proposed as a classification method due to its tested performance in text classification and categorization applications. The proposed author identification model consists of five phases which are: documents collection, dataset preparation, features extraction, features optimization and classification model building. In the documents collection phase we collect a set of Arabic text documents in Political Analysis Articles and Literature fields. Then in the dataset preparation phase we process the documents and organize them into separate directories for each author. The feature extraction phase is the most essential since a set of features are extracted from the documents to capture the style of writers. For the feature optimization phase, the previously extracted features are optimized to get the most relevant features that represent the style of authors. The final phase is classification model building and evaluation to test the model performance. The rest of the paper consists of the following sections: Section 2 presents related works, Section 3 presents the proposed author identification model. Section 4 presents the experiments and results of realizing the model. Section 5 concludes the paper and suggests future work. 1. Related Work Jing (2011) applies an approach based on using dependency grammar for Chinese authorship identification. Their approach consists of 4 steps which are: data collection, feature extraction, feature optimization and identification. For the feature set they proposed dependency as a new syntactic-based level feature combined with another 3 features which are: empty word, part-of-speech and punctuation to form the whole set. Principle component analysis (PCA) is used for feature reduction and SVM for classification. Experiments showed that combining the four features gives better performance than any other feature combination. Taş (2007) presents a fully automated approach to the author identification of Turkish text by adapting a set of style markers to the analysis of the text. For identifying an author, 35 style markers are determined for a set of 20 different authors. They tested and compared several machine learning algorithms for classification. Maximum success is obtained with Naïve Bayes Multinomial with a rate of 80%, using CFS subset evaluator with rank search method. Chaurasia (2011) proposes a novel approach for author identification which depends on using initial character n-gram approach with real world text. They dealt only with one start slit of every term (initial slit). After text preprocessing, the n-gram profile is generated then this n-gram profile of each document is compared against the profiles of all documents in the training classes using a dis-similarity algorithm. The accuracy is higher than other types of ngram. 2 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 Zhao (2006) proposes a new approach for author attribution based on relative entropy which is an approach motivated from information theory. Kullback-Leibler Divergence (KLD) is used as a classifier to measure how different two distributions are. They also tested SVM algorithm and Bayes Networks then they compared their performance with the performance of KLD for different Types of features including function words, POS tags and punctuation. Their experiments showed that KLD classification is an effective method for general text categorization while SVM result in better accuracy and precision for author attribution. Diederich (2003) introduces an SVM algorithm as an approach for author attribution due to its ability to process a very large number of features without the need for feature selection. A comparison is held between several classification methods and SVM to test its performance on word-level features especially word forms and bigrams of function words. The experiments showed that SVM achieve good performance for author identification. It also showed that bigrams of function words result in a performance superior to other classifiers but they perform less well than full word forms. Stamatatos (2009) presents a survey of the latest advances in automated approaches used in author attribution. He examines the characteristics of these approaches for text representation and text classification, and also the evaluation methodologies and criteria used in author attribution studies. The survey distinguishes several types of stylometric features to quantify the writing style including lexical features, character features, syntactic, semantic and application-specific features. Mccombe (2002) investigates a set of computer based author identification methods to determine which of them are the most effective. It introduced some programs used for the process of author identification. An explanation of how to use each program, the algorithms and mathematical methods and also the results obtained from using each program was included. The most effective method as stated is the one based on letter unigram used for 300-word email author identification. Author identification depends much on classification approaches. SVM classification algorithm has been widely used in text classification. In our research we depend on SVM to perform the preprocessing of Arabic text documents and feature extraction. Next we present some works that have used SVM in their classification process. Joachims (1998) explores and identify the use of SVM as a text classifier. The word stems are used to represent the text features, where words are considered features if they are not stop-words and if they occurred at least 3 times in training data. Information gain is used for feature selection. The experiments compared SVM with other 4 learning methods used for categorization. The results show that SVM has better precision and recall and better performance than other methods. Al-Harbi (2008) introduces an automatic approach for Arabic text classification based on two classification algorithms: SVM and C5.0. They use 7 different datasets to cover different subject domains. Lexical Features (single word) are extracted and chi squared statistics are used to compute the dependency between the term and the class for feature selection. The experiments are conducted using RAPIDMINER to implement SVM algorithm and Clementine for C5.0 decision tree algorithm. The results show that C5.0 outperformed the SVM in accuracy by about 10%. 3 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 El-Shishtawy (2009) proposes a method to extract key phrases from Arabic text by using statistical measures such as term frequency and combine it with linguistic knowledge for better results. The knowledge includes syntactic rules such as part-of-speech tags and word frequency, also the abstract form of Arabic words is used instead of its stems. Analysis of Variance (ANOVA) test is used to evaluate the selected features. The learning model is built using Linear Discriminant Analysis (LDA). The experiments show that the presented system has better performance than the existing Arabic extractor system with doubled precision and recall obtained mainly from linguistic knowledge. Saad (2010) examines the impact of text preprocessing and term weighting schemes on Arabic text classification. Also they develop a new combination of term weighting schemes to enhance text classification. The results show that using term weighting and stemming reduces dataset dimensionality since stemming reduces many terms to their root, also classification accuracy is much better for stemmed terms than Bags-of-Tokens (BOT) without stemming for all combinations. 2. Author Identification Model This section describes the model for author identification. The model is translated into a series of steps starting from Arabic text documents collection and ending with classification and evaluation. The proposed model shown in Figure 1 represents the author identification process defined as a series of steps performed in sequence. These steps are: Data collection, data preparation, feature extraction, feature optimization, classification and finally evaluation. The documents from each author are processed at each step to obtain the information we need to identify authorship until the actual classification is performed and the classifier is evaluated. Fig. 1: Authorship Identification Model Next we explain the author identification process represented by the model in more details. 4 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 3.1 Data Collection The process of collecting a set of text documents in specific domain for later processing to achieve a specific purpose. We collect multiple Arabic text documents for different writers in two domains: Literature and Political Analysis Articles. The number and length of each document is different for all authors in both domains. For the Political Analysis Articles domain we have 3 authors (3 classes) and for the literature we started with 3 authors then added two more for a sum of 5 authors (the size of the literature documents as stories and novels is much larger than the Political Analysis Articles). Table 1 shows the total number and size of documents for each author in each domain. Table 1: The number and size of documents for Political Analysis Articles and Literature domains Number of Size of Author Documents Documents Political Analysis Articles Domain Fahmi Howeidi 80 613 KB Saleh Naami 89 745 KB Yasser Zaatrah 86 632 KB Literature Domain Abdullah Tayeh 10 0.99 MB Nadia Khost 14 0.97 MB Najeeb Mahfoz 13 866 KB Ghassan Kanafani 10 495 KB Gobran K. Gobran 11 417 KB Although the number of the documents in the Political Analysis Articles domain is large compared to that of the Literature domain, its size is much less. Because we need consistency in the number of attributes extracted from the text documents of the two domains, the number of Political Analysis Articles for each author is chosen based on an approximation of the number of the attributes that will be extracted from them. 3.2 Data Preparation The process of preparing the data for further processing involves eliminating redundant or unnecessary parts which cause noise in data. Punctuation marks and special characters are removed from text documents. After that, the documents of each author are renamed with the author initials and a unique serial number then stored in a separate directory named by the author (class) name. This stage results in 3 directories for the Political Analysis Articles domain and 5 directories for the Literature domain. 3.3 Feature Extraction This is the most essential phase in author identification where we perform feature extraction on the text documents to represent the style of each author. We extract two types 5 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 of features lexical and syntactic. For the lexical features we extract word bi-grams and for the syntactic we extract Part-of-Speech (POS) tags n-grams. Here we describe the extraction of each feature in more details. 3.3.1 Word Bi-grams For the first feature, we used RapidMiner to generate the word n-grams by using the “Process Documents from Files” operator in the text processing package. To generate word n-grams the text documents go through tokenization, stop words filtering, stemming and generation of n-grams with maximum length of 2 for the n-gram operator to generate word bi-grams. Finally we set the word vector creation scheme parameter in “Process Documents from Files” operator to compute the Term Frequency-Inverse Document Frequency (TF-IDF) for each bi-gram. The output from this step is the word bi-grams extracted from the documents with the normalized TF-IDF computed for each bi-gram in each document. The Tokenize Operator splits the text document into word tokens, then these tokens are filtered to remove the stop words (because they occur frequently in all documents, so they don’t distinguish between authors style). After all stop words are removed, the stem of the remaining tokens is extracted. Finally the word bi-grams which are used to train the classifier are generated from the stemmed tokens. 3.3.2 POS tags n-grams. The extraction of the second feature is performed in two steps: the tagging of the text document and tags n-gram generation. Step 1: to tag the text documents we used the Arabic Fast Tagger model included in Stanford POS tagger software tool, where each document is tagged separately. This is done by running the tagger on the text to be tagged. The output of this process is a text that contains each word of with its POS tag. Step 2: to generate the n-grams for the tags we take the tags from the resulting text and perform the same process we used for the word bi-gram extraction. The output is the tags bi-grams with TF-IDF computed for each one in each document where each row represents the normalized TF-IDF values of the POS tags n-grams represented in columns for a specific text document. The class label is specified for each instance of the text documents. After both features are extracted they are combined together into a Comma Separated Values (CSV) file, the dataset, which is used in the classification. 3.4 Feature Optimization It involves choosing the most relevant features extracted from a piece of text or a corpus to reduce the size of data and get the most useful information for classification purposes. In the feature extraction phase a large number of bi-grams are extracted from the text documents these may not all be useful for text classification. So in order to enhance the classification process we perform pruning on the resulting word bi-grams to remove those bi6 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 grams that occurs in less than 3 documents. Also a part of the optimization process is performed in feature extraction phase using stop words filtering and stemming. 3.5 Classification It is the process of assigning a label or a category to a piece of data to help identify it. In this phase we build the classifier using the LIBLINEAR SVM classification tool in WEKA. To train the classifier we read the dataset files into WEKA, choose the LIBLINEAR from classifier list then we set the SVM type parameter into L2- loss support vector machines (primal). L2 is the commonly used type for text document classification due to its speed and accuracy (Jing, 2011). We set the cost parameter to 1 (its default value) then we start training. This process is performed for each domain separately. 3.6. Evaluation The final phase is the evaluation of the model by testing the results obtained from experiments using predefined criteria such as performance and accuracy. We tested the results obtained using 10 fold cross validation which is set from WEKA test options in the explorer interface and then comparing the model accuracy and the accuracy per class from the true positive rate (TP Rate) and confusion matrix shown in the main results GUI in WEKA explorer. 3. Experiments and Results This section presents the experiments we performed on Arabic text documents for authorship identification using WEKA for SVM classifier building and evaluation. The experiments are performed on features extracted from documents in two domains; Literature and Political Analysis Articles. For the Literature domain the experiments is separated into two parts: the first is performed on 3 classes and the second is performed on 5 classes to help in evaluating the performance of the classification model. The experiments are divided into two parts: the first part is performed on the word bigrams feature vector only without the POS tags, and the second part includes both features. The reason we perform the experiments this way is to see whether combining the two features will have any effect on the classification accuracy. Accuracy is the basic measure of evaluation. The classification model is evaluated using 10 fold cross validation to show the actual performance of the classification model. 10 Fold Cross Validation is a technique for estimating the performance of a predictive model by dividing the data into 10 parts which are used to train and test the model. The data is broken down into 10 sets of size n/10, and then the model is trained on 9 datasets of them and tested on 1. The process is repeated 10 times and the mean accuracy is taken. Testing Accuracy is defined as the percentage of the total number of the correctly classified documents to the total number of documents. Confusion Matrix is a specific table that allows visualization of the performance of an algorithm where each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. The name stems from the fact that 7 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 it makes it easy to see if the system is confusing two classes, for example mislabeling one as another (Provos, 2011). We introduce the confusion matrix because it provides a clear view of the results of classifying the instances of each class which allow us to identify the number of correctly and incorrectly classified instances and to identify the class to which they are mislabeled. 4.1 Experiment 1 The first experiment aims to train and test the classification model on the word bigrams feature only using 10 fold cross validation to obtain the actual accuracy of the classifier. The Political Analysis Articles classification model is first tested with an accuracy of around 93% and then the two models of the Literature domain (first with 3 classes then 5) are tested with an accuracy of 95% and 91%. The result of testing the Political Analysis Articles domain classification model is presented in Fig. 1 (we do not show the rest of the testing results due to space limitation). Fig. 2: The result of testing model on word bi-grams feature for Political Analysis Articles The dataset is split into 10 datasets, 9 are used for training the model and 1 is used to test it. The figure shows the testing accuracy of 92.9412% which is the mean accuracy of the 10 runs of the 10 fold cross validation. Also it presents the TP rate and the confusion matrix which provides an overview of the model accuracy for each class. 8 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 4.2 Experiment 2 The second experiment aims to train and test the classification model on the word bigrams feature combined with the POS tags feature using 10 fold cross validation to obtain the actual estimation of the classifier performance. The Political Analysis Articles classification model is first tested with an accuracy of around 100% and then the two models of the literature domain (first with 3 classes then 5) are tested with an accuracy of 97% and 98%. The result of testing the Political Analysis Articles domain classification model is presented in Fig. 3 (again, we do not show the rest of the testing results due to space limitation). Fig. 3: The result of testing the model on word bi-grams and POS Tags n-gram features for Political Analysis Articles The figure shows the testing results of 10 fold cross validation for the Political Analysis Articles domain on both features; the word bi-grams and the POS tags n-grams. As the figure shows the TP rate for all the classes is 1 which means that all instances of each class are classified correctly. This is also demonstrated by the confusion matrix where all instances where predicted correctly without any mislabeling. Also as for all previous experiments the time taken to build the classification model is relatively short. Table 2 summarizes the results of all the experiments performed on both domains and a graphical representation is presented in Fig. 4. 9 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 Table 2: The results for classification model evaluation of Political Analysis Articles and literature domains Domain No Tags With Tags Experiments Experiments Political Analysis Articles 92.94% 100% Literature 3-classes 95% 97% Literature 5-classes 91% 98% Fig. 4: The representation of the results of evaluating the classification model for Literature and Political Analysis Articles domains The evaluation of both experiments performed on the Political Analysis Articles and the Literature domain demonstrate that even though the word bi-grams feature results in high classification accuracy, combining the POS tags n-grams feature with the word bi-grams feature results in even higher accuracy. Also the results show that linear SVM is an effective algorithm for text documents classification because it can handle the large number of features and attributes that usually result from text documents representation and the fact that the computation time is relatively short even with a large number of instances. 5. Conclusion and Future Work This research deals with the problem of author identification for Arabic text using Support Vector Machines (SVM). We performed several experiments on Arabic text documents taken from two domains: Political Analysis Articles and Literature. For the document representation we combined two types of features; lexical and syntactic where we used word bi-grams lexical feature and Part-of-Speech (POS) tags n-grams syntactic feature. To ensure the classifier accuracy, we conducted two experiments separately. The first one was performed on a dataset of the word bi-grams feature only for both domains and the second was performed on a dataset that combines both features. The tested accuracy of the dataset with both feature types was higher for both domains than when one feature type is used. This demonstrates that combining more than one feature type enhances the classification process. Also the tested accuracy for all experiments shows 10 Journal of Advanced Computer Science and Technology Research, Vol.4 No.1, March 2014, 1-11 that SVM is an effective algorithm for text documents classification that results in accuracy that may reach 100%. This research showed a significant result for text document classification using support vector machines. Further enhancements can be introduced such as the introduction of new feature types with different combinations, the processing of a larger corpus with various documents and larger number of authors, and the processing of metaphorical and rhymed text documents in the literature field. References Al-Harbi, S. e. (2008). Automatic Arabic Text Classification. The 9th International Conference on the Statistical Analysis of Textual Data. Lyon, France . Chaurasia, M. A. (2011). An Empirical Study on Author Affirmation. International Journal of Electrical & Computer Sciences, 11(1). Diederich, J. K. (2003). Authorship attribution with support vector machines. Applied intelligence, 19(1-2), 109-123. El-Shishtawy, T. a.-S. (2009). Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques. Proceedings of the Second International Conference on Arabic Language Resources and Tools. Cairo, Egypt: The MEDAR Consortium. Jing, W. Y. (2011). Authorship Identification for Chinese Texts Based on Dependency Grammar. Journal of Convergence Information Technology, 6(6). Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, 1398. Mccombe, N. (2002). Methods of Author Identification. Trinity College Dublin. Provos, K. (2011). Confusion Matrix. Retrieved 11 2013, from Wikipedia: http://en.wikipedia.org/wiki/Confusion_matrix Saad, M. K. (2010). Arabic Text Classification Using Decision Trees. 12th international workshop on computer science and information technologies CSIT’2010. SaintPetersburg, Russia. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3), 538-556. Taş, T. a. (2007). Author Identification for Turkish Texts. Journal of Arts and Sciences, 7. Zhao, Y. Z. (2006). Using relative entropy for authorship attribution. Information Retrieval Technology, 92-105. 11
© Copyright 2026 Paperzz