KNN Arabic Text Categorization Using IG Feature Selection Dr. Ghassan Kanaan Amman AL-Ahliyya University, Jordan [email protected] Dr. Riyad Al-Shalabi Amman Al-Ahliya University, Jordan [email protected] AbdAllah Al-Akhras Yarmouk University, Jordan [email protected] ABSTRACT This project presents an implementation of automatic KNN Arabic text categorizer. Six hundred Arabic text documents belong to 6 categories was tested using the classifier. The main objective of this project is to build an automatic KNN Arabic text categorizer and test the effectiveness of the information gain (IG) feature selection which used for feature selection in this project. The study concluded that the effectiveness of the improved classifier is very good and reached 0.793 macro-recalls, 0.627 macro-precision and we also conclude that the effectiveness of the KNN classifier was increased as we increase the training size. For specific categories we conclude that agriculture has the best average recall which is reached 92.2 while the best recall was 0.899 on economy category. Key Words: Text Categorization, kNN, Similarity Measuring, Vector Model, Terms Weighting, Keywords Extracting, 1 Introduction Text categorization (TC) or text classification is the task of assigning a number of appropriate categories to a text document. This categorization process has many applications such as document routing, document management, or document dissemination. Traditionally each incoming document is analyzed and categorized manually by domain experts based on the content of the document. Extensive human resources have to be spent on carrying out such a task. To facilitate the process of text categorization, automatic categorization schemes are required. Its goal is to build categorization models that can be used to classify text documents automatically. [22] In this research we have been build an automatic Arabic text categorizer based on information gain (IG) feature selection, which is the best over all feature extraction method as suggested by many research's [5,6]. Normalized TF*IDF was used for term weighting scheme; jaccard similarity measure is used as similarity measure. Traditionally, methods for selecting subsets of features that provide best performance results are divided into wrappers and filters [1]. Wrappers utilize the learning machine as a fitness (evaluation) function and search for the best features in the space of all feature subsets. This formulation of the problem allows the use of standard optimization techniques, with an additional complication that the fitness function has a probabilistic nature. wrappers highly depend on the inductive principle of the learning model and may suffer from excessive computational complexity since the problem itself is NP-hard. In contrast to wrappers, filters are typically based on selecting the best features in one pass, although more complex approaches have also been studied [2]. In domains such as text categorization or gene selection filters are still dominant [3]. Evaluating one feature at a time, filters estimate its usefulness for the prediction process according to various metrics [4-7]. Besides wrappers and filters, some authors distinguish embedded methods as a separate category of feature selection algorithms [3, 8]. Embedded methods are incorporated into the learning procedure, and hence are also dependent on the model. In fact, almost any learner can be considered some form of an embedded feature selection algorithm, where the particular estimation of features’ usefulness may result in their weighting [9], elimination [10] or construction of new features [11]. The proposed permutationbased methodology. Input Documents Text Preprocessing Document Indexing Keywords Selection Evaluation Categorization Algorithm Figure 1.1: Overview of the categorization process As shown in Figure 1.1 the categorization process is done mainly in five steps, data preprocessing, document indexing, keyword or feature selection, categorization algorithm and finally evaluation of the categorization task, which are descried in the next sections. 1.1 Data preprocessing Preprocessing: First step after we input the text documents in the system is the preprocessing step which is very important to natural language, in this step the stop words are removed, stemming is applied to the remaining text in the document, this help reducing the feature space of the problem. 1.2 Indexing In this step the system build the index data structure, which is usually is the inverted file and calculate the weight of each term in the documents. Many weighting scheme are used such as Boolean model, term frequency (TF) scheme and the best scheme is normalized TF*IDF. 1.3 Feature selection Feature selection: is a very important step in the categorization process, because in feature selection we select the keywords that best represents the text documents, if we select a very low number of keywords this will affect the accuracy and the performance of the system in opposite manner, will select a large number of keywords will make the performance of the classifier in term of time very low. Many feature selection methods are proposed and used in previous research such as documents frequency (DF), mutual information (MI), term strength (TS) and information gain (IG) which is used in this research. 1.4 Categorization algorithm In this step we apply a categorization algorithm to classify the text documents based on the index data structure, the technique used in this research is KNN algorithm. 1.5 Evaluation The evaluation of the classifier or the categorizer is done by using many mathematic rules to test its effectiveness, such as recall, precision, F1 measure and F-score (many evaluation measures are discussed in Chapter 4) 2 Literature Review A significant body of research has been produced in the feature selection area. Excellent surveys, systematizations, and journal special issues on feature selection algorithms have been presented in the recent past [1, 3, 8, and 13]. Searching for the best features within the wrapper framework, Kohavi and John [1] define an optimal subset of features with respect to a particular classification model. The best feature subset given the model is the one that provides the highest accuracy. Numerous wrappers have been proposed to date. These techniques include heuristic approaches such as forward selection, backward elimination [14], hill-climbing, best-first or beam search [15], randomized algorithms such as genetic algorithms [16] or simulated annealing [17], as well as their combinations [18]. In general, wrappers explore the power set of all features starting with no features, all features, or a random selection there of. Optimality criterion was also tackled within the filtering framework. Koller and Sahami select the best feature subset based strictly on the joint probability distributions [19]; a feature subset Z Í X is optimal if p(Y | X) = p(Y | Z). The difference between the probability distributions was measured by the relative entropy or Kullback-Leibler distance. This problem formulation naturally leads to the backward elimination search strategy. Hence, relevance and optimality do not imply each other. Naturally, in cases of high-dimensional datasets containing thousands of features, filters are preferred to wrappers. The domains most commonly considered within the filtering framework are text categorization [3] and gene expression [21, 22]. A significant difference between the two models is that text categorization systems typically contain both a large number of features and a large number of examples; while gene expression data usually contain a limited number of examples pushing the problem toward statistically underdetermined. Most commonly used filters are based on information theoretic or statistical principles. For example, information gain or c 2 good-ness- of-fit tests have become baseline methods. However, both require discretized features and are difficult to “normalize” when features are multivalued. Several other approaches frequently used are Relief [23, 24], giniindex [11], relevance [25], average absolute weight of evidence [26], binormal separation [6] etc. Various benchmarking studies across several domains are provided in [5-7]. Some examples of embedded methods are decision tree learners such as ID3 [27] and CART [11] or the support vector machine approaches of Guyon et al. [10] and Weston et al. [28]. For example, in the recursive feature elimination approach of Guyon et al. [10], an initial model is trained using all the features. Then, features are iteratively removed in a greedy fashion until the largest margin of separation is reached. Good surveys of embedded techniques are given by Blum and Langley [8] and Guyon and Elisseeff [3]. 3 Data Set This chapter discusses the specification of the data set used for evaluating the implementation of the improved KNN Arabic text categorizer. Since there is no Arabic corpus publicly available, we had to create our own corpus. For this purpose we have colleted many newspaper articles from different newspaper and news website available online including Al-Jazeera, AlSafeer, Fares.net and Al-Dostor, the data set consist of 600 Arabic documents of different lengths that belongs to 6 categories, the categories are ( Economy " "اﻗﺘ ﺼﺎد, Health " "ﺻ ﺤﺔ, Politic " "ﺳﻴﺎﺳ ﺔ, Sport " "رﻳﺎﺿ ﺔ, Agriculture " " زراﻋ ﺔ, Science " ) " ﻋﻠ ﻮم, we use for each category 100 documents, Table 3.1 represent the number of documents for each category. Each document was labeled manually based on its contents and the domain that it was found within, each document is stored on a separate file and the documents of the same category are stored in a separate directory. Table 3.1: Number of documents per class Category Name Number of Documents used Economy (")"اﻗﺘﺼﺎد Health ( ")" ﺻﺤﺔ Politic (" )" ﺳﻴﺎﺳﺔ Sport (" )" رﻳﺎﺿﺔ Agriculture ("زراﻋﺔ ") Science ( ")" ﻋﻠﻮم 100 100 100 100 100 Table 4.1: Different decision can happened in the classification process Classifier Decision Yes No Correct Decision By Expert Yes No A B C D Table 4.1 show the different decisions that can happen during the categorization process, where: A is the number of documents assigned yes (by the classifier) and yes is correct (by user the judgment) B is the number of documents assigned yes( by the classifier) but no is correct (by the user judgment) C is the number of documents assigned no( by the classifier) but yes is correct (by the user judgment) D is the number of documents assigned no (by the classifier) and no is correct (by the user judgment 100 4 Evaluation Measures This chapter discusses and explains the evaluation measures that have been used in our research to measure the effectiveness of the proposed system The experimental evaluation of classifiers typically tries to asses the effectiveness of the classifier [39]. Classification can be binary (or not to assign a category or not to assign a test document) or a raking categorization that provide a list of ranks for the potential categories for the test document, if we want to evaluate such ranking classifiers can made it into binary classifies (multi binary classifiers) by threshold on the scores of the ranking list. 4.1 The Recall Measures Recall (R) is defined as the ratio between the numbers of documents classified correct with respect to total number of document. Recall is defined as follows: R= a a+c (4.1) 4.2 The Precision Measure Precision (P) is defined as the ratio between the number of documents classified correct with respect to the total number of documents classified. Precision is defined as follows: P= a a+b (4.2) 4.3 The Error Measure. Another less important measure is the Error (E) measure, which measures the error occurring within the categorization process. The error measure takes the following formula: b+c E= a+b+c+d (4.2) 4.4 Other Measures Usually recall or precision alone are not used for the evaluation of the categorization process because we can has height recall value but low precision value, so many researcher use measures which is a combination the standard recall and precision: 1) Breakeven point was proposed and discussed by lewis [23] and this point is defined as the point at which recall and precision are equal. 2) F-measure combines recall and precision using the formula. F= 2 ∗ Re call ∗ Pr ecision Re call + Pr ecision (4.4) Where the F-measure was used in many research, such as [21, 22] 4.5 Global Averaging measures The most popular measures for global averaging are the macro-average and the micro-average; both of them can be used with the traditional recall and precision: • Macro-average: the results of local precision or recall are evaluated first and then global averaging is obtain by dividing the total by the number of decisions taken as in the following formulas where M means macro-average: r = M ∑r i =1 m i (4.5) P = M ∑Pi i=1 M (4.6) 5 Research Methodology This chapter describes briefly the methodology of the automatic KNN Arabic text categorizer used in the project. 5.1 Overview The methodology used is this study is KNN technique, normalized TF * IDF as weighting scheme, (IG) Information gain for keywords selection, finally when we test a document we compute the similarity between it and between all the training documents, then we use the k nearest neighbor similarities, take the summation of the similarity of the same training documents category and the highest value, the classifier sign its category as the classifier decision. 5.2 Brief Description In this section we describe in brief the steps of building the automatic KNN Arabic text classifier. Step 1: The user will select the number of documents he want as the Training documents for all the category, (the number of training documents is the same for all category, if the user select 20 as example this mean for each category 20 documents will be as training , so the total number of training will be 20*6=120 documents for the data set) The system extracts all the word in the documents (600 Arabic documents), eliminate the stop words, and finding the stem for each term based on light stemming technique. Because roots are better for IR system and for feature selection [7]. Also the user select the k-value which determines the number of the K nearest neighbor documents which are used as the classifier decision. m IG ( t ) = − ∑ Pr ( C i ) LogP r ( C i ) + i =1 Step 2: Making the inverted file (For the Training Documents) which contains the term (as a stem), document number, and frequency of the term in these documents and the category number (1 for Economy ""اﻗﺘ ﺼﺎد, 2 for Health ""ﺻ ﺤﺔ, 3 for Politic " "ﺳﻴﺎﺳ ﺔ, 4 for Sport ""رﻳﺎﺿ ﺔ, 5 for Agriculture " " زراﻋ ﺔ, 6 for Science " " ﻋﻠ ﻮم ). Step 3: We compute the weight of each term in each document based on the normalized TF*IDF, because this scheme is better than the frequency of the term or the Boolean model. Wij = ⎛N⎞ × Log 2 ⎜⎜ ⎟⎟ MaxFreqLj ⎝ ni ⎠ Freqij Where: Freq ij is the frequency of the term i in document j. Max Freq Lj is the maximum frequency of any term in the document j. N is the number of document in the collection (in this research N = 600 documents). ni is the number of documents in which the term i appear within it at least once. W ij is the weight of the term I in document j . Step 4: Applying our Keyword (feature) extraction algorithm, which extract the stem higher than a threshold value based on information gain feature selection. Information gain measures the goodness of a term globally with respect to all categories, "in measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document" [5] in general if we have ci categories where i range from 1 to m (in this project i=6) then the information gain of some term t is defined by the function IG (t) as in the following equation: m Pr ( t ) ∑ Pr ( C i \ t ) LogP r ( C i \ t ) + i =1 m Pr ( t ) ∑ Pr ( C i \ t ) LogP r ( C i \ t ) i =1 where Pr(ci) is calculated from the fraction of the number of documents that belong to class ci from the total number of documents, Pr (t) is calculated from the number of documents in which the term t occur at least once, Pr (ci \t) is the number of documents among the class ci that have the term t occurs at least once in it, and Pr(ci \t\) is the number of documents belong to class ci and does not contain the term t. After that the system stores this Keywords in the Index Terms Array, with its Frequency, Document Number, Weight and its category. After this step all the weight and similarity calculation is base on Index term array. Step 5: The system will now take the remaining number of training documents as the number of documents of the testing set, (if the user select 60 document for training, this mean he want 60 documents for each category will be the training set and 40 documents for each category will be the testing set) Step 6: For each testing document the system remove stop words, make stemming, compute the weight of its term and finally, the classifier compute the similarity between the test document and each of the training documents using the jaccard similarity measure, which is represented by the following equation.. m JaccardSimij = m ∑W k =1 Where: m ∑W × ∑W 2 k =1 m ki ki kj k =1 + ∑ W 2 kj − ∑ (Wki × Wkj ) k =1 m k =1 wkj =the weight of the term k in the document I, and =the weight of the term k in the document j. Step 7: Now the classifier, compute the similarity differences between each training document and all the others training documents, then finding the k nearest neighbor documents (depend on the value of K which the user select it from the beginning), based on this nearest neighbor documents take the summation of the similarities that belong to the same category. The highest value will be the classifier decision. Step 8: The classifier compute the local precision and recall for the test document, then the system will repeat the steps (5,6,7,8) for each test document until he test all the testing documents. Step 9: Finally the system compute the local recall and the local precision for each category, then he will compute the average local precision and the average local recall for all the categories (for the data set), these for each run on different training over testing ratio, then finally we compute the macro-recall and macro-precision for all the data set. 6 Results and Testing we have been test the system built in this research many times and when the best threshold was found, we recommend the following results based on run the system 5 times each one on a different train data (60, 120, 180, 240,360 training documents) 0.9 R ecall and P recision wki 0.8 0.7 0.6 0.5 0.4 Recall Jacc 0.3 Precision Jacc 0.2 0.1 0 60 120 180 240 360 Number of Training Documents Figure 6.1: Average recall and precision for different training size Figure 6.1 shows that as the number of training documents increases, as the accuracy and effectiveness of the system also increases. We also notice that the recall is always greater than the precision over the jaccard similarity measure (the similarity measure used in this study) Recall Jacc Precision Jacc Average 1.00 0.90 0.80 0.70 Recall and 0.60 Precision 0.50 0.40 0.30 0.20 Economy Health Politic Sport Agriculture Science Categories Figure 6.2: Average results for each category Figure 6.2 shows the average results for all the categories based on the 5 trails, we notice that sport category has the best value of recall and also the agriculture category, while in term of precision the economy category has the best value while it's recall not bad. The k-value is very important in the KNN algorithm, because a small value of K mean low performance in term of recall and precision, while very high value of K means also low performance, so we must select a suitable value of k. We suggest the value of k to 19, so that we have good results. The following figures 6.3 to 6.7 represent the results based on different training size. Recall_jacc Recall and Precision 1.0 0.9 Recall 0.8 Precision 1.00 Precision_Jacc 0.90 0.7 0.6 0.80 0.5 0.4 Recall and Precision 0.3 0.2 0.70 0.60 0.1 0.50 Sc ie nc e 0.40 0.30 A gr ic ul tu re Sp or t ic lit Po Ec on om H ea th y 0.0 Economy Category Heath Politic Sport Agriculture Science Categories Figure 6.3: Recall and precision by using 60 training documents Figure 6.7: Recall and precision by using 360 training documents Recall_jacc Precision_Jacc 1.00 Recall and Precision 0.90 7 Conclusions and Recommendation 0.80 0.70 0.60 0.50 0.40 0.30 0.20 Economy Heath Politic Sport Agriculture Science Categories Figure 6.4: Recall and precision by using 120 training documents Recall_jacc 1.00 Precision_Jacc 0.90 0.80 Recall and Precision 0.70 0.60 0.50 0.40 7.1 Summary The automatic Arabic KNN classifier implemented in this study was evaluated for automatic Arabic text categorization. A corpus (consist of 600 Arabic text document belong to 6 categories) was collected from the internet and used in the evaluation. The automatic Arabic KNN classifier used the information gain (IG) for feature selection. The dataset were preprocessed by removing stop words, make light stemming, feature selection, weighing scheme (Normalized TF * IDF) and computing similarity measure (Jaccard). 0.30 Economy Heath Politic Sport Agriculture Science 7.2 Conclusion Categories Figure 6.5: Recall and precision by using 180 training documents Recall_jacc 1.00 Precision_Jacc 0.90 0.80 Recall and Precision 0.70 0.60 0.50 0.40 0.30 Economy Heath Politic Sport Agriculture Science Categories Figure 6.6: Recall and precision by using 240 training documents Based on the results of the study, the following concludes may be warranted: • The results obtained in this study are applicable to Arabic language. • The automatic Arabic KNN classifier can be taught to classify using small number of training documents. The results showed that the best result on F measure is on 60% training data set. • Using the jaccard similarity measure, a macro-recall of about 0.793, a macro-precision of about 0.627 were reached in this study using the improved KNN classifier. • The value of K is very important and can be very effect the • performance of the KNN classifier, low value of k is not applicable as soon as those high values of k, the study recommend a value of K to be 19 as the best value. For specific categories we conclude that agriculture has the best average recall which is reached 0.922 while the best recall was 0.899 on economy category. References [1] [2] [3] [4] [5] [6] [7] [8] R. Kohavi and G. John, "Wrappers for feature selection," Artificial intelligence, Vol. 97, No. (1-2), 1997, pp. 273-324. H. Almuallim, and T. G. Dietterich., "Learning with many irrelevant features," National Conference on Artificial Intelligence, 1992, pp. 547552. I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," Journal of Machine Learning Research, Vol. 3, 2003, pp. 1157-1182. I. Kononenko, "On biases in estimating multi-valued attributes," International Joint Conference on Artificial Intelligence, 1995, pp. 1034-1040. Y. Yang and J. P. Pedersen, "A comparative study on feature selection in text categorization," International Conference on Machine Learning, 1997, pp. 412-420. G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of Machine Learning Research, Vol. 3, 2003, pp. 1289-1305. M. A. Hall and G. Holmes, "Benchmarking attribute selection techniques for discrete class data mining," IEEE Trans. Knowledge and Data Engineering, Vol. 15, No. 6, 2003, pp. 1437-1447. A. L. Blum, and P. Langley, "Selection of relevant features and examples in machine learning," [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] Artificial Intelligence, Vol. 97, No. (1-2), 1997, pp. 245-271. N. Littlestone, "Learning quickly when irrelevant attributes abound: a new linear threshold algorithm," Machine Learning, Vol. 2, 1998, pp. 285-318. I. Guyon, et al., "Gene selection for cancer classification using support vector machines," Machine Learning, 2002, Vol. 46, No. (1-3), 2002, pp. 389-422. L. Breiman, "Classification and regression trees," 1984, Belmont, CA. E. Frank, and I. H. Witten, "Using a permutation test for attribute selection in decision trees," International Conference on Machine Learning, 1998, pp. 152-160. M. Dash, and H. Liu, "Feature selection for classification," Intelligent Data Analysis, Vol. 1, No. 3, 1997, pp. 131-156. C. M. Bishop, "Neural networks for pattern recognition," 1995, Oxford University Press.15. Caruana, R. and D. Freitag. Greedy attribute selection. International Conference on Machine Learning. 1994, pp. 28-36. H. Vafaie, and I. F. Imam, "Feature selection methods. Genetic algorithms vs. greedy like search," International Conference on Fuzzy and Intelligent Control Systems, 1994. J. Doak, "An evaluation of feature selection methods and their application to computer security," 1992, Technical Report CSE-92-18. University of California at Davis. G. Brassard, and P. Bratley, "Fundamentals of algorithms," Prentice Hall, 1996. D. Koller, and M. Sahami, "Toward optimal feature selection," International Conference on Machine Learning, 1996, pp. 284-292. G. H. John, R. Kohavi and K. Pfleger., "Irrelevant features and the subset selection problem," [20] [21] [22] [23] International Conference on Machine Learning, 1994, pp. 121-129. J. Li, and L. Wong, "Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns," Bioinformatics, 2002, Vol. 18, No.5, pp. 725-734. F. Sebastiani, "A Tutorial on automated text categorization," In Analia Amandi and Ricardo zunino, editors, proceeding of ASAI-99, 1st argentinian Symposium on artificial intelligence, 1999, pp. 7-35, Buenos aires, AR. W. Lam and C. Y. Ho, "Using a Generalized Instance Set for Automatic Text Categorization," 1998, SIGIR’98, pp. 81-89. D. D. Lewis, "Naïve bayes at forty: The independent assumption in information retrieval," In Proceeding s of ECML-98. 10th European conference on machine learning, Chemnitz, Germany, 1998, pp. 4-15
© Copyright 2026 Paperzz