Urdu Text Classification Abbas Raza Ali Maliha Ijaz National University of Computers and Emerging Sciences Block-B, Faisal Town Lahore, Pakistan National University of Computers and Emerging Sciences Block-B, Faisal Town Lahore, Pakistan [email protected] [email protected] ABSTRACT This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot directly interpret the raw dataset, so language specific preprocessing techniques are applied on it to generate a standardized and reduced-feature lexicon. Urdu language is morphological rich language which makes those tasks complex. Statistical characteristics of corpus and lexicon are measured which show satisfactory results of text preprocessing module. The empirical results show that Support Vector Machines outperform Naïve Bayes classifier in terms of classification accuracy. SVMs outperform Naïve Bayes in context of classification accuracy. The overall system is divided into three main components: 1) Acquisition, compilation and labeling of the text documents of the corpus 2) Preprocessing of raw corpus to generate standardized and reduced-feature lexicon 3) Training of statistical classifiers on the preprocessed data to classify test data Detailed architecture of the system along with its three components is shown in Figure 1. Keywords Corpus, information retrieval, lexicon, Naïve Bayes, normalization, feature selection, text classification, text mining, Urdu. Corpus acquisition Lexicon based tokenization 1. INTRODUCTION Text classification is a process of classifying unknown text automatically by suggesting most probable class to which it belongs. As electronic information is increasing day by day, it becomes a key technique to organize large amount of data for analysis and processing [9]. Text classification is involved in many applications like text filtering, document organization, classification of news stories, searching for interesting information on the web, etc. These are language specific systems mostly designed for English but no work has been done for Urdu language. So, developing classification systems for Urdu documents is a challenging task due to morphological richness and scarcity of resources of the language like automatic tools for tokenization, feature selection, stemming etc. Two different classifiers based on supervised learning techniques are developed and their accuracies are compared on the given dataset. From the experiments, the Naïve Bayes classifier is found to be more efficient than the Support Vector Machines. However, Normalization Diacritics elimination Document Preprocessing Stop words elimination Affixes based stemming Estimate p(Term | class) Training Estimate p(class) Classification Naive Bayes Classifier Maximum[p(class | Term)] Calculate Normalized Term Frequency Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FIT’09, December 16–18, 2009, CIIT, Abbottabad, Pakistan. Copyright 2009 ACM 978-1-60558-642-7/09/12....$10. Training Calculate α, w, w0 Classification SVM Classifier Maximum[α*r *K(xtest, x)+w0] Figure 1. Architecture of Urdu text classification system 2. COPORA 3.3 Normalization A large amount of dataset is usually needed in order to get good classification accuracy of a statistical system. For that purpose a large text corpus of 19.3 million words, collected from different online Urdu news websites, is used [3]. The corpus was manually classified into six different domains namely news, sports, finance, culture, consumer information, and personal information. Some Urdu alphabets have more than one Unicode because they are shaped similar to Arabic alphabets. Such characters are replaced by alternate Urdu alphabets to stop creating multiple copies of a word. Some examples of Un-normalized alphabets and their mapping with Urdu characters are shown in Appendix B. The breakup of documents for each class is given in Table 1 in the next section. The corpus is divided into two parts, the training set contains 90% of the documents from each class and remaining 10% of the documents are used as test set. 3.4 Stop words Elimination 3. DOCUMENT PREPROCESSING In order to obtain stop word list, frequency of every term is calculated using the same corpus and prepared a ‘word frequency lexicon’. The threshold frequency is chosen by manually analyzing that lexicon. It was observed that the top 116 high frequency words belonged to functional class. So a stop word list comprise of 116 high frequency words is gathered that is eventually eliminated from the final lexicon. Some of the high frequency words from the stop words are given in Table 2. Statistical classifiers mostly require the input dataset to be preprocessed in the format specified by them. This preprocessing is language-dependent so general text mining techniques need to be modified in order to apply them on Urdu corpus. 3.1 Tokenization Words are derived from the corpus on the basis of white spaces and punctuation marks. The corpus also contains multiple words without any white space or punctuation marks and some nonUrdu words as well. To resolve this problem every word is looked up from ‘tokenization lexicon’ and becomes a token if found otherwise eliminated. The ‘tokenization lexicon’ is manually prepared and gathered from different sources containing 220,760 unique entries. The class wise statistics of the input dataset development process is described in Table 1. Stop words are functional words of a language and meaningless in context of text classification. They are eliminated from the lexicon in order to reduce its size by using a list of most frequent words known as stop word list. Table 2. Some most frequent words extracted from the corpus Word Frequency Table 1. Analysis of documents at different levels of preprocessing stages Class Documents Tokens Types Terms News 17,501 8,957,259 78,649 54,817 Sports 3,388 1,666,304 21,473 16,622 Finance 1,766 1,162,019 16,144 11,951 Culture 1,088 3,845,117 57,486 37,493 1,046 1,980,723 26,433 19,781 1,278 1,685,424 34,614 25,588 Consumer Information Personal Communication Total 26,067 19,296,846 اور Word Frequency 743,949 368,155 582,882 306,103 575,545 281,922 466,908 Š 254,385 413,788 اس 244,017 3.5 Stemming It is a process of reducing a word to its root form; it often consists of removing the derivational affixes [13]. ‘Affix elimination’ based stemming is applied in order to merge multiple related word forms. An affixes list containing 417 prefixes and 73 suffixes of Urdu language used for that purpose which reduced 24% terms from the lexicon. Following algorithm is applied to every token to stem them: 234,799 166,252 1) Pick the first and last character of a token separately and search them in the affixes list. If not found then concatenate second and second last letter and again search Diacritics are used in Urdu text to alter pronunciation of a word but they are optional characters in the language. In the current corpus, less then one percent of the words are diacritized. In order to standardize the corpus, the diacritics are completely removed َ like for example (house) and ِ (surrounded) to be mapped on 2) Continue the process until an affix is found from the list then search remaining part of the word in the ‘tokenization lexicon’; if found then retain it and eliminate remaining (prefix or suffix) part of the token. 3.2 Diacritics Elimination to a single word Appendix A. 3) In case prefix and suffix strings are not found at all then retain the original word as it is. . List of diacritical marks in Urdu is given in For example ò, صis added to prefix string and دis added to suffix string and these are looked up in prefix and suffix list respectively. When they are not found in those lists; حis added to prefix string and prefix string now becomes ò and نis added to suffix string and suffix string now becomes . Prefix and suffix strings are looked up in prefix and suffix list respectively. If still they are not found in the lists so تis added to prefix string and it becomes ò and مis added to suffix list which becomes and they are looked up in the respective lists. Finally, in suffix list so rest of the token which is ò is looked up in ò is reduced the lexicon and it is found there so the token to is found ò after stemming. Figure 3. Estimation of vocabulary size 3.6 Statistical Properties Some statistical properties of the cleaned lexicon extracted from raw corpus are analyzed using Zipf’s and Heaps Law which shows satisfactory results. 1) According to Zipf's law the ith most frequent term occurs with a frequency inversely proportional to i in presence of a constant c. c frequencyi = i (1) It models distribution of the terms in a collection which implies that documents belonging to the same class will have similar frequency distributions [13]. Figure 2 shows that the frequencies of the most common terms are inversely proportional to their rank in the current corpus. 4. NAÏVE BAYES Naïve Bayes is a supervised learning technique, efficiently used in text classification [1]. It is based on Bayesian theorem with independence assumption [5]. Using Bayes rule, the probability of a document being in a class is; P(Class | Document ) = P(Document | Class )× P(Class ) P(Document ) (3) P(Document|Class) is the conditional probability of document given class, P(Class) and P(Document) are prior and evidence probabilities of class and documents respectively. The independence assumption is used to calculate the conditional probability, where probability of each document feature (Termi) is independent from another [2]. Class that maximizes (3) will be selected. Class Document1 … Document2 Term1 … Term2 Documentm Termn Figure 4. Architecture of Text classification using Naïve Bayes P(Document) = P(Term1 ) × K × P(Termn ) = ∏i =1 P(Termi ) n Figure 2. Frequency distribution of terms over entire collection 2) According to Heaps law the size of vocabulary of a corpus is measured using (2) to predict number of distinct words that exist in a document [13]. vocabulary size = K × (corpus size ) β (2) where K and β are constants that vary between 30-90 and 0.45 respectively. Figure 3 shows that how the vocabulary size increases by increasing the size of current corpus. . (4) The P(Document) is constant over all terms, so by ignoring it and applying (4) on (3), the expression becomes; P(Document | Class ) × P(Class ) = ( [ ]) arg max ∏ j =1 P(Term j | Classi )× P(Classi ) i n P(Term j | Classi )= count (Term j , Classi ) count (Term j ) (5) (6) P(Class i ) = count (documents in class i ) count (documents) (7) In (6) the count(Term j, in Class i) can be zero because the training data is not enough to represent every term in every class, this becomes the overall estimate equal to zero. To eliminate zeros, re-evaluating the conditional probability and assign a very small non-zero constant values known as smoothing [13]. A very simple smoothing technique is to add one to all the counts and divide it by vocabulary size to normalize overall probabilities. This technique is known as Laplace smoothing and is usually suitable for unigram based language models like Naïve Bayes [14]. P(Term j | Classi ) = count (Term j , classi ) + 1 (8) count (Term j ) + V After estimating conditional (7) and prior (8) probability parameters during training phase, a test document is classify as; [ ] best class = arg max ∏ j =1 P(Term j | Class )× P(Class ) n cεC (9) ( [ ]) arg max ∏ j =1 log(P(Term j | Classi ))+ log(P(Classi )) i n for each t ε V 17) P[t ][c] ← 18) Tct + 1 Tt + V end for 19) end for Classification 20) T ← total tokens in test document d 21) for each c ε C 22) score[c] ← log( prior[c ]) 23) for each t ε T 24) score[c] ← score[c ] + log(P[t ][c]) 25) end for 26) end for 27) best class ← max(score) Many conditional probabilities are multiplied in (9), which can result in a floating point underflow [13]. Hence, by applying logarithm on both sides (9) becomes; best class = 16) (10) 5. SUPPORT VECTOR MACHINES SVM is a supervised learning technique and very effective in text classification. It finds a hyperplane h with maximum margin m that separates two classes and at test time a data point is classified depending on the side of the hyperplane it lies in [10]. v v v h( x ) = x t .w + w0 (11) 2 m= v 2 w (12) 4.1 Algorithm The algorithm is divided into three independent modules given below; Preprocessing 1) L ← lexicon based tokenization 2) NL ← text normalization of L 3) T ← high frequency words elimination of NL 4) term ← affix based stemming of T Training 5) C ← {class1, class2, …, classk} where xt is a vector of terms in a document belongs to class rt ε {1, …, k}; wv and w0 are weights associated with each document vector and threshold respectively. At least two data points closest to decision surface determines the margin of the classifier known as support vectors and others are known as non-support vectors [13]. In text classification the data is usually not linearly separable, so that a penalty C is introduced for data points crossing the margin known as misclassified point. x2 h 6) D ← {document1, document2, …, documentm} 7) V ← {term1, term2, …, termn} 8) for each c ε C 9) Nc ← total documents Dc in class c 10) prior[c] ← 11) tokensc = tokens of all documents [Dc] in class c 12) for each t ε V Nc N 13) Tct ← frequency of each token [tokensc] in class c 14) Tt ← frequency of each token in all classes c 15) end for Support vectors + ξ=0 α>0 C+ ξ≥1 0<ξ<1 α=0 Cα=0 m=2/||w||2 w x1 Figure 5. Geometric representation of two class linearly separable Support Vector Machines. The hyperplane h separates positive and negative training examples with maximum margin m and ξ are slack variables associated with each data point and is zero for correctly classified data points. The margin between hyperplanes can be maximized by v with respect to v , w and v parameters; minimizing w w 0 ξ ⎛1 v 2 ⎞ v f ( x ) = arg min ⎜ w + C ∑ξ t ⎟ v 2 w t ⎝ ⎠ (13) (14) ∀ tN= 1 It can be solved by introducing Lagrange multipliers α for every data point [10], thus the Lagrange equation becomes; v v v 1 v 2 v L( w, w0 , α , ξ , µ ) = w − C ∑ ξ t − (15) 2 t v v ∑ α t . r t .(x . w + w0 ) − 1 + ξ t − ∑ µ t .ξ t [ ] t t (16) t v v (17) w0 = r t − x t .w v During the training process w and w0 are calculated for every data point from (16) and (17) given parameter C and then w0 will be averaged overall data points and the Lagrange equation becomes; v T v (18) L = ∑ α t − ∑ ∑ α t .α s .r t .r s . (x t ) .x s ( s ) The non-linearly separable data is transformed to higher dimensional feature space to separate the data linearly. This can v T v be done by replacing the inner products x t .x s in (18) with Kernel function that expressed them as a dot product in higher dimension [9]. Following are some Kernel functions; ( ) Linear Kernel: K ( x, y ) = x . y (19) Polynomial Kernel of degree d: K ( x, y ) = ( x . y + 1) d (20) −|| x − y ||2 Radial basis function Kernel of width σ: K ( x, y ) = e 2σ 2 (21) The final decision function using Kernel becomes; (( ⎛ v v g (x) = ∑ r t ⎜ ∑α s . r s . K x t t ⎝ s ) v k ⎞⎞ . x s + w0 ⎟ ⎟⎟ (23) ⎠⎠ 5.1 Algorithm The algorithm for nonlinearly separable multi-class text classification is stated below; Preprocessing 3) high frequency words elimination 5) terms are used as features and ‘normalized term frequency’ as its value Training 6) m ← maximum number of terms in a document 7) n ← total number of documents 8) N ← m * n By solving (15) the parameter values are; v v w = ∑ α t .r t .x t t T 4) affix based stemming 0 ≤α t ≤ C t ) 2) text normalization ) ξ t ≥0 ( ) 1) lexicon based tokenization with constraints that all data points must satisfy (11); v v r t . x t .w + w0 ≥ 1 − ξ t ( (( ⎛ k ⎛ v v g ( x ) = arg max⎜⎜ ∑ r t ⎜ ∑ α s . r s . K k x t k ⎝ t ⎝ s ) T ) ⎞ v . x s + w0 ⎟ ⎠ 9) D ← {document1, document2, …, documentn} 10) x ← {term1, term2, …, termN} 11) r ← {class1, class2, …, classn} 12) α ← Lagrange multiplier 13) .* ← is an operator which multiply vectors element by element 14) for each i ε D 15) w[i] ← α .* r .* x[i] 16) end for 17) for each t ε x 18) w0 ← w0 + (r[t] – x[t] * w) 19) end for 20) w0 ← w0 n Classification (22) SVM is basically a binary classifier but a multi-class case can be resolved by splitting them into k binary classes. Every ith classifier constructs a hyperplane between class i and k-1 remaining classes. The training phase learns k support vector machines and at classification time class with maximum score will be selected [16]. So multiclass version of (22) becomes; 21) xtest ← {term1, term2, …, termm} 22) sv ← {support vector1, support vector2, …, support vector t} 23) for each c ε r 24) 25) 26) for each i ε sv ∑ ← α[c][i] * r[c] * K(xtest, sv[i]) end for 27) From the above results it is clear that normalized term frequencies (maximum accuracy achieved 93.34%) are better then simple frequency of a term (maximum accuracy achieved 78.60%). Another observation is that Kernel functions perform very well as compare to linear function and Naïve Bayes classifier. score[k] ← ∑ + w0 28) end for 29) best class ← max(score) 6. RESULTS The system accuracy is measured separately for both classifiers using (24). Accuracy = No. of documents correctly classified a term (24) Total no. of documents Naïve Bayes classifier's accuracy is measured by eliminating features (like stop words and stemming) one by one in the test dataset. Table 3. Analysis of documents at different levels of preprocessing stages Stopword Stopword Stemming Baseline elimination & elimination (%) (%) (%) stemming (%) 90.94 91.77 91.83 93.35 Class News 7. CONCLUSION The experimental results show that the Urdu text classification using Naïve Bayes is very efficient but not as accurate as Support Vector Machines. Naïve Bayes classifier is measured by eliminating features one by one, where it is concluded that the stemming algorithm presented in this paper is decreasing the overall system accuracy. SVM classifier is measured by varying the value of C and Kernel function even then it performs well consistently. It is also concluded that normalized term frequency is a better feature value as compared to simple term frequency. Some statistical characteristics show that the lexicon extracted from the corpus has good vocabulary size and frequency of terms distribution. This successful attempt motivates to explore other areas of information retrieval in context of Urdu language. 8. REFERENCES [1] Zhang, H. 2004. The Optimality of Naive Bayes. In: Proceedings of 17th International FLAIRS Conference, Florida, USA. Sports 32.15 73.45 22.12 40.12 Finance 21.96 19.65 12.20 27.09 Culture 7.92 8.93 5.05 7.32 Consumer Information 14.95 15.76 12.90 13.76 Personal Communication 30.65 33.78 21.08 32.09 [3] Ijaz, M. and Hussain, S. 2007. Corpus Based Urdu Lexicon Development. In: Proceedings of Conference on Language Technology (CLT07), Peshawar, Pakistan. Overall 71.31 76.79 70.08 72.31 [4] Lowd, D., and Domingos, P. 2005. Naive Bayes Models for Probability Estimation. In: Proceedings of ICML, Germany. From the above results it is observed that stemming in Urdu language is not useful but stop words elimination improves overall classifier's accuracy. So that SVM classifier's accuracy is measured by only picking (stop words eliminated and unstemmed) lexicon that achieve maximum overall accuracy in Naïve Bayes. Only overall accuracies are recorded here, instead of class wise, by putting different values of C using variant Kernel functions. The baseline system consists of simple term frequency TF as feature values with linear function, but after that normalized term frequencies are used for non-linear Kernel function. Table 4. Accuracy of Support Vector Machines classifier Kernel Functions Baseline Linear (%) Polynomial (%) RBF (%) (%) C 1 77.79 69.10 1000 77.06 20000 Maximum 81.23 84.48 81.09 89.53 88.66 78.60 84.39 91.03 93.34 78.60 84.39 91.03 93.34 [2] Rish, I. 2001. An empirical study of the naive Bayes classifier. In: Proceedings IJCAI Workshop on Empirical Methods in Artificial Intelligence, Seattle, USA. [5] Dai, W., Xue, G. R., Yang, Q., and Yu, Y. 2007. Transferring Naive Bayes Classifiers for Text Classification. In: Proceedings of 22nd AAAI Conference on Artificial Intelligence, British Columbia, USA. [6] Joachims, T. 2001. A Statistical Learning Model of Text Classification for Support Vector Machines. In: Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), New Orleans, USA. [7] Zhuang, D., Zhang, B., Yang, Q., Yan, J., Chen, Z., and Chen Y. 2005. Efficient Text Classification by Weighted Proximal SVM. In: Proceedings of International Conference on Data Mining (ICDM), Houston, Texas, USA. [8] Joachims, T. 2005. A Support Vector Method for Multivariate Performance Measures. In: Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany. [9] Joachims, T. 1998. Text Categorization with Support Vector Machines: Learning with many Relevant Features. In: Proceedings of ECML-98, 10th European Conference on Machine Learning, Dorint-Parkhotel, Chemnitz, Germany. [10] Lee, Y., Lin, Y., and Wahba, G. 2001. Multicategory Support Vector Machines. In: Proceedings of Computing Science and Statistics Vol. 33, the Interface Foundation, California, USA. [11] Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., and Al-Rajeh, A. 2008. Automatic Arabic Text Classification. In: Proceedings of Actes JADT'2008 en ligne. [12] Joachims, T., Hamza, T., and Noaman, H. M. 1997. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, TN, USA. [13] Manning, C. D., Raghavan, P., and Schtuze, H. 2008. An Introduction to Information Retrieval, Cambridge University Press. [14] Jurafsky, D., and James, M. H. 2000. Speech and Language Processing, Prentice Hall.
© Copyright 2026 Paperzz