Urdu Text Classification

Urdu Text Classification
Abbas Raza Ali
Maliha Ijaz
National University of Computers and
Emerging Sciences
Block-B, Faisal Town
Lahore, Pakistan
National University of Computers and
Emerging Sciences
Block-B, Faisal Town
Lahore, Pakistan
[email protected]
[email protected]
ABSTRACT
This paper compares statistical techniques for text classification
using Naïve Bayes and Support Vector Machines, in context of
Urdu language. A large corpus is used for training and testing
purpose of the classifiers. However, those classifiers cannot
directly interpret the raw dataset, so language specific
preprocessing techniques are applied on it to generate a
standardized and reduced-feature lexicon. Urdu language is
morphological rich language which makes those tasks complex.
Statistical characteristics of corpus and lexicon are measured
which show satisfactory results of text preprocessing module. The
empirical results show that Support Vector Machines outperform
Naïve Bayes classifier in terms of classification accuracy.
SVMs outperform Naïve Bayes in context of classification
accuracy.
The overall system is divided into three main components:
1) Acquisition, compilation and labeling of the text documents
of the corpus
2) Preprocessing of raw corpus to generate standardized and
reduced-feature lexicon
3) Training of statistical classifiers on the preprocessed data to
classify test data
Detailed architecture of the system along with its three
components is shown in Figure 1.
Keywords
Corpus, information retrieval, lexicon, Naïve Bayes,
normalization, feature selection, text classification, text mining,
Urdu.
Corpus acquisition
Lexicon based tokenization
1. INTRODUCTION
Text classification is a process of classifying unknown text
automatically by suggesting most probable class to which it
belongs. As electronic information is increasing day by day, it
becomes a key technique to organize large amount of data for
analysis and processing [9]. Text classification is involved in
many applications like text filtering, document organization,
classification of news stories, searching for interesting
information on the web, etc. These are language specific systems
mostly designed for English but no work has been done for Urdu
language. So, developing classification systems for Urdu
documents is a challenging task due to morphological richness
and scarcity of resources of the language like automatic tools for
tokenization, feature selection, stemming etc.
Two different classifiers based on supervised learning techniques
are developed and their accuracies are compared on the given
dataset. From the experiments, the Naïve Bayes classifier is found
to be more efficient than the Support Vector Machines. However,
Normalization
Diacritics elimination
Document
Preprocessing
Stop words elimination
Affixes based stemming
Estimate p(Term | class)
Training
Estimate p(class)
Classification
Naive Bayes
Classifier
Maximum[p(class | Term)]
Calculate Normalized Term Frequency
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
FIT’09, December 16–18, 2009, CIIT, Abbottabad, Pakistan.
Copyright 2009 ACM 978-1-60558-642-7/09/12....$10.
Training
Calculate α, w, w0
Classification
SVM
Classifier
Maximum[α*r *K(xtest, x)+w0]
Figure 1. Architecture of Urdu text classification system
2. COPORA
3.3 Normalization
A large amount of dataset is usually needed in order to get good
classification accuracy of a statistical system. For that purpose a
large text corpus of 19.3 million words, collected from different
online Urdu news websites, is used [3]. The corpus was manually
classified into six different domains namely news, sports, finance,
culture, consumer information, and personal information.
Some Urdu alphabets have more than one Unicode because they
are shaped similar to Arabic alphabets. Such characters are
replaced by alternate Urdu alphabets to stop creating multiple
copies of a word. Some examples of Un-normalized alphabets and
their mapping with Urdu characters are shown in Appendix B.
The breakup of documents for each class is given in Table 1 in the
next section. The corpus is divided into two parts, the training set
contains 90% of the documents from each class and remaining
10% of the documents are used as test set.
3.4 Stop words Elimination
3. DOCUMENT PREPROCESSING
In order to obtain stop word list, frequency of every term is
calculated using the same corpus and prepared a ‘word frequency
lexicon’. The threshold frequency is chosen by manually
analyzing that lexicon. It was observed that the top 116 high
frequency words belonged to functional class. So a stop word list
comprise of 116 high frequency words is gathered that is
eventually eliminated from the final lexicon. Some of the high
frequency words from the stop words are given in Table 2.
Statistical classifiers mostly require the input dataset to be
preprocessed in the format specified by them. This preprocessing
is language-dependent so general text mining techniques need to
be modified in order to apply them on Urdu corpus.
3.1 Tokenization
Words are derived from the corpus on the basis of white spaces
and punctuation marks. The corpus also contains multiple words
without any white space or punctuation marks and some nonUrdu words as well. To resolve this problem every word is looked
up from ‘tokenization lexicon’ and becomes a token if found
otherwise eliminated. The ‘tokenization lexicon’ is manually
prepared and gathered from different sources containing 220,760
unique entries. The class wise statistics of the input dataset
development process is described in Table 1.
Stop words are functional words of a language and meaningless in
context of text classification. They are eliminated from the
lexicon in order to reduce its size by using a list of most frequent
words known as stop word list.
Table 2. Some most frequent words extracted from the corpus
Word
Frequency
Table 1. Analysis of documents at different levels of
preprocessing stages
Class
Documents
Tokens
Types
Terms
News
17,501
8,957,259
78,649
54,817
Sports
3,388
1,666,304
21,473
16,622
Finance
1,766
1,162,019
16,144
11,951
Culture
1,088
3,845,117
57,486
37,493
1,046
1,980,723
26,433
19,781
1,278
1,685,424
34,614
25,588
Consumer
Information
Personal
Communication
Total
26,067 19,296,846
‫اور‬
Word
Frequency
743,949
368,155
582,882
306,103
575,545
281,922
466,908
Š
254,385
413,788
‫اس‬
244,017
3.5 Stemming
It is a process of reducing a word to its root form; it often consists
of removing the derivational affixes [13]. ‘Affix elimination’
based stemming is applied in order to merge multiple related word
forms. An affixes list containing 417 prefixes and 73 suffixes of
Urdu language used for that purpose which reduced 24% terms
from the lexicon. Following algorithm is applied to every token to
stem them:
234,799 166,252
1) Pick the first and last character of a token separately and
search them in the affixes list. If not found then concatenate
second and second last letter and again search
Diacritics are used in Urdu text to alter pronunciation of a word
but they are optional characters in the language. In the current
corpus, less then one percent of the words are diacritized. In order
to standardize the corpus, the diacritics are completely removed
َ
like for example
(house) and ِ (surrounded) to be mapped on
2) Continue the process until an affix is found from the list then
search remaining part of the word in the ‘tokenization lexicon’; if
found then retain it and eliminate remaining (prefix or suffix) part
of the token.
3.2 Diacritics Elimination
to a single word
Appendix A.
3) In case prefix and suffix strings are not found at all then retain
the original word as it is.
. List of diacritical marks in Urdu is given in
For example
ò,
‫ ص‬is added to prefix string and ‫ د‬is added
to suffix string and these are looked up in prefix and suffix list
respectively. When they are not found in those lists; ‫ ح‬is added to
prefix string and prefix string now becomes ò and ‫ ن‬is added to
suffix string and suffix string now becomes
. Prefix and suffix
strings are looked up in prefix and suffix list respectively. If still
they are not found in the lists so ‫ ت‬is added to prefix string and it
becomes
ò and ‫ م‬is added to suffix list which becomes
and they are looked up in the respective lists. Finally,
in suffix list so rest of the token which is
ò is looked up in
ò is reduced
the lexicon and it is found there so the token
to
is found
ò after stemming.
Figure 3. Estimation of vocabulary size
3.6 Statistical Properties
Some statistical properties of the cleaned lexicon extracted from
raw corpus are analyzed using Zipf’s and Heaps Law which
shows satisfactory results.
1) According to Zipf's law the ith most frequent term occurs with
a frequency inversely proportional to i in presence of a constant c.
c
frequencyi =
i
(1)
It models distribution of the terms in a collection which implies
that documents belonging to the same class will have similar
frequency distributions [13]. Figure 2 shows that the frequencies
of the most common terms are inversely proportional to their rank
in the current corpus.
4. NAÏVE BAYES
Naïve Bayes is a supervised learning technique, efficiently used
in text classification [1]. It is based on Bayesian theorem with
independence assumption [5]. Using Bayes rule, the probability of
a document being in a class is;
P(Class | Document ) =
P(Document | Class )× P(Class )
P(Document )
(3)
P(Document|Class) is the conditional probability of document
given class, P(Class) and P(Document) are prior and evidence
probabilities of class and documents respectively. The
independence assumption is used to calculate the conditional
probability, where probability of each document feature (Termi) is
independent from another [2]. Class that maximizes (3) will be
selected.
Class
Document1
…
Document2
Term1
…
Term2
Documentm
Termn
Figure 4. Architecture of Text classification using Naïve
Bayes
P(Document) = P(Term1 ) × K × P(Termn ) = ∏i =1 P(Termi )
n
Figure 2. Frequency distribution of terms over entire
collection
2) According to Heaps law the size of vocabulary of a corpus is
measured using (2) to predict number of distinct words that exist
in a document [13].
vocabulary size = K × (corpus size )
β
(2)
where K and β are constants that vary between 30-90 and 0.45
respectively. Figure 3 shows that how the vocabulary size
increases by increasing the size of current corpus. .
(4)
The P(Document) is constant over all terms, so by ignoring it and
applying (4) on (3), the expression becomes;
P(Document | Class ) × P(Class ) =
(
[
])
arg max ∏ j =1 P(Term j | Classi )× P(Classi )
i
n
P(Term j | Classi )=
count (Term j , Classi )
count (Term j )
(5)
(6)
P(Class i ) =
count (documents in class i )
count (documents)
(7)
In (6) the count(Term j, in Class i) can be zero because the
training data is not enough to represent every term in every class,
this becomes the overall estimate equal to zero. To eliminate
zeros, re-evaluating the conditional probability and assign a very
small non-zero constant values known as smoothing [13]. A very
simple smoothing technique is to add one to all the counts and
divide it by vocabulary size to normalize overall probabilities.
This technique is known as Laplace smoothing and is usually
suitable for unigram based language models like Naïve Bayes
[14].
P(Term j | Classi ) =
count (Term j , classi ) + 1
(8)
count (Term j ) + V
After estimating conditional (7) and prior (8) probability
parameters during training phase, a test document is classify as;
[
]
best class = arg max ∏ j =1 P(Term j | Class )× P(Class )
n
cεC
(9)
(
[
])
arg max ∏ j =1 log(P(Term j | Classi ))+ log(P(Classi ))
i
n
for each t ε V
17)
P[t ][c] ←
18)
Tct + 1
Tt + V
end for
19) end for
Classification
20) T ← total tokens in test document d
21) for each c ε C
22)
score[c] ← log( prior[c ])
23)
for each t ε T
24)
score[c] ← score[c ] + log(P[t ][c])
25) end for
26) end for
27) best class ← max(score)
Many conditional probabilities are multiplied in (9), which can
result in a floating point underflow [13]. Hence, by applying
logarithm on both sides (9) becomes;
best class =
16)
(10)
5. SUPPORT VECTOR MACHINES
SVM is a supervised learning technique and very effective in text
classification. It finds a hyperplane h with maximum margin m
that separates two classes and at test time a data point is classified
depending on the side of the hyperplane it lies in [10].
v v v
h( x ) = x t .w + w0
(11)
2
m= v 2
w
(12)
4.1 Algorithm
The algorithm is divided into three independent modules given
below;
Preprocessing
1) L ← lexicon based tokenization
2) NL ← text normalization of L
3) T ← high frequency words elimination of NL
4) term ← affix based stemming of T
Training
5) C ← {class1, class2, …, classk}
where xt is a vector of terms in a document belongs to class rt ε
{1, …, k}; wv and w0 are weights associated with each document
vector and threshold respectively.
At least two data points closest to decision surface determines the
margin of the classifier known as support vectors and others are
known as non-support vectors [13]. In text classification the data
is usually not linearly separable, so that a penalty C is introduced
for data points crossing the margin known as misclassified point.
x2
h
6) D ← {document1, document2, …, documentm}
7) V ← {term1, term2, …, termn}
8) for each c ε C
9)
Nc ← total documents Dc in class c
10)
prior[c] ←
11)
tokensc = tokens of all documents [Dc] in class c
12)
for each t ε V
Nc
N
13)
Tct ← frequency of each token [tokensc] in class c
14)
Tt ← frequency of each token in all classes c
15)
end for
Support vectors
+
ξ=0
α>0
C+
ξ≥1
0<ξ<1
α=0
Cα=0
m=2/||w||2
w
x1
Figure 5. Geometric representation of two class linearly
separable Support Vector Machines. The hyperplane h
separates positive and negative training examples with
maximum margin m and ξ are slack variables associated with
each data point and is zero for correctly classified data points.
The margin between hyperplanes can be maximized by
v with respect to v , w and v parameters;
minimizing w
w 0
ξ
⎛1 v 2
⎞
v
f ( x ) = arg min
⎜ w + C ∑ξ t ⎟
v
2
w
t
⎝
⎠
(13)
(14)
∀ tN= 1
It can be solved by introducing Lagrange multipliers α for every
data point [10], thus the Lagrange equation becomes;
v v v 1 v 2
v
L( w, w0 , α , ξ , µ ) = w − C ∑ ξ t −
(15)
2
t
v
v
∑ α t . r t .(x . w + w0 ) − 1 + ξ t − ∑ µ t .ξ t
[
]
t
t
(16)
t
v v
(17)
w0 = r t − x t .w
v
During the training process w and w0 are calculated for every
data point from (16) and (17) given parameter C and then w0 will
be averaged overall data points and the Lagrange equation
becomes;
v T v
(18)
L = ∑ α t − ∑ ∑ α t .α s .r t .r s . (x t ) .x s
(
s
)
The non-linearly separable data is transformed to higher
dimensional feature space to separate the data linearly. This can
v T v
be done by replacing the inner products x t .x s in (18) with
Kernel function that expressed them as a dot product in higher
dimension [9]. Following are some Kernel functions;
( )
Linear Kernel: K ( x, y ) = x . y
(19)
Polynomial Kernel of degree d: K ( x, y ) = ( x . y + 1) d
(20)
−|| x − y ||2
Radial basis function Kernel of width σ: K ( x, y ) = e
2σ 2
(21)
The final decision function using Kernel becomes;
((
⎛
v
v
g (x) = ∑ r t ⎜ ∑α s . r s . K x t
t
⎝ s
)
v
k ⎞⎞
. x s + w0 ⎟ ⎟⎟ (23)
⎠⎠
5.1 Algorithm
The algorithm for nonlinearly separable multi-class text
classification is stated below;
Preprocessing
3) high frequency words elimination
5) terms are used as features and ‘normalized term
frequency’ as its value
Training
6) m ← maximum number of terms in a document
7) n ← total number of documents
8) N ← m * n
By solving (15) the parameter values are;
v
v
w = ∑ α t .r t .x t
t
T
4) affix based stemming
0 ≤α t ≤ C
t
)
2) text normalization
)
ξ t ≥0
( )
1) lexicon based tokenization
with constraints that all data points must satisfy (11);
v v
r t . x t .w + w0 ≥ 1 − ξ t
(
((
⎛
k
⎛
v
v
g ( x ) = arg max⎜⎜ ∑ r t ⎜ ∑ α s . r s . K k x t
k
⎝ t ⎝ s
)
T
)
⎞
v
. x s + w0 ⎟
⎠
9) D ← {document1, document2, …, documentn}
10) x ← {term1, term2, …, termN}
11) r ← {class1, class2, …, classn}
12) α ← Lagrange multiplier
13) .* ← is an operator which multiply vectors element by
element
14) for each i ε D
15)
w[i] ← α .* r .* x[i]
16) end for
17) for each t ε x
18)
w0 ← w0 + (r[t] – x[t] * w)
19) end for
20)
w0 ←
w0
n
Classification
(22)
SVM is basically a binary classifier but a multi-class case
can be resolved by splitting them into k binary classes.
Every ith classifier constructs a hyperplane between class i
and k-1 remaining classes. The training phase learns k
support vector machines and at classification time class
with maximum score will be selected [16]. So multiclass
version of (22) becomes;
21) xtest ← {term1, term2, …, termm}
22) sv ← {support vector1, support vector2, …, support
vector t}
23) for each c ε r
24)
25)
26)
for each i ε sv
∑ ← α[c][i] * r[c] * K(xtest, sv[i])
end for
27)
From the above results it is clear that normalized term frequencies
(maximum accuracy achieved 93.34%) are better then simple
frequency of a term (maximum accuracy achieved 78.60%).
Another observation is that Kernel functions perform very well as
compare to linear function and Naïve Bayes classifier.
score[k] ← ∑ + w0
28) end for
29) best class ← max(score)
6. RESULTS
The system accuracy is measured separately for both classifiers
using (24).
Accuracy =
No. of documents correctly classified a term (24)
Total no. of documents
Naïve Bayes classifier's accuracy is measured by eliminating
features (like stop words and stemming) one by one in the test
dataset.
Table 3. Analysis of documents at different levels of
preprocessing stages
Stopword
Stopword
Stemming
Baseline
elimination &
elimination
(%)
(%)
(%)
stemming (%)
90.94
91.77
91.83
93.35
Class
News
7. CONCLUSION
The experimental results show that the Urdu text classification
using Naïve Bayes is very efficient but not as accurate as Support
Vector Machines. Naïve Bayes classifier is measured by
eliminating features one by one, where it is concluded that the
stemming algorithm presented in this paper is decreasing the
overall system accuracy. SVM classifier is measured by varying
the value of C and Kernel function even then it performs well
consistently. It is also concluded that normalized term frequency
is a better feature value as compared to simple term frequency.
Some statistical characteristics show that the lexicon extracted
from the corpus has good vocabulary size and frequency of terms
distribution. This successful attempt motivates to explore other
areas of information retrieval in context of Urdu language.
8. REFERENCES
[1] Zhang, H. 2004. The Optimality of Naive Bayes. In:
Proceedings of 17th International FLAIRS Conference,
Florida, USA.
Sports
32.15
73.45
22.12
40.12
Finance
21.96
19.65
12.20
27.09
Culture
7.92
8.93
5.05
7.32
Consumer
Information
14.95
15.76
12.90
13.76
Personal
Communication
30.65
33.78
21.08
32.09
[3] Ijaz, M. and Hussain, S. 2007. Corpus Based Urdu Lexicon
Development. In: Proceedings of Conference on Language
Technology (CLT07), Peshawar, Pakistan.
Overall
71.31
76.79
70.08
72.31
[4] Lowd, D., and Domingos, P. 2005. Naive Bayes Models for
Probability Estimation. In: Proceedings of ICML, Germany.
From the above results it is observed that stemming in Urdu
language is not useful but stop words elimination improves
overall classifier's accuracy. So that SVM classifier's accuracy is
measured by only picking (stop words eliminated and unstemmed) lexicon that achieve maximum overall accuracy in
Naïve Bayes. Only overall accuracies are recorded here, instead
of class wise, by putting different values of C using variant
Kernel functions. The baseline system consists of simple term
frequency TF as feature values with linear function, but after that
normalized term frequencies are used for non-linear Kernel
function.
Table 4. Accuracy of Support Vector Machines classifier
Kernel Functions
Baseline
Linear (%) Polynomial (%) RBF (%)
(%)
C
1
77.79
69.10
1000
77.06
20000
Maximum
81.23
84.48
81.09
89.53
88.66
78.60
84.39
91.03
93.34
78.60
84.39
91.03
93.34
[2] Rish, I. 2001. An empirical study of the naive Bayes
classifier. In: Proceedings IJCAI Workshop on Empirical
Methods in Artificial Intelligence, Seattle, USA.
[5] Dai, W., Xue, G. R., Yang, Q., and Yu, Y. 2007.
Transferring Naive Bayes Classifiers for Text Classification.
In: Proceedings of 22nd AAAI Conference on Artificial
Intelligence, British Columbia, USA.
[6] Joachims, T. 2001. A Statistical Learning Model of Text
Classification for Support Vector Machines. In: Proceedings
of the Conference on Research and Development in
Information Retrieval (SIGIR), New Orleans, USA.
[7] Zhuang, D., Zhang, B., Yang, Q., Yan, J., Chen, Z., and
Chen Y. 2005. Efficient Text Classification by Weighted
Proximal SVM. In: Proceedings of International Conference
on Data Mining (ICDM), Houston, Texas, USA.
[8] Joachims, T. 2005. A Support Vector Method for
Multivariate Performance Measures. In: Proceedings of the
22nd International Conference on Machine Learning (ICML),
Bonn, Germany.
[9] Joachims, T. 1998. Text Categorization with Support Vector
Machines: Learning with many Relevant Features. In:
Proceedings of ECML-98, 10th European Conference on
Machine Learning, Dorint-Parkhotel, Chemnitz, Germany.
[10] Lee, Y., Lin, Y., and Wahba, G. 2001. Multicategory
Support Vector Machines. In: Proceedings of Computing
Science and Statistics Vol. 33, the Interface Foundation,
California, USA.
[11] Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed,
M. S., and Al-Rajeh, A. 2008. Automatic Arabic Text
Classification. In: Proceedings of Actes JADT'2008 en ligne.
[12] Joachims, T., Hamza, T., and Noaman, H. M. 1997. A
Probabilistic Analysis of the Rocchio Algorithm with TFIDF
for Text Categorization. In: Proceedings of the 14th
International Conference on Machine Learning, TN, USA.
[13] Manning, C. D., Raghavan, P., and Schtuze, H. 2008. An
Introduction to Information Retrieval, Cambridge University
Press.
[14] Jurafsky, D., and James, M. H. 2000. Speech and Language
Processing, Prentice Hall.