KNN Arabic Text Categorization Using IG Feature

KNN Arabic Text Categorization Using IG Feature Selection
Dr. Ghassan Kanaan
Amman AL-Ahliyya University, Jordan
[email protected]
Dr. Riyad Al-Shalabi
Amman Al-Ahliya University, Jordan
[email protected]
AbdAllah Al-Akhras
Yarmouk University, Jordan
[email protected]
ABSTRACT
This project presents an implementation of automatic KNN Arabic text categorizer. Six
hundred Arabic text documents belong to 6 categories was tested using the classifier. The
main objective of this project is to build an automatic KNN Arabic text categorizer and test
the effectiveness of the information gain (IG) feature selection which used for feature
selection in this project. The study concluded that the effectiveness of the improved classifier
is very good and reached 0.793 macro-recalls, 0.627 macro-precision and we also conclude
that the effectiveness of the KNN classifier was increased as we increase the training size.
For specific categories we conclude that agriculture has the best average recall which is
reached 92.2 while the best recall was 0.899 on economy category.
Key Words: Text Categorization, kNN, Similarity Measuring, Vector Model,
Terms Weighting, Keywords Extracting,
1 Introduction
Text categorization (TC) or text
classification is the task of assigning a
number of appropriate categories to a text
document. This categorization process has
many applications such as document
routing, document management, or
document dissemination. Traditionally
each incoming document is analyzed and
categorized manually by domain experts
based on the content of the document.
Extensive human resources have to be
spent on carrying out such a task. To
facilitate the process of text categorization,
automatic categorization schemes are
required. Its goal is to build categorization
models that can be used to classify text
documents automatically. [22]
In this research we have been build an
automatic Arabic text categorizer based on
information gain (IG) feature selection,
which is the best over all feature extraction
method as suggested by many research's
[5,6]. Normalized TF*IDF was used for
term weighting scheme; jaccard similarity
measure is used as similarity measure.
Traditionally, methods for selecting
subsets of features that provide best
performance results are divided into
wrappers and filters [1]. Wrappers utilize
the learning machine as a fitness
(evaluation) function and search for the
best features in the space of all feature
subsets. This formulation of the problem
allows the use of standard optimization
techniques,
with
an
additional
complication that the fitness function has a
probabilistic nature. wrappers highly
depend on the inductive principle of the
learning model and may suffer from
excessive computational complexity since
the problem itself is NP-hard. In contrast to
wrappers, filters are typically based on
selecting the best features in one pass,
although more complex approaches have
also been studied [2]. In domains such as
text categorization or gene selection filters
are still dominant [3]. Evaluating one
feature at a time, filters estimate its
usefulness for the prediction process
according to various metrics [4-7].
Besides wrappers and filters, some authors
distinguish embedded methods as a
separate category of feature selection
algorithms [3, 8]. Embedded methods are
incorporated into the learning procedure,
and hence are also dependent on the
model. In fact, almost any learner can be
considered some form of an embedded
feature selection algorithm, where the
particular
estimation
of
features’
usefulness may result in their weighting
[9], elimination [10] or construction of new
features [11]. The proposed permutationbased methodology.
Input
Documents
Text
Preprocessing
Document
Indexing
Keywords
Selection
Evaluation
Categorization
Algorithm
Figure 1.1: Overview of the categorization
process
As shown in Figure 1.1 the categorization
process is done mainly in five steps, data
preprocessing,
document
indexing,
keyword
or
feature
selection,
categorization algorithm and finally
evaluation of the categorization task,
which are descried in the next sections.
1.1 Data preprocessing
Preprocessing: First step after we input the
text documents in the system is the
preprocessing step which is very important
to natural language, in this step the stop
words are removed, stemming is applied to
the remaining text in the document, this
help reducing the feature space of the
problem.
1.2 Indexing
In this step the system build the index data
structure, which is usually is the inverted
file and calculate the weight of each term
in the documents. Many weighting scheme
are used such as Boolean model, term
frequency (TF) scheme and the best
scheme is normalized TF*IDF.
1.3 Feature selection
Feature selection: is a very important step
in the categorization process, because in
feature selection we select the keywords
that best represents the text documents, if
we select a very low number of keywords
this will affect the accuracy and the
performance of the system in opposite
manner, will select a large number of
keywords will make the performance of
the classifier in term of time very low.
Many feature selection methods are
proposed and used in previous research
such as documents frequency (DF), mutual
information (MI), term strength (TS) and
information gain (IG) which is used in this
research.
1.4 Categorization algorithm
In this step we apply a categorization
algorithm to classify the text documents
based on the index data structure, the
technique used in this research is KNN
algorithm.
1.5 Evaluation
The evaluation of the classifier or the
categorizer is done by using many
mathematic rules to test its effectiveness,
such as recall, precision, F1 measure and
F-score (many evaluation measures are
discussed in Chapter 4)
2 Literature Review
A significant body of research has been
produced in the feature selection area.
Excellent surveys, systematizations, and
journal special issues on feature selection
algorithms have been presented in the
recent past [1, 3, 8, and 13]. Searching for
the best features within the wrapper
framework, Kohavi and John [1] define an
optimal subset of features with respect to a
particular classification model.
The best feature subset given the model is
the one that provides the highest accuracy.
Numerous wrappers have been proposed to
date. These techniques include heuristic
approaches such as forward selection,
backward elimination [14], hill-climbing,
best-first or beam search [15], randomized
algorithms such as genetic algorithms [16]
or simulated annealing [17], as well as
their combinations [18]. In general,
wrappers explore the power set of all
features starting with no features, all
features, or a random selection there of.
Optimality criterion was also tackled
within the filtering framework. Koller and
Sahami select the best feature subset based
strictly
on
the
joint
probability
distributions [19]; a feature subset Z Í X is
optimal if p(Y | X) = p(Y | Z). The
difference between the probability
distributions was measured by the relative
entropy or Kullback-Leibler distance. This
problem formulation naturally leads to the
backward elimination search strategy.
Hence, relevance and optimality do not
imply each other. Naturally, in cases of
high-dimensional
datasets
containing
thousands of features, filters are preferred
to wrappers. The domains most commonly
considered within the filtering framework
are text categorization [3] and gene
expression [21, 22]. A significant
difference between the two models is that
text categorization systems typically
contain both a large number of features and
a large number of examples; while gene
expression data usually contain a limited
number of examples pushing the problem
toward statistically underdetermined. Most
commonly used filters are based on
information
theoretic
or
statistical
principles. For example, information gain
or c 2 good-ness- of-fit tests have become
baseline methods. However, both require
discretized features and are difficult to
“normalize” when features are multivalued.
Several
other
approaches
frequently used are Relief [23, 24], giniindex [11], relevance [25], average
absolute weight of evidence [26], binormal separation [6] etc. Various
benchmarking studies across several
domains are provided in [5-7]. Some
examples of embedded methods are
decision tree learners such as ID3 [27] and
CART [11] or the support vector machine
approaches of Guyon et al. [10] and
Weston et al. [28]. For example, in the
recursive feature elimination approach of
Guyon et al. [10], an initial model is
trained using all the features. Then,
features are iteratively removed in a greedy
fashion until the largest margin of
separation is reached. Good surveys of
embedded techniques are given by Blum
and Langley [8] and Guyon and Elisseeff
[3].
3 Data Set
This chapter discusses the specification of
the data set used for evaluating the
implementation of the improved KNN
Arabic text categorizer.
Since there is no Arabic corpus
publicly available, we had to create our
own corpus. For this purpose we have
colleted many newspaper articles from
different newspaper and news website
available online including Al-Jazeera, AlSafeer, Fares.net and Al-Dostor, the data
set consist of 600 Arabic documents of
different lengths that belongs to 6
categories, the categories are ( Economy
"‫ "اﻗﺘ ﺼﺎد‬, Health "‫ "ﺻ ﺤﺔ‬, Politic "‫ "ﺳﻴﺎﺳ ﺔ‬,
Sport "‫ "رﻳﺎﺿ ﺔ‬, Agriculture " ‫ " زراﻋ ﺔ‬,
Science " ‫) " ﻋﻠ ﻮم‬, we use for each category
100 documents, Table 3.1 represent the
number of documents for each category.
Each document was labeled manually
based on its contents and the domain that it
was found within, each document is stored
on a separate file and the documents of the
same category are stored in a separate
directory.
Table 3.1: Number of documents per class
Category Name
Number of Documents
used
Economy ("‫)"اﻗﺘﺼﺎد‬
Health ( "‫)" ﺻﺤﺔ‬
Politic (" ‫)" ﺳﻴﺎﺳﺔ‬
Sport (" ‫)" رﻳﺎﺿﺔ‬
Agriculture ("‫زراﻋﺔ‬
")
Science ( "‫)" ﻋﻠﻮم‬
100
100
100
100
100
Table 4.1: Different decision can happened in
the classification process
Classifier
Decision
Yes
No
Correct Decision
By Expert
Yes
No
A
B
C
D
Table 4.1 show the different decisions that
can happen during the categorization
process, where:
ƒ A is the number of documents
assigned yes (by the classifier) and
yes is correct (by user the
judgment)
ƒ B is the number of documents
assigned yes( by the classifier) but
no is correct (by the user judgment)
ƒ C is the number of documents
assigned no( by the classifier) but
yes is correct (by the user
judgment)
ƒ D is the number of documents
assigned no (by the classifier) and
no is correct (by the user judgment
100
4 Evaluation Measures
This chapter discusses and explains the
evaluation measures that have been used in
our research to measure the effectiveness
of the proposed system
The experimental evaluation of classifiers
typically tries to asses the effectiveness of
the classifier [39].
Classification can be binary (or not to
assign a category or not to assign a test
document) or a raking categorization that
provide a list of ranks for the potential
categories for the test document, if we
want to evaluate such ranking classifiers
can made it into binary classifies (multi
binary classifiers) by threshold on the
scores of the ranking list.
4.1 The Recall Measures
Recall (R) is defined as the ratio between
the numbers of documents classified
correct with respect to total number of
document. Recall is defined as follows:
R=
a
a+c
(4.1)
4.2 The Precision Measure
Precision (P) is defined as the ratio
between the number of documents
classified correct with respect to the total
number of documents classified. Precision
is defined as follows:
P=
a
a+b
(4.2)
4.3 The Error Measure.
Another less important measure is the
Error (E) measure, which measures the
error occurring within the categorization
process. The error measure takes the
following formula:
b+c
E=
a+b+c+d
(4.2)
4.4 Other Measures
Usually recall or precision alone are not
used for the evaluation of the
categorization process because we can has
height recall value but low precision value,
so many researcher use measures which is
a combination the standard recall and
precision:
1) Breakeven point was proposed and
discussed by lewis [23] and this
point is defined as the point at
which recall and precision are
equal.
2) F-measure combines recall and
precision using the formula.
F=
2 ∗ Re call ∗ Pr ecision
Re call + Pr ecision
(4.4)
Where the F-measure was used in many
research, such as [21, 22]
4.5 Global Averaging measures
The most popular measures for global
averaging are the macro-average and the
micro-average; both of them can be used
with the traditional recall and precision:
• Macro-average: the results of local
precision or recall are evaluated
first and then global averaging is
obtain by dividing the total by the
number of decisions taken as in the
following formulas where M means
macro-average:
r =
M
∑r
i =1
m
i
(4.5)
P =
M
∑Pi
i=1
M
(4.6)
5 Research Methodology
This chapter describes briefly the
methodology of the automatic KNN Arabic
text categorizer used in the project.
5.1 Overview
The methodology used is this study is
KNN technique, normalized TF * IDF as
weighting scheme, (IG) Information gain
for keywords selection, finally when we
test a document we compute the similarity
between it
and between all the training documents,
then we use the k nearest neighbor
similarities, take the summation of the
similarity of the same training documents
category and the highest value, the
classifier sign its category as the classifier
decision.
5.2 Brief Description
In this section we describe in brief the
steps of building the automatic KNN
Arabic text classifier.
Step 1:
The user will select the number of
documents he want as the Training
documents for all the category, (the
number of training documents is the same
for all category, if the user select 20 as
example this mean for each category 20
documents will be as training , so the total
number of training will be 20*6=120
documents for the data set) The system
extracts all the word in the documents (600
Arabic documents), eliminate the stop
words, and finding the stem for each term
based on light stemming technique.
Because roots are better for IR system and
for feature selection [7]. Also the user
select the k-value which determines the
number of the K nearest neighbor
documents which are used as the classifier
decision.
m
IG ( t ) = − ∑ Pr ( C i ) LogP r ( C i ) +
i =1
Step 2:
Making the inverted file (For the
Training Documents) which contains the
term (as a stem), document number, and
frequency of the term in these documents
and the category number (1 for Economy
"‫"اﻗﺘ ﺼﺎد‬, 2 for Health "‫"ﺻ ﺤﺔ‬, 3 for Politic
"‫ "ﺳﻴﺎﺳ ﺔ‬, 4 for Sport "‫"رﻳﺎﺿ ﺔ‬, 5 for
Agriculture " ‫" زراﻋ ﺔ‬, 6 for Science " ‫" ﻋﻠ ﻮم‬
).
Step 3:
We compute the weight of each
term in each document based on the
normalized TF*IDF, because this scheme
is better than the frequency of the term or
the Boolean model.
Wij =
⎛N⎞
× Log 2 ⎜⎜ ⎟⎟
MaxFreqLj
⎝ ni ⎠
Freqij
Where:
ƒ Freq ij is the frequency of the term i
in document j.
ƒ Max Freq Lj is the maximum
frequency of any term in the
document j.
ƒ N is the number of document in the
collection (in this research N = 600
documents).
ƒ ni is the number of documents in
which the term i appear within it at
least once.
ƒ W ij is the weight of the term I in
document j .
Step 4:
Applying our Keyword (feature)
extraction algorithm, which extract the
stem higher than a threshold value based
on information gain feature selection.
Information gain measures the
goodness of a term globally with respect to
all categories, "in measures the number of
bits of information obtained for category
prediction by knowing the presence or
absence of a term in a document" [5] in
general if we have ci categories where i
range from 1 to m (in this project i=6) then
the information gain of some term t is
defined by the function IG (t) as in the
following equation:
m
Pr ( t ) ∑ Pr ( C i \ t ) LogP r ( C i \ t ) +
i =1
m
Pr ( t ) ∑ Pr ( C i \ t ) LogP r ( C i \ t )
i =1
where Pr(ci) is calculated from the
fraction of the number of documents that
belong to class ci from the total number of
documents, Pr (t) is calculated from the
number of documents in which the term t
occur at least once, Pr (ci \t) is the number
of documents among the class ci that have
the term t occurs at least once in it, and
Pr(ci \t\) is the number of documents belong
to class ci and does not contain the term t.
After that the system stores this Keywords
in the Index Terms Array, with its
Frequency, Document Number, Weight
and its category. After this step all the
weight and similarity calculation is base on
Index term array.
Step 5:
The system will now take the
remaining number of training documents
as the number of documents of the testing
set, (if the user select 60 document for
training, this mean he want 60 documents
for each category will be the training set
and 40 documents for each category will
be the testing set)
Step 6:
For each testing document the system
remove stop words, make stemming,
compute the weight of its term and finally,
the classifier
compute the similarity
between the test document and each of the
training documents using the jaccard
similarity measure, which is represented by
the following equation..
m
JaccardSimij =
m
∑W
k =1
Where:
m
∑W × ∑W
2
k =1
m
ki
ki
kj
k =1
+ ∑ W 2 kj − ∑ (Wki × Wkj )
k =1
m
k =1
wkj
=the weight of the term k in the
document I, and
=the weight of the term k in the
document j.
Step 7:
Now the classifier, compute the similarity
differences
between
each
training
document and all the others training
documents, then finding the k nearest
neighbor documents (depend on the value
of K which the user select it from the
beginning), based on this nearest neighbor
documents take the summation of the
similarities that belong to the same
category. The highest value will be the
classifier decision.
Step 8:
The classifier compute the local precision
and recall for the test document, then the
system will repeat the steps (5,6,7,8) for
each test document until he test all the
testing documents.
Step 9:
Finally the system compute the local recall
and the local precision for each category,
then he will compute the average local
precision and the average local recall for
all the categories (for the data set), these
for each run on different training over
testing ratio, then finally we compute the
macro-recall and macro-precision for all
the data set.
6 Results and Testing
we have been test the system built in this
research many times and when the best
threshold was found, we recommend the
following results based on run the system 5
times each one on a different train data (60,
120, 180, 240,360 training documents)
0.9
R ecall and P recision
wki
0.8
0.7
0.6
0.5
0.4
Recall Jacc
0.3
Precision Jacc
0.2
0.1
0
60
120
180
240
360
Number of Training Documents
Figure 6.1: Average recall and precision for
different training size
Figure 6.1 shows that as the number of
training documents increases, as the
accuracy and effectiveness of the system
also increases. We also notice that the
recall is always greater than the precision
over the jaccard similarity measure (the
similarity measure used in this study)
Recall Jacc
Precision Jacc
Average
1.00
0.90
0.80
0.70
Recall and
0.60
Precision
0.50
0.40
0.30
0.20
Economy
Health
Politic
Sport Agriculture Science
Categories
Figure 6.2: Average results for each category
Figure 6.2 shows the average results for all
the categories based on the 5 trails, we
notice that sport category has the best
value of recall and also the agriculture
category, while in term of precision the
economy category has the best value while
it's recall not bad.
The k-value is very important in the KNN
algorithm, because a small value of K
mean low performance in term of recall
and precision, while very high value of K
means also low performance, so we must
select a suitable value of k. We suggest
the value of k to 19, so that we have good
results.
The following figures 6.3 to 6.7 represent
the results based on different training size.
Recall_jacc
Recall and Precision
1.0
0.9
Recall
0.8
Precision
1.00
Precision_Jacc
0.90
0.7
0.6
0.80
0.5
0.4
Recall and
Precision
0.3
0.2
0.70
0.60
0.1
0.50
Sc
ie
nc
e
0.40
0.30
A
gr
ic
ul
tu
re
Sp
or
t
ic
lit
Po
Ec
on
om
H
ea
th
y
0.0
Economy
Category
Heath
Politic
Sport
Agriculture
Science
Categories
Figure 6.3: Recall and precision by using 60
training documents
Figure 6.7: Recall and precision by using 360
training documents
Recall_jacc
Precision_Jacc
1.00
Recall and Precision
0.90
7 Conclusions and
Recommendation
0.80
0.70
0.60
0.50
0.40
0.30
0.20
Economy
Heath
Politic
Sport
Agriculture Science
Categories
Figure 6.4: Recall and precision by using 120
training documents
Recall_jacc
1.00
Precision_Jacc
0.90
0.80
Recall and
Precision
0.70
0.60
0.50
0.40
7.1 Summary
The automatic Arabic KNN classifier
implemented in this study was evaluated
for automatic Arabic text categorization.
A corpus (consist of 600 Arabic text
document belong to 6 categories) was
collected from the internet and used in the
evaluation. The automatic Arabic KNN
classifier used the information gain (IG)
for feature selection. The dataset were
preprocessed by removing stop words,
make light stemming, feature selection,
weighing scheme (Normalized TF * IDF)
and
computing
similarity
measure
(Jaccard).
0.30
Economy
Heath
Politic
Sport
Agriculture
Science
7.2 Conclusion
Categories
Figure 6.5: Recall and precision by using 180
training documents
Recall_jacc
1.00
Precision_Jacc
0.90
0.80
Recall and
Precision
0.70
0.60
0.50
0.40
0.30
Economy
Heath
Politic
Sport
Agriculture
Science
Categories
Figure 6.6: Recall and precision by using 240
training documents
Based on the results of the study, the
following concludes may be warranted:
• The results obtained in this study
are applicable to Arabic language.
• The automatic Arabic KNN
classifier can be taught to classify
using small number of training
documents. The results showed
that the best result on F measure is
on 60% training data set.
• Using the jaccard similarity
measure, a macro-recall of about
0.793, a macro-precision of about
0.627 were reached in this study
using the improved KNN classifier.
• The value of K is very important
and can be very effect the
•
performance of the KNN classifier,
low value of k is not applicable as
soon as those high values of k, the
study recommend a value of K to
be 19 as the best value.
For specific categories we conclude
that agriculture has the best average
recall which is reached 0.922 while
the best recall was 0.899 on
economy category.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
R. Kohavi and G. John, "Wrappers
for feature selection," Artificial
intelligence, Vol. 97, No. (1-2),
1997, pp. 273-324.
H. Almuallim, and T. G. Dietterich.,
"Learning with many irrelevant
features," National Conference on
Artificial Intelligence, 1992, pp. 547552.
I. Guyon and A. Elisseeff, "An
introduction to variable and feature
selection," Journal of Machine
Learning Research, Vol. 3, 2003, pp.
1157-1182.
I. Kononenko, "On biases in
estimating multi-valued attributes,"
International Joint Conference on
Artificial Intelligence, 1995, pp.
1034-1040.
Y. Yang and J. P. Pedersen, "A
comparative study on feature
selection in text categorization,"
International Conference on Machine
Learning, 1997, pp. 412-420.
G. Forman, "An extensive empirical
study of feature selection metrics for
text classification," Journal of
Machine Learning Research, Vol. 3,
2003, pp. 1289-1305.
M. A. Hall and G. Holmes,
"Benchmarking attribute selection
techniques for discrete class data
mining," IEEE Trans. Knowledge
and Data Engineering, Vol. 15, No.
6, 2003, pp. 1437-1447.
A. L. Blum, and P. Langley,
"Selection of relevant features and
examples in machine learning,"
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
Artificial Intelligence, Vol. 97, No.
(1-2), 1997, pp. 245-271.
N. Littlestone, "Learning quickly
when irrelevant attributes abound: a
new linear threshold algorithm,"
Machine Learning, Vol. 2, 1998, pp.
285-318.
I. Guyon, et al., "Gene selection for
cancer classification using support
vector machines," Machine Learning,
2002, Vol. 46, No. (1-3), 2002, pp.
389-422.
L. Breiman, "Classification and
regression trees," 1984, Belmont,
CA.
E. Frank, and I. H. Witten, "Using a
permutation test for attribute
selection
in
decision
trees,"
International Conference on Machine
Learning, 1998, pp. 152-160.
M. Dash, and H. Liu, "Feature
selection
for
classification,"
Intelligent Data Analysis, Vol. 1, No.
3, 1997, pp. 131-156.
C. M. Bishop, "Neural networks for
pattern recognition," 1995, Oxford
University Press.15. Caruana, R. and
D. Freitag. Greedy attribute selection.
International Conference on Machine
Learning. 1994, pp. 28-36.
H. Vafaie, and I. F. Imam, "Feature
selection
methods.
Genetic
algorithms vs. greedy like search,"
International Conference on Fuzzy
and Intelligent Control Systems,
1994.
J. Doak, "An evaluation of feature
selection
methods
and
their
application to computer security,"
1992, Technical Report CSE-92-18.
University of California at Davis.
G. Brassard, and P. Bratley,
"Fundamentals
of
algorithms,"
Prentice Hall, 1996.
D. Koller, and M. Sahami, "Toward
optimal
feature
selection,"
International Conference on Machine
Learning, 1996, pp. 284-292.
G. H. John, R. Kohavi and K.
Pfleger., "Irrelevant features and the
subset
selection
problem,"
[20]
[21]
[22]
[23]
International Conference on Machine
Learning, 1994, pp. 121-129.
J. Li, and L. Wong, "Identifying good
diagnostic gene groups from gene
expression profiles using the concept
of
emerging
patterns,"
Bioinformatics, 2002, Vol. 18, No.5,
pp. 725-734.
F. Sebastiani, "A Tutorial on
automated text categorization," In
Analia Amandi and Ricardo zunino,
editors, proceeding of ASAI-99, 1st
argentinian Symposium on artificial
intelligence, 1999, pp. 7-35, Buenos
aires, AR.
W. Lam and C. Y. Ho, "Using a
Generalized
Instance
Set
for
Automatic Text Categorization,"
1998, SIGIR’98, pp. 81-89.
D. D. Lewis, "Naïve bayes at forty:
The independent assumption in
information retrieval," In Proceeding
s of ECML-98. 10th European
conference on machine learning,
Chemnitz, Germany, 1998, pp. 4-15