Feature reduction techniques for Arabic text categorization

Feature Reduction Techniques for Arabic Text
Categorization
Rehab Duwairi
Department of Computer Information Systems, Jordan University of Science and Technology,
Irbid, Jordan. E-mail: [email protected]
Mohammad Nayef Al-Refai
Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan.
E-mail: [email protected]
Natheer Khasawneh
Department of Computer Engineering; Jordan University of Science and Technology, Irbid, Jordan.
E-mail: [email protected]
This paper presents and compares three feature reduction techniques that were applied to Arabic text. The
techniques include stemming, light stemming, and word
clusters. The effects of the aforementioned techniques
were studied and analyzed on the K-nearest-neighbor
classifier. Stemming reduces words to their stems. Light
stemming, by comparison, removes common affixes from
words without reducing them to their stems. Word clusters group synonymous words into clusters and each
cluster is represented by a single word. The purpose of
employing the previous methods is to reduce the size of
document vectors without affecting the accuracy of the
classifiers. The comparison metric includes size of document vectors, classification time, and accuracy (in terms
of precision and recall). Several experiments were carried out using four different representations of the same
corpus: the first version uses stem-vectors, the second
uses light stem-vectors, the third uses word clusters, and
the fourth uses the original words (without any transformation) as representatives of documents. The corpus
consists of 15,000 documents that fall into three categories: sports, economics, and politics. In terms of vector
sizes and classification time, the stemmed vectors consumed the smallest size and the least time necessary
to classify a testing dataset that consists of 6,000 documents. The light stemmed vectors superseded the other
three representations in terms of classification accuracy.
Introduction
The exponential growth in the availability of online information and in Internet usage has created an urgent demand
Received September 26, 2008; revised June 2, 2009; accepted June 3, 2009
© 2009 ASIS&T • Published online 13 July 2009 in Wiley InterScience
(www.interscience.wiley.com). DOI: 10.1002/asi.21173
for fast and useful access to information (Correa & Ludermir,
2002; Ker & Chen, 2000; Pierre, 2000). People need help
in finding, filtering, and managing resources. Furthermore,
today’s large repositories of information present the problem
of how to analyze the information and how to facilitate navigation to the information. This mass of information must be
organized to make it comprehensible to people, and the most
successful paradigm is to categorize different documents
according to their topics.
Text categorization, or text classification, is one of many
information management tasks. It is a way of assigning documents to predefined categories based on document contents.
Categorization is generally done to organize information
automatically. The need of automated classification arises
basically because of the paced growth and change of the
Web, where manual organization becomes almost impossible without expending massive time and effort (Pierre, 2000).
Information retrieval, text routing, filtering, and understanding are some examples of wide range applications for text
categorization (Dumais, Platt, Heckerman, & Sahami, 1998).
Many categorization algorithms have been applied to text
categorization, for example, the Naïve Bayes probabilistic
classifiers (Eyheramendy, Lewis, & Madiagn, 2003), Decision Tree classifiers (Bednar, 2006), Neural Networks (Basu,
Walters, & Shepherd, 2003), K-nearest-neighbor classifiers
(KNN) (Gongde, Hui, David, Yaxin, & Kieran, 2004) and
Support Vector Machines (Sebastiani, 2005).
With the increasing size of datasets used in text classification, the number and quality of features provided to describe
the data has become a relevant and challenging problem.
There is a need for effective feature reduction strategies (Yan
et al., 2005; Yang & Pedersen, 1997). Some standard feature
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(11):2347–2352, 2009
reduction processes can initially be used to reduce the high
number of features such as eliminating stopwords, stemming,
and removing very frequent/infrequent words. Feature selection strategies discover a subset of features that are relevant
to the task to be learned and that causes a reduction in the
training and testing data sizes (Seo, Ankolekar, & Sycara,
2004). The classifier built with only the relevant subset of
features would have better predictive accuracy than the classifier built from the entire set of features (Mejia-Lavalle &
Morales, 2006). If we keep working on a high dimension
dataset space, two main problems may arise: computational
complexity and overfitting.
In this paper the researchers present and compare three
heuristic feature selection measures for Arabic text: stemming, light stemming, and word clusters. Stemming reduces
words to their stems (roots). Light stemming, on the other
hand, removes common affixes from words without reducing them to their roots. Word clusters partition the words
that appear in a document into clusters based on the synonymy relation. Afterwards each cluster is represented by
a single word (called cluster representative). The effects of
the above three techniques, in addition to the case of using the
original words of a document, such as feature selection techniques, were assessed for text categorization. The assessment
framework includes comparing the document vector sizes,
preprocessing and classifications times, and classifier accuracy. The KNN classifier was applied to an Arabic dataset.
The dataset consists of 15,000 Arabic documents; the documents are collected, filtered, and classified manually into
three categories: Sports, Economics, and Politics. The smallest vector sizes and the smallest time were achieved in the
case of stemming, while the highest classifier accuracy in
terms of precision and recall was obtained in the case of light
stemming.
This paper is organized as follows: The first section
is the introduction; the second section describes the proposed framework, which consists of feature selection techniques and the classification subsystem. The third section
presents and analyzes the results of this paper. Finally, the
last section summarizes the conclusions and briefly highlights
future work.
System Framework
The Proposed Feature Selection Measures
Stemming algorithms are needed in many applications
such as natural language processing, compression of data, and
information retrieval systems. Very little work in the literature
utilizes stemming algorithms for Arabic text categorization,
such as the work of Sawaf, Zaplo, and Ney (2001), and the
work of Elkourdi, Bensaid, and Rachidi (2004), and Duwairi
(2006). Applying stemming algorithms as a feature selection
method reduces the number of features because lexical forms
(of words) are derived from basic building blocks and, hence,
many features that are generated from the same stem are represented as one feature (their stem). This technique reduces
the size of document vectors and increases the speed of
2348
learning and categorization phases for many classifiers, especially for classifiers that scan the whole training dataset for
each test document. The stemming algorithm of Al-Shalabi,
Kanaan, and Al-Serhan (2003) was followed here as a feature
selection method. Arabic words’ roots consist of three letters.
Very few words have four, five, or six letters. The algorithm
reported in Al-Shalabi et al. (2003) finds the three-letter roots
for Arabic words without depending on any root or pattern
files. For example, using Al-Shalabi et al.’s algorithm would
which mean
reduce the Arabic words
“the library”, “the writer”, and “the book,” respectively, to
one stem
, which means write.
The main idea for using light stemming is that many word
variants do not have similar meanings or semantics. However,
these word variants are generated from the same root. Thus,
root extraction algorithms affect the meanings of words. Light
stemming, by comparison, aims to enhance the categorization
performance while retaining the words’meanings. It removes
some defined prefixes and suffixes from the word instead of
extracting the original root (Aljlayl & Frieder, 2002). For
means “the book” and the word
example, the word
means “the writers”; they are extracted from the
“write”, but they have different meanings.
same root
Thus, the stemming approach reduces their semantics. The
light stemming approach, on the other hand, maps the word
which means “the book” to
which means
which means “the
“book”, and stems the word
writers” to
which means “writer”. Light stemming
keeps the words’ meanings unaffected. We applied the light
stemming approach of Aljlayl and Frieder (2002) as a feature
selection method. The basis of their light stemming algorithm
consists of several rounds that attempt to locate and remove
the most frequent prefixes and suffixes from the word.
Word clustering aggregates synonymous words, which
have various syntactical forms but have similar meanings,
into clusters. For example, the verbs
and
have similar meaning, which is “run,” and therefore would
be aggregated into a single cluster even though they have
different roots. Classifiers cannot deal with such words as
correlated words that provide similar semantic interpretations. A word cluster vector is created for every document
by partitioning the words that appear in that document
into groups based on their synonymy relation. Every cluster is then represented by a single word—the one that is
commonly used in that context. To alleviate minor syntactical variations among words in the same cluster, the words
were light stemmed. Using this approach, a document vector would consist of cluster representatives only and their
frequencies. This fundamentally reduces the size of document vectors. The distribution of words into clusters is
performed by carrying out a thesaurus lookup for every word
that appears in a document. If that word matches a cluster
(list) in the thesaurus then that word is replaced by that list’s
representative.
Since the dataset consists of three categories, sports, politics, and economics, we built a thesaurus for Arabic terms
that are related to these topics. Many resources were utilized
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2009
DOI: 10.1002/asi
FIG. 1. The main modules in the system.
to build this thesaurus such as the following Arabic dictionaries:
,
,
,
, and other
electronic resources such as
, Microsoft Office
2003 thesaurus, and many Internet sites that provide information about the Arabic terms of politics, sports, and economics,
or their synonymous terms.
The Classification Subsystem
The motivation of our research is to assess the effects of
feature reduction methods on the KNN classifier. Therefore,
the documents in the dataset were processed and represented
in four versions and then the classifier was run on each version. Each document in the dataset was represented using the
following four vectors:
A stem vector: where words that appear in the document were
reduced to their stems.
A light stem vector: where words are processed by removing common affixes using the algorithm described previously
(Aljlayl & Frieder, 2002).
A word clusters vector where synonymous words that
appear in the document are represented by a single word that
is their cluster representative.
A word vector where words of a document are used as is,
without any transformation.
Figure 1 shows the main modules in the system. The
following paragraphs describe each of these modules.
The preprocessor: preprocessing aims to represent documents in a format that is understandable to the classifier. Common functions for preprocessors include document
conversion, stopword removal, and term weighting.
The stemmer and light stemmer modules apply stemming
and light stemming to the keywords of a document, respectively. The word cluster module groups synonymous words.
After each transformation the keywords are weighted using
term frequency (TF).
The KNN classifier takes as input a test document (represented using the four vectors described above) and assigns a
label to that document by calculating its similarity to every
document in the training dataset. The training dataset was
also represented using the four representational vectors used
for test documents. The label of the test document is determined based on the labels of the closest K neighbors to that
document. The best value of K was 10 (for this dataset) and
it was determined experimentally.
Experimentation and Results Analysis
Dataset Description and Vector Sizes
The dataset consists of 15,000 Arabic text documents.
These documents were prepared manually by collecting them
from online resources. The dataset was filtered and classified manually into three categories: politics, sports, and
economics. Each category consists of 5,000 documents.
The dataset was divided into two parts: training and testing. The testing data consist of 6,000 documents, 2,000
documents per category. The training data, on the other hand,
consist of 9,000 documents, 3,000 documents per category.
Every document was represented by four vectors depending
on the feature reduction technique employed. In particular,
word, word clusters, stemmed, and light stemmed vectors.
Table 1 describes the characteristics of the four versions
of document vectors. The purpose of this table is to show that
feature reduction methods reduce the dataset size, and hence
minimize the required memory space to handle the dataset.
As can be seen from the table, the stemmed vectors consumed
the least space (35 MB) and the smallest number of features
(5,341,696). This is expected, as stemming reduces several
words to one stem. The largest vectors in terms of size and
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2009
DOI: 10.1002/asi
2349
TABLE 1. The properties of the training dataset in terms of size and number
of features.
Training dataset
version
Word vectors
Stemmed vectors
Light-stemmed vectors
Word-cluster vectors
Size in
megabytes
Total number of keywords
for 9,000 documents
86 MB
35 MB
59.5 MB
57 MB
8,987,278
5,341,696
7,092,884
6,845,538
TABLE 2. The elapsed classification time (in seconds) for 6,000 test
documents.
Experiment
Word vectors
Stemmed vectors
Light-stemmed vectors
Word-cluster vectors
Preprocessing
time
Classification
time
Total
Time
1,778
1,252
1,433
1,881
12,013
9,773
10,886
10,155
13,791
11,025
12,319
12,036
number of features were the word vectors (where no feature
reduction technique was employed). Again, this is expected,
as words with minor syntactical variations would each be
represented as a distinct keyword on its own. Finally, the
light-stemmed vectors consumed (59.5 MB) with (7,092,884)
features. This is slightly higher than the stemmed vectors as
only certain prefixes and suffixes are removed from the words
before storing them in their corresponding vectors. Wordcluster vectors fall between the stemmed and light stemmed
versions.
Preprocessing and Classification Times
The experiments were performed by categorizing the test
documents (6,000 documents) in four cases based on the
feature reduction method utilized. In every case the KNN
classifier was used. The experiments were carried out on a
Pentium IV personal computer (PC), with a RAM of size
256 MB. Table 2 shows the elapsed preprocessing and classification times for all test documents. Preprocessing time
depends on the activities performed during this process. In
our work, preprocessing includes the removal of punctuation
marks, tags, and stopwords, which is common to all experiments. Preprocessing also includes term weighting, which is
common to all experiments but also depends on the feature
reduction algorithm. Terms weighting time is proportional to
the number of terms in a given document: the more terms the
higher the preprocessing time.
Table 2 shows that the lowest preprocessing time was
achieved in the case of stemming. This is because the size
of the vectors is smaller when compared with the other three
vector types. Also, the stemming algorithm utilized is efficient in the sense that it needs to scan a given word only once
to deduce its stem (Al-Shalabi et al., 2003). The next-best preprocessing time was achieved in the case of light stemming.
2350
To a certain extent, stemming and light stemming are similar
in the sense that they both need to process every word in a
document either by running the stemming or light stemming
algorithms. The worst preprocessing time was in the case
of word clusters, as this requires accessing the thesaurus in
addition to the document to create a document vector.
Classification time indicates the time necessary to classify the 6,000 test documents using the KNN classifier. As
can be seen from the table, classifying documents using the
stemmed vectors needed the least time, as document vector sizes are rather small. Classifying documents using the
light-stemmed and word-cluster vectors consumed 10,886
and 10,155 seconds, respectively. The two values slightly
vary because vector sizes of the two methods are similar (see
Table 1). Word vectors needed the largest time to classify the
collection of test documents; again, this is due to the fact that
word vectors are the largest. To sum up, classification time
using the KNN classifier is proportional to document vector
sizes: the smaller the vector sizes the smaller the classification
time.
The last column in Table 2 shows the total time. Total time
is the sum of the preprocessing time and classification time.
Stemmed vectors achieve the lowest total time. The largest
time was achieved in the case of word vectors.
Classifier Accuracy Versus Feature Reduction Techniques
This section investigates the effects of feature reduction
techniques on classifier accuracy. The accuracy of the classifier is judged by the standard precision and recall values
widely used by the research community. These were originally used to evaluate information retrieval systems and are
currently used to assess the accuracy of classifiers. Assume
that we have a binary classification problem. This means
we have only one category to predict (say, C). The sample of documents consists of both documents that belong to
this category and documents that do not belong to the category. Let TP (true positives) be the number of documents that
were classified by the classifier to be members of C and they
are true members of C (human agrees with classifier). Let
FN (false negatives) be the number of documents that truly
belong to C but the classifier failed to single them out. Let FP
(false positives) be the number of documents that were misclassified by the classifier to belong to C. Finally, let TN (true
negatives) be the number of documents that were truly classified not to belong to C. Recall is defined as TP/(TP + FN)
and Precision is given by TP/(TP + FP).
Table 3 shows the precision values for the politics, economics, and sports documents that were fed to the KNN
classifier. Every group of test documents was fed to the classifier four times: once in the form of word vectors, the second in
the form of stemmed vectors, then as light-stemmed vectors,
and finally as word-clusters vectors.
As can be seen from the table the highest value of precision
was achieved for politics documents, in the case of word
cluster vectors. The precision for the light stemming case was
slightly less than the word clusters case. The worst precision
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2009
DOI: 10.1002/asi
TABLE 3. Classifier accuracy (using precision) for word vectors,
stemmed-vectors, light-stemmed vectors and word-clusters vectors.
Politics
Economics
Sports
Word
vectors
Stemmed
vectors
Light-stemmed
vectors
Word-clusters
vectors
0.722
0.7642
0.95813
0.7448
0.92433
0.9873
0.83187
0.9334
0.9806
0.8404
0.90938
0.9763
TABLE 4. Classifier accuracy (using recall) for word vectors,
stemmed-vectors, light-stemmed vectors, and word-clusters vectors.
Politics
Economics
Sports
Word
vectors
Stemmed
vectors
Light-stemmed
vectors
Word clusters
vectors
0.9435
0.6565
0.801
0.956
0.7085
0.938
0.9475
0.82
0.9635
0.9295
0.838
0.9495
FIG. 2. Average precision of the classifier.
FIG. 3. Average recall of the classifier.
value for the politics documents was in the case of the word
vectors.
The highest precision value for the economics documents,
by comparison, was achieved in the case of light-stemmed
vectors; the lowest value was achieved in the case of the word
vectors. Finally, the highest precision value for the sports documents was achieved in the case of light-stemmed vectors
and the worst was in the case of word vectors. The conclusion is that using words without applying any stemming, light
stemming, or word cluster algorithms result in the classifier
being too sensitive to the minor syntactical variations of the
same word and therefore these words are considered not correlated, which consequently adversely affects the precision
of the classifier. Figure 2 depicts the average precision for
all test documents (politics, economics, and sports); the best
average value was achieved in the case of light stemming and
the worst average value resulted in the case of word vectors.
Table 4 shows the recall values for the three categories.
The highest recall for politics’ documents was achieved in
the case where the documents represented as stemmed vectors. The highest recall value for economics’ documents, by
comparison, was achieved when the documents were represented as word clusters. Finally, the highest recall for sports’
documents was achieved in the case where the documents
were represented as light-stemmed vectors.
Figure 3 shows the average recall values for all test
documents against the four employed feature reduction
techniques. The two best values were achieved in the cases
of light stemming and word clusters, respectively.
Conclusions and Future Work
In this study we applied three feature selection methods
for Arabic text categorization. The Arabic dataset was collected manually from Internet sites such as Arabic journals.
The dataset was filtered and classified manually into three
categories: politics, sports, and economics. Each category
consists of 5,000 documents. The dataset was divided into
two parts: training and testing. The testing data consist of
6,000 documents, 2,000 documents for each category. The
training data consist of 9,000 documents, 3,000 documents
per category.
The techniques used for feature selection are stemming
(Al-Shalabi et al., 2003), light stemming (Aljlayl & Frieder,
2002), and word clusters. Stemming finds the three-letter
roots for Arabic words without depending on any root or
pattern files. Light stemming, on the other hand, removes the
common suffixes and prefixes from the words. Word clusters
group synonymous light stems using a prepared thesaurus and
then chooses a light-stemmed word to represent the cluster.
The KNN classifier was used to classify the test documents. The experiments were done in the following manner,
the KNN classifier was run four times on four versions of the
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2009
DOI: 10.1002/asi
2351
dataset. In the first version, a document is represented as a
vector that includes all the words that appear in that document. In the second version, words of a given document are
reduced to their stems (roots) then the corresponding document vector is created. In the third version the words that
constitute a given document were light stemmed and then
the corresponding vector is generated. In the final version, the
words of a given document were grouped into clusters based
on the synonymy relation and each cluster was represented by
a single word (cluster representative). Afterwards, that document vector was formed by using cluster representatives
only.
Our experiments have shown that stemming reduces vector
sizes, and therefore improves the classification time. However, it adversely affects the accuracy of the classifier in
terms of precision and recall. The precision and recall reached
their highest values when using the light stemming approach.
These results are of interest to anyone working in Arabic
information retrieval, text filtering, or text categorization.
In the future we plan to extend this framework to include
statistical feature selection techniques such as χ2 , information gain, and the Gini index (Shang et al., 2007; Shankar &
Karypis, 2000). Finally, the thesaurus, which we used in
this work, was built manually. We plan to improve this thesaurus by utilizing automatic algorithms and then screening
the synonymy lists by language experts.
References
Aljlayl, M., & Frieder, O. (2002). On Arabic search: Improving the retrieval
effectiveness via a light stemming approach. In Proceedings of the ACM
11th Conference on Information and Knowledge Management (pp. 340–
347). New York: ACM Press.
Al-Shalabi, R., Kanaan, G., & Al-Serhan, H. (2003, December). A new
approach for extracting Arabic roots. Paper presented at the International Arab Conference on Information Technology (ACIT), Alexandra,
Egypt.
Basu, A., Walters, C., & Shepherd, M. (2003). Support vector machines
for text categorization. In Proceedings of the 36th Annual Hawaii International Conference on System Sciences (pp. 103–109). Los Alamitos,
California: IEEE Press. Retrieved July 2, 2009, from http://ieeexplore.
ieee.org/stamp/stamp.jsp?tp=&arnumber=1174243&isnumber=26341
Bednar, P. (2006, January). Active learning of SVM and decision tree
classifiers for text categorization. Paper presented at the Fourth SlovakianHungarian Joint Symposium on Applied Machine Intelligence, Herlany,
Slovakia.
Correa, R.F., & Ludermir, T.B. (2002, November). Automatic text categorization: Case study. Paper presented at the VII Brazilian Symposium on
Neural Networks, Pernambuco, Brazil.
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings
2352
of the Seventh International Conference on Information and Knowledge
Management (pp. 148–155). New York: ACM Press.
Duwairi, R.M. (2006). Machine learning for Arabic text categorization. Journal of the American Society for Information Science and Technology,
57(8), 1005–1010.
Elkourdi, M., Bensaid, A., & Rachidi, T. (2004). Automatic Arabic document categorization based on the naïve Bayes algorithm. In Proceedings
of COLING 20th Workshop on Computational Approaches to Arabic Script-based Languages (pp. 51–58). Retrieved July 2, 2009, from
http://www.arabicscript.org/W5/pdf/proceedings.pdf
Eyheramendy, S., Lewis, D., & Madiagn, D. (2003). On the naïve Bayes
model for text categorization. Paper presented at the Ninth International
Conference on Artificial Intelligence and Statistics, Key West, FL.
Gongde, G., Hui, W., David, A.B., Yaxin, B., & Kieran, G. (2004). An
kNN model-based approach and its application in text categorization.
In A. Gelbukh (Ed.), Proceedings of the Fifth International Conference on Intelligent Text Processing and Computational Linguistics
(CICLing) (pp. 559–570). Lecture Notes in Computer Science, Vol. 2945.
Berlin/Heidelberg, Germany: Springer.
Ker, S., & Chen, J. (2000). A text categorization based on summarization technique. In J. Klavans & J. Gonzalo (Eds.), Proceedings of the
ACL-2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval, held in conjunction with the 38th Annual
Meeting of the Association for Computational Linguistics (pp. 79–83).
New Brunswick, NJ: The Association for Computational Linguistics.
Mejia-Lavalle, M., & Morales, E. (2006). Feature selection considering attribute inter-dependencies. In International Workshop on Feature
Selection for Data Mining: Interfacing Machine Learning and Statistics
(pp. 50–58). Providence, RI: American Mathematical Society.
Pierre, J. (2000, September). Practical issues for automated categorization
of web pages. Paper presented at the 2000 Workshop on the Semantic
Web, Lisbon, Portugal. Retrieved May 29, 2009, from http://citeseer.ist.
psu.edu/pierre00practical.html
Sawaf, H., Zaplo, J., & Ney, H. (2001, July). Statistical classification methods
for Arabic news articles. Paper presented at the Arabic Natural Language
Processing Workshop. Toulonse, France.
Sebastiani, F. (2005). Text categorization. In A. Zanasi (Ed.). Text mining
and its applications to intelligence, CRM and knowledge management
(pp. 109–129). Southampton, UK: WIT Press.
Seo, Y., Ankolekar, A., & Sycara, K. (2004). Feature selection for extracting
semantically rich words. Technical Report CMU-RI-TR-04-18, Robotics
Institute, Carnegie Mellon University, Pittsburgh, PA.
Shang, W., Huoang, H., Zhu, H., Lin, Y., Qu, Y., & Wang, Z. (2007). A novel
feature selection algorithm for text categorization. Expert Systems with
Applications, 33(1), 1–5.
Shankar, S., & Karypis, G. (2000). A feature weight adjustment algorithm
for document categorization. In Proceedings of the Sixth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. New
York: ACM Press.
Yan, J., Liu N., Zhang B., Yan S., Chen Z., Cheng Q., Fan W., & Ma W.
(2005). OCFS: Optimal orthogonal centroid feature selection for text categorization. In Proceedings of the 28th Annual International ACM SIGIR
Conference (SIGIR’2005) (pp. 122–129). New York: ACM Press.
Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection
in text categorization. In J.D.H. Fisher (Ed.). The Fourteenth International Conference on Machine Learning (ICML’97) (pp. 412–420).
San Francisco: Morgan Kaufmann.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2009
DOI: 10.1002/asi