Text Categorization with a Small Number of Labeled Training

Text Categorization with a Small Number of
Labeled Training Examples
A Thesis Presented
by
Kang Hyuk Lee
Submitted to the University of Sydney in fulfillment
of the requirements for the degree of
Doctor of Philosophy
September 1, 2003
School of Information Technologies
University of Sydney
ii
List of Publications
Some contents in this research were published in the followings:
K. H. Lee, J. Kay, and B. H. Kang. Keyword Association Network: A Statistical Multiterm Indexing Approach for Document Categorization. In Proceedings of the 5th
Australasian Document Computing Symposium, pages 9-16, December 1, 2000.
K. H. Lee, J. Kay, and B. H. Kang. Lazy Linear Classifier and Rank-in-Score
Threshold in Similarity-Based Text Categorization. International Conference on
Machine Learning Workshop on Text Learning (TextML’2002), Sydney, Australia,
pages 36-43, July 8, 2002.
K. H. Lee, J. Kay, B. H. Kang, and U. Rosebrock. A Comparative Study on Statistical
Machine Learning Algorithms and Thresholding Strategies for Automatic Text
Categorization. The 7th Pacific Rim International Conference on Artificial Intelligence
(PRICAI-02), Tokyo, Japan, pages 444-453, August 18-22, 2002.
K. H. Lee, J. Kay, and B. H. Kang. Active Learning: Applying RinSCut Thresholding
Strategy to Uncertainty Sampling. The 16th Australian Joint Conference on Artificial
Intelligence, Perth, Australia, December 3-5, 2003 (In press).
iii
Acknowledgments
My great thanks to my supervisor, Associate Professor Judy Kay, for showing
special patience as I pursued my own particular area of interest. She enthusiastically
supported me to do research in an area that is not one of her specialized areas. I know
that she spent a lot of time to give suggestions and constructive criticism in the
development of this thesis. I also thank her for the financial support in the form of a
postgraduate scholarship.
Thanks to Dr. Byeong H. Kang and his research students at the University of
Tasmania for their time and interest in my research area. He provided me with a great
deal of guidance in machine learning and encouraged me to keep doing this research.
My thanks should go to my wife, Soo Jung, and my two children, Eugene and
Yusang. They showed their endless love, understanding, and belief in the difficult
situation I made. Thanks to them for letting me know how much I love them.
My deepest thanks to my parents for their love and support without which this
thesis could not have appeared.
Thanks to Mr. Kang-Kil Lee and his family for caring for my family in Sydney,
Australia.
Many thanks to people in the School of Information Technologies for their
patience as I used much CPU time and disk space on servers that should be shared
with them.
iv
Abstract
This thesis describes the investigation and development of supervised and semisupervised learning approaches to similarity-based text categorization systems. It uses
a small number of manually labeled examples for training and still maintains
effectiveness.
The purpose of text categorization is to automatically assign arbitrary raw
documents to predefined categories based on their contents. Text categorization
involves several sub-phases that should be combined effectively for the success of the
overall system. We explore those sub-phases in terms of an approach to similaritybased text categorization that has shown good categorization performance.
Supervised approaches to text categorization usually require a large number of
training examples to achieve a high level of effectiveness. Training examples used in
supervised learning approaches require human involvement in the labeling of each
example. So, labeling such a large number of documents for training poses a
considerable burden on human experts who need to read each document and assign it to
appropriate categories. With this problem in mind, our goal was to develop a text
categorization system that uses fewer labeled examples for training to achieve a given
level of performance.
We describe our new similarity-based learning algorithm (KAN) and thresholding
strategies (RinSCut variants). KAN was designed to give appropriate weights to terms
according to their semantic content and importance by using their co-occurrence
information and the discriminating power values for similarity computation. In this
way, KAN defines more statistical information for each given training example,
compared with other, conventional similarity-based algorithms. After investigating the
existing common thresholding strategies, we designed, for multi-class text
categorization in which documents may belong to variable numbers of categories,
RinSCut variants that combine the strengths of previously investigated thresholding
strategies. Our thresholding strategies are general and so, can be applied to other
v
similarity-based text processing tasks as well as other similarity-based learning
algorithms.
To reduce the number of labeled examples needed to achieve a given level of
performance, we developed uncertainty-based, selective sampling methods. Rather
than relying on random sampling for candidate training examples, our text
categorization system uses selective sampling methods to actively choose candidate
examples for training that are likely to be more informative than other examples. Then,
the system either presents these to the human, to assign their correct labels or it
assigns them its own predicted labels. We applied these uncertainty selective sampling
methods to our own classifiers. However, our sampling methods are quite general and
could be used for any machine learning approaches.
We conducted extensive comparative experiments on two standard test collections
(the Reuters-21578 and the 20-Newsgroups). We present the experimental results
using a standard evaluation method, F1, for micro and macro-averaged performance.
The results show that KAN and RinSCut variants work better than other widely used
techniques. They also demonstrate that our uncertainty selective sampling methods, in
most cases, outperform random sampling in terms of the required numbers of labeled
training examples needed to achieve a given level of performance. One exceptional case,
in which selective sampling methods failed, occurred in micro-averaged performance in
multi-class text categorization task.
vi
Contents
1 Introduction...................................................................................
1.1 Text Categorization Process................................................................................4
1.2 Outline of the Thesis.........................................................................................12
2 Text Categorization and Previous Work..............................................................
2.1 Text Categorization ...........................................................................................15
2.1.1 A Definition of Text Categorization .....................................................15
2.1.2 Ambiguities in Natural Language Text ..................................................16
2.1.3 Knowledge Engineering versus Machine Learning Approach...............17
2.1.4 Difficulties for the Machine Learning Approach..................................18
2.2 Preprocessing.....................................................................................................18
2.2.1 Feature Extraction .................................................................................19
2.2.2 Representation ......................................................................................21
2.2.3 Feature Selection ...................................................................................22
2.3 Learning Classifiers............................................................................................25
2.3.1 Similarity-Based Learning Algorithms ..................................................26
2.3.1.1 Profile-Based Linear Algorithms .............................................26
2.3.1.2 Instance-Based Lazy Algorithm..............................................27
2.3.2 Thresholding Strategy ...........................................................................28
2.3.2.1 Rank-Based (RCut)..................................................................30
2.3.2.2 Score-Based (SCut)..................................................................30
2.3.2.3 Proportion-based Assignment (PCut) .....................................31
2.4 Active Learning..................................................................................................31
2.4.1 Uncertainty Sampling............................................................................32
2.4.2 Committee-Based Sampling ..................................................................34
2.5 Evaluation Methods ..........................................................................................34
2.5.1 Performance Measures of Effectiveness ...............................................35
2.5.2 Micro and Macro Averaging .................................................................39
2.5.3 Data Splitting ........................................................................................41
vii
3 Keyword Association Network (KAN)...............................................................
3.1 Objectives and Motivation................................................................................44
3.2 Overall Approach..............................................................................................47
3.2.1 Constructing KAN .................................................................................48
3.2.2 Relationship Measure ...........................................................................51
3.2.3 Discriminating Power Function.............................................................52
3.2.4 Applying KAN to Text Categorization .................................................55
3.3 Computational Complexity...............................................................................57
4 RinSCut: New Thresholding Strategy................................................................
4.1 Motivation.........................................................................................................59
4.2 Desired Properties.............................................................................................61
4.3 Overall Approach..............................................................................................62
4.3.1 Defining Ambiguous Zone ....................................................................62
4.3.2 Defining RCut Threshold.......................................................................67
5 Evaluation I: KAN and RinSCut.....................................................................
5.1 Data Sets Used ..................................................................................................68
5.1.1 Reuters-21578.......................................................................................70
5.1.2 20-Newsgroups.....................................................................................75
5.2 Text Preprocessing ............................................................................................78
5.2.1 Feature Extraction .................................................................................78
5.2.2 Feature Weighting..................................................................................79
5.2.3 Feature Selection ...................................................................................81
5.3 Experiments on the Number of Features...........................................................81
5.3.1 Experimental Setup ...............................................................................82
5.3.2 Results and Analysis.............................................................................85
5.4 Experiments on the Number of Training Examples...........................................93
5.4.1 Experimental Setup ...............................................................................93
5.4.2 Results and Analysis.............................................................................95
6 Learning with Selective Sampling ....................................................................
6.1 Goal and Issues................................................................................................110
6.1.1 Homogeneous versus Heterogeneous Approach.................................111
6.1.2 Using Positive-Certain Examples for Training....................................111
viii
6.2 Overall Approach............................................................................................113
6.2.1 Computing Uncertainty Values...........................................................114
6.2.2 Defining Certain and Uncertain Documents with RinSCut .................115
7 Evaluation II: Uncertainty Selective Sampling .......................................................1
7.1
7.2
7.3
7.4
Data Sets Used and Text Processing ...............................................................118
Classifiers Implemented and Evaluated...........................................................119
Sampling Methods Compared.........................................................................120
Results and Analysis.......................................................................................121
8 Conclusions.....................................................................................
8.1 Contributions...................................................................................................137
8.2 Future Work ....................................................................................................139
Appendices.................................................................................................................142
Appendix A: Stop-list ................................................................................................143
Bibliography ...............................................................................................................148
ix
List of Tables
2.1
Contingency table for category ci. .....................................................................36
2.2
Global contingency table for category set C......................................................41
5.1
The 53 categories of the Reuters-21578 data set used in our experiments
(part 1)...............................................................................................................72
5.2
The 53 categories of the Reuters-21578 data set used in our experiments
(part 2)...............................................................................................................73
5.3
The 53 categories of the Reuters-21578 data set used in our experiments
(part 3)...............................................................................................................74
5.4
The 20 categories of the 20-Newsgroups corpus..............................................75
5.5
Statistics for the unique features in the Reuters-21578 corpus.........................83
5.6
Statistics for the unique features in the 20-Newsgroups corpus.......................84
5.7
The percentage of training data and the number of training documents
used in each round. ............................................................................................94
5.8
The best micro-averaged F1 and its classifier in each round on the
Reuters-21578 corpus. ....................................................................................106
5.9
The best macro-averaged F1 and its classifier in each round on the
Reuters-21578 corpus. ....................................................................................107
x
List of Figures
1.1
Initial-learning model for text categorization. ......................................................6
1.2
Categorization and learning model for text categorization...................................8
1.3
Learning with selective sampling.......................................................................10
2.1
The per-document and per-category thresholding strategies.............................29
2.2
Uncertainty sampling algorithm. .......................................................................33
3.1
An example of the similarity computation with semantic and
informative ambiguities......................................................................................45
3.2
An example for constructing KAN.....................................................................50
3.3
Algorithm for generating frequent 2-feature sets F2..........................................50
3.4
Network representation generated from the example........................................51
3.5
Graphs of S against df, when λ = 0.35, 0.50, and 0.65. ....................................54
3.6
An example of network representation for handling annotated training
examples. ...........................................................................................................57
4.1
Comparability of similarity scores and thresholding strategies.........................60
4.2
Ambiguous zone between ts_top(ci) and ts_bottom(ci) for a given
category ci..........................................................................................................64
4.3
The locations of ts_max_top(ci), ts_min_top(ci), ts_max_bottom(ci),
ts_min_bottom(ci), mod_ts_top(ci), and mod_ts_bottom(ci) in the ordered
list of similarity scores. .....................................................................................66
5.1
An example of the Reuters-21578 document (identification number, 9,
assigned to the earn category and used for the training set). .............................71
5.2
An example of the 20-Newsgroups document (assigned to the
alt.atheism newsgroup)......................................................................................78
xi
5.3
The resulting tokenized file after the feature extraction for the example
document in Figure 5.1. .....................................................................................79
5.4
The TFIDF weighting scheme based on the term frequency and
document frequency file. ...................................................................................80
5.5
F1 performance of Rocchio on the Reuters-21578 corpus (6, 984 training
documents used)................................................................................................86
5.6
F1 performance of k-NN on the Reuters-21578 corpus (6,984 training
documents used)................................................................................................87
5.7
F1 performance of WH on the Reuters-21578 corpus (6,984 training
documents used)................................................................................................87
5.8
F1 performance of KAN on the Reuters-21578 corpus (6,984 training
documents used)................................................................................................88
5.9
F1 performance of Rocchio on the 20-Newsgroups corpus (all the
training documents used in each split)...............................................................90
5.10
F1 performance of k-NN on the 20-Newsgroups corpus (all the training
documents used in each split)............................................................................91
5.11
F1 performance of WH on the 20-Newsgroups corpus (all the training
documents used in each split)............................................................................91
5.12
F1 performance of KAN on the 20-Newsgroups corpus (all the training
documents used in each split)............................................................................92
5.13
Micro-averaged F1 of each classifier on the Reuters-21578 corpus with
SCut. ..................................................................................................................97
5.14
Macro-averaged F1 of each classifier on the Reuters-21578 corpus with
SCut. ..................................................................................................................97
5.15
Micro-averaged F1 of each classifier on the 20-Newsgroups corpus with
RCut...................................................................................................................99
5.16
Macro-averaged F1 of each classifier on the 20-Newsgroups corpus with
RCut...................................................................................................................99
5.17
Micro-averaged F1 of each classifier on the 20-Newsgroups corpus with
RCut (with “truly random sampling”).............................................................100
xii
5.18
Macro-averaged F1 of each classifier on the 20-Newsgroups corpus with
RCut (with “truly random sampling”).............................................................100
5.19
Micro-averaged performance comparison of SCut and RinSCut variants
on Rocchio (the Reuters-21578 corpus used)..................................................102
5.20
Macro-averaged performance comparison of SCut and RinSCut variants
on Rocchio (the Reuters-21578 corpus used)..................................................102
5.21
Micro-averaged performance comparison of SCut and RinSCut variants
on WH (the Reuters-21578 corpus used)........................................................103
5.22
Macro-averaged performance comparison of SCut and RinSCut variants
on WH (the Reuters-21578 corpus used)........................................................103
5.23
Micro-averaged performance comparison of SCut and RinSCut variants
on k-NN (the Reuters-21578 corpus used).....................................................104
5.24
Macro-averaged performance comparison of SCut and RinSCut variants
on k-NN (the Reuters-21578 corpus used).....................................................104
5.25
Micro-averaged performance comparison of SCut and RinSCut variants
on KAN (the Reuters-21578 corpus used).......................................................105
5.26
Macro-averaged performance comparison of SCut and RinSCut variants
on KAN (the Reuters-21578 corpus used).......................................................105
6.1
Flow of unlabeled documents in our selective sampling..................................113
6.2
Definition of certain and uncertain examples using ts_top(ci) and
ts_bottom(ci) for a given category ci.................................................................116
7.1
Micro-averaged F1 of KAN+RCut on the 20-Newsgroups (RS: random
sampling, SS-U: selective sampling of uncertain examples, SSU&C[1000]: selective sampling of uncertain and certain examples [1,000
certain examples], SS-U&C[500]: selective sampling of uncertain and
certain examples [500 certain examples]). .......................................................122
7.2
Macro-averaged F1 of KAN+RCut on the 20-Newsgroups (RS: random
sampling, SS-U: selective sampling of uncertain examples, SSU&C[1000]: selective sampling of uncertain and certain examples [1,000
xiii
certain examples], SS-U&C[500]: selective sampling of uncertain and
certain examples [500 certain examples]). .......................................................123
7.3
Micro-averaged F1 of KAN+GRinSCut on the Reuters-21578 (RS:
random sampling, SS-U selective sampling of uncertain examples, SSU&C [500]: selective sampling of uncertain and certain examples [500
certain examples], SS-U&C [250]: selective sampling of uncertain and
certain examples [250 certain examples]). .......................................................125
7.4
Macro-averaged F1 of KAN+GRinSCut on the Reuters-21578 (RS:
random sampling, SS-U: selective sampling of uncertain examples, SSU&C [500]: selective sampling of uncertain and certain examples [500
certain examples], SS-U&C [250]: selective sampling of uncertain and
certain examples [250 certain examples]). .......................................................126
7.5
Micro-averaged F1 of KAN+LRinSCut on the Reuters-21578 (RS:
random sampling, SS-U: selective sampling of uncertain examples, SSU&C [500]: selective sampling of uncertain and certain examples [500
certain examples], SS-U&C [250]: selective sampling of uncertain and
certain examples [250 certain examples]). .......................................................127
7.6
Macro-averaged F1 of KAN+LRinSCut on the Reuters-21578 (RS:
random sampling, SS-U: selective sampling of uncertain examples, SSU&C [500]: selective sampling of uncertain and certain examples [500
certain examples], SS-U&C [250]: selective sampling of uncertain and
certain examples [250 certain examples]). .......................................................128
1
Chapter 1
Introduction
The work in this thesis explores supervised and semi-supervised machine learning
approaches to text categorization: an algorithm that exploits word co-occurrence
information and discriminating power value, new approaches to thresholding, and the
important goal of reducing the number of labeled training examples.
The amount of textual information that is stored digitally in electronic forms is
huge and increasing. With the so-called information overloading problem, caused by the
growing availability and heavy use of such electronic textual information, there has
been increasing interest in tools that can help in organizing and describing the large
amount of online textual information for later retrieval and use. One of the successful
and directly applicable paradigms for helping users in making good and quick selection
of textual information of interest is by classifying the different documents according to
their topics. The main text classification tasks, that are usually considered distinct, are
information retrieval, text categorization, information filtering, and document clustering
[Lew92a]. However, boundaries between them are not sharp, as all involve grouping of
documents based on their contents. Even though most machine learning methods in this
research have been developed for text categorization, they are quite general and
applicable to other text classification tasks that are usually based on some similarity
(or distance) measures between documents.
Text categorization is, simply, the task of automatically assigning arbitrary
documents to predefined categories (topics or classes) based on their content. There
are two different approaches to text categorization. One approach assigns each
2
document to a single category, the one it appears to fit best. And, the other approach,
also studied in this thesis, is to allow documents to be categorized into all categories
that it matches well. Traditionally, human experts, who are knowledgeable about the
categories, conduct text categorization manually. While making it possible to categorize
documents, based on their semantic content and the user’s own conceptual model, this
manual approach requires substantial human resources to read each document and
decide its appropriate categories. As a result, the number of classified documents tends
to be very small. This is generally a severe obstacle for the success of fully manual text
categorization systems.
Automatic approaches to text categorization involve the construction of a
categorization scheme, which is used for categorizing previously unseen documents. A
categorization scheme consists of the knowledge base (i.e., a set of classifiers for all
predefined categories) that captures the concepts describing the categories. In a
supervised learning approach, the knowledge base is learned from natural language texts
with category information (i.e., labels).
Most natural languages are ambiguous and usually return a large number of input
features (words, terms, tokens, or attributes) to the supervised learning process. These
characteristics of natural language texts make the target concepts complex. This means
the learning process, if it is to give an accurate knowledge base, requires a large number
of labeled examples. In many practical situations, such a large number of labeled
documents for training is not readily available, since manually labeling them is such a
big burden on human experts. This bottleneck is a critical problem for an automatic
approach.
Also, a large number of training documents may cause a delay in the incorporation
of new training examples into an existing knowledge base. This follows from the time
requested for an extensive learning process over existing training documents with new
ones added. It is, therefore, important for the success of a text categorization system to
develop machine learning methods that automate the text categorization task and give
reasonable effectiveness with fewer labeled training documents. The need for such a
text categorization system becomes more apparent with the efforts to automatically
classify personal textual information (for example, electronic mail). In this case, the
3
system must respond quickly and the small number of labeled documents may be
insufficient for effective learning.
In this research, our main goal is to develop supervised and semi-supervised
learning methods for an automated text categorization system that can achieve good
performance with a small set of labeled examples. To achieve this goal, we briefly
discuss two learning tasks:
1. Constructing an accurate knowledge base with a small set of training
examples.
Various supervised machine learning techniques have been successfully
applied to text categorization and their effectiveness has usually been
evaluated over the full set of available training examples. In other words,
relatively little research has been conducted on learning knowledge bases
from a small number of training examples.
In terms of effectiveness, another important aspect of a new learning
method is its efficiency in time and space. There are many situations where
the text categorization system should respond quickly to user’s
information needs and any feedback they provide. If a text categorization
system requires long processing times in the incorporation of new training
examples into an existing knowledge base, this could be a critical factor
against that categorization system.
2. Determining the most informative unlabeled examples for training, based on
the knowledge base learned from existing labeled examples.
Reduction of the number of training documents needed for achieving a
given level of text categorization performance can be accomplished by
selecting, for labeling and training, only the most informative examples.
Selective sampling (or active learning) [CAL94], as opposed to random
sampling, refers to any learning method involving active selection of
candidate training documents. It selects only the most informative
documents, by filtering a stream of unlabeled examples. This filtering
process is usually based on the estimated uncertainty values of unlabeled
4
examples. Uncertainty of a document under a given category is measured
by comparing the document’s similarity score with the category’s
threshold. In previous semi-supervised approaches to uncertainty selective
sampling, the most uncertain examples are selected. These are the
documents with similarity scores closest to the category’s threshold
[LC94]. Then, they ask human experts to label them. In this thesis, we
explore the possibility that the most positive-certain documents could
have benefits for training. If using such positive-certain documents leads to
performance improvements, another main advantage of it for the central
goal in this thesis, is that they do not require human involvement for
labeling them. They are automatically labeled by the system.
A knowledge base (or classifiers) used for uncertainty measurement can be
the same (homogeneous approach) or different (heterogeneous approach)
from the knowledge base used for categorizing new documents. The main
reason for adopting a heterogeneous approach is that the type of
knowledge base used for categorization is too computationally expensive
to build and use for selective sampling [LC94]. So, it is highly desirable,
rather than constructing a new knowledge base, to use the available
classifiers if they are accurate and fast enough for uncertainty selective
sampling.
In this research project, we develop and investigate machine learning methods for
text categorization that give good performance with small numbers of labeled training
documents. Our supervised learning approaches are effective and not computationally
expensive based on the number of input features. Our semi-supervised learning
approaches to uncertainty selective sampling directly use the same type of classifiers
for determining which of a set of unlabeled examples are most useful for training. In the
following subsections, we examine our text categorization process in more detail and
outline this thesis.
1.1 Text Categorization Process
The main task of a machine learning approach to text categorization is to automatically
build a knowledge base that can be used for assigning previously unseen documents to
5
appropriate categories. Because of ambiguities in natural languages and complex
concepts of categories, learning an accurate knowledge base generally requires a large
number of training documents that are manually labeled. In the real world, however,
obtaining these is rarely practical and is sometimes impossible. This problem provides
our motivation for developing a text categorization system that achieves good
categorization results with small numbers of labeled training examples, quickly
incorporates newly labeled examples into the existing categorization scheme, and, as a
result, allows a homogeneous approach to the uncertainty selective sampling of candidate
training examples. This section describes the text categorization process in our system
and briefly examines typical techniques.
Initial-Learning
Figure 1.1 shows the initial-learning model, for constructing the initial-knowledgebase. This is intended to operate with a small set of labeled examples. The resulting
initial-knowledge-base can be used for predicting categories of new documents and for
uncertainty selective sampling of informative examples for future training. In this
model, a small number of unlabeled raw documents should be randomly selected. These
are presented to human experts for actual labels. Then, in the preprocessor, these
labeled documents are transformed to a representation that is readable and suitable for
a machine learning algorithm, in the ‘learner’ box of Figure 1.1.
The common representation adopted by most machine learning algorithms is the
vector space model (“bag-of-words”) [Sal91]. In this representation method, each
document is represented by a list of extracted features (words, terms, tokens, or
attributes). Their associated weights are computed from either the existence or the
frequency of each feature in a given document. Various techniques from natural
language processing should or could be applied to extract informative features. The
common techniques used for feature extraction are the stop-list (or negative dictionary
as it is referred to in [Fox90]) to remove common words such as the, a, of, to, is, and
was, and the word stemming to reduce different words from the same stem, for
example children and childhood → child. Then, each feature can be weighted by using a
Boolean vector indicating if the feature appeared in a given document or a numeric
vector derived from the frequency of occurrence. Applying other complicated text
6
processing techniques, such as pos-tagging [Bri92, Bri95] and n-grams (phrase)
generation [SP97, BT98], may also improve performance. However, there is a trade-off
between their substantial preprocessing time and small benefits in text categorization
[SM99].
unlabeled documents
randomly selected
raw documents
labeled
documents
preprocessor
document
representations
with labels
learner
human expert
knowledge
base
Figure 1.1
Initial-learning model for text categorization.
One of the common characteristics of natural languages is their high dimensionality
in the feature space. Even for a moderately sized corpus, the categorization task may
be confronted by several thousands of features and tens of predefined categories. The
computational cost of learning a knowledge base for problems of this huge size is
prohibitive for most machine learning algorithms. Feature selection, which is
7
considered as a preprocessing step, has been successfully applied to reduce this huge
feature space without a loss of categorization performance [YP97]. It eliminates many
words that appear evenly across categories, as being uninformative. Previous work in
feature selection [KS96, MG99] has shown that one can achieve a significant
performance improvement by applying feature selection methods rather than using the
full feature set. An important issue in applying any feature space reduction method is
to define how many features should be chosen for each category. Finding an optimal
feature set size is affected by the characteristics of text data and the chosen machine
learning algorithm. Therefore, a good choice typically requires many experiments on a
variety of feature set sizes, to evaluate the effectiveness of classifiers.
A set of training data in the reduced representations, with category information, is
presented to learner. The task of the learner is to analyze the set of training examples
and to construct classifiers that predict the categories of new documents. It typically
generates one classifier for each category and each classifier has an associated threshold
value.
A wide range of machine learning algorithms have been developed and applied to
learn classifiers (see, for example, [ADW94, BSA94, LR94, Yan94, WPW95, LSCP96,
MRG96, Joa97, CH98, Joa98, MN98, YG98, RS99]). One general group of learning
algorithms that has shown good text categorization performance is similarity-based
[MG96, Yan99, HK00]. In this group of algorithms, mapping from a new document to
a particular category is based on the comparison of the category’s threshold and the
similarity-score for the new document. The threshold values in classifiers are
predefined by applying a thresholding strategy. This is usually different from the
learning algorithms which are used only for the similarity computation between a new
document and each category. With a small number of training examples available, an
important issue in the initial-learning-model is that the resulting knowledge base must
be accurate enough to be used for both categorization and subsequent uncertainty
selective sampling. As shown in Figure 1.1, this learning process involves and relies
upon two critical steps: the document representation, based on feature selection, and
the learner (machine learning algorithm and thresholding strategy).
8
Categorization and Learning
We now describe the process of text categorization where there is user feedback
that initiates the learning process. Once initial-learning has been completed, the
resulting knowledge base can be used to predict the categories of a new document. The
results of prediction are presented to human expert and then, if there is any feedback
on this prediction, learning is initiated to update the current knowledge base. This
categorization and learning model is described in Figure 1.2.
real world documents
raw
documents
preprocessor
document
representations
knowledge
base
predictor
feedback
predicted categories
for each document
learner
human expert
Figure 1.2
Categorization and learning model for text categorization.
9
Like the raw documents in the initial learning model of Figure 1.1, any real-world
document first needs to be transformed to a representation in the preprocessor. To
make the binary decision between a document and each category, the predictor
computes similarity scores for every document-category pair, based on the document
representation and classifiers. The decision on the categories for the documents is made
by thresholding these similarity scores. A wide range of measures can be used for the
similarity computation in the predictor (See [Har75, Dam95] for some possible
similarity measures). One of the most widely used similarity measures is the cosine of
the angle between two vectors, computed as the inner product of two normalized
vectors [Sal89]. We choose this measure since it performs well in most text
categorization literature and exploration of this aspect is not the core of our research.
Any feedback from human experts should be incorporated quickly into the current
knowledge base to make it more accurate. As well as the simple category information
on new documents, the expert may modify the predefined categories: adding new
categories or deleting some predefined categories. Some approaches, such as [RMW95,
SMB96, NH98], allow human experts to annotate documents indicating or writing
some important terms in a given document. The creation and manipulation of such
annotated documents is an important area in the field of text categorization, but is
outside the scope of this thesis.
Learning with Selective Sampling
Figure 1.3 shows another type of learning model – learning with selective sampling
– in our system. Rather than relying on random sampling of documents for labeling and
training as in Figure 1.1, this model uses a sampler to automatically select the candidate
training examples. Among the selected informative examples, this model presents
uncertain examples to human expert for labeling and directly uses positive-certain
examples without manual-labeling. The expected result is a reduction in the number of
labeled training examples needed for a given level of text categorization performance
than that required with random sampling. This offers the potential of improved
performance in our target area, where we would like to achieve a desired performance
level with fewer examples than random sampling requires.
10
unlabeled documents
heterogeneous if sampler = knowledge base 2
sampler
homogeneous if sampler = knowledge base 1
informative
raw documents
automatically
labeled
documents
manually
labeled
documents
preprocessor
human expert
document
representations
with labels
learner
knowledge
base 1
knowledge
base 2
predictor
Figure 1.3
Learning with selective sampling.
In this research, selective sampling is based on the uncertainty of the
categorization for a document [LG94]. This is measured by comparing the estimated
similarity score and threshold. To make a distinction between uncertain and positivecertain examples, we design a new scheme based on our own thresholding strategy,
RinSCut.
11
Computing the uncertainty value of a document under a given category requires a
classifier for a category. Since the number of unlabeled documents is huge, the classifier
for uncertainty selective sampling should be cheap to build and use. As shown in
Figure 1.3, the uncertainty measure in the sampler can be based on the same type of
knowledge base, knowledge base 1 in Figure 1.3, as the one used for categorization
(homogeneous) or a different type of knowledge base, knowledge base 2,
(heterogeneous). As mentioned earlier, the heterogeneous approach is preferred if the
homogeneous approach is computationally expensive. However, building a separate
type of knowledge base for selective sampling imposes another cost to the system.
Moreover, a poor quality knowledge base, that is cheap to run, may yield unreliable
uncertainty values of documents. The effectiveness of classifiers that are used for
selective sampling might be an important factor for successful selective sampling. As a
result, if the type of knowledge base used for categorization is accurate and not
expensive to run, a homogeneous approach is preferred to heterogeneous approach.
Another consideration in uncertainty selective sampling is of using the most
positive-certain documents without human expert labeling. Unlike uncertain
documents, positive-certain examples are ones on which the text categorization system
can, to some degree, attach their category membership. So, they can be used for
training without the category labels from human experts. This suggests that the
sampler in this extended learning model should have the ability to distinguish between
uncertain and positive-certain examples. Previous approaches to selective sampling
continue identifying candidate training documents until either there are no more
unlabeled documents or the human expert is unwilling to label more examples. In other
words, they have no stop-point mechanism to stop selecting examples in a given
category. If the sampler keeps choosing training examples for a given category after the
target concept description is learned, it may waste the human expert’s time. The
positive-certain documents could be less informative than uncertain ones for training.
However, using them for training could lead to performance improvements and, more
importantly, it does not require any human labeling cost.
12
1.2 Outline of the Thesis
The main goal of our text categorization system is to achieve good text
categorization performance with smaller numbers of labeled training examples. Because
natural languages are unstructured and target concepts of predefined categories are
complex, learning accurate classifiers requires a large number of labeled examples. The
labeling process needs human involvement. So, labeling such a large number of
documents is time-consuming, tedious, and sometimes error-prone [HW90, ADW94,
VM94]. This thesis explores machine learning approaches which offer the promise of
significantly reducing this labeling cost, operating on a smaller number of labeled
training examples to construct the classifiers that are accurate and cheap enough to use
as a foundation for uncertainty selective sampling.
In this research project, we begin by looking at previous work in text
categorization. We focus on similarity-based approaches that have shown good
categorization results. These investigated results, reported in Chapter 2, highlight the
important properties our text categorization system should have.
In Chapter 3, we introduce Keyword Association Network (KAN), a new machine
learning algorithm, we have developed and applied to compute the similarity scores of
new documents. It promises increased accuracy because it attempts to resolve their
ambiguities. KAN is, essentially, a framework for exploiting word co-occurrence
statistics and each feature’s discriminating power in the similarity computation. The
promise of this approach is that such statistical information might provide a
reasonably accurate initial-classifier with small numbers of labeled training examples.
Also, in Chapter 4, we introduce a new thresholding strategy, rank-in-score
(RinSCut), to find the optimal thresholds for the categories. This has been one of the
relatively unexplored research areas in text classification, even though it has significant
impact on the overall effectiveness of results. RinSCut is designed to combine strengths
of two common thresholding strategies, rank-based (RCut) and score-based (SCut). It is
designed to make an online decision on new documents, reduce the risk of overfitting to
a small number of training examples, and give thresholds by optimizing both local and
global performance.
13
In Chapter 5, we use two standard test collections, Reuters-21578 [R21578] and
20-Newsgroups [20News], in comparative experimental results. First, we perform the
experiments for tuning of feature selection to find a suitable size of feature space for
each learning algorithm. Then, we assess KAN and RinSCut against other widely used
similarity-based learning algorithms and thresholding strategies. We show the
effectiveness and efficiency of our approaches for building classifiers and for use in
uncertainty selective sampling. KAN’s efficiency derives from a reduced feature space
that gives performance improvements over the full feature set. The effectiveness of
RinSCut is tested on the Reuters-21578 corpus, since it is designed for multi-class text
categorization problems.
Chapter 6 presents uncertainty selective sampling methods based on our new
approach, RinSCut. We report several comparative experiments in Chapter 7 to show
the effectiveness of our homogeneous approach for selective sampling methods. Then,
we finish this thesis with conclusion and possible directions for future work in
Chapter 8.
14
Chapter 2
Text Categorization and Previous Work
Text categorization is the task of assigning previously unseen documents to
appropriate predefined categories. The supervised machine learning approach makes
this automatic, by learning classifiers from a set of training examples. For most
supervised learning algorithms, building accurate classifiers needs a large volume of
manually labeled examples. This manual labeling process is time-consuming, expensive,
and will have some level of inconsistency [HW90, ADW94, VM94]. This problem
motivates our work towards a text categorization system that can achieve a
satisfactory level of performance with fewer training examples.
Many machine learning algorithms have been developed and applied to the
construction of classifiers. They can usually be grouped into rule-based, probabilitybased, and similarity-based learning algorithms. This thesis focuses on the similaritybased approach. This builds upon the large volume of previous work in the area of text
categorization that has adopted this approach. We chose the similarity-based approach
as a foundation for building more accurate classifiers because it offers the possibility of
exploring statistical information that may capture the target concepts hidden in
documents.
As shown in Chapter 1, our text categorization system is complex in that it
consists of multiple models and multiple phases in each model. Each phase has a huge
impact on the others; an unsatisfactory partial result or delay in processing in any one
phase may result in the failure of the overall text categorization process.
15
In this chapter, we describe various techniques for each phase of similarity-based
text categorization systems. This establishes what has been previously achieved and
also gives an indication of desirable properties for our text categorization system.
2.1 Text Categorization
In this section, we give an overview of text categorization. We first define the text
categorization task and discuss ambiguities in most natural languages on which
classifiers should be built. Then, we discuss two general approaches, “knowledge
engineering” and “machine learning”, to the construction of classifiers and why we are
focusing on a machine learning approach. This section also describes characteristics of
the domain of text categorization that make this task difficult for a machine learning
approach.
2.1.1
A Definition of Text Categorization
Text categorization (also known as text classification) is the automated assignment
of natural language text to appropriate thematic categories, based on its content. A set
of categories is predefined manually. This task can be formalized as binary
categorization: to determine a Boolean value b ∈ {T, F} for each pair (dj, ci) ∈ D × C,
where D is a domain of documents and C = {c1, c2, … , c|C|} is a set of predefined
categories. The value T assigned to (dj, ci) indicates a decision to assign dj to ci, while
the value F indicates a decision not to assign dj to ci.
A function _ : D × C _ {T, F} that describes how documents might be categorized
is called the classifier (also known as a hypothesis, model, or rule). The similaritybased classifier that this thesis is concerned with typically includes a weight vector w
and a threshold _ for each category. For a given input vector v for document dj ∈ D, an
assignment decision on the input vector v is b = T if w⋅v ≥ _, and b = F if w⋅v < _.
Two different types of text categorization task can be identified depending on the
number of categories that could be assigned to each document. The first type, in which
exactly one category is assigned to each dj ∈ D, is regarded as the single-class (or nonoverlapping categories) text categorization task. The second type, in which any
16
number of categories from zero to |C| may be assigned to each dj ∈ D, is called the
multi-class (or overlapping categories) task [Seb02]. A special type of multi-class text
categorization is one where each document is assigned to the same number k, where k >
1, of categories. The answer to the question of which type of text categorization
should be adopted for a given text categorization system depends on the application
and characteristics of the corpus.
2.1.2
Ambiguities in Natural Language Text
In most content-based text classification systems, an important issue is how they
can capture the meaning of the natural language texts. Obtaining accurate classifiers
requires the system to understand the natural languages at some level. Understanding
natural languages, however, is a difficult task due to ambiguities in them:
1. The same sentence may have different meanings. For example, consider a
sentence like “Salespeople sold the dog biscuits” (an example from
[Cha97]). This sentence can be interpreted in two different ways: (1) the
salespeople are selling the dog-biscuits and (2) the salespeople are selling
biscuits to dogs.
2.
There is the large number of synonyms – syntactically different words
with the same or similar meanings – in natural languages. It is regarded as
good writing style not to repeatedly use the same word for expressing a
particular idea (or concept). Synonyms allow the same idea to be
expressed by different words that have a similar meaning.
3. Polysemy refers to an ambiguity where words which are spelled the same
can have different meanings in different sentences or documents. For
example, the word “bat” may mean (1) an implement used in sports to hit
the ball or (2) a flying mammal.
Resolving such ambiguities is probably beneficial to text categorization when there
are many words in common across categories, even though it may not have a huge
impact on the overall text categorization performance.
17
2.1.3
Knowledge Engineering versus Machine Learning Approach
There are two different ways of constructing classifiers, the function _ : D × C _
{T, F}. They are “knowledge engineering” and “machine learning” approaches. In the
knowledge engineering approach, human experts (including knowledge engineers and
domain experts) manually create a set of rules that correctly categorize previously
unseen documents under given categories. While allowing for semantically-oriented text
categorization, by defining controlled vocabularies which can be interpreted by the text
categorization system [BG01], manually determining such a solution imposes a
considerable workload on human experts. This makes it time consuming and expensive.
Also, this manual approach may cause inconsistency since human experts often
disagree on the assigned categories of documents and even one person may categorize
documents inconsistently [ADW94, VM94]. As a result, these problems for the
knowledge engineering approach cause the bottleneck of encoding large amounts of
incomplete and potentially conflicting expert knowledge.
The machine learning approach to text categorization is to automatically build the
classifiers by learning the concept descriptions of the categories. One type of machine
learning, applied to text categorization, is “supervised learning”. This requires a set of
pre-labeled (pre-categorized) training documents for generating classifiers. By contrast,
“unsupervised learning” refers to the task of automatically identifying a set of
categories from a set of unlabeled documents and grouping these unlabeled documents
under these identified categories [Mer98]. This task is typically called document
clustering and is sometimes confused with text categorization that is the focus of our
work.
The advantages of the machine learning approach over knowledge engineering are
the considerable reduction in the volume of work required from human experts,
consistent text categorization, and the capability of easily adjusting the generated
classifier to handle different types of documents (such as newspaper articles,
newsgroup postings, electronic mails, etc.) and even languages other than English.
18
2.1.4
Difficulties for the Machine Learning Approach
The unstructured format of natural language text, and the diversity of target
concepts associated with the categories, present interesting challenges to the contentbased application of machine learning algorithms. The large number of input features,
that seem necessary for the construction of classifiers, overwhelms most text
categorization systems. For most machine learning algorithms, increasing the number of
features means that they have to use more training examples to obtain the same level of
text categorization performance. This large number of training examples and features
may be computationally intractable for most machine learning algorithms, by requiring
unacceptably large processing time and memory.
Of the large number of features, there are usually many features that appear in
most documents. These words can be considered irrelevant, in the sense that such
features are evenly distributed throughout documents and, as a result, have no
discriminating power. It is important for the efficiency and effectiveness of the system
to select an efficient subset of features, by removing these irrelevant ones. However, it
is a difficult task since a reasonable feature subset size might be different across the
categories and some informative features for a given category could be distributed
across several categories. For example, depending on the level of concept complexity,
some categories require a large number of features to describe their concepts while
others need a relatively small number of features. Also, informative features in the
overlapping categories might be evenly distributed across such overlapping categories
and could be considered as irrelevant ones.
2.2 Preprocessing
Text preprocessing is the basic and critical stage needed for most text classification
tasks. It transforms all raw documents to a suitable form, called a representation, that
is readable and usable by the relevant machine learning algorithms. In most similaritybased text categorization systems, a document is represented by a set of extracted
features and their associated weights.
For a content-based text categorization system to categorize documents
effectively, it needs to understand natural language text at some level by resolving
19
ambiguities in extracted terms as described in Section 2.1.2. Also, the large number of
irrelevant features in full feature space may cause a significant drop in the text
categorization performance.
A wide range of disambiguation and feature space reduction methods have been
proposed from various research areas, such as information retrieval, natural language
+
processing, and machine learning [DDFL 90, YP97, MG99]. Some methods might
have obvious advantages in effectiveness over others. However, another important
consideration is the speed of preprocessing that is an important factor in most text
categorization systems. This section describes the typical and most popular methods
that have been applied to text preprocessing.
2.2.1
Feature Extraction
Feature extraction is the process of identifying features, or types of information,
contained within the documents. It is one of the first and critical steps of almost every
text classification system. It is these extracted features that machine learning algorithms
for text categorization use to find the target concept descriptions of categories.
Feature extraction first divides documents into separate terms at punctuation and
white space. Then, more complicated text processing methods could be used to find
the more informative features. They include techniques such as removing stop-list
terms, stemming different words to common roots, and identifying phrases. The stoplist and stemming are common forms of natural language processing in most text
categorization systems, while the use of phrases for text categorization needs more
care because most phrases extracted have extremely low frequency in the corpus and
they may result in a much larger feature space.
A stop-list contains stop-words (or common words) that are not useful to keep
for learning the target concepts. During the feature extraction from documents, any
words appearing in a stop-list are removed. There is not any general theory to create a
stop-list. Choosing stop-words for a stop-list involves many arbitrary human
decisions. Two of the most common types of stop-words are:
1. words that have little semantic meanings and
20
2. words whose frequency of occurrence in the corpus is greater than some
level.
The second type of stop-words is domain-specific. In order to make text
categorization systems more general, we use, in this work, only the first type of stopwords. The list of stop-words used in this research is shown in Appendix A. Most
stop words in this Appendix are mainly from [SW99].
Another widely used text preprocessing technique is word stemming. In English, it
reduces different words to common roots by removing word endings or suffixes.
Stemming algorithms have been designed to handle plurals (cars _ car) and to make a
group of simple synonyms (children and childhood _ child). One of the common
stemming algorithms is the Porter’s algorithm [Por80]. Although it has been identified
as having some problems in [CX95], it gives consistent performance improvements
across a range of text classification tasks. Also, it is commonly accepted in natural
language processing that Porter’s algorithm is better than most other stemming
algorithms. This work adopts Porter’s stemming algorithm and applies it to the words
remaining after removing stop-words.
An interesting approach for feature extraction in all content-based text
classification tasks is the identification and use of phrases (or n-grams which is
sequences of words of length n) in addition to, or in place of, single words. A phrase
usually consists of two or more single words that occur sequentially in natural language
text. The phrase extraction can be motivated statistically or syntactically. A statistical
phrase denotes any sequence of words that occur contiguously in a text, and syntactic
phrase refers to any phrase that is extracted based on a grammar of the language. Some
examples of syntactic phrases are noun phrases and verb phrases.
Regardless of which approach is used for phrase extraction, using phrases for text
categorization seems a feasible way to improve categorization performance, in that
•
a set of words has a much smaller degree of ambiguity than its
constituent individual words and,
•
as a result, phrases can be a better linguistic textual unit as a feature than
single words to express a complex concept description.
21
Lewis conducted a number of experiments to examine the effects of using phrases
for text categorization [Lew92a, Lew92b]. He extracted all the syntactic noun phrases
by applying part-of-speech tagging and showed that using those phrases actually gave
worse categorization performance than using only single words. The main reasons he
identified for this are the higher dimensionality in feature space and extremely low
frequency of extracted phrases in the corpus.
More recently, a number of researchers [Fur98, MG98, SSS98] have investigated
the issue of using phrases. They performed experiments on the various lengths of ngrams that were extracted using term frequency. Their experimental results showed
that using word sequences of length two or three increased categorization performance
slightly and longer sequence actually reduced the performance.
2.2.2
Representation
Machine learning algorithms need a suitable representation for raw documents.
The common representation, adopted by most machine learning algorithms, is the
vector space model or “bag-of-words” representation where a document is represented
by a set of extracted features and their vectors. Each vector corresponds to a certain
point in n-dimensional space, where n is the number of features, and its value is either
Boolean, indicating the existence of the feature in a given document, or the frequency of
occurrence. All other information, such as the feature’s position and order, in a
document, is lost. The simplicity and effectiveness of the vector space model makes it
the most popular representation method in content-based text classification systems.
The core of similarity-based categorization is that similar documents have feature
vectors that are close in the n-dimensional space.
In the vector space model, each document dj is transformed to a vector vj = (x1, x2,
... , xn). Here, n is the number of unique features and each xk is the value of the kth
feature. In similarity-based text categorization, a feature vector is usually calculated as
a combination of two common weighting schemes:
•
the term frequency, TFk, is the number of times the kth feature occurs in
document dj and,
22
•
the inverse document frequency, IDFk, is log(|A| / DFk), where DFk is the
number of documents that contain the kth feature and |A| is the total
number of documents in the training set A.
The role of IDF is to capture the notion that terms occurring in only a few documents
are more informative than ones that appear in many documents. Then, xk is computed
as TFk×IDFk.
Because document lengths may vary widely, a length normalization factor should
be applied to the term weighting function. The weighting equation that is used in this
work [BSAS95] is given as:
(logTFk + 1.0) × IDFk
xk =
where TFk > 0
∑ k=1,n [(logTFk + 1.0) × IDFk]2
The vector space model assumes that features (and their occurrences) are
independent. Ambiguities – synonymy and polysemy – in natural language text make
this assumption inaccurate. One way of resolving those ambiguities is to use the
relationships between features when giving them weights.
+
Latent semantic indexing (LSI) [DDFL 90] is a semantics-based representation
method to overcome ambiguity problems by using the term inter-relationships. In
information retrieval, LSI has been successful. In the similarity computation, it uses
other coalescing terms that do not exist in a given query and reduces the high
dimensionality by using only the reduced k-dimensional feature space. LSI has been
used in a few text classification systems [WPW95, HPS96]. While it offers the
potential for a better representation method for resolving ambiguities of text than the
vector space model, its principal disadvantage is that it is computationally expensive.
2.2.3
Feature Selection
A major problem in the text categorization task is the high dimensionality of the
feature space. Even for a moderately sized corpus, the text categorization task may
23
have to deal with thousands of features and tens of categories. Most sophisticated
machine learning algorithms applied to text categorization cannot scale to this huge
number of features. The learning process on such a high feature space may require
unacceptably long processing time and need a large number of training documents since
all features in the representation are not necessarily relevant and beneficial for learning
classifiers. As a result, it is highly desirable to reduce the feature space without
removing potentially useful features for the target concepts of categories.
The stop-word removal and word stemming methods that were described in
section 2.1.1 can be viewed as dimensionality reduction methods. However, the main
advantage of applying these methods is the reduction in the size of each document, not
on the size of the full feature set. As a result, high dimensionality of the feature space
may still exist even after applying them. They still leave the dimensionality
prohibitively large for machine learning algorithms (especially for some learning
algorithms that are trying to use the inter-relationships among features). This means
that text categorization needs more aggressive methods to reduce the size of overall
feature set.
Various feature selection (also called dimensionality reduction) methods have been
proposed and successfully applied for this aggressive reduction of the feature set
without sacrificing text categorization performance. They include document frequency,
mutual information, information gain, OddsRatio, and _2 statistic. [MG99, YP97]. The
main idea behind all feature selection methods is to eliminate many non-informative
features that appear evenly across categories. The criteria for the choice of feature
selection methods are not clear since the effectiveness of each method in text
categorization is affected by the characteristics of the test corpus and the chosen
machine learning algorithm.
A difficulty with most sophisticated feature selection methods is that they are
very time-consuming and so it is not practical or possible to perform the feature
selection process whenever new training examples are available. This high timecomplexity is a critical problem for our text categorization system in which one of
main characteristics is to quickly incorporate any new information into the current
knowledge base. However, this problem should not preclude use of feature selection
methods in the learning process: the presence or absence of a feature in the reduced
24
representation should not be changed frequently with the addition of a small number of
newly labeled examples.
There are two different ways of conducting dimensionality reduction, depending
on whether the method is applied locally to each category or globally to the set of all
categories. For example, in the local application of an information gain function
[Qui93] the information gain of a feature wk in a specific category ci is defined to be:
IG(wk, ci) = − [Pr(ci)⋅logPr(ci) + Pr(_i)⋅logPr(_i)]
+
_
_
Pr(w)⋅[Pr(c|w)⋅logPr(c|w)]
w∈{wk ,_ k } c∈{ci,_ i}
where Pr(wk) is the probability that feature wk occurs and _k means that feature does
not occur. Pr(ci|wk) is the conditional probability of the category ci value given that the
feature wk occurs. This equation assumes that both category and feature are binary
valued. Thus, frequency is not used as the value of features. The globally computed
information gain of wk for the category set C is the sum of local information gain
values.
IG(wk, C) = _i=1,|C| IG(wk, ci)
In this project, we used document frequency and information gain together for
feature selection. These methods have been widely used and regarded as effective
feature selection methods in text categorization. We removed all features occurring in a
small threshold number of training documents, and then used information gain locally
to choose the subset of features for each category. The reason for using the local
feature selection is that if feature selection is applied globally, selected features in the
subset could be mainly from a small number of categories and, as a result, frequency
values of most features may be zero in some categories. We performed many
experiments by varying the size of feature set to find the optimal feature space that
gives the best text categorization performance.
25
2.3 Learning Classifiers
From representations of labeled documents, the learner in our learning model
builds a knowledge base in which a classifier for each individual category is stored. The
learner usually includes two kinds of learning algorithms since the construction of a
classifier for each category ci ∈ C consists of both the definition of a function _i that
gives an estimated categorization value for a given document dj ∈ D and the definition
of a threshold _i. This classifier is then used to categorize previously unseen
documents. For the assignment decision on a document dj under category ci, the
classifier computes an estimated categorization value of _i(dj) and this is tested against
a threshold _i: dj is categorized under ci if _i(dj) ≥ _i while it is not under ci if _i(dj) < _i.
To the definition of _i, a wide range of machine learning algorithms have been
developed and applied. These include rule-based induction algorithms [ADW94,
CH98, MRG96], linear learning algorithms [Joa97, LSCP96, BSA94], k-Nearest
Neighbor [YG98, Yan94], naive Bayes probabilistic algorithms [Joa97, LR94, MN98],
support vector machines [Joa98], and neural networks [WPW95, RS99]. Among them,
one general group of learning algorithms that has shown good categorization
performance is similarity-based. In this group of algorithms, a function _i is designed to
return a similarity score for an estimated categorization value. In this research, we
adopt and investigate a similarity-based approach to text categorization. This approach
can utilize semantic information that might exist implicitly within documents and could
be extracted by analyzing statistical information about the document set. In particular,
we are interested in exploring co-location and frequency of terms. We believe that, with
a small number of training examples, this approach can give improved categorization
results.
In similarity-based text categorization, an important, but unexplored, research area
is the thresholding strategy to find the optimal thresholds for the categories. The major
focus of research in the text categorization literature has been placed on the definition
of _i, while the definition of threshold _i has had far less attention, as finding the
optimal threshold _i for a category ci is often seen as a trivial task. This could be true
for the rule-based and probability-based approaches. Actually, the rule-based
approach does not require any threshold value in the classifiers. The probability-based
algorithms have a theoretical base for analytically determining the optimal threshold
26
value that is trivially 0.5. In similarity-based text categorization, without a theoretical
base for the analytical determination of the optimal threshold, finding an optimal
threshold _i in similarity-based text categorization should be done empirically. This
becomes much more difficult with a small number of training examples and in cases
where there are some rare categories that have few positive training examples. The
importance of a threshold _i is no less than the function _i since the presence of any
single unreliable threshold value could downgrade overall text categorization
performance.
This section examines existing typical similarity-based learning algorithms for a
function _i and thresholding strategies for a threshold _i.
2.3.1
Similarity-Based Learning Algorithms
Essentially, the function _i for the similarity score of a new document should be
designed to give more weight to informative terms than non-informative ones. The
similarity-based algorithms can be again grouped into two main classes: profile-based
linear learning algorithms and instance-based lazy algorithms.
2.3.1.1
Profile-Based Linear Algorithms
Some linear algorithms build a generalized profile for each target category that is in
the form of a weighted list of features [LSCP96]. A linear learning algorithm that is
trying to derive an explicit profile is called a profile-based linear algorithm. Its
advantage over the instance-based algorithms, and other sophisticated algorithms like
neural networks, is that such an explicit profile is a more understandable representation
for human experts. As a result, they can update the profile according to their
preference for terms. In a generalized profile, each feature is associated with a weight
vector computed from a set of training examples. One of the typical profile-based
linear learning algorithms is based on Rocchio relevance feedback [Roc71]. In the
Rocchio algorithm, each category ci has a vector of the form wi = (y1, y2, ... , yn) and each
yk is computed as follows:
27
yk = [_ × ∑j=1,|A| xk(dj) × |ci|−1] − [_ × ∑j=1,|A| xk(dj) × (|A| − |ci|)−1]
d j∈ci
d j∈ci
Here, _ and _ are adjustment parameters for positive and negative examples, xk(dj) is
the weight of the kth feature in a document dj, |ci| is the number of positive documents
in the category ci, and |A| is the total number of documents in the training set. The
similarity value between a category and a new document is obtained as the inner
product between the two corresponding feature vectors. The problem in this classifier
is that some informative features in a rare category with a small number of positive
documents will have small weights if they appear occasionally in the negative
examples.
One of the popular profile-based linear algorithms is Widrow-Hoff (WH) [WS85].
WH is an on-line algorithm that updates weight vectors by using one training example
at a time. For each category ci, the new weight of the kth feature yk,j in the vector wj is
calculated from the old weight vector wj−1 and the vector vj of a new document dj:
yk,j = yk,j−1 − 2_(wj-1⋅vj − bj)xk,j
In this equation, wj-1⋅vj is the cosine value of the two vectors, _ is the learning rate
parameter, bj is the category label of new document having 1 if the new document is
positive and 0 if it is negative, and xk,j is the value of the kth feature in vj. Typically,
the initial weight vector w0 is set to have all zeros, w0 = (0, … , 0). In a comparative
evaluation [LSCP96], WH has shown improved performance over Rocchio.
2.3.1.2
Instance-Based Lazy Algorithm
The k-Nearest Neighbor (k-NN) algorithm is an instance-based lazy learning
algorithm that operates directly on training documents. As a result, this algorithm does
not involve any pre-learning process for determining a generalized absolute weight for
each feature in the profile of category. The main motivation of k-NN algorithm is based
on the reasoning that a document itself has more representative power than its
generalized profile. For categorizing a new incoming document dj under a given
category ci, k-NN computes its similarity scores to all documents in the training set A.
28
These resulting similarity scores are then sorted into descending order. The final
similarity score, _i(dj), is the sum of the similarity scores of the documents in the
category ci using the set of k top-ranking documents, KA:
_i(dj) = ∑ SM(dj, dz)bzi
d z∈KA
bzi = 1 if dz is a positive example for ci while bzi = 0 if dz is a negative example for ci.
SM(dj, dz) represents some similarity measure between two documents and is usually
computed using the cosine function that is defined to be:
SM(dj, dz) =
∑ k=1,n
(xk j_xkz)2
∑ k=1,n (xk j)2
_
∑ k=1,n (xkz)2
where xk j and xkz are weights of kth feature in the n-dimensional space vectors of dj and
dz, respectively. It is the standard form of k-NN, as in [Yan94].
While giving good performance in the text categorization literature [Yan99, YX99],
k-NN has several drawbacks. In k-NN, it is difficult to find the optimal k value when
training with documents that are unevenly distributed across categories. A large value
of k may work well with common categories that have many positive documents, but
it could be problematic for some rare categories that have fewer positive documents
than the k value. Also, due to the lack of generalized feature weights, some noisy
examples that have been pre-categorized wrongly by human experts may have a direct
and significant impact on the quality of the ranking. Furthermore, time complexity of
k-NN is O(m) where m is the number of training examples, since it requires the
similarity computation on every example. As a result, the processing time needed for
categorizing a new document is quite long with the large size of the training set.
2.3.2
Thresholding Strategy
The last step in obtaining a mapping from a new document to relevant categories
can be achieved by threshold values that are tested against the resulting similarity
scores. In similarity-based text categorization, as discussed earlier, the optimal
29
threshold for each category must be derived experimentally from labeled documents.
Existing common techniques are rank-based thresholding (RCut), proportion-based
assignment (PCut), and score-based optimization (SCut). As illustrated in Figure 2.1,
RCut is the per-document thresholding strategy that compares similarity scores of
categories for a document. Both SCut and PCut, on the other hand, are per-category
strategies that compare the similarity scores of documents within a category. Yang
[Yan99, Yan01] has reviewed these techniques and summarized their extensive
evaluation on various corpora.
documents
d1
d2
…
dn
_1(d1)
_1(d2)
…
_1(dn)
categories
c1
per-category
strategies:
c2
_2(d1)
_2(d2)
…
_2(dn)
..
.
..
.
..
.
..
.
..
.
cm
_m(d1)
_m(d2)
…
_m(dn)
SCut and PCut
per-document strategy: RCut
Figure 2.1
The per-document and per-category thresholding strategies.
As noted in [Yan01], the thresholding strategy is an important post-processing
step in text categorization. It has a significant impact on the performance of classifiers.
Finding the optimal thresholding strategy is a difficult task that is heavily influenced
30
by the user interests, the characteristics of the data set, and the adopted machine
learning algorithms. This suggests that combining the strengths of the existing
thresholding strategies may be useful. This subsection describes the above three
common techniques to identify some desired properties of our new thresholding
method.
2.3.2.1
Rank-Based (RCut)
The rank-based thresholding strategy (RCut) is a per-document strategy that sorts
the similarity scores of categories for each document dj. Then, it assigns a “YES”
decision to the k top-ranking categories. The threshold k is predefined automatically
by optimizing the global performance on the training set A. The same value of k is
applied to all new documents by assuming that they may belong to the same number
of categories.
In the multi-class categorization problem, where documents may belong to a
variable number of categories, RCut usually gives good micro-averaged performance
(defined later in Section 2.5.2) since the selection of k mainly depends on some
frequent categories. However, when this globally optimized threshold k is 1 and many
rare categories have overlapping concepts with other categories (i.e., they have many
identical documents), this strategy may result in low macro-averaged performance
(defined later in Section 2.5.2). Furthermore, RCut is not suitable in the situation where
many documents have no appropriate categories, since it is always trying to assign all
documents to the same number of categories.
2.3.2.2
Score-Based (SCut)
The score-based strategy (SCut) learns an optimal threshold for each category ci.
The optimal threshold ts(ci) is a similarity score that optimizes the local performance
of category ci. If the local performance of each category is the primary concern and the
test documents belong to a variable number of categories, this strategy may be a better
choice than RCut. However, it is not trivial to find an optimal threshold for each
category. This problem becomes more apparent with a small set of training examples.
SCut may give too high or too low thresholds for some rare categories (i.e., overfitting
31
to training data) and so it can lower the global categorization performance as well as the
local performance for the rare categories.
For the multi-class text categorization task, the overfitting problem of SCut to a
small number of training examples and some rare categories, indicates that we need a
new thresholding strategy that gives flexibility in the number of categories assigned to
each new document like SCut, and also mitigates the overfitting problem of unreliable
thresholds in some rare categories.
2.3.2.3
Proportion-based Assignment (PCut)
Like SCut, the proportion-based assignment strategy (PCut) is a per-category
thresholding strategy. Given a ranking list of similarity scores for all documents for
each category ci, PCut assigns a “YES” decision to t top-ranking test documents. The
threshold t(ci) is computed by the rule |A| × Pr(ci)× _ where |A| is the number of
documents in a training set A, Pr(ci) is the probability of category ci in A, and _ is the
real-valued parameter given by the user or predetermined automatically in the same
way as k for RCut. PCut assumes that the proportion of positive documents in the
training set will be consistent with the test set.
While performing well in the text categorization experiments [Yan99], one of main
disadvantages of PCut is that it cannot be used for on-line text categorization since it
can be applied only when we have a pool of new documents to be categorized. This
weak point is the main reason why PCut has not been applied for many practical text
categorization systems in which the delayed systems’ response on new documents is
unacceptable. For example, it is unsuitable for e-mail categorization systems.
2.4 Active Learning
In most supervised learning tasks, the more training examples we provide to the
system, the better it performs. Typical experiments in text categorization literature
show that the system usually needs several thousands of human-labeled training
examples to get reasonable text categorization performance. However, given human
resource limitations, obtaining such a large number of training examples is expensive.
32
This problem suggests an active learning approach that controls the process of
sampling training examples to achieve a given level of performance with fewer training
examples.
Active learning refers to any learning method that actively participates in the
collection of training examples on which the system is trained [CAL94]. The main idea
of active learning is to achieve a given level of system’s performance with fewer
training examples. This is achieved without relying on random sampling. Instead, it
constructs new examples or selects a small number of optimally informative examples
for the user to classify.
Generating artificial new examples is one type of active learning approach that
was used in [Ang87]. However, most work in text categorization has focused on a
selective sampling approach [CAL94, LC94, LT97] in which the system selects the
most informative examples for labeling from a large set of unlabeled examples. In this
section, we investigate two typical approaches for selective sampling: uncertainty
sampling and committee-based sampling.
2.4.1
Uncertainty Sampling
Lewis and Gale in [LG94] proposed the uncertainty-sampling method for active
learning. Its effectiveness has been demonstrated in text categorization [LG94, LC94]
and other text learning tasks [TCM99]. Uncertainty sampling selects unlabeled
examples for labeling, based on the level of uncertainty about their correct category. A
text categorization system using an uncertainty-sampling method examines unlabeled
documents and computes the uncertainty values for the predicted category
membership of all examples. Then, those examples with largest uncertainties are
selected as the most informative ones for training and presented to human experts for
labeling. The uncertainty of an example is typically estimated by comparing its
numeric similarity score with the threshold of the category. The most uncertain
(informative) example is the one whose score is closest to the threshold. Figure 2.2
shows the pseudo-code for the uncertainty-sampling algorithm.
33
• Create an initial knowledge base (a set of classifiers)
• UNTIL “there are no more unlabeled examples” OR “human experts
are unwilling to label more examples”
(1) Apply the current knowledge base to each unlabeled example
(2) Find the k examples with the highest uncertainty values
(3) Have human experts label these k examples
(4) Train the new knowledge base on all labeled examples to this
point
Figure 2.2
Uncertainty sampling algorithm.
The knowledge base (a set of classifiers) used for estimating uncertainties of
unlabeled examples can be the same type (homogeneous) or a different type
(heterogeneous) from that used for the categorization of new documents. Even though
a heterogeneous approach to uncertainty sampling requires an additional construction
cost, it is essential when the existing classifiers are too computationally expensive to
use for uncertainty sampling of training examples. On the other hand, a homogeneous
approach may be preferred to a heterogeneous one if the existing classifiers are efficient
enough to build and run for the uncertainty sampling method.
For our main goal in this thesis, another important issue in the uncertainty
sampling of candidate training examples is whether or not the text categorization
system has a mechanism for showing the boundaries of the threshold regions to which
the uncertain and certain examples should belong. With such a mechanism, the system
can use the positive-certain examples for training, without asking the human experts
for their correct category. Also, the system can stop the sampling process for
uncertain examples when it believes that no more uncertain examples are available for a
34
given category. Such a mechanism may save human experts from labeling more
examples after the system has learnt an accurate category concept. Also, when the
input stream of unlabeled examples is infinite, this mechanism might be effective, since
the system can choose only the examples whose numeric scores are within the defined
threshold region.
2.4.2
Committee-Based Sampling
The other type of selective sampling method is committee-based sampling [DE95,
LT97]. In this method, diverse committee members are created from a given training
data set. Each committee member is an implementation of a different machine learning
algorithm. Then, each committee member is asked to predict the labels of examples.
The system selects informative examples based on the degree of disagreement among
the committee members and then presents them to a human for labeling. The most
informative example is the one having the highest disagreement on the predictions of
committee members. Again, the same issues described in uncertainty sampling are
critical and have yet to be explored.
2.5 Evaluation Methods
We need evaluation methods to compare various text classifiers. Evaluation of a
classifier can be conducted by measuring its efficiency and its effectiveness. Efficiency
is typically measured by using the elapsed processor time and it refers to the ability of
a classifier to run fast. Elapsed processor time is already well-defined so that it does
not need to be explained in this thesis. Efficiency of a classifier can usually be
measured on two dimensions: learning efficiency (i.e., the time a machine learning
algorithm takes to generate a classifier from a set of training examples) and
categorization efficiency (i.e., the time the classifier takes to assign appropriate
categories to a new document). Because of the unstable nature of parameters on which
the evaluation depends, efficiency is rarely used as the singular performance measure in
text categorization. However, efficiency is important for the practical application of
the system.
35
A much more common evaluation method for text categorization systems is
effectiveness: this refers to the ability to take the right decisions on the categorization
of new incoming documents. There are several commonly used performance measures
of effectiveness. However, there is no agreement on one single measure for use in all
applications. Indeed, the type of measure that is preferable depends on the
characteristics of the test data set and on the user’s interests. The absence of one
optimal measure of effectiveness makes it very difficult to compare the relative
effectiveness of classifiers.
In the next section, we will examine various performance measures of effectiveness
that have been widely used for the evaluation of text categorization systems. Then, we
will turn to two different issues in the evaluation of text categorization systems:
averaging performance values of all categories to get a representative single value of the
system’s performance and splitting an initial corpus into training and test data sets.
2.5.1
Performance Measures of Effectiveness
While a number of different conventional performance measures are available for
the effectiveness evaluation for text categorization, the definition of almost all
measures is based on the same 2×2 contingency table model that is constructed as
shown in Table 2.1.
In this table, ‘YES’ and ‘NO’ represent a binary decision given to each document
dj under category ci. Each entry in the table indicates the number of documents of the
specified type:
•
TPi: the number of true positive documents that the system predicted were
YES, and were in fact in the category ci.
•
FPi: the number of false positive documents that the system predicted
were YES, but actually were not in the category ci.
•
FNi: the number of false negative documents that the system predicted
were NO, but were in fact in the category ci.
36
•
TNi: the number of true negative documents that the system predicted were
NO, and actually were not in the category ci.
Here, note that the larger TPi and TNi values are (or the smaller FPi and FNi values are),
the more effective ci is.
Table 2.1 Contingency table for category ci.
label by human expert
category ci
YES is correct
NO is correct
predicted YES
TPi
FPi
predicted NO
FNi
TNi
label by the system
Given such a two-way contingency table, most conventional performance
measures compute a single value from the four values in the table. The standard
performance measures for classic information retrieval research are recall and precision
that have been also frequently adopted for the evaluation for the text categorization.
These measures are computed as follows.
TPi
•
Recall =
if TPi + FNi > 0
TPi + FNi
37
TPi
•
Precision =
if TPi + FPi > 0
TPi + FPi
Recall measures the proportion of documents that are predicted to be YES and
correct, against all documents that are actually correct. While, the precision is the
proportion of documents which are both predicted to be YES and are actually correct,
against all documents that are predicted YES. In general, the higher precision is, the
lower recall becomes, and vice versa. For example, we can achieve very high precision
by rarely predicting ‘YES’ (i.e., by setting a very high threshold value) or very high
recall by rarely predicting ‘NO’ (i.e., by setting very low threshold value). For this
reason, they are seldom used alone as a sole measure of effectiveness. Instead, it is
common in the literature to show two associated values of recall and precision at each
level.
Other performance measures that are purely based on the contingency table are
accuracy and error. They are defined as follows:
TPi +
•
Accuracy =
TNi
|D|
FPi +
•
Error
=
FNi
where |D| = TPi + FPi + FNi + TNi > 0
where |D| = TPi + FPi + FNi + TNi > 0
|D|
While commonly used for performance measures in the machine learning literature,
accuracy and error are not frequently used in text categorization. Their low popularity
in text categorization may be explained, based on their definitions. The accuracy and
error are defined as the proportion of documents that are correctly predicted and the
proportion of documents that are wrongly predicted, respectively. Both measures, in
38
common, have |D| which is the total number documents in their denominator. As
criticized in [Yan99], a large value of |D| makes accuracy insensitive to a small change
in the value of TP (true positive) or TN (true negative). Likewise, the variations in the
value of FP (false positive) or FN (false negative) will have a tiny impact on the value
of error. Also, for the rare categories that have a small number of positive documents
assigned, a trivial rejecter (i.e., a classifier that rejects every document for a category)
may give much better performance (i.e., the larger value of accuracy and the smaller
value of error) than non-trivial classifiers. As a consequence, with a large data set and
low average probability of each document belonging to a given category, these two
measures cannot be sensitive.
As briefly discussed earlier, neither recall nor precision makes sense in isolation
from each other because of the tradeoff between them. In actual practice, a trivial
classifier that gives YES decisions to every document-category pair, (dj, ci), will have
perfect recall (i.e., Recall = 1), but an extremely low value of precision. Most text
categorization systems that have been adjusted to have high recall will sacrifice
precision, and vice versa. Usually, users want the system to have high recall and high
precision. However, it is generally difficult for them to choose between two systems
where one has higher recall and the other has higher precision. For this reason, it is
usually preferable to evaluate a system’s effectiveness by using a measure that
combines recall and precision.
Among the various combined measures, break-even point and F_ are the most
frequently used in text categorization. They are defined as follows:
•
Break-even point (BEP) is the value at which Recall = Precision
(_2 + 1) × Recall × Precision
•
F_
where 0 ≤ _ ≥
=
2
(_ × Precision) + Recall
∞
The value of BEP is the value of precision that is tuned to be equal to recall and it is
computed by repeatedly varying the thresholds of a given category to plot precision as
a function of recall. If there are no values of precision and recall that are exactly equal,
39
the interpolated BEP is computed by averaging the closest values of precision and
recall. Yang [Yan99] noted that interpolated BEP may not be a reliable effectiveness
measure when no values of precision and recall are close enough. Also, as described in
[Seb02], Lewis who proposed the break-even point in [Lew92a, Lew92b] noted that it
is unclear whether a system that achieves a high value of BEP also obtains high scores
on other performance measures.
F_, which was first proposed in [Rij79], is another common choice in text
categorization and it is a measure of recall, precision, and a parameter called _. This
parameter allows differential weighting on the importance of recall and precision. The
value of _ can be between 0 to ∞ (infinity). _ = 0 means that the system ignores recall,
whereas _ = ∞ means that it ignores precision. Usually, _ = 1 (i.e., recall and precision
are viewed as having equal importance) has been used for this measure.
2 × Recall × Precision
•
F1
=
Recall + Precision
Note that, when Recall = Precision, F1 will have the same value of recall (or precision).
So, BEP is always less than or equal to the optimal value of F1 due to the unreliable
character of interpolated BEP [Yan99].
In this research, we use F1 for the effectiveness measure since balanced recall and
precision is our main concern for our comparative analysis of classifiers.
2.5.2
Micro and Macro Averaging
For a given category set C = {c1, c2, … , c|C|} and a document set D, binary
assignment decisions will generate a total of |C| contingency tables. An important issue
which arises is how to compute a single value of effectiveness by averaging them. Two
different averaging methods are available: macro-averaging and micro-averaging.
Macro-averaging first computes effectiveness measures locally for each
contingency table, and then computes a single value by averaging over all the resulting
40
local measures. For instance, macro-averaging recall, MA-Recall, is computed as
follows:
∑ i=1,|C| Recalli
•
MA-Recall =
|C|
TP
where Recalli =
i
and TPi + FNi > 0
TPi + FNi
Macro-averaging can be viewed as a per-category averaging method that gives
equal weight to each category. So, a macro-averaging measure will be a good indicator
of the ability of classifiers to work well on rare categories which have a small number
of positive documents.
Micro-averaging considers all binary decisions as a single global group by making a
global contingency table, and then computes a single effectiveness measure by
summing over all individual decisions. For example, micro-averaging recall, MI-Recall,
is computed based on the global contigency table in Table 2.2:
∑ i=1,|C| TPi
•
MI-Recall =
∑ i=1,|C| (TPi + FNi)
Micro-averaging gives equal weight to every individual decision of (dj, ci). So, it
can be considered a per-document averaging method. Whether macro-averaging or
micro-averaging is more informative obviously depends on the purpose of
categorization and characteristics of test data set. However, micro-averaging seems to
be the preferable averaging method in the literature [Seb02]. In this thesis, we will use
both macro-averaging as well as micro-averaging for the evaluation of classifiers.
Comparing these two measures gives an indication of the impact of the rare categories
that could be hidden in micro-averaging.
41
Table 2.2 Global contingency table for category set C.
category set
C = {c1, c2, … , c|C|}
label by human expert
YES is correct
NO is correct
predicted YES
∑ i=1,|C| TPi
∑ i=1,|C| FPi
predicted NO
∑ i=1,|C| FNi
∑ i=1,|C| TNi
label by the system
2.5.3
Data Splitting
Evaluation of classifiers in machine learning approaches needs a test set which is
different from a training set used for classifier construction. Theoretically, the
documents in a test set cannot be involved in the learning process for classifiers for
which they will be tested. Otherwise, evaluation results would likely be too good to be
achieved realistically. So, before the classifier construction, an initial set of documents,
H = {d1, d2, … , d|H|} ⊂ D, should be split in two disjoint sets (training set and test set).
Two standard splitting methods are train-and-test and k-fold cross-validation.
In the train-and-test approach, an initial set H is partitioned into a training set A =
{d1, … , d|A|} and a test set E = {d|A|+1, … , d|H|}. These splits are, in most cases,
constructed based on the random sampling, or sometimes based on real-word data flow
(i.e., earlier documents for the training set and later ones for the test set) if documents
have a time-flag like electronic mail. This approach can give biased evaluation results
since it cannot select from all the documents for training. However, the train-and-test
approach has been frequently used in text categorization since it makes experiments
much faster and makes comparisons amongst systems easier. Some data sets have
42
predefined splits and so they make it easier for different researchers to compare their
text categorization results.
An alternative to the train-and-test approach is k-fold cross-validation. It has the
advantage of minimizing variations due to the biased sampling of training data. This
splitting approach partitions an initial data set H into k different sets (E1, E2, … , Ek)
where positive and negative documents for each category are equally distributed among
them. Then, k experiments are iteratively run by applying the train-and-test approach
on k train-test pairs (Ai = H − Ei, Ei), and then the final performance result is computed
by averaging the k runs. For a data set with no predefined splits, we will use a crossvalidation approach for data splitting.
One consideration is in the construction of n small subsets from a given large
training set A. It is an important issue for our work since we have to examine whether a
classifier works well with few training examples. What we have to do is to further split
a training set A in n disjoint sets (S1, S2, …, Sn), not necessarily of equal size, and then
construct n new training sets (SA1, SA2, …, SAn) where each SAi = SAi-1 + Si. The
effectiveness of a classifier with a small number of training examples could be assessed
based on the results of n experiments conducted on n train-test pairs (SAi, E). As
discussed earlier, each Si can be built in a random manner. However, random sampling,
here, can cause the problem of an extremely uneven distribution of positive documents
across categories. The documents of a few frequent categories (i.e., category with a
large number of positive documents) would likely dominate in Si. We view this
problem as a major factor, causing biased experimental results. So, we use semi-random
sampling that randomly selects documents but controls the distribution of positive
documents for each category. As an example, consider a two-class (c1 and c2) text
categorization problem where the number of training documents |A| = 1,000, the
number of positive documents for category c1 is 900, and the number of positive
documents category c2 is 100. When S1 should have 10% (100) of |A|, it randomly
selects the same fraction of positive documents in each category (i.e., 90 from c1 and 10
from c2). In this thesis, we will use semi-random sampling unless otherwise stated.
Some classifiers have internal parameters, such as thresholds, that should be
optimized empirically based on a set of training documents. For parameter
optimization, it is often the case that a training set A is further partitioned into two
43
different sets, namely a training set R = {d1, … , d|R|} and a validation set V = {d|R|+1, … ,
d|A|}. Also, this validation set V must be kept separate from the test set E and thus
must be used only for parameter tuning. One question in the construction of the
validation set is, of the documents in an initial training set A, what fraction should a
validation set V have? With a small set size of A, this question may be much more
difficult (sometimes impossible) to answer, since the number of documents in A may
not be enough even for training itself. Note that, in our experimental settings, the
training set SAi will be very small compared to a given initial training set A and it will
not be practical to divide SAi into two subsets. In this thesis, we do not make separate
validation sets, but use the training set A for the parameter optimization.
44
Chapter 3
Keyword Association Network (KAN)
In this chapter, we describe our Keyword Association Network (KAN) approach
to the definition of a function _i that returns the estimated categorization values of new
documents for a specific category ci. We discuss the motivation of KAN, its overall
approach to text categorization, and its computational complexity.
3.1 Objectives and Motivation
The main goal of this research project is to develop text categorization methods
that work well with a small number of training examples. For this goal, we propose a
learning method to define the function _ in a classifier and its motivation is to achieve
the following two important objectives:
1. to give a feature the appropriate weight according to its semantic meaning
in a given document, and
2. to remove the influence of irrelevant features by giving more weight to the
discriminating (informative) features.
In text categorization, many words appear frequently in documents of several
different categories. These features are problematic for the similarity computation
between a classifier and a new document since they are more ambiguous than other
words that mainly appear in a single category. We have noticed that the ambiguity
45
these features may have can be characterized into two types: semantic and informative
ambiguities.
categories
computer:
farm:
profiles or
positive
documents
apple
windows
computer
…
apple
grape
computer
…
similarity
computation
new
documents
dj
_computer(dj)
apple
grape
computer
…
_farm(dj)
human expert assigned
category farm to the
new document dj
“apple” has semantic ambiguity.
“computer” has informative ambiguity.
Figure 3.1
An example of the similarity computation with semantic and
informative ambiguities.
Semantic ambiguity is well-defined in the information retrieval literature and it
refers to differences in meanings. As an example in Figure 3.1, consider the feature,
“apple”, that conveys different meanings in two different categories, computer and
farm. A good text categorization system should be able to differentiate its different
meanings in each category, i.e. “apple” is a kind of fruit in the category farm and a kind
of computer system in the category computer. Suppose we want to perform a
similarity computation between two objects: new document dj that should be
categorized under the category farm, and a profile (or positive documents for the
instance-based learning algorithms) of the category computer. With conventional
46
learning algorithms that compute the similarity score based on a single-term indexing
approach like the vector space model, the feature “apple” may have a large weight if it
has large weights in both objects and, as a result, this document may be incorrectly
considered as being computer-related (i.e., _computer(dj) > _farm(dj)). Human experts can
differentiate the meaning of the feature “apple” in the document dj and the same word
in the category computer by looking at other words in both objects. This suggests that
a promising approach to this problem is to exploit co-occurrence information of
“apple” with other features in computing the weight of the feature “apple”. This
approach can reduce the weight of “apple” in the similarity computation since pairs of
words in the document dj are less likely to appear in the category computer.
The more frequent type of ambiguity in text categorization is informative
ambiguity. This refers to differences in importance. In fact, informative disambiguation
is the aim of most machine learning algorithms applied to text categorization. However,
since the same documents may belong to more than one category and many features
may occur frequently in several different categories, this aim is a difficult one to
achieve for most learning algorithms that are based on a single-term indexing approach.
For example, since the current farming industry uses computer technology, many new
documents that should be assigned under the category farm may contain computerrelated terms like the feature “computer” as shown in Figure 3.1. Obviously, these
computer-related terms may have minor importance in the documents of the category
farm. As in the example of semantic ambiguity, if we adopt conventional learning
algorithms based on single-term indexing, similarity scores between the documents in
the category farm and the category computer may be large and, as a result, some of
documents in the category farm will be categorized under the category computer.
Again, word co-occurrence statistics are a potentially effective source of information
for resolving this informative ambiguity, since co-occurrences of computer-related and
farm-related words may be low in the category computer.
Another challenge for resolving informative ambiguities of features is to use their
discriminating power (numeric value) in assessing the similarity scores of new
documents. In text categorization, some categories may have one or two discriminating
features. Identifying the existence of those features in new documents is enough for the
categorization task. Using the same feature set size across all categories and giving an
equal opportunity to them to participate in a similarity computation could be a major
47
cause of low system performance. A number of discriminating power functions have
been investigated and applied to feature selection as a preprocessing step that selects a
subset of features [KS96, MG99]. Using the discriminating power of each feature in a
similarity computation is a promising approach to the elimination of the impact of
irrelevant features and, as a result, to significant performance improvements.
The basis of our approach is that using word co-occurrence information and
discriminating power values of features in a similarity computation could result in the
semantic and informative disambiguation and, as a result, would lead to improved text
categorization performance with a small number of training examples. Another
important part of the motivation for this approach is that word pairs could be more
understandable to users than a single word. Pazzani’s usability study [Paz00] showed
that people prefer word pair explanations for categorization. In our approach,
automatically extracted word pairs used for categorization of new documents seem to
be effectively used to increase user acceptance of learned profiles of categories.
3.2 Overall Approach
In similarity-based text categorization, each raw document is represented by the
features extracted and their associated weights, based on a single-term indexing scheme
like TFIDF described in section 2.2.2. With the representations of training documents,
most conventional machine learning algorithms construct the function _ based on only
the matching single features they have in common [Seb02]. And the function _
computes the similarity scores of new documents based on again only the matching
single features.
KAN is a new machine learning approach to the construction of the function _ that
exploits word co-occurrence information extracted from a set of training documents and
discriminating power value of each feature in similarity computation for resolving both
semantic and informative ambiguities. KAN consists of the construction of the network
of co-occurring features from a collection of documents, the definition of a relationship
measure between two features, and the definition of the discriminating power of each
feature in a given category. In this section, we discuss these three basic parts of KAN
and how it can be used for text categorization.
48
3.2.1
Constructing KAN
Previous work showed that it is possible to automatically find words that are
semantically similar to a given word, based on the co-occurrence of words [Rug92,
SCAT92]. Such word co-occurrence information has been used for various text learning
+
tasks including semantic feature indexing [DDFL 90], automatic thesaurus generation
+
[CY92], automatic query expansion [XC00], and text mining tasks [KMRT 94,
SSC97]. KAN is also constructed by means of a network representation based on word
co-occurrences in the training document set.
Let us assume that there is a set of n unique features {F = (w1, w2, ... , wn)} and a
set of m documents {D = (d1, d2, ... , dm)}. Here, each dj = (w1, w2, ... , wk) is a nonempty subset of F. The construction of KAN for a given category ci is based on the
document frequency (DF) of two co-occurring features wi and wj, DF(wi,wj | ci), that is
defined as follows:
•
DF(wi,wj | ci) is the number of positive documents in the category ci that
contain both features wi and wj.
The problem of building the network is to find two features that satisfy a userspecified minimum document frequency (minDF).
As an example, let us suppose that we have a set of positive documents in a
particular category, information technology in Figure 3.2. In this example, through the
preprocessing steps, stemming algorithm and a stop-list as mentioned in Section 2.2,
we could gain a set of distinct features F that could be considered as informative ones.
With the given set of documents D, set of unique features F, and user specified minDF
(in this example, it is 2), the algorithm in Figure 3.3 finds frequent 2-feature sets, F2,
which are groups of two features occurring frequently together in the set of documents
and satisfying the given minimum document frequency. Note that the feature “file”
occurs in just one document and so it is not considered at this point since it does not
satisfy a given minDF. CF2 is candidate 2-feature sets generated from the frequent 1feature sets F1.
Figure 3.4 shows the resulting network representation for above example. Each
node presents a feature and its document frequency in a given category, and an integer
49
on the edge between two nodes shows the document frequency of two features. In the
calculation of a similarity score between this category and a new document, if the
document has both features “apple” and “computer” they are considered as
informative ones according to this network. And, their relationship measure (explained
in the next section) will be increased in the similarity computation, as explained in
Section 3.2.4.
set of unique features: F = {apple, windows, computer, web, www, file}
set of documents (D)
d 1 = (apple, windows, computer)
d 2 = (windows, computer)
d 3 = (apple, computer, web, www)
d 4 = (file, web, www)
{F1}
{F2}
{CF2}
feature set
DF
feature set
DF
feature set
DF
apple
2
apple, windows
1
apple, computer
2
windows
2
apple, computer
2
windows, computer 2
computer
3
apple, web
1
web, www
web
2
apple, www
1
www
2
windows, computer 2
windows, web
0
windows, www
0
computer, web
1
computer, www
1
web, www
2
2
50
Figure 3.2
An example for constructing KAN.
F1: frequent 1-feature sets
CF2: candidate 2-feature sets
F2: frequent 2-feature sets
for all d ∈ D do
for all w ∈ W do
if wi exists in d
wi.count++;
F1 = { wi ∈ F | wi.count ≥ minDF}
Compute uncommon features that
satisfy user specified document
frequency (minDF).
when F1 = {w1, w2, … , wk}
Generate feature pairs for
CF2 = 0;
uncommon ones.
for all w ∈ F1 do
for (i = 1; i < k; i++) do
CF2 = CF2 + {(wi, wi+1), (wi, wi+2), … , (wi, wk)}
for all d ∈ D do
for all cf ∈ CF2
if two words in cfi exists in d
cfi.count++;
F2 = { cfi ∈ CF2 | cfi.count ≥ minDF}
Figure 3.3
Create list of feature pairs that
satisfy required document
frequency (minDF).
Algorithm for generating frequent 2-feature sets F2.
51
computer(3)
2
2
windows(2
)
apple(2)
2
www(2)
Figure 3.4
3.2.2
web(2)
Network representation generated from the example.
Relationship Measure
The degree of relationship between two features is represented by a confidence
value which we refer to as CONF(wi,wj), our confidence that wi is related to wj. This
+
measure was used to find association rules in [AMST 96] that have been identified as
an important tool for knowledge discovery in huge transactional databases. The
notions of co-occurrence and frequency in this research are exactly the same as 2frequent itemsets and support of association rules. In the resulting network structure
in the previous section, the confidence value is used for measuring how the presence of
one feature in a given document may influence the presence of another feature in a
particular category.
When a category ci has a set of n unique features {F = (w1, w2, ... , wn)}, the ith
feature’s confidence value to the jth feature, CONF(wi,wj) , is defined as follows:
52
DF(wi,wj | ci)
• CONF(wi,wj) =
…. Formula 3.1
DF(wi | of
ci) two co-located features wi and wj in
Where DF(wi,wj | ci) is the document frequency
the category ci and DF(wi | ci) is the document frequency of the feature wi in the
category ci. Note that this rule is asymmetric, i.e. CONF(wj,wi) has a different value
from CONF(wi,wj) due to the different denominator. High confidence of wi to wj is
interpreted as an indicator that the semantic meaning and importance of feature wi can
be determined by the existence of the feature wj.
The problem of finding large 2-feature sets can be reduced to a problem of finding
all frequent 2-feature sets that satisfy a user-defined minimum confidence (minCONF).
Based on the minCONF, some weak relationships in the network representation may
be filtered out due to their low confidence. In this research, we do not perform this
filtering process for weak relationships, since the optimal value of minCONF cannot be
found easily and such weak relationships might have a minor impact on KAN’s
performance because KAN is designed to make their impacts small as explained in the
next sections.
3.2.3
Discriminating Power Function
The discriminating power of each feature is an important factor in resolving the
informative ambiguities of features that occur in several different categories. As a
result, it is critical for achieving high performance in text categorization. Our KAN
involves its integration with the similarity computation between a new document and a
category. In KAN, we apply a discriminating power function similar to the scheme
used in [SK00].
Let m be the number of categories. For the jth feature, wj, we prepare the weight
vector hj = (df1,j, df2,j, ... , dfm,j). Here, each dfi,j is computed as follows:
• dfi,j =
DF(wj | ci)
DF(wi)
…. Formula 3.2
53
where DF(wj | ci) is the document frequency of wj in the ith category ci and DF(wj) is
the document frequency of wj over all documents in all categories. Shankar in [SK00]
computed the discriminating power value of the jth feature, Pj, as follows:
• Pj = ∑i=1,m(_i,j)2
…. Formula 3.3
where _j is one-norm scaled vector of hj, so each _i,j is:
• _i,j =
dfi,j
…. Formula 3.4
∑ i=1,m(dfi,j )
Pj will have the lowest value, 1/m, if a feature wj is evenly distributed across all the
categories and have the largest value, 1, if it appears in only one category. Then, the
same Pj is used for all categories to adjust the weight of feature wj. One drawback of
this scheme is that it causes a steep declining tendency in the weights of some
informative features in the several categories that have the same overlapping concept
(i.e., some features that appear frequently in several categories and are informative in
them will have a small Pj).
To overcome the drawback of Pj, we use dfi,j defined in Formula 3.2 as the
discriminating power of the jth feature wj in the ith category ci. In order to further
reduce the impact of the irrelevant features, dfi,j is transformed to Si,j as follows:
•
Si,j = dfi,j(λ / dfi,j)
…. Formula 3.5
where the range of λ is 0 < λ < 1. The graph of the df and S values of the features has
an S shape about the point of λ as shown in Figure 3.5. This graph depicts the S value
associated with each value of df when λ has three different values, 0.35, 0.50, and 0.65.
This shows that df values below λ are further penalized.
To address the problem of setting λ for a particular text categorization task, let us
consider the 2-category categorization problem where |C| = 2, like a spam-mail filtering
task [AKCS00, SDHH98, CM01]. In this case, the categorization must split new e-
54
mails into two disjoint categories, junk and non-junk ones. Since only two categories
are involved in this case, a feature having 0.5 for df cannot be considered informative.
This feature should have a much smaller S value than 0.5 and, as a result, λ should be
greater than 0.5 (i.e., 0.5 < λ < 1). The value of λ should be lower where the number of
predefined categories increase and higher where the number of categories decrease. In
this research, we set 0.35 for λ because of the large number of predefined categories in
the data sets.
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
S
0.50
0.45
0.40
0.35
0.30
0.25
= 0.35
0.20
= 0.50
0.15
= 0.65
0.10
0.05
0.00
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
df
Figure 3.5
Graphs of S against df, when λ = 0.35, 0.50, and 0.65.
55
3.2.4
Applying KAN to Text Categorization
Unlike the profile-based linear algorithms discussed in section 2.3.1.1, KAN does
not fix the feature weight through the learning process. The absolute weight for each
feature in a particular category is determined on-the-fly on seeing a new document.
This approach, we believe, helps give appropriate weights to those features that occur
frequently in several different categories, giving appropriate weights according to their
semantic meaning and importance in a given document.
Suppose a new document g, having the vector of the form (z1, z2, … , zn) becomes
available. Then, we calculate the weight vector (y1, y2, ... , yn), for a category ci as
follows:
•
yk = [∑j=1,|A| xk(dj) × |ci|−1] × [1 + Sk × ∑l=1,n(CONF(wl,wk) × bl)]
dj∈ci
…. Formula 3.6
l≠k
xk(dj): the weight of the kth feature in a training example dj
|A|: the number of documents in the training set A
|ci|: the number of positive documents in the category ci
Sk: the discriminating power of the kth feature in the category ci
CONF(wl,wk): the confidence measure of the lth feature wl to the kth
feature wk
bl: the existence label of the lth feature in a new document g having 1
if the feature appears in the new document and 0 if it does not.
Then, the similarity score between the category ci and new document g is the dot
product between the two vectors:
• _i(g) = ∑k=1,n(yk × zk)
…. Formula 3.7
56
One interesting property of this automated text categorization system, but not
empirically tested in this research, is its potential ability to handle some annotated
training documents that contain informative terms indicated by human experts as well
as their category information. In KAN, such manually indicated terms that are declared
to be informative in a given category can be handled by the construction of a set of
discriminating features [LKKR02].
Let R be the set of features that satisfy the following condition:
•
wk ∈ R
if (Sk ≥ minS) or
(wk indicated by human experts in annotated documents)
where minS is the user defined minimum discriminating power value. Then, when a
new document is available, the weight yk of the kth feature wk in a category ci can be
computed as follows:
•
yk = [∑j∈1,|A| xk(dj) × |ci|−1] × [1 + δ]
…. Formula 3.8
dj∈ci
δ is
[Sk × ∑l=1,n(CONF(wl,wk) × bl)] if wk ∈ R
l≠k
0 if wk ∉ R
This equation is essentially the same as formula 3.6. The difference in this equation is
that only the features in R will have additional weights computed based on
relationships with other features while the other features not in R have only the basic
weights based on the single-term weighting scheme. It means that when wk is indicated
as important by human expert, as shown in Figure 3.6, it is incorporated in Formula
3.8 regardless of the value of Sk.
Figure 3.6 shows an example of the network representation of KAN for handling
such a discriminating feature set in a given category grain. In this example, the three
nodes “agriculture”, “grain”, and “wheat” are presented as the discriminating features.
The first two nodes, “agriculture” and “grain”, satisfy the minS while “wheat” does
57
not satisfy it but is indicated as an informative feature by human experts. These three
features will provide most of the overall similarity scores of new documents.
cgrain
minS=0.35
program
grain
S=0.65
china
ship
wheat
S=0.29
SUPT=19
8
U.S
agriculture
S=0.54
trade
Figure 3.6
An example of network representation for handling annotated
training examples.
3.3 Computational Complexity
A potential drawback of KAN is that it would not be suitable for working with a
high dimensional feature spaces because of the high computational cost that quickly
becomes prohibitive as n increases, where n is the number of unique features in the
feature space. When computing the weight of the jth feature in a given category, KAN
requires a confidence value for each pair of features. With the simplified O() notation
(i.e. mn2 + pn _ n2), the computational complexity of KAN is O(n2) in both construction
58
time and similarity computation time. So, to make KAN practical for text
categorization, it is critical to reduce the feature space by applying feature selection
algorithms.
If the KAN’s performance with a reduced feature set is comparable with a full
feature set, and such a reduced feature set has a reasonable input size that is
manageable by KAN, we can say that KAN is efficient in terms of the processing time.
This is the subject of the experiments in Section 5.3.
59
Chapter 4
RinSCut: New Thresholding Strategy
The final decisions for the mapping from the new documents to relevant categories
can be made by thresholding the similarity scores computed in classifiers. For
similarity-based classifiers, the thresholding strategy is a critical research area that has
a huge impact on the overall text categorization effectiveness. This chapter describes
our new thresholding strategy, rank-in-score thresholding strategy (RinSCut): its
motivation, desired property analysis, and overall approach for the definition of the
optimal threshold value _i for a given category ci.
4.1 Motivation
As Yang notes [Yan01], thresholding strategies constitute an unexplored research
area in text categorization. Indeed, most work implicitly assumes that finding the
optimal thresholding strategy is a trivial task. However, this assumption is not true. In
fact, the selection of the optimal thresholding strategy, discussed in section 2.3.2,
depends on several aspects.
1. Different characteristics of machine learning algorithms
Some learning algorithms produce similarity scores that are more comparable
within document than within category. For such algorithms, as shown in
Figure 4.1 the RCut thresholding strategy would be better than SCut and PCut.
These latter strategies work better with similarity scores that are more
comparable within-category. However, the type of similarity scores an
60
adopted algorithm will generate is not clear until one compares empirical
results obtained by applying all existing thresholding strategies.
documents
d1
d2
…
dn
_1(d1)
_1(d2)
…
_1(dn)
categories
c1
more comparable
within category:
c2
_2(d1)
_2(d2)
…
_2(dn)
..
.
..
.
..
.
..
.
..
.
cm
_m(d1)
_m(d2)
…
_m(dn)
SCut and PCut
more comparable within document: RCut
Figure 4.1
Comparability of similarity scores and thresholding strategies.
2. Different user interests
The choice of the thresholding strategy is heavily affected by different user
interests. For example, when a user is interested in the local performance of
each category, SCut would be preferred since it optimizes the macro-averaged
performance. RCut will be effective when the user is interested in the global
performance (i.e., micro-averaged performance). The difficulty is that user
interests are not constant, changing over time.
61
3. The multi-class text categorization problem
In real-world applications, text categorization systems do not know how many
categories could be assigned to each new document and whether that number of
categories will be constant over all new documents or not. This means that
typical text categorization tasks are multi-class categorization problem in
which new documents belong to a variable number of categories. For multiclass categorization, SCut and PCut seem to be the optimal choices because
RCut assigns the same number of categories to all new documents. However,
this is not necessarily the case because of the different characteristics of the
machine learning algorithms and different user interests discussed earlier.
In the absence of an outstanding strategy, it is apparent that finding the optimal
thresholding strategy for any given algorithm and data set is difficult. Addressing this
problem is a challenge in similarity-based text categorization. One way of overcoming
this problem is to design a new thresholding strategy that jointly uses the strengths of
different existing strategies. This is the motivation for our invention and evaluation of a
new thresholding strategy, RinSCut.
4.2 Desired Properties
In developing our new thresholding strategy, we have noticed that RinSCut should
have the following desired properties.
1. Online text categorization
Among three thresholding strategies, PCut is the only one that uses the
proportional information of each category observed in a training set. While
using such proportional information makes PCut work well for the test set that
shows a similar category distribution to the training set, it also makes PCut
unsuitable for online text categorization. With PCut, the categorization of each
new document should be postponed until a pool of new documents is
accumulated. The ability to make an online decision is highly desirable for
many text classification systems, especially for e-mail categorization systems
62
and information filtering systems where delayed decisions on new documents
are not allowed.
2. Optimizing both local and global effectiveness
RCut would be effective when the global (micro-averaged) performance is the
primary concern. On the other hand, SCut could be superior to RCut when the
local performance of each category (macro-averaged performance) is tested.
Combining the strength of both thresholding strategies offers the potential for
a new strategy that optimizes both local and global performance.
3. Avoiding the risk of overfitting a small number of training documents
Our main goal in this research is to make a text categorization system that
works reasonably well with a small number of training examples. With the
small size of training data, SCut has a high risk of overfitting the training data
and, as a result, it can give some rare categories unreliable thresholds that
significantly lower the macro-averaged and micro-averaged effectiveness. By
contrast, RCut is less sensitive than SCut to the problem of overfitting.
From the analysis of desirable properties, we want a new thresholding strategy
that: (1) has the ability to categorize new documents online, (2) gives thresholds which
optimize both macro-averaged and micro-averaged performance, and (3) is insensitive
to the problem of overfitting to training data in some rare categories.
4.3 Overall Approach
We now describe our new thresholding strategy rank-in-score (RinSCut), that is
designed to use the strengths of two existing strategies, RCut and SCut, and to have the
desirable properties discussed in the above section.
4.3.1
Defining Ambiguous Zone
As we have already noted, a weakness of SCut is that it has a high risk of
overfitting some rare categories (i.e., SCut gives unreliable thresholds that are too high
63
or too low). To deal with this problem, we define a range around the threshold value
given by SCut. For new documents whose similarity scores are within this range, the
categorization decision depends on another thresholding strategy like RCut that is
relatively insensitive to the overfitting problem and optimizes the global performance
of the system.
To combine the strengths of RCut and SCut and, as a result, to overcome weakness
in both thresholding strategies, RinSCut finds two threshold scores, ts_top(ci) and
ts_bottom(ci), for each category ci. As shown in Figure 4.2, the range of these two
threshold values is considered an ambiguous zone in the category ci. The computation
of the two thresholds is based on ts(ci) which is the optimal threshold score from SCut,
ND(ci) which is the set of negative training documents with similarity scores above
ts(ci), and PD(ci) which is the set of positive training documents with similarity scores
below ts(ci) as follows in [LKK02]:
∑
dj∈ND(ci)
• ts_top(ci) = ts(ci) +
[_i(dj) − ts(ci)]
…. Formula 4.1
|ND(ci)|
∑
dj∈PD(ci)
• ts_bottom(ci) = ts(ci) −
[ts(ci) − _i(dj)]
…. Formula 4.2
|PD(ci)|
where _i(dj) is the similarity value of document dj in the category ci, |ND(ci)| and
|PD(ci)| are the number of documents in ND(ci) and PD(ci) respectively. Both ts_top(ci)
and ts_bottom(ci) become ts(ci) if no document is in ND(ci) and PD(ci) set,
respectively. The ambiguous zone is used for new documents on which the
categorization decision to this category cannot be made only with the threshold ts(ci)
given by SCut.
64
Training data are sorted by similarity scores in descending order for category
ci.
 ----- with the highest score hs(ci)
:



_
ts_top(ci)




 ----- with ts(ci) from SCut



_
ts_bottom(ci)


:

 ----- with the lowest score ls(ci)
Ambiguous zone
 : positive data with similarity score above ts(ci)
 : positive data with similarity score below ts(ci) = PD(ci)
 : negative data with similarity score below ts(ci)
 : negative data with similarity score above ts(ci) = ND(ci)
Figure 4.2
Ambiguous zone between ts_top(ci) and ts_bottom(ci) for a given
category ci.
65
For an assignment decision on a new document g to the category ci with the
similarity score, _i(g), assigns a “YES” decision if _i(g) ≥ ts_top(ci) and a “NO”
decision if _i(g) < ts_bottom(ci). If _i(g) is in the ambiguous zone between ts_top(ci)
and ts_bottom(ci), the final assignment decision depends on the rank threshold k from
RCut.
This broad approach has been explored elsewhere, using a different method based
on user-defined parameters [LKKR02]. A possible modification to the computation of
ts_top(ci) and ts_bottom(ci) is to apply user-defined maximum and minimum values for
these two thresholds in order to prevent them from having too high or low values. This
could be done with user-defined real-valued parameters, para_max and para_min,
between 0 and 1 that are used for computing ts_max_top(ci), ts_min_top(ci),
ts_max_bottom(ci), and ts_min_bottom(ci).
•
ts_max_top(ci) = ts(ci) + [hs(ci) − ts(ci)] × para_max
•
ts_min_top(ci) = ts(ci) + [hs(ci) − ts(ci)] × para_min
•
ts_max_bottom(ci) = ts(ci) − [ts(ci) − ls(ci)] × para_max
•
ts_min_bottom(ci) = ts(ci) − [ts(ci) − ls(ci)] × para_min
where hs(ci) is the highest similarity score and ls(ci) is the lowest score in a given
category ci. Then, modified mod_ts_top(ci) and mod_ts_bottom(ci) are defined as
follows:
•
mod_ts_top(ci) is
ts_max_top(ci) if ts_top(ci) > ts_max_top(ci)
ts_min_top(ci) if ts_top(ci) < ts_min_top(ci)
ts_top(ci)
•
otherwise
mod_ts_bottom(ci) is ts_max_bottom(ci) if ts_bottom(ci) < ts_max_bottom(ci)
ts_min_bottom(ci)
if ts_bottom(ci) > ts_min_bottom(ci)
ts_bottom(ci)
otherwise
66
Their expected locations in the ordered list of similarity scores of training documents
are shown in Figure 4.3. This method has been investigated and evaluated in
[LKKR02]. It appears that users find it difficult to suggest such parameters.
Accordingly, we test only the unmodified version of thresholds, ts_top and ts_bottom,
in this thesis.
Training Data are sorted by similarity scores in descending order for
category ci.
 ----- with the highest score hs(ci)
:

_
ts_max_top(ci)



_
ts_min_top(ci)

 ----- with ts(ci) from SCut

_
ts_min_bottom(ci)



_
ts_max_bottom(ci)

:
 ----- with the lowest score ls(ci)
mod_ts_top(ci)
mod_ts_bottom(ci)
 : positive data with similarity score above ts(ci)
 : positive data with similarity score below ts(ci)
 : negative data with similarity score below ts(ci)
 : negative data with similarity score above ts(ci)
Figure 4.3
The locations of ts_max_top(ci), ts_min_top(ci), ts_max_bottom(ci),
ts_min_bottom(ci), mod_ts_top(ci), and mod_ts_bottom(ci) in the
ordered list of similarity scores.
67
4.3.2
Defining RCut Threshold
For new documents that have similarity scores within the ambiguous zone for a
given category ci, the rank threshold k is used to make the final decision on them. This
threshold k can be defined in two ways: by optimizing the global performance
(GRinSCut) or by optimizing the local performance of each category (LRinSCut). In
GRinSCut, the same k value is applied to all categories like RCut. Each category may
have different k values in LRinSCut.
Using a globally optimized k value may be a good choice if we want high microaveraged performance. But, when the local performance of each category is the primary
concern and some rare categories prevent use of SCut, LRinSCut will be more effective.
For a category ci and a new document g having the similarity score _i(g) that is
between ts_top(ci) and ts_bottom(ci), RinSCut sorts the similarity scores of categories
to give the k top-ranking categories. Its final decision is a “YES” if the category ci is in
the k categories and a “NO” decision if ci is not in the k categories.
68
Chapter 5
Evaluation I: KAN and RinSCut
We described, in Chapter 3 and 4, our approaches, KAN and RinSCut, for building
accurate classifiers from a small number of training examples. In this chapter we
discuss the details of evaluative experiments. This chapter begins by providing
information on the data sets that we used in experiments. Then, we explain how the
raw documents were preprocessed. Finally we discuss the experimental results. They
demonstrate the efficiency and effectiveness of our approaches to text categorization.
5.1 Data Sets Used
The experiments in this thesis were conducted using two data sets, Reuters-21578
[R21578] and 20-Newsgroups [20News]. We used these corpora because of the
following characteristics they have in common:
1. The documents in the corpora are real-world ones.
What we mean by real-world documents is that they are not machinegenerated. All the documents in these corpora were written by humans for
some purpose other than the purpose of testing text categorization systems.
2. These corpora can be considered as standard data sets that have been used
in testing many text categorization systems.
69
These corpora are publicly available and have been widely used for
experimental work in various text categorization systems (See, for example,
[CS96, Joa98, YX99, HK00] for the Reuters-21578 and [MN98, NMTM00,
NG00] for 20-Newsgroups).
While many researchers have used both corpora in the tests of their
categorization systems, there has unfortunately been no standard way of using
these corpora and, as a result, many of these experiments have been carried out
in different conditions. This makes meaningful comparisons among the
systems somewhat difficult. In order to guarantee reliable cross-system
comparison, we have to conduct the experiments: (1) on the same documents
and categories, (2) with the same splitting method for the training and test set,
(3) with the same evaluation method, and (4) using the same part of each
document (i.e., title or body, or both).
3. Each corpus contains a large number of documents.
In general, the evaluations for text categorization systems require large
numbers of labeled documents. A small set of documents that is insufficient to
prepare both training and test set would result in biased experimental results.
There are large numbers of pre-categorized documents (about 20,000) in each
data set. This is enough for assessing the unbiased performance of a particular
text categorization system.
4. Each corpus contains many predefined categories.
Many categories are available in each corpus (20 categories in the 20Newsgroups and 135 categories in the “Topics” group of the Reuters-21578).
Some text categorization tasks involve a very small number of categories (for
example, spam-mail filtering is the task of 2-class text categorization), and are
generally easier text categorization tasks.
In the following subsections, we will explain in detail each data set.
70
5.1.1
Reuters-21578
The Reuters-21578 corpus consists of a set of 21,578 Reuters newswire articles
from 1987. Each document was assigned to categories by human experts from Reuters
Ltd and Carnegie Group Inc. From 1990 through 1993, the formatting and
documentation of articles was done by David D. Lewis and his coworkers.
This first version of the data was called the Reuters-22173 [R22173] consisting of
22,173 Reuters newswire articles. In 1996, based on the Reuters-22173 corpus, further
formatting was done, a variety of typographical errors were corrected, and 595
duplicate articles were removed. This new version of the collection has 21,578
documents and thus is called the Reuters-21578 collection. It has been widely used for
the experimental work in text categorization.
Recently, a new Reuters collection, called Reuters Corpus Volume 1 [RCV1], has
been made available and will likely replace the Reuters-21578 as the standard Reuters
collection for text categorization. We did not use this new version of Reuters collection
because it was not available when our research started.
In the Reuters-21578 collection, there are 672 categories in total across 6 different
groups: “Topics”, “Places”, “People”, “Organizations”, “Exchanges”, and
“Companies”. In this corpus, “Companies” has no categories. Most text categorization
research has been done using the “Topics” group that consists of 135 economic subject
categories.
In using the Reuters-21578 corpus for text categorization, there are 3 standardized
predefined splitting methods for dividing a set of available data into a training set and a
test set. These splitting methods are the Modified Lewis (“ModLewis”), the Modified
Apte (“ModApte”), and the Modified Hayes (“ModHayes”). We chose the
“ModApte” splitting method since it has been the most widely used splitting method
in text categorization evaluations on this corpus.
We used only the body of each article since many articles have no title, and the
text in the title usually appears in the body. Figure 5.1 provides an example of the
extracted Reuters-21578 documents. It is #9 in the earn category, and was used for
training.
71
Champion Products Inc said its board of directors approved a two-for-one
stock split of its common shares for shareholders of record as of April 1, 1987.
The company also said its board voted to recommend to shareholders at the
annual meeting April 23 an increase in the authorized capital stock from five
mln to 25 mln shares.
Figure 5.1
An example of the Reuters-21578 document (identification number,
9, assigned to the earn category and used for the training set).
Instead of analyzing all 135 categories in the “Topics” group, we chose the
categories having at least 10 articles in both training set and test set. This results in 53
categories. This gives a corpus of 6,984 training documents and 3,265 test documents
across these 53 categories. Many documents in both training and test set may belong
to multiple categories. The 53 categories and the numbers of training and test
documents in each category are listed in three tables (Tables 5.1, 5.2, and 5.3). In these
tables, the rows are ordered based on the number of training documents, from highest
to lowest.
The documents in the Reuters-21578 are unevenly distributed across the
categories. Most are assigned to the first two categories: “earn” and “acq” (See Table
5.1). Since micro-averaged performance depends heavily on such frequent categories,
adding other extremely rare categories into the corpus would have a very minor impact
on the resulting micro-averaged performance. As a result, if the micro-averaged
performance is considered, our micro-averaged performance results in this thesis might
be comparable with others which include all the categories with small numbers of
articles. For the macro-averaged performance, our results may not be comparable since
the performance of rare categories has equal importance with the frequent categories.
72
Table 5.1 The 53 categories of the Reuters-21578 data set used in our
experiments (part 1).
category name
number of training
documents
number of test documents
earn
2,709 (38.8%)
1,066 (32.6%)
acq
1,488 (21.3%)
722 (22.1%)
money-fx
460 (6.6%)
222 (6.8%)
grain
394 (5.6%)
179 (5.5%)
crude
349 (5.0%)
215 (6.6%)
trade
337 (4.8%)
177 (5.4%)
interest
289 (4.1%)
133 (4.1%)
wheat
198 (2.8%)
89 (2.7%)
ship
191 (2.7%)
103 (3.2%)
corn
159 (2.3%)
63 (1.9%)
sugar
118 (1.7%)
57 (1.7%)
oilseed
117 (1.7%)
65 (2.0%)
coffee
110 (1.6%)
33 (1.0%)
dlr
96 (1.4%)
72 (2.2%)
gold
94 (1.3%)
39 (1.2%)
gnp
92 (1.3%)
61 (1.9%)
money-supply
87 (1.2%)
39 (1.2%)
veg-oil
86 (1.2%)
50 (1.5%)
livestock
73 (1.0%)
39 (1.2%)
73
soybean
73 (1.0%)
38 (1.2%)
nat-gas
72 (1.0%)
54 (1.7%)
Table 5.2 The 53 categories of the Reuters-21578 data set used in our
experiments (part 2).
category name
number of training
documents
number of test documents
bop
62 (0.9%)
39 (1.2%)
cpi
60 (0.9%)
41 (1.3%)
carcass
50 (0.7%)
25 (0.8%)
cocoa
50 (0.7%)
18 (0.6%)
reserves
48 (0.7%)
25 (0.8%)
copper
47 (0.7%)
30 (0.9%)
jobs
41 (0.6%)
27 (0.8%)
iron-steel
40 (0.6%)
25 (0.8%)
cotton
38 (0.5%)
24 (0.7%)
yen
36 (0.5%)
22 (0.7%)
ipi
35 (0.5%)
22 (0.7%)
rubber
35 (0.5%)
14 (0.4%)
rice
35 (0.5%)
32 (1.0%)
alum
33 (0.5%)
25 (0.8%)
barley
33 (0.5%)
15 (0.5%)
gas
30 (0.4%)
24 (0.7%)
meal-feed
30 (0.4%)
20 (0.6%)
palm-oil
29 (0.4%)
13 (0.4%)
74
sorghum
23 (0.3%)
11 (0.3%)
silver
21 (0.3%)
15 (0.5%)
pet-chem
20 (0.3%)
21 (0.6%)
Table 5.3 The 53 categories of the Reuters-21578 data set used in our
experiments (part 3).
category name
number of training
documents
number of test documents
rapeseed
18 (0.3%)
17 (0.5%)
tin
18 (0.3%)
15 (0.5%)
wpi
17 (0.2%)
12 (0.4%)
strategic-metal
16 (0.2%)
16 (0.5%)
lead
15 (0.2%)
20 (0.6%)
orange
15 (0.2%)
10 (0.3%)
hog
15 (0.2%)
11 (0.3%)
heat
14 (0.2%)
11 (0.3%)
soy-oil
14 (0.2%)
11 (0.3%)
fuel
13 (0.2%)
15 (0.5%)
soy-meal
13 (0.2%)
13 (0.4%)
total (part 1, 2, and 3)
6,984 (100.0%)
3,265 (100.0%)
In this thesis, to construct the learning curves for the implemented classifiers,
based on the number of training documents used, we conducted various experiments by
increasing the number of training examples. The training set for each round is a super
75
set of the one for the previous round. The same size of test set (i.e., the 3,265 test
examples in Table 5.3) was used for testing the generated classifiers in each round.
5.1.2
20-Newsgroups
The 20-Newsgroups data set is a collection of approximately 20,000 newsgroup
articles posted to 20 different Usenet discussion groups. Since it was set up and used
by Ken Lang in [Lan95], this corpus has been a popular benchmark data set that has
been frequently used for experiments in various text classification systems.
Table 5.4 The 20 categories of the 20-Newsgroups corpus.
category name
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.Christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
In Table 5.4, a list of 20 such newsgroups is shown. Unlike the Reuters-21578
corpus, the documents in this corpus are partitioned (nearly) evenly across 20
newsgroups and each document belongs to exactly only one newsgroup. As a result,
76
there are about 1,000 documents in each newsgroup. Some of the newsgroups are very
closely related to each other and can be considered as having a hierarchical data
structure. For example, two categories, talk.politics.guns and talk.politics.mideast, can
be regarded as being the child categories of the super category, talk.politics. Because of
its hierarchical relationship between the available newsgroups, 20-Newsgroups has
been used as a benchmark data set for hierarchical text categorization [MRMN98].
Figure 5.2 shows an example document from the 20-Newsgroups corpus. It was
posted to the alt.atheism newsgroup. In using this corpus for our experiments, we used
the text in the “Subject” header and body. Making use of other textual information like
“From” and “Sender” headers may result in performance improvements. We did not
consider this in this thesis since it is not general, but specific to the characteristics of a
particular data set. Also, most other evaluations [MN98, NMTM00] on this data set
removed header information.
The task of assigning a document to a single category (single-class text
categorization) is quite different from the task of assigning a document to a variable
number of relevant categories (multi-class text categorization). Because each document
in the 20-Newsgroups corpus belongs to exactly one category, this corpus has been
primarily used for the single-class text categorization that assigns each document to the
single, most appropriate category. Other data sets, such as the Reuters-21578 data set
are used for evaluating a system's ability to perform the multi-class text categorization
task.
Unlike the Reuters-21578 corpus, the 20-Newsgroups corpus has no predefined
splitting method for dividing it into the training and test set. The splitting method we
adopted is the k-fold cross-validation that was discussed in Section 2.5.3. We chose 5
for the k value. As a result, in each experiment, 20 percent of the total number of
documents was used for the test set and the remaining 80 percent of the data is then
used for the training set. Due to the time limit, this 5-fold cross-validation was
performed just once. Also, the learning curve of each classifier was generated by
selecting varying numbers of training documents from the training set’s 80 percent of
the total examples.
77
Path:cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!fs7.e
ce.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!ra!cs.umd.edu!mimsy!mangoe
From: [email protected] (Charley Wingate)
Newsgroups: alt.atheism
Subject: Re: A Little Too Satanic
Message-ID: <[email protected]>
Date: 25 Apr 93 02:48:25 GMT
Sender: [email protected]
Lines: 34
Kent Sandvik and Jon Livesey made essentially the same response, so this time
Kent's article gets the reply:
>I agree, but this started at one particular point in time, and we
>don't know when this starting point of 'accurately copied scriptures'
>actually happened.
This begs the question of whether it ever "started"-- perhaps because accuracy was
always an intention.
>Even worse, if the events in NT were not written by eye witness accounts (a
>high probability looking at possible dates when the first Gospels were
>ready) then we have to take into account all the problems with information
>forwarded with the 'telephone metaphor', indeed.
It makes little difference if you have eyewitnesses or people one step away
(reporters, if you will). As I said earlier, the "telephone" metaphor is innately
bad, because the purpose of a game of telephone is contrary to the aims of writing
these sorts of texts. (Also, I would point out that, by the standards generally
asserted in this group, the distinction between eyewitnesses and others is hollow,
since nobody can be shown to be an eyewitness, or indeed, even shown to be the
author of the text.)
There is no evidence that the "original" texts of either the OT or the NT are
largely lost over time in a sea of errors, "corrections", additions and deletions. In
the case of the NT, the evidence is strongly in the other direction: the Textus R.
and the Nestle-Aland text do not differ on more than a low level of significance.
It is reasonable to assume a similar situation for the OT, based on the NT as a
model.
-C. Wingate
+ "The peace of God, it is no peace,
+ but strife closed in the sod.
[email protected]
+ Yet, brothers, pray for but one thing:
tove!mangoe
+ the marv'lous peace of God."
78
Figure 5.2
An example of the 20-Newsgroups document (assigned to the
alt.atheism newsgroup).
5.2 Text Preprocessing
All raw documents in both the training and the test set should be converted
directly into the representations that are necessary for efficient learning. Converting a
raw text into such a representation involves the following three sub-processes: (1)
identifying meaningful features that could be single words or phrases, (2) giving an
appropriate weight to each feature, and (3) finding a reduced feature set that is
computationally tractable to the machine learning algorithms without sacrificing text
categorization performance. Because of the diverse techniques available in each subprocess above, the overall conversion process involves making many arbitrary
decisions. Our principle here is to maintain the simplicity of this conversion process
by using the typical and most popular methods for similarity-based text
categorization. This section describes such methods chosen for the preprocessing of
raw documents.
5.2.1
Feature Extraction
Feature extraction is an automatic procedure for detecting the meaningful tokens in
a text. These extracted tokens are called features in the information retrieval field. This
procedure is, indeed, one of the critical natural language processing tasks for most text
classification systems in the domain of textual information analysis.
There are several possible tokenizing methods. In this thesis, we extracted tokens
from a raw document by breaking up the string of characters at white-space and at all
the characters other than alphabetical and numerical characters. Then, we trimmed
tokens to remove any marks or punctuations around them that could exist (i.e., shares.
_ shares and “telephone” _ telephone). We did not correct any misspellings in the
extracted tokens. Also, we omitted numeric information, considering only alphabetical
79
characters as candidate features. All the extracted tokens are downshifted. Then, the
common words were removed based on the stop-list in Appendix A. Then, we applied
a stemming algorithm to the remaining tokens. Figure 5.3 shows a resulting tokenized
file after the feature extraction process for the example document in Figure 5.1.
champion product inc board director approv two stock split common share
sharehold record april compani board vote recommend sharehold annual meet
april increas author capit stock five mln mln share
Figure 5.3
5.2.2
The resulting tokenized file after the feature extraction for the
example document in Figure 5.1.
Feature Weighting
In these experiments, we used the vector space model, the common representation
method adopted by most similarity-based text classification systems. In the vector
space model, the weight of each feature is usually computed by the TFIDF weighting
scheme.
As shown in Figure 5.4, the TFIDF weights of features are computed based on
two frequency files: the term frequency (TF) and document frequency (DF) file. For
each document, the term frequency file, in which each feature has an integer number
indicating the term frequency in that document, is generated based on its tokenized file
as shown in Figure 5.3. This term frequency file is then used for updating the global
document frequency file, in which an integer for each feature indicates the number of
training documents that contain this feature in the training set. Based on these two
80
frequency files and the TFIDF weighting equation (discussed previously in Section
2.2.2), each raw document is finally transformed to our target representation, a vector
space model.
global document
frequency file
term frequency file
for each document
annual
approv
april
2
author
board
capit
…
annual
approv
april
581
asset
author
avg
billion
board
…
1
1
1
2
1
128
345
115
233
407
344
113
TFIDF file for each document
annual
approv
april
0.225
…
Figure 5.4
0.261
0.255
The TFIDF weighting scheme based on the term frequency and
document frequency file.
81
5.2.3
Feature Selection
Even after applying the stop-list and stemming algorithm, the feature space
usually has a high dimensionality that is not computationally tractable for most
learning algorithms. This problem highlights the need for an aggressive feature
reduction method. Previous work [YP97] on feature selection demonstrated that better
results can be achieved with a reduced feature set.
Two types of features to which the feature selection methods should be targeted
are (1) extremely low frequency features and (2) high frequency features that appear
almost evenly across categories. For the reduction of the first type of feature, we
applied document frequency. We selected features occurring in at least 2 documents in
the same category. And, for reduction of the second type, we adopted information
gain that was explained in Section 2.2.3. The remaining features, after applying
document frequency, are sorted in descending order by the information gain values.
Then, we picked the numbers of features based on their information gain values.
5.3 Experiments on the Number of Features
Our preliminary experimental results in [LKK00] had shown that KAN gives a
significant performance improvement over a typical similarity-based learning algorithm
– Rocchio. In this section, we investigate further the effectiveness of the similaritybased learning algorithms that we explored in this thesis, based on the various feature
subsets obtained by applying the feature selection algorithms.
The main goal in the following set of experiments is to verify whether a larger
number of input features will always give better results. If this is not the case for the
given learning algorithm, we want to find an optimal size of feature subset, a size that
is more effective than the full feature set.
82
5.3.1
Experimental Setup
We have implemented four similarity-based learning algorithms (Rocchio, WH, kNN, and KAN) explained in Section 2.3,1 and Chapter 3. Their parameter settings, used
throughout these evaluations, are as follows:
(1) Rocchio
To compute the vectors of categories (i.e., profiles of categories), we used _ =
16 and _ = 4 in the equation in Section 2.3.1.1 as suggested in [BSA94].
(2) WH
We set the learning rate parameter _ to 0.25 as this value has been used for the
implementation of WH in other literature [LSCP96, LH98].
(3) k-NN
The value of k used in these experiments for k-NN is 10, 20, 30, 40, and 50.
Then, we chose the k value with the best performance result in each
experiment.
(4) KAN
For computing the discriminating power S of each feature, we used 0.4 for λ.
The data sets used in these experiments are the Reuters-21578 and 20Newsgroups. All the training examples available in each split were used for each
experiment. The adopted thresholding strategy for each data set is the SCut for the
Reuters-21578 (our corpus for multi-class text categorization task) and the RCut for
20-Newsgroups (our corpus for single-class text categorization task). The optimal
value of RCut for 20-Newsgroups is always 1 since all the documents belong to only
one category. We did not apply the RinSCut strategy because our focus in these
experiments is on the learning algorithms with feature selection.
83
In order to construct the learning curve for a given learning algorithm, using
different sizes of input feature sets, we ran a series of experiments on each corpus by
varying the numbers of input features. The feature selection, which is based on the
document frequency and information gain took the numbers, from 10 to 250, of
features for each category.
Tables 5.5 and 5.6 show the statistics for the unique features in the Reuters-21578
and the 20-Newsgroups, respectively. In these tables, the first column shows the
number of features selected for each category, the second column shows the number of
unique features in the training set and the last column shows the average number of
unique features in each category. Note in these tables, once numbers of selected
features exceeds 110 (roughly say) for each category, there is little difference in the
numbers of unique features.
Table 5.5 Statistics for the unique features in the Reuters-21578 corpus.
number of features
chosen for each category
number of unique
average number of unique
features in the training set features in each category
10
338
4.5
30
967
13.0
50
1,623
22.4
70
2,210
30.4
90
2,729
37.1
110
3,193
42.5
130
3,614
47.5
150
3,952
50.9
84
170
4,228
53.3
190
4,458
55.3
210
4,649
56.8
230
4,800
57.9
250
4,939
59.1
Table 5.6 Statistics for the unique features in the 20-Newsgroups corpus.
number of features
chosen for each category
number of unique
average number of unique
features in the training set features in each category
10
185.2
8.6
30
521.8
22.9
50
797.6
32.6
70
1,021.8
38.7
90
1,192.6
41.7
110
1,312.0
42.2
130
1,401.4
41.9
150
1,461.6
41.0
85
170
1,491.4
39.5
190
1,503.6
37.9
210
1,511.4
37.4
230
1,512.6
36.9
250
1,514.6
36.8
This fact indicates that selecting more features in each category eventually reduces
the number of unique features for each category and, as a result, results in the presence
of a larger number of common features that occur evenly across many categories. For a
given learning algorithm, if the unreduced full feature set does not work well over some
reduced feature subsets, the high number of common features in this full feature space
might provide an explanation.
5.3.2
Results and Analysis
The graphs in Figure 5.5 to Figure 5.8 show the F1 measures (see Section 2.5.1 for
the definition) for four similarity-based learning algorithms in the Reuters-21578 data
set. Each figure depicts two F1 measures, micro and macro-averaged, of a given
algorithm, using all training data available (i.e., 6,984 documents as shown in Table
5.3). In these figures, the Y axis indicates F1 performance on the test data set and the X
axis indicates the number of input features where the F1 performance was achieved.
Throughout the experimental results, we can observe that the macro-averaged F1 is
clearly lower than the micro-averaged F1. One of the possible reasons for these results
might be that, in the Reuters-21578, many of the 53 categories are rare, with a small
number of documents in the training set as shown in Tables 5.1 to 5.3. As a result, it is
much more difficult to categorize new documents with the classifiers which are
86
constructed by using such a small sized training set. This fact seems to cause very low
F1 measures for rare categories and, consequently, the low F1 in average.
0.90
0.80
0.70
0.60
F1
0.50
0.40
0.30
0.20
macro-averaged
0.10
micro-averaged
0.00
10
30
50
70
90
110
130
150
170
190
210
230
250
number of features per category
Figure 5.5
F1 performance of Rocchio on the Reuters-21578 corpus (6, 984
training documents used).
0.90
0.80
0.70
0.60
F1
0.50
0.40
0.30
macro-averaged
87
Figure 5.6
F1 performance of k-NN on the Reuters-21578 corpus (6,984 training
documents used).
0.90
0.80
0.70
0.60
F1
0.50
0.40
0.30
0.20
macro-averaged
0.10
micro-averaged
0.00
10
30
50
70
90
110
130
150
170
190
210
230
250
number of features per category
Figure 5.7
F1 performance of WH on the Reuters-21578 corpus (6,984 training
documents used).
88
0.90
0.80
0.70
0.60
F1
0.50
0.40
0.30
0.20
macro-averaged
0.10
micro-averaged
0.00
10
30
50
70
90
110
130
150
170
190
210
230
250
number of features per category
Figure 5.8
F1 performance of KAN on the Reuters-21578 corpus (6,984 training
documents used).
From these experimental results, we can see that, in the Reuters-21578 corpus, all
the similarity-based learning algorithms give the best results with a relatively small
number of input features (around 10 to 70). Adding more features into the feature
space fails to produce any performance benefits while it requires more time to run the
classifiers. This result is consistent with the findings reported on other feature
selection studies [Mla98, MG99, Yan99], even though they used different test data
sets.
Figures 5.9 to 5.12 present the experimental results on the 20-Newsgroups data
set. Unlike the results on the Reuters-21578 corpus, in this case, all the learning
algorithms give nearly equal micro and macro-averaged F1 performances. These
identical results can be obtained when the value of denominator in an evaluating
equation for each category is the same. For example, let us consider the computation of
the micro and macro-averaged recall. When D is denominator, N is numerator in the
computation of both measures, and the number of categories is |C|, the micro and
macro-averaged recall, MI-Recall and MA-Recall respectively as explained in Section
2.5.2, are defined as follows:
89
N1 + ___ + N|C|
MI-Recall =
D1 + ___ + D|C|
N1 / D1 + ___ + N|C| / D|C|
MA-Recall =
|C|
N1
N|C|
=
+ ___
D1 × |C|
+
D|C| × |C|
where Di = TPi + FNi
When the number of correct examples in each category (i.e., TPi + FNi) is equal across
all the categories, the denominators, Di, become equal, to say D. So, MI-Recall will be
equal to MA-Recall as follows:
N1 + ___ + N|C|
MA-Recall =
D × |C|
= MI-Recall
Note that the following characteristics of the 20-Newsgroups: (1) all the documents are
evenly distributed across the 20 categories and (2) all the documents belong to only
one category. As discussed in Section 5.1.2, we select 20 percent of the documents in
each category for the test set. Because of an even distribution of the documents across
the categories in the full data set, all the categories have the same number of documents
again in the test set. Also, each document has only one category for its correct label.
90
These characteristics of the 20-Newsgroups make the denominators of recall the same
in each category (i.e., the value of TPi + FNi becomes same in each category). By
contrast, the F1 measurement is the combination of recall and precision as discussed in
Section 2.5.1 and contains only three entries, TPi, FNi, and FPi (note not TNi). The
total value of these three entries for each category could be different even with the
above characteristics in the 20-Newsgroups. For almost identical micro and macroaveraged F1 performances in these experiments, we think that the same number of
training documents in each category may be another main reason. We will investigate
this issue more in the following section.
From the charts on Figure 5.9 to Figure 5.12, we can also observe that each
similarity-based learning algorithm gives the similar learning curve to the one obtained
from the Reuters-21578 corpus. In other words, for each learning algorithm there is no
significant advantage on the F1 performance with a larger number of input features.
Like on the Reuters-21578, we could conclude that feature selection is advantageous
for fast pre and post-processing and also for performance improvements.
0.90
0.80
0.70
0.60
F1
0.50
0.40
0.30
0.20
macro-averaged
0.10
micro-averaged
0.00
10
30
50
70
90
110
130
150
170
190
210
230
250
number of features per category
Figure 5.9
F1 performance of Rocchio on the 20-Newsgroups corpus (all the
training documents used in each split).
91
0.90
0.80
0.70
0.60
F1
0.50
0.40
0.30
0.20
macro-averaged
0.10
micro-averaged
0.00
10
30
50
70
90
110
130
150
170
190
210
230
250
number of features per category
Figure 5.10 F1 performance of k-NN on the 20-Newsgroups corpus (all the
training documents used in each split).
0.90
0.80
0.70
0.60
F1
0.50
0.40
0.30
0.20
macro-averaged
0.10
micro-averaged
0.00
10
30
50
70
90
110
130
150
170
190
210
230
250
number of features per category
Figure 5.11 F1 performance of WH on the 20-Newsgroups corpus (all the training
documents used in each split).
92
0.90
0.80
0.70
0.60
F1
0.50
0.40
0.30
0.20
macro-averaged
0.10
micro-averaged
0.00
10
30
50
70
90
110
130
150
170
190
210
230
250
number of features per category
Figure 5.12 F1 performance of KAN on the 20-Newsgroups corpus (all the
training documents used in each split).
As shown in Figure 5.11, the WH algorithm gives the best results at 50 features.
For the other three learning algorithms, adding more features beyond 50 does not give a
significant performance improvement. However, with a smaller number of input
features than 50 features, there is somewhat worse performance, compared with the
larger number of features.
Finding the optimal size of the feature subset is very time consuming since it
requires the repetition of the same experiment at every possible size of feature set.
Furthermore, such an optimal size could be changed with a different size of training
data. For all the similarity-based algorithms in the following set of experiments, we use
50 features per category. This appears close to the optimal size of feature subset since
the result of each algorithm that was achieved at 50 is very close to, or the same as, the
best result each algorithm achieved. A small variance in this number has little effect.
93
5.4 Experiments on the Number of Training Examples
In the previous section, the experimental results showed that increasing the
number of input features gave no benefits for our investigated similarity-based machine
learning algorithms. The best text categorization results were for reduced feature sets.
Even so, we observed small differences between their performance results. Note,
however, that these experimental results are based on the use of all the training
examples available in a training set that usually contains several thousands labeled
documents.
In an authentic situation, it is not realistic to prepare such large numbers of labeled
training documents since manually labeling them imposes a huge cost on the human
experts. Consequently, an interesting question arises as to whether a particular
classifier works reasonably well with a small number of training examples available at a
particular time. There is reason to predict that our methods, KAN and the variants of
RinSCut, may work better than other counterpart techniques since they have been
designed to use more information about the labeled training documents.
In this section, we will explore the above issue. To draw the learning curve of each
classifier according to the different amount of input training data, we evaluated each
classifier on various sizes of training examples. This is an important issue, not only for
the classifiers themselves, but also for the case of selective sampling, since the
effectiveness of selective sampling might be affected by the quality of classifiers that
are usually built from a small number of training examples.
Some results of these experiments have been published previously in [LKK02,
LKKR02]. In this section, we report much more extensive experiments on both
collections (for example, 5-cross validation on the 20-Newsgroups). This section
contains these updated results.
5.4.1
Experimental Setup
Like the experiments in Section 5.3, we used four machine learning algorithms
(Rocchio, WH, k-NN, and KAN). Then, these learning algorithms were evaluated with
the SCut on the Reuters-21578 corpus and the RCut on the 20-Newsgroups corpus.
94
For the evaluation of our thresholding strategies, two variants of RinSCut, we used
only the Reuters-21578 since RinSCut has been developed for multi-class text
categorization tasks. The parameter settings for the implemented learning algorithms
are the same, with the parameters described in Section 5.3.1.
To construct a learning curve for each classifier, we conducted each experiment by
increasing the size of training examples. Each experiment was repeated 10-fold. Table
5.7 shows the percentage of the overall training examples and the number of training
examples, in each round, in each data set. The training set for each round is a super set
of the one for the previous round.
One question, in conducting these experiments, is how many times we should
repeat the experiment for each round. We may intuitively think that the more we
repeat the experiment for each round, the more meaningful, the normal performance
distribution we have. With this in mind, and with concern for time, we chose three
repetitions. Thus, each 10-fold experiment was conducted 3 times (repeating
experiment more times may be necessary to gain more stable results) and, so, the
resulting performance of a particular classifier in each round is an averaged performance
result of these three experimental results.
For the experiment in each round, we applied document frequency and
information gain for feature selection and took 50 features for all categories as
discussed in Section 5.3.2.
Table 5.7 The percentage of training data and the number of training documents
used in each round.
corpus
round
Reuters-21578
percentage of
training
examples
number of
examples
20-Newsgroups
percentage of
training
examples
number of
examples
95
5.4.2
1
1.5%
106
1.0%
160
2
3.0%
212
2.0%
320
3
5.3%
371
3.0%
480
4
9.9%
689
5.0%
800
5
18.2%
1,272
10.0%
1,600
6
34.5%
2,409
16.0%
2,560
7
52.9%
3,696
32.0%
5,120
8
73.5%
5,136
50.0%
8,000
9
88.8%
6,202
70.0%
11,200
10
100.0%
6,984
100.0%
15,998
Results and Analysis
Figures 5.13 and 5.14 depict the learning curve for each similarity-based learning
algorithm with SCut in the micro and macro-averaged F1, respectively. These charts
show the learning trace of each algorithm as a function of the size of training data used
to build each classifier on the Reuters-21578 corpus.
Rocchio gives lower performance than the other algorithms on both measures
across all the rounds with the exception of the lowest micro-averaged performance of
k-NN on round 1 and 2. However, k-NN shows similar performance to KAN after
round 3 in the micro-averaged F1. WH shows stable performance that has no extremely
low F1 measure at any round in both the micro and macro-averaged F1 performance, but
its performance is consistently slightly lower than the best performance across all the
rounds. In macro-averaged F1, KAN, in general, achieves better results than the other
algorithms.
96
Our new algorithm, KAN, does not always show better performance than the other
three algorithms on all the rounds. However, we can conclude that KAN is, in the
Reuters-21578 data set, a better learning algorithm than others since it achieves the
best performance in most rounds and, even when KAN does not, it shows similar
performance to the best performance of any other algorithm. For example in Figure
5.13, k-NN shows quite good performance, that is similar to KAN after round 4, but,
with a very small number of training examples (i.e., round 1 to 3), its micro-averaged
measures are much lower than the best measures at each round. By contrast, KAN
works well even with such a small number of training examples.
0.90
0.80
micro-avg. F 1
0.70
0.60
0.50
Rocchio
0.40
k-NN
0.30
KAN
0.20
WH
0.10
0.00
1
2
3
4
5
6
round
7
8
9
10
97
Figure 5.13 Micro-averaged F1 of each classifier on the Reuters-21578 corpus
with SCut.
0.90
0.80
macro-avg. F 1
0.70
0.60
0.50
0.40
Rocchio
0.30
k-NN
0.20
KAN
0.10
WH
0.00
1
2
3
4
5
6
7
8
9
10
round
Figure 5.14 Macro-averaged F1 of each classifier on the Reuters-21578 corpus
with SCut.
Figures 5.15 and 5.16 show the micro and macro-averaged F1 performance of each
learning algorithm with RCut on the 20-Newsgroups data set. Each algorithm achieves a
very similar learning trace, on both micro and macro-averaged measures that are the
same as in Section 5.3. Also, all the learning algorithms achieve very similar
performance at each round. These experimental results are indicating that there is no
superior learning algorithm that works much better than others on the 20-Newsgroups
data set.
As explained in Section 5.3, such similar results for the learning algorithms are
probably due to the fact that text categorization on the 20-Newsgroups is a single-class
categorization task. So, we may conclude that single-class categorization is a simpler
98
task than the multi-class, because it always gives stable performance that hardly
changes with the specific learning algorithm applied.
Also, note that we extracted the same proportion of training examples from each
category for rounds 1 to 9. Since the training data at round 10 is evenly distributed
across 20 categories, all the categories have nearly the same number of training
examples at each round. In a real situation, however, it is unrealistic for each category
to always have the same number of training examples. It is interesting to see whether
or not the results of Figures 5.15 and 5.16 will be different if we use “truly random
sampling” that extracts examples without conserving an even distribution of training
examples across categories.
Figures 5.17 and 5.18 show the micro and macro-averaged F1 performances of each
algorithm on 20-Newsgroups with “truly random sampling” that may cause an uneven
distribution of training documents at each round. In these experiments, we select the
same number of training documents as shown in Table 5.7 at each round.
From these charts, we can see that an uneven distribution of training examples
leads to quite different results. Each learning algorithm gives lower performance than
the one from the even distribution. So, we can conclude that an uneven distribution of
training examples could make text categorization harder [YG96].
0.90
0.80
micro-avg. F 1
0.70
0.60
0.50
Rocchio
0.40
k-NN
0.30
KAN
0.20
WH
0.10
0.00
1
2
3
4
5
6
round
7
8
9
10
99
Figure 5.15 Micro-averaged F1 of each classifier on the 20-Newsgroups corpus
with RCut.
0.90
0.80
macro-avg. F 1
0.70
0.60
0.50
Rocchio
0.40
k-NN
0.30
KAN
0.20
WH
0.10
0.00
1
2
3
4
5
6
7
8
9
10
round
Figure 5.16 Macro-averaged F1 of each classifier on the 20-Newsgroups corpus
with RCut.
0.90
0.80
micro-avg. F 1
0.70
0.60
0.50
0.40
Rocchio
0.30
k-NN
0.20
KAN
0.10
WH
0.00
1
2
3
4
5
round
6
7
8
9
100
Figure 5.17 Micro-averaged F1 of each classifier on the 20-Newsgroups corpus
with RCut (with “truly random sampling”).
0.90
0.80
macro-avg. F 1
0.70
0.60
0.50
0.40
Rocchio
0.30
k-NN
0.20
KAN
0.10
WH
0.00
1
2
3
4
5
6
7
8
9
round
Figure 5.18 Macro-averaged F1 of each classifier on the 20-Newsgroups corpus
with RCut (with “truly random sampling”).
Note that k-NN and WH algorithms do not perform as well when training
documents are unevenly distributed across categories. KAN and Rocchio achieve better
results than the other two algorithms and their performances are very similar at all the
rounds in both micro and macro-averaged F1.
Also, by comparing the micro and macro-averaged results for each algorithm in
Figures 5.17 and 5.18, we can see that the two averaged measures of each algorithm are
101
only slightly different. As discussed in Section 5.3, an even distribution of training
examples causes almost identical micro and macro-averaged F1 measures on the 20Newsgroups.
Now, let us look at the performance comparison of the three thresholding
strategies - SCut, GRinSCut, and LRinSCut – in a similarity-based learning algorithm.
Since, these three thresholding strategies had mainly been developed for the multi-class
text categorization task, we ran the experiments only on the Reuters-21578 corpus.
Figures 5.19 through 5.26 show the performance of our RinSCut variants in each
algorithm that should be compared to the performance of SCut that was shown in both
Figures 5.13 and 5.14. Figures 5.19 and 5.20 show that, across all the rounds, our
RinSCut variants consistently give considerable performance improvements to Rocchio
against SCut in the micro and macro-averaged F1. The results of GRinSCut and
LRinSCut are very similar in the macro-averaged F1. However, in the micro-averaged F1
in Figure 5.19, the advantage of GRinSCut is noticeable showing a considerable
performance improvement, especially at rounds 2 and 3.
As shown in Figures 5.21 and 5.22, the WH algorithm with two RinSCut variants
gives very unstable micro and macro-averaged performance. Also, it gives, in general,
lower performance results than with SCut on both measures.
In Figure 5.23, k-NN with RinSCut strategies achieve slightly better microaveraged results than with SCut except at round 4 while giving similar macro-averaged
performance to SCut across all the rounds in Figure 5.24.
0.90
0.80
micro-avg. F 1
0.70
0.60
0.50
0.40
Rocchio-LRinSCut
0.30
Rocchio-GRinSCut
0.20
Rocchio-SCut
0.10
0.00
1
2
3
4
5
6
round
7
8
9
10
102
Figure 5.19 Micro-averaged performance comparison of SCut and RinSCut
variants on Rocchio (the Reuters-21578 corpus used).
0.90
0.80
macro-avg. F 1
0.70
0.60
0.50
0.40
0.30
Rocchio-LRinSCut
0.20
Rocchio-GRinSCut
0.10
Rocchio-SCut
0.00
1
2
3
4
5
6
7
8
9
10
round
Figure 5.20 Macro-averaged performance comparison of SCut and RinSCut
variants on Rocchio (the Reuters-21578 corpus used).
0.90
0.80
micro-avg. F 1
0.70
0.60
0.50
0.40
0.30
WH-LRinSCut
0.20
WH-GRinSCut
0.10
WH-SCut
0.00
103
Figure 5.21 Micro-averaged performance comparison of SCut and RinSCut
variants on WH (the Reuters-21578 corpus used).
0.90
0.80
macro-avg. F 1
0.70
0.60
0.50
0.40
WH-LRinSCut
0.30
WH-GRinSCut
0.20
WH-SCut
0.10
0.00
1
2
3
4
5
6
7
8
9
10
round
Figure 5.22 Macro-averaged performance comparison of SCut and RinSCut
variants on WH (the Reuters-21578 corpus used).
0.90
0.80
micro-avg. F 1
0.70
0.60
0.50
0.40
k-NN-LRinSCut
0.30
k-NN-GRinSCut
0.20
k-NN-SCut
0.10
104
Figure 5.23 Micro-averaged performance comparison of SCut and RinSCut
variants on k-NN (the Reuters-21578 corpus used).
0.90
0.80
macro-avg. F 1
0.70
0.60
0.50
0.40
k-NN-LRinSCut
0.30
k-NN-GRinSCut
0.20
k-NN-SCut
0.10
0.00
1
2
3
4
5
6
7
8
9
10
round
Figure 5.24 Macro-averaged performance comparison of SCut and RinSCut
variants on k-NN (the Reuters-21578 corpus used).
0.90
0.80
micro-avg. F 1
0.70
0.60
0.50
0.40
KAN-LRinSCut
0.30
KAN-GRinSCut
105
Figure 5.25 Micro-averaged performance comparison of SCut and RinSCut
variants on KAN (the Reuters-21578 corpus used).
0.90
0.80
macro-avg. F 1
0.70
0.60
0.50
0.40
KAN-LRinSCut
0.30
KAN-GRinSCut
0.20
KAN-SCut
0.10
0.00
1
2
3
4
5
6
7
8
9
10
round
Figure 5.26 Macro-averaged performance comparison of SCut and RinSCut
variants on KAN (the Reuters-21578 corpus used).
106
Figures 5.25 and 5.26 show that KAN with RinSCut variants achieve slightly better
results than SCut on both measures except the low micro-averaged F1 of GRinSCut at
round 1 and 2 in Figure 5.25. However, KAN with GRinSCut at round 3 gives a
significant performance improvement that is very close to the best micro-averaged F1
performance at round 9.
Tables 5.8 and 5.9 show the best F1 performance and its classifier (i.e., a learning
algorithm and thresholding strategy) at each round in the micro and macro-averaged F1
respectively on the Reuters-21578 data set.
Table 5.8 The best micro-averaged F1 and its classifier in each round on the
Reuters-21578 corpus.
round
best micro-avg. F1
classifier
1
0.500
Rocchio-SCut
2
0.669
Rocchio-GRinSCut
3
0.726
KAN-GRinSCut
4
0.756
KAN-GRinSCut
5
0.775
KAN-GRinSCut
6
0.784
k-NN-LRinSCut
7
0.794
k-NN-GRinSCut
8
0.793
KAN-LRinSCut
9
0.788
KAN-LRinSCut
10
0.790
k-NN-GRinSCut
107
In Table 5.8, we can see that our RinSCut thresholding strategies (GRinSCut and
LRinSCut) work well in most rounds except round 1. Note that GRinSCut gives the
best micro-averaged performance more frequently than LRinSCut. This seems to match
well with our aim in developing GRinSCut: it was designed to improve the microaveraged performance. Also, KAN appears 5 times in this table at round 3, 4, 5, 8, and
9 as the best learning algorithm with RinSCut variants. This suggests that it offers
advantages for smaller training sets.
Table 5.9 The best macro-averaged F1 and its classifier in each round on the
Reuters-21578 corpus.
round
best macro-avg. F1
classifier
1
0.230
k-NN-LRinSCut
2
0.335
Rocchio-GRinSCut
3
0.444
KAN-LRinSCut
4
0.537
KAN-GRinSCut
5
0.605
KAN-GRinSCut
6
0.634
KAN-LRinSCut
7
0.645
KAN-LRinSCut
8
0.641
KAN-LRinSCut
9
0.629
KAN-LRinSCut
10
0.629
KAN-LRinSCut
108
The advantage of KAN and RinSCut variants is much more obvious in the macroaveraged F1 performance in Table 5.7. In this table, KAN achieves the best F1
performance at 8 rounds and RinSCut thresholding strategies outperform SCut across
all the rounds. In addition, LRinSCut, designed for improved macro-averaged
performance, appears 7 times as the best thresholding strategy.
In this section, we have described our extensive experiments on the Reuters-21578
and 20-Newsgroups data sets to assess the effects of our methods for text
categorization. The empirical results on the Reuters-21578 show that KAN
outperforms Rocchio and WH, and achieves slightly better results than k-NN. The
improved results for our thresholding strategies are stronger. The two variants
(GRinSCut and LRinSCut) of RinSCut show considerable performance improvements in
the micro and macro-averaged F1 performance for all learning algorithms except WH.
Although, KAN+RinSCut variants do not give better performance over other
counterpart techniques on all the rounds, they seem to slightly outperform the other
techniques. On the basis of the experimental results, it seems that the best choice,
across techniques for the Reuters-21578 data set (for the multi-class categorization
task), would be KAN with RinSCut variants.
109
Chapter 6
Learning with Selective Sampling
Supervised learning approaches to text categorization require a large number of
annotated (or labeled) documents for training to achieve a high level of performance.
The problem in real contexts, however, is that gathering such a large number of
accurately annotated training documents is difficult since it is very time-consuming and
error-prone [HW90, ADW94, VM94].
An emerging research area in text categorization is active learning [CAL94], where
the system actively participates in the collection of training documents, rather than
relying on the random sampling. There are usually two types of active learning; (1)
generating artificial new training documents and (2) selecting the most informative
documents from a pool of unlabeled ones. In this chapter, we investigate the latter
type of active learning (i.e., selective sampling) since unlabeled documents for training
are generally plentiful.
As we described in the introduction to Chapter 1, the primary goal of selective
sampling is to reduce the number of labeled training examples that are needed to
achieve a particular performance level. The selective sampling process is typically
performed by selecting and using the most informative examples in the available set of
unlabeled raw documents. Our method of determining informative ones is based on the
uncertainty values of unlabeled examples and so-called uncertainty sampling [LG94,
LC94]; the document that has the largest uncertainty value is considered the most
informative document for training.
110
In this chapter, we discuss some issues that, we believe, have significant impact
on the quality of the selective sampling process.
6.1 Goal and Issues
Our main goal in this thesis is to develop a machine learning approach to text
categorization that can achieve a high performance level with fewer annotated training
examples. Through the previous chapters, we have described development of our own
methods (KAN and RinSCut variants) for this goal and verified that they work well,
even with a small number of labeled training documents.
Regardless of the specific learning algorithm (and/or a specific thresholding
strategy) applied to text categorization, one promising approach towards our goal is to
have some control over the sampling process of training examples. Since each
document is generally different from the others, some documents may be very helpful
(or informative) for learning accurate classifiers and others not so. Searching for and
using such helpful documents for training the classifiers is the main purpose of the
selective sampling process. The expected desirable effects of selective sampling are as
follows:
1. We can build accurate classifiers quickly by using a relatively small number
of training documents.
2. We can save human experts from labeling many uninformative documents
that are not helpful for training the classifiers.
The typical method (which is also adopted in this thesis) for finding informative
examples is based on the uncertainty values that are computed by comparing between
a threshold value of a given category and the similarity values of unlabeled documents.
The main issues we discuss in this chapter are:
1. Using homogeneous or heterogeneous types of knowledge base (i.e.,
classifiers)
2. Directly using the most positive-certain documents for training without
human labeling
111
6.1.1
Homogeneous versus Heterogeneous Approach
When computing the uncertainty value of a given unlabeled document, we need a
classifier that is already built from the available training examples. This classifier can be
the same type of classifier as the one used for the categorization of new documents or
it can be a totally different type of classifier that is only used for the selective
sampling process.
It seems that the homogeneous approach (i.e., using the same type of classifier)
for both tasks is preferable in the computation of uncertainty values of unlabeled
documents. This is better than the heterogeneous approach since the homogeneous
approach does not require the additional cost of building a different type of classifier.
However, the heterogeneous approach has been used in [LC94] for selective sampling
and, as mentioned in the literature, the main reason for using it is that the existing type
of classifier used for document categorization is not suitable for the computation of
uncertainty values or too computationally expensive to build and use with a large
number of training documents. As a result, it seems that the choice of approach
depends on the computation complexity of existing classifiers.
In this chapter, we focus on our techniques (KAN and RinSCut variants) and their
categorization performance improvements through selective sampling. As discussed in
Section 3.3, the computational complexity of KAN is O(n2) where n is the number of
features in the vector space. This fact could be quite problematic when applying KAN
to selective sampling. So, to be a practical classifier, KAN must show reasonable
performance with a reduced feature set, that is a manageable size. Fortunately, as
shown in Chapter 5, KAN as well as the other similarity-based learning algorithms,
achieved the highest categorization performance with a reduced feature set. This had
far fewer features than the unreduced full feature set. So, our adopted approach to
selective sampling is the homogeneous approach that directly uses existing classifiers
built by KAN, not requiring the additional cost of building new classifiers.
6.1.2
Using Positive-Certain Examples for Training
As discussed earlier, selective sampling finds, for a given category, the most
uncertain documents, those whose category is most ambiguous. Then, it presents some
112
of the most uncertain documents to human experts, asking for their correct category
labels.
One possibility arising from the process of selective sampling is the use of a set of
positive-certain documents for a given category to achieve performance improvements.
Such positive-certain documents, which are also used as negative ones for the other
categories, could be less informative than uncertain ones used in previous selective
sampling methods. Even so, it is plausible that using positive-certain ones will have
positive effects for the categorization performance.
We can expect that if the text categorization system has a scheme that is
automatically finding the uncertain documents, it can also locate the most positivecertain documents that must be categorized under a given category. This automatic
scheme for locating positive-certain documents may be quite advantageous if using
them for training leads to performance improvements, since it does not require any
work from human experts. Goldman and Zhou’s work [GZ00] can label unlabeled data
but it uses two different classifiers.
A few positive-certain examples could be in error and those automatically and
wrongly classified documents will affect the categorization performance. This problem
is one of the main reasons why we choose the homogeneous approach in which the
system uses the same type of relatively accurate learning algorithm (like KAN) for the
selective sampling of informative documents (uncertain and/or positive-certain
documents) and the categorization of new documents.
Figure 6.1 depicts the flow of unlabeled documents for each iteration in our
selective sampling approach. The sampler, here, defines two types of documents,
uncertain and positive-certain documents. Note that previous selective sampling used
only the uncertain documents, requiring manual-labeling. In addition to these uncertain
examples, our selective sampling method uses, for training, positive-certain documents
that are automatically labeled by the system.
113
human expert
sampler
most
uncertain
documents
unlabeled
documents
training
documents
most
certain
documents
Figure 6.1
manually labeled
automatically labeled
Flow of unlabeled documents in our selective sampling.
6.2 Overall Approach
The selective sampling approach we are interested in is referred to as uncertainty
sampling since it is based on the uncertainty values of unlabeled documents.
Uncertainty sampling was first introduced and discussed in the literature by [LG94].
In that work, they used only the uncertain documents that should be labeled by human
experts. Our uncertainty sampling approach is different from this original method in
that it uses both uncertain and positive-certain documents. So, we need a new scheme
that makes a distinction between the uncertain and certain documents in the available
unlabeled documents set.
114
In the following subsections, we explain the way of computing the uncertainty
values and our new scheme, in which we can define the positive-certain documents as
well as the uncertain documents.
6.2.1
Computing Uncertainty Values
To compute the uncertainty values of unlabeled documents for uncertainty
sampling, the text categorization system should have the classifiers that are usually
built from the existing training documents. As discussed earlier, these classifiers in our
system are the same as the classifiers that are used for the categorization of new
documents.
Based on homogeneous uncertainty sampling, the system computes the similarity
score _i(uj) for the ith category ci and the jth unlabeled document uj. Previous
uncertainty selective sampling approaches computed the uncertainty value UCTi(uj)
based on its certainty value CTi(uj) as follows:
• UCTi(uj) = − CTi(uj)
…. Formula 6.1
Then, CTi(uj) is defined as follows:
• CTi(uj) = |_i(uj) − ts(ci)|
…. Formula 6.2
where ts(ci) is typically the optimal threshold from the SCut thresholding strategy.
From those two formulas, we can see that the uncertainty is defined as having the
opposite meaning from the certainty and the largest possible uncertainty value in the
set of unlabeled documents can be obtained from the document that has a similarity
score which is closest to the threshold ts(ci). As a result, this largest uncertainty value
is 0 when _i = ts(ci). Previous uncertainty sampling approaches select, in each iteration,
only the number of uncertain documents that are closest to the threshold and the
human experts must annotate them with their correct category labels.
115
Note that the above formulas do not make a distinction between one document
having a similarity score below ts(ci) and the other one having a similarity score above
ts(ci). For example, when ts(ci) = 20, _i(ui) = 15, and _i(uj) = 25, the uncertainty and
certainty values of two unlabeled documents, ui and uj, in a given category ci are the
same value, 5. Our uncertainty selective sampling method must differentiate them to
define the positive-certain documents and negative-certain documents. In the following
section, we describe how to do this with our own thresholding strategy, RinSCut.
6.2.2
Defining Certain and Uncertain Documents with RinSCut
The key difference in our selective sampling, from the previous conventional
approaches, is that our system distinguishes the uncertain, positive-certain, and
negative-certain examples, and using them for training. In this research, we do not use
negative-certain documents because positive documents for one category are also used
as negative ones for the other categories. The training set had many negative examples,
so we did not explore the power of using negative-certain documents. Our method is
especially significant for the positive-certain documents defined, because the text
categorization system uses them without the correct category labels from the human
experts.
For this new scheme of selective sampling, we use the RinSCut thresholding
strategy, introduced and explained in Chapter 4. As explained in Section 4.3.1, RinSCut
defines the ambiguous zone for a given category ci using the ts_top(ci) and
ts_bottom(ci). In our approach, this ambiguous zone is used for differentiating the
uncertain, positive-certain, and negate-certain examples in the set of unlabeled
documents.
Figure 6.2 shows and defines the three ranges of similarity scores of unlabeled
documents as follows:
• ts_bottom(ci) ≤ _i(uncertain documents) < ts_top(ci)
• _i(positive-certain documents) ≥ ts_top(ci)
• _i(negative-certain documents) < ts_bottom(ci)
116
Unlabeled examples are sorted by similarity scores in descending order for
category ci.



:

positive-certain documents
_
ts_top(ci)
∆
∆
:
∆
∆


:


--- with ts(ci) from SCut
_
uncertain documents
ts_bottom(ci)
negative-certain documents
 : positive-certain documents with similarity scores above ts_top(ci)
∆ : uncertain documents with similarity scores between ts_top(ci)
and ts_bottom(ci)
 : negative-certain documents with similarity scores below ts_bottom(ci)
Figure 6.2
Definition of certain and uncertain examples using ts_top(ci) and
ts_bottom(ci) for a given category ci.
117
In this figure, the uncertain documents for the category ci are shown as ∆ having
similarity scores that belong to the ambiguous zone, the positive-certain documents are
shown as , having similarity scores above ts_top(ci), and the negative-certain
documents are represented as  having similarity scores below ts_bottom(ci).
In our uncertainty selective sampling methods, the system selects, for each
iteration, a number of uncertain documents with the largest uncertainty values (i.e.,
with similarity scores closest to the threshold), but they also must be within the
ambiguous zone. Then, the system presents them to human experts for the category
label. Also, our selective sampling method automatically selects the positive-certain
documents with similarity scores above ts_top(ci), Then, it uses them directly for
training without asking a human expert for their correct category information.
118
Chapter 7
Evaluation II: Uncertainty Selective Sampling
In Chapter 5, we conducted comparative experiments on our new techniques
(KAN and RinSCut variants) for text categorization. The experimental results showed
that our techniques work slightly better than other widely used methods. However,
like those other methods, they still require a large number of training examples to
achieve a high level of performance. The sampling method used in those experiments
for selecting the training examples was random sampling.
In Chapter 6, we described another type of sampling method – the uncertainty
selective sampling. This is a mechanism that finds more informative examples in
unlabeled documents and uses them for training. The desirable and expected result is
that the categorization performance, especially with a small number of randomly
selected training data, could be significantly improved by using the same number of
informative examples.
In this chapter, we conduct experiments on this uncertainty selective sampling
method to explore this potential performance improvements for text categorization.
7.1 Data Sets Used and Text Processing
The same two data sets, Reuters-21578 and 20-Newsgroups, that were used in
Chapter 5, are also used in this evaluation. For the training and test sets in each corpus,
we use the same splitting methods that were described in Section 5.1.1 for the Reuters21578 and in Section 5.1.2 for the 20-Newsgroups. To convert raw documents to the
119
representations, we also use the same text preprocessing methods described in Section
5.2.
For feature space reduction, we use the same feature selection algorithms as the
ones used in Chapter 5 and select 50 unique features from each category as evaluated
and explained in Section 5.3.
7.2 Classifiers Implemented and Evaluated
The classifiers (learning algorithm + thresholding strategy) implemented and
evaluated for the categorization of new documents in the experiments are
KAN+GRinSCut, KAN+LRinSCut, and KAN+RCut. For the Reuters-21578 data set, the
following classifiers were implemented and evaluated with selective sampling:
(1) KAN+GRinSCut
KAN learning algorithm and GRinSCut thresholding strategy
(2) KAN+LRinSCut
KAN learning algorithm and LRinSCut thresholding strategy
And, for the 20-Newsgroups data set, the following classifier was evaluated:
(1) KAN+RCut
KAN learning algorithm and RCut thresholding strategy
These classifiers on each corpus are built from the available training examples and
evaluated against the test data set. The goal of the experiments in this chapter is to see
whether or not their categorization performances are improved by applying the
uncertainty selective sampling methods described in the following section, compared
with the random sampling.
120
7.3 Sampling Methods Compared
In this chapter, we want to compare the following sampling methods:
(1) Random sampling (RS)
The documents in the training set are randomly selected from the set of
unlabeled examples and then, manually labeled by human experts.
(2) Selective sampling of uncertain examples (SS-U)
For the training set, the most uncertain documents are selected based on their
uncertainty scores and then, manually labeled by human experts.
(3) Selective sampling of uncertain and certain examples (SS-U&C)
As well as the most uncertain documents, a set of positive-certain documents
that is automatically labeled by the system is also added to the training set.
The classifiers described in Section 7.2 were built on the training examples selected
by one of above sampling methods. Also, as discussed in Chapter 6, our uncertainty
selective sampling methods (SS-U and SS-U&C) are based on the homogeneous
approach that uses the same type of classifier as the one used for the categorization of
new (or, test) documents. So, each classifier was used for the categorization of test
documents, used again for the selective sampling methods, and then re-built using the
training examples selected from the selective sampling methods.
For the number of positive-certain examples that are automatically labeled by the
system and used for training in the SS-U&C sampling method, we choose 500 and 250
examples for the Reuters-21578 (6,984 training examples in total) and 1,000 and 500
examples for the 20-Newsgroups (about 16,000 training examples in total). Also, to
obtain generalized and reliable results for the evaluation of the random sampling
method, we conducted the experiments 3 times in each data split for both data sets.
121
7.4 Results and Analysis
For the results in this section, we use the following experimental methodology. To
build an initial-classifier, we randomly select the same number, n, of positive-training
examples for each category (n = 2 for the Reuters-21578, 106 examples in total and n =
4 for the 20-Newsgroups, 80 examples in total). Also, for each iteration, the same
number of examples (i.e., 106 examples for the Reuters-21578 and 80 examples for the
20-Newsgroups) is selected based on the adopted sampling method and added to the
training set. For the SS-U&C sampling, a set of an additional k automatically labeled
positive-certain examples (500 and 250 for the Reuters-21578, and 1,000 and 500 for
the 20-Newsgroups) by the classifier constructed at the previous iteration is added to
the training set. These k positive-certain examples are almost evenly distributed across
categories. Then, the classifier that is incrementally re-built from the training set is
evaluated against the test set.
Figure 7.1 and 7.2 show the micro and macro-averaged F1 of the KAN+RCut
classifier, evaluated with the three different sampling methods on the 20-Newsgroups
corpus. In these charts, the curves of uncertainty selective sampling methods stop
when they achieve the target performance of the random sampling. Also, note that the
RS sampling on this data set used “truly random sampling”. So, the performance of RS
in these experiments showed similar results to Figures 5.17 and 5.18, not Figures 5.15
and 5.16 in Section 5.4.2. The micro and macro-averaged F1 measures of each sampling
method are almost identical due to the characteristics of this data set explained in
section 5.3.2. The advantage of the SS-U sampling becomes obvious after 320 training
examples. But, initially with 240 examples its performance is slightly worse than the
random sampling, RS. Its low performance seems to be due to the inaccurate initialclassifier that is learned with the small number of training examples. The advantage of
the SS-U&C variants (SS-U&C[1000] and SS-U&C[500]) is clear even at the initial
point, but it failed to give better results after 400 training examples. The reason for this
is that the documents incorrectly classified as positive-certain affects the
categorization performance. Also, SS-U&C[1000] results in slightly better
performance than SS-U&C[500]. This result shows that positive-certain examples
may be less informative than uncertain ones and, as a result, we need a quite large
number of positive-certain examples to make use of their advantage for selective
sampling.
122
0.90
0.80
0.70
micro-avg. F 1
0.60
0.50
0.40
0.30
RS
SS-U
0.20
SS-U&C[1000]
0.10
SS-U&C[500]
0.00
160
320
480
640
800
960
1,120
1,280
number of manually labeled training examples
Figure 7.1
Micro-averaged F1 of KAN+RCut on the 20-Newsgroups (RS:
random sampling, SS-U: selective sampling of uncertain examples,
SS-U&C[1000]: selective sampling of uncertain and certain examples
[1,000 certain examples], SS-U&C[500]: selective sampling of
uncertain and certain examples [500 certain examples]).
123
0.90
0.80
0.70
macro-avg. F 1
0.60
0.50
0.40
0.30
RS
SS-U
0.20
SS-U&C[1000]
SS-U&C[500]
0.10
0.00
160
320
480
640
800
960
1,120
1,280
number of manually labeled training examples
Figure 7.2
Macro-averaged F1 of KAN+RCut on the 20-Newsgroups (RS:
random sampling, SS-U: selective sampling of uncertain examples,
SS-U&C[1000]: selective sampling of uncertain and certain examples
[1,000 certain examples], SS-U&C[500]: selective sampling of
uncertain and certain examples [500 certain examples]).
124
From these results, we can see that uncertainty selective sampling methods require
a much smaller number of labeled training examples than the RS sampling. For
example, to achieve a given level of performance, to say 0.573 micro-averaged F1 of the
RS sampling at 1,280 examples in Figure 7.1, the SS-U requires 480 training examples,
and the SS-U&C[1000] and SS-U&C[500] need 800 and 880 examples, respectively.
This represents 62.5% saving on the required examples for the SS-U, and 37.5% and
31.2% savings for the SS-U&C[1000] and SS-U&C[500] over the random sampling.
Figures 7.3 through 7.6 show the experimental results on the Reuters-21578
corpus. The learning traces of KAN+GRinSCut with four sampling methods are
presented in Figures 7.3 and 7.4.
In Figure 7.3, we can see that all the uncertainty selective sampling methods for
the micro-averaged F1 do not show any desirable effects over the random sampling.
These results are mainly due to the fact that the documents in the Reuters-21578
corpus are very unevenly distributed across the categories. In Tables 5.1 through 5.3 in
Chapter 5, we can see that more than 50% of examples belong to two categories,
“earn” and “acq”, in both training and test sets. In this situation, the micro-averaged
measure mainly depends on the performances on these two categories. The randomly
selected documents in the training set of each trial might be mainly from “earn” and
“acq” categories and the classifiers that are built from this unevenly distributed training
set might be working well with the test examples of two frequent categories. By
contrast, the selected documents in the selective sampling methods are evenly
distributed across categories. Also, note in Figure 7.3 that the SS-U&C variants
perform much better than the SS-U sampling. This superior result of SS-U&C, against
SS-U, for the micro-averaged performance is probably due to the large number of
positive-certain examples added to the training set, since these examples can solve the
problem of the sparse training examples for the frequent categories like “earn” and
“acq”.
125
0.90
0.80
0.70
micro-avg. F 1
0.60
0.50
0.40
0.30
RS
SS-U
0.20
SS-U&C[500]
SS-U&C[250]
0.10
0.00
106
318
530
742
954
1,166
1,378
number of manually labeled training examples
Figure 7.3
Micro-averaged F1 of KAN+GRinSCut on the Reuters-21578 (RS:
random sampling, SS-U selective sampling of uncertain examples,
SS-U&C [500]: selective sampling of uncertain and certain examples
[500 certain examples], SS-U&C [250]: selective sampling of
uncertain and certain examples [250 certain examples]).
126
0.90
0.80
0.70
macro-avg. F 1
0.60
0.50
0.40
0.30
RS
SS-U
0.20
SS-U&C[500]
SS-U&C[250]
0.10
0.00
106
318
530
742
954
1,166
1,378
number of manually labeled training examples
Figure 7.4
Macro-averaged F1 of KAN+GRinSCut on the Reuters-21578 (RS:
random sampling, SS-U: selective sampling of uncertain examples,
SS-U&C [500]: selective sampling of uncertain and certain examples
[500 certain examples], SS-U&C [250]: selective sampling of
uncertain and certain examples [250 certain examples]).
127
0.90
0.80
0.70
micro-avg. F 1
0.60
0.50
0.40
0.30
RS
SS-U
0.20
SS-U&C[500]
SS-U&C[250]
0.10
0.00
106
318
530
742
954
1,166
1,378
number of manually labeled training examples
Figure 7.5
Micro-averaged F1 of KAN+LRinSCut on the Reuters-21578 (RS:
random sampling, SS-U: selective sampling of uncertain examples,
SS-U&C [500]: selective sampling of uncertain and certain examples
[500 certain examples], SS-U&C [250]: selective sampling of
uncertain and certain examples [250 certain examples]).
128
0.90
0.80
0.70
macro-avg. F 1
0.60
0.50
0.40
0.30
RS
SS-U
0.20
SS-U&C[500]
SS-U&C[250]
0.10
0.00
106
318
530
742
954
1,166
1,378
number of manually labeled training examples
Figure 7.6
Macro-averaged F1 of KAN+LRinSCut on the Reuters-21578 (RS:
random sampling, SS-U: selective sampling of uncertain examples,
SS-U&C [500]: selective sampling of uncertain and certain examples
[500 certain examples], SS-U&C [250]: selective sampling of
uncertain and certain examples [250 certain examples]).
129
Figure 7.4 shows the macro-averaged F1 of KAN+GRinSCut. Unlike for the microaveraged performance, the advantage of the SS-U sampling is apparent for the macroaveraged measure. It achieves 0.50 macro-averaged F1 at 848 examples, while the
random sampling needs 1,378 examples to achieve this level of performance. So, the
SS-U shows 38.4% saving on the number of documents required, over the random
sampling. The advantages of the SS-U&C variants are not clear with the small numbers
of labeled examples used. But, after 954 examples, both SS-U&C[500] and SSU&C[250] work well and achieve 0.502 and 0.500 macro-averaged F1 at 954 and 1,166
examples, respectively. These results show 30.7% saving for SS-U&C[500] and
15.3% saving for SS-U&C[250]. As in the 20-Newsgroups corpus, we can see that
using 500 positive-certain examples (in this case, 500 examples) achieves better
performance than using 250 positive-certain examples.
Figures 7.5 and 7.6 depict the micro and macro-averaged F1 performance of
another classifier, KAN+LRinSCut in which the locally optimized RinSCut used, with
the sampling methods on the Reuters-21578 data set. In Figure 7.5, the SS-U shows
somewhat unstable learning curves in the micro-averaged performance and, like
KAN+GRinSCut in Figure 7.3, it failed to achieve 0.765 micro-averaged F1 with less
than RS at 1,378 examples. The SS-U&C variants work better than the other sampling
methods with small numbers of labeled examples. But, after 530 examples for SSU&C[500] and 374 examples for SS-U&C[250], their performances are worse than
RS.
For the macro-averaged performance of KAN+LRinSCut in Figure 7.6, all the
uncertainty sampling methods consistently work better than the RS. To achieve 0.500
macro-averaged F1, while the RS needs 1,378 examples, the SS-U needs 848 training
examples representing 38.4% saving in the required number of examples. The SSU&C[500] and SS-U&C[250] need 954 and 1,166 examples, showing 30.7% and
15.3% savings, respectively.
From these results on the Reuters-21578, we can see that with an uneven
distribution of test documents, the optimal choice for the sampling methods becomes
difficult. And, the choice will depend on the user’s needs. For example, if microaveraged performance is the primary concern on the Reuters-21578, the choice of the
130
optimal sampling method will be the RS sampling method, otherwise, it will be the SSU selective sampling. If both micro and macro-averaged measures are concerned, the
optimal choice seems to be the SS-U&C sampling method.
Also, the experimental results on both data sets show that using more positivecertain examples (i.e., 1,000 examples for the 20-Newsgroups and 500 for the Reuters21578) works slightly better than the smaller number of positive-certain ones (500 and
250 examples, respectively). However, we cannot say that using more positive-certain
ones is better, since performance difference is not apparent and using a larger number
of training examples would make overall text categorization process slower.
131
Chapter 8
Conclusions
Our goal in this research was to investigate and develop supervised and semisupervised machine learning approaches to text categorization, including (1) an
algorithm that exploits word co-occurrence information and discriminating power
value, (2) new approaches to thresholding, and (3) semi-supervised learning
approaches to selective sampling, for the important goal of reducing the number of
labeled training examples to achieve a given level of performance. The type of text
categorization we investigated was similarity-based, where the classifier returns the
predicted category labels of new documents based on their similarity scores.
To achieve our goal, we built text categorization systems to which we applied new
classifiers (KAN and RinSCut variants) and uncertainty selective sampling methods.
KAN is a new learning algorithm that was designed to give a term the appropriate
weight according to its semantic meaning and importance. For this, KAN uses the
feature’s co-occurrence information and discriminating power value in a given category.
Another important research area in similarity-based text categorization concerns the
thresholding strategy. It is unconditionally needed for classifiers and has a significant
impact on categorization performance. We investigated existing common thresholding
techniques and developed RinSCut variants that were designed to combine the
strengths of the existing thresholding strategies. Finally, we explored uncertainty
selective sampling methods. Rather than relying on random sampling, our selective
sampling methods actively choose candidate training examples. Our selective sampling
methods are based on the estimated uncertainty value of a given unlabeled example. To
avoid additional cost in building a different type of classifier for selective sampling, we
132
adopted a homogeneous approach that uses the same type of classifier as that in the
categorization of new documents. As well as exploring conventional selective sampling
methods that use only the most uncertain examples for training (in this thesis, referred
to as the SS-U), we developed another type of selective sampling method (SS-U&C)
that picks and uses a set of positive-certain examples with the uncertain ones. The
main advantage of SS-U&C sampling is that the positive-certain documents
recommended do not require a human labeling process since they are thought to be
positive by the system.
Extensive text categorization experiments were conducted on two standard test
collections: the 20-Newsgroups and the Reuters-21578. The 20-Newsgroups corpus is
suitable for the evaluation of the single-class (non-overlapping classes) categorization
task, while the Reuters-21578 is for the multi-class categorization task. Both
collections are real-world ones, contain a large number of pre-categorized documents,
have many predefined categories, and can be regarded as standard collections for testing
text categorization systems.
The key conclusions drawn from the experimental results are as follows:
•
For all the similarity-based learning algorithms implemented and tested in
this research, we found in Section 5.3 that using a large number of features
failed to give a significant performance improvement on either data set. As
a result, aggressive feature space reduction was possible, giving both fast
processing and better performance.
•
We compared KAN against other typical similarity-based learning
algorithms in Section 5.4, using the existing conventional thresholding
strategies (i.e., RCut for the 20-Newsgroups and SCut for the Reuters21578) and varying the number of training examples used for training. The
best and KAN’s performance in each round on the Reuters-21578 corpus
are summarized below. The best performance in each case is shown in
bold. We observed that KAN for Reuters-21578 achieved the best
performance in most rounds and, even when it does not, KAN showed
performance close to the best of others.
133
round
best F1
KAN’s F1
micro
Macro
micro
macro
1
0.500
0.227
0.455
0.188
2
0.624
0.298
0.624
0.264
3
0.627
0.399
0.627
0.399
4
0.719
0.469
0.696
0.469
5
0.742
0.572
0.733
0.572
6
0.743
0.602
0.743
0.602
7
0.756
0.597
0.756
0.597
8
0.755
0.605
0.752
0.605
9
0.757
0.596
0.757
0.596
10
0.750
0.584
0.750
0.584
For the 20-Newsgroups, all the learning algorithms showed similar
performance results when the training examples are evenly distributed
across categories. But, with an uneven distribution of training examples,
caused by “truly random sampling”, there are greater differences between
the learning algorithms. The best and KAN’s performance in each round on
the 20-Newsgroups corpus are shown in the table below.
round
best F1
KAN’s F1
micro
macro
micro
macro
1
0.396
0.383
0.396
0.381
2
0.497
0.482
0.497
0.482
3
0.524
0.506
0.524
0.506
4
0.536
0.513
0.536
0.512
5
0.570
0.553
0.563
0.540
6
0.597
0.583
0.590
0.570
7
0.623
0.607
0.618
0.597
8
0.662
0.649
0.662
0.643
9
0.666
0.660
0.666
0.652
134
10
0.736
0.735
0.712
0.701
Again, the best performance in each case is shown in bold. KAN and
Rocchio achieved similar results in this situation and outperformed k-NN
and WH in most rounds.
•
To assess the effects of the RinSCut thresholding strategy, experiments
were performed on the Reuters-21578 (i.e., for multi-class text
categorization) only. The F1 performance of the tested thresholding
strategies with all the similarity-based algorithms at round 10 (i.e., using all
the training examples) are shown below:
algorithm
SCut
GRinSCut
LRinSCut
micro
macro
micro
macro
micro
Macro
Rocchio
0.634
0.412
0.736
0.549
0.752
0.573
WH
0.681
0.570
0.732
0.517
0.438
0.442
k-NN
0.692
0.570
0.790*
0.552
0.786
0.578
KAN
0.750
0.584
0.780
0.615
0.786
0.629*
In this table, the performance measures with * represent the overall top
results in micro and macro-averaged performance. Also, the table below
shows the performance improvements of the RinSCut variants in
percentage against SCut.
algorithm
SCut
best in RinSCut variants
micro
macro
micro
macro
Rocchio
0.634
0.412
0.752 (18.6%)
0.573 (39.1%)
WH
0.681
0.570
0.732 (7.5%)
0.517 (-9.3%)
k-NN
0.692
0.570
0.790 (14.2%)
0.578 (1.4%)
KAN
0.750
0.584
0.786 (4.8%)
0.629 (7.7%)
135
It appears that our RinSCut variants (GRinSCut and LRinSCut) gave
considerable performance improvements for all the learning algorithms,
except WH. Especially for the Rocchio, RinSCut variants outperformed
SCut. Even though Rocchio with the RinSCut variants did not give the best
results across all the rounds, its performance was close to the best results
other classifiers achieved. This result showed that the thresholding
strategy, an unexplored research area, is important for similarity-based text
categorization.
•
Based on the experiments on this Reuters-21578, we were able to say that
the best combination among the compared methods in this research seems
to be KAN with the RinSCut variants for multi-class categorization. We
found that KAN with the LRinSCut performed best on macro-averaged
performance, while KAN with the GRinSCut achieved the second best
performance which is very close to the best performance of k-NN on
macro-averaged performance.
•
We compared the uncertainty selective sampling methods (SS-U, SSU&C[1000], and SS-U&C[500]) and random sampling (RS) with
KAN+RCut on the 20-Newsgroups corpus in Section 7.4. We observed that
all the selective sampling methods require fewer labeled examples for
training to achieve a given level of performance as shown in tables below.
target micro averaged F1 = 0.573
sampling method
number of labeled examples
F1
savings
RS
1,280
0.573
0%
SS-U
400
0.581
62.5%
SS-U&C[1000]
800
0.576
37.5%
SS-U&C[500]
880
0.580
31.2%
target macro averaged F1 = 0.549
sampling method
number of labeled examples
F1
savings
RS
1,280
0.549
0%
136
SS-U
400
0.550
68.7%
SS-U&C[1000]
560
0.552
56.2%
SS-U&C[500]
720
0.560
43.7%
With more than 320 training examples, the SS-U gave better results than
the SS-U&C variants. The reason for this seems to be that some of the
positive-certain documents in the SS-U&C variants were incorrectly
categorized. These incorrect examples appeared to lower the performance
of SS-U&C compared with the SS-U sampling method.
•
The comparative experiments on sampling methods with KAN+GRinSCut
and KAN+LRinSCut were conducted on the Reuters-21578 data set. For
the micro-averaged performance, all the selective sampling methods failed
to show a clear advantage. This result was mainly caused by an uneven
distribution of test examples in this data set. However, SS-U&C variants
showed much better performance than SS-U (See Figures.7.3 and 7.5 in
Chapter 7). Their micro-averaged performances were very close to RS. By
contrast, for the macro-averaged, there was a clear advantage over the RS
sampling. This is shown in the table below.
target macro averaged F1 = 0.497 with KAN+GRinSCut
sampling method
number of labeled examples
F1
savings
RS
1,378
0.497
0%
SS-U
848
0.502
38.4%
SS-U&C[500]
954
0.507
30.7%
SS-U&C[250]
1,166
0.500
15.3%
target macro averaged F1 = 0.500 with KAN+LRinSCut
sampling method
number of labeled examples
F1
savings
RS
1,378
0.500
0%
SS-U
848
0.528
38.4%
SS-U&C[500]
954
0.538
30.7%
137
SS-U&C[250]
1,166
0.533
15.3%
The savings of each selective sampling method on the number of labeled
examples required to achieve a given level of performance are same in the
two summarized tables. The conclusions drawn from these results are (1) if
micro-averaged performance is the primary concern, the RS sampling
should be used, (2) otherwise, the SS-U sampling could be the optimal
choice, and (3) if both averaged measures are concerned, the choice for the
sampling methods for training examples might be SS-U&C since it did not
show the worst performance on both micro and macro-averaged
performances.
8.1 Contributions
The following contributions were made with this research in the area of text
categorization.
•
Clarification of the high dimensionality problem to which most
sophisticated machine learning algorithms cannot scale.
One of major problems in text categorization is the high dimensionality of the
feature space. From the results of extensive experiments on this issue, we
found that learning accurate concepts of categories does not require a large
number of input features in either corpora (the 20-Newsgroups and Reuters21578) and so, the system needs aggressive feature reduction for faster
processing and for better performance results.
•
Obtaining improved experimental results by applying word co-occurrence
information with the discriminating power values.
Previous work [Lew92a, Lew92b] showed that using phrases does not lead to
any improvements in text categorization performance. This is probably
because of the sparsity of such phrases in a given data set. We noted that using
word co-occurrence information could be effective for resolving some of the
semantic and informative ambiguities each term can have. By applying this
138
word co-occurrence information with the discriminating power values of
features in KAN, we achieved better results compared with other, conventional
similarity-based learning algorithms – k-Nearest Neighbor (k-NN), WidrowHoff (WH), and Rocchio.
•
Combining the strengths of existing thresholding strategies to develop a
new strategy (RinSCut) that works better in multi-class text categorization.
Thresholding strategies, in similarity-based text categorization, are one of the
unexplored research areas that need more attention. We developed new
strategies – RinSCut variants – by using the strengths of two existing
strategies, RCut and SCut. Experimental results for multi-class text
categorization showed that our RinSCut variants gave a significant
improvement over most similarity-based learning algorithms.
•
Experimental results showing that our uncertainty selective sampling
methods with our classifiers (KAN+RinSCut variants) significantly reduce
the number of labeled training examples required for the random sampling
to achieve a given level of performance.
Supervised learning approaches to text categorization usually need a large
number of human-labeled examples to achieve a high level of performance.
However, manually labeling of such a large number of examples is difficult and,
sometimes, impractical. Our two kinds of selective sampling methods used
fewer manually labeled examples for training than the number of training
examples the random sampling required.
•
Evaluation of the effectiveness of our proposed methods by using the
standard evaluation measure, F1, in the micro and macro-averaged
performance.
Our proposed methods were evaluated using the F1 measure that is one of the
standard evaluation methods in text categorization. Also, we showed this
measure in both micro and macro-averaged performance, since developing a
method for only one averaging method is sometimes considered as trivial.
•
Generality of our proposed methods.
139
Our thresholding and selective sampling methods have been developed and
applied to text categorization. They are quite general and applicable to other
similarity-based text classification tasks.
8.2 Future Work
There are a variety of possible directions that are related to this research and that
can be explored further for future work.
•
Empirical studies on the optimal frequency of feature selection process.
One of the main stages that make the learning process in a text categorization
system slow is the feature selection. In the experiments in this research, we
performed feature selection whenever new training documents were available.
It would be interesting to see whether or not there is any optimal size of
training examples for halting feature selection (i.e., adding more examples to
this optimal size and performing the feature selection no longer gives any
significant difference in the lists of extracted features). If such an optimal size
exists, we can significantly decrease the overall learning time.
•
Further experiments to tune the parameters for KAN.
The choice of the value for the λ parameter in KAN would be affected by the
characteristics of a given data set. This indicates that we need to tune λ
parameter for a given data set. However, in this research, we manually and
intuitively established this value. It should be possible to determine this
automatically. By using such an automatically tuned optimal parameter value,
KAN might show more improved results.
•
Removing low level of relationship scores in KAN.
In KAN, we used all the relationship scores computed among the features by
assuming that low relationship scores will have a minor impact on the
predictions of KAN. Removing such low relationship scores may result in
improvements of the effectiveness of KAN.
140
•
More studies on ways to define the ambiguous zone in RinSCut.
The range of the ambiguous zone in RinSCut was defined using upper and
lower bounds that were automatically defined from the training examples
available. The computation of these two bounds is still overfitting a small
number of training examples. More attention should be given to this issue to
find a way to avoid this overfitting problem. It implies that, for a given data
set, we need to perform experiments with a range of strategies and
systematically explore this.
•
More studies on the distribution of recommended training examples in the
uncertainty selective sampling methods.
We allocated nearly the same number of uncertain examples (and positivecertain examples for the SS-U&C) to each category (i.e., kept an even
distribution of recommended examples across categories). When the test
documents in a given data set show an uneven distribution as in the Reuters21578 corpus, varying the number of training examples recommended for each
category in our selective sampling methods may increase the micro-averaged
performance. This suggests that it would be fruitful to explore effects of
having an uneven distribution of training examples and exploring what
proportion of these training examples is actually correct for a given category.
•
Applying our proposed methods to other data sets.
More experiments are needed to evaluate our methods on other corpora like
the Reuters Corpus Volume 1 [RCV1]. Also, we note that applying our
methods to the categorization of other types of documents like web
documents is plausible and may give different outcomes, for the KAN, RinSCut
thresholding strategy, and our uncertainty selective sampling methods.
•
Applying our methods to other text-based classification tasks.
Our methods have been developed mainly for text categorization. We would
like to use of our methods for other classification tasks, such as text clustering,
that is the task of automatically grouping similar documents.
141
We have explored supervised and semi-supervised machine learning approaches to text
categorization. For the important goal of reducing the number of labeled training
examples to achieve a given level of performance, we have developed KAN that exploits
word co-occurrence information and discriminating power value, RinSCut variants that
are new approaches to thresholding, and uncertainty selective sampling methods. We
conducted extensive experiments on the two standard test collections. In these, we
carefully evaluated KAN and RinSCut variants, demonstrating their effectiveness of F1,
the standard evaluation metric. Also, we explored novel approaches to selective
sampling and this showed the desirable effect of decreasing the number of labeled
training examples needed to achieve a given level of macro-averaged performance.
142
Appendices
143
Appendix A: Stop-list
A
able
about
above
abruptly
absolutely
according
accordingly
accurately
across
actively
actual
actually
adequately
after
afterward
afterwards
again
against
ago
ahead
alas
all
alike
almost
alone
along
already
also
although
altogether
always
am
among
amongst
an
and
another
any
anybody
anyhow
anymore
anyone
anything
anyway
anywhere
apparently
approximately
are
around
as
aside
ask
asks
asked
asking
at
automatically
away
b
badly
barely
basically
be
beautifully
became
become
becomes
becoming
been
being
because
before
behind
beneath
beside
besides
between
beyond
bitterly
both
briefly
but
by
c
came
can
cannot
144
carefully
casually
certain
certainly
chiefly
clearly
come
comes
comfortably
coming
commonly
completely
consciously
consequently
considerably
consistently
constantly
continually
continuously
correctly
could
currently
d
deeply
definitely
deliberately
depending
desperately
despite
did
directly
do
does
doing
done
doubtless
during
e
each
eagerly
earnestly
easily
economically
effectively
either
else
elsewhere
emotionally
enough
entirely
equally
especially
essentially
etcetera
even
evenly
eventually
ever
every
everybody
everyday
everyone
everything
everywhere
evidently
except
exclusively
extremely
f
fairly
favorably
few
fewer
finally
firmly
for
forever
formerly
fortunately
frankly
from
fully
further
furthered
furthering
furthermore
furthers
g
gave
generally
gently
get
getting
gets
give
given
gives
giving
go
goes
going
gone
got
gradually
greatly
h
had
happily
hardly
has
hastily
have
having
he
heavily
hence
henceforth
her
here
herself
hey
highly
him
himself
his
historically
honestly
how
145
however
i
ideally
if
immediately
in
incidentally
including
increasingly
indeed
independently
indirectly
individually
inevitably
initially
instantly
instead
into
invariably
is
it
its
itself
j
just
k
knew
know
knowing
known
knows
l
largely
lately
latter
least
lest
let
lets
letting
lightly
likely
likewise
literally
locally
logically
loosely
loudly
m
made
mainly
make
makes
making
may
maybe
me
meanwhile
mentally
merely
might
more
moreover
most
mostly
much
must
my
myself
n
namely
naturally
near
nearby
nearer
nearest
nearly
neatly
necessary
necessarily
need
needed
needs
needing
neither
never
nevertheless
newly
next
no
nobody
non
none
nor
normally
not
notably
nothing
now
nowadays
nowhere
o
ok
obviously
occasionally
oh
of
off
officially
often
on
one
once
only
openly
or
ordinarily
originally
other
others
otherwise
ought
our
ourselves
out
over
own
p
146
painfully
paradoxically
partially
particularily
partly
patiently
per
perfectly
perhaps
permanently
personally
physically
plainly
possible
possibly
practically
precisely
preferably
presumably
previously
primarily
principally
privately
probably
promptly
properly
publicly
purely
put
puts
q
quietly
quite
r
rarely
rather
readily
really
reasonably
recently
regarding
regardless
regularly
relatively
repeatedly
respectively
s
safely
said
same
satisfactorily
saw
say
says
saying
scarcely
see
seeing
seem
seemed
seeming
seemingly
seems
seen
sees
seldom
separately
seriously
several
severely
shall
sharply
she
shortly
should
silently
similarly
simply
since
slightly
slowly
smoothly
so
socially
softly
solely
some
somebody
someday
somehow
someone
someplace
something
sometime
sometimes
somewhat
somewhere
soon
specifically
squarely
steadily
still
strictly
strongly
subsequently
substantially
successfully
such
sufficiently
supposedly
sure
surely
surprisingly
t
take
taking
taken
takes
tell
telling
tells
temporarily
than
that
the
their
them
themselves
147
then
there
thereafter
thereby
therefore
therefrom
therein
thereof
thereto
thereupon
therewith
these
they
thing
things
this
thoroughly
those
thou
though
thoughtfully
through
throughout
thus
tightly
to
together
told
too
took
totally
toward
towards
traditionally
truly
typically
u
ultimately
unanimously
unconsciously
under
undoubtedly
unexpectedly
unfortunately
unless
unlike
until
up
upon
upward
us
usual
usually
utterly
v
vaguely
versus
very
via
vigorously
violently
virtually
vs
w
was
we
went
were
what
whatever
when
whenever
where
whereabouts
whereas
whereby
wherefore
wherein
whereof
whereupon
wherever
whether
which
whichever
while
who
wholly
whose
why
widely
wildly
will
with
within
without
would
x
y
yeah
yet
you
your
yours
z
148
Bibliography
[20News]
The 20-Newsgroups collection, collected by Ken Lang, may be freely
available for research purpose only from,
http://www.ai.mit.edu/people/jrennie/20Newsgroups/.
[ADW94]
C. Apte, F. Damerau, and S. M. Weiss. Automated Learning of
Decision Rules for Text Categorization. ACM Transactions of
Information Systems, 12(3), pages 233-251, 1994
[AKCS00]
I. Androutsopoulos, J. Koutsias, K. Chandrinos and C. Spyropoulos.
An Experimental Comparison of Naive Bayesian and Keyword-Based
Anit-Spam Filtering with Personal E-mail Messages. In Proceedings of
the 23rd annual international ACM SIGIR conference on Research and
development in information retrieval, pages 160-167, 2000.
+
[AMST 96]
A. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo.
Fast Discovery of Association Rules. In U. M. Fayyad, G. PiatetskyShapiro, P. Smith, R. Uthurusamy, editors, Advances in Knowledge
Discovery and Data Mining, AAAI/MIT Press, pages 307-328, 1996.
[Ang87]
D. Angluin. Queries and Concept Learning, Machine Learning, 2(4),
pages 319-342, 1987.
[BG01]
T. Brasethvik and J. A. Gulla. Natural Language Analysis for Semantic
Document Modeling. Data & Knowledge Engineering, 38, pages 45-62,
2001.
149
[Bri92]
E. Brill. A Simple Rule-based Part-of-Speech Tagger. In Proceedings of
the 3rd Annual Conference on Applied Natural Language Processing
(ACL), Trento, Italy, pages 152-155, 1992.
[Bri95]
E. Brill. Unsupervised Learning of Disambiguation Rules for Part of
Speech Tagging. In Proceedings of the 3rd Workshop on Very Large
Corpora, pages 1-13, 1995.
[BSA94]
C. Buckley, G. Salton, and J. Allan. The Effect of Adding Relevance
Information in a Relevance Feedback Environment. ACM SIGIR
Conference on Research and Development in Information Retrieval
(SIGIR'94), pages 292-300, 1994.
[BSAS95]
C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic Query
Expansion Using SMART: TREC 3. The Third Text Retrieval
Conference (TREC-3), National Institute of Standards and Technology
Special Publication 500-207, Gaithersburg, MD, 1995.
[BT98]
A. W. Black and P. Taylor. Assigning Phrase Breaks from Part-ofSpeech Sequences. Computer, Speech and Language, 12(2), pages 99117, 1998.
[CAL94]
D. Cohn, L. Atlas, and R. Lander. Improving generalization with
Active Learning. Machine Learning, 15(2), pages 201-221, 1994.
[CH98]
W. W. Cohen and H. Hirsh. Joins that Generalize: Text Classification
Using WHIRL. In Proceedings of the 4th International Conference on
Knowledge Discovery and Data Mining (KDD-98), pages 169-173,
New York, NY, 1998.
[Cha97]
E. Charniak. Statistical Techniques for Natural Language Parsing. AI
Magazine, 18(4), pages 33-44, 1997.
[CM01]
X. Carreras and L. Mrquez. Boosting Trees for Anti-Spam Email
Filtering. In Proceedings of RANLP-01, 4th International Conference
on Recent Advances in Natural Language Processing, Tzigov Chark,
BG, 2001.
150
[CS96]
W. W. Cohen and Y. Singer. Context-sensitive Learning Methods for
Text Categorization. In Proceedings of the 19th Annual International
ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR’96), pages 307-315, 1996.
[CX95]
W. B. Croft and J. Xu. Corpus-Specific Stemming using Word Form
Co-occurrence. In Proceedings of the 4th Annual Symposium on
Document Analysis and Information Retrieval, pages 147-159, Las
Vegas, Nevada, April 1995.
[CY92]
C. J. Crouch and B. Yang. Experiments in Automatic Statistical
Thesaurus Construction. In Proceedings of the 15th Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR’92), pages 77-88, 1992.
[Dam95]
M. Damashek. Gauging Similarity via N-Grams: LanguageIndependent Sorting, Categorization, and Retrieval of Text. Science,
267, February 1995.
[DDFL 90]
+
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R.
Harshman. Indexing by Latent Semantic Analysis. Journal of the
Society for Information Science, 41(6), pages 391-407, 1990.
[DE95]
I. Dagan and S. P. Engelson, Committee-Based Sampling for Training
Probabilistic Classifiers. In Proceedings of the 12th International
Conference on Machine Learning, pages 150-157, 1995.
[Fox90]
C. Fox. A Stop List for General Text. SIGIR Forum, 24:1-2, pages 1935, 1990.
[Fur98]
J. Furnkranz. A Study Using n-gram Features for Text Categorization.
Technical Report OEFAI-TR-9830, Austrian Institute for Artificial
Intelligence, 1998.
[GZ00]
S. Goldman and Y. Zhou. Enhancing Supervised Learning with
Unlabeled Data. In Proceedings of the 17th International Conference
on Machine Learning, pages 327-334, 2000.
151
[Har75]
J. A. Hartigan. Clustering Algorithms. John Willey & Sons, 1975.
[HK00]
E. H. Han and G. Karypis. Centroid-Based Document Classification:
Analysis and Experimental Results. In Proceedings of the 4th
European Conference on Principles and Practice of Knowledge
Discovery in Databases (PKDD), pages 424-431, September 2000.
[HPS96]
D. Hull, J. Pedersen, and H. Schuetze. Document Routing as Statistical
Classification. In AAAI Spring Symposium on Machine Learning in
Information Access, Palo Alto, CA, March 1996.
[HW90]
P. Hayes and S.Weinstein. CONSTRUE/TIS: A System for ContentBased Indexing of a Database of News Stories. In Second Annual
Conference on Innovative Applications of Artificial Intelligence, pages
49-64, 1990.
[Joa97]
T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm with
TFIDF for Text Categorization. In Proceedings of the 14th
International Conference on Machine Learning (ICML’97), pages 143151, 1997.
[Joa98]
T. Joachims. Text Categorization with Support Vector Machines:
Learning with Many Relevant Features. In Machine Learning: ECML98, Tenth European Conference on Machine Learning, pages 137-142,
1998.
+
[KMRT 94]
M. Klemettinen, H. Mannila, P. Rokainen, H. Toivonen, and I.
Verkamo. Finding Interesting Rules from Large Sets of Discovered
Association Rules. In Proceedings of the 3rd International Conference
on Information and Knowledge Management (CIKM'94), pages 401407, 1994.
[KS96]
D. Koller and M. Sahami. Toward Optimal Feature Selection. In
Proceedings of the 13th International Conference on Machine Learning
(ICML’96), pages 284-292, 1996.
152
[Lan95]
K. Lang. NewsWeeder: Learning to Filter Netnews. In Proceedings of
the 12th International Machine Learning Conference (ICML’95), Lake
Tahoe, CA, Morgan Kaufmann, San Francisco, pages 331-339, 1995.
[LC94]
D. D. Lewis and J. Catlett. Heterogeneous Uncertainty Sampling for
Supervised Learning. In Proceedings of the Eleventh International
Conference on Machine Learning, San Francisco, CA., Morgan
Kaufman, pages 148-156, 1994.
[Lew92a]
D. D. Lewis. Representation and Learning in Information Retrieval.
Ph.D. Thesis, Department of Computer Science, University of
Massachusetts, Amherst, MA, 1992.
[Lew92b]
D. D. Lewis. An Evaluation of Phrasal and Clustered Representations
on a Text Categorization Task. In Proceedings of the 15th Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval, June 21-24, Copenhagen, Denmark, pages
37-50, 1992.
[LG94]
D. D. Lewis and W. Gale. A Sequential Algorithm for Training Text
Classifiers. In Proceedings of the 17th Annual International ACM
SIGIR Conference on Research and Development in Information
Retrieval, pages 3-12, 1994.
[LH98]
W. Lam and C. Y. Ho. Using A Generalized Instance Set for
Automatic Text Categorization. In Proceedings of the 21st Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval, August 24-28, Melbourne, Australia, pages
81-89, 1998.
[LKK00]
K. H. Lee, J. Kay, and B. H. Kang. Keyword Association Network: A
Statistical Multi-term Indexing Approach
for
Document
Categorization. In Proceedings of the 5th Australasian Document
Computing Symposium, pages 9-16, December 2000.
[LKK02]
K. H. Lee, J. Kay, and B. H. Kang. Lazy Linear Classifier and Rankin-Score Threshold in Similarity-Based Text Categorization. ICML
153
Workshop on Text Learning (TextML’2002), Sydney, Australia,
pages 36-43, July 2002.
[LKKR02]
K. H. Lee, J. Kay, B. H. Kang, and U. Rosebrock. A Comparative
Study on Statistical Machine Learning Algorithms and Thresholding
Strategies for Automatic Text Categorization. The 7th Pacific Rim
International Conference on Artificial Intelligence (PRICAI-02),
Tokyo, Japan, pages 444-453, August 2002.
[LR94]
D. Lewis and M. Ringuette. A Comparison of two learning algorithms
for text categorization. In Third Annual Symposium on Document
Analysis and Information Retrieval, pages 81-93, 1994.
[LSCP96]
D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training
Algorithms for Linear Text Classifiers. In Proceedings of the 19th
Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR’96), pages 298-306,
1996.
[LT97]
R. Liere and P. Tadepalli. Active Learning with Committees for Text
Categorization. In Proceedings of the 14th National Conference on
Artificial Intelligence, pages 591-596, 1997.
[Mer98]
D. Merkl. Text Classification with Self-Organizing Maps: Some
Lessons Learned. Neurocomputing, 21:1-3, pages 61-77, 1998.
[MG96]
I. Moulinier and J. G. Ganascia. Applying an Existing
Learning Algorithm to Text Categorization. In S. Wermter,
and G. Scheler (eds.), Connectionist, Statistical, and
Approaches to Learning for Natural Language Processing,
Verlag, Berlin, pages 343-354, 1996
[MG98]
D. Mlademic and M. Grobelnik. Word Sequences as Features in Text
Learning. In Proceedings of the 17th Electrotechnical and Computer
Science Conference (ERK98), Ljubljana, Slovenia, pages 145-148,
1998.
Machine
E. Riloff,
Symbolic
Springer-
154
[MG99]
D. Mladenic and M. Grobelnik. Feature Selection for Unbalanced
Class Distribution and Naive Bayes. In Proceedings of the 16th
International Conference on Machine Learning, pages 258-267, 1999.
[Mla98]
D. Mladenic. Feature Subset Selection in Text-learning. In Proceedings
of the 10th European Conference on Machine Learning (ECML’98),
pages 95-100, 1998.
[MN98]
A. McCallum and K. Nigam. A Comparison of Event Models for
Naive Bayes Text Classifiers. In AAAI-98 Workshop on Learning for
Text Categorization, pages 41-48, 1998.
[MRG96]
I. Moulinier, G. Raskinis, and J. Ganascia. Text Categorization: A
Symbolic Approach. In Proceedings of the 5th Annual Symposium on
Document Analysis and Information Retrieval (SDAIR-96), pages 8799, 1996.
[MRMN98]
A. McCallum, R. Rosenfeld, T. Mitchell, and A. Y. Ng. Improving
text classification by shrinkage in a hierarchy of classes. In Proceedings
of the 15th International Conference on Machine Learning, San
Francisco, CA, pages 359-367, 1998.
[NG00]
K. Nigam and R. Ghani. Understanding the Behavior of Co-training. In
Proceedings of KDD-2000 Workshop on Text Mining, pages, 2000.
[NH98]
K. Nagao and K. Hasida. Automatic Text Summarization Based on the
Global Document Annotation. In Proceedings of COLING-ACL’98,
pages 917-921, 1998.
[NMTM00]
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text
Classification from Labeled and Unlabeled Documents using EM.
Machine Learning, 39(2/3), pages 103-134, 2000.
[Paz00]
M. J. Pazzani. Representation of Electronic Mail Filtering Profiles: A
User Study. Intelligent User Interfaces, pages 202-206, 2000.
155
[Por80]
M. F. Porter. An Algorithm for Suffix Stripping. Program, 14(3),
pages 130-137, July 1980.
[Qui93]
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan
Kaufmann, 1993.
[R21578]
The Reuters-21578 collection, originally collected and labeled by
Carnegie Group Inc and Reuters Ltd, may be freely available for
research purpose only from,
http://www.daviddlewis.com/resources/testcollections/reuters21578/,
previous location of the collection was,
http://www.research.att.com/~lewis/reuters21578.html.
[R22173]
The Reuters-22173 collection, originally collected and labeled by
Carnegie Group Inc and Reuters Ltd, may be freely available by
anonymous ftp for research purpose only from,
ftp://ciir-ftp.cs.umass.edu:/pub/reuters1.
[RCV1]
The new Reuters collection, called Reuters Corpus Volume 1, has
recently been made available by Reuters Ltd, may be freely available
for research purpose only from,
http://about.reuters.com/researchandstandards/corpus/.
[Rij79]
C. J. van Rijsbergen. Information Retrieval. 2nd Edition, Butterworths,
London, UK, 1979.
[RMW95]
M. Röscheisen, C. Mogensen, and T. Winograd. Beyond Browsing:
Shared Comments, SOAPs, Trails, and On-line Communities. In
Proceedings of the 3rd International World Wide Web Conference,
Darmstadt, Germany, pages 739-749, April 1995.
[Roc71]
J. Rocchio. Relevance Feedback in Information Retrieval. In G. Salton,
editor, The SMART Retrieval System: Experiments in Automatic
Document Processing, Prentice-Hall Inc., pages 313-323, 1971.
[RS99]
M. E. Ruiz and P. Srinivasan. Hierarchical Neural Networks for Text
Categorization. In Proceedings of the 22nd ACM International
156
Conference on Research and Development in Information Retrieval
(SIGIR-99), pages 281-282, 1999.
[Rug92]
G. Ruge. Experiments on Linguistically Based Term Associations.
Information Processing & Management, 28(3), pages 317-332, 1992.
[Sal89]
G. Salton. Automatic Text Processing: The Transformation, Analysis,
and Retrieval of Information by Computer. Addison-Wesley, Reading,
Massachusetts, 1989.
[Sal91]
G. Salton. Developments in Automatic Text Retrieval. Science, Vol.
253, pages 974-979, 1991.
[SCAT92]
S. Sekine, J. Carroll, A. Ananiadou, and J. Tsujii. Automatic Learning
for Semantic Collocation. Proceedings of the Third Conference on
Applied Natural Language Processing, ACL, pages 104-110, 1992.
[SDHH98]
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian
Approach to Filtering Junk E-mail. In Proceedings of AAAI'98
Workshop Learning for Text Categorization, Madison, Wisconsin,
pages 55-62, 1998.
[Seb02]
F. Sebastiani. Machine Learning in Automated Text Categorization.
ACM Computing Surveys, 34(1), pages 1-47, 2002.
[SK00]
S. Shankar and G. Karypis. A Feature Weight Adjustment for
Document Categorization. The 6th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, August 20-23,
Boston, MA, 2000.
[SM99]
S. Scott and S. Matwin. Feature Engineering for Text Classification. In
Proceedings of the 16th International Conference on Machine
Learning, pages 379-388, 1999.
[SMB96]
M. A. Schickler, M. S. Mazer, and C. Brooks. Pan-Browser Support
for Annotations and Other Meta-Information on the World Wide Web.
Computer Networks and ISDN Systems, 28, pages 1063-1074, 1996.
157
[SP97]
L. Saul and F. Pereira. Aggregate and Mixed-order Markov Models for
Statistical Language Processing. In C. Cardie and R. Weischedel (eds.),
Proceedings of the 2nd Conference on Empirical Methods in Natural
Language Processing, pages 81-89, 1997.
[SSC97]
L. Singh, P. Scheuermann and B. Chen. Generating Association Rules
from Semi-Structured Documents Using an Extended Concept
Hierarchy. In Proceedings of the 6th International Conference on
Information and Knowledge Management (CIKM’97), pages 193-200,
1997.
[SSS98]
R. E. Schapire, Y. Singer, and A. Singhal. Boosting and Rocchio
Applied to Text Filtering. In Proceedings of the 21st Annual
International Conference on Research and Development in Information
Retrieval, pages 215-223, 1998.
[SW99]
The stop words from the ‘SuperJournal’ research project in the U.K.
This project was conducted over three years from 1996 to 1998, as
part of the Electronic Libraries Program (eLib). These stop words may
be freely available for research purpose only from,
http://www.mimas.ac.uk/sj/application/demo/stopword.html.
[TCM99]
C. Thompson, M. E. Califf, and R.J. Mooney. Active Learning for
Natural Language Parsing and Information Extraction. In Proceedings
of the 16th International Conference on Machine Learning, pages 406414, 1999.
[VM94]
A. Vorstermans and J. P. Martens. Automatic Labeling of Corpora for
Speech Synthesis Development. In Proceedings of ICSLP’94, pages
1747-1750, 1994.
[WPW95]
E. Wiener, J. O. Pedersen, and A. S. Weigend. A Neural Network
Approach to Topic Spotting. In Proceedings of the Fourth Annual
Symposium on Document Analysis and Information Retrieval
(SDAIR’95), pages 317-332, 1995.
158
[WS85]
B. Widrow and S. D. Stearns. Adaptive Signal Processing. PrenticeHall Inc., Eaglewood Cliffs, NJ, 1985.
[XC00]
J. Xu and W. B. Croft. Improving the Effectiveness of Information
Retrieval with Local Context Analysis. ACM Transactions on
Information Systems, 18(1), pages 79-112, January 2000.
[Yan01]
Y. Yang. A Study on Thresholding Strategies for Text Categorization.
In Proceedings of the 24th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval
(SIGIR’01), pages 137-145, 2001.
[Yan94]
Y. Yang. Expert Network: Effective and Efficient Learning from
Human Decisions in Text Categorization and Retrieval. In Proceedings
of the 17th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 13-22, 1994.
[Yan99]
Y. Yang. An Evaluation of Statistical Approaches to Text
Categorization. Journal of Information Retrieval, 1(1/2), pages 67-88,
1999.
[YG98]
T. Yavuz and A. Guvenir. Application of k-Nearest Neighbor on
Feature Projections Classifier to Text Categorization. In Proceedings
of the 13th International Symposium on Computer and Information
Sciences – ISCIS’98, U. Gudukbay, T. Dayar, A. Gursoy, E. Gelenbe
(eds.), Antalya, Turkey, pages 135-142, 1998.
[YP97]
Y. Yang and J. O. Pedersen. A Comparative Study on Feature
Selection in Text Categorization. In Proceedings of the 14th
International Conference on Machine Learning (ICML’97), pages 412420, 1997.
[YX99]
Y. Yang and X. Liu. A Re-examination of Text Categorization
Methods. In Proceedings of International ACM Conference on
Research and Development in Information Retrieval (SIGIR’99), pages
42-49, 1999.