Text Categorization with a Small Number of Labeled Training Examples A Thesis Presented by Kang Hyuk Lee Submitted to the University of Sydney in fulfillment of the requirements for the degree of Doctor of Philosophy September 1, 2003 School of Information Technologies University of Sydney ii List of Publications Some contents in this research were published in the followings: K. H. Lee, J. Kay, and B. H. Kang. Keyword Association Network: A Statistical Multiterm Indexing Approach for Document Categorization. In Proceedings of the 5th Australasian Document Computing Symposium, pages 9-16, December 1, 2000. K. H. Lee, J. Kay, and B. H. Kang. Lazy Linear Classifier and Rank-in-Score Threshold in Similarity-Based Text Categorization. International Conference on Machine Learning Workshop on Text Learning (TextML’2002), Sydney, Australia, pages 36-43, July 8, 2002. K. H. Lee, J. Kay, B. H. Kang, and U. Rosebrock. A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization. The 7th Pacific Rim International Conference on Artificial Intelligence (PRICAI-02), Tokyo, Japan, pages 444-453, August 18-22, 2002. K. H. Lee, J. Kay, and B. H. Kang. Active Learning: Applying RinSCut Thresholding Strategy to Uncertainty Sampling. The 16th Australian Joint Conference on Artificial Intelligence, Perth, Australia, December 3-5, 2003 (In press). iii Acknowledgments My great thanks to my supervisor, Associate Professor Judy Kay, for showing special patience as I pursued my own particular area of interest. She enthusiastically supported me to do research in an area that is not one of her specialized areas. I know that she spent a lot of time to give suggestions and constructive criticism in the development of this thesis. I also thank her for the financial support in the form of a postgraduate scholarship. Thanks to Dr. Byeong H. Kang and his research students at the University of Tasmania for their time and interest in my research area. He provided me with a great deal of guidance in machine learning and encouraged me to keep doing this research. My thanks should go to my wife, Soo Jung, and my two children, Eugene and Yusang. They showed their endless love, understanding, and belief in the difficult situation I made. Thanks to them for letting me know how much I love them. My deepest thanks to my parents for their love and support without which this thesis could not have appeared. Thanks to Mr. Kang-Kil Lee and his family for caring for my family in Sydney, Australia. Many thanks to people in the School of Information Technologies for their patience as I used much CPU time and disk space on servers that should be shared with them. iv Abstract This thesis describes the investigation and development of supervised and semisupervised learning approaches to similarity-based text categorization systems. It uses a small number of manually labeled examples for training and still maintains effectiveness. The purpose of text categorization is to automatically assign arbitrary raw documents to predefined categories based on their contents. Text categorization involves several sub-phases that should be combined effectively for the success of the overall system. We explore those sub-phases in terms of an approach to similaritybased text categorization that has shown good categorization performance. Supervised approaches to text categorization usually require a large number of training examples to achieve a high level of effectiveness. Training examples used in supervised learning approaches require human involvement in the labeling of each example. So, labeling such a large number of documents for training poses a considerable burden on human experts who need to read each document and assign it to appropriate categories. With this problem in mind, our goal was to develop a text categorization system that uses fewer labeled examples for training to achieve a given level of performance. We describe our new similarity-based learning algorithm (KAN) and thresholding strategies (RinSCut variants). KAN was designed to give appropriate weights to terms according to their semantic content and importance by using their co-occurrence information and the discriminating power values for similarity computation. In this way, KAN defines more statistical information for each given training example, compared with other, conventional similarity-based algorithms. After investigating the existing common thresholding strategies, we designed, for multi-class text categorization in which documents may belong to variable numbers of categories, RinSCut variants that combine the strengths of previously investigated thresholding strategies. Our thresholding strategies are general and so, can be applied to other v similarity-based text processing tasks as well as other similarity-based learning algorithms. To reduce the number of labeled examples needed to achieve a given level of performance, we developed uncertainty-based, selective sampling methods. Rather than relying on random sampling for candidate training examples, our text categorization system uses selective sampling methods to actively choose candidate examples for training that are likely to be more informative than other examples. Then, the system either presents these to the human, to assign their correct labels or it assigns them its own predicted labels. We applied these uncertainty selective sampling methods to our own classifiers. However, our sampling methods are quite general and could be used for any machine learning approaches. We conducted extensive comparative experiments on two standard test collections (the Reuters-21578 and the 20-Newsgroups). We present the experimental results using a standard evaluation method, F1, for micro and macro-averaged performance. The results show that KAN and RinSCut variants work better than other widely used techniques. They also demonstrate that our uncertainty selective sampling methods, in most cases, outperform random sampling in terms of the required numbers of labeled training examples needed to achieve a given level of performance. One exceptional case, in which selective sampling methods failed, occurred in micro-averaged performance in multi-class text categorization task. vi Contents 1 Introduction................................................................................... 1.1 Text Categorization Process................................................................................4 1.2 Outline of the Thesis.........................................................................................12 2 Text Categorization and Previous Work.............................................................. 2.1 Text Categorization ...........................................................................................15 2.1.1 A Definition of Text Categorization .....................................................15 2.1.2 Ambiguities in Natural Language Text ..................................................16 2.1.3 Knowledge Engineering versus Machine Learning Approach...............17 2.1.4 Difficulties for the Machine Learning Approach..................................18 2.2 Preprocessing.....................................................................................................18 2.2.1 Feature Extraction .................................................................................19 2.2.2 Representation ......................................................................................21 2.2.3 Feature Selection ...................................................................................22 2.3 Learning Classifiers............................................................................................25 2.3.1 Similarity-Based Learning Algorithms ..................................................26 2.3.1.1 Profile-Based Linear Algorithms .............................................26 2.3.1.2 Instance-Based Lazy Algorithm..............................................27 2.3.2 Thresholding Strategy ...........................................................................28 2.3.2.1 Rank-Based (RCut)..................................................................30 2.3.2.2 Score-Based (SCut)..................................................................30 2.3.2.3 Proportion-based Assignment (PCut) .....................................31 2.4 Active Learning..................................................................................................31 2.4.1 Uncertainty Sampling............................................................................32 2.4.2 Committee-Based Sampling ..................................................................34 2.5 Evaluation Methods ..........................................................................................34 2.5.1 Performance Measures of Effectiveness ...............................................35 2.5.2 Micro and Macro Averaging .................................................................39 2.5.3 Data Splitting ........................................................................................41 vii 3 Keyword Association Network (KAN)............................................................... 3.1 Objectives and Motivation................................................................................44 3.2 Overall Approach..............................................................................................47 3.2.1 Constructing KAN .................................................................................48 3.2.2 Relationship Measure ...........................................................................51 3.2.3 Discriminating Power Function.............................................................52 3.2.4 Applying KAN to Text Categorization .................................................55 3.3 Computational Complexity...............................................................................57 4 RinSCut: New Thresholding Strategy................................................................ 4.1 Motivation.........................................................................................................59 4.2 Desired Properties.............................................................................................61 4.3 Overall Approach..............................................................................................62 4.3.1 Defining Ambiguous Zone ....................................................................62 4.3.2 Defining RCut Threshold.......................................................................67 5 Evaluation I: KAN and RinSCut..................................................................... 5.1 Data Sets Used ..................................................................................................68 5.1.1 Reuters-21578.......................................................................................70 5.1.2 20-Newsgroups.....................................................................................75 5.2 Text Preprocessing ............................................................................................78 5.2.1 Feature Extraction .................................................................................78 5.2.2 Feature Weighting..................................................................................79 5.2.3 Feature Selection ...................................................................................81 5.3 Experiments on the Number of Features...........................................................81 5.3.1 Experimental Setup ...............................................................................82 5.3.2 Results and Analysis.............................................................................85 5.4 Experiments on the Number of Training Examples...........................................93 5.4.1 Experimental Setup ...............................................................................93 5.4.2 Results and Analysis.............................................................................95 6 Learning with Selective Sampling .................................................................... 6.1 Goal and Issues................................................................................................110 6.1.1 Homogeneous versus Heterogeneous Approach.................................111 6.1.2 Using Positive-Certain Examples for Training....................................111 viii 6.2 Overall Approach............................................................................................113 6.2.1 Computing Uncertainty Values...........................................................114 6.2.2 Defining Certain and Uncertain Documents with RinSCut .................115 7 Evaluation II: Uncertainty Selective Sampling .......................................................1 7.1 7.2 7.3 7.4 Data Sets Used and Text Processing ...............................................................118 Classifiers Implemented and Evaluated...........................................................119 Sampling Methods Compared.........................................................................120 Results and Analysis.......................................................................................121 8 Conclusions..................................................................................... 8.1 Contributions...................................................................................................137 8.2 Future Work ....................................................................................................139 Appendices.................................................................................................................142 Appendix A: Stop-list ................................................................................................143 Bibliography ...............................................................................................................148 ix List of Tables 2.1 Contingency table for category ci. .....................................................................36 2.2 Global contingency table for category set C......................................................41 5.1 The 53 categories of the Reuters-21578 data set used in our experiments (part 1)...............................................................................................................72 5.2 The 53 categories of the Reuters-21578 data set used in our experiments (part 2)...............................................................................................................73 5.3 The 53 categories of the Reuters-21578 data set used in our experiments (part 3)...............................................................................................................74 5.4 The 20 categories of the 20-Newsgroups corpus..............................................75 5.5 Statistics for the unique features in the Reuters-21578 corpus.........................83 5.6 Statistics for the unique features in the 20-Newsgroups corpus.......................84 5.7 The percentage of training data and the number of training documents used in each round. ............................................................................................94 5.8 The best micro-averaged F1 and its classifier in each round on the Reuters-21578 corpus. ....................................................................................106 5.9 The best macro-averaged F1 and its classifier in each round on the Reuters-21578 corpus. ....................................................................................107 x List of Figures 1.1 Initial-learning model for text categorization. ......................................................6 1.2 Categorization and learning model for text categorization...................................8 1.3 Learning with selective sampling.......................................................................10 2.1 The per-document and per-category thresholding strategies.............................29 2.2 Uncertainty sampling algorithm. .......................................................................33 3.1 An example of the similarity computation with semantic and informative ambiguities......................................................................................45 3.2 An example for constructing KAN.....................................................................50 3.3 Algorithm for generating frequent 2-feature sets F2..........................................50 3.4 Network representation generated from the example........................................51 3.5 Graphs of S against df, when λ = 0.35, 0.50, and 0.65. ....................................54 3.6 An example of network representation for handling annotated training examples. ...........................................................................................................57 4.1 Comparability of similarity scores and thresholding strategies.........................60 4.2 Ambiguous zone between ts_top(ci) and ts_bottom(ci) for a given category ci..........................................................................................................64 4.3 The locations of ts_max_top(ci), ts_min_top(ci), ts_max_bottom(ci), ts_min_bottom(ci), mod_ts_top(ci), and mod_ts_bottom(ci) in the ordered list of similarity scores. .....................................................................................66 5.1 An example of the Reuters-21578 document (identification number, 9, assigned to the earn category and used for the training set). .............................71 5.2 An example of the 20-Newsgroups document (assigned to the alt.atheism newsgroup)......................................................................................78 xi 5.3 The resulting tokenized file after the feature extraction for the example document in Figure 5.1. .....................................................................................79 5.4 The TFIDF weighting scheme based on the term frequency and document frequency file. ...................................................................................80 5.5 F1 performance of Rocchio on the Reuters-21578 corpus (6, 984 training documents used)................................................................................................86 5.6 F1 performance of k-NN on the Reuters-21578 corpus (6,984 training documents used)................................................................................................87 5.7 F1 performance of WH on the Reuters-21578 corpus (6,984 training documents used)................................................................................................87 5.8 F1 performance of KAN on the Reuters-21578 corpus (6,984 training documents used)................................................................................................88 5.9 F1 performance of Rocchio on the 20-Newsgroups corpus (all the training documents used in each split)...............................................................90 5.10 F1 performance of k-NN on the 20-Newsgroups corpus (all the training documents used in each split)............................................................................91 5.11 F1 performance of WH on the 20-Newsgroups corpus (all the training documents used in each split)............................................................................91 5.12 F1 performance of KAN on the 20-Newsgroups corpus (all the training documents used in each split)............................................................................92 5.13 Micro-averaged F1 of each classifier on the Reuters-21578 corpus with SCut. ..................................................................................................................97 5.14 Macro-averaged F1 of each classifier on the Reuters-21578 corpus with SCut. ..................................................................................................................97 5.15 Micro-averaged F1 of each classifier on the 20-Newsgroups corpus with RCut...................................................................................................................99 5.16 Macro-averaged F1 of each classifier on the 20-Newsgroups corpus with RCut...................................................................................................................99 5.17 Micro-averaged F1 of each classifier on the 20-Newsgroups corpus with RCut (with “truly random sampling”).............................................................100 xii 5.18 Macro-averaged F1 of each classifier on the 20-Newsgroups corpus with RCut (with “truly random sampling”).............................................................100 5.19 Micro-averaged performance comparison of SCut and RinSCut variants on Rocchio (the Reuters-21578 corpus used)..................................................102 5.20 Macro-averaged performance comparison of SCut and RinSCut variants on Rocchio (the Reuters-21578 corpus used)..................................................102 5.21 Micro-averaged performance comparison of SCut and RinSCut variants on WH (the Reuters-21578 corpus used)........................................................103 5.22 Macro-averaged performance comparison of SCut and RinSCut variants on WH (the Reuters-21578 corpus used)........................................................103 5.23 Micro-averaged performance comparison of SCut and RinSCut variants on k-NN (the Reuters-21578 corpus used).....................................................104 5.24 Macro-averaged performance comparison of SCut and RinSCut variants on k-NN (the Reuters-21578 corpus used).....................................................104 5.25 Micro-averaged performance comparison of SCut and RinSCut variants on KAN (the Reuters-21578 corpus used).......................................................105 5.26 Macro-averaged performance comparison of SCut and RinSCut variants on KAN (the Reuters-21578 corpus used).......................................................105 6.1 Flow of unlabeled documents in our selective sampling..................................113 6.2 Definition of certain and uncertain examples using ts_top(ci) and ts_bottom(ci) for a given category ci.................................................................116 7.1 Micro-averaged F1 of KAN+RCut on the 20-Newsgroups (RS: random sampling, SS-U: selective sampling of uncertain examples, SSU&C[1000]: selective sampling of uncertain and certain examples [1,000 certain examples], SS-U&C[500]: selective sampling of uncertain and certain examples [500 certain examples]). .......................................................122 7.2 Macro-averaged F1 of KAN+RCut on the 20-Newsgroups (RS: random sampling, SS-U: selective sampling of uncertain examples, SSU&C[1000]: selective sampling of uncertain and certain examples [1,000 xiii certain examples], SS-U&C[500]: selective sampling of uncertain and certain examples [500 certain examples]). .......................................................123 7.3 Micro-averaged F1 of KAN+GRinSCut on the Reuters-21578 (RS: random sampling, SS-U selective sampling of uncertain examples, SSU&C [500]: selective sampling of uncertain and certain examples [500 certain examples], SS-U&C [250]: selective sampling of uncertain and certain examples [250 certain examples]). .......................................................125 7.4 Macro-averaged F1 of KAN+GRinSCut on the Reuters-21578 (RS: random sampling, SS-U: selective sampling of uncertain examples, SSU&C [500]: selective sampling of uncertain and certain examples [500 certain examples], SS-U&C [250]: selective sampling of uncertain and certain examples [250 certain examples]). .......................................................126 7.5 Micro-averaged F1 of KAN+LRinSCut on the Reuters-21578 (RS: random sampling, SS-U: selective sampling of uncertain examples, SSU&C [500]: selective sampling of uncertain and certain examples [500 certain examples], SS-U&C [250]: selective sampling of uncertain and certain examples [250 certain examples]). .......................................................127 7.6 Macro-averaged F1 of KAN+LRinSCut on the Reuters-21578 (RS: random sampling, SS-U: selective sampling of uncertain examples, SSU&C [500]: selective sampling of uncertain and certain examples [500 certain examples], SS-U&C [250]: selective sampling of uncertain and certain examples [250 certain examples]). .......................................................128 1 Chapter 1 Introduction The work in this thesis explores supervised and semi-supervised machine learning approaches to text categorization: an algorithm that exploits word co-occurrence information and discriminating power value, new approaches to thresholding, and the important goal of reducing the number of labeled training examples. The amount of textual information that is stored digitally in electronic forms is huge and increasing. With the so-called information overloading problem, caused by the growing availability and heavy use of such electronic textual information, there has been increasing interest in tools that can help in organizing and describing the large amount of online textual information for later retrieval and use. One of the successful and directly applicable paradigms for helping users in making good and quick selection of textual information of interest is by classifying the different documents according to their topics. The main text classification tasks, that are usually considered distinct, are information retrieval, text categorization, information filtering, and document clustering [Lew92a]. However, boundaries between them are not sharp, as all involve grouping of documents based on their contents. Even though most machine learning methods in this research have been developed for text categorization, they are quite general and applicable to other text classification tasks that are usually based on some similarity (or distance) measures between documents. Text categorization is, simply, the task of automatically assigning arbitrary documents to predefined categories (topics or classes) based on their content. There are two different approaches to text categorization. One approach assigns each 2 document to a single category, the one it appears to fit best. And, the other approach, also studied in this thesis, is to allow documents to be categorized into all categories that it matches well. Traditionally, human experts, who are knowledgeable about the categories, conduct text categorization manually. While making it possible to categorize documents, based on their semantic content and the user’s own conceptual model, this manual approach requires substantial human resources to read each document and decide its appropriate categories. As a result, the number of classified documents tends to be very small. This is generally a severe obstacle for the success of fully manual text categorization systems. Automatic approaches to text categorization involve the construction of a categorization scheme, which is used for categorizing previously unseen documents. A categorization scheme consists of the knowledge base (i.e., a set of classifiers for all predefined categories) that captures the concepts describing the categories. In a supervised learning approach, the knowledge base is learned from natural language texts with category information (i.e., labels). Most natural languages are ambiguous and usually return a large number of input features (words, terms, tokens, or attributes) to the supervised learning process. These characteristics of natural language texts make the target concepts complex. This means the learning process, if it is to give an accurate knowledge base, requires a large number of labeled examples. In many practical situations, such a large number of labeled documents for training is not readily available, since manually labeling them is such a big burden on human experts. This bottleneck is a critical problem for an automatic approach. Also, a large number of training documents may cause a delay in the incorporation of new training examples into an existing knowledge base. This follows from the time requested for an extensive learning process over existing training documents with new ones added. It is, therefore, important for the success of a text categorization system to develop machine learning methods that automate the text categorization task and give reasonable effectiveness with fewer labeled training documents. The need for such a text categorization system becomes more apparent with the efforts to automatically classify personal textual information (for example, electronic mail). In this case, the 3 system must respond quickly and the small number of labeled documents may be insufficient for effective learning. In this research, our main goal is to develop supervised and semi-supervised learning methods for an automated text categorization system that can achieve good performance with a small set of labeled examples. To achieve this goal, we briefly discuss two learning tasks: 1. Constructing an accurate knowledge base with a small set of training examples. Various supervised machine learning techniques have been successfully applied to text categorization and their effectiveness has usually been evaluated over the full set of available training examples. In other words, relatively little research has been conducted on learning knowledge bases from a small number of training examples. In terms of effectiveness, another important aspect of a new learning method is its efficiency in time and space. There are many situations where the text categorization system should respond quickly to user’s information needs and any feedback they provide. If a text categorization system requires long processing times in the incorporation of new training examples into an existing knowledge base, this could be a critical factor against that categorization system. 2. Determining the most informative unlabeled examples for training, based on the knowledge base learned from existing labeled examples. Reduction of the number of training documents needed for achieving a given level of text categorization performance can be accomplished by selecting, for labeling and training, only the most informative examples. Selective sampling (or active learning) [CAL94], as opposed to random sampling, refers to any learning method involving active selection of candidate training documents. It selects only the most informative documents, by filtering a stream of unlabeled examples. This filtering process is usually based on the estimated uncertainty values of unlabeled 4 examples. Uncertainty of a document under a given category is measured by comparing the document’s similarity score with the category’s threshold. In previous semi-supervised approaches to uncertainty selective sampling, the most uncertain examples are selected. These are the documents with similarity scores closest to the category’s threshold [LC94]. Then, they ask human experts to label them. In this thesis, we explore the possibility that the most positive-certain documents could have benefits for training. If using such positive-certain documents leads to performance improvements, another main advantage of it for the central goal in this thesis, is that they do not require human involvement for labeling them. They are automatically labeled by the system. A knowledge base (or classifiers) used for uncertainty measurement can be the same (homogeneous approach) or different (heterogeneous approach) from the knowledge base used for categorizing new documents. The main reason for adopting a heterogeneous approach is that the type of knowledge base used for categorization is too computationally expensive to build and use for selective sampling [LC94]. So, it is highly desirable, rather than constructing a new knowledge base, to use the available classifiers if they are accurate and fast enough for uncertainty selective sampling. In this research project, we develop and investigate machine learning methods for text categorization that give good performance with small numbers of labeled training documents. Our supervised learning approaches are effective and not computationally expensive based on the number of input features. Our semi-supervised learning approaches to uncertainty selective sampling directly use the same type of classifiers for determining which of a set of unlabeled examples are most useful for training. In the following subsections, we examine our text categorization process in more detail and outline this thesis. 1.1 Text Categorization Process The main task of a machine learning approach to text categorization is to automatically build a knowledge base that can be used for assigning previously unseen documents to 5 appropriate categories. Because of ambiguities in natural languages and complex concepts of categories, learning an accurate knowledge base generally requires a large number of training documents that are manually labeled. In the real world, however, obtaining these is rarely practical and is sometimes impossible. This problem provides our motivation for developing a text categorization system that achieves good categorization results with small numbers of labeled training examples, quickly incorporates newly labeled examples into the existing categorization scheme, and, as a result, allows a homogeneous approach to the uncertainty selective sampling of candidate training examples. This section describes the text categorization process in our system and briefly examines typical techniques. Initial-Learning Figure 1.1 shows the initial-learning model, for constructing the initial-knowledgebase. This is intended to operate with a small set of labeled examples. The resulting initial-knowledge-base can be used for predicting categories of new documents and for uncertainty selective sampling of informative examples for future training. In this model, a small number of unlabeled raw documents should be randomly selected. These are presented to human experts for actual labels. Then, in the preprocessor, these labeled documents are transformed to a representation that is readable and suitable for a machine learning algorithm, in the ‘learner’ box of Figure 1.1. The common representation adopted by most machine learning algorithms is the vector space model (“bag-of-words”) [Sal91]. In this representation method, each document is represented by a list of extracted features (words, terms, tokens, or attributes). Their associated weights are computed from either the existence or the frequency of each feature in a given document. Various techniques from natural language processing should or could be applied to extract informative features. The common techniques used for feature extraction are the stop-list (or negative dictionary as it is referred to in [Fox90]) to remove common words such as the, a, of, to, is, and was, and the word stemming to reduce different words from the same stem, for example children and childhood → child. Then, each feature can be weighted by using a Boolean vector indicating if the feature appeared in a given document or a numeric vector derived from the frequency of occurrence. Applying other complicated text 6 processing techniques, such as pos-tagging [Bri92, Bri95] and n-grams (phrase) generation [SP97, BT98], may also improve performance. However, there is a trade-off between their substantial preprocessing time and small benefits in text categorization [SM99]. unlabeled documents randomly selected raw documents labeled documents preprocessor document representations with labels learner human expert knowledge base Figure 1.1 Initial-learning model for text categorization. One of the common characteristics of natural languages is their high dimensionality in the feature space. Even for a moderately sized corpus, the categorization task may be confronted by several thousands of features and tens of predefined categories. The computational cost of learning a knowledge base for problems of this huge size is prohibitive for most machine learning algorithms. Feature selection, which is 7 considered as a preprocessing step, has been successfully applied to reduce this huge feature space without a loss of categorization performance [YP97]. It eliminates many words that appear evenly across categories, as being uninformative. Previous work in feature selection [KS96, MG99] has shown that one can achieve a significant performance improvement by applying feature selection methods rather than using the full feature set. An important issue in applying any feature space reduction method is to define how many features should be chosen for each category. Finding an optimal feature set size is affected by the characteristics of text data and the chosen machine learning algorithm. Therefore, a good choice typically requires many experiments on a variety of feature set sizes, to evaluate the effectiveness of classifiers. A set of training data in the reduced representations, with category information, is presented to learner. The task of the learner is to analyze the set of training examples and to construct classifiers that predict the categories of new documents. It typically generates one classifier for each category and each classifier has an associated threshold value. A wide range of machine learning algorithms have been developed and applied to learn classifiers (see, for example, [ADW94, BSA94, LR94, Yan94, WPW95, LSCP96, MRG96, Joa97, CH98, Joa98, MN98, YG98, RS99]). One general group of learning algorithms that has shown good text categorization performance is similarity-based [MG96, Yan99, HK00]. In this group of algorithms, mapping from a new document to a particular category is based on the comparison of the category’s threshold and the similarity-score for the new document. The threshold values in classifiers are predefined by applying a thresholding strategy. This is usually different from the learning algorithms which are used only for the similarity computation between a new document and each category. With a small number of training examples available, an important issue in the initial-learning-model is that the resulting knowledge base must be accurate enough to be used for both categorization and subsequent uncertainty selective sampling. As shown in Figure 1.1, this learning process involves and relies upon two critical steps: the document representation, based on feature selection, and the learner (machine learning algorithm and thresholding strategy). 8 Categorization and Learning We now describe the process of text categorization where there is user feedback that initiates the learning process. Once initial-learning has been completed, the resulting knowledge base can be used to predict the categories of a new document. The results of prediction are presented to human expert and then, if there is any feedback on this prediction, learning is initiated to update the current knowledge base. This categorization and learning model is described in Figure 1.2. real world documents raw documents preprocessor document representations knowledge base predictor feedback predicted categories for each document learner human expert Figure 1.2 Categorization and learning model for text categorization. 9 Like the raw documents in the initial learning model of Figure 1.1, any real-world document first needs to be transformed to a representation in the preprocessor. To make the binary decision between a document and each category, the predictor computes similarity scores for every document-category pair, based on the document representation and classifiers. The decision on the categories for the documents is made by thresholding these similarity scores. A wide range of measures can be used for the similarity computation in the predictor (See [Har75, Dam95] for some possible similarity measures). One of the most widely used similarity measures is the cosine of the angle between two vectors, computed as the inner product of two normalized vectors [Sal89]. We choose this measure since it performs well in most text categorization literature and exploration of this aspect is not the core of our research. Any feedback from human experts should be incorporated quickly into the current knowledge base to make it more accurate. As well as the simple category information on new documents, the expert may modify the predefined categories: adding new categories or deleting some predefined categories. Some approaches, such as [RMW95, SMB96, NH98], allow human experts to annotate documents indicating or writing some important terms in a given document. The creation and manipulation of such annotated documents is an important area in the field of text categorization, but is outside the scope of this thesis. Learning with Selective Sampling Figure 1.3 shows another type of learning model – learning with selective sampling – in our system. Rather than relying on random sampling of documents for labeling and training as in Figure 1.1, this model uses a sampler to automatically select the candidate training examples. Among the selected informative examples, this model presents uncertain examples to human expert for labeling and directly uses positive-certain examples without manual-labeling. The expected result is a reduction in the number of labeled training examples needed for a given level of text categorization performance than that required with random sampling. This offers the potential of improved performance in our target area, where we would like to achieve a desired performance level with fewer examples than random sampling requires. 10 unlabeled documents heterogeneous if sampler = knowledge base 2 sampler homogeneous if sampler = knowledge base 1 informative raw documents automatically labeled documents manually labeled documents preprocessor human expert document representations with labels learner knowledge base 1 knowledge base 2 predictor Figure 1.3 Learning with selective sampling. In this research, selective sampling is based on the uncertainty of the categorization for a document [LG94]. This is measured by comparing the estimated similarity score and threshold. To make a distinction between uncertain and positivecertain examples, we design a new scheme based on our own thresholding strategy, RinSCut. 11 Computing the uncertainty value of a document under a given category requires a classifier for a category. Since the number of unlabeled documents is huge, the classifier for uncertainty selective sampling should be cheap to build and use. As shown in Figure 1.3, the uncertainty measure in the sampler can be based on the same type of knowledge base, knowledge base 1 in Figure 1.3, as the one used for categorization (homogeneous) or a different type of knowledge base, knowledge base 2, (heterogeneous). As mentioned earlier, the heterogeneous approach is preferred if the homogeneous approach is computationally expensive. However, building a separate type of knowledge base for selective sampling imposes another cost to the system. Moreover, a poor quality knowledge base, that is cheap to run, may yield unreliable uncertainty values of documents. The effectiveness of classifiers that are used for selective sampling might be an important factor for successful selective sampling. As a result, if the type of knowledge base used for categorization is accurate and not expensive to run, a homogeneous approach is preferred to heterogeneous approach. Another consideration in uncertainty selective sampling is of using the most positive-certain documents without human expert labeling. Unlike uncertain documents, positive-certain examples are ones on which the text categorization system can, to some degree, attach their category membership. So, they can be used for training without the category labels from human experts. This suggests that the sampler in this extended learning model should have the ability to distinguish between uncertain and positive-certain examples. Previous approaches to selective sampling continue identifying candidate training documents until either there are no more unlabeled documents or the human expert is unwilling to label more examples. In other words, they have no stop-point mechanism to stop selecting examples in a given category. If the sampler keeps choosing training examples for a given category after the target concept description is learned, it may waste the human expert’s time. The positive-certain documents could be less informative than uncertain ones for training. However, using them for training could lead to performance improvements and, more importantly, it does not require any human labeling cost. 12 1.2 Outline of the Thesis The main goal of our text categorization system is to achieve good text categorization performance with smaller numbers of labeled training examples. Because natural languages are unstructured and target concepts of predefined categories are complex, learning accurate classifiers requires a large number of labeled examples. The labeling process needs human involvement. So, labeling such a large number of documents is time-consuming, tedious, and sometimes error-prone [HW90, ADW94, VM94]. This thesis explores machine learning approaches which offer the promise of significantly reducing this labeling cost, operating on a smaller number of labeled training examples to construct the classifiers that are accurate and cheap enough to use as a foundation for uncertainty selective sampling. In this research project, we begin by looking at previous work in text categorization. We focus on similarity-based approaches that have shown good categorization results. These investigated results, reported in Chapter 2, highlight the important properties our text categorization system should have. In Chapter 3, we introduce Keyword Association Network (KAN), a new machine learning algorithm, we have developed and applied to compute the similarity scores of new documents. It promises increased accuracy because it attempts to resolve their ambiguities. KAN is, essentially, a framework for exploiting word co-occurrence statistics and each feature’s discriminating power in the similarity computation. The promise of this approach is that such statistical information might provide a reasonably accurate initial-classifier with small numbers of labeled training examples. Also, in Chapter 4, we introduce a new thresholding strategy, rank-in-score (RinSCut), to find the optimal thresholds for the categories. This has been one of the relatively unexplored research areas in text classification, even though it has significant impact on the overall effectiveness of results. RinSCut is designed to combine strengths of two common thresholding strategies, rank-based (RCut) and score-based (SCut). It is designed to make an online decision on new documents, reduce the risk of overfitting to a small number of training examples, and give thresholds by optimizing both local and global performance. 13 In Chapter 5, we use two standard test collections, Reuters-21578 [R21578] and 20-Newsgroups [20News], in comparative experimental results. First, we perform the experiments for tuning of feature selection to find a suitable size of feature space for each learning algorithm. Then, we assess KAN and RinSCut against other widely used similarity-based learning algorithms and thresholding strategies. We show the effectiveness and efficiency of our approaches for building classifiers and for use in uncertainty selective sampling. KAN’s efficiency derives from a reduced feature space that gives performance improvements over the full feature set. The effectiveness of RinSCut is tested on the Reuters-21578 corpus, since it is designed for multi-class text categorization problems. Chapter 6 presents uncertainty selective sampling methods based on our new approach, RinSCut. We report several comparative experiments in Chapter 7 to show the effectiveness of our homogeneous approach for selective sampling methods. Then, we finish this thesis with conclusion and possible directions for future work in Chapter 8. 14 Chapter 2 Text Categorization and Previous Work Text categorization is the task of assigning previously unseen documents to appropriate predefined categories. The supervised machine learning approach makes this automatic, by learning classifiers from a set of training examples. For most supervised learning algorithms, building accurate classifiers needs a large volume of manually labeled examples. This manual labeling process is time-consuming, expensive, and will have some level of inconsistency [HW90, ADW94, VM94]. This problem motivates our work towards a text categorization system that can achieve a satisfactory level of performance with fewer training examples. Many machine learning algorithms have been developed and applied to the construction of classifiers. They can usually be grouped into rule-based, probabilitybased, and similarity-based learning algorithms. This thesis focuses on the similaritybased approach. This builds upon the large volume of previous work in the area of text categorization that has adopted this approach. We chose the similarity-based approach as a foundation for building more accurate classifiers because it offers the possibility of exploring statistical information that may capture the target concepts hidden in documents. As shown in Chapter 1, our text categorization system is complex in that it consists of multiple models and multiple phases in each model. Each phase has a huge impact on the others; an unsatisfactory partial result or delay in processing in any one phase may result in the failure of the overall text categorization process. 15 In this chapter, we describe various techniques for each phase of similarity-based text categorization systems. This establishes what has been previously achieved and also gives an indication of desirable properties for our text categorization system. 2.1 Text Categorization In this section, we give an overview of text categorization. We first define the text categorization task and discuss ambiguities in most natural languages on which classifiers should be built. Then, we discuss two general approaches, “knowledge engineering” and “machine learning”, to the construction of classifiers and why we are focusing on a machine learning approach. This section also describes characteristics of the domain of text categorization that make this task difficult for a machine learning approach. 2.1.1 A Definition of Text Categorization Text categorization (also known as text classification) is the automated assignment of natural language text to appropriate thematic categories, based on its content. A set of categories is predefined manually. This task can be formalized as binary categorization: to determine a Boolean value b ∈ {T, F} for each pair (dj, ci) ∈ D × C, where D is a domain of documents and C = {c1, c2, … , c|C|} is a set of predefined categories. The value T assigned to (dj, ci) indicates a decision to assign dj to ci, while the value F indicates a decision not to assign dj to ci. A function _ : D × C _ {T, F} that describes how documents might be categorized is called the classifier (also known as a hypothesis, model, or rule). The similaritybased classifier that this thesis is concerned with typically includes a weight vector w and a threshold _ for each category. For a given input vector v for document dj ∈ D, an assignment decision on the input vector v is b = T if w⋅v ≥ _, and b = F if w⋅v < _. Two different types of text categorization task can be identified depending on the number of categories that could be assigned to each document. The first type, in which exactly one category is assigned to each dj ∈ D, is regarded as the single-class (or nonoverlapping categories) text categorization task. The second type, in which any 16 number of categories from zero to |C| may be assigned to each dj ∈ D, is called the multi-class (or overlapping categories) task [Seb02]. A special type of multi-class text categorization is one where each document is assigned to the same number k, where k > 1, of categories. The answer to the question of which type of text categorization should be adopted for a given text categorization system depends on the application and characteristics of the corpus. 2.1.2 Ambiguities in Natural Language Text In most content-based text classification systems, an important issue is how they can capture the meaning of the natural language texts. Obtaining accurate classifiers requires the system to understand the natural languages at some level. Understanding natural languages, however, is a difficult task due to ambiguities in them: 1. The same sentence may have different meanings. For example, consider a sentence like “Salespeople sold the dog biscuits” (an example from [Cha97]). This sentence can be interpreted in two different ways: (1) the salespeople are selling the dog-biscuits and (2) the salespeople are selling biscuits to dogs. 2. There is the large number of synonyms – syntactically different words with the same or similar meanings – in natural languages. It is regarded as good writing style not to repeatedly use the same word for expressing a particular idea (or concept). Synonyms allow the same idea to be expressed by different words that have a similar meaning. 3. Polysemy refers to an ambiguity where words which are spelled the same can have different meanings in different sentences or documents. For example, the word “bat” may mean (1) an implement used in sports to hit the ball or (2) a flying mammal. Resolving such ambiguities is probably beneficial to text categorization when there are many words in common across categories, even though it may not have a huge impact on the overall text categorization performance. 17 2.1.3 Knowledge Engineering versus Machine Learning Approach There are two different ways of constructing classifiers, the function _ : D × C _ {T, F}. They are “knowledge engineering” and “machine learning” approaches. In the knowledge engineering approach, human experts (including knowledge engineers and domain experts) manually create a set of rules that correctly categorize previously unseen documents under given categories. While allowing for semantically-oriented text categorization, by defining controlled vocabularies which can be interpreted by the text categorization system [BG01], manually determining such a solution imposes a considerable workload on human experts. This makes it time consuming and expensive. Also, this manual approach may cause inconsistency since human experts often disagree on the assigned categories of documents and even one person may categorize documents inconsistently [ADW94, VM94]. As a result, these problems for the knowledge engineering approach cause the bottleneck of encoding large amounts of incomplete and potentially conflicting expert knowledge. The machine learning approach to text categorization is to automatically build the classifiers by learning the concept descriptions of the categories. One type of machine learning, applied to text categorization, is “supervised learning”. This requires a set of pre-labeled (pre-categorized) training documents for generating classifiers. By contrast, “unsupervised learning” refers to the task of automatically identifying a set of categories from a set of unlabeled documents and grouping these unlabeled documents under these identified categories [Mer98]. This task is typically called document clustering and is sometimes confused with text categorization that is the focus of our work. The advantages of the machine learning approach over knowledge engineering are the considerable reduction in the volume of work required from human experts, consistent text categorization, and the capability of easily adjusting the generated classifier to handle different types of documents (such as newspaper articles, newsgroup postings, electronic mails, etc.) and even languages other than English. 18 2.1.4 Difficulties for the Machine Learning Approach The unstructured format of natural language text, and the diversity of target concepts associated with the categories, present interesting challenges to the contentbased application of machine learning algorithms. The large number of input features, that seem necessary for the construction of classifiers, overwhelms most text categorization systems. For most machine learning algorithms, increasing the number of features means that they have to use more training examples to obtain the same level of text categorization performance. This large number of training examples and features may be computationally intractable for most machine learning algorithms, by requiring unacceptably large processing time and memory. Of the large number of features, there are usually many features that appear in most documents. These words can be considered irrelevant, in the sense that such features are evenly distributed throughout documents and, as a result, have no discriminating power. It is important for the efficiency and effectiveness of the system to select an efficient subset of features, by removing these irrelevant ones. However, it is a difficult task since a reasonable feature subset size might be different across the categories and some informative features for a given category could be distributed across several categories. For example, depending on the level of concept complexity, some categories require a large number of features to describe their concepts while others need a relatively small number of features. Also, informative features in the overlapping categories might be evenly distributed across such overlapping categories and could be considered as irrelevant ones. 2.2 Preprocessing Text preprocessing is the basic and critical stage needed for most text classification tasks. It transforms all raw documents to a suitable form, called a representation, that is readable and usable by the relevant machine learning algorithms. In most similaritybased text categorization systems, a document is represented by a set of extracted features and their associated weights. For a content-based text categorization system to categorize documents effectively, it needs to understand natural language text at some level by resolving 19 ambiguities in extracted terms as described in Section 2.1.2. Also, the large number of irrelevant features in full feature space may cause a significant drop in the text categorization performance. A wide range of disambiguation and feature space reduction methods have been proposed from various research areas, such as information retrieval, natural language + processing, and machine learning [DDFL 90, YP97, MG99]. Some methods might have obvious advantages in effectiveness over others. However, another important consideration is the speed of preprocessing that is an important factor in most text categorization systems. This section describes the typical and most popular methods that have been applied to text preprocessing. 2.2.1 Feature Extraction Feature extraction is the process of identifying features, or types of information, contained within the documents. It is one of the first and critical steps of almost every text classification system. It is these extracted features that machine learning algorithms for text categorization use to find the target concept descriptions of categories. Feature extraction first divides documents into separate terms at punctuation and white space. Then, more complicated text processing methods could be used to find the more informative features. They include techniques such as removing stop-list terms, stemming different words to common roots, and identifying phrases. The stoplist and stemming are common forms of natural language processing in most text categorization systems, while the use of phrases for text categorization needs more care because most phrases extracted have extremely low frequency in the corpus and they may result in a much larger feature space. A stop-list contains stop-words (or common words) that are not useful to keep for learning the target concepts. During the feature extraction from documents, any words appearing in a stop-list are removed. There is not any general theory to create a stop-list. Choosing stop-words for a stop-list involves many arbitrary human decisions. Two of the most common types of stop-words are: 1. words that have little semantic meanings and 20 2. words whose frequency of occurrence in the corpus is greater than some level. The second type of stop-words is domain-specific. In order to make text categorization systems more general, we use, in this work, only the first type of stopwords. The list of stop-words used in this research is shown in Appendix A. Most stop words in this Appendix are mainly from [SW99]. Another widely used text preprocessing technique is word stemming. In English, it reduces different words to common roots by removing word endings or suffixes. Stemming algorithms have been designed to handle plurals (cars _ car) and to make a group of simple synonyms (children and childhood _ child). One of the common stemming algorithms is the Porter’s algorithm [Por80]. Although it has been identified as having some problems in [CX95], it gives consistent performance improvements across a range of text classification tasks. Also, it is commonly accepted in natural language processing that Porter’s algorithm is better than most other stemming algorithms. This work adopts Porter’s stemming algorithm and applies it to the words remaining after removing stop-words. An interesting approach for feature extraction in all content-based text classification tasks is the identification and use of phrases (or n-grams which is sequences of words of length n) in addition to, or in place of, single words. A phrase usually consists of two or more single words that occur sequentially in natural language text. The phrase extraction can be motivated statistically or syntactically. A statistical phrase denotes any sequence of words that occur contiguously in a text, and syntactic phrase refers to any phrase that is extracted based on a grammar of the language. Some examples of syntactic phrases are noun phrases and verb phrases. Regardless of which approach is used for phrase extraction, using phrases for text categorization seems a feasible way to improve categorization performance, in that • a set of words has a much smaller degree of ambiguity than its constituent individual words and, • as a result, phrases can be a better linguistic textual unit as a feature than single words to express a complex concept description. 21 Lewis conducted a number of experiments to examine the effects of using phrases for text categorization [Lew92a, Lew92b]. He extracted all the syntactic noun phrases by applying part-of-speech tagging and showed that using those phrases actually gave worse categorization performance than using only single words. The main reasons he identified for this are the higher dimensionality in feature space and extremely low frequency of extracted phrases in the corpus. More recently, a number of researchers [Fur98, MG98, SSS98] have investigated the issue of using phrases. They performed experiments on the various lengths of ngrams that were extracted using term frequency. Their experimental results showed that using word sequences of length two or three increased categorization performance slightly and longer sequence actually reduced the performance. 2.2.2 Representation Machine learning algorithms need a suitable representation for raw documents. The common representation, adopted by most machine learning algorithms, is the vector space model or “bag-of-words” representation where a document is represented by a set of extracted features and their vectors. Each vector corresponds to a certain point in n-dimensional space, where n is the number of features, and its value is either Boolean, indicating the existence of the feature in a given document, or the frequency of occurrence. All other information, such as the feature’s position and order, in a document, is lost. The simplicity and effectiveness of the vector space model makes it the most popular representation method in content-based text classification systems. The core of similarity-based categorization is that similar documents have feature vectors that are close in the n-dimensional space. In the vector space model, each document dj is transformed to a vector vj = (x1, x2, ... , xn). Here, n is the number of unique features and each xk is the value of the kth feature. In similarity-based text categorization, a feature vector is usually calculated as a combination of two common weighting schemes: • the term frequency, TFk, is the number of times the kth feature occurs in document dj and, 22 • the inverse document frequency, IDFk, is log(|A| / DFk), where DFk is the number of documents that contain the kth feature and |A| is the total number of documents in the training set A. The role of IDF is to capture the notion that terms occurring in only a few documents are more informative than ones that appear in many documents. Then, xk is computed as TFk×IDFk. Because document lengths may vary widely, a length normalization factor should be applied to the term weighting function. The weighting equation that is used in this work [BSAS95] is given as: (logTFk + 1.0) × IDFk xk = where TFk > 0 ∑ k=1,n [(logTFk + 1.0) × IDFk]2 The vector space model assumes that features (and their occurrences) are independent. Ambiguities – synonymy and polysemy – in natural language text make this assumption inaccurate. One way of resolving those ambiguities is to use the relationships between features when giving them weights. + Latent semantic indexing (LSI) [DDFL 90] is a semantics-based representation method to overcome ambiguity problems by using the term inter-relationships. In information retrieval, LSI has been successful. In the similarity computation, it uses other coalescing terms that do not exist in a given query and reduces the high dimensionality by using only the reduced k-dimensional feature space. LSI has been used in a few text classification systems [WPW95, HPS96]. While it offers the potential for a better representation method for resolving ambiguities of text than the vector space model, its principal disadvantage is that it is computationally expensive. 2.2.3 Feature Selection A major problem in the text categorization task is the high dimensionality of the feature space. Even for a moderately sized corpus, the text categorization task may 23 have to deal with thousands of features and tens of categories. Most sophisticated machine learning algorithms applied to text categorization cannot scale to this huge number of features. The learning process on such a high feature space may require unacceptably long processing time and need a large number of training documents since all features in the representation are not necessarily relevant and beneficial for learning classifiers. As a result, it is highly desirable to reduce the feature space without removing potentially useful features for the target concepts of categories. The stop-word removal and word stemming methods that were described in section 2.1.1 can be viewed as dimensionality reduction methods. However, the main advantage of applying these methods is the reduction in the size of each document, not on the size of the full feature set. As a result, high dimensionality of the feature space may still exist even after applying them. They still leave the dimensionality prohibitively large for machine learning algorithms (especially for some learning algorithms that are trying to use the inter-relationships among features). This means that text categorization needs more aggressive methods to reduce the size of overall feature set. Various feature selection (also called dimensionality reduction) methods have been proposed and successfully applied for this aggressive reduction of the feature set without sacrificing text categorization performance. They include document frequency, mutual information, information gain, OddsRatio, and _2 statistic. [MG99, YP97]. The main idea behind all feature selection methods is to eliminate many non-informative features that appear evenly across categories. The criteria for the choice of feature selection methods are not clear since the effectiveness of each method in text categorization is affected by the characteristics of the test corpus and the chosen machine learning algorithm. A difficulty with most sophisticated feature selection methods is that they are very time-consuming and so it is not practical or possible to perform the feature selection process whenever new training examples are available. This high timecomplexity is a critical problem for our text categorization system in which one of main characteristics is to quickly incorporate any new information into the current knowledge base. However, this problem should not preclude use of feature selection methods in the learning process: the presence or absence of a feature in the reduced 24 representation should not be changed frequently with the addition of a small number of newly labeled examples. There are two different ways of conducting dimensionality reduction, depending on whether the method is applied locally to each category or globally to the set of all categories. For example, in the local application of an information gain function [Qui93] the information gain of a feature wk in a specific category ci is defined to be: IG(wk, ci) = − [Pr(ci)⋅logPr(ci) + Pr(_i)⋅logPr(_i)] + _ _ Pr(w)⋅[Pr(c|w)⋅logPr(c|w)] w∈{wk ,_ k } c∈{ci,_ i} where Pr(wk) is the probability that feature wk occurs and _k means that feature does not occur. Pr(ci|wk) is the conditional probability of the category ci value given that the feature wk occurs. This equation assumes that both category and feature are binary valued. Thus, frequency is not used as the value of features. The globally computed information gain of wk for the category set C is the sum of local information gain values. IG(wk, C) = _i=1,|C| IG(wk, ci) In this project, we used document frequency and information gain together for feature selection. These methods have been widely used and regarded as effective feature selection methods in text categorization. We removed all features occurring in a small threshold number of training documents, and then used information gain locally to choose the subset of features for each category. The reason for using the local feature selection is that if feature selection is applied globally, selected features in the subset could be mainly from a small number of categories and, as a result, frequency values of most features may be zero in some categories. We performed many experiments by varying the size of feature set to find the optimal feature space that gives the best text categorization performance. 25 2.3 Learning Classifiers From representations of labeled documents, the learner in our learning model builds a knowledge base in which a classifier for each individual category is stored. The learner usually includes two kinds of learning algorithms since the construction of a classifier for each category ci ∈ C consists of both the definition of a function _i that gives an estimated categorization value for a given document dj ∈ D and the definition of a threshold _i. This classifier is then used to categorize previously unseen documents. For the assignment decision on a document dj under category ci, the classifier computes an estimated categorization value of _i(dj) and this is tested against a threshold _i: dj is categorized under ci if _i(dj) ≥ _i while it is not under ci if _i(dj) < _i. To the definition of _i, a wide range of machine learning algorithms have been developed and applied. These include rule-based induction algorithms [ADW94, CH98, MRG96], linear learning algorithms [Joa97, LSCP96, BSA94], k-Nearest Neighbor [YG98, Yan94], naive Bayes probabilistic algorithms [Joa97, LR94, MN98], support vector machines [Joa98], and neural networks [WPW95, RS99]. Among them, one general group of learning algorithms that has shown good categorization performance is similarity-based. In this group of algorithms, a function _i is designed to return a similarity score for an estimated categorization value. In this research, we adopt and investigate a similarity-based approach to text categorization. This approach can utilize semantic information that might exist implicitly within documents and could be extracted by analyzing statistical information about the document set. In particular, we are interested in exploring co-location and frequency of terms. We believe that, with a small number of training examples, this approach can give improved categorization results. In similarity-based text categorization, an important, but unexplored, research area is the thresholding strategy to find the optimal thresholds for the categories. The major focus of research in the text categorization literature has been placed on the definition of _i, while the definition of threshold _i has had far less attention, as finding the optimal threshold _i for a category ci is often seen as a trivial task. This could be true for the rule-based and probability-based approaches. Actually, the rule-based approach does not require any threshold value in the classifiers. The probability-based algorithms have a theoretical base for analytically determining the optimal threshold 26 value that is trivially 0.5. In similarity-based text categorization, without a theoretical base for the analytical determination of the optimal threshold, finding an optimal threshold _i in similarity-based text categorization should be done empirically. This becomes much more difficult with a small number of training examples and in cases where there are some rare categories that have few positive training examples. The importance of a threshold _i is no less than the function _i since the presence of any single unreliable threshold value could downgrade overall text categorization performance. This section examines existing typical similarity-based learning algorithms for a function _i and thresholding strategies for a threshold _i. 2.3.1 Similarity-Based Learning Algorithms Essentially, the function _i for the similarity score of a new document should be designed to give more weight to informative terms than non-informative ones. The similarity-based algorithms can be again grouped into two main classes: profile-based linear learning algorithms and instance-based lazy algorithms. 2.3.1.1 Profile-Based Linear Algorithms Some linear algorithms build a generalized profile for each target category that is in the form of a weighted list of features [LSCP96]. A linear learning algorithm that is trying to derive an explicit profile is called a profile-based linear algorithm. Its advantage over the instance-based algorithms, and other sophisticated algorithms like neural networks, is that such an explicit profile is a more understandable representation for human experts. As a result, they can update the profile according to their preference for terms. In a generalized profile, each feature is associated with a weight vector computed from a set of training examples. One of the typical profile-based linear learning algorithms is based on Rocchio relevance feedback [Roc71]. In the Rocchio algorithm, each category ci has a vector of the form wi = (y1, y2, ... , yn) and each yk is computed as follows: 27 yk = [_ × ∑j=1,|A| xk(dj) × |ci|−1] − [_ × ∑j=1,|A| xk(dj) × (|A| − |ci|)−1] d j∈ci d j∈ci Here, _ and _ are adjustment parameters for positive and negative examples, xk(dj) is the weight of the kth feature in a document dj, |ci| is the number of positive documents in the category ci, and |A| is the total number of documents in the training set. The similarity value between a category and a new document is obtained as the inner product between the two corresponding feature vectors. The problem in this classifier is that some informative features in a rare category with a small number of positive documents will have small weights if they appear occasionally in the negative examples. One of the popular profile-based linear algorithms is Widrow-Hoff (WH) [WS85]. WH is an on-line algorithm that updates weight vectors by using one training example at a time. For each category ci, the new weight of the kth feature yk,j in the vector wj is calculated from the old weight vector wj−1 and the vector vj of a new document dj: yk,j = yk,j−1 − 2_(wj-1⋅vj − bj)xk,j In this equation, wj-1⋅vj is the cosine value of the two vectors, _ is the learning rate parameter, bj is the category label of new document having 1 if the new document is positive and 0 if it is negative, and xk,j is the value of the kth feature in vj. Typically, the initial weight vector w0 is set to have all zeros, w0 = (0, … , 0). In a comparative evaluation [LSCP96], WH has shown improved performance over Rocchio. 2.3.1.2 Instance-Based Lazy Algorithm The k-Nearest Neighbor (k-NN) algorithm is an instance-based lazy learning algorithm that operates directly on training documents. As a result, this algorithm does not involve any pre-learning process for determining a generalized absolute weight for each feature in the profile of category. The main motivation of k-NN algorithm is based on the reasoning that a document itself has more representative power than its generalized profile. For categorizing a new incoming document dj under a given category ci, k-NN computes its similarity scores to all documents in the training set A. 28 These resulting similarity scores are then sorted into descending order. The final similarity score, _i(dj), is the sum of the similarity scores of the documents in the category ci using the set of k top-ranking documents, KA: _i(dj) = ∑ SM(dj, dz)bzi d z∈KA bzi = 1 if dz is a positive example for ci while bzi = 0 if dz is a negative example for ci. SM(dj, dz) represents some similarity measure between two documents and is usually computed using the cosine function that is defined to be: SM(dj, dz) = ∑ k=1,n (xk j_xkz)2 ∑ k=1,n (xk j)2 _ ∑ k=1,n (xkz)2 where xk j and xkz are weights of kth feature in the n-dimensional space vectors of dj and dz, respectively. It is the standard form of k-NN, as in [Yan94]. While giving good performance in the text categorization literature [Yan99, YX99], k-NN has several drawbacks. In k-NN, it is difficult to find the optimal k value when training with documents that are unevenly distributed across categories. A large value of k may work well with common categories that have many positive documents, but it could be problematic for some rare categories that have fewer positive documents than the k value. Also, due to the lack of generalized feature weights, some noisy examples that have been pre-categorized wrongly by human experts may have a direct and significant impact on the quality of the ranking. Furthermore, time complexity of k-NN is O(m) where m is the number of training examples, since it requires the similarity computation on every example. As a result, the processing time needed for categorizing a new document is quite long with the large size of the training set. 2.3.2 Thresholding Strategy The last step in obtaining a mapping from a new document to relevant categories can be achieved by threshold values that are tested against the resulting similarity scores. In similarity-based text categorization, as discussed earlier, the optimal 29 threshold for each category must be derived experimentally from labeled documents. Existing common techniques are rank-based thresholding (RCut), proportion-based assignment (PCut), and score-based optimization (SCut). As illustrated in Figure 2.1, RCut is the per-document thresholding strategy that compares similarity scores of categories for a document. Both SCut and PCut, on the other hand, are per-category strategies that compare the similarity scores of documents within a category. Yang [Yan99, Yan01] has reviewed these techniques and summarized their extensive evaluation on various corpora. documents d1 d2 … dn _1(d1) _1(d2) … _1(dn) categories c1 per-category strategies: c2 _2(d1) _2(d2) … _2(dn) .. . .. . .. . .. . .. . cm _m(d1) _m(d2) … _m(dn) SCut and PCut per-document strategy: RCut Figure 2.1 The per-document and per-category thresholding strategies. As noted in [Yan01], the thresholding strategy is an important post-processing step in text categorization. It has a significant impact on the performance of classifiers. Finding the optimal thresholding strategy is a difficult task that is heavily influenced 30 by the user interests, the characteristics of the data set, and the adopted machine learning algorithms. This suggests that combining the strengths of the existing thresholding strategies may be useful. This subsection describes the above three common techniques to identify some desired properties of our new thresholding method. 2.3.2.1 Rank-Based (RCut) The rank-based thresholding strategy (RCut) is a per-document strategy that sorts the similarity scores of categories for each document dj. Then, it assigns a “YES” decision to the k top-ranking categories. The threshold k is predefined automatically by optimizing the global performance on the training set A. The same value of k is applied to all new documents by assuming that they may belong to the same number of categories. In the multi-class categorization problem, where documents may belong to a variable number of categories, RCut usually gives good micro-averaged performance (defined later in Section 2.5.2) since the selection of k mainly depends on some frequent categories. However, when this globally optimized threshold k is 1 and many rare categories have overlapping concepts with other categories (i.e., they have many identical documents), this strategy may result in low macro-averaged performance (defined later in Section 2.5.2). Furthermore, RCut is not suitable in the situation where many documents have no appropriate categories, since it is always trying to assign all documents to the same number of categories. 2.3.2.2 Score-Based (SCut) The score-based strategy (SCut) learns an optimal threshold for each category ci. The optimal threshold ts(ci) is a similarity score that optimizes the local performance of category ci. If the local performance of each category is the primary concern and the test documents belong to a variable number of categories, this strategy may be a better choice than RCut. However, it is not trivial to find an optimal threshold for each category. This problem becomes more apparent with a small set of training examples. SCut may give too high or too low thresholds for some rare categories (i.e., overfitting 31 to training data) and so it can lower the global categorization performance as well as the local performance for the rare categories. For the multi-class text categorization task, the overfitting problem of SCut to a small number of training examples and some rare categories, indicates that we need a new thresholding strategy that gives flexibility in the number of categories assigned to each new document like SCut, and also mitigates the overfitting problem of unreliable thresholds in some rare categories. 2.3.2.3 Proportion-based Assignment (PCut) Like SCut, the proportion-based assignment strategy (PCut) is a per-category thresholding strategy. Given a ranking list of similarity scores for all documents for each category ci, PCut assigns a “YES” decision to t top-ranking test documents. The threshold t(ci) is computed by the rule |A| × Pr(ci)× _ where |A| is the number of documents in a training set A, Pr(ci) is the probability of category ci in A, and _ is the real-valued parameter given by the user or predetermined automatically in the same way as k for RCut. PCut assumes that the proportion of positive documents in the training set will be consistent with the test set. While performing well in the text categorization experiments [Yan99], one of main disadvantages of PCut is that it cannot be used for on-line text categorization since it can be applied only when we have a pool of new documents to be categorized. This weak point is the main reason why PCut has not been applied for many practical text categorization systems in which the delayed systems’ response on new documents is unacceptable. For example, it is unsuitable for e-mail categorization systems. 2.4 Active Learning In most supervised learning tasks, the more training examples we provide to the system, the better it performs. Typical experiments in text categorization literature show that the system usually needs several thousands of human-labeled training examples to get reasonable text categorization performance. However, given human resource limitations, obtaining such a large number of training examples is expensive. 32 This problem suggests an active learning approach that controls the process of sampling training examples to achieve a given level of performance with fewer training examples. Active learning refers to any learning method that actively participates in the collection of training examples on which the system is trained [CAL94]. The main idea of active learning is to achieve a given level of system’s performance with fewer training examples. This is achieved without relying on random sampling. Instead, it constructs new examples or selects a small number of optimally informative examples for the user to classify. Generating artificial new examples is one type of active learning approach that was used in [Ang87]. However, most work in text categorization has focused on a selective sampling approach [CAL94, LC94, LT97] in which the system selects the most informative examples for labeling from a large set of unlabeled examples. In this section, we investigate two typical approaches for selective sampling: uncertainty sampling and committee-based sampling. 2.4.1 Uncertainty Sampling Lewis and Gale in [LG94] proposed the uncertainty-sampling method for active learning. Its effectiveness has been demonstrated in text categorization [LG94, LC94] and other text learning tasks [TCM99]. Uncertainty sampling selects unlabeled examples for labeling, based on the level of uncertainty about their correct category. A text categorization system using an uncertainty-sampling method examines unlabeled documents and computes the uncertainty values for the predicted category membership of all examples. Then, those examples with largest uncertainties are selected as the most informative ones for training and presented to human experts for labeling. The uncertainty of an example is typically estimated by comparing its numeric similarity score with the threshold of the category. The most uncertain (informative) example is the one whose score is closest to the threshold. Figure 2.2 shows the pseudo-code for the uncertainty-sampling algorithm. 33 • Create an initial knowledge base (a set of classifiers) • UNTIL “there are no more unlabeled examples” OR “human experts are unwilling to label more examples” (1) Apply the current knowledge base to each unlabeled example (2) Find the k examples with the highest uncertainty values (3) Have human experts label these k examples (4) Train the new knowledge base on all labeled examples to this point Figure 2.2 Uncertainty sampling algorithm. The knowledge base (a set of classifiers) used for estimating uncertainties of unlabeled examples can be the same type (homogeneous) or a different type (heterogeneous) from that used for the categorization of new documents. Even though a heterogeneous approach to uncertainty sampling requires an additional construction cost, it is essential when the existing classifiers are too computationally expensive to use for uncertainty sampling of training examples. On the other hand, a homogeneous approach may be preferred to a heterogeneous one if the existing classifiers are efficient enough to build and run for the uncertainty sampling method. For our main goal in this thesis, another important issue in the uncertainty sampling of candidate training examples is whether or not the text categorization system has a mechanism for showing the boundaries of the threshold regions to which the uncertain and certain examples should belong. With such a mechanism, the system can use the positive-certain examples for training, without asking the human experts for their correct category. Also, the system can stop the sampling process for uncertain examples when it believes that no more uncertain examples are available for a 34 given category. Such a mechanism may save human experts from labeling more examples after the system has learnt an accurate category concept. Also, when the input stream of unlabeled examples is infinite, this mechanism might be effective, since the system can choose only the examples whose numeric scores are within the defined threshold region. 2.4.2 Committee-Based Sampling The other type of selective sampling method is committee-based sampling [DE95, LT97]. In this method, diverse committee members are created from a given training data set. Each committee member is an implementation of a different machine learning algorithm. Then, each committee member is asked to predict the labels of examples. The system selects informative examples based on the degree of disagreement among the committee members and then presents them to a human for labeling. The most informative example is the one having the highest disagreement on the predictions of committee members. Again, the same issues described in uncertainty sampling are critical and have yet to be explored. 2.5 Evaluation Methods We need evaluation methods to compare various text classifiers. Evaluation of a classifier can be conducted by measuring its efficiency and its effectiveness. Efficiency is typically measured by using the elapsed processor time and it refers to the ability of a classifier to run fast. Elapsed processor time is already well-defined so that it does not need to be explained in this thesis. Efficiency of a classifier can usually be measured on two dimensions: learning efficiency (i.e., the time a machine learning algorithm takes to generate a classifier from a set of training examples) and categorization efficiency (i.e., the time the classifier takes to assign appropriate categories to a new document). Because of the unstable nature of parameters on which the evaluation depends, efficiency is rarely used as the singular performance measure in text categorization. However, efficiency is important for the practical application of the system. 35 A much more common evaluation method for text categorization systems is effectiveness: this refers to the ability to take the right decisions on the categorization of new incoming documents. There are several commonly used performance measures of effectiveness. However, there is no agreement on one single measure for use in all applications. Indeed, the type of measure that is preferable depends on the characteristics of the test data set and on the user’s interests. The absence of one optimal measure of effectiveness makes it very difficult to compare the relative effectiveness of classifiers. In the next section, we will examine various performance measures of effectiveness that have been widely used for the evaluation of text categorization systems. Then, we will turn to two different issues in the evaluation of text categorization systems: averaging performance values of all categories to get a representative single value of the system’s performance and splitting an initial corpus into training and test data sets. 2.5.1 Performance Measures of Effectiveness While a number of different conventional performance measures are available for the effectiveness evaluation for text categorization, the definition of almost all measures is based on the same 2×2 contingency table model that is constructed as shown in Table 2.1. In this table, ‘YES’ and ‘NO’ represent a binary decision given to each document dj under category ci. Each entry in the table indicates the number of documents of the specified type: • TPi: the number of true positive documents that the system predicted were YES, and were in fact in the category ci. • FPi: the number of false positive documents that the system predicted were YES, but actually were not in the category ci. • FNi: the number of false negative documents that the system predicted were NO, but were in fact in the category ci. 36 • TNi: the number of true negative documents that the system predicted were NO, and actually were not in the category ci. Here, note that the larger TPi and TNi values are (or the smaller FPi and FNi values are), the more effective ci is. Table 2.1 Contingency table for category ci. label by human expert category ci YES is correct NO is correct predicted YES TPi FPi predicted NO FNi TNi label by the system Given such a two-way contingency table, most conventional performance measures compute a single value from the four values in the table. The standard performance measures for classic information retrieval research are recall and precision that have been also frequently adopted for the evaluation for the text categorization. These measures are computed as follows. TPi • Recall = if TPi + FNi > 0 TPi + FNi 37 TPi • Precision = if TPi + FPi > 0 TPi + FPi Recall measures the proportion of documents that are predicted to be YES and correct, against all documents that are actually correct. While, the precision is the proportion of documents which are both predicted to be YES and are actually correct, against all documents that are predicted YES. In general, the higher precision is, the lower recall becomes, and vice versa. For example, we can achieve very high precision by rarely predicting ‘YES’ (i.e., by setting a very high threshold value) or very high recall by rarely predicting ‘NO’ (i.e., by setting very low threshold value). For this reason, they are seldom used alone as a sole measure of effectiveness. Instead, it is common in the literature to show two associated values of recall and precision at each level. Other performance measures that are purely based on the contingency table are accuracy and error. They are defined as follows: TPi + • Accuracy = TNi |D| FPi + • Error = FNi where |D| = TPi + FPi + FNi + TNi > 0 where |D| = TPi + FPi + FNi + TNi > 0 |D| While commonly used for performance measures in the machine learning literature, accuracy and error are not frequently used in text categorization. Their low popularity in text categorization may be explained, based on their definitions. The accuracy and error are defined as the proportion of documents that are correctly predicted and the proportion of documents that are wrongly predicted, respectively. Both measures, in 38 common, have |D| which is the total number documents in their denominator. As criticized in [Yan99], a large value of |D| makes accuracy insensitive to a small change in the value of TP (true positive) or TN (true negative). Likewise, the variations in the value of FP (false positive) or FN (false negative) will have a tiny impact on the value of error. Also, for the rare categories that have a small number of positive documents assigned, a trivial rejecter (i.e., a classifier that rejects every document for a category) may give much better performance (i.e., the larger value of accuracy and the smaller value of error) than non-trivial classifiers. As a consequence, with a large data set and low average probability of each document belonging to a given category, these two measures cannot be sensitive. As briefly discussed earlier, neither recall nor precision makes sense in isolation from each other because of the tradeoff between them. In actual practice, a trivial classifier that gives YES decisions to every document-category pair, (dj, ci), will have perfect recall (i.e., Recall = 1), but an extremely low value of precision. Most text categorization systems that have been adjusted to have high recall will sacrifice precision, and vice versa. Usually, users want the system to have high recall and high precision. However, it is generally difficult for them to choose between two systems where one has higher recall and the other has higher precision. For this reason, it is usually preferable to evaluate a system’s effectiveness by using a measure that combines recall and precision. Among the various combined measures, break-even point and F_ are the most frequently used in text categorization. They are defined as follows: • Break-even point (BEP) is the value at which Recall = Precision (_2 + 1) × Recall × Precision • F_ where 0 ≤ _ ≥ = 2 (_ × Precision) + Recall ∞ The value of BEP is the value of precision that is tuned to be equal to recall and it is computed by repeatedly varying the thresholds of a given category to plot precision as a function of recall. If there are no values of precision and recall that are exactly equal, 39 the interpolated BEP is computed by averaging the closest values of precision and recall. Yang [Yan99] noted that interpolated BEP may not be a reliable effectiveness measure when no values of precision and recall are close enough. Also, as described in [Seb02], Lewis who proposed the break-even point in [Lew92a, Lew92b] noted that it is unclear whether a system that achieves a high value of BEP also obtains high scores on other performance measures. F_, which was first proposed in [Rij79], is another common choice in text categorization and it is a measure of recall, precision, and a parameter called _. This parameter allows differential weighting on the importance of recall and precision. The value of _ can be between 0 to ∞ (infinity). _ = 0 means that the system ignores recall, whereas _ = ∞ means that it ignores precision. Usually, _ = 1 (i.e., recall and precision are viewed as having equal importance) has been used for this measure. 2 × Recall × Precision • F1 = Recall + Precision Note that, when Recall = Precision, F1 will have the same value of recall (or precision). So, BEP is always less than or equal to the optimal value of F1 due to the unreliable character of interpolated BEP [Yan99]. In this research, we use F1 for the effectiveness measure since balanced recall and precision is our main concern for our comparative analysis of classifiers. 2.5.2 Micro and Macro Averaging For a given category set C = {c1, c2, … , c|C|} and a document set D, binary assignment decisions will generate a total of |C| contingency tables. An important issue which arises is how to compute a single value of effectiveness by averaging them. Two different averaging methods are available: macro-averaging and micro-averaging. Macro-averaging first computes effectiveness measures locally for each contingency table, and then computes a single value by averaging over all the resulting 40 local measures. For instance, macro-averaging recall, MA-Recall, is computed as follows: ∑ i=1,|C| Recalli • MA-Recall = |C| TP where Recalli = i and TPi + FNi > 0 TPi + FNi Macro-averaging can be viewed as a per-category averaging method that gives equal weight to each category. So, a macro-averaging measure will be a good indicator of the ability of classifiers to work well on rare categories which have a small number of positive documents. Micro-averaging considers all binary decisions as a single global group by making a global contingency table, and then computes a single effectiveness measure by summing over all individual decisions. For example, micro-averaging recall, MI-Recall, is computed based on the global contigency table in Table 2.2: ∑ i=1,|C| TPi • MI-Recall = ∑ i=1,|C| (TPi + FNi) Micro-averaging gives equal weight to every individual decision of (dj, ci). So, it can be considered a per-document averaging method. Whether macro-averaging or micro-averaging is more informative obviously depends on the purpose of categorization and characteristics of test data set. However, micro-averaging seems to be the preferable averaging method in the literature [Seb02]. In this thesis, we will use both macro-averaging as well as micro-averaging for the evaluation of classifiers. Comparing these two measures gives an indication of the impact of the rare categories that could be hidden in micro-averaging. 41 Table 2.2 Global contingency table for category set C. category set C = {c1, c2, … , c|C|} label by human expert YES is correct NO is correct predicted YES ∑ i=1,|C| TPi ∑ i=1,|C| FPi predicted NO ∑ i=1,|C| FNi ∑ i=1,|C| TNi label by the system 2.5.3 Data Splitting Evaluation of classifiers in machine learning approaches needs a test set which is different from a training set used for classifier construction. Theoretically, the documents in a test set cannot be involved in the learning process for classifiers for which they will be tested. Otherwise, evaluation results would likely be too good to be achieved realistically. So, before the classifier construction, an initial set of documents, H = {d1, d2, … , d|H|} ⊂ D, should be split in two disjoint sets (training set and test set). Two standard splitting methods are train-and-test and k-fold cross-validation. In the train-and-test approach, an initial set H is partitioned into a training set A = {d1, … , d|A|} and a test set E = {d|A|+1, … , d|H|}. These splits are, in most cases, constructed based on the random sampling, or sometimes based on real-word data flow (i.e., earlier documents for the training set and later ones for the test set) if documents have a time-flag like electronic mail. This approach can give biased evaluation results since it cannot select from all the documents for training. However, the train-and-test approach has been frequently used in text categorization since it makes experiments much faster and makes comparisons amongst systems easier. Some data sets have 42 predefined splits and so they make it easier for different researchers to compare their text categorization results. An alternative to the train-and-test approach is k-fold cross-validation. It has the advantage of minimizing variations due to the biased sampling of training data. This splitting approach partitions an initial data set H into k different sets (E1, E2, … , Ek) where positive and negative documents for each category are equally distributed among them. Then, k experiments are iteratively run by applying the train-and-test approach on k train-test pairs (Ai = H − Ei, Ei), and then the final performance result is computed by averaging the k runs. For a data set with no predefined splits, we will use a crossvalidation approach for data splitting. One consideration is in the construction of n small subsets from a given large training set A. It is an important issue for our work since we have to examine whether a classifier works well with few training examples. What we have to do is to further split a training set A in n disjoint sets (S1, S2, …, Sn), not necessarily of equal size, and then construct n new training sets (SA1, SA2, …, SAn) where each SAi = SAi-1 + Si. The effectiveness of a classifier with a small number of training examples could be assessed based on the results of n experiments conducted on n train-test pairs (SAi, E). As discussed earlier, each Si can be built in a random manner. However, random sampling, here, can cause the problem of an extremely uneven distribution of positive documents across categories. The documents of a few frequent categories (i.e., category with a large number of positive documents) would likely dominate in Si. We view this problem as a major factor, causing biased experimental results. So, we use semi-random sampling that randomly selects documents but controls the distribution of positive documents for each category. As an example, consider a two-class (c1 and c2) text categorization problem where the number of training documents |A| = 1,000, the number of positive documents for category c1 is 900, and the number of positive documents category c2 is 100. When S1 should have 10% (100) of |A|, it randomly selects the same fraction of positive documents in each category (i.e., 90 from c1 and 10 from c2). In this thesis, we will use semi-random sampling unless otherwise stated. Some classifiers have internal parameters, such as thresholds, that should be optimized empirically based on a set of training documents. For parameter optimization, it is often the case that a training set A is further partitioned into two 43 different sets, namely a training set R = {d1, … , d|R|} and a validation set V = {d|R|+1, … , d|A|}. Also, this validation set V must be kept separate from the test set E and thus must be used only for parameter tuning. One question in the construction of the validation set is, of the documents in an initial training set A, what fraction should a validation set V have? With a small set size of A, this question may be much more difficult (sometimes impossible) to answer, since the number of documents in A may not be enough even for training itself. Note that, in our experimental settings, the training set SAi will be very small compared to a given initial training set A and it will not be practical to divide SAi into two subsets. In this thesis, we do not make separate validation sets, but use the training set A for the parameter optimization. 44 Chapter 3 Keyword Association Network (KAN) In this chapter, we describe our Keyword Association Network (KAN) approach to the definition of a function _i that returns the estimated categorization values of new documents for a specific category ci. We discuss the motivation of KAN, its overall approach to text categorization, and its computational complexity. 3.1 Objectives and Motivation The main goal of this research project is to develop text categorization methods that work well with a small number of training examples. For this goal, we propose a learning method to define the function _ in a classifier and its motivation is to achieve the following two important objectives: 1. to give a feature the appropriate weight according to its semantic meaning in a given document, and 2. to remove the influence of irrelevant features by giving more weight to the discriminating (informative) features. In text categorization, many words appear frequently in documents of several different categories. These features are problematic for the similarity computation between a classifier and a new document since they are more ambiguous than other words that mainly appear in a single category. We have noticed that the ambiguity 45 these features may have can be characterized into two types: semantic and informative ambiguities. categories computer: farm: profiles or positive documents apple windows computer … apple grape computer … similarity computation new documents dj _computer(dj) apple grape computer … _farm(dj) human expert assigned category farm to the new document dj “apple” has semantic ambiguity. “computer” has informative ambiguity. Figure 3.1 An example of the similarity computation with semantic and informative ambiguities. Semantic ambiguity is well-defined in the information retrieval literature and it refers to differences in meanings. As an example in Figure 3.1, consider the feature, “apple”, that conveys different meanings in two different categories, computer and farm. A good text categorization system should be able to differentiate its different meanings in each category, i.e. “apple” is a kind of fruit in the category farm and a kind of computer system in the category computer. Suppose we want to perform a similarity computation between two objects: new document dj that should be categorized under the category farm, and a profile (or positive documents for the instance-based learning algorithms) of the category computer. With conventional 46 learning algorithms that compute the similarity score based on a single-term indexing approach like the vector space model, the feature “apple” may have a large weight if it has large weights in both objects and, as a result, this document may be incorrectly considered as being computer-related (i.e., _computer(dj) > _farm(dj)). Human experts can differentiate the meaning of the feature “apple” in the document dj and the same word in the category computer by looking at other words in both objects. This suggests that a promising approach to this problem is to exploit co-occurrence information of “apple” with other features in computing the weight of the feature “apple”. This approach can reduce the weight of “apple” in the similarity computation since pairs of words in the document dj are less likely to appear in the category computer. The more frequent type of ambiguity in text categorization is informative ambiguity. This refers to differences in importance. In fact, informative disambiguation is the aim of most machine learning algorithms applied to text categorization. However, since the same documents may belong to more than one category and many features may occur frequently in several different categories, this aim is a difficult one to achieve for most learning algorithms that are based on a single-term indexing approach. For example, since the current farming industry uses computer technology, many new documents that should be assigned under the category farm may contain computerrelated terms like the feature “computer” as shown in Figure 3.1. Obviously, these computer-related terms may have minor importance in the documents of the category farm. As in the example of semantic ambiguity, if we adopt conventional learning algorithms based on single-term indexing, similarity scores between the documents in the category farm and the category computer may be large and, as a result, some of documents in the category farm will be categorized under the category computer. Again, word co-occurrence statistics are a potentially effective source of information for resolving this informative ambiguity, since co-occurrences of computer-related and farm-related words may be low in the category computer. Another challenge for resolving informative ambiguities of features is to use their discriminating power (numeric value) in assessing the similarity scores of new documents. In text categorization, some categories may have one or two discriminating features. Identifying the existence of those features in new documents is enough for the categorization task. Using the same feature set size across all categories and giving an equal opportunity to them to participate in a similarity computation could be a major 47 cause of low system performance. A number of discriminating power functions have been investigated and applied to feature selection as a preprocessing step that selects a subset of features [KS96, MG99]. Using the discriminating power of each feature in a similarity computation is a promising approach to the elimination of the impact of irrelevant features and, as a result, to significant performance improvements. The basis of our approach is that using word co-occurrence information and discriminating power values of features in a similarity computation could result in the semantic and informative disambiguation and, as a result, would lead to improved text categorization performance with a small number of training examples. Another important part of the motivation for this approach is that word pairs could be more understandable to users than a single word. Pazzani’s usability study [Paz00] showed that people prefer word pair explanations for categorization. In our approach, automatically extracted word pairs used for categorization of new documents seem to be effectively used to increase user acceptance of learned profiles of categories. 3.2 Overall Approach In similarity-based text categorization, each raw document is represented by the features extracted and their associated weights, based on a single-term indexing scheme like TFIDF described in section 2.2.2. With the representations of training documents, most conventional machine learning algorithms construct the function _ based on only the matching single features they have in common [Seb02]. And the function _ computes the similarity scores of new documents based on again only the matching single features. KAN is a new machine learning approach to the construction of the function _ that exploits word co-occurrence information extracted from a set of training documents and discriminating power value of each feature in similarity computation for resolving both semantic and informative ambiguities. KAN consists of the construction of the network of co-occurring features from a collection of documents, the definition of a relationship measure between two features, and the definition of the discriminating power of each feature in a given category. In this section, we discuss these three basic parts of KAN and how it can be used for text categorization. 48 3.2.1 Constructing KAN Previous work showed that it is possible to automatically find words that are semantically similar to a given word, based on the co-occurrence of words [Rug92, SCAT92]. Such word co-occurrence information has been used for various text learning + tasks including semantic feature indexing [DDFL 90], automatic thesaurus generation + [CY92], automatic query expansion [XC00], and text mining tasks [KMRT 94, SSC97]. KAN is also constructed by means of a network representation based on word co-occurrences in the training document set. Let us assume that there is a set of n unique features {F = (w1, w2, ... , wn)} and a set of m documents {D = (d1, d2, ... , dm)}. Here, each dj = (w1, w2, ... , wk) is a nonempty subset of F. The construction of KAN for a given category ci is based on the document frequency (DF) of two co-occurring features wi and wj, DF(wi,wj | ci), that is defined as follows: • DF(wi,wj | ci) is the number of positive documents in the category ci that contain both features wi and wj. The problem of building the network is to find two features that satisfy a userspecified minimum document frequency (minDF). As an example, let us suppose that we have a set of positive documents in a particular category, information technology in Figure 3.2. In this example, through the preprocessing steps, stemming algorithm and a stop-list as mentioned in Section 2.2, we could gain a set of distinct features F that could be considered as informative ones. With the given set of documents D, set of unique features F, and user specified minDF (in this example, it is 2), the algorithm in Figure 3.3 finds frequent 2-feature sets, F2, which are groups of two features occurring frequently together in the set of documents and satisfying the given minimum document frequency. Note that the feature “file” occurs in just one document and so it is not considered at this point since it does not satisfy a given minDF. CF2 is candidate 2-feature sets generated from the frequent 1feature sets F1. Figure 3.4 shows the resulting network representation for above example. Each node presents a feature and its document frequency in a given category, and an integer 49 on the edge between two nodes shows the document frequency of two features. In the calculation of a similarity score between this category and a new document, if the document has both features “apple” and “computer” they are considered as informative ones according to this network. And, their relationship measure (explained in the next section) will be increased in the similarity computation, as explained in Section 3.2.4. set of unique features: F = {apple, windows, computer, web, www, file} set of documents (D) d 1 = (apple, windows, computer) d 2 = (windows, computer) d 3 = (apple, computer, web, www) d 4 = (file, web, www) {F1} {F2} {CF2} feature set DF feature set DF feature set DF apple 2 apple, windows 1 apple, computer 2 windows 2 apple, computer 2 windows, computer 2 computer 3 apple, web 1 web, www web 2 apple, www 1 www 2 windows, computer 2 windows, web 0 windows, www 0 computer, web 1 computer, www 1 web, www 2 2 50 Figure 3.2 An example for constructing KAN. F1: frequent 1-feature sets CF2: candidate 2-feature sets F2: frequent 2-feature sets for all d ∈ D do for all w ∈ W do if wi exists in d wi.count++; F1 = { wi ∈ F | wi.count ≥ minDF} Compute uncommon features that satisfy user specified document frequency (minDF). when F1 = {w1, w2, … , wk} Generate feature pairs for CF2 = 0; uncommon ones. for all w ∈ F1 do for (i = 1; i < k; i++) do CF2 = CF2 + {(wi, wi+1), (wi, wi+2), … , (wi, wk)} for all d ∈ D do for all cf ∈ CF2 if two words in cfi exists in d cfi.count++; F2 = { cfi ∈ CF2 | cfi.count ≥ minDF} Figure 3.3 Create list of feature pairs that satisfy required document frequency (minDF). Algorithm for generating frequent 2-feature sets F2. 51 computer(3) 2 2 windows(2 ) apple(2) 2 www(2) Figure 3.4 3.2.2 web(2) Network representation generated from the example. Relationship Measure The degree of relationship between two features is represented by a confidence value which we refer to as CONF(wi,wj), our confidence that wi is related to wj. This + measure was used to find association rules in [AMST 96] that have been identified as an important tool for knowledge discovery in huge transactional databases. The notions of co-occurrence and frequency in this research are exactly the same as 2frequent itemsets and support of association rules. In the resulting network structure in the previous section, the confidence value is used for measuring how the presence of one feature in a given document may influence the presence of another feature in a particular category. When a category ci has a set of n unique features {F = (w1, w2, ... , wn)}, the ith feature’s confidence value to the jth feature, CONF(wi,wj) , is defined as follows: 52 DF(wi,wj | ci) • CONF(wi,wj) = …. Formula 3.1 DF(wi | of ci) two co-located features wi and wj in Where DF(wi,wj | ci) is the document frequency the category ci and DF(wi | ci) is the document frequency of the feature wi in the category ci. Note that this rule is asymmetric, i.e. CONF(wj,wi) has a different value from CONF(wi,wj) due to the different denominator. High confidence of wi to wj is interpreted as an indicator that the semantic meaning and importance of feature wi can be determined by the existence of the feature wj. The problem of finding large 2-feature sets can be reduced to a problem of finding all frequent 2-feature sets that satisfy a user-defined minimum confidence (minCONF). Based on the minCONF, some weak relationships in the network representation may be filtered out due to their low confidence. In this research, we do not perform this filtering process for weak relationships, since the optimal value of minCONF cannot be found easily and such weak relationships might have a minor impact on KAN’s performance because KAN is designed to make their impacts small as explained in the next sections. 3.2.3 Discriminating Power Function The discriminating power of each feature is an important factor in resolving the informative ambiguities of features that occur in several different categories. As a result, it is critical for achieving high performance in text categorization. Our KAN involves its integration with the similarity computation between a new document and a category. In KAN, we apply a discriminating power function similar to the scheme used in [SK00]. Let m be the number of categories. For the jth feature, wj, we prepare the weight vector hj = (df1,j, df2,j, ... , dfm,j). Here, each dfi,j is computed as follows: • dfi,j = DF(wj | ci) DF(wi) …. Formula 3.2 53 where DF(wj | ci) is the document frequency of wj in the ith category ci and DF(wj) is the document frequency of wj over all documents in all categories. Shankar in [SK00] computed the discriminating power value of the jth feature, Pj, as follows: • Pj = ∑i=1,m(_i,j)2 …. Formula 3.3 where _j is one-norm scaled vector of hj, so each _i,j is: • _i,j = dfi,j …. Formula 3.4 ∑ i=1,m(dfi,j ) Pj will have the lowest value, 1/m, if a feature wj is evenly distributed across all the categories and have the largest value, 1, if it appears in only one category. Then, the same Pj is used for all categories to adjust the weight of feature wj. One drawback of this scheme is that it causes a steep declining tendency in the weights of some informative features in the several categories that have the same overlapping concept (i.e., some features that appear frequently in several categories and are informative in them will have a small Pj). To overcome the drawback of Pj, we use dfi,j defined in Formula 3.2 as the discriminating power of the jth feature wj in the ith category ci. In order to further reduce the impact of the irrelevant features, dfi,j is transformed to Si,j as follows: • Si,j = dfi,j(λ / dfi,j) …. Formula 3.5 where the range of λ is 0 < λ < 1. The graph of the df and S values of the features has an S shape about the point of λ as shown in Figure 3.5. This graph depicts the S value associated with each value of df when λ has three different values, 0.35, 0.50, and 0.65. This shows that df values below λ are further penalized. To address the problem of setting λ for a particular text categorization task, let us consider the 2-category categorization problem where |C| = 2, like a spam-mail filtering task [AKCS00, SDHH98, CM01]. In this case, the categorization must split new e- 54 mails into two disjoint categories, junk and non-junk ones. Since only two categories are involved in this case, a feature having 0.5 for df cannot be considered informative. This feature should have a much smaller S value than 0.5 and, as a result, λ should be greater than 0.5 (i.e., 0.5 < λ < 1). The value of λ should be lower where the number of predefined categories increase and higher where the number of categories decrease. In this research, we set 0.35 for λ because of the large number of predefined categories in the data sets. 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 S 0.50 0.45 0.40 0.35 0.30 0.25 = 0.35 0.20 = 0.50 0.15 = 0.65 0.10 0.05 0.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 df Figure 3.5 Graphs of S against df, when λ = 0.35, 0.50, and 0.65. 55 3.2.4 Applying KAN to Text Categorization Unlike the profile-based linear algorithms discussed in section 2.3.1.1, KAN does not fix the feature weight through the learning process. The absolute weight for each feature in a particular category is determined on-the-fly on seeing a new document. This approach, we believe, helps give appropriate weights to those features that occur frequently in several different categories, giving appropriate weights according to their semantic meaning and importance in a given document. Suppose a new document g, having the vector of the form (z1, z2, … , zn) becomes available. Then, we calculate the weight vector (y1, y2, ... , yn), for a category ci as follows: • yk = [∑j=1,|A| xk(dj) × |ci|−1] × [1 + Sk × ∑l=1,n(CONF(wl,wk) × bl)] dj∈ci …. Formula 3.6 l≠k xk(dj): the weight of the kth feature in a training example dj |A|: the number of documents in the training set A |ci|: the number of positive documents in the category ci Sk: the discriminating power of the kth feature in the category ci CONF(wl,wk): the confidence measure of the lth feature wl to the kth feature wk bl: the existence label of the lth feature in a new document g having 1 if the feature appears in the new document and 0 if it does not. Then, the similarity score between the category ci and new document g is the dot product between the two vectors: • _i(g) = ∑k=1,n(yk × zk) …. Formula 3.7 56 One interesting property of this automated text categorization system, but not empirically tested in this research, is its potential ability to handle some annotated training documents that contain informative terms indicated by human experts as well as their category information. In KAN, such manually indicated terms that are declared to be informative in a given category can be handled by the construction of a set of discriminating features [LKKR02]. Let R be the set of features that satisfy the following condition: • wk ∈ R if (Sk ≥ minS) or (wk indicated by human experts in annotated documents) where minS is the user defined minimum discriminating power value. Then, when a new document is available, the weight yk of the kth feature wk in a category ci can be computed as follows: • yk = [∑j∈1,|A| xk(dj) × |ci|−1] × [1 + δ] …. Formula 3.8 dj∈ci δ is [Sk × ∑l=1,n(CONF(wl,wk) × bl)] if wk ∈ R l≠k 0 if wk ∉ R This equation is essentially the same as formula 3.6. The difference in this equation is that only the features in R will have additional weights computed based on relationships with other features while the other features not in R have only the basic weights based on the single-term weighting scheme. It means that when wk is indicated as important by human expert, as shown in Figure 3.6, it is incorporated in Formula 3.8 regardless of the value of Sk. Figure 3.6 shows an example of the network representation of KAN for handling such a discriminating feature set in a given category grain. In this example, the three nodes “agriculture”, “grain”, and “wheat” are presented as the discriminating features. The first two nodes, “agriculture” and “grain”, satisfy the minS while “wheat” does 57 not satisfy it but is indicated as an informative feature by human experts. These three features will provide most of the overall similarity scores of new documents. cgrain minS=0.35 program grain S=0.65 china ship wheat S=0.29 SUPT=19 8 U.S agriculture S=0.54 trade Figure 3.6 An example of network representation for handling annotated training examples. 3.3 Computational Complexity A potential drawback of KAN is that it would not be suitable for working with a high dimensional feature spaces because of the high computational cost that quickly becomes prohibitive as n increases, where n is the number of unique features in the feature space. When computing the weight of the jth feature in a given category, KAN requires a confidence value for each pair of features. With the simplified O() notation (i.e. mn2 + pn _ n2), the computational complexity of KAN is O(n2) in both construction 58 time and similarity computation time. So, to make KAN practical for text categorization, it is critical to reduce the feature space by applying feature selection algorithms. If the KAN’s performance with a reduced feature set is comparable with a full feature set, and such a reduced feature set has a reasonable input size that is manageable by KAN, we can say that KAN is efficient in terms of the processing time. This is the subject of the experiments in Section 5.3. 59 Chapter 4 RinSCut: New Thresholding Strategy The final decisions for the mapping from the new documents to relevant categories can be made by thresholding the similarity scores computed in classifiers. For similarity-based classifiers, the thresholding strategy is a critical research area that has a huge impact on the overall text categorization effectiveness. This chapter describes our new thresholding strategy, rank-in-score thresholding strategy (RinSCut): its motivation, desired property analysis, and overall approach for the definition of the optimal threshold value _i for a given category ci. 4.1 Motivation As Yang notes [Yan01], thresholding strategies constitute an unexplored research area in text categorization. Indeed, most work implicitly assumes that finding the optimal thresholding strategy is a trivial task. However, this assumption is not true. In fact, the selection of the optimal thresholding strategy, discussed in section 2.3.2, depends on several aspects. 1. Different characteristics of machine learning algorithms Some learning algorithms produce similarity scores that are more comparable within document than within category. For such algorithms, as shown in Figure 4.1 the RCut thresholding strategy would be better than SCut and PCut. These latter strategies work better with similarity scores that are more comparable within-category. However, the type of similarity scores an 60 adopted algorithm will generate is not clear until one compares empirical results obtained by applying all existing thresholding strategies. documents d1 d2 … dn _1(d1) _1(d2) … _1(dn) categories c1 more comparable within category: c2 _2(d1) _2(d2) … _2(dn) .. . .. . .. . .. . .. . cm _m(d1) _m(d2) … _m(dn) SCut and PCut more comparable within document: RCut Figure 4.1 Comparability of similarity scores and thresholding strategies. 2. Different user interests The choice of the thresholding strategy is heavily affected by different user interests. For example, when a user is interested in the local performance of each category, SCut would be preferred since it optimizes the macro-averaged performance. RCut will be effective when the user is interested in the global performance (i.e., micro-averaged performance). The difficulty is that user interests are not constant, changing over time. 61 3. The multi-class text categorization problem In real-world applications, text categorization systems do not know how many categories could be assigned to each new document and whether that number of categories will be constant over all new documents or not. This means that typical text categorization tasks are multi-class categorization problem in which new documents belong to a variable number of categories. For multiclass categorization, SCut and PCut seem to be the optimal choices because RCut assigns the same number of categories to all new documents. However, this is not necessarily the case because of the different characteristics of the machine learning algorithms and different user interests discussed earlier. In the absence of an outstanding strategy, it is apparent that finding the optimal thresholding strategy for any given algorithm and data set is difficult. Addressing this problem is a challenge in similarity-based text categorization. One way of overcoming this problem is to design a new thresholding strategy that jointly uses the strengths of different existing strategies. This is the motivation for our invention and evaluation of a new thresholding strategy, RinSCut. 4.2 Desired Properties In developing our new thresholding strategy, we have noticed that RinSCut should have the following desired properties. 1. Online text categorization Among three thresholding strategies, PCut is the only one that uses the proportional information of each category observed in a training set. While using such proportional information makes PCut work well for the test set that shows a similar category distribution to the training set, it also makes PCut unsuitable for online text categorization. With PCut, the categorization of each new document should be postponed until a pool of new documents is accumulated. The ability to make an online decision is highly desirable for many text classification systems, especially for e-mail categorization systems 62 and information filtering systems where delayed decisions on new documents are not allowed. 2. Optimizing both local and global effectiveness RCut would be effective when the global (micro-averaged) performance is the primary concern. On the other hand, SCut could be superior to RCut when the local performance of each category (macro-averaged performance) is tested. Combining the strength of both thresholding strategies offers the potential for a new strategy that optimizes both local and global performance. 3. Avoiding the risk of overfitting a small number of training documents Our main goal in this research is to make a text categorization system that works reasonably well with a small number of training examples. With the small size of training data, SCut has a high risk of overfitting the training data and, as a result, it can give some rare categories unreliable thresholds that significantly lower the macro-averaged and micro-averaged effectiveness. By contrast, RCut is less sensitive than SCut to the problem of overfitting. From the analysis of desirable properties, we want a new thresholding strategy that: (1) has the ability to categorize new documents online, (2) gives thresholds which optimize both macro-averaged and micro-averaged performance, and (3) is insensitive to the problem of overfitting to training data in some rare categories. 4.3 Overall Approach We now describe our new thresholding strategy rank-in-score (RinSCut), that is designed to use the strengths of two existing strategies, RCut and SCut, and to have the desirable properties discussed in the above section. 4.3.1 Defining Ambiguous Zone As we have already noted, a weakness of SCut is that it has a high risk of overfitting some rare categories (i.e., SCut gives unreliable thresholds that are too high 63 or too low). To deal with this problem, we define a range around the threshold value given by SCut. For new documents whose similarity scores are within this range, the categorization decision depends on another thresholding strategy like RCut that is relatively insensitive to the overfitting problem and optimizes the global performance of the system. To combine the strengths of RCut and SCut and, as a result, to overcome weakness in both thresholding strategies, RinSCut finds two threshold scores, ts_top(ci) and ts_bottom(ci), for each category ci. As shown in Figure 4.2, the range of these two threshold values is considered an ambiguous zone in the category ci. The computation of the two thresholds is based on ts(ci) which is the optimal threshold score from SCut, ND(ci) which is the set of negative training documents with similarity scores above ts(ci), and PD(ci) which is the set of positive training documents with similarity scores below ts(ci) as follows in [LKK02]: ∑ dj∈ND(ci) • ts_top(ci) = ts(ci) + [_i(dj) − ts(ci)] …. Formula 4.1 |ND(ci)| ∑ dj∈PD(ci) • ts_bottom(ci) = ts(ci) − [ts(ci) − _i(dj)] …. Formula 4.2 |PD(ci)| where _i(dj) is the similarity value of document dj in the category ci, |ND(ci)| and |PD(ci)| are the number of documents in ND(ci) and PD(ci) respectively. Both ts_top(ci) and ts_bottom(ci) become ts(ci) if no document is in ND(ci) and PD(ci) set, respectively. The ambiguous zone is used for new documents on which the categorization decision to this category cannot be made only with the threshold ts(ci) given by SCut. 64 Training data are sorted by similarity scores in descending order for category ci. ----- with the highest score hs(ci) : _ ts_top(ci) ----- with ts(ci) from SCut _ ts_bottom(ci) : ----- with the lowest score ls(ci) Ambiguous zone : positive data with similarity score above ts(ci) : positive data with similarity score below ts(ci) = PD(ci) : negative data with similarity score below ts(ci) : negative data with similarity score above ts(ci) = ND(ci) Figure 4.2 Ambiguous zone between ts_top(ci) and ts_bottom(ci) for a given category ci. 65 For an assignment decision on a new document g to the category ci with the similarity score, _i(g), assigns a “YES” decision if _i(g) ≥ ts_top(ci) and a “NO” decision if _i(g) < ts_bottom(ci). If _i(g) is in the ambiguous zone between ts_top(ci) and ts_bottom(ci), the final assignment decision depends on the rank threshold k from RCut. This broad approach has been explored elsewhere, using a different method based on user-defined parameters [LKKR02]. A possible modification to the computation of ts_top(ci) and ts_bottom(ci) is to apply user-defined maximum and minimum values for these two thresholds in order to prevent them from having too high or low values. This could be done with user-defined real-valued parameters, para_max and para_min, between 0 and 1 that are used for computing ts_max_top(ci), ts_min_top(ci), ts_max_bottom(ci), and ts_min_bottom(ci). • ts_max_top(ci) = ts(ci) + [hs(ci) − ts(ci)] × para_max • ts_min_top(ci) = ts(ci) + [hs(ci) − ts(ci)] × para_min • ts_max_bottom(ci) = ts(ci) − [ts(ci) − ls(ci)] × para_max • ts_min_bottom(ci) = ts(ci) − [ts(ci) − ls(ci)] × para_min where hs(ci) is the highest similarity score and ls(ci) is the lowest score in a given category ci. Then, modified mod_ts_top(ci) and mod_ts_bottom(ci) are defined as follows: • mod_ts_top(ci) is ts_max_top(ci) if ts_top(ci) > ts_max_top(ci) ts_min_top(ci) if ts_top(ci) < ts_min_top(ci) ts_top(ci) • otherwise mod_ts_bottom(ci) is ts_max_bottom(ci) if ts_bottom(ci) < ts_max_bottom(ci) ts_min_bottom(ci) if ts_bottom(ci) > ts_min_bottom(ci) ts_bottom(ci) otherwise 66 Their expected locations in the ordered list of similarity scores of training documents are shown in Figure 4.3. This method has been investigated and evaluated in [LKKR02]. It appears that users find it difficult to suggest such parameters. Accordingly, we test only the unmodified version of thresholds, ts_top and ts_bottom, in this thesis. Training Data are sorted by similarity scores in descending order for category ci. ----- with the highest score hs(ci) : _ ts_max_top(ci) _ ts_min_top(ci) ----- with ts(ci) from SCut _ ts_min_bottom(ci) _ ts_max_bottom(ci) : ----- with the lowest score ls(ci) mod_ts_top(ci) mod_ts_bottom(ci) : positive data with similarity score above ts(ci) : positive data with similarity score below ts(ci) : negative data with similarity score below ts(ci) : negative data with similarity score above ts(ci) Figure 4.3 The locations of ts_max_top(ci), ts_min_top(ci), ts_max_bottom(ci), ts_min_bottom(ci), mod_ts_top(ci), and mod_ts_bottom(ci) in the ordered list of similarity scores. 67 4.3.2 Defining RCut Threshold For new documents that have similarity scores within the ambiguous zone for a given category ci, the rank threshold k is used to make the final decision on them. This threshold k can be defined in two ways: by optimizing the global performance (GRinSCut) or by optimizing the local performance of each category (LRinSCut). In GRinSCut, the same k value is applied to all categories like RCut. Each category may have different k values in LRinSCut. Using a globally optimized k value may be a good choice if we want high microaveraged performance. But, when the local performance of each category is the primary concern and some rare categories prevent use of SCut, LRinSCut will be more effective. For a category ci and a new document g having the similarity score _i(g) that is between ts_top(ci) and ts_bottom(ci), RinSCut sorts the similarity scores of categories to give the k top-ranking categories. Its final decision is a “YES” if the category ci is in the k categories and a “NO” decision if ci is not in the k categories. 68 Chapter 5 Evaluation I: KAN and RinSCut We described, in Chapter 3 and 4, our approaches, KAN and RinSCut, for building accurate classifiers from a small number of training examples. In this chapter we discuss the details of evaluative experiments. This chapter begins by providing information on the data sets that we used in experiments. Then, we explain how the raw documents were preprocessed. Finally we discuss the experimental results. They demonstrate the efficiency and effectiveness of our approaches to text categorization. 5.1 Data Sets Used The experiments in this thesis were conducted using two data sets, Reuters-21578 [R21578] and 20-Newsgroups [20News]. We used these corpora because of the following characteristics they have in common: 1. The documents in the corpora are real-world ones. What we mean by real-world documents is that they are not machinegenerated. All the documents in these corpora were written by humans for some purpose other than the purpose of testing text categorization systems. 2. These corpora can be considered as standard data sets that have been used in testing many text categorization systems. 69 These corpora are publicly available and have been widely used for experimental work in various text categorization systems (See, for example, [CS96, Joa98, YX99, HK00] for the Reuters-21578 and [MN98, NMTM00, NG00] for 20-Newsgroups). While many researchers have used both corpora in the tests of their categorization systems, there has unfortunately been no standard way of using these corpora and, as a result, many of these experiments have been carried out in different conditions. This makes meaningful comparisons among the systems somewhat difficult. In order to guarantee reliable cross-system comparison, we have to conduct the experiments: (1) on the same documents and categories, (2) with the same splitting method for the training and test set, (3) with the same evaluation method, and (4) using the same part of each document (i.e., title or body, or both). 3. Each corpus contains a large number of documents. In general, the evaluations for text categorization systems require large numbers of labeled documents. A small set of documents that is insufficient to prepare both training and test set would result in biased experimental results. There are large numbers of pre-categorized documents (about 20,000) in each data set. This is enough for assessing the unbiased performance of a particular text categorization system. 4. Each corpus contains many predefined categories. Many categories are available in each corpus (20 categories in the 20Newsgroups and 135 categories in the “Topics” group of the Reuters-21578). Some text categorization tasks involve a very small number of categories (for example, spam-mail filtering is the task of 2-class text categorization), and are generally easier text categorization tasks. In the following subsections, we will explain in detail each data set. 70 5.1.1 Reuters-21578 The Reuters-21578 corpus consists of a set of 21,578 Reuters newswire articles from 1987. Each document was assigned to categories by human experts from Reuters Ltd and Carnegie Group Inc. From 1990 through 1993, the formatting and documentation of articles was done by David D. Lewis and his coworkers. This first version of the data was called the Reuters-22173 [R22173] consisting of 22,173 Reuters newswire articles. In 1996, based on the Reuters-22173 corpus, further formatting was done, a variety of typographical errors were corrected, and 595 duplicate articles were removed. This new version of the collection has 21,578 documents and thus is called the Reuters-21578 collection. It has been widely used for the experimental work in text categorization. Recently, a new Reuters collection, called Reuters Corpus Volume 1 [RCV1], has been made available and will likely replace the Reuters-21578 as the standard Reuters collection for text categorization. We did not use this new version of Reuters collection because it was not available when our research started. In the Reuters-21578 collection, there are 672 categories in total across 6 different groups: “Topics”, “Places”, “People”, “Organizations”, “Exchanges”, and “Companies”. In this corpus, “Companies” has no categories. Most text categorization research has been done using the “Topics” group that consists of 135 economic subject categories. In using the Reuters-21578 corpus for text categorization, there are 3 standardized predefined splitting methods for dividing a set of available data into a training set and a test set. These splitting methods are the Modified Lewis (“ModLewis”), the Modified Apte (“ModApte”), and the Modified Hayes (“ModHayes”). We chose the “ModApte” splitting method since it has been the most widely used splitting method in text categorization evaluations on this corpus. We used only the body of each article since many articles have no title, and the text in the title usually appears in the body. Figure 5.1 provides an example of the extracted Reuters-21578 documents. It is #9 in the earn category, and was used for training. 71 Champion Products Inc said its board of directors approved a two-for-one stock split of its common shares for shareholders of record as of April 1, 1987. The company also said its board voted to recommend to shareholders at the annual meeting April 23 an increase in the authorized capital stock from five mln to 25 mln shares. Figure 5.1 An example of the Reuters-21578 document (identification number, 9, assigned to the earn category and used for the training set). Instead of analyzing all 135 categories in the “Topics” group, we chose the categories having at least 10 articles in both training set and test set. This results in 53 categories. This gives a corpus of 6,984 training documents and 3,265 test documents across these 53 categories. Many documents in both training and test set may belong to multiple categories. The 53 categories and the numbers of training and test documents in each category are listed in three tables (Tables 5.1, 5.2, and 5.3). In these tables, the rows are ordered based on the number of training documents, from highest to lowest. The documents in the Reuters-21578 are unevenly distributed across the categories. Most are assigned to the first two categories: “earn” and “acq” (See Table 5.1). Since micro-averaged performance depends heavily on such frequent categories, adding other extremely rare categories into the corpus would have a very minor impact on the resulting micro-averaged performance. As a result, if the micro-averaged performance is considered, our micro-averaged performance results in this thesis might be comparable with others which include all the categories with small numbers of articles. For the macro-averaged performance, our results may not be comparable since the performance of rare categories has equal importance with the frequent categories. 72 Table 5.1 The 53 categories of the Reuters-21578 data set used in our experiments (part 1). category name number of training documents number of test documents earn 2,709 (38.8%) 1,066 (32.6%) acq 1,488 (21.3%) 722 (22.1%) money-fx 460 (6.6%) 222 (6.8%) grain 394 (5.6%) 179 (5.5%) crude 349 (5.0%) 215 (6.6%) trade 337 (4.8%) 177 (5.4%) interest 289 (4.1%) 133 (4.1%) wheat 198 (2.8%) 89 (2.7%) ship 191 (2.7%) 103 (3.2%) corn 159 (2.3%) 63 (1.9%) sugar 118 (1.7%) 57 (1.7%) oilseed 117 (1.7%) 65 (2.0%) coffee 110 (1.6%) 33 (1.0%) dlr 96 (1.4%) 72 (2.2%) gold 94 (1.3%) 39 (1.2%) gnp 92 (1.3%) 61 (1.9%) money-supply 87 (1.2%) 39 (1.2%) veg-oil 86 (1.2%) 50 (1.5%) livestock 73 (1.0%) 39 (1.2%) 73 soybean 73 (1.0%) 38 (1.2%) nat-gas 72 (1.0%) 54 (1.7%) Table 5.2 The 53 categories of the Reuters-21578 data set used in our experiments (part 2). category name number of training documents number of test documents bop 62 (0.9%) 39 (1.2%) cpi 60 (0.9%) 41 (1.3%) carcass 50 (0.7%) 25 (0.8%) cocoa 50 (0.7%) 18 (0.6%) reserves 48 (0.7%) 25 (0.8%) copper 47 (0.7%) 30 (0.9%) jobs 41 (0.6%) 27 (0.8%) iron-steel 40 (0.6%) 25 (0.8%) cotton 38 (0.5%) 24 (0.7%) yen 36 (0.5%) 22 (0.7%) ipi 35 (0.5%) 22 (0.7%) rubber 35 (0.5%) 14 (0.4%) rice 35 (0.5%) 32 (1.0%) alum 33 (0.5%) 25 (0.8%) barley 33 (0.5%) 15 (0.5%) gas 30 (0.4%) 24 (0.7%) meal-feed 30 (0.4%) 20 (0.6%) palm-oil 29 (0.4%) 13 (0.4%) 74 sorghum 23 (0.3%) 11 (0.3%) silver 21 (0.3%) 15 (0.5%) pet-chem 20 (0.3%) 21 (0.6%) Table 5.3 The 53 categories of the Reuters-21578 data set used in our experiments (part 3). category name number of training documents number of test documents rapeseed 18 (0.3%) 17 (0.5%) tin 18 (0.3%) 15 (0.5%) wpi 17 (0.2%) 12 (0.4%) strategic-metal 16 (0.2%) 16 (0.5%) lead 15 (0.2%) 20 (0.6%) orange 15 (0.2%) 10 (0.3%) hog 15 (0.2%) 11 (0.3%) heat 14 (0.2%) 11 (0.3%) soy-oil 14 (0.2%) 11 (0.3%) fuel 13 (0.2%) 15 (0.5%) soy-meal 13 (0.2%) 13 (0.4%) total (part 1, 2, and 3) 6,984 (100.0%) 3,265 (100.0%) In this thesis, to construct the learning curves for the implemented classifiers, based on the number of training documents used, we conducted various experiments by increasing the number of training examples. The training set for each round is a super 75 set of the one for the previous round. The same size of test set (i.e., the 3,265 test examples in Table 5.3) was used for testing the generated classifiers in each round. 5.1.2 20-Newsgroups The 20-Newsgroups data set is a collection of approximately 20,000 newsgroup articles posted to 20 different Usenet discussion groups. Since it was set up and used by Ken Lang in [Lan95], this corpus has been a popular benchmark data set that has been frequently used for experiments in various text classification systems. Table 5.4 The 20 categories of the 20-Newsgroups corpus. category name alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.Christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc In Table 5.4, a list of 20 such newsgroups is shown. Unlike the Reuters-21578 corpus, the documents in this corpus are partitioned (nearly) evenly across 20 newsgroups and each document belongs to exactly only one newsgroup. As a result, 76 there are about 1,000 documents in each newsgroup. Some of the newsgroups are very closely related to each other and can be considered as having a hierarchical data structure. For example, two categories, talk.politics.guns and talk.politics.mideast, can be regarded as being the child categories of the super category, talk.politics. Because of its hierarchical relationship between the available newsgroups, 20-Newsgroups has been used as a benchmark data set for hierarchical text categorization [MRMN98]. Figure 5.2 shows an example document from the 20-Newsgroups corpus. It was posted to the alt.atheism newsgroup. In using this corpus for our experiments, we used the text in the “Subject” header and body. Making use of other textual information like “From” and “Sender” headers may result in performance improvements. We did not consider this in this thesis since it is not general, but specific to the characteristics of a particular data set. Also, most other evaluations [MN98, NMTM00] on this data set removed header information. The task of assigning a document to a single category (single-class text categorization) is quite different from the task of assigning a document to a variable number of relevant categories (multi-class text categorization). Because each document in the 20-Newsgroups corpus belongs to exactly one category, this corpus has been primarily used for the single-class text categorization that assigns each document to the single, most appropriate category. Other data sets, such as the Reuters-21578 data set are used for evaluating a system's ability to perform the multi-class text categorization task. Unlike the Reuters-21578 corpus, the 20-Newsgroups corpus has no predefined splitting method for dividing it into the training and test set. The splitting method we adopted is the k-fold cross-validation that was discussed in Section 2.5.3. We chose 5 for the k value. As a result, in each experiment, 20 percent of the total number of documents was used for the test set and the remaining 80 percent of the data is then used for the training set. Due to the time limit, this 5-fold cross-validation was performed just once. Also, the learning curve of each classifier was generated by selecting varying numbers of training documents from the training set’s 80 percent of the total examples. 77 Path:cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!fs7.e ce.cmu.edu!europa.eng.gtefsd.com!darwin.sura.net!ra!cs.umd.edu!mimsy!mangoe From: [email protected] (Charley Wingate) Newsgroups: alt.atheism Subject: Re: A Little Too Satanic Message-ID: <[email protected]> Date: 25 Apr 93 02:48:25 GMT Sender: [email protected] Lines: 34 Kent Sandvik and Jon Livesey made essentially the same response, so this time Kent's article gets the reply: >I agree, but this started at one particular point in time, and we >don't know when this starting point of 'accurately copied scriptures' >actually happened. This begs the question of whether it ever "started"-- perhaps because accuracy was always an intention. >Even worse, if the events in NT were not written by eye witness accounts (a >high probability looking at possible dates when the first Gospels were >ready) then we have to take into account all the problems with information >forwarded with the 'telephone metaphor', indeed. It makes little difference if you have eyewitnesses or people one step away (reporters, if you will). As I said earlier, the "telephone" metaphor is innately bad, because the purpose of a game of telephone is contrary to the aims of writing these sorts of texts. (Also, I would point out that, by the standards generally asserted in this group, the distinction between eyewitnesses and others is hollow, since nobody can be shown to be an eyewitness, or indeed, even shown to be the author of the text.) There is no evidence that the "original" texts of either the OT or the NT are largely lost over time in a sea of errors, "corrections", additions and deletions. In the case of the NT, the evidence is strongly in the other direction: the Textus R. and the Nestle-Aland text do not differ on more than a low level of significance. It is reasonable to assume a similar situation for the OT, based on the NT as a model. -C. Wingate + "The peace of God, it is no peace, + but strife closed in the sod. [email protected] + Yet, brothers, pray for but one thing: tove!mangoe + the marv'lous peace of God." 78 Figure 5.2 An example of the 20-Newsgroups document (assigned to the alt.atheism newsgroup). 5.2 Text Preprocessing All raw documents in both the training and the test set should be converted directly into the representations that are necessary for efficient learning. Converting a raw text into such a representation involves the following three sub-processes: (1) identifying meaningful features that could be single words or phrases, (2) giving an appropriate weight to each feature, and (3) finding a reduced feature set that is computationally tractable to the machine learning algorithms without sacrificing text categorization performance. Because of the diverse techniques available in each subprocess above, the overall conversion process involves making many arbitrary decisions. Our principle here is to maintain the simplicity of this conversion process by using the typical and most popular methods for similarity-based text categorization. This section describes such methods chosen for the preprocessing of raw documents. 5.2.1 Feature Extraction Feature extraction is an automatic procedure for detecting the meaningful tokens in a text. These extracted tokens are called features in the information retrieval field. This procedure is, indeed, one of the critical natural language processing tasks for most text classification systems in the domain of textual information analysis. There are several possible tokenizing methods. In this thesis, we extracted tokens from a raw document by breaking up the string of characters at white-space and at all the characters other than alphabetical and numerical characters. Then, we trimmed tokens to remove any marks or punctuations around them that could exist (i.e., shares. _ shares and “telephone” _ telephone). We did not correct any misspellings in the extracted tokens. Also, we omitted numeric information, considering only alphabetical 79 characters as candidate features. All the extracted tokens are downshifted. Then, the common words were removed based on the stop-list in Appendix A. Then, we applied a stemming algorithm to the remaining tokens. Figure 5.3 shows a resulting tokenized file after the feature extraction process for the example document in Figure 5.1. champion product inc board director approv two stock split common share sharehold record april compani board vote recommend sharehold annual meet april increas author capit stock five mln mln share Figure 5.3 5.2.2 The resulting tokenized file after the feature extraction for the example document in Figure 5.1. Feature Weighting In these experiments, we used the vector space model, the common representation method adopted by most similarity-based text classification systems. In the vector space model, the weight of each feature is usually computed by the TFIDF weighting scheme. As shown in Figure 5.4, the TFIDF weights of features are computed based on two frequency files: the term frequency (TF) and document frequency (DF) file. For each document, the term frequency file, in which each feature has an integer number indicating the term frequency in that document, is generated based on its tokenized file as shown in Figure 5.3. This term frequency file is then used for updating the global document frequency file, in which an integer for each feature indicates the number of training documents that contain this feature in the training set. Based on these two 80 frequency files and the TFIDF weighting equation (discussed previously in Section 2.2.2), each raw document is finally transformed to our target representation, a vector space model. global document frequency file term frequency file for each document annual approv april 2 author board capit … annual approv april 581 asset author avg billion board … 1 1 1 2 1 128 345 115 233 407 344 113 TFIDF file for each document annual approv april 0.225 … Figure 5.4 0.261 0.255 The TFIDF weighting scheme based on the term frequency and document frequency file. 81 5.2.3 Feature Selection Even after applying the stop-list and stemming algorithm, the feature space usually has a high dimensionality that is not computationally tractable for most learning algorithms. This problem highlights the need for an aggressive feature reduction method. Previous work [YP97] on feature selection demonstrated that better results can be achieved with a reduced feature set. Two types of features to which the feature selection methods should be targeted are (1) extremely low frequency features and (2) high frequency features that appear almost evenly across categories. For the reduction of the first type of feature, we applied document frequency. We selected features occurring in at least 2 documents in the same category. And, for reduction of the second type, we adopted information gain that was explained in Section 2.2.3. The remaining features, after applying document frequency, are sorted in descending order by the information gain values. Then, we picked the numbers of features based on their information gain values. 5.3 Experiments on the Number of Features Our preliminary experimental results in [LKK00] had shown that KAN gives a significant performance improvement over a typical similarity-based learning algorithm – Rocchio. In this section, we investigate further the effectiveness of the similaritybased learning algorithms that we explored in this thesis, based on the various feature subsets obtained by applying the feature selection algorithms. The main goal in the following set of experiments is to verify whether a larger number of input features will always give better results. If this is not the case for the given learning algorithm, we want to find an optimal size of feature subset, a size that is more effective than the full feature set. 82 5.3.1 Experimental Setup We have implemented four similarity-based learning algorithms (Rocchio, WH, kNN, and KAN) explained in Section 2.3,1 and Chapter 3. Their parameter settings, used throughout these evaluations, are as follows: (1) Rocchio To compute the vectors of categories (i.e., profiles of categories), we used _ = 16 and _ = 4 in the equation in Section 2.3.1.1 as suggested in [BSA94]. (2) WH We set the learning rate parameter _ to 0.25 as this value has been used for the implementation of WH in other literature [LSCP96, LH98]. (3) k-NN The value of k used in these experiments for k-NN is 10, 20, 30, 40, and 50. Then, we chose the k value with the best performance result in each experiment. (4) KAN For computing the discriminating power S of each feature, we used 0.4 for λ. The data sets used in these experiments are the Reuters-21578 and 20Newsgroups. All the training examples available in each split were used for each experiment. The adopted thresholding strategy for each data set is the SCut for the Reuters-21578 (our corpus for multi-class text categorization task) and the RCut for 20-Newsgroups (our corpus for single-class text categorization task). The optimal value of RCut for 20-Newsgroups is always 1 since all the documents belong to only one category. We did not apply the RinSCut strategy because our focus in these experiments is on the learning algorithms with feature selection. 83 In order to construct the learning curve for a given learning algorithm, using different sizes of input feature sets, we ran a series of experiments on each corpus by varying the numbers of input features. The feature selection, which is based on the document frequency and information gain took the numbers, from 10 to 250, of features for each category. Tables 5.5 and 5.6 show the statistics for the unique features in the Reuters-21578 and the 20-Newsgroups, respectively. In these tables, the first column shows the number of features selected for each category, the second column shows the number of unique features in the training set and the last column shows the average number of unique features in each category. Note in these tables, once numbers of selected features exceeds 110 (roughly say) for each category, there is little difference in the numbers of unique features. Table 5.5 Statistics for the unique features in the Reuters-21578 corpus. number of features chosen for each category number of unique average number of unique features in the training set features in each category 10 338 4.5 30 967 13.0 50 1,623 22.4 70 2,210 30.4 90 2,729 37.1 110 3,193 42.5 130 3,614 47.5 150 3,952 50.9 84 170 4,228 53.3 190 4,458 55.3 210 4,649 56.8 230 4,800 57.9 250 4,939 59.1 Table 5.6 Statistics for the unique features in the 20-Newsgroups corpus. number of features chosen for each category number of unique average number of unique features in the training set features in each category 10 185.2 8.6 30 521.8 22.9 50 797.6 32.6 70 1,021.8 38.7 90 1,192.6 41.7 110 1,312.0 42.2 130 1,401.4 41.9 150 1,461.6 41.0 85 170 1,491.4 39.5 190 1,503.6 37.9 210 1,511.4 37.4 230 1,512.6 36.9 250 1,514.6 36.8 This fact indicates that selecting more features in each category eventually reduces the number of unique features for each category and, as a result, results in the presence of a larger number of common features that occur evenly across many categories. For a given learning algorithm, if the unreduced full feature set does not work well over some reduced feature subsets, the high number of common features in this full feature space might provide an explanation. 5.3.2 Results and Analysis The graphs in Figure 5.5 to Figure 5.8 show the F1 measures (see Section 2.5.1 for the definition) for four similarity-based learning algorithms in the Reuters-21578 data set. Each figure depicts two F1 measures, micro and macro-averaged, of a given algorithm, using all training data available (i.e., 6,984 documents as shown in Table 5.3). In these figures, the Y axis indicates F1 performance on the test data set and the X axis indicates the number of input features where the F1 performance was achieved. Throughout the experimental results, we can observe that the macro-averaged F1 is clearly lower than the micro-averaged F1. One of the possible reasons for these results might be that, in the Reuters-21578, many of the 53 categories are rare, with a small number of documents in the training set as shown in Tables 5.1 to 5.3. As a result, it is much more difficult to categorize new documents with the classifiers which are 86 constructed by using such a small sized training set. This fact seems to cause very low F1 measures for rare categories and, consequently, the low F1 in average. 0.90 0.80 0.70 0.60 F1 0.50 0.40 0.30 0.20 macro-averaged 0.10 micro-averaged 0.00 10 30 50 70 90 110 130 150 170 190 210 230 250 number of features per category Figure 5.5 F1 performance of Rocchio on the Reuters-21578 corpus (6, 984 training documents used). 0.90 0.80 0.70 0.60 F1 0.50 0.40 0.30 macro-averaged 87 Figure 5.6 F1 performance of k-NN on the Reuters-21578 corpus (6,984 training documents used). 0.90 0.80 0.70 0.60 F1 0.50 0.40 0.30 0.20 macro-averaged 0.10 micro-averaged 0.00 10 30 50 70 90 110 130 150 170 190 210 230 250 number of features per category Figure 5.7 F1 performance of WH on the Reuters-21578 corpus (6,984 training documents used). 88 0.90 0.80 0.70 0.60 F1 0.50 0.40 0.30 0.20 macro-averaged 0.10 micro-averaged 0.00 10 30 50 70 90 110 130 150 170 190 210 230 250 number of features per category Figure 5.8 F1 performance of KAN on the Reuters-21578 corpus (6,984 training documents used). From these experimental results, we can see that, in the Reuters-21578 corpus, all the similarity-based learning algorithms give the best results with a relatively small number of input features (around 10 to 70). Adding more features into the feature space fails to produce any performance benefits while it requires more time to run the classifiers. This result is consistent with the findings reported on other feature selection studies [Mla98, MG99, Yan99], even though they used different test data sets. Figures 5.9 to 5.12 present the experimental results on the 20-Newsgroups data set. Unlike the results on the Reuters-21578 corpus, in this case, all the learning algorithms give nearly equal micro and macro-averaged F1 performances. These identical results can be obtained when the value of denominator in an evaluating equation for each category is the same. For example, let us consider the computation of the micro and macro-averaged recall. When D is denominator, N is numerator in the computation of both measures, and the number of categories is |C|, the micro and macro-averaged recall, MI-Recall and MA-Recall respectively as explained in Section 2.5.2, are defined as follows: 89 N1 + ___ + N|C| MI-Recall = D1 + ___ + D|C| N1 / D1 + ___ + N|C| / D|C| MA-Recall = |C| N1 N|C| = + ___ D1 × |C| + D|C| × |C| where Di = TPi + FNi When the number of correct examples in each category (i.e., TPi + FNi) is equal across all the categories, the denominators, Di, become equal, to say D. So, MI-Recall will be equal to MA-Recall as follows: N1 + ___ + N|C| MA-Recall = D × |C| = MI-Recall Note that the following characteristics of the 20-Newsgroups: (1) all the documents are evenly distributed across the 20 categories and (2) all the documents belong to only one category. As discussed in Section 5.1.2, we select 20 percent of the documents in each category for the test set. Because of an even distribution of the documents across the categories in the full data set, all the categories have the same number of documents again in the test set. Also, each document has only one category for its correct label. 90 These characteristics of the 20-Newsgroups make the denominators of recall the same in each category (i.e., the value of TPi + FNi becomes same in each category). By contrast, the F1 measurement is the combination of recall and precision as discussed in Section 2.5.1 and contains only three entries, TPi, FNi, and FPi (note not TNi). The total value of these three entries for each category could be different even with the above characteristics in the 20-Newsgroups. For almost identical micro and macroaveraged F1 performances in these experiments, we think that the same number of training documents in each category may be another main reason. We will investigate this issue more in the following section. From the charts on Figure 5.9 to Figure 5.12, we can also observe that each similarity-based learning algorithm gives the similar learning curve to the one obtained from the Reuters-21578 corpus. In other words, for each learning algorithm there is no significant advantage on the F1 performance with a larger number of input features. Like on the Reuters-21578, we could conclude that feature selection is advantageous for fast pre and post-processing and also for performance improvements. 0.90 0.80 0.70 0.60 F1 0.50 0.40 0.30 0.20 macro-averaged 0.10 micro-averaged 0.00 10 30 50 70 90 110 130 150 170 190 210 230 250 number of features per category Figure 5.9 F1 performance of Rocchio on the 20-Newsgroups corpus (all the training documents used in each split). 91 0.90 0.80 0.70 0.60 F1 0.50 0.40 0.30 0.20 macro-averaged 0.10 micro-averaged 0.00 10 30 50 70 90 110 130 150 170 190 210 230 250 number of features per category Figure 5.10 F1 performance of k-NN on the 20-Newsgroups corpus (all the training documents used in each split). 0.90 0.80 0.70 0.60 F1 0.50 0.40 0.30 0.20 macro-averaged 0.10 micro-averaged 0.00 10 30 50 70 90 110 130 150 170 190 210 230 250 number of features per category Figure 5.11 F1 performance of WH on the 20-Newsgroups corpus (all the training documents used in each split). 92 0.90 0.80 0.70 0.60 F1 0.50 0.40 0.30 0.20 macro-averaged 0.10 micro-averaged 0.00 10 30 50 70 90 110 130 150 170 190 210 230 250 number of features per category Figure 5.12 F1 performance of KAN on the 20-Newsgroups corpus (all the training documents used in each split). As shown in Figure 5.11, the WH algorithm gives the best results at 50 features. For the other three learning algorithms, adding more features beyond 50 does not give a significant performance improvement. However, with a smaller number of input features than 50 features, there is somewhat worse performance, compared with the larger number of features. Finding the optimal size of the feature subset is very time consuming since it requires the repetition of the same experiment at every possible size of feature set. Furthermore, such an optimal size could be changed with a different size of training data. For all the similarity-based algorithms in the following set of experiments, we use 50 features per category. This appears close to the optimal size of feature subset since the result of each algorithm that was achieved at 50 is very close to, or the same as, the best result each algorithm achieved. A small variance in this number has little effect. 93 5.4 Experiments on the Number of Training Examples In the previous section, the experimental results showed that increasing the number of input features gave no benefits for our investigated similarity-based machine learning algorithms. The best text categorization results were for reduced feature sets. Even so, we observed small differences between their performance results. Note, however, that these experimental results are based on the use of all the training examples available in a training set that usually contains several thousands labeled documents. In an authentic situation, it is not realistic to prepare such large numbers of labeled training documents since manually labeling them imposes a huge cost on the human experts. Consequently, an interesting question arises as to whether a particular classifier works reasonably well with a small number of training examples available at a particular time. There is reason to predict that our methods, KAN and the variants of RinSCut, may work better than other counterpart techniques since they have been designed to use more information about the labeled training documents. In this section, we will explore the above issue. To draw the learning curve of each classifier according to the different amount of input training data, we evaluated each classifier on various sizes of training examples. This is an important issue, not only for the classifiers themselves, but also for the case of selective sampling, since the effectiveness of selective sampling might be affected by the quality of classifiers that are usually built from a small number of training examples. Some results of these experiments have been published previously in [LKK02, LKKR02]. In this section, we report much more extensive experiments on both collections (for example, 5-cross validation on the 20-Newsgroups). This section contains these updated results. 5.4.1 Experimental Setup Like the experiments in Section 5.3, we used four machine learning algorithms (Rocchio, WH, k-NN, and KAN). Then, these learning algorithms were evaluated with the SCut on the Reuters-21578 corpus and the RCut on the 20-Newsgroups corpus. 94 For the evaluation of our thresholding strategies, two variants of RinSCut, we used only the Reuters-21578 since RinSCut has been developed for multi-class text categorization tasks. The parameter settings for the implemented learning algorithms are the same, with the parameters described in Section 5.3.1. To construct a learning curve for each classifier, we conducted each experiment by increasing the size of training examples. Each experiment was repeated 10-fold. Table 5.7 shows the percentage of the overall training examples and the number of training examples, in each round, in each data set. The training set for each round is a super set of the one for the previous round. One question, in conducting these experiments, is how many times we should repeat the experiment for each round. We may intuitively think that the more we repeat the experiment for each round, the more meaningful, the normal performance distribution we have. With this in mind, and with concern for time, we chose three repetitions. Thus, each 10-fold experiment was conducted 3 times (repeating experiment more times may be necessary to gain more stable results) and, so, the resulting performance of a particular classifier in each round is an averaged performance result of these three experimental results. For the experiment in each round, we applied document frequency and information gain for feature selection and took 50 features for all categories as discussed in Section 5.3.2. Table 5.7 The percentage of training data and the number of training documents used in each round. corpus round Reuters-21578 percentage of training examples number of examples 20-Newsgroups percentage of training examples number of examples 95 5.4.2 1 1.5% 106 1.0% 160 2 3.0% 212 2.0% 320 3 5.3% 371 3.0% 480 4 9.9% 689 5.0% 800 5 18.2% 1,272 10.0% 1,600 6 34.5% 2,409 16.0% 2,560 7 52.9% 3,696 32.0% 5,120 8 73.5% 5,136 50.0% 8,000 9 88.8% 6,202 70.0% 11,200 10 100.0% 6,984 100.0% 15,998 Results and Analysis Figures 5.13 and 5.14 depict the learning curve for each similarity-based learning algorithm with SCut in the micro and macro-averaged F1, respectively. These charts show the learning trace of each algorithm as a function of the size of training data used to build each classifier on the Reuters-21578 corpus. Rocchio gives lower performance than the other algorithms on both measures across all the rounds with the exception of the lowest micro-averaged performance of k-NN on round 1 and 2. However, k-NN shows similar performance to KAN after round 3 in the micro-averaged F1. WH shows stable performance that has no extremely low F1 measure at any round in both the micro and macro-averaged F1 performance, but its performance is consistently slightly lower than the best performance across all the rounds. In macro-averaged F1, KAN, in general, achieves better results than the other algorithms. 96 Our new algorithm, KAN, does not always show better performance than the other three algorithms on all the rounds. However, we can conclude that KAN is, in the Reuters-21578 data set, a better learning algorithm than others since it achieves the best performance in most rounds and, even when KAN does not, it shows similar performance to the best performance of any other algorithm. For example in Figure 5.13, k-NN shows quite good performance, that is similar to KAN after round 4, but, with a very small number of training examples (i.e., round 1 to 3), its micro-averaged measures are much lower than the best measures at each round. By contrast, KAN works well even with such a small number of training examples. 0.90 0.80 micro-avg. F 1 0.70 0.60 0.50 Rocchio 0.40 k-NN 0.30 KAN 0.20 WH 0.10 0.00 1 2 3 4 5 6 round 7 8 9 10 97 Figure 5.13 Micro-averaged F1 of each classifier on the Reuters-21578 corpus with SCut. 0.90 0.80 macro-avg. F 1 0.70 0.60 0.50 0.40 Rocchio 0.30 k-NN 0.20 KAN 0.10 WH 0.00 1 2 3 4 5 6 7 8 9 10 round Figure 5.14 Macro-averaged F1 of each classifier on the Reuters-21578 corpus with SCut. Figures 5.15 and 5.16 show the micro and macro-averaged F1 performance of each learning algorithm with RCut on the 20-Newsgroups data set. Each algorithm achieves a very similar learning trace, on both micro and macro-averaged measures that are the same as in Section 5.3. Also, all the learning algorithms achieve very similar performance at each round. These experimental results are indicating that there is no superior learning algorithm that works much better than others on the 20-Newsgroups data set. As explained in Section 5.3, such similar results for the learning algorithms are probably due to the fact that text categorization on the 20-Newsgroups is a single-class categorization task. So, we may conclude that single-class categorization is a simpler 98 task than the multi-class, because it always gives stable performance that hardly changes with the specific learning algorithm applied. Also, note that we extracted the same proportion of training examples from each category for rounds 1 to 9. Since the training data at round 10 is evenly distributed across 20 categories, all the categories have nearly the same number of training examples at each round. In a real situation, however, it is unrealistic for each category to always have the same number of training examples. It is interesting to see whether or not the results of Figures 5.15 and 5.16 will be different if we use “truly random sampling” that extracts examples without conserving an even distribution of training examples across categories. Figures 5.17 and 5.18 show the micro and macro-averaged F1 performances of each algorithm on 20-Newsgroups with “truly random sampling” that may cause an uneven distribution of training documents at each round. In these experiments, we select the same number of training documents as shown in Table 5.7 at each round. From these charts, we can see that an uneven distribution of training examples leads to quite different results. Each learning algorithm gives lower performance than the one from the even distribution. So, we can conclude that an uneven distribution of training examples could make text categorization harder [YG96]. 0.90 0.80 micro-avg. F 1 0.70 0.60 0.50 Rocchio 0.40 k-NN 0.30 KAN 0.20 WH 0.10 0.00 1 2 3 4 5 6 round 7 8 9 10 99 Figure 5.15 Micro-averaged F1 of each classifier on the 20-Newsgroups corpus with RCut. 0.90 0.80 macro-avg. F 1 0.70 0.60 0.50 Rocchio 0.40 k-NN 0.30 KAN 0.20 WH 0.10 0.00 1 2 3 4 5 6 7 8 9 10 round Figure 5.16 Macro-averaged F1 of each classifier on the 20-Newsgroups corpus with RCut. 0.90 0.80 micro-avg. F 1 0.70 0.60 0.50 0.40 Rocchio 0.30 k-NN 0.20 KAN 0.10 WH 0.00 1 2 3 4 5 round 6 7 8 9 100 Figure 5.17 Micro-averaged F1 of each classifier on the 20-Newsgroups corpus with RCut (with “truly random sampling”). 0.90 0.80 macro-avg. F 1 0.70 0.60 0.50 0.40 Rocchio 0.30 k-NN 0.20 KAN 0.10 WH 0.00 1 2 3 4 5 6 7 8 9 round Figure 5.18 Macro-averaged F1 of each classifier on the 20-Newsgroups corpus with RCut (with “truly random sampling”). Note that k-NN and WH algorithms do not perform as well when training documents are unevenly distributed across categories. KAN and Rocchio achieve better results than the other two algorithms and their performances are very similar at all the rounds in both micro and macro-averaged F1. Also, by comparing the micro and macro-averaged results for each algorithm in Figures 5.17 and 5.18, we can see that the two averaged measures of each algorithm are 101 only slightly different. As discussed in Section 5.3, an even distribution of training examples causes almost identical micro and macro-averaged F1 measures on the 20Newsgroups. Now, let us look at the performance comparison of the three thresholding strategies - SCut, GRinSCut, and LRinSCut – in a similarity-based learning algorithm. Since, these three thresholding strategies had mainly been developed for the multi-class text categorization task, we ran the experiments only on the Reuters-21578 corpus. Figures 5.19 through 5.26 show the performance of our RinSCut variants in each algorithm that should be compared to the performance of SCut that was shown in both Figures 5.13 and 5.14. Figures 5.19 and 5.20 show that, across all the rounds, our RinSCut variants consistently give considerable performance improvements to Rocchio against SCut in the micro and macro-averaged F1. The results of GRinSCut and LRinSCut are very similar in the macro-averaged F1. However, in the micro-averaged F1 in Figure 5.19, the advantage of GRinSCut is noticeable showing a considerable performance improvement, especially at rounds 2 and 3. As shown in Figures 5.21 and 5.22, the WH algorithm with two RinSCut variants gives very unstable micro and macro-averaged performance. Also, it gives, in general, lower performance results than with SCut on both measures. In Figure 5.23, k-NN with RinSCut strategies achieve slightly better microaveraged results than with SCut except at round 4 while giving similar macro-averaged performance to SCut across all the rounds in Figure 5.24. 0.90 0.80 micro-avg. F 1 0.70 0.60 0.50 0.40 Rocchio-LRinSCut 0.30 Rocchio-GRinSCut 0.20 Rocchio-SCut 0.10 0.00 1 2 3 4 5 6 round 7 8 9 10 102 Figure 5.19 Micro-averaged performance comparison of SCut and RinSCut variants on Rocchio (the Reuters-21578 corpus used). 0.90 0.80 macro-avg. F 1 0.70 0.60 0.50 0.40 0.30 Rocchio-LRinSCut 0.20 Rocchio-GRinSCut 0.10 Rocchio-SCut 0.00 1 2 3 4 5 6 7 8 9 10 round Figure 5.20 Macro-averaged performance comparison of SCut and RinSCut variants on Rocchio (the Reuters-21578 corpus used). 0.90 0.80 micro-avg. F 1 0.70 0.60 0.50 0.40 0.30 WH-LRinSCut 0.20 WH-GRinSCut 0.10 WH-SCut 0.00 103 Figure 5.21 Micro-averaged performance comparison of SCut and RinSCut variants on WH (the Reuters-21578 corpus used). 0.90 0.80 macro-avg. F 1 0.70 0.60 0.50 0.40 WH-LRinSCut 0.30 WH-GRinSCut 0.20 WH-SCut 0.10 0.00 1 2 3 4 5 6 7 8 9 10 round Figure 5.22 Macro-averaged performance comparison of SCut and RinSCut variants on WH (the Reuters-21578 corpus used). 0.90 0.80 micro-avg. F 1 0.70 0.60 0.50 0.40 k-NN-LRinSCut 0.30 k-NN-GRinSCut 0.20 k-NN-SCut 0.10 104 Figure 5.23 Micro-averaged performance comparison of SCut and RinSCut variants on k-NN (the Reuters-21578 corpus used). 0.90 0.80 macro-avg. F 1 0.70 0.60 0.50 0.40 k-NN-LRinSCut 0.30 k-NN-GRinSCut 0.20 k-NN-SCut 0.10 0.00 1 2 3 4 5 6 7 8 9 10 round Figure 5.24 Macro-averaged performance comparison of SCut and RinSCut variants on k-NN (the Reuters-21578 corpus used). 0.90 0.80 micro-avg. F 1 0.70 0.60 0.50 0.40 KAN-LRinSCut 0.30 KAN-GRinSCut 105 Figure 5.25 Micro-averaged performance comparison of SCut and RinSCut variants on KAN (the Reuters-21578 corpus used). 0.90 0.80 macro-avg. F 1 0.70 0.60 0.50 0.40 KAN-LRinSCut 0.30 KAN-GRinSCut 0.20 KAN-SCut 0.10 0.00 1 2 3 4 5 6 7 8 9 10 round Figure 5.26 Macro-averaged performance comparison of SCut and RinSCut variants on KAN (the Reuters-21578 corpus used). 106 Figures 5.25 and 5.26 show that KAN with RinSCut variants achieve slightly better results than SCut on both measures except the low micro-averaged F1 of GRinSCut at round 1 and 2 in Figure 5.25. However, KAN with GRinSCut at round 3 gives a significant performance improvement that is very close to the best micro-averaged F1 performance at round 9. Tables 5.8 and 5.9 show the best F1 performance and its classifier (i.e., a learning algorithm and thresholding strategy) at each round in the micro and macro-averaged F1 respectively on the Reuters-21578 data set. Table 5.8 The best micro-averaged F1 and its classifier in each round on the Reuters-21578 corpus. round best micro-avg. F1 classifier 1 0.500 Rocchio-SCut 2 0.669 Rocchio-GRinSCut 3 0.726 KAN-GRinSCut 4 0.756 KAN-GRinSCut 5 0.775 KAN-GRinSCut 6 0.784 k-NN-LRinSCut 7 0.794 k-NN-GRinSCut 8 0.793 KAN-LRinSCut 9 0.788 KAN-LRinSCut 10 0.790 k-NN-GRinSCut 107 In Table 5.8, we can see that our RinSCut thresholding strategies (GRinSCut and LRinSCut) work well in most rounds except round 1. Note that GRinSCut gives the best micro-averaged performance more frequently than LRinSCut. This seems to match well with our aim in developing GRinSCut: it was designed to improve the microaveraged performance. Also, KAN appears 5 times in this table at round 3, 4, 5, 8, and 9 as the best learning algorithm with RinSCut variants. This suggests that it offers advantages for smaller training sets. Table 5.9 The best macro-averaged F1 and its classifier in each round on the Reuters-21578 corpus. round best macro-avg. F1 classifier 1 0.230 k-NN-LRinSCut 2 0.335 Rocchio-GRinSCut 3 0.444 KAN-LRinSCut 4 0.537 KAN-GRinSCut 5 0.605 KAN-GRinSCut 6 0.634 KAN-LRinSCut 7 0.645 KAN-LRinSCut 8 0.641 KAN-LRinSCut 9 0.629 KAN-LRinSCut 10 0.629 KAN-LRinSCut 108 The advantage of KAN and RinSCut variants is much more obvious in the macroaveraged F1 performance in Table 5.7. In this table, KAN achieves the best F1 performance at 8 rounds and RinSCut thresholding strategies outperform SCut across all the rounds. In addition, LRinSCut, designed for improved macro-averaged performance, appears 7 times as the best thresholding strategy. In this section, we have described our extensive experiments on the Reuters-21578 and 20-Newsgroups data sets to assess the effects of our methods for text categorization. The empirical results on the Reuters-21578 show that KAN outperforms Rocchio and WH, and achieves slightly better results than k-NN. The improved results for our thresholding strategies are stronger. The two variants (GRinSCut and LRinSCut) of RinSCut show considerable performance improvements in the micro and macro-averaged F1 performance for all learning algorithms except WH. Although, KAN+RinSCut variants do not give better performance over other counterpart techniques on all the rounds, they seem to slightly outperform the other techniques. On the basis of the experimental results, it seems that the best choice, across techniques for the Reuters-21578 data set (for the multi-class categorization task), would be KAN with RinSCut variants. 109 Chapter 6 Learning with Selective Sampling Supervised learning approaches to text categorization require a large number of annotated (or labeled) documents for training to achieve a high level of performance. The problem in real contexts, however, is that gathering such a large number of accurately annotated training documents is difficult since it is very time-consuming and error-prone [HW90, ADW94, VM94]. An emerging research area in text categorization is active learning [CAL94], where the system actively participates in the collection of training documents, rather than relying on the random sampling. There are usually two types of active learning; (1) generating artificial new training documents and (2) selecting the most informative documents from a pool of unlabeled ones. In this chapter, we investigate the latter type of active learning (i.e., selective sampling) since unlabeled documents for training are generally plentiful. As we described in the introduction to Chapter 1, the primary goal of selective sampling is to reduce the number of labeled training examples that are needed to achieve a particular performance level. The selective sampling process is typically performed by selecting and using the most informative examples in the available set of unlabeled raw documents. Our method of determining informative ones is based on the uncertainty values of unlabeled examples and so-called uncertainty sampling [LG94, LC94]; the document that has the largest uncertainty value is considered the most informative document for training. 110 In this chapter, we discuss some issues that, we believe, have significant impact on the quality of the selective sampling process. 6.1 Goal and Issues Our main goal in this thesis is to develop a machine learning approach to text categorization that can achieve a high performance level with fewer annotated training examples. Through the previous chapters, we have described development of our own methods (KAN and RinSCut variants) for this goal and verified that they work well, even with a small number of labeled training documents. Regardless of the specific learning algorithm (and/or a specific thresholding strategy) applied to text categorization, one promising approach towards our goal is to have some control over the sampling process of training examples. Since each document is generally different from the others, some documents may be very helpful (or informative) for learning accurate classifiers and others not so. Searching for and using such helpful documents for training the classifiers is the main purpose of the selective sampling process. The expected desirable effects of selective sampling are as follows: 1. We can build accurate classifiers quickly by using a relatively small number of training documents. 2. We can save human experts from labeling many uninformative documents that are not helpful for training the classifiers. The typical method (which is also adopted in this thesis) for finding informative examples is based on the uncertainty values that are computed by comparing between a threshold value of a given category and the similarity values of unlabeled documents. The main issues we discuss in this chapter are: 1. Using homogeneous or heterogeneous types of knowledge base (i.e., classifiers) 2. Directly using the most positive-certain documents for training without human labeling 111 6.1.1 Homogeneous versus Heterogeneous Approach When computing the uncertainty value of a given unlabeled document, we need a classifier that is already built from the available training examples. This classifier can be the same type of classifier as the one used for the categorization of new documents or it can be a totally different type of classifier that is only used for the selective sampling process. It seems that the homogeneous approach (i.e., using the same type of classifier) for both tasks is preferable in the computation of uncertainty values of unlabeled documents. This is better than the heterogeneous approach since the homogeneous approach does not require the additional cost of building a different type of classifier. However, the heterogeneous approach has been used in [LC94] for selective sampling and, as mentioned in the literature, the main reason for using it is that the existing type of classifier used for document categorization is not suitable for the computation of uncertainty values or too computationally expensive to build and use with a large number of training documents. As a result, it seems that the choice of approach depends on the computation complexity of existing classifiers. In this chapter, we focus on our techniques (KAN and RinSCut variants) and their categorization performance improvements through selective sampling. As discussed in Section 3.3, the computational complexity of KAN is O(n2) where n is the number of features in the vector space. This fact could be quite problematic when applying KAN to selective sampling. So, to be a practical classifier, KAN must show reasonable performance with a reduced feature set, that is a manageable size. Fortunately, as shown in Chapter 5, KAN as well as the other similarity-based learning algorithms, achieved the highest categorization performance with a reduced feature set. This had far fewer features than the unreduced full feature set. So, our adopted approach to selective sampling is the homogeneous approach that directly uses existing classifiers built by KAN, not requiring the additional cost of building new classifiers. 6.1.2 Using Positive-Certain Examples for Training As discussed earlier, selective sampling finds, for a given category, the most uncertain documents, those whose category is most ambiguous. Then, it presents some 112 of the most uncertain documents to human experts, asking for their correct category labels. One possibility arising from the process of selective sampling is the use of a set of positive-certain documents for a given category to achieve performance improvements. Such positive-certain documents, which are also used as negative ones for the other categories, could be less informative than uncertain ones used in previous selective sampling methods. Even so, it is plausible that using positive-certain ones will have positive effects for the categorization performance. We can expect that if the text categorization system has a scheme that is automatically finding the uncertain documents, it can also locate the most positivecertain documents that must be categorized under a given category. This automatic scheme for locating positive-certain documents may be quite advantageous if using them for training leads to performance improvements, since it does not require any work from human experts. Goldman and Zhou’s work [GZ00] can label unlabeled data but it uses two different classifiers. A few positive-certain examples could be in error and those automatically and wrongly classified documents will affect the categorization performance. This problem is one of the main reasons why we choose the homogeneous approach in which the system uses the same type of relatively accurate learning algorithm (like KAN) for the selective sampling of informative documents (uncertain and/or positive-certain documents) and the categorization of new documents. Figure 6.1 depicts the flow of unlabeled documents for each iteration in our selective sampling approach. The sampler, here, defines two types of documents, uncertain and positive-certain documents. Note that previous selective sampling used only the uncertain documents, requiring manual-labeling. In addition to these uncertain examples, our selective sampling method uses, for training, positive-certain documents that are automatically labeled by the system. 113 human expert sampler most uncertain documents unlabeled documents training documents most certain documents Figure 6.1 manually labeled automatically labeled Flow of unlabeled documents in our selective sampling. 6.2 Overall Approach The selective sampling approach we are interested in is referred to as uncertainty sampling since it is based on the uncertainty values of unlabeled documents. Uncertainty sampling was first introduced and discussed in the literature by [LG94]. In that work, they used only the uncertain documents that should be labeled by human experts. Our uncertainty sampling approach is different from this original method in that it uses both uncertain and positive-certain documents. So, we need a new scheme that makes a distinction between the uncertain and certain documents in the available unlabeled documents set. 114 In the following subsections, we explain the way of computing the uncertainty values and our new scheme, in which we can define the positive-certain documents as well as the uncertain documents. 6.2.1 Computing Uncertainty Values To compute the uncertainty values of unlabeled documents for uncertainty sampling, the text categorization system should have the classifiers that are usually built from the existing training documents. As discussed earlier, these classifiers in our system are the same as the classifiers that are used for the categorization of new documents. Based on homogeneous uncertainty sampling, the system computes the similarity score _i(uj) for the ith category ci and the jth unlabeled document uj. Previous uncertainty selective sampling approaches computed the uncertainty value UCTi(uj) based on its certainty value CTi(uj) as follows: • UCTi(uj) = − CTi(uj) …. Formula 6.1 Then, CTi(uj) is defined as follows: • CTi(uj) = |_i(uj) − ts(ci)| …. Formula 6.2 where ts(ci) is typically the optimal threshold from the SCut thresholding strategy. From those two formulas, we can see that the uncertainty is defined as having the opposite meaning from the certainty and the largest possible uncertainty value in the set of unlabeled documents can be obtained from the document that has a similarity score which is closest to the threshold ts(ci). As a result, this largest uncertainty value is 0 when _i = ts(ci). Previous uncertainty sampling approaches select, in each iteration, only the number of uncertain documents that are closest to the threshold and the human experts must annotate them with their correct category labels. 115 Note that the above formulas do not make a distinction between one document having a similarity score below ts(ci) and the other one having a similarity score above ts(ci). For example, when ts(ci) = 20, _i(ui) = 15, and _i(uj) = 25, the uncertainty and certainty values of two unlabeled documents, ui and uj, in a given category ci are the same value, 5. Our uncertainty selective sampling method must differentiate them to define the positive-certain documents and negative-certain documents. In the following section, we describe how to do this with our own thresholding strategy, RinSCut. 6.2.2 Defining Certain and Uncertain Documents with RinSCut The key difference in our selective sampling, from the previous conventional approaches, is that our system distinguishes the uncertain, positive-certain, and negative-certain examples, and using them for training. In this research, we do not use negative-certain documents because positive documents for one category are also used as negative ones for the other categories. The training set had many negative examples, so we did not explore the power of using negative-certain documents. Our method is especially significant for the positive-certain documents defined, because the text categorization system uses them without the correct category labels from the human experts. For this new scheme of selective sampling, we use the RinSCut thresholding strategy, introduced and explained in Chapter 4. As explained in Section 4.3.1, RinSCut defines the ambiguous zone for a given category ci using the ts_top(ci) and ts_bottom(ci). In our approach, this ambiguous zone is used for differentiating the uncertain, positive-certain, and negate-certain examples in the set of unlabeled documents. Figure 6.2 shows and defines the three ranges of similarity scores of unlabeled documents as follows: • ts_bottom(ci) ≤ _i(uncertain documents) < ts_top(ci) • _i(positive-certain documents) ≥ ts_top(ci) • _i(negative-certain documents) < ts_bottom(ci) 116 Unlabeled examples are sorted by similarity scores in descending order for category ci. : positive-certain documents _ ts_top(ci) ∆ ∆ : ∆ ∆ : --- with ts(ci) from SCut _ uncertain documents ts_bottom(ci) negative-certain documents : positive-certain documents with similarity scores above ts_top(ci) ∆ : uncertain documents with similarity scores between ts_top(ci) and ts_bottom(ci) : negative-certain documents with similarity scores below ts_bottom(ci) Figure 6.2 Definition of certain and uncertain examples using ts_top(ci) and ts_bottom(ci) for a given category ci. 117 In this figure, the uncertain documents for the category ci are shown as ∆ having similarity scores that belong to the ambiguous zone, the positive-certain documents are shown as , having similarity scores above ts_top(ci), and the negative-certain documents are represented as having similarity scores below ts_bottom(ci). In our uncertainty selective sampling methods, the system selects, for each iteration, a number of uncertain documents with the largest uncertainty values (i.e., with similarity scores closest to the threshold), but they also must be within the ambiguous zone. Then, the system presents them to human experts for the category label. Also, our selective sampling method automatically selects the positive-certain documents with similarity scores above ts_top(ci), Then, it uses them directly for training without asking a human expert for their correct category information. 118 Chapter 7 Evaluation II: Uncertainty Selective Sampling In Chapter 5, we conducted comparative experiments on our new techniques (KAN and RinSCut variants) for text categorization. The experimental results showed that our techniques work slightly better than other widely used methods. However, like those other methods, they still require a large number of training examples to achieve a high level of performance. The sampling method used in those experiments for selecting the training examples was random sampling. In Chapter 6, we described another type of sampling method – the uncertainty selective sampling. This is a mechanism that finds more informative examples in unlabeled documents and uses them for training. The desirable and expected result is that the categorization performance, especially with a small number of randomly selected training data, could be significantly improved by using the same number of informative examples. In this chapter, we conduct experiments on this uncertainty selective sampling method to explore this potential performance improvements for text categorization. 7.1 Data Sets Used and Text Processing The same two data sets, Reuters-21578 and 20-Newsgroups, that were used in Chapter 5, are also used in this evaluation. For the training and test sets in each corpus, we use the same splitting methods that were described in Section 5.1.1 for the Reuters21578 and in Section 5.1.2 for the 20-Newsgroups. To convert raw documents to the 119 representations, we also use the same text preprocessing methods described in Section 5.2. For feature space reduction, we use the same feature selection algorithms as the ones used in Chapter 5 and select 50 unique features from each category as evaluated and explained in Section 5.3. 7.2 Classifiers Implemented and Evaluated The classifiers (learning algorithm + thresholding strategy) implemented and evaluated for the categorization of new documents in the experiments are KAN+GRinSCut, KAN+LRinSCut, and KAN+RCut. For the Reuters-21578 data set, the following classifiers were implemented and evaluated with selective sampling: (1) KAN+GRinSCut KAN learning algorithm and GRinSCut thresholding strategy (2) KAN+LRinSCut KAN learning algorithm and LRinSCut thresholding strategy And, for the 20-Newsgroups data set, the following classifier was evaluated: (1) KAN+RCut KAN learning algorithm and RCut thresholding strategy These classifiers on each corpus are built from the available training examples and evaluated against the test data set. The goal of the experiments in this chapter is to see whether or not their categorization performances are improved by applying the uncertainty selective sampling methods described in the following section, compared with the random sampling. 120 7.3 Sampling Methods Compared In this chapter, we want to compare the following sampling methods: (1) Random sampling (RS) The documents in the training set are randomly selected from the set of unlabeled examples and then, manually labeled by human experts. (2) Selective sampling of uncertain examples (SS-U) For the training set, the most uncertain documents are selected based on their uncertainty scores and then, manually labeled by human experts. (3) Selective sampling of uncertain and certain examples (SS-U&C) As well as the most uncertain documents, a set of positive-certain documents that is automatically labeled by the system is also added to the training set. The classifiers described in Section 7.2 were built on the training examples selected by one of above sampling methods. Also, as discussed in Chapter 6, our uncertainty selective sampling methods (SS-U and SS-U&C) are based on the homogeneous approach that uses the same type of classifier as the one used for the categorization of new (or, test) documents. So, each classifier was used for the categorization of test documents, used again for the selective sampling methods, and then re-built using the training examples selected from the selective sampling methods. For the number of positive-certain examples that are automatically labeled by the system and used for training in the SS-U&C sampling method, we choose 500 and 250 examples for the Reuters-21578 (6,984 training examples in total) and 1,000 and 500 examples for the 20-Newsgroups (about 16,000 training examples in total). Also, to obtain generalized and reliable results for the evaluation of the random sampling method, we conducted the experiments 3 times in each data split for both data sets. 121 7.4 Results and Analysis For the results in this section, we use the following experimental methodology. To build an initial-classifier, we randomly select the same number, n, of positive-training examples for each category (n = 2 for the Reuters-21578, 106 examples in total and n = 4 for the 20-Newsgroups, 80 examples in total). Also, for each iteration, the same number of examples (i.e., 106 examples for the Reuters-21578 and 80 examples for the 20-Newsgroups) is selected based on the adopted sampling method and added to the training set. For the SS-U&C sampling, a set of an additional k automatically labeled positive-certain examples (500 and 250 for the Reuters-21578, and 1,000 and 500 for the 20-Newsgroups) by the classifier constructed at the previous iteration is added to the training set. These k positive-certain examples are almost evenly distributed across categories. Then, the classifier that is incrementally re-built from the training set is evaluated against the test set. Figure 7.1 and 7.2 show the micro and macro-averaged F1 of the KAN+RCut classifier, evaluated with the three different sampling methods on the 20-Newsgroups corpus. In these charts, the curves of uncertainty selective sampling methods stop when they achieve the target performance of the random sampling. Also, note that the RS sampling on this data set used “truly random sampling”. So, the performance of RS in these experiments showed similar results to Figures 5.17 and 5.18, not Figures 5.15 and 5.16 in Section 5.4.2. The micro and macro-averaged F1 measures of each sampling method are almost identical due to the characteristics of this data set explained in section 5.3.2. The advantage of the SS-U sampling becomes obvious after 320 training examples. But, initially with 240 examples its performance is slightly worse than the random sampling, RS. Its low performance seems to be due to the inaccurate initialclassifier that is learned with the small number of training examples. The advantage of the SS-U&C variants (SS-U&C[1000] and SS-U&C[500]) is clear even at the initial point, but it failed to give better results after 400 training examples. The reason for this is that the documents incorrectly classified as positive-certain affects the categorization performance. Also, SS-U&C[1000] results in slightly better performance than SS-U&C[500]. This result shows that positive-certain examples may be less informative than uncertain ones and, as a result, we need a quite large number of positive-certain examples to make use of their advantage for selective sampling. 122 0.90 0.80 0.70 micro-avg. F 1 0.60 0.50 0.40 0.30 RS SS-U 0.20 SS-U&C[1000] 0.10 SS-U&C[500] 0.00 160 320 480 640 800 960 1,120 1,280 number of manually labeled training examples Figure 7.1 Micro-averaged F1 of KAN+RCut on the 20-Newsgroups (RS: random sampling, SS-U: selective sampling of uncertain examples, SS-U&C[1000]: selective sampling of uncertain and certain examples [1,000 certain examples], SS-U&C[500]: selective sampling of uncertain and certain examples [500 certain examples]). 123 0.90 0.80 0.70 macro-avg. F 1 0.60 0.50 0.40 0.30 RS SS-U 0.20 SS-U&C[1000] SS-U&C[500] 0.10 0.00 160 320 480 640 800 960 1,120 1,280 number of manually labeled training examples Figure 7.2 Macro-averaged F1 of KAN+RCut on the 20-Newsgroups (RS: random sampling, SS-U: selective sampling of uncertain examples, SS-U&C[1000]: selective sampling of uncertain and certain examples [1,000 certain examples], SS-U&C[500]: selective sampling of uncertain and certain examples [500 certain examples]). 124 From these results, we can see that uncertainty selective sampling methods require a much smaller number of labeled training examples than the RS sampling. For example, to achieve a given level of performance, to say 0.573 micro-averaged F1 of the RS sampling at 1,280 examples in Figure 7.1, the SS-U requires 480 training examples, and the SS-U&C[1000] and SS-U&C[500] need 800 and 880 examples, respectively. This represents 62.5% saving on the required examples for the SS-U, and 37.5% and 31.2% savings for the SS-U&C[1000] and SS-U&C[500] over the random sampling. Figures 7.3 through 7.6 show the experimental results on the Reuters-21578 corpus. The learning traces of KAN+GRinSCut with four sampling methods are presented in Figures 7.3 and 7.4. In Figure 7.3, we can see that all the uncertainty selective sampling methods for the micro-averaged F1 do not show any desirable effects over the random sampling. These results are mainly due to the fact that the documents in the Reuters-21578 corpus are very unevenly distributed across the categories. In Tables 5.1 through 5.3 in Chapter 5, we can see that more than 50% of examples belong to two categories, “earn” and “acq”, in both training and test sets. In this situation, the micro-averaged measure mainly depends on the performances on these two categories. The randomly selected documents in the training set of each trial might be mainly from “earn” and “acq” categories and the classifiers that are built from this unevenly distributed training set might be working well with the test examples of two frequent categories. By contrast, the selected documents in the selective sampling methods are evenly distributed across categories. Also, note in Figure 7.3 that the SS-U&C variants perform much better than the SS-U sampling. This superior result of SS-U&C, against SS-U, for the micro-averaged performance is probably due to the large number of positive-certain examples added to the training set, since these examples can solve the problem of the sparse training examples for the frequent categories like “earn” and “acq”. 125 0.90 0.80 0.70 micro-avg. F 1 0.60 0.50 0.40 0.30 RS SS-U 0.20 SS-U&C[500] SS-U&C[250] 0.10 0.00 106 318 530 742 954 1,166 1,378 number of manually labeled training examples Figure 7.3 Micro-averaged F1 of KAN+GRinSCut on the Reuters-21578 (RS: random sampling, SS-U selective sampling of uncertain examples, SS-U&C [500]: selective sampling of uncertain and certain examples [500 certain examples], SS-U&C [250]: selective sampling of uncertain and certain examples [250 certain examples]). 126 0.90 0.80 0.70 macro-avg. F 1 0.60 0.50 0.40 0.30 RS SS-U 0.20 SS-U&C[500] SS-U&C[250] 0.10 0.00 106 318 530 742 954 1,166 1,378 number of manually labeled training examples Figure 7.4 Macro-averaged F1 of KAN+GRinSCut on the Reuters-21578 (RS: random sampling, SS-U: selective sampling of uncertain examples, SS-U&C [500]: selective sampling of uncertain and certain examples [500 certain examples], SS-U&C [250]: selective sampling of uncertain and certain examples [250 certain examples]). 127 0.90 0.80 0.70 micro-avg. F 1 0.60 0.50 0.40 0.30 RS SS-U 0.20 SS-U&C[500] SS-U&C[250] 0.10 0.00 106 318 530 742 954 1,166 1,378 number of manually labeled training examples Figure 7.5 Micro-averaged F1 of KAN+LRinSCut on the Reuters-21578 (RS: random sampling, SS-U: selective sampling of uncertain examples, SS-U&C [500]: selective sampling of uncertain and certain examples [500 certain examples], SS-U&C [250]: selective sampling of uncertain and certain examples [250 certain examples]). 128 0.90 0.80 0.70 macro-avg. F 1 0.60 0.50 0.40 0.30 RS SS-U 0.20 SS-U&C[500] SS-U&C[250] 0.10 0.00 106 318 530 742 954 1,166 1,378 number of manually labeled training examples Figure 7.6 Macro-averaged F1 of KAN+LRinSCut on the Reuters-21578 (RS: random sampling, SS-U: selective sampling of uncertain examples, SS-U&C [500]: selective sampling of uncertain and certain examples [500 certain examples], SS-U&C [250]: selective sampling of uncertain and certain examples [250 certain examples]). 129 Figure 7.4 shows the macro-averaged F1 of KAN+GRinSCut. Unlike for the microaveraged performance, the advantage of the SS-U sampling is apparent for the macroaveraged measure. It achieves 0.50 macro-averaged F1 at 848 examples, while the random sampling needs 1,378 examples to achieve this level of performance. So, the SS-U shows 38.4% saving on the number of documents required, over the random sampling. The advantages of the SS-U&C variants are not clear with the small numbers of labeled examples used. But, after 954 examples, both SS-U&C[500] and SSU&C[250] work well and achieve 0.502 and 0.500 macro-averaged F1 at 954 and 1,166 examples, respectively. These results show 30.7% saving for SS-U&C[500] and 15.3% saving for SS-U&C[250]. As in the 20-Newsgroups corpus, we can see that using 500 positive-certain examples (in this case, 500 examples) achieves better performance than using 250 positive-certain examples. Figures 7.5 and 7.6 depict the micro and macro-averaged F1 performance of another classifier, KAN+LRinSCut in which the locally optimized RinSCut used, with the sampling methods on the Reuters-21578 data set. In Figure 7.5, the SS-U shows somewhat unstable learning curves in the micro-averaged performance and, like KAN+GRinSCut in Figure 7.3, it failed to achieve 0.765 micro-averaged F1 with less than RS at 1,378 examples. The SS-U&C variants work better than the other sampling methods with small numbers of labeled examples. But, after 530 examples for SSU&C[500] and 374 examples for SS-U&C[250], their performances are worse than RS. For the macro-averaged performance of KAN+LRinSCut in Figure 7.6, all the uncertainty sampling methods consistently work better than the RS. To achieve 0.500 macro-averaged F1, while the RS needs 1,378 examples, the SS-U needs 848 training examples representing 38.4% saving in the required number of examples. The SSU&C[500] and SS-U&C[250] need 954 and 1,166 examples, showing 30.7% and 15.3% savings, respectively. From these results on the Reuters-21578, we can see that with an uneven distribution of test documents, the optimal choice for the sampling methods becomes difficult. And, the choice will depend on the user’s needs. For example, if microaveraged performance is the primary concern on the Reuters-21578, the choice of the 130 optimal sampling method will be the RS sampling method, otherwise, it will be the SSU selective sampling. If both micro and macro-averaged measures are concerned, the optimal choice seems to be the SS-U&C sampling method. Also, the experimental results on both data sets show that using more positivecertain examples (i.e., 1,000 examples for the 20-Newsgroups and 500 for the Reuters21578) works slightly better than the smaller number of positive-certain ones (500 and 250 examples, respectively). However, we cannot say that using more positive-certain ones is better, since performance difference is not apparent and using a larger number of training examples would make overall text categorization process slower. 131 Chapter 8 Conclusions Our goal in this research was to investigate and develop supervised and semisupervised machine learning approaches to text categorization, including (1) an algorithm that exploits word co-occurrence information and discriminating power value, (2) new approaches to thresholding, and (3) semi-supervised learning approaches to selective sampling, for the important goal of reducing the number of labeled training examples to achieve a given level of performance. The type of text categorization we investigated was similarity-based, where the classifier returns the predicted category labels of new documents based on their similarity scores. To achieve our goal, we built text categorization systems to which we applied new classifiers (KAN and RinSCut variants) and uncertainty selective sampling methods. KAN is a new learning algorithm that was designed to give a term the appropriate weight according to its semantic meaning and importance. For this, KAN uses the feature’s co-occurrence information and discriminating power value in a given category. Another important research area in similarity-based text categorization concerns the thresholding strategy. It is unconditionally needed for classifiers and has a significant impact on categorization performance. We investigated existing common thresholding techniques and developed RinSCut variants that were designed to combine the strengths of the existing thresholding strategies. Finally, we explored uncertainty selective sampling methods. Rather than relying on random sampling, our selective sampling methods actively choose candidate training examples. Our selective sampling methods are based on the estimated uncertainty value of a given unlabeled example. To avoid additional cost in building a different type of classifier for selective sampling, we 132 adopted a homogeneous approach that uses the same type of classifier as that in the categorization of new documents. As well as exploring conventional selective sampling methods that use only the most uncertain examples for training (in this thesis, referred to as the SS-U), we developed another type of selective sampling method (SS-U&C) that picks and uses a set of positive-certain examples with the uncertain ones. The main advantage of SS-U&C sampling is that the positive-certain documents recommended do not require a human labeling process since they are thought to be positive by the system. Extensive text categorization experiments were conducted on two standard test collections: the 20-Newsgroups and the Reuters-21578. The 20-Newsgroups corpus is suitable for the evaluation of the single-class (non-overlapping classes) categorization task, while the Reuters-21578 is for the multi-class categorization task. Both collections are real-world ones, contain a large number of pre-categorized documents, have many predefined categories, and can be regarded as standard collections for testing text categorization systems. The key conclusions drawn from the experimental results are as follows: • For all the similarity-based learning algorithms implemented and tested in this research, we found in Section 5.3 that using a large number of features failed to give a significant performance improvement on either data set. As a result, aggressive feature space reduction was possible, giving both fast processing and better performance. • We compared KAN against other typical similarity-based learning algorithms in Section 5.4, using the existing conventional thresholding strategies (i.e., RCut for the 20-Newsgroups and SCut for the Reuters21578) and varying the number of training examples used for training. The best and KAN’s performance in each round on the Reuters-21578 corpus are summarized below. The best performance in each case is shown in bold. We observed that KAN for Reuters-21578 achieved the best performance in most rounds and, even when it does not, KAN showed performance close to the best of others. 133 round best F1 KAN’s F1 micro Macro micro macro 1 0.500 0.227 0.455 0.188 2 0.624 0.298 0.624 0.264 3 0.627 0.399 0.627 0.399 4 0.719 0.469 0.696 0.469 5 0.742 0.572 0.733 0.572 6 0.743 0.602 0.743 0.602 7 0.756 0.597 0.756 0.597 8 0.755 0.605 0.752 0.605 9 0.757 0.596 0.757 0.596 10 0.750 0.584 0.750 0.584 For the 20-Newsgroups, all the learning algorithms showed similar performance results when the training examples are evenly distributed across categories. But, with an uneven distribution of training examples, caused by “truly random sampling”, there are greater differences between the learning algorithms. The best and KAN’s performance in each round on the 20-Newsgroups corpus are shown in the table below. round best F1 KAN’s F1 micro macro micro macro 1 0.396 0.383 0.396 0.381 2 0.497 0.482 0.497 0.482 3 0.524 0.506 0.524 0.506 4 0.536 0.513 0.536 0.512 5 0.570 0.553 0.563 0.540 6 0.597 0.583 0.590 0.570 7 0.623 0.607 0.618 0.597 8 0.662 0.649 0.662 0.643 9 0.666 0.660 0.666 0.652 134 10 0.736 0.735 0.712 0.701 Again, the best performance in each case is shown in bold. KAN and Rocchio achieved similar results in this situation and outperformed k-NN and WH in most rounds. • To assess the effects of the RinSCut thresholding strategy, experiments were performed on the Reuters-21578 (i.e., for multi-class text categorization) only. The F1 performance of the tested thresholding strategies with all the similarity-based algorithms at round 10 (i.e., using all the training examples) are shown below: algorithm SCut GRinSCut LRinSCut micro macro micro macro micro Macro Rocchio 0.634 0.412 0.736 0.549 0.752 0.573 WH 0.681 0.570 0.732 0.517 0.438 0.442 k-NN 0.692 0.570 0.790* 0.552 0.786 0.578 KAN 0.750 0.584 0.780 0.615 0.786 0.629* In this table, the performance measures with * represent the overall top results in micro and macro-averaged performance. Also, the table below shows the performance improvements of the RinSCut variants in percentage against SCut. algorithm SCut best in RinSCut variants micro macro micro macro Rocchio 0.634 0.412 0.752 (18.6%) 0.573 (39.1%) WH 0.681 0.570 0.732 (7.5%) 0.517 (-9.3%) k-NN 0.692 0.570 0.790 (14.2%) 0.578 (1.4%) KAN 0.750 0.584 0.786 (4.8%) 0.629 (7.7%) 135 It appears that our RinSCut variants (GRinSCut and LRinSCut) gave considerable performance improvements for all the learning algorithms, except WH. Especially for the Rocchio, RinSCut variants outperformed SCut. Even though Rocchio with the RinSCut variants did not give the best results across all the rounds, its performance was close to the best results other classifiers achieved. This result showed that the thresholding strategy, an unexplored research area, is important for similarity-based text categorization. • Based on the experiments on this Reuters-21578, we were able to say that the best combination among the compared methods in this research seems to be KAN with the RinSCut variants for multi-class categorization. We found that KAN with the LRinSCut performed best on macro-averaged performance, while KAN with the GRinSCut achieved the second best performance which is very close to the best performance of k-NN on macro-averaged performance. • We compared the uncertainty selective sampling methods (SS-U, SSU&C[1000], and SS-U&C[500]) and random sampling (RS) with KAN+RCut on the 20-Newsgroups corpus in Section 7.4. We observed that all the selective sampling methods require fewer labeled examples for training to achieve a given level of performance as shown in tables below. target micro averaged F1 = 0.573 sampling method number of labeled examples F1 savings RS 1,280 0.573 0% SS-U 400 0.581 62.5% SS-U&C[1000] 800 0.576 37.5% SS-U&C[500] 880 0.580 31.2% target macro averaged F1 = 0.549 sampling method number of labeled examples F1 savings RS 1,280 0.549 0% 136 SS-U 400 0.550 68.7% SS-U&C[1000] 560 0.552 56.2% SS-U&C[500] 720 0.560 43.7% With more than 320 training examples, the SS-U gave better results than the SS-U&C variants. The reason for this seems to be that some of the positive-certain documents in the SS-U&C variants were incorrectly categorized. These incorrect examples appeared to lower the performance of SS-U&C compared with the SS-U sampling method. • The comparative experiments on sampling methods with KAN+GRinSCut and KAN+LRinSCut were conducted on the Reuters-21578 data set. For the micro-averaged performance, all the selective sampling methods failed to show a clear advantage. This result was mainly caused by an uneven distribution of test examples in this data set. However, SS-U&C variants showed much better performance than SS-U (See Figures.7.3 and 7.5 in Chapter 7). Their micro-averaged performances were very close to RS. By contrast, for the macro-averaged, there was a clear advantage over the RS sampling. This is shown in the table below. target macro averaged F1 = 0.497 with KAN+GRinSCut sampling method number of labeled examples F1 savings RS 1,378 0.497 0% SS-U 848 0.502 38.4% SS-U&C[500] 954 0.507 30.7% SS-U&C[250] 1,166 0.500 15.3% target macro averaged F1 = 0.500 with KAN+LRinSCut sampling method number of labeled examples F1 savings RS 1,378 0.500 0% SS-U 848 0.528 38.4% SS-U&C[500] 954 0.538 30.7% 137 SS-U&C[250] 1,166 0.533 15.3% The savings of each selective sampling method on the number of labeled examples required to achieve a given level of performance are same in the two summarized tables. The conclusions drawn from these results are (1) if micro-averaged performance is the primary concern, the RS sampling should be used, (2) otherwise, the SS-U sampling could be the optimal choice, and (3) if both averaged measures are concerned, the choice for the sampling methods for training examples might be SS-U&C since it did not show the worst performance on both micro and macro-averaged performances. 8.1 Contributions The following contributions were made with this research in the area of text categorization. • Clarification of the high dimensionality problem to which most sophisticated machine learning algorithms cannot scale. One of major problems in text categorization is the high dimensionality of the feature space. From the results of extensive experiments on this issue, we found that learning accurate concepts of categories does not require a large number of input features in either corpora (the 20-Newsgroups and Reuters21578) and so, the system needs aggressive feature reduction for faster processing and for better performance results. • Obtaining improved experimental results by applying word co-occurrence information with the discriminating power values. Previous work [Lew92a, Lew92b] showed that using phrases does not lead to any improvements in text categorization performance. This is probably because of the sparsity of such phrases in a given data set. We noted that using word co-occurrence information could be effective for resolving some of the semantic and informative ambiguities each term can have. By applying this 138 word co-occurrence information with the discriminating power values of features in KAN, we achieved better results compared with other, conventional similarity-based learning algorithms – k-Nearest Neighbor (k-NN), WidrowHoff (WH), and Rocchio. • Combining the strengths of existing thresholding strategies to develop a new strategy (RinSCut) that works better in multi-class text categorization. Thresholding strategies, in similarity-based text categorization, are one of the unexplored research areas that need more attention. We developed new strategies – RinSCut variants – by using the strengths of two existing strategies, RCut and SCut. Experimental results for multi-class text categorization showed that our RinSCut variants gave a significant improvement over most similarity-based learning algorithms. • Experimental results showing that our uncertainty selective sampling methods with our classifiers (KAN+RinSCut variants) significantly reduce the number of labeled training examples required for the random sampling to achieve a given level of performance. Supervised learning approaches to text categorization usually need a large number of human-labeled examples to achieve a high level of performance. However, manually labeling of such a large number of examples is difficult and, sometimes, impractical. Our two kinds of selective sampling methods used fewer manually labeled examples for training than the number of training examples the random sampling required. • Evaluation of the effectiveness of our proposed methods by using the standard evaluation measure, F1, in the micro and macro-averaged performance. Our proposed methods were evaluated using the F1 measure that is one of the standard evaluation methods in text categorization. Also, we showed this measure in both micro and macro-averaged performance, since developing a method for only one averaging method is sometimes considered as trivial. • Generality of our proposed methods. 139 Our thresholding and selective sampling methods have been developed and applied to text categorization. They are quite general and applicable to other similarity-based text classification tasks. 8.2 Future Work There are a variety of possible directions that are related to this research and that can be explored further for future work. • Empirical studies on the optimal frequency of feature selection process. One of the main stages that make the learning process in a text categorization system slow is the feature selection. In the experiments in this research, we performed feature selection whenever new training documents were available. It would be interesting to see whether or not there is any optimal size of training examples for halting feature selection (i.e., adding more examples to this optimal size and performing the feature selection no longer gives any significant difference in the lists of extracted features). If such an optimal size exists, we can significantly decrease the overall learning time. • Further experiments to tune the parameters for KAN. The choice of the value for the λ parameter in KAN would be affected by the characteristics of a given data set. This indicates that we need to tune λ parameter for a given data set. However, in this research, we manually and intuitively established this value. It should be possible to determine this automatically. By using such an automatically tuned optimal parameter value, KAN might show more improved results. • Removing low level of relationship scores in KAN. In KAN, we used all the relationship scores computed among the features by assuming that low relationship scores will have a minor impact on the predictions of KAN. Removing such low relationship scores may result in improvements of the effectiveness of KAN. 140 • More studies on ways to define the ambiguous zone in RinSCut. The range of the ambiguous zone in RinSCut was defined using upper and lower bounds that were automatically defined from the training examples available. The computation of these two bounds is still overfitting a small number of training examples. More attention should be given to this issue to find a way to avoid this overfitting problem. It implies that, for a given data set, we need to perform experiments with a range of strategies and systematically explore this. • More studies on the distribution of recommended training examples in the uncertainty selective sampling methods. We allocated nearly the same number of uncertain examples (and positivecertain examples for the SS-U&C) to each category (i.e., kept an even distribution of recommended examples across categories). When the test documents in a given data set show an uneven distribution as in the Reuters21578 corpus, varying the number of training examples recommended for each category in our selective sampling methods may increase the micro-averaged performance. This suggests that it would be fruitful to explore effects of having an uneven distribution of training examples and exploring what proportion of these training examples is actually correct for a given category. • Applying our proposed methods to other data sets. More experiments are needed to evaluate our methods on other corpora like the Reuters Corpus Volume 1 [RCV1]. Also, we note that applying our methods to the categorization of other types of documents like web documents is plausible and may give different outcomes, for the KAN, RinSCut thresholding strategy, and our uncertainty selective sampling methods. • Applying our methods to other text-based classification tasks. Our methods have been developed mainly for text categorization. We would like to use of our methods for other classification tasks, such as text clustering, that is the task of automatically grouping similar documents. 141 We have explored supervised and semi-supervised machine learning approaches to text categorization. For the important goal of reducing the number of labeled training examples to achieve a given level of performance, we have developed KAN that exploits word co-occurrence information and discriminating power value, RinSCut variants that are new approaches to thresholding, and uncertainty selective sampling methods. We conducted extensive experiments on the two standard test collections. In these, we carefully evaluated KAN and RinSCut variants, demonstrating their effectiveness of F1, the standard evaluation metric. Also, we explored novel approaches to selective sampling and this showed the desirable effect of decreasing the number of labeled training examples needed to achieve a given level of macro-averaged performance. 142 Appendices 143 Appendix A: Stop-list A able about above abruptly absolutely according accordingly accurately across actively actual actually adequately after afterward afterwards again against ago ahead alas all alike almost alone along already also although altogether always am among amongst an and another any anybody anyhow anymore anyone anything anyway anywhere apparently approximately are around as aside ask asks asked asking at automatically away b badly barely basically be beautifully became become becomes becoming been being because before behind beneath beside besides between beyond bitterly both briefly but by c came can cannot 144 carefully casually certain certainly chiefly clearly come comes comfortably coming commonly completely consciously consequently considerably consistently constantly continually continuously correctly could currently d deeply definitely deliberately depending desperately despite did directly do does doing done doubtless during e each eagerly earnestly easily economically effectively either else elsewhere emotionally enough entirely equally especially essentially etcetera even evenly eventually ever every everybody everyday everyone everything everywhere evidently except exclusively extremely f fairly favorably few fewer finally firmly for forever formerly fortunately frankly from fully further furthered furthering furthermore furthers g gave generally gently get getting gets give given gives giving go goes going gone got gradually greatly h had happily hardly has hastily have having he heavily hence henceforth her here herself hey highly him himself his historically honestly how 145 however i ideally if immediately in incidentally including increasingly indeed independently indirectly individually inevitably initially instantly instead into invariably is it its itself j just k knew know knowing known knows l largely lately latter least lest let lets letting lightly likely likewise literally locally logically loosely loudly m made mainly make makes making may maybe me meanwhile mentally merely might more moreover most mostly much must my myself n namely naturally near nearby nearer nearest nearly neatly necessary necessarily need needed needs needing neither never nevertheless newly next no nobody non none nor normally not notably nothing now nowadays nowhere o ok obviously occasionally oh of off officially often on one once only openly or ordinarily originally other others otherwise ought our ourselves out over own p 146 painfully paradoxically partially particularily partly patiently per perfectly perhaps permanently personally physically plainly possible possibly practically precisely preferably presumably previously primarily principally privately probably promptly properly publicly purely put puts q quietly quite r rarely rather readily really reasonably recently regarding regardless regularly relatively repeatedly respectively s safely said same satisfactorily saw say says saying scarcely see seeing seem seemed seeming seemingly seems seen sees seldom separately seriously several severely shall sharply she shortly should silently similarly simply since slightly slowly smoothly so socially softly solely some somebody someday somehow someone someplace something sometime sometimes somewhat somewhere soon specifically squarely steadily still strictly strongly subsequently substantially successfully such sufficiently supposedly sure surely surprisingly t take taking taken takes tell telling tells temporarily than that the their them themselves 147 then there thereafter thereby therefore therefrom therein thereof thereto thereupon therewith these they thing things this thoroughly those thou though thoughtfully through throughout thus tightly to together told too took totally toward towards traditionally truly typically u ultimately unanimously unconsciously under undoubtedly unexpectedly unfortunately unless unlike until up upon upward us usual usually utterly v vaguely versus very via vigorously violently virtually vs w was we went were what whatever when whenever where whereabouts whereas whereby wherefore wherein whereof whereupon wherever whether which whichever while who wholly whose why widely wildly will with within without would x y yeah yet you your yours z 148 Bibliography [20News] The 20-Newsgroups collection, collected by Ken Lang, may be freely available for research purpose only from, http://www.ai.mit.edu/people/jrennie/20Newsgroups/. [ADW94] C. Apte, F. Damerau, and S. M. Weiss. Automated Learning of Decision Rules for Text Categorization. ACM Transactions of Information Systems, 12(3), pages 233-251, 1994 [AKCS00] I. Androutsopoulos, J. Koutsias, K. Chandrinos and C. Spyropoulos. An Experimental Comparison of Naive Bayesian and Keyword-Based Anit-Spam Filtering with Personal E-mail Messages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 160-167, 2000. + [AMST 96] A. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. PiatetskyShapiro, P. Smith, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, pages 307-328, 1996. [Ang87] D. Angluin. Queries and Concept Learning, Machine Learning, 2(4), pages 319-342, 1987. [BG01] T. Brasethvik and J. A. Gulla. Natural Language Analysis for Semantic Document Modeling. Data & Knowledge Engineering, 38, pages 45-62, 2001. 149 [Bri92] E. Brill. A Simple Rule-based Part-of-Speech Tagger. In Proceedings of the 3rd Annual Conference on Applied Natural Language Processing (ACL), Trento, Italy, pages 152-155, 1992. [Bri95] E. Brill. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. In Proceedings of the 3rd Workshop on Very Large Corpora, pages 1-13, 1995. [BSA94] C. Buckley, G. Salton, and J. Allan. The Effect of Adding Relevance Information in a Relevance Feedback Environment. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pages 292-300, 1994. [BSAS95] C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic Query Expansion Using SMART: TREC 3. The Third Text Retrieval Conference (TREC-3), National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, MD, 1995. [BT98] A. W. Black and P. Taylor. Assigning Phrase Breaks from Part-ofSpeech Sequences. Computer, Speech and Language, 12(2), pages 99117, 1998. [CAL94] D. Cohn, L. Atlas, and R. Lander. Improving generalization with Active Learning. Machine Learning, 15(2), pages 201-221, 1994. [CH98] W. W. Cohen and H. Hirsh. Joins that Generalize: Text Classification Using WHIRL. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 169-173, New York, NY, 1998. [Cha97] E. Charniak. Statistical Techniques for Natural Language Parsing. AI Magazine, 18(4), pages 33-44, 1997. [CM01] X. Carreras and L. Mrquez. Boosting Trees for Anti-Spam Email Filtering. In Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 2001. 150 [CS96] W. W. Cohen and Y. Singer. Context-sensitive Learning Methods for Text Categorization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), pages 307-315, 1996. [CX95] W. B. Croft and J. Xu. Corpus-Specific Stemming using Word Form Co-occurrence. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pages 147-159, Las Vegas, Nevada, April 1995. [CY92] C. J. Crouch and B. Yang. Experiments in Automatic Statistical Thesaurus Construction. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), pages 77-88, 1992. [Dam95] M. Damashek. Gauging Similarity via N-Grams: LanguageIndependent Sorting, Categorization, and Retrieval of Text. Science, 267, February 1995. [DDFL 90] + S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the Society for Information Science, 41(6), pages 391-407, 1990. [DE95] I. Dagan and S. P. Engelson, Committee-Based Sampling for Training Probabilistic Classifiers. In Proceedings of the 12th International Conference on Machine Learning, pages 150-157, 1995. [Fox90] C. Fox. A Stop List for General Text. SIGIR Forum, 24:1-2, pages 1935, 1990. [Fur98] J. Furnkranz. A Study Using n-gram Features for Text Categorization. Technical Report OEFAI-TR-9830, Austrian Institute for Artificial Intelligence, 1998. [GZ00] S. Goldman and Y. Zhou. Enhancing Supervised Learning with Unlabeled Data. In Proceedings of the 17th International Conference on Machine Learning, pages 327-334, 2000. 151 [Har75] J. A. Hartigan. Clustering Algorithms. John Willey & Sons, 1975. [HK00] E. H. Han and G. Karypis. Centroid-Based Document Classification: Analysis and Experimental Results. In Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pages 424-431, September 2000. [HPS96] D. Hull, J. Pedersen, and H. Schuetze. Document Routing as Statistical Classification. In AAAI Spring Symposium on Machine Learning in Information Access, Palo Alto, CA, March 1996. [HW90] P. Hayes and S.Weinstein. CONSTRUE/TIS: A System for ContentBased Indexing of a Database of News Stories. In Second Annual Conference on Innovative Applications of Artificial Intelligence, pages 49-64, 1990. [Joa97] T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML’97), pages 143151, 1997. [Joa98] T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Machine Learning: ECML98, Tenth European Conference on Machine Learning, pages 137-142, 1998. + [KMRT 94] M. Klemettinen, H. Mannila, P. Rokainen, H. Toivonen, and I. Verkamo. Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proceedings of the 3rd International Conference on Information and Knowledge Management (CIKM'94), pages 401407, 1994. [KS96] D. Koller and M. Sahami. Toward Optimal Feature Selection. In Proceedings of the 13th International Conference on Machine Learning (ICML’96), pages 284-292, 1996. 152 [Lan95] K. Lang. NewsWeeder: Learning to Filter Netnews. In Proceedings of the 12th International Machine Learning Conference (ICML’95), Lake Tahoe, CA, Morgan Kaufmann, San Francisco, pages 331-339, 1995. [LC94] D. D. Lewis and J. Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of the Eleventh International Conference on Machine Learning, San Francisco, CA., Morgan Kaufman, pages 148-156, 1994. [Lew92a] D. D. Lewis. Representation and Learning in Information Retrieval. Ph.D. Thesis, Department of Computer Science, University of Massachusetts, Amherst, MA, 1992. [Lew92b] D. D. Lewis. An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, June 21-24, Copenhagen, Denmark, pages 37-50, 1992. [LG94] D. D. Lewis and W. Gale. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994. [LH98] W. Lam and C. Y. Ho. Using A Generalized Instance Set for Automatic Text Categorization. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24-28, Melbourne, Australia, pages 81-89, 1998. [LKK00] K. H. Lee, J. Kay, and B. H. Kang. Keyword Association Network: A Statistical Multi-term Indexing Approach for Document Categorization. In Proceedings of the 5th Australasian Document Computing Symposium, pages 9-16, December 2000. [LKK02] K. H. Lee, J. Kay, and B. H. Kang. Lazy Linear Classifier and Rankin-Score Threshold in Similarity-Based Text Categorization. ICML 153 Workshop on Text Learning (TextML’2002), Sydney, Australia, pages 36-43, July 2002. [LKKR02] K. H. Lee, J. Kay, B. H. Kang, and U. Rosebrock. A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization. The 7th Pacific Rim International Conference on Artificial Intelligence (PRICAI-02), Tokyo, Japan, pages 444-453, August 2002. [LR94] D. Lewis and M. Ringuette. A Comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81-93, 1994. [LSCP96] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training Algorithms for Linear Text Classifiers. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), pages 298-306, 1996. [LT97] R. Liere and P. Tadepalli. Active Learning with Committees for Text Categorization. In Proceedings of the 14th National Conference on Artificial Intelligence, pages 591-596, 1997. [Mer98] D. Merkl. Text Classification with Self-Organizing Maps: Some Lessons Learned. Neurocomputing, 21:1-3, pages 61-77, 1998. [MG96] I. Moulinier and J. G. Ganascia. Applying an Existing Learning Algorithm to Text Categorization. In S. Wermter, and G. Scheler (eds.), Connectionist, Statistical, and Approaches to Learning for Natural Language Processing, Verlag, Berlin, pages 343-354, 1996 [MG98] D. Mlademic and M. Grobelnik. Word Sequences as Features in Text Learning. In Proceedings of the 17th Electrotechnical and Computer Science Conference (ERK98), Ljubljana, Slovenia, pages 145-148, 1998. Machine E. Riloff, Symbolic Springer- 154 [MG99] D. Mladenic and M. Grobelnik. Feature Selection for Unbalanced Class Distribution and Naive Bayes. In Proceedings of the 16th International Conference on Machine Learning, pages 258-267, 1999. [Mla98] D. Mladenic. Feature Subset Selection in Text-learning. In Proceedings of the 10th European Conference on Machine Learning (ECML’98), pages 95-100, 1998. [MN98] A. McCallum and K. Nigam. A Comparison of Event Models for Naive Bayes Text Classifiers. In AAAI-98 Workshop on Learning for Text Categorization, pages 41-48, 1998. [MRG96] I. Moulinier, G. Raskinis, and J. Ganascia. Text Categorization: A Symbolic Approach. In Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval (SDAIR-96), pages 8799, 1996. [MRMN98] A. McCallum, R. Rosenfeld, T. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning, San Francisco, CA, pages 359-367, 1998. [NG00] K. Nigam and R. Ghani. Understanding the Behavior of Co-training. In Proceedings of KDD-2000 Workshop on Text Mining, pages, 2000. [NH98] K. Nagao and K. Hasida. Automatic Text Summarization Based on the Global Document Annotation. In Proceedings of COLING-ACL’98, pages 917-921, 1998. [NMTM00] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3), pages 103-134, 2000. [Paz00] M. J. Pazzani. Representation of Electronic Mail Filtering Profiles: A User Study. Intelligent User Interfaces, pages 202-206, 2000. 155 [Por80] M. F. Porter. An Algorithm for Suffix Stripping. Program, 14(3), pages 130-137, July 1980. [Qui93] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [R21578] The Reuters-21578 collection, originally collected and labeled by Carnegie Group Inc and Reuters Ltd, may be freely available for research purpose only from, http://www.daviddlewis.com/resources/testcollections/reuters21578/, previous location of the collection was, http://www.research.att.com/~lewis/reuters21578.html. [R22173] The Reuters-22173 collection, originally collected and labeled by Carnegie Group Inc and Reuters Ltd, may be freely available by anonymous ftp for research purpose only from, ftp://ciir-ftp.cs.umass.edu:/pub/reuters1. [RCV1] The new Reuters collection, called Reuters Corpus Volume 1, has recently been made available by Reuters Ltd, may be freely available for research purpose only from, http://about.reuters.com/researchandstandards/corpus/. [Rij79] C. J. van Rijsbergen. Information Retrieval. 2nd Edition, Butterworths, London, UK, 1979. [RMW95] M. Röscheisen, C. Mogensen, and T. Winograd. Beyond Browsing: Shared Comments, SOAPs, Trails, and On-line Communities. In Proceedings of the 3rd International World Wide Web Conference, Darmstadt, Germany, pages 739-749, April 1995. [Roc71] J. Rocchio. Relevance Feedback in Information Retrieval. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall Inc., pages 313-323, 1971. [RS99] M. E. Ruiz and P. Srinivasan. Hierarchical Neural Networks for Text Categorization. In Proceedings of the 22nd ACM International 156 Conference on Research and Development in Information Retrieval (SIGIR-99), pages 281-282, 1999. [Rug92] G. Ruge. Experiments on Linguistically Based Term Associations. Information Processing & Management, 28(3), pages 317-332, 1992. [Sal89] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Massachusetts, 1989. [Sal91] G. Salton. Developments in Automatic Text Retrieval. Science, Vol. 253, pages 974-979, 1991. [SCAT92] S. Sekine, J. Carroll, A. Ananiadou, and J. Tsujii. Automatic Learning for Semantic Collocation. Proceedings of the Third Conference on Applied Natural Language Processing, ACL, pages 104-110, 1992. [SDHH98] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-mail. In Proceedings of AAAI'98 Workshop Learning for Text Categorization, Madison, Wisconsin, pages 55-62, 1998. [Seb02] F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), pages 1-47, 2002. [SK00] S. Shankar and G. Karypis. A Feature Weight Adjustment for Document Categorization. The 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, Boston, MA, 2000. [SM99] S. Scott and S. Matwin. Feature Engineering for Text Classification. In Proceedings of the 16th International Conference on Machine Learning, pages 379-388, 1999. [SMB96] M. A. Schickler, M. S. Mazer, and C. Brooks. Pan-Browser Support for Annotations and Other Meta-Information on the World Wide Web. Computer Networks and ISDN Systems, 28, pages 1063-1074, 1996. 157 [SP97] L. Saul and F. Pereira. Aggregate and Mixed-order Markov Models for Statistical Language Processing. In C. Cardie and R. Weischedel (eds.), Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, pages 81-89, 1997. [SSC97] L. Singh, P. Scheuermann and B. Chen. Generating Association Rules from Semi-Structured Documents Using an Extended Concept Hierarchy. In Proceedings of the 6th International Conference on Information and Knowledge Management (CIKM’97), pages 193-200, 1997. [SSS98] R. E. Schapire, Y. Singer, and A. Singhal. Boosting and Rocchio Applied to Text Filtering. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval, pages 215-223, 1998. [SW99] The stop words from the ‘SuperJournal’ research project in the U.K. This project was conducted over three years from 1996 to 1998, as part of the Electronic Libraries Program (eLib). These stop words may be freely available for research purpose only from, http://www.mimas.ac.uk/sj/application/demo/stopword.html. [TCM99] C. Thompson, M. E. Califf, and R.J. Mooney. Active Learning for Natural Language Parsing and Information Extraction. In Proceedings of the 16th International Conference on Machine Learning, pages 406414, 1999. [VM94] A. Vorstermans and J. P. Martens. Automatic Labeling of Corpora for Speech Synthesis Development. In Proceedings of ICSLP’94, pages 1747-1750, 1994. [WPW95] E. Wiener, J. O. Pedersen, and A. S. Weigend. A Neural Network Approach to Topic Spotting. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), pages 317-332, 1995. 158 [WS85] B. Widrow and S. D. Stearns. Adaptive Signal Processing. PrenticeHall Inc., Eaglewood Cliffs, NJ, 1985. [XC00] J. Xu and W. B. Croft. Improving the Effectiveness of Information Retrieval with Local Context Analysis. ACM Transactions on Information Systems, 18(1), pages 79-112, January 2000. [Yan01] Y. Yang. A Study on Thresholding Strategies for Text Categorization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), pages 137-145, 2001. [Yan94] Y. Yang. Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 13-22, 1994. [Yan99] Y. Yang. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1(1/2), pages 67-88, 1999. [YG98] T. Yavuz and A. Guvenir. Application of k-Nearest Neighbor on Feature Projections Classifier to Text Categorization. In Proceedings of the 13th International Symposium on Computer and Information Sciences – ISCIS’98, U. Gudukbay, T. Dayar, A. Gursoy, E. Gelenbe (eds.), Antalya, Turkey, pages 135-142, 1998. [YP97] Y. Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML’97), pages 412420, 1997. [YX99] Y. Yang and X. Liu. A Re-examination of Text Categorization Methods. In Proceedings of International ACM Conference on Research and Development in Information Retrieval (SIGIR’99), pages 42-49, 1999.
© Copyright 2024 Paperzz