Improving Text Classification using Local Latent Semantic Indexing Tao Liu Zheng Chen* Benyu Zhang* Wei-ying Ma* Gongyi Wu Nankai University, China * Microsoft Research Asia Nankai University, China [email protected] *{zhengc,byzhang,wyma}@microsoft.com [email protected] Abstract Latent Semantic Indexing (LSI) has been shown to be extremely useful in information retrieval, but it is not an optimal representation for text classification. It always drops the text classification performance when being applied to the whole training set (global LSI) because this completely unsupervised method ignores class discrimination while only concentrating on representation. Some local LSI methods have been proposed to improve the classification by utilizing class discrimination information. However, their performance improvements over original term vectors are still very limited. In this paper, we propose a new local LSI method called “Local Relevancy Weighted LSI” to improve text classification by performing a separate Single Value Decomposition (SVD) on the transformed local region of each class. Experimental results show that our method is much better than global LSI and traditional local LSI methods on classification within a much smaller LSI dimension. 1. Introduction Text classification is one of the core problems in Text Mining. Its task is to automatically assign predefined classes or categories to text documents [19]. There have been a lot of research done on text classification and SVM has been proved to be the best algorithm for text classification [6]. However, we found that the performance of a text classifier is strongly biased by the underlying feature representation. First, the inherent high dimension with tens of thousands of terms even for a moderate-sized text collection is prohibitively computational expensive for many learning algorithms [19]and easily raises the over fitting problem [11]. Secondly, polysemy (one word can have different meanings) and synonym (different words are used to describe the same concept) interfere with forming appropriate classification functions and make this task very difficult. [7,22]. There are typically two types of algorithms to represent the feature space used in classification. One type is the socalled “feature selection” algorithms, i.e. to select a subset of most representative features from the original feature space [8,20]. Another type is called “feature extraction”, i.e. to transform the original feature space to a smaller feature space to reduce the dimension. Compared with feature selection, feature extraction can not only reduce the dimensions of the feature space greatly, but also succeed in solving the polysemy and synonym problem in a certain degree. The most representative feature extraction algorithm is the Latent Semantic Indexing (LSI) which is an automatic method that transforms the original textual data to a smaller semantic space by taking advantage of some of the implicit higher-order structure in associations of words with text objects [1,2]. The transformation is computed by applying truncated singular value decomposition (SVD) to the term-by-document matrix. After SVD, terms which are used in similar contexts will be merged together. Thus, documents using different terminology to talk about the same concept can be positioned near each other in the new space [17]. Although LSI was originally proposed as an information retrieval method, it has also been widely used in text classification as well. For example, [19] used LSI to cut off noise during training process, [22] performed SVD on an expanded term-by-document matrix that includes both the training examples and background knowledge to improve text classification, [18] performed SVD on the term-by-document matrix whose terms include both single-word terms and phrase terms. While LSI is applied to text classification, there are two common methods. The first one is called “Global LSI”, which performs SVD directly on the entire training document collection to generate the new feature space. This method is completely unsupervised, that is, it pays no attention to the class label of the existing training data. It has no help to improve the discrimination power of document classes, so it always yields no better, sometimes even worse performance than original term vector on classification [15]. The other one is called “Local LSI”, which performs a separate SVD on the local region of each topic [4,5,12,17]. Compared with global LSI, this method utilizes the class information effectively, so it improves the performance of global LSI greatly. However, due to the same weighting problem, the improvements over original term vector are still very limited. It is noticed that in local LSI, all documents in the local region are equally considered in the SVD computation. But intuitively, different documents should play different roles to the final feature space and it is expected that more relevant documents to the topic can contributes more to the local semantic space than those non-relevant ones. So based on this idea, we propose a new local LSI method “Local Relevancy Weighted LSI (LRW-LSI)”, which selects documents to the local region in a smooth way. In other words, LRW-LSI gives different weight to each document in the local region according to its relevance before performing SVD so that the local semantic space can be extracted more accurately. Experimental results shown later prove this idea and it is found LRW-LSI is much better than global LSI and ordinary local LSI methods on classification performance within a much smaller LSI dimension. The rest of this paper is organized as the followings. In Section 2, we give a brief introduction to global LSI and different local LSI methods. In Section 3, we propose our new Local Relevancy Weighted LSI method. Then, several experiments are done to evaluate different LSI methods for text classification and their results are discussed in Section 4. Section 5 concludes this paper. 2. Related Works The most straightforward method of applying LSI for text classification is the global LSI method, which performs SVD directly on the entire training set and then testing documents are transformed by simply projecting them onto the left singular matrix produced in the original decomposition for evaluation. [15,18,19,22]. Global LSI is completely unsupervised, and it makes no use of class label, word order, syntactic relations or other existing information. It aims at deriving an optimal representation of the original data in a lower dimensional space in the mean squared error sense but this representation does no help the optimal discrimination of classes. This is especially true when there are some original terms that are particularly good at discriminating a category/class. That discrimination power may be lost in the new semantic space. Hence, global LSI performs not as good as expected and even drops the classification performance. Furthermore, Wiener, Pedersen and Weigend [17] also found that global LSI performed increasingly worse as topic frequency decreased due to the fact that infrequent topics are usually represented by infrequent terms, and infrequent terms may be projected out of LSI representations as noise. To integrate the class information, Hull [4] first proposed the concept of local LSI, which performed SVD on a local region of a topic so that the most important local structure, which is crucial in separating relevant documents from non-relevant documents, can be captured. A drawback of Hull’s solution is that the local region is defined by only relevant/positive documents which contain no discriminative information, which makes the improvement of classification performance very limited. Therefore, it is very necessary to add some of other documents to balance the local region. [5,12,17] did this work and they extended the local region by introducing some non-relevant documents which are most difficult to be distinguished from the relevant documents. The introduction of the most nearby non-relevant documents provides the most valuable discrimination information and is found to be more effective than using relevant documents alone. Compared to global LSI whose cost of SVD computing is incurred only once, local LSI has an increased cost because a separate SVD has to be computed for each class. However, the SVD is only applied to the local region, which means that the matrix is far smaller than the one used in global LSI so the computation can be extremely fast. In the following sub-sections, a brief introduction on local LSI methods is given. All local LSI methods are similar in the generation processes of local region. That is, each document in the training set is first assigned with a relevancy score related to a topic, and those documents whose scores are higher than a predefined threshold value are picked to generate the local region. 2.1. Relevant Documents Selecting Method (RDS) RDS defines the local region for a topic as the relevant documents only [4]. It is the simplest method but the local region contains no discrimination information, so it is very limited to improve the classification performance. On the other hand, the frequency of topic occurrence varies greatly from topic to topic. For example, in Reuters-21578 data, the biggest topic “earn” has roughly 30% of the documents while only five training documents belong to topic “platinum”. So it is very difficult to select a different optimal LSI dimension for different topic. Hence, RDS is not a recommended local LSI method. 2.2. Query-specific Screening Method (QS) QS is the most widely used local LSI method which defines the local region of a topic as the n most similar documents, where similarity is measured using the inner product score to the query vector which can be generated by Rocchio-expansion or the m most predictive terms of the topic. 2.2.1. Query by Rocchio-expansion. Rocchio-expansion is the simplest method to combine relevant documents and non-relevant documents [10]. It has been widely used to define the query of a topic as the mean of relevant documents in the training set [3,12,13]. 2.2.2. Query by predictive terms. Wiener et al. [17] used another method to define the query as the 100 most predictive terms of the topic where the predictive score of a term was measured by Relevancy Score (RS). Many other measures can be used to rank terms for a topic, such as Information Gain (IG), x2 statistic (CHI) and Mutual Information (MI) [8,20]. While there is no big difference in selecting a small set of terms between RS, IG and CHI [17], in this paper, we only consider two representative measures: CHI and MI. CHI is defined as equation (1) which measures the association between the term and the topic. MI is defined as equation (2) which only measures how important a term to a topic by the presence of a term occurs in the documents. χ 2 (t , c ) = N × ( p (t , c) × p (t ,c) − p(t , c) × p(t , c)) 2 p(t ) × p(t ) × p(c) × p(c) p(t , c) MI (t , c) = log = log p(t | c) − log p(t ) p(t ) × p(c) may pay more attention to the semantic structure of nonrelevant documents and the local semantic space may be very biased. On the other hand, when the size is too small, the local region may contain no enough discriminative information so that the local semantic space may also be limited. Based on these problems, an intuitive idea which can be described as the smooth curve in Figure 1 comes up. That is to introduce documents into the local region in a smooth way. In other words, the relevancy value of each document to a class is further used to weight the document so that more relevant documents can be introduced with higher weights and then they will do more contribution to SVD computation. Hence, the better local semantic space which results in better classification performance can be extracted to separate positive documents from negative documents. (1) (2) 2.3. SVM Screening Method (SS) Both RDS and QS actually can be viewed as a classification process to generate the local region. So similar with them, for each class, a SVM classifier can be trained using training documents and then be used to classify each training document to get its confidence value. Finally, the n most confident documents are picked as the local region of that class. This method is called the SVM Screening method. 3. Local Relevancy Weighted LSI As introduced above, in local LSI, each document in the training set is first assigned with a relevancy score related to a topic, and then the documents whose scores are larger than a predefined threshold value are selected to generate the local region. Then SVD is performed on the local region to produce a local semantic space. This process can be simply described as the jump curve in Figure 1. That is 0/1 weighting method is used to generate the local region where documents whose scores are larger than the predefined threshold value are weighted with 1 and others are weighted with 0. The 0/1 weighting method is a simple but crude way to generate local region. It assumes that the selected documents are equally important in the SVD computation. However, it is obvious that each document plays a different role in the local semantic space and the more relevant documents should contribute more to the local semantic space, and vice versa. Furthermore, in real application, the size of local region is very difficult to tune. When the size is too big, non-relevant documents may be much more than relevant documents so that SVD Figure 1. Local LSI and Local Relevancy Weighted LSI This new method is named “Local Relevancy Weighted LSI (LRW-LSI)”. For each class, assume an initial classifier IC has been trained using training documents in term vector representation which can be can Rocchio classifier, Term Query classifier or SVM classifier as introduced in last Section. Then the training process of LRW-LSI contains the following six steps. At the first step, the initial classifier IC of topic c is used to assign initial relevancy score ( rs ) to each training document. Then at step two, each training document is weighted according to equation (3). The weighting function is a Sigmoid function which has two parameters a and b to shape the curve. At step three, the top n documents are selected to generate the local term-by-document matrix of the topic c . Then at step four, a truncated SVD is performed to generate the local semantic space. At step five, all other weighted training documents are folded into the new space. Then at the final step, all training documents in local LSI vector are used to train a real classifier RC of topic c . ' d i = d i ∗ f (rsi ) , where f ( rs i ) = 1 1+ e − a ( rsi + b ) (3) In testing process, when a testing document comes in, it is first classified by the initial classifier IC to get its initial relevancy score. Then it is weighted according to the equation (3) and then folded into the local semantic space to get its local LSI vector. Then the local LSI vector is finally used to be classified by the classifier RC to decide whether it belongs to topic c or not. Note that the parameters a and b of the Sigmoid function define how smoothly to introduce the documents. It is obvious that when the parameters a and b is suitable, for example, when a equals 0 or a and b are large enough, LRW-LSI is actually the same as local LSI, so local LSI is just a special case of LRW-LSI. 4. Experiments In this Section, we first evaluate four Local LSI methods including QS-Roc, QS-CHI, QS-MI and QS-SS, and compare them with Term Vector and Global LSI. Then we evaluate Local Relevancy Weighted LSI method. SVM light 1 is chosen as the classification algorithm, SVDPAKC/sis2 is used to perform SVD and F-Measure is used to evaluate the classification results. Two common data sets are used, including Reuters-21578 and Industry Sector. Before performing classification, a standard stop-word list is used to remove common stop words and stemming technology is used to convert variations of the same words into its base form. Then those terms that appear in less than 3 documents are removed. Finally tf*idf (with “ltc” option) is used to assign the weight of each term in each document. 4.1. Data Sets Text classification performance varies greatly on different dataset, so we choose two text collections, including Reuters-215783 and Industry Sector4. Reuters-21578 (Reuters) is the most widely used text collection for text classification. There are total 21578 documents and 135 categories in this corpus. The frequency of topic occurrence varies greatly from topic to topic. For example, the most frequent topic “earn” appears in more than 30% of the documents while “platinum” is a topic mentioned in only five training documents. In our experiments, we only chose the most frequent 25 topics and used “Lewis” split which results in 6314 training examples and 2451 testing examples. 1 http://svmlight.joachims.org/ http://www.netlib.org/svdpack/ 3 http://www.daviddlewis.com/resources/testcollections 4 http://www-2.cs.cmu.edu/afs/cs.cmu.edu 2 Industry Sector (IS) is a collection of web pages belonging to companies from various economic sectors. There are 105 topics and total 9652 web pages in this dataset. Compared to Reuters, the topics are averagely distributed in documents. Also, a subset of the 14 categories whose size are bigger than 130 is selected for the experiments. Then a random split of 70% training examples and 30% testing examples is produced which results in 2030 training examples and 911 testing examples. 4.2. Classification Algorithm Support Vector Machine (SVM) is chosen in our experiment which is very popular and proved to be one of the best classification algorithms for text classification [18,21]. SVM is originally introduced by Vapnic in 1995 for solving two-class pattern recognition problem [16]. It is based on the Structural Risk Minimization principle from the computational learning theory. The idea is to find a hypothesis h to separate positive examples from negative examples with maximum margin. In linear SVM, we use w * x − b = 0 to represent the hyper-plane, where the normal vector w and constant b are defined by the distance from the hyper-plane to the nearest positive and negative examples. 4.3. Performance Measures For evaluating the effectiveness of classification, we use the standard recall, precision and F1 measure. Recall is defined to be the ratio of correct assignments by the system divided by the total number of correct assignments. Precision is the ratio of correct assignments by the system divided by the total number of the system' s assignments. The F1 measure, initially introduced by van Rijsbergen [9], combines recall (r) and precision (p) with an equal weight in the following form: 2rp F1 = r+ p (4) There are two ways to measure the average F1 of a binary classifier over multi categories, namely, the macroaveraging and micro-averaging. The former way is to first compute F1 for each category and then average them, while the later way is to first computer precision and recall for all categories and use them to calculate F1. It is clear that macro-averaging gives an equal weight to the performance on every category, regardless how rare or how common a category is, but micro-averaging favors the performance on common categories. Both of them are calculated in our experiments. Table 1. Classification results on Reuters Method Term Global LSI QS-Roc QS-CHI QS-MI SS Dimension >5000 250 200 200 200 200 Micro-F1 94.02 92.64 92.90 93.78 93.99 91.69 Macro-F1 81.50 73.26 84.12 84.79 85.60 83.49 The classification results of Reuters-21578 using different methods are shown in Table 1. As can be seen from this table, term vector is the best representation for SVM classifier. Compared to term vector, global LSI degrades the performance markedly especially on Macroaveraging F1 (relatively 10.1% reduction). There is no remarkable difference between Local LSI methods whose micro-averaging F1 is slightly dropped while macroaveraging F1 is slightly improved. While micro-averaging favors performance on common topic but macro-averaging gives an equal weight to all topics, we separate the 25 topics of Reuters-21578 averagely into three levels including high, medium and low to conduct a further analysis as in [17]. Table 2 shows the results. As can be seen, global LSI performs increasingly worse as topic frequency decreases which is due to the fact that the signal of infrequent topics is so weak that it is usually treated as noise and projected out of the new semantic space. Contrary to the global LSI, local LSI focuses on the local structure and performs increasingly better as topic frequency decreased. For example, compared to term vector for low frequent topics, both F1 are dropped nearly 29% by global LSI but improved nearly 13% by local LSI. Similar results can be found on Industry Sector data as shown in Table 3. The best performance is produced by term vector. While all topics have similar frequency which can be viewed as medium frequency, both F1 measures were greatly dropped by global LSI and slightly dropped by local LSI. Method Term Global LSI Local LSI Micro-F1 High Med 95.3 88.3 94.6 83.1 Low 71.3 50.7 Macro-F1 High Med 85.3 87.8 83.8 81.7 Low 71.1 50.7 94.6 80.4 85.1 80.1 89.6 89.3 Table 3. Classification results on Industry Sector Method Term Global LSI QS-Roc QS-CHI QS-MI SS Dimension >5000 250 200 200 200 200 Micro-F1 80.91 63.95 75.08 76.55 75.56 75.81 Macro-F1 80.61 63.99 76.03 77.81 76.07 77.11 4.4.2. Local Relevancy Weighted LSI. For local relevancy weighted LSI, we use SVM classifier as the initial classifier IC to generate each document’s initial relevancy score. And the parameters a and b of Sigmoid function are initially set with 5.0 and 0.2. 100 95 90 F1 4.4.1. Global LSI and Local LSI. The first experiment we conduct is to compare four local LSI methods with original term vector and global LSI, including QS-Roc, QS-CHI, QS-MI and SS where QS-Roc means the query is defined as Rocchio-expansion, QS-CHI and QS-MI means the query is defined as the top 100 predictive terms selected by CHI and MI measure. For Reuters-21578, local region is defined as five times the number of positive documents but no less than 350 and no more than 1500 which is similar with [17]. For Industry Sector, local region is defined as five times the number of positive documents. Table 2. Classification results on Reuters over three topic frequency ranges 85 80 75 Term Micro-F1 Term Macro-F1 LRW-LSI Micro-F1 LRW-LSI Macro-F1 70 10 20 50 100 Dimensions of LRW-LSI 150 Figure 2. LRW-LSI results on Reuters 92 90 88 86 84 82 F1 4.4. Experimental Results and Discussion 80 78 76 Term Micro-F1 Term Macro-F1 74 LRW-LSI Micro-F1 LRW-LSI Macro-F1 72 10 20 50 100 Dimensions of LRW-LSI 150 Figure 3. LRW-LSI results on Industry Sector Figure 2 and Figure 3 displays the classification results on Reuters-21578 and Industry Sector. The dotted lines of term vector are displayed only as the reference points in terms of performance comparison. From these figures, the following observations can be made: Figure 6 shows the local LSI projection and Figure 7 shows the LRW-LSI projection for “interest” topic of Reuters-21578 where the red asterisks represent relevant documents and blue circles represent non-relevant documents. As can be seen from these figures, relevant documents and non-relevant documents are separated by Local LSI only to a certain degree, that is, there are still a big chunk of documents mixed together. Comparatively, LRW-LSI is much more successful in class separation, that is, non-relevant documents are almost clustered around the origin while relevant documents are widely distributed. In the previous experiment, we set the parameters of Sigmoid function by experience. So in order to learn the influence of the parameters setting, we conduct two testing experiments using different tuning strategies. The first one is to fix the parameter a to 5.0 and change the parameter b from 0~1.0 and the second one is to fix the parameter b to 0.0 and change the parameter a from 0.0 to 6.0. In both tests, 10 dimensions are used for classification. 100 95 90 F1 First, compared to term vector, LRW-LSI improves the both F1 performances greatly on both data. For example, using 10 dimensions on Reuters-21578, the microaveraging F1 is improved by 0.8% and the macroaveraging F1 is improved by 7.4%; using 20 dimensions on Industry Sector, the micro-averaging F1 is improved by 9.4% and the macro-averaging F1 is improved by 10.2%. Second, the optimal dimension of LRW-LSI is much smaller than that of global LSI and local LSI method. For example, on Reuters-21578, the optimal dimension of LRW-LSI is only 10 while that of the global LSI is 250 and local LSI is 200. Such small dimension also means that the computation is much faster than the computation of the global LSI and local LSI method. Table 4 shows the run time of different LSI methods on a PC with Pentium III 500MHz and 256M memory. The runtime includes both training procedure and testing procedure. As can be seen, term vector is the fastest and it needs only hundred seconds. Global LSI needs much more time than term vector due to the costly SVD computation on entire training set. Although SVD computation on local region is very fast, the overall computation on all topics is extremely high, so local LSI is not expected to be used in practice. Similar with local LSI, LRW-LSI has to perform a separate SVD on local region of each topic, but such a low LSI dimension makes LRW-LSI be extremely rapid. It needs only less than 3 times of runtime of term vector, so it can be widely used in practice. 80 75 Table 4. Run Time of Different Methods (Seconds). Method Term Global LSI Local LSI LRW-LSI Reuters-21578 372 4702 6508 1080 Industry Sector 291 1905 5210 752 85 Micro-F1 Macro-F1 70 0 0.2 0.4 0.6 The Value of Parameter b 0.8 Figure 4 Parameter b Tuning on Reuters 100 95 90 F1 Third, with the LSI dimension increases, the performances decrease slowly. But even in a relatively high dimension, the performances are still equal or above the performances of term vector. Using 150 dimensions, for example, on Reuters-21578 the micro-averaging F1 is equal and the macro-averaging F1 is still improved by 2.4%; on Industry Sector, the micro-averaging F1 is still improved by 5.1% and the macro-averaging F1 is still improved by 4.2%. Forth, while local LSI has similar performance with term vector and LRW-LSI is much better than term vector, LRW-LSI is much better than local LSI not only in performance but also in much smaller LSI dimension. It means that LRW-LSI is more effective in separating relevant documents and nearby non-relevant documents. In order to find out the reason, we project the documents into coordinate plane using the first and the second local LSI factors as in [4,5]. 85 80 75 Micro-F1 Macro-F1 70 2 3 4 5 The Value of Parameter a 6 Figure 5 Parameter a tuning on Reuters Figure 4 and Figure 5 show the testing results on Reuters-21578. We don’t display the results on Industry Sector because its results are similar with Reuters-21578. As can be seen from these figures, LRW-LSI is more easily influenced by parameter b than by parameter a . But the performance influences of both parameters are very small, so in general, LRW-LSI is insensitive to both parameters. Figure 6. Local LSI projection for interest topic of Reuters Figure 7. LRW-LSI projection for interest topic of Reuters 5. Conclusion In this paper, we propose a new Local Relevancy Weighted LSI (LRW-LSI) method to help improve the text classification performance. This method is developed from Local LSI, but different from Local LSI in that the documents in the local region are introduced using a smooth descending curve so that more relevant documents to the topic are assigned higher weights. Therefore, the local SVD can concentrate on modeling the semantic information that is actually most important for the classification task. The experimental results verify this idea and show that LRW-LSI is quite effective. It can improve the classification performance greatly using a much smaller dimension compared to the global LSI and local LSI methods. Another work we have done is a comparative study on global LSI and several local LSI methods for text classification. It is found that local space is more suitable for LSI than the global space. Global LSI can optimize representation of the whole original data in a low dimensional space but gives no help to optimizing the discrimination of the topic, so it always drops the classification performance. Local LSI captures the important local structure which is crucial in separating relevant documents from nearby non-relevant documents, so it succeeds in keeping or improving slightly the classification performance in a low dimension. Acknowledgments The authors would like to thank Weiguo Fan and Wensi Xi from Virginia Polytechnic Institute and State University for fruitful discussions. References [1] Berry, M. W., Dumais, S. T., & O' Brien, G. W, Using Linear Algebra for Intelligent Information Retrieval, SIAM Review, 37:573-595, 1995 [2] Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. & Harshman, R. A, Indexing by latent semantic analysis, Journal of the Society for Information Science, 41(6):391-407, 1990 [3] Hearst, M., Pedersen, J., Pirolli, P. & Schutze, H, Xerox Site Report: Four TREC-4 Tracks, In D. Harman (Ed.) TREC-4, pp. 97-119, 1996 [4] Hull, D, Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing, Proc. of SIGIR’94, pp. 282289, 1994 [5] Hull, D.A, Information Retrieval using Statistical Classifcation. PhD thesis, Stanford University, 1995 [6] Joachims, T, Text Categorization with Support Vector Meachines: Learning with Many Relevant Features, Proc. of ECML’98, pp. 137-142, 1998 [7] Lewis, D.D. & Ringuette, M, A Comparison of Two Learning Algorithms for Text Categorization, Proc. of SDAIR’94, pp. 81-93, 1995 [8] Liu, T., Liu, S., Chen, Z., & Ma, W, An Evaluation on Feature Selection for Text Clustering, Proc. of ICML’03, pp. 488~495, 2003 [9] Rijsbergen, C.J, Information Retrieval. Butterworths, 2nd Edition, 1979 [10] Rocchio, J.J, Relevance Feedback in Information Retrieval, In Gerard Salton, (Ed.), The Smart retrieval system| experiments in automatic document processing, pp. 313~323, 1971 [11] Sebastiani, F, Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1):1-47, 2002 [12] Schutze, H., Hull, D.A. & Pedersen, J.O, A Comparison of Classifiers and Document Representations for the Routing Problem, Proc. of SIGIR’95, pp. 229-237, 1995 [13] Schutze, H., Pedersen, J.O. & Hearst, M. A, Xerox TREC3 Report: Combining Exact and Fuzzy Predictors, In D. Harman (Ed.) TREC-3, 1995 [14] Schutze, H. & Silverstein, C, Projections for Efficient Document Clustering, Proc. of SIGIR’97, pp. 74~81, 1997 [15] Torkkola, K, Linear Discriminant Analysis in Document Classification, 2002 [16] Vapnic, V, The Nature of Statistical Learning Theory, Springer, 1995 [17] Wiener, E., Pedersen, J.O. & Weigend, A.S, A Neural Network Approach to Topic Spotting, Proc. Of SDAIR’95, pp. 317-332, 1995 [18] Wu, H. & Gunopulos, D, Evaluating the Utility of Statistical Phrases and Latent Semantic Indexing for Text Classification, Proc. of ICDM’02, pp. 713-716, 2002 [19] Yang, Y, Noise Reduction in a Statistical Approach to Text Categorization, Proc. of SIGIR’95, pp. 256-263, 1995 [20] Yang, Y. & Pedersen, J. O, A comparative study on feature selection in text categorization, Proc. of ICML’97, pp. 412-420, 1997 [21] Yang, Y. & Liu, X, A Re-examination of Text Categorization Methods,. Proc. of SIGIR’99, pp. 42-49, 1999 [22] Zelikovitz, S. & Hirsh, H, Using LSI for Text Classification in the Presence of Background Text. Proc. of CIKM01, pp. 113118, 2001
© Copyright 2026 Paperzz