A Feature Selection Method Based on Fisher’s Discriminant Ratio for Text Sentiment Classification Suge Wang1,3, Deyu Li2,3, Yingjie Wei4, and Hongxia Li1 1 2 School of Mathematics Science, Shanxi University, Taiyuan 030006, China School of Computer & Information Technology, Shanxi University, Taiyuan 030006, China 3 Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China 4 Science Press, Beijing 100717, China [email protected] Abstract. With the rapid growth of e-commerce, product reviews on the Web have become an important information source for customers’ decision making when they intend to buy some product. As the reviews are often too many for customers to go through, how to automatically classify them into different sentiment orientation categories (i.e. positive/negative) has become a research problem. In this paper, based on Fisher’s discriminant ratio, an effective feature selection method is proposed for product review text sentiment classification. In order to validate the validity of the proposed method, we compared it with other methods respectively based on information gain and mutual information while support vector machine is adopted as the classifier. In this paper, 6 subexperiments are conducted by combining different feature selection methods with 2 kinds of candidate feature sets. Under 1006 review documents of cars, the experimental results indicate that the Fisher’s discriminant ratio based on word frequency estimation has the best performance with F value 83.3% while the candidate features are the words which appear in both positive and negative texts. 1 Introduction Owing to its openness, virtualization and sharing criterion, the Internet has been rapidly becoming a platform for people to express their opinion, attitude, feeling and emotion. With the rapid development of e-commerce, product reviews on the Web have become a very important information source for customers’ decision making when they plan to buy some products online or offline. As the reviews are often too many for customers to go through, so an automatically text sentiment classifier may be very helpful to a customer to rapidly know the review orientations (e.g. positive/negative) about some product. Text sentiment classification is aim to automatically judge what sentiment orientation, the positive (‘thumbs up’) or negative (‘thumbs down’), a review text is with by mining and analyzing the subjective information in the text, such as standpoint, view, attitude, mood and so on. Text sentiment classification can be widely applied to many fields such as public opinion analysis, W. Liu et al. (Eds.): WISM 2009, LNCS 5854, pp. 88–97, 2009. © Springer-Verlag Berlin Heidelberg 2009 A Feature Selection Method Based on Fisher’s Discriminant Ratio 89 product online tracking, question answer system and text summarization. Whereas, the review texts on the Web, such as on BBS, Blogs or forum websites are often with non-structured or semi-structured form. Different from structured data, feature selection is a crucial problem to the non-structured or semi-structured data classification. Although there has been a recent surge of interest in text sentiment classification, the state-of-the-art techniques for text sentiment classification are much less mature than those for text topic classification. This is partially attributed to the fact that topics are represented objectively and explicitly with keywords while the topics are expressed with subtlety. However, the text sentiments are hidden in a large of subjective information in the text, such as standpoint, view, attitude, mood and so on. Therefore, the text sentiment classification requires deeper analyzing and understanding of textual statement information and thus is more challenging. In recent years, by employing some machine learning techniques i.e. Naive Bayes (NB), maximum entropy classification (ME), and support vector machine (SVM)), many researches have been conducted on English text sentiment classification[1-7] and on Chinese text sentiment classification[8-11]. One major particularity or difficulty of the text sentiment classification problem is the high dimension of the features used to describe texts, which raises big hurdles in applying many sophisticated learning algorithms to text sentiment classification. The aim of feature selection methods is to obtain a reduction of the original feature set by removing some features that are considered irrelevant for text sentiment classification to yield improved classification accuracy and decrease the running time of learning algorithms[11]. Tan et al. (2008) presented an empirical study of sentiment categorization on Chinese documents[12]. In their work four feature selection methods, Information Gain (IG), MI(Mutual Information), CHI and DF(Document Frequency) were adopted for feature selection. The experimental results indicate that IG performs the best for sentimental terms selection and SVM exhibits the best performance for Chinese text sentiment classification. Yi et al. (2003) developed and tested two feature term selection algorithms based on a mixture language model and likelihood ratio[13]. Wang et al. (2007) presented a hybrid method for feature selection based on the category distinguishing capability of feature words and IG[10]. In this paper, from the contribute views of distinguishing text sort, a kind of effective feature selection method is proposed for text sentiment classification. By using two kinds of probability distribution methods, i.e., Boolean value and word frequently, four kinds of feature selecting method is proposed. By using support vector machine (SVM), the experiment results indicate this approach is the efficacy for review text sentiment classification. The remainder of this paper is organized as follows: Section 2 presents proposed feature selection methods. Section 3 presents experiments used to evaluate the effectiveness of the proposed approach and discussion of the results. Section 4 concludes with closing remarks and future directions. 2 Feature Selection Method Based on Fisher’s Discriminant Ratio Fisher linear discriminant is one of an efficient approaches for dimension reduction in statistical pattern recognition[14]. Its main idea can be briefly described as follows. 90 S. Wang et al. Suppose that there are two kinds of sample points in a d-dimension data space. We hope to find a line in the original space such that the projective points on the line of the sample points can be separated as much as possible by some point on the line. In other words, the bigger the square of the difference between the means of two kinds projected sample points is and at the same time the smaller the within-class scatters are, the better the expected line is. More formally, construct the following function, so-called Fisher’s discriminant ratio. J F ( w) = (m1 − m 2 ) 2 S1 + S 2 (1) 2 2 w is the direction vector of the expected line, mi and S i ( i = 1, 2 ) are the mean and the within-class scatter of the i th class respectively. So the above idea is to find a w such that J F ( w) achieve its maximum. where 2.1 Improved Fisher’s Discriminant Ratio In fact, Fisher’s discriminant ratio can be improved for evaluating the category distinguishing capability of a feature by replacing w with the dimension of the feature. For this purpose, Fisher’s discriminant ratio is reconstructed as follows. F (tk ) = where E (t k | P ) and ( E (tk | P) − E (tk | N )) 2 D(tk | P) + D(tk | N ) E (t k | N ) are the conditional mean of the feature respect to the categories P and N respectively, conditional variances of the feature (2) tk D (tk | P ) and D (t k | N ) tk with are the with respect to the categories P and N respec- tively. ( E (tk | P ) − E (tk | N )) 2 is the between-class scatter degree D(tk | P ) + D(tk | N ) is the sum of within-class scatter degrees. It is obvious that and In applications, the probabilities involved in Formula (2) can be estimated in different ways. In this paper, two kinds of methods respectively based on Boolean value and frequency are adopted for probability estimation. (1) Fisher’s discriminant ratio based on Boolean value Let d P ,i (i = 1, 2,..., m) and d N , j ( j = 1, 2,..., n) denote the ith positive text and the jth negative text respectively. Random variables defined as follows. d P ,i (tk ) and d N , j (tk ) are A Feature Selection Method Based on Fisher’s Discriminant Ratio 91 ⎧1 if tk occurs in d P ,i d P ,i (tk ) = ⎨ ⎩0 otherwise ⎧1 if tk occurs in d N ,i d N ,i (tk ) = ⎨ ⎩0 otherwise Let m1 ( n1 ) and m0 ( n0 ) be the numbers of positive (negative) texts within feature tk and without feature tk respectively. Obviously, random variables d P ,i (t k ) and d N , j (t k ) are depicted by the following distributions respectively. P (d P ,i (tk ) = l ) = mi / m, l = 1,0 ; P (d N , j (tk ) = l ) = ni / n, l = 1,0 ,. d P ,i (tk ) ( i = 1, 2,..., m ) are independent with the same distribution, and so is d N , j (t k ) ( j = 1, 2,..., n ). d P ,i (t k ) ( i = 1, 2,..., m ) can be regarded as a sample with size m . Then the conditional means and the conditional variances of the feature tk with respect to the categories P and N in Formula It should be note that (2) can be estimated by using Formulas (3)-(6). E (tk | P) = E ( 1 m ∑ d P,i (tk )) m i =1 1 m m = ∑ i =1 E (d P ,i (tk )) = 1 m m (3) 1 n n d (t )) = 1 ∑ j =1 N , j k n n (4) D (tk | P ) = 1 m m1 2 ( d ( t ) ) − ∑ P i k , m i =1 m (5) D(tk | N ) = 1 n n (d N , j (tk ) − 1 ) 2 ∑ j =1 n n (6) E (tk | N ) = E ( 92 S. Wang et al. Hence, we have that ( E (tk | P) − E (tk | N )) 2 F (tk ) = D(tk | P) + D(tk | N ) (m1n − mn1 ) 2 = m n m n mn 2 ∑ i =1 (d P ,i (tk ) − 1 ) 2 + nm 2 ∑ j =1 (d N , j (tk ) − 1 ) 2 m n Fisher’s discriminant ratio by F (tk ) (7) based on Boolean value is subsequently denoted FB (tk ) . (2) The Fisher’s discriminant ratio based on frequency It is obvious that Fisher’s discriminant ratio based on Boolean value dose not consider the appearing frequency of a feature in a certain text, but only consider whether or not it appears. In order to examine the influence of the frequence on the significance of a feature another probability estimation method based on frequency is adopted in Formula (2). Let vP ,i and vN , j be the word tokens of texts d P ,i and d N , j , wP ,i (tk ) and wN , j (tk ) be the frequencies of tk appearing in d P ,i and dN, j respectively. Then the conditional means and the conditional variances of the feature tk with re- spect to the categories P and N in Formula (1) can be estimated by using Formulas (8)-(11). E (tk | P) = 1 m wP ,i (tk ) ∑ m i =1 vP ,i (8) E (tk | N ) = 1 n wN , j (tk ) ∑ n j =1 vN , j (9) D(tk | P) = 1 m wP ,i (tk ) − E (tk | P)) 2 ∑( m i =1 vP ,i (10) D(tk | N ) = 1 m wN , j (tk ) ( − E (tk | N )) 2 ∑ n j =1 vN , j (11) Hence, we have that F (tk ) = ( E (tk | P) − E (tk | N ))2 D(tk | P) + D(tk | N ) A Feature Selection Method Based on Fisher’s Discriminant Ratio 93 1 m wP ,i (tk ) 1 n wN , j (tk ) 2 − ∑ ) ∑ m i =1 vP ,i n j =1 vN , j = 1 m wP , j (tk ) 1 m wP ,i (tk ) 2 1 n wN , j (tk ) 1 n wN , j (tk ) 2 − ∑ − ∑ ) + ∑( ) ∑( m i =1 vP , j m i =1 vP ,i n j =1 vN , j n j =1 vN , j ( n w (t ) wP ,i (tk ) −m∑ N , j k )2 vP , i vN , j i =1 j =1 = m m n w (t ) n w (t ) w (t ) w (t ) n 2 ∑ ( m P , i k − ∑ P ,i k ) 2 + m 2 ∑ ( n N , j k − ∑ N , j k ) 2 vP , i v P ,i vN , j vN , j i =1 i =1 j =1 j =1 m mn(n∑ Fisher’s discriminant ratio F (tk ) (12) based on frequency is subsequently denoted by FF (tk ) . 2.2 Process of Feature Selection (1) Candidate feature sets In order to compare the classification effects of features from the different regions. We design two kinds of word sets as the candidate feature sets. One of them denoted by U consists of all words in the text set. Another candidate feature set words which appear in both positive and negative texts. I contains all (2) Features used in the classification model The idea of Fisher’s discriminant ratio implies that it can be used as a significance measure of features for classification problems. The larger the value of Fisher’s discriminant ratio of a feature is, the stronger the classification capability of the feature is. So we can compute the value of Fisher’s discriminant ratio for every feature and rank them in the descending order. And then choose the best features with a certain number. 3 The Process of Text Sentiment Classification The whole experiment process is divided into training and testing parts. (1) Segment preprocess of review texts. (2) By using the methods introduced in Section 2, obtain the candidate feature sets, and then select the features for the classification model. (3) Express each text in the form of feature weight vector. Here, feature weights are computed by using TFIDF[2]. (4) Train the support vector classification machine by using training data, and then obtain a classifier. (5) Test the performance of the classifier by using testing data. Up to now, it is verified that support vector machine possesses the best performance for the text sentiment classification problem[1,10,11]. Therefore we adopt support vector machine to construct the classifier in the presented paper. 94 S. Wang et al. 4 Experiment Result 4.1 Experiment Corpus and Evaluation Measures (1) We collected 1006 Chinese review texts about 11 kinds of car trademarks published on http://www.xche.com.cn/baocar/ from January 2006 to March 2007. In the corpus there are 578 positive reviews and 428 negative reviews. The total reviews contain 1000 thousands words. (2) To evaluate the effectiveness of the proposed feature selection methods, three kinds of classical evaluation measures generally used in text classification, Precision, Recall and F1 value are adopted in this paper. By PP(PN), RP(RN) and FP(FN) we denote Precision, Recall and F1 value of positive(negative) review texts respectively. By F we denote F1 value of all review texts. 4.2 Experiment Result and Analysis For convenience, a kind of simple symbol, Candidate Feature Set + Feature Selection Method, shows which candidate feature set and which feature selection method were adopted in an experiment. For example, U + FB (t k ) stands for that the candidate feature set U and the feature significance measure FB (tk ) are adopted in an ex- periment. Experiment 1: In order to exam the influence of feature dimension on classification performance 500, 1000 and 2000 features are selected from the candidate feature set I by using the methods in Section 2. The experiment result is shown in Table 1. Table 1. Classification effects of the different feature dimensions by using I + FF (tk ) ap- proach Dimension 500 1000 2000 PP 82.65 82.90 80.39 RP 85.57 89.39 89.39 FP 84.08 86.00 84.60 NP 79.81 84.20 82.80 NR 75.53 75.06 70.12 NF 77.61 79.30 75.79 F 81.3 83.3 81.2 Measure Table 1 shows that almost all evaluation measures are best under 1000 feature dimension. In other words, a larger amount of features need not imply a good classification result. So all the feature dimensions in the succedent experiments are 1000. A Feature Selection Method Based on Fisher’s Discriminant Ratio 95 Experiment 2: The aim of this experiment is to compare the effects of 6 kinds of feature selection approaches with 1000 dimension for text sentiment classification. Here MI and IG stand for Mutual Information and Information Gain respectively. The text sentiment classification experiment results are shown in Table2. Table 2. Classification effects based on 6 kinds of feature selection approaches with 1000 dimension Measure Approach U + MI U + IG U + FB (tk ) U + FF (tk ) I + FB (tk ) I + FF (tk ) PP(%) RP(%) FP(%) PN(%) RN(%) FN(%) FT(%) 71.68 49.39 58.48 53.25 74.35 62.06 62.84 81.19 88.52 84.61 82.58 72.00 76.72 81.50 81.75 90.61 85.86 85.25 72.24 77.95 82.80 76.73 84.35 80.32 78.85 65.41 70.15 76.30 81.67 85.04 83.32 79.25 74.12 76.60 80.40 82.90 89.39 86.00 84.20 75.06 79.29 83.30 From Table 2 it can be seen that: (1) Among the 3 kinds of feature significance measures Mutual Information, Information Gain and Fisher’s discriminant ratio for feature selection the performance of MI is worst. (2) For IG and Fisher’s discriminant ratio, only in the third instance U + FF (tk ) the performance of Fisher’s discriminant ratio is worse than that of IG. (3) The performances of two kinds of variations of Fisher’s discriminant ratio rely on the candidate feature set to some extent. For Fisher’s discriminant ratio based on Boolean value the performance of it is better when U is adopted as the candidate set. (4) Among all the 6 kinds of approaches, the performance of Fisher’s discriminant ratio based on frequency is best when the candidate feature set is I. Remarks: (1) The computing time cost of the feature selection process depends upon the size of the candidate feature set, it is advisable to design a smaller candidate feature set such as I in this paper. (2) For text sentiment classification problem many kinds of feature selection methods such as MI, IG, CHI and DF are compared in some literatures [11]. Among these methods IG is validated to be best in the past research works. However, the experiments in this paper shown that Fisher’s discriminant ratio based on frequency is a better choice than IG for feature selection. 5 Conclusions Text sentiment classification can be widely applied to text filtering, online tracking opinions, analysis of public-opinion poll, and chat systems. However, compared with 96 S. Wang et al. traditional subject classification, there are more factors that need to be considered in text sentiment classification. Especially, the feature selection for text sentiment classification is more difficult. In this paper, under the granularity level of words, a new feature selection method based on Fisher’s discriminant ratio is proposed. In order to validate the validity of the proposed method, we compared it with other methods respectively based on information gain and mutual information while support vector machine is adopted as the classifier. In this paper, 6 subexperiments are conducted by combining different feature selection methods with 2 kinds of candidate feature sets. The experiment results show that I + FF (t k ) obtain the best classification effectiveness, its F value achieves 83.3%. Our further research works will focus on establishing a sentiment knowledge base based on vocabulary, syntactic, semantic and ontology. Acknowledgements This work was supported by the National Natural Science Foundation (No.60875040), the Foundation of Doctoral Program Research of Ministry of Education of China (No.200801080006), Key project of Science and Technology Research of the Ministry of Education of China (No.207018), Natural Science Foundation of Shanxi Province (No.2007011042), the Open-End Fund of Key Laboratory of Shanxi Province, Shanxi Foundation of Tackling Key Problem in Science and Technology (No.051129), Science and Technology Development Foundation of Colleges in Shanxi Province (No. 200611002). References [1] Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification Using Machine Learning Techniques. In: EMNLP 2002 (2002) [2] Turney, P.D., Littman, M.L.: Unsupervised Learning of Semantic Orientation from A Hundred-billion-word Corpus. Technical Report EGB-1094, National Research Council Canada (2002) [3] Turney, P.D., Littman, M.L.: Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Transactions on Information Systems (TOIS) 21(4), 315–346 (2003) [4] Tony, M., Nigel, C.: Sentiment Analysis Using Support Vector Machines with Diverse Information Sources. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, Barcelona, Spain, July 2004, pp. 412–418 (2004) [5] Michael, G.: Sentiment Classification on Customer Feedback Data: Noisy Data, Large Feature Vectors, and the Role of Linguistic Analysis. In: Proceedings the 20th international conference on computational linguistics (2004) [6] Kennedy, A., Inkpen, D.: Sentiment Classification of Movie and Product Reviews Using Contextual Valence Shifters. In: Workshop on the analysis of Informal and Formal Information Exchange during negotiations (2005) [7] Chaovalit, P., Zhou, L.: Movie Review Mining: A Comparison Between Supervised and Unsupervised Classification Approaches. In: IEEE proceedings of the 38th Hawaii International Conference on System Sciences, Big Island, Hawaii, pp. 1–9 (2005) A Feature Selection Method Based on Fisher’s Discriminant Ratio 97 [8] Ye, Q., Lin, B., Li, Y.: Sentiment Classification for Chinese Reviews: A Comparison Between SVM and Semantic Approaches. In: The 4th International Inference on Machine Learning and Cybernetics ICMLC 2005 (June 2005) [9] Ye, Q., Zhang, Z., Rob, L.: Sentiment Classification of Online Reviews to Travel Destinations by Supervised Machine Learning Approaches. Expert System with Application (in press) [10] Wang, S., Wei, Y., Li, D., Zhang, W., Li, W.: A Hybrid Method of Feature Selection for Chinese Text Sentiment Classification. In: Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 435–439. IEEE Computer Society, Los Alamitos (2007) [11] Ahmed, A., Chen, H., Salem, A.: Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums. ACM Transactions on Information Systems 26(3) (2008) [12] Tan, S., Zhang, J.: An Empirical Study of Sen-timent Analysis for Chinese Documents. Expert Systems with Applications 34, 2622–2629 (2008) [13] Yi, J., Nasukawa, T., Bunescu, R., Niblack, W.: Sentiment Analyzer: Extracting Sentiments About A Given Topic Using Natural Language Processing Techniques. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 427–434 (2003) [14] Webb, A.R.: Statistical Pattern Recognition, 2nd edn. John Wiley & Sons, Inc., Chichester [15] Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: SIGIR, pp. 42–49 (1999) [16] Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: ICML, pp. 412–420 (1997)
© Copyright 2026 Paperzz