A Feature Selection Method Based on Fisher`s

A Feature Selection Method Based on Fisher’s
Discriminant Ratio for Text Sentiment Classification
Suge Wang1,3, Deyu Li2,3, Yingjie Wei4, and Hongxia Li1
1
2
School of Mathematics Science, Shanxi University, Taiyuan 030006, China
School of Computer & Information Technology, Shanxi University, Taiyuan 030006, China
3
Key Laboratory of Computational Intelligence and Chinese Information Processing of
Ministry of Education, Shanxi University, Taiyuan 030006, China
4
Science Press, Beijing 100717, China
[email protected]
Abstract. With the rapid growth of e-commerce, product reviews on the Web
have become an important information source for customers’ decision making
when they intend to buy some product. As the reviews are often too many for
customers to go through, how to automatically classify them into different sentiment orientation categories (i.e. positive/negative) has become a research
problem. In this paper, based on Fisher’s discriminant ratio, an effective feature
selection method is proposed for product review text sentiment classification. In
order to validate the validity of the proposed method, we compared it with other
methods respectively based on information gain and mutual information while
support vector machine is adopted as the classifier. In this paper, 6 subexperiments are conducted by combining different feature selection methods with 2
kinds of candidate feature sets. Under 1006 review documents of cars, the experimental results indicate that the Fisher’s discriminant ratio based on word
frequency estimation has the best performance with F value 83.3% while the
candidate features are the words which appear in both positive and negative
texts.
1 Introduction
Owing to its openness, virtualization and sharing criterion, the Internet has been rapidly becoming a platform for people to express their opinion, attitude, feeling and
emotion. With the rapid development of e-commerce, product reviews on the Web
have become a very important information source for customers’ decision making
when they plan to buy some products online or offline. As the reviews are often too
many for customers to go through, so an automatically text sentiment classifier may
be very helpful to a customer to rapidly know the review orientations (e.g. positive/negative) about some product. Text sentiment classification is aim to automatically judge what sentiment orientation, the positive (‘thumbs up’) or negative
(‘thumbs down’), a review text is with by mining and analyzing the subjective information in the text, such as standpoint, view, attitude, mood and so on. Text sentiment
classification can be widely applied to many fields such as public opinion analysis,
W. Liu et al. (Eds.): WISM 2009, LNCS 5854, pp. 88–97, 2009.
© Springer-Verlag Berlin Heidelberg 2009
A Feature Selection Method Based on Fisher’s Discriminant Ratio
89
product online tracking, question answer system and text summarization. Whereas,
the review texts on the Web, such as on BBS, Blogs or forum websites are often with
non-structured or semi-structured form. Different from structured data, feature selection is a crucial problem to the non-structured or semi-structured data classification.
Although there has been a recent surge of interest in text sentiment classification,
the state-of-the-art techniques for text sentiment classification are much less mature
than those for text topic classification. This is partially attributed to the fact that topics
are represented objectively and explicitly with keywords while the topics are expressed with subtlety. However, the text sentiments are hidden in a large of subjective
information in the text, such as standpoint, view, attitude, mood and so on. Therefore,
the text sentiment classification requires deeper analyzing and understanding of textual statement information and thus is more challenging. In recent years, by employing some machine learning techniques i.e. Naive Bayes (NB), maximum entropy
classification (ME), and support vector machine (SVM)), many researches have been
conducted on English text sentiment classification[1-7] and on Chinese text sentiment
classification[8-11].
One major particularity or difficulty of the text sentiment classification problem is
the high dimension of the features used to describe texts, which raises big hurdles in
applying many sophisticated learning algorithms to text sentiment classification. The
aim of feature selection methods is to obtain a reduction of the original feature set by
removing some features that are considered irrelevant for text sentiment classification
to yield improved classification accuracy and decrease the running time of learning
algorithms[11]. Tan et al. (2008) presented an empirical study of sentiment categorization on Chinese documents[12]. In their work four feature selection methods, Information Gain (IG), MI(Mutual Information), CHI and DF(Document Frequency)
were adopted for feature selection. The experimental results indicate that IG performs
the best for sentimental terms selection and SVM exhibits the best performance for
Chinese text sentiment classification. Yi et al. (2003) developed and tested two feature term selection algorithms based on a mixture language model and likelihood
ratio[13]. Wang et al. (2007) presented a hybrid method for feature selection based on
the category distinguishing capability of feature words and IG[10].
In this paper, from the contribute views of distinguishing text sort, a kind of effective feature selection method is proposed for text sentiment classification. By using
two kinds of probability distribution methods, i.e., Boolean value and word frequently, four kinds of feature selecting method is proposed. By using support vector
machine (SVM), the experiment results indicate this approach is the efficacy for review text sentiment classification.
The remainder of this paper is organized as follows: Section 2 presents proposed
feature selection methods. Section 3 presents experiments used to evaluate the effectiveness of the proposed approach and discussion of the results. Section 4 concludes
with closing remarks and future directions.
2 Feature Selection Method Based on Fisher’s Discriminant Ratio
Fisher linear discriminant is one of an efficient approaches for dimension reduction in
statistical pattern recognition[14]. Its main idea can be briefly described as follows.
90
S. Wang et al.
Suppose that there are two kinds of sample points in a d-dimension data space. We
hope to find a line in the original space such that the projective points on the line of
the sample points can be separated as much as possible by some point on the line. In
other words, the bigger the square of the difference between the means of two kinds
projected sample points is and at the same time the smaller the within-class scatters
are, the better the expected line is. More formally, construct the following function,
so-called Fisher’s discriminant ratio.
J F ( w) =
(m1 − m 2 )
2
S1 + S 2
(1)
2
2
w is the direction vector of the expected line, mi and S i ( i = 1, 2 ) are the
mean and the within-class scatter of the i th class respectively. So the above idea is to
find a w such that J F ( w) achieve its maximum.
where
2.1 Improved Fisher’s Discriminant Ratio
In fact, Fisher’s discriminant ratio can be improved for evaluating the category distinguishing capability of a feature by replacing w with the dimension of the feature.
For this purpose, Fisher’s discriminant ratio is reconstructed as follows.
F (tk ) =
where
E (t k | P )
and
( E (tk | P) − E (tk | N )) 2
D(tk | P) + D(tk | N )
E (t k | N )
are the conditional mean of the feature
respect to the categories P and N respectively,
conditional variances of the feature
(2)
tk
D (tk | P )
and
D (t k | N )
tk
with
are the
with respect to the categories P and N respec-
tively.
( E (tk | P ) − E (tk | N )) 2 is the between-class scatter degree
D(tk | P ) + D(tk | N ) is the sum of within-class scatter degrees.
It is obvious that
and
In applications, the probabilities involved in Formula (2) can be estimated in different ways. In this paper, two kinds of methods respectively based on Boolean value
and frequency are adopted for probability estimation.
(1) Fisher’s discriminant ratio based on Boolean value
Let d P ,i (i = 1, 2,..., m) and d N , j ( j = 1, 2,..., n) denote the ith positive text and
the jth negative text respectively. Random variables
defined as follows.
d P ,i (tk )
and
d N , j (tk ) are
A Feature Selection Method Based on Fisher’s Discriminant Ratio
91
⎧1 if tk occurs in d P ,i
d P ,i (tk ) = ⎨
⎩0 otherwise
⎧1 if tk occurs in d N ,i
d N ,i (tk ) = ⎨
⎩0 otherwise
Let m1 ( n1 ) and m0 ( n0 ) be the numbers of positive (negative) texts within feature
tk and without feature tk respectively.
Obviously, random variables d P ,i (t k ) and d N , j (t k ) are depicted by the following distributions respectively.
P (d P ,i (tk ) = l ) = mi / m, l = 1,0 ;
P (d N , j (tk ) = l ) = ni / n, l = 1,0
,.
d P ,i (tk ) ( i = 1, 2,..., m ) are independent with the same
distribution, and so is d N , j (t k ) ( j = 1, 2,..., n ). d P ,i (t k ) ( i = 1, 2,..., m ) can
be regarded as a sample with size m . Then the conditional means and the conditional variances of the feature tk with respect to the categories P and N in Formula
It should be note that
(2) can be estimated by using Formulas (3)-(6).
E (tk | P) = E (
1 m
∑ d P,i (tk ))
m i =1
1 m
m
= ∑ i =1 E (d P ,i (tk )) = 1
m
m
(3)
1 n
n
d (t )) = 1
∑
j =1 N , j k
n
n
(4)
D (tk | P ) =
1 m
m1 2
(
d
(
t
)
)
−
∑
P
i
k
,
m i =1
m
(5)
D(tk | N ) =
1 n
n
(d N , j (tk ) − 1 ) 2
∑
j =1
n
n
(6)
E (tk | N ) = E (
92
S. Wang et al.
Hence, we have that
( E (tk | P) − E (tk | N )) 2
F (tk ) =
D(tk | P) + D(tk | N )
(m1n − mn1 ) 2
=
m
n
m
n
mn 2 ∑ i =1 (d P ,i (tk ) − 1 ) 2 + nm 2 ∑ j =1 (d N , j (tk ) − 1 ) 2
m
n
Fisher’s discriminant ratio
by
F (tk )
(7)
based on Boolean value is subsequently denoted
FB (tk ) .
(2) The Fisher’s discriminant ratio based on frequency
It is obvious that Fisher’s discriminant ratio based on Boolean value dose not consider
the appearing frequency of a feature in a certain text, but only consider whether or not
it appears. In order to examine the influence of the frequence on the significance of a
feature another probability estimation method based on frequency is adopted in Formula (2).
Let vP ,i and vN , j be the word tokens of texts d P ,i and d N , j , wP ,i (tk ) and
wN , j (tk )
be the frequencies of
tk
appearing in
d P ,i
and
dN, j
respectively.
Then the conditional means and the conditional variances of the feature
tk
with re-
spect to the categories P and N in Formula (1) can be estimated by using Formulas
(8)-(11).
E (tk | P) =
1 m wP ,i (tk )
∑
m i =1 vP ,i
(8)
E (tk | N ) =
1 n wN , j (tk )
∑
n j =1 vN , j
(9)
D(tk | P) =
1 m wP ,i (tk )
− E (tk | P)) 2
∑(
m i =1 vP ,i
(10)
D(tk | N ) =
1 m wN , j (tk )
(
− E (tk | N )) 2
∑
n j =1 vN , j
(11)
Hence, we have that
F (tk ) =
( E (tk | P) − E (tk | N ))2
D(tk | P) + D(tk | N )
A Feature Selection Method Based on Fisher’s Discriminant Ratio
93
1 m wP ,i (tk ) 1 n wN , j (tk ) 2
− ∑
)
∑
m i =1 vP ,i
n j =1 vN , j
=
1 m wP , j (tk ) 1 m wP ,i (tk ) 2 1 n wN , j (tk ) 1 n wN , j (tk ) 2
− ∑
− ∑
) + ∑(
)
∑(
m i =1 vP , j
m i =1 vP ,i
n j =1 vN , j
n j =1 vN , j
(
n w
(t )
wP ,i (tk )
−m∑ N , j k )2
vP , i
vN , j
i =1
j =1
=
m
m
n
w (t ) n w (t )
w (t )
w (t )
n 2 ∑ ( m P , i k − ∑ P ,i k ) 2 + m 2 ∑ ( n N , j k − ∑ N , j k ) 2
vP , i
v P ,i
vN , j
vN , j
i =1
i =1
j =1
j =1
m
mn(n∑
Fisher’s discriminant ratio
F (tk )
(12)
based on frequency is subsequently denoted
by FF (tk ) .
2.2 Process of Feature Selection
(1) Candidate feature sets
In order to compare the classification effects of features from the different regions.
We design two kinds of word sets as the candidate feature sets. One of them denoted
by U consists of all words in the text set. Another candidate feature set
words which appear in both positive and negative texts.
I contains all
(2) Features used in the classification model
The idea of Fisher’s discriminant ratio implies that it can be used as a significance
measure of features for classification problems. The larger the value of Fisher’s discriminant ratio of a feature is, the stronger the classification capability of the feature
is. So we can compute the value of Fisher’s discriminant ratio for every feature and
rank them in the descending order. And then choose the best features with a certain
number.
3 The Process of Text Sentiment Classification
The whole experiment process is divided into training and testing parts.
(1) Segment preprocess of review texts.
(2) By using the methods introduced in Section 2, obtain the candidate feature sets,
and then select the features for the classification model.
(3) Express each text in the form of feature weight vector. Here, feature weights are
computed by using TFIDF[2].
(4) Train the support vector classification machine by using training data, and then
obtain a classifier.
(5) Test the performance of the classifier by using testing data.
Up to now, it is verified that support vector machine possesses the best performance
for the text sentiment classification problem[1,10,11]. Therefore we adopt support
vector machine to construct the classifier in the presented paper.
94
S. Wang et al.
4 Experiment Result
4.1 Experiment Corpus and Evaluation Measures
(1) We collected 1006 Chinese review texts about 11 kinds of car trademarks published on http://www.xche.com.cn/baocar/ from January 2006 to March 2007. In the
corpus there are 578 positive reviews and 428 negative reviews. The total reviews
contain 1000 thousands words.
(2) To evaluate the effectiveness of the proposed feature selection methods, three
kinds of classical evaluation measures generally used in text classification, Precision,
Recall and F1 value are adopted in this paper. By PP(PN), RP(RN) and FP(FN) we
denote Precision, Recall and F1 value of positive(negative) review texts respectively.
By F we denote F1 value of all review texts.
4.2 Experiment Result and Analysis
For convenience, a kind of simple symbol, Candidate Feature Set + Feature Selection
Method, shows which candidate feature set and which feature selection method were
adopted in an experiment. For example, U + FB (t k ) stands for that the candidate
feature set
U
and the feature significance measure
FB (tk )
are adopted in an ex-
periment.
Experiment 1: In order to exam the influence of feature dimension on classification
performance 500, 1000 and 2000 features are selected from the candidate feature set
I by using the methods in Section 2. The experiment result is shown in Table 1.
Table 1. Classification effects of the different feature dimensions by using
I + FF (tk ) ap-
proach
Dimension
500
1000
2000
PP
82.65
82.90
80.39
RP
85.57
89.39
89.39
FP
84.08
86.00
84.60
NP
79.81
84.20
82.80
NR
75.53
75.06
70.12
NF
77.61
79.30
75.79
F
81.3
83.3
81.2
Measure
Table 1 shows that almost all evaluation measures are best under 1000 feature dimension. In other words, a larger amount of features need not imply a good classification result. So all the feature dimensions in the succedent experiments are 1000.
A Feature Selection Method Based on Fisher’s Discriminant Ratio
95
Experiment 2: The aim of this experiment is to compare the effects of 6 kinds of
feature selection approaches with 1000 dimension for text sentiment classification.
Here MI and IG stand for Mutual Information and Information Gain respectively. The
text sentiment classification experiment results are shown in Table2.
Table 2. Classification effects based on 6 kinds of feature selection approaches with 1000
dimension
Measure
Approach
U + MI
U + IG
U + FB (tk )
U + FF (tk )
I + FB (tk )
I + FF (tk )
PP(%)
RP(%)
FP(%)
PN(%)
RN(%)
FN(%)
FT(%)
71.68
49.39
58.48
53.25
74.35
62.06
62.84
81.19
88.52
84.61
82.58
72.00
76.72
81.50
81.75
90.61
85.86
85.25
72.24
77.95
82.80
76.73
84.35
80.32
78.85
65.41
70.15
76.30
81.67
85.04
83.32
79.25
74.12
76.60
80.40
82.90
89.39
86.00
84.20
75.06
79.29
83.30
From Table 2 it can be seen that:
(1) Among the 3 kinds of feature significance measures Mutual Information, Information Gain and Fisher’s discriminant ratio for feature selection the performance of
MI is worst.
(2) For IG and Fisher’s discriminant ratio, only in the third instance
U + FF (tk ) the performance of Fisher’s discriminant ratio is worse than that of IG.
(3) The performances of two kinds of variations of Fisher’s discriminant ratio rely
on the candidate feature set to some extent. For Fisher’s discriminant ratio based on
Boolean value the performance of it is better when U is adopted as the candidate set.
(4) Among all the 6 kinds of approaches, the performance of Fisher’s discriminant
ratio based on frequency is best when the candidate feature set is I.
Remarks: (1) The computing time cost of the feature selection process depends upon
the size of the candidate feature set, it is advisable to design a smaller candidate feature set such as I in this paper. (2) For text sentiment classification problem many
kinds of feature selection methods such as MI, IG, CHI and DF are compared in some
literatures [11]. Among these methods IG is validated to be best in the past research
works. However, the experiments in this paper shown that Fisher’s discriminant ratio
based on frequency is a better choice than IG for feature selection.
5 Conclusions
Text sentiment classification can be widely applied to text filtering, online tracking
opinions, analysis of public-opinion poll, and chat systems. However, compared with
96
S. Wang et al.
traditional subject classification, there are more factors that need to be considered in
text sentiment classification. Especially, the feature selection for text sentiment classification is more difficult. In this paper, under the granularity level of words, a new
feature selection method based on Fisher’s discriminant ratio is proposed. In order to
validate the validity of the proposed method, we compared it with other methods
respectively based on information gain and mutual information while support vector
machine is adopted as the classifier. In this paper, 6 subexperiments are conducted by
combining different feature selection methods with 2 kinds of candidate feature
sets. The experiment results show that I + FF (t k ) obtain the best classification
effectiveness, its F value achieves 83.3%. Our further research works will focus on
establishing a sentiment knowledge base based on vocabulary, syntactic, semantic and
ontology.
Acknowledgements
This work was supported by the National Natural Science Foundation (No.60875040),
the Foundation of Doctoral Program Research of Ministry of Education of China
(No.200801080006), Key project of Science and Technology Research of the Ministry of Education of China (No.207018), Natural Science Foundation of Shanxi Province (No.2007011042), the Open-End Fund of Key Laboratory of Shanxi Province,
Shanxi Foundation of Tackling Key Problem in Science and Technology
(No.051129), Science and Technology Development Foundation of Colleges in
Shanxi Province (No. 200611002).
References
[1] Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification Using Machine Learning Techniques. In: EMNLP 2002 (2002)
[2] Turney, P.D., Littman, M.L.: Unsupervised Learning of Semantic Orientation from A
Hundred-billion-word Corpus. Technical Report EGB-1094, National Research Council
Canada (2002)
[3] Turney, P.D., Littman, M.L.: Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Transactions on Information Systems (TOIS) 21(4),
315–346 (2003)
[4] Tony, M., Nigel, C.: Sentiment Analysis Using Support Vector Machines with Diverse
Information Sources. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, Barcelona,
Spain, July 2004, pp. 412–418 (2004)
[5] Michael, G.: Sentiment Classification on Customer Feedback Data: Noisy Data, Large
Feature Vectors, and the Role of Linguistic Analysis. In: Proceedings the 20th international conference on computational linguistics (2004)
[6] Kennedy, A., Inkpen, D.: Sentiment Classification of Movie and Product Reviews Using
Contextual Valence Shifters. In: Workshop on the analysis of Informal and Formal Information Exchange during negotiations (2005)
[7] Chaovalit, P., Zhou, L.: Movie Review Mining: A Comparison Between Supervised and
Unsupervised Classification Approaches. In: IEEE proceedings of the 38th Hawaii International Conference on System Sciences, Big Island, Hawaii, pp. 1–9 (2005)
A Feature Selection Method Based on Fisher’s Discriminant Ratio
97
[8] Ye, Q., Lin, B., Li, Y.: Sentiment Classification for Chinese Reviews: A Comparison Between SVM and Semantic Approaches. In: The 4th International Inference on Machine
Learning and Cybernetics ICMLC 2005 (June 2005)
[9] Ye, Q., Zhang, Z., Rob, L.: Sentiment Classification of Online Reviews to Travel Destinations by Supervised Machine Learning Approaches. Expert System with Application
(in press)
[10] Wang, S., Wei, Y., Li, D., Zhang, W., Li, W.: A Hybrid Method of Feature Selection for
Chinese Text Sentiment Classification. In: Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 435–439. IEEE Computer Society, Los Alamitos (2007)
[11] Ahmed, A., Chen, H., Salem, A.: Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums. ACM Transactions on Information
Systems 26(3) (2008)
[12] Tan, S., Zhang, J.: An Empirical Study of Sen-timent Analysis for Chinese Documents.
Expert Systems with Applications 34, 2622–2629 (2008)
[13] Yi, J., Nasukawa, T., Bunescu, R., Niblack, W.: Sentiment Analyzer: Extracting Sentiments About A Given Topic Using Natural Language Processing Techniques. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 427–434 (2003)
[14] Webb, A.R.: Statistical Pattern Recognition, 2nd edn. John Wiley & Sons, Inc.,
Chichester
[15] Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: SIGIR, pp.
42–49 (1999)
[16] Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: ICML, pp. 412–420 (1997)