Machine Learning Approaches to
Sentiment Classification
CMPUT 551: Course Project
Winter, 2005
SW5
Shane Bergsma
Darryl Jung
Rejean Lau
Yunping Wang
Advisors:
Dr. Shaojun Wang
Dr. Russ Greiner
Department of Computing Science
University of Alberta
Project website:
http://www.cs.ualberta.ca/~bergsma/Movies/
Table of Contents
List of Figures ..................................................................................................................... 3
List of Tables ...................................................................................................................... 3
1. Introduction..................................................................................................................... 4
2. Related Work .................................................................................................................. 7
2.1 Automated Text Categorization ...................................................................7
2.1.1 Document Indexing............................................................................................ 8
2.1.2 Dimensionality Reduction ................................................................................ 9
2.2 Sentiment Classification ............................................................................10
3 Experimental Methodology ........................................................................................... 12
3.1 Baseline Data Set......................................................................................12
3.2 Approaches to Movie Review Sentiment Classification .............................14
3.2.1 Baseline Approach ........................................................................................... 14
3.2.2 Objective Sentence Filtering............................................................................ 14
3.2.3 Web-mining of New Features.......................................................................... 16
3.2.4 Sentence-level Classification ........................................................................... 19
3.2.5 Hybrid Model................................................................................................... 20
3.3 Machine Learning Algorithms ....................................................................22
3.3.1 Support Vector Machines ................................................................................ 22
3.3.2 Naïve Bayes ..................................................................................................... 22
3.3.3 Decision Trees ................................................................................................. 23
3.3.4 Logistic Regression.......................................................................................... 24
3.3.5 Summary of Machine Learning Algorithms .................................................... 26
4. Results........................................................................................................................... 26
4.1 Comparison of Sentiment Classification using Different ML Algorithms ....26
4.2 Learning Curve with Logistic Regression ..................................................29
4.3 Hybrid Classification Results .....................................................................30
4.4 Improved Sentence-level Classification with Position Information.............32
5. Conclusion .................................................................................................................... 32
5.1 Contributions .............................................................................................32
5.2 Future Work...............................................................................................33
6. References..................................................................................................................... 34
2
List of Figures
Figure 1: Learning Task Diagram....................................................................................... 5
Figure 2: Performance Task Diagram................................................................................. 6
Figure 3: Word Co-occurrence Matrix ............................................................................... 8
Figure 4: Pang et al. 2002 Results ................................................................................... 11
Figure 5: Subjectivity Extracted document classifier. Bo and Pang (2004) ..................... 15
Figure 6: Sample IMDb.com User-rating ......................................................................... 17
Figure 7: Document preprocessing for sentence based polarity features ......................... 19
Figure 8: SVM-hyperplane distances as Constituent Output/Hybrid Input...................... 21
Figure 9: Batch Gradient Ascent for Logistic Regression Algorithm .............................. 25
Figure 10: Baseline Bow datasets evaluated by Logistic Regression (convergence
condition | wi +1 − wi |≤ 0.01 and step size =0.05) ....................................................... 25
Figure 11: Machine Learning Results for four Different Approaches.............................. 27
Figure 12: Learning Curve with Logistic Regression....................................................... 29
List of Tables
Table 1: Examples of sentence polarity types................................................................... 20
Table 2: ML Tool Summary ............................................................................................. 26
Table 3: Machine Learning Results for Different Approaches......................................... 26
Table 4: SVM 10-fold Cross-Validation Performance (%) .............................................. 30
Table 5: Hybrid 10-Fold Cross-Validation Relative Drop in Performance (%)............... 30
Table 6: Classification results on the IMDb sentiment movie corpus using fixed length
sentence level polarity feature vectors using SVM and 10 cross validation............. 32
3
1. Introduction
It is a very familiar fact that there are a plethora of documents available on-line.
This has motivated fruitful research into many areas like data mining and the semantic
web. One such area of research falling under natural language processing (NLP) is that
of automated text categorization.
Here, one uses a machine to categorize a given
document according to one or more of its properties. Most of this NLP research has
focused attention on the problem of topic categorization.
That is, given a document as
raw test, the problem is to determine what its topic is (e.g., politics, business, sports).
Some have looked at automated text categorization based on other properties. Lee and
Pang [2002] describe efforts to categorize documents according to their author, publisher,
source, "brow" (high-brow vs. low-brow), and genre (e.g., editorial, essay).
Recently, many web sites have emerged that offer reviews of items like books,
cars, snow tires, vacation destinations, etc. They describe the items in some detail and
evaluate them as good/bad, preferred/not preferred. So, there is motivation to categorize
these reviews in an automated way by a property other than topic, namely, by what is
called their 'sentiment' or 'polarity'.
That is, whether they recommend or do not
recommend a particular item. One speaks of a review as having positive or negative
polarity. (Note that the term 'sentiment' as used in this context does not necessarily imply
anything emotive; it does require that a judgement pro or con has been made in a
document).
Now, such automated categorization by sentiment, if it worked effectively, would
have many applications. First, it would help users quickly to classify and organize online reviews of goods and services, political commentaries, etc. Secondly, categorization
by sentiment would also help businesses to handle 'form free' customer feed-back. They
could use it to classify and tabulate such feedback automatically and could thereby
determine, for instance, the percentage of happy clientele without having actually to read
any customer input. Not only businesses but governments and non-profit organizations
might benefit from such an application. Thirdly, categorization by sentiment could also
be used to filter email and other messages. A mail program might use it to eliminate socalled 'flames'. Finally, perhaps a word processor might employ it to warn an author that
4
he is using bombastic or other undesirable language.
In this light, there is suitable
motivation to look at the possibility of automated categorization by sentiment.
In this report, we investigate the use of machine learning to produce a classifier
that categorizes a raw text document by its sentiment. Using machine learning for this
purpose is rather new. Until 2002, attempts at automated categorization by sentiment
were all knowledge-based (Pang and Lee, 2002).
We focus in particular on movie reviews for several reasons: (1) There are several
on-line movie review data bases. For instance, IMDb (Internet Movie Database) contains
thousands of reviews. (2) Pang et al. at Cornell offer a large data base of movie reviews
which they have labelled by sentiment.
(3) Movie reviews have a clear indication of
sentiment such as "two thumbs up" or "****" from which it is easy to gather labels. And
(4) Turney (2002) reported that movie reviews are one of the most difficult domains to
categorize by sentiment. Note that, although we are only considering movie reviews, the
lessons learned from this research apply beyond them.
Our algorithms and data
representation methods should apply to any kinds of documents containing sentiment
content.
Our learning task is to take text documents as inputs and to produce as output a
classifier.
More precisely, the input will consist of movie reviews whose content
(excluding its rating information) is represented in a format described below (the bag-ofwords format) and the output will be various classifiers that categorize according to
polarity.
Figure 1: Learning Task Diagram
5
A classifier produced by carrying out the learning task will have the performance
task of taking a movie review as an input and producing as output a classification of its
polarity--positive or negative. We will measure the success of the performance by its
percentage accuracy using 10-fold cross validation and we will use as our baseline the
results of Pang and Lee (2004).
Figure 2: Performance Task Diagram
The problem of sentiment categorization can be seen to be challenging when
compared to that of topic categorization. The latter relies heavily on observing so-called
key words—e.g., 'stock market', 'parliament'. One might think that a similar strategy
would work for sentiment categorization—just look for words like 'horrible', 'great' and
classify accordingly.
Counter-intuitively, this strategy does not perform well for
sentiment categorization. For instance, there is no negative word in "How could anyone
pay to see this movie?" Literally, that's simply a question. Or "This movie fills a wellneeded void." looks positive at first blush. Tokens like "still" and "?" seem to make a
difference.1 Memory and machine performance considerations also make the problem of
sentiment categorization non-trivial. The bag-of-words strategy discussed below uses
feature vectors of length greater than 14 000. The convenient machine learner WEKA
crashes when employing certain learning algorithms under such conditions (even after the
JavaVM memory allocation is increased).
The rest of this report will consider the topics of related work, our experimental
methodology, results, and conclusions. Under related work, we will discuss the problem
of automated text categorization in general, and the specific work of Pang and Lee,
1
Pang and Lee (2002), §4
6
Mullen and Collier, and Turney.
The experimental methodology section will describe
the data set, the machine learning algorithms that we use, and our various strategies
including sentence-level classification and a hybrid model. The results section will
compare the strategies and algorithms and provide a learning curve.
Finally, under
conclusions, we summarize our contributions and indicate directions for future work.
2. Related Work
This section outlines related work in automated text categorization and other
sentiment classification approaches.
2.1 Automated Text Categorization
The role of text categorization is to assign a document into one of a number of
pre-defined categories. Normally we wish this process to be automated -- a computer
program inspects the document and decides on the category without any manual
involvement.
Furthermore, in supervised machine learning approaches, we ask the
computer program automatically to learn the classification function from a number of
labelled example documents. To do this, a representation of the document in a form
understandable by a computer is necessary. This process of translating a document from
text into a feature vector representation is called Document Indexing and is explained
below. Once the document is represented as a feature vector, standard machine learning
approaches can be used to induce a classifier on the training set. Often, however, the
resulting representation has an problematically large number of features. We discuss
dimensionality-reduction techniques that address this issue below.
Many problems in Natural Language Processing can be viewed as a
categorization task. For example, authorship identification involves assigning a work
(the text) to an author (the category) based on analyzing previous works by a group of
authors.
Text categorization techniques are needed for language identification for
unknown languages, filtering documents (such as unwanted e-mail), routing documents
(for example, to the appropriate service representative), word sense disambiguation (the
categories are the word senses), hierarchical categorization of web pages (such as Yahoo!
categories), topic identification (including the presence of relevant, new, or developing
7
news items in a stream of news stories) and other tasks of indexing or organizing digital
content for a variety of purposes (Sebastiani, 2002).
In this project, we also approach sentiment classification as a text categorization
exercise. The two types of sentiment, positive and negative, are adopted as the category
labels. We represent both documents and individual sentences as feature vectors, and
learn and test sentiment classifiers on these representations.
2.1.1 Document Indexing
Creating a feature vector or other representation of a document is a process that is
known in the IR community as indexing. There are a variety of ways to represent textual
data in feature vector form, however most are based on word co-occurrence patterns. In
these approaches, a vocabulary of words is defined for the representation, which are all
possible words that might be important to classification.
This is usually done by
extracting all words occurring above a certain number of times (perhaps 3 times), and
defining your feature space so that each dimension corresponds to one of these words.
Figure 3: Word Co-occurrence Matrix
When representing a given textual instance (perhaps a document or a sentence),
the value of each dimension (also known as an attribute) is assigned based on whether the
word corresponding to that dimension occurs in the given textual instance.
If the
document consists of only one word, then only that corresponding dimension will have a
value, and every other dimension (i.e., every other attribute) will be zero. This is known
as the ``bag of words'' approach. Figure 3 shows this feature vector representation, with
the vocabulary along the rows and each column representing a document.
8
One important question is what values to use when the word is present. Perhaps
the most common approach is to weight each present word using its frequency in the
document and perhaps its frequency in the training corpus as a whole. The most common
weighting function is the tfidf (term frequency-inverse document frequency) measure, but
other approaches exist (Salton and Buckley, 1988}. In most sentiment classification
work, a binary weighting function is used.
Assigning 1 if the word is present, 0
otherwise, has been shown to be most effective (Pang and Lee, 02). In our work, we
verified the superiority of this function over tfidf and use it in all our representations.
2.1.2 Dimensionality Reduction
Large feature vectors can be impractical for machine learning. First of all, there
are sheer feasibility issues: we, in fact, found difficulty using the Weka toolkit on many
of our data sets. Secondly, there are overfitting issues. Large numbers of features with
few example instances can lead to great classification on the training set but poor
portability to new data. Performing ``dimensionality reduction'' or ``feature selection''
has been shown to prevent overfitting and improve effectiveness (Sebastiani, 2002).
Given that it might be a good idea to remove some attributes, how does one decide which
attributes to remove?
The main issue is which words to include in the vocabulary, as this defines the
number of dimensions in the feature space. As was mentioned above, we usually restrict
words in the vocabulary to those that occur 3 or more times in the training set, preventing
many very uncommon words from having their own dimension in feature space. In our
experiments, we employ this technique.
Another common approach is the removal of a list of so-called ``stop words,''
such as ``the, of, in, on.'' These are very common in all natural language text and provide
little semantic content. Sometimes, the removal of all ``function words'' -- all articles,
prepositions, conjunctions, etc. -- is done (Sebastiani, 2002). In our work, we tried using
Dekang Lin's parser Minipar (Lin, 1998) to remove all words except adjectives.
However, although making our feature vectors small enough for some further ML
algorithms, this reduced overall SVM performance. Furthermore, we tried a standard list
of stop words taken from the BOW-implementation of McCallum (1996), but this also
9
reduced performance. Pang et al. (2002) also report that stop word removal impairs
performance. This seems to have to do with the different usage of some common words
in positive versus negative sentiment documents. The word ``still'' for example, seems to
be associated more with positive documents (``The acting was terrible, still I enjoyed
myself.''), while the question mark punctuation seems to flag negative reviews (``I kept
wondering, where can I get a barf bag?''). Thus, in all of our experiments, stop words are
not employed.
Finally, stemming can be used to group similar words (differing only in suffix)
into the same dimension (Porter, 1980). In our work, stemming was done using the
Porter stemmer (Porter, 1980), adapted from the BOW-implementation of McCallum
(1996).
There are a number of other approaches to dimensionality reduction in the text
categorization literature. Standard feature selection approaches have been used as a preprocessing step, such as information-theoretic or chi-squared-based term-selection
functions. Further term-clustering techniques have been adopted, as well as the use of
latent semantic indexing to reduce the dimensionality by singular value decomposition
methods (Sebastiani, 2002). Preliminary tests using feature selection in Weka were
negative. This agrees with other published literature. Gabrilovich and Markovitch show
any level of feature selection impairs performance of SVM on the movie review data set
(2004).
Nevertheless, further investigation of advanced feature selection techniques
remains possible in future experimentation.
2.2 Sentiment Classification
Pang, Lee, and Vaithyanathan published a seminal paper (2002) on the topic of
automated categorization by sentiment. They were the first to apply machine learning to
the problem of sentiment categorization. For their training and testing domains, they
decided to use movie reviews for the reasons mentioned above. The source for their
movie reviews was IMDb.
Their experiments used 700 positive and 700 negative
reviews (so they had a uniform class distribution). To prepare these reviews, they
removed any rating information or html tags from them. For learning algorithms, Pang et
10
al. considered naïve Bayes, maximum entropy, and support vector machines. They (only)
used 3-fold cross validation to evaluate the classifiers they produced.
The following table summarizes the results of Pang et al.'s 2002 experiments.
Figure 4: Pang et al. 2002 Results
Quick inspection of the table shows that they obtain the best results for classifying into
sentiment (row 2) by using the bag-of-words representation—that is, by using unigrams
(single words) as attributes and their presence/absence as values.
Using these values
does 10% better than using word frequency as values (row 1)—a fact that is somewhat
surprising since word frequency information helps in the case of topic categorization.
Using unigrams as attributes does about 5% better than using bigrams (2 word sequences)
as attributes (row 4)—one might intuitively think that bigrams would improve
performance as they allow one to distinguish between 'not good' and 'very good'. Using
unigrams as attributes does slightly better than using unigrams together with their parts of
speech(row 5)—again, one might intuitively think that adding the part of speech tag to a
word would improve performance as that would allow the system to distinguish between,
say, 'book' used as a noun and used as a verb. Note that using only adjectives is not
particularly beneficial (row 6). Using 2633 adjectives performs worse than simply using
the 2633 most frequent unigrams even though adjectives might be thought to carry much
of a document's sentiment content.
These results were of great help to us in that they allowed us prima facie to rule
out many possible ways to represent a document's content for the purposes of sentiment
categorization. They justified our decision to use the bag-of-words representation as our
initial representation.
11
Pang and Lee (2004) investigated subjective sentence extraction. The results of
improved performance over the full unfiltered were inconclusive, however comparable
performances were obtained.
Sentence polarity labelling was investigated by Turney (2002) using pointwise
mutual information. Here the semantic orientation of a phrase is calculated using
reference words such as “excellent” and “poor”, and more words were found using an
Altavista search and proximity to those words. Machine learning was not used to predict
the overall polarity of a document, but rather the average of the semantic orientations of
the sentences in a document.
Like us, Tony Mullen and Nigel Collier (2004) also looked at combining new
information sources with the standard bag of words approach. They developed several
measures of phrase and word favourability and built Support Vector Machine (SVM)
classifiers using these measures. These were combined with the standard bag-of-words
classifier using a hybrid SVM. In this approach, the input features to the hybrid model
come from the distances to the dividing hyperplane of the constituent SVM classifiers.
An SVM classifier is then learned from these hybrid features. In our approach, we also
combine output SVM hyperplane-distances as features in a new classifier, but we found
that learning a simple naïve bayes classifier from the hybrid features achieves the highest
performance.
Also note, in this paper Mullen and Collier report achieving the highest
performance to date using the Cornell data set, 86% with 10-fold cross-validation. In
fact, this paper was published concurrently to Pang and Lee's 2004 work, where 87% was
achieved using only the basic bag-of-words SVM approach.
3 Experimental Methodology
3.1 Baseline Data Set
Our basic data set consist of 2000 movie reviews, 1000 labelled positive and
1000 labelled negative (so they have a uniform class distribution).
These were
downloaded from Bo Pang's web page: http://www.cs.cornell.edu/people/pabo/moviereview-data/.
12
Pang et al. obtained these from an IMDb newsgroup. They selected only reviews that
gave a rating in terms of stars or a numerical score and they only considered clearly
positive and clearly negative reviews. To restrict the number of reviews written by any
particular author, they imposed an upper bound of 19 reviews per author per sentiment
category. In order to prepare the reviews, they removed any rating information or html
tags from them. A sample positive movie review is given below:
truman ( " true-man " ) burbank is the perfect name for jim carrey's character in this film .
president truman was an unassuming man who became known worldwide , in spite of ( or was it because of ) his
stature .
" truman " also recalls an era of plenty following a grim war , an era when planned communities built by government
scientists promised an idyllic life for americans .
and burbank , california , brings to mind the tonight show and the home of nbc .
if hollywood is the center of the film world , burbank is , or was , the center of tv's world , the world where our
protagonist lives .
combine all these names and concepts into " truman burbank , " and you get something that well describes him and his
artificial world .
truman leads the perfect life .
his town , his car , and his wife are picture perfect .
his idea of reality comes under attack one day when a studio light falls from the sky .
the radio explains that an overflying airplane started coming apart .
but then why would an airplane be carrying a studio light ?
the next day during the drive to work , the radio jams and he starts picking up a voice that exactly describes his
movements .
he is so distracted that he nearly hits a pedestrian .
when the radio comes back to normal , the announcer warns listeners to drive carefully .
his suspicion aroused , he wanders around the town square looking for other oddities .
the world appears to be functioning properly until he enters an office building and tries to take the elevator .
the elevator doors open up on a small lounge with people on coffee breaks .
a grip sees truman him and quickly moves a paneled door , made to look like the back of an elevator , into place .
two security guards grab him and throw him out .
truman is really suspicious now .
it gets even worse the next day when his wife , a nurse , describes an elevator accident in the building where he saw
the lounge .
" it's best not to think about it , " she says , trying vainly to change truman's memory .
truman becomes determined to see who or what is behind this apparently elaborate hoax at his expense .
at every turn he is stopped by an amazing coincidence that just happens to keep him in his own little town .
his last hope is to quell his fear of the ocean and sail to the edge of the world .
you know by now that truman's life is the subject of a television program .
his actions are " real " but everything else is carefully scripted , from the death of his father to the choice of his wife .
truman is determined to find out what the big hoax is .
meanwhile , christof , the all-seeing creator of truman's world does his best to keep him unaware and happy .
it's sort of like westworld told from the robots' point of view , or jurassic park from the dinosaurs' point of view .
we root for the captive of the cage-world .
our protagonist is counting on " chaos theory " to help him escape his elaborate trap .
the story , written by andrew niccol ( writer/director of gattaca ) , introduces some interesting questions , such as the
ethics of subjecting a person to this type of life , or the psychological impact of learning that your entire life has all been
fake .
although these questions came to mind , i don't think the film itself asked them .
it certainly didn't address them or try to answer them .
i was particularly disappointed that the film didn't deal more with the trauma of learning one's life is a tv show .
carrey's performance at the end showed a smidgen of truman's pain , but i almost felt that he got over it too easily for
the sake of the film's pacing .
earlier in the movie i found myself wondering if it would be better for truman to find out the truth or whether i should root
for him to be well .
the two seemed exclusive of one another , but weir and niccol didn't see it that way .
perhaps it's not fair to criticize a movie for what it isn't , but it seems like there were some missed opportunities here .
13
but on its own terms , the movie is well made .
sight , sound and pacing are all handled competently .
much of the first part of the movie is the truman show .
the scenes are all apparently shot from hidden cameras , with snoots and obstructions covering the corners of the
screen .
one hidden camera is apparently in his car radio , the green led numbers obscuring the lower part of the screen .
the music is well-chosen and scored .
the film opens with what sounds like family drama theme music , when truman's world is still beautiful and perfect .
when the movie ends , the score sounds more like a frantic , driven , tangerine dream opus , while still keeping the
same timbre .
philip glass' epic music ( from powaqqatsi ) permeates truman's scenes of suspicion and awakening .
( glass has a small cameo as a keyboardist for the show . )
and the pacing of the story was brisk .
there was no unnecessarily long setup explaining the concept behind the truman show , just a few quick title cards , a
few interviews , and then right into the show , and the movie .
one of the first scenes is of the studio light falling ; there was no token scene of truman's idyllic life before it falls apart ,
because it wasn't necessary , we pick up the story at the first sign of trouble , and no sooner .
there's also no point in the movie where the plot slows down .
it's a quick , straight shot to the movie's end .
in terms of overall quality , i would compare the truman show to niccol's gattaca .
both films are well made with interesting stories set in interesting worlds .
but neither film really felt like it capitalized on all the great ideas ; neither film " clicked " and became an instant classic .
nevertheless , i look forward to niccol's next film , whatever it may be .
Sample Positive Review
Note that although all letters are miniscule, punctuation is preserved.
We also used the subjectivity data set downloaded again from Bo Pang's web
page as well as a positive/negative sentence corpus based on Yahoo Movie Reviews.
These will be described below.
3.2 Approaches to Movie Review Sentiment Classification
3.2.1 Baseline Approach
The baseline approach is the standard bag-of-words representation applied to the
movie database. As mentioned above, we do not filter stop words, but we do perform
stemming. Results using the baseline representation will allow comparison with other
approaches in the literature.
3.2.2 Objective Sentence Filtering
A second approach by Pang and Lee (2004) investigated the performance, after
objective sentences were removed. This is a kind of document pre-processing before
applying the document to the baseline classifier. The idea is that, objective sentences
14
could only hurt the classification, as they give no indication of a document’s polarity.
Pang and Lee (2004) were able to demonstrate that this objectively-filtered document
training set, achieved results which could approach but not exceed the baseline results.
The procedure, which we will refer to as the objectively-filtered document classifier, is
shown in Figure 5.
Figure 5: Subjectivity Extracted document classifier. Bo and Pang (2004)
The procedure of extracting the subjective sentences is quite simple. Pang and
Lee(2004) created a second training corpus containing objective and subjective reviews,
and reapplied the baseline classifier at the sentence level. The objective reviews
consisting of 5000 sentences taken from plot summaries off the IMDb database, but
ensured not to overlap with the primary document sentiment corpus. The subjective
reviews were retrieved from www.rottentomatoes.com, compromising 5000 movie
review snippets (eg., “bold, imaginative, and impossible to resist”). We will refer to this
second corpus subjective/objective training corpus as the objectively filtered training
corpus2.
15
3.2.3 Web-mining of New Features
One unique contribution of this work is the mining from the web of new features
for movie review sentiment classification. The key idea behind this approach is that
background information about the product being reviewed may assist in classifying the
sentiment of the review itself.
For example, suppose I know that car A won car of the year in a recent magazine
article. Perhaps we also see very high ratings of car A in customer surveys. Now
suppose a review of car A arrives in which our standard sentiment classification
approaches cannot ascertain whether it is positive or negative. By default, perhaps it
makes sense to classify this review as positive.
At the same time, if car B consistently scores very low on customer surveys and
reports by industry analysts, and an ambiguous review for car B arrives, we would surely
be better off classifying this review as negative by default.
On the web, this kind of background information about movies can be found
easily. By visiting the IMDb.com website, a program can automatically extract a movie's
year of release, what awards it has won, its box office revenue, etc. In our experiments,
we have seen the most useful web-mined feature for movie classification is the online
user rating. Visitors to the IMDb homepage of each movie have the option to assign a
score to the movie, from 0.0 to 10.0, and IMDb.com provides the average score of the
2
The objectively filtered movie review corpus is available on
http://www.cs.cornell.edu/people/pabo/movie-review-data/ under Subjectivity datasets.
16
online user ratings. A dozen movies in the top-250 have received over 100,000 votes, but
the majority of movies in the Cornell database have user ratings coming from between
one and ten thousand ratings.
Figure 6: Sample IMDb.com User-rating
To gather this information, we needed to automatically determine the IMDb
homepage of each movie in our data set. This always corresponds to a particular ``IMDb
code'' for each movie.
For example, the IMDb URL for ``Titanic'' is
``http://www.IMDb.com/title/tt0120338/'', where the code is the final string ``tt0120338.''
Once we gather the code for each movie, we can visit its homepage and extract
interesting information.
To extract the codes, we first extracted the movie title and date from the original
html newsgroup postings used to generate the formatted Cornell data set. The actual
Cornell files do not always include the name of the film.
We then used a web document fetching program include in Dekang Lin's
Language and Text Analysis Tools (LaTaT) (Lin, 2001). We searched for the movie
using the IMDb search url, for example, ``http://www.IMDb.com/find?q=titanic'', for
titanic and gather this page with the document fetcher. This page was then parsed for a
hyperlink to the homepage of the movie, occurring on the same line with the year of
release of the film. By validating the hyperlink with the year, we ensure we select the
correct title code when there are multiple movies with the same title, or sequels to a
movie also returned with the search. A small portion of our codes had to be manually
determined when this process was not successful.
Also, some incorrect links (two
movies in the same year with the similar title, for example) were picked up when
randomly checking the generated urls.
We built a list of these codes for every unique movie in the 2000-movie review
database. This list is available at http://www.cs.ualberta.ca/~bergsma/Movies/. We then
visited the homepage for each movie, and extracted four features:
17
-
user-rating
-
year
-
whether oscar-nominated or not
-
whether best picture oscar winner or not
Finally, as an additional feature, we used the document-fetching code to extract
the number of pages returned from the Google search engine when we enter either the
movie title and the word ``sucks,'' and the movie title and the word ``awesome.'' The
ratio of these two counts was provided as the fifth feature to our web-mined classifier.
This provides us with a sense of the sentiment towards the movie on the entire web,
rather than just with IMDb.com visitors.
Given these four features, the question is how to incorporate these into the
sentiment review classification task. In a sense they represent ``prior'' information about
the movies, and perhaps should be included as additional features in the bag of words
classifier. However, we found this approach resulted in no significant improvement in
performance using the SVM for classification. This observation will be explained in our
results section below.
Rather than incorporating this information into the bag-of-words baseline
classifier, we decided to build separate classifiers for the reviews using only these webmined features. It is important to realize that these classifiers will thus classify reviews
without using any information from the review itself! They make their decision purely
based on background information about the movie. This method is not as crazy as it may
seem -- in fact, we show very high performance using background information only.
Furthermore, we achieve our highest performance using this classifier in combination
with the other classifiers outlined in this section, as one hybrid classifier. The web-mined
features are in fact shown to be the most important single component of this hybrid
approach.
Also, note that in preliminary tests, we saw no advantage to using all five webmined features rather than simply the user-rating and year. This seems to be because
movies that win oscars and best pictures and are spoken of as ``awesome'' tend to have
very high user ratings, so these additional features are largely redundant.
18
3.2.4 Sentence-level Classification
The sentence level polarity is an extension to Pang and Lee’s subjectivity
extraction method.
The preprocessing of the document undergoes the process depicted in Figure 7.
Figure 7: Document preprocessing for sentence based polarity features
Referring to Figure 7, the input compromises of documents Di which are a set of
sentences {S1,S2,S3…Slength} where the documents are labelled to a + class or – class.
The first classifier then relabels the sentences as either subjective S, or objective O,
which are outputs of the subjective/objective classifier as described before. A second
classifier then relabels the subjective sentences as either + or -, as outputted by a
positive/negative sentiment binary classifier.
For training of the positive/negative classifier, 1500 positive and 1500 negative
sentiment ‘sentence’ reviews were obtained from Yahoo movie reviews, and a baseline
classifier was applied. The output of the second classifier leaves the documents in
polarity sentence classification form, where the polarity belongs to the set {+,-,O). A +
label means the sentence was classified as positive sentiment. A – label means the
19
sentence was classified as negative sentiment. A O label means the sentence is objective
or has neutral sentiment.
Table 1: Examples of sentence polarity types
Example
Class
Stunning visual effects.
Positive +
The plot moves too slowly.
Negative -
Neo dodges bullets.
Objective O
Re-representing the sentence polarity feature vectors in fixed length format
After pre-processing the documents into the sentence polarity feature
representation, the feature vectors are still of variable length (corresponding to the
number of sentences in the document). One transformation to a fixed length feature
vector is the vector <+,-,O>, where each feature represents the frequency of that
particular feature type for that document. For example, if document D ={+,+,+,O,O,-,-,+,}, then the fixed length feature vector would be {4,3,2}. Of course, we can normalize the
feature vector by the vector size, to negate the affect of differences in document lengths
{4,3,2}/sqrt(4^2+3^2+2^2).
One more representation is to incorporate position of the sentences in to the
feature vector. This is done by applying the fixed length feature vector to each quarter of
the document for a 12 feature vector representation <{q1+,q1-,q1O,q2+,q2-,q2O,q3+,q3,q3O, q4+,q4-,q4O}. For example, q1+,q1-,q1O represent respectively the document’s
first quarter number of positive,negative, and neutral polarity sentence labels. Again,
vector size normalization can be applied to negate the affect of differences in document
lengths.
Finally, in the comparison tests below, sentence-level classification was evaluated
using a single feature value – the ratio of positive sentences to negative sentences in each
document.
3.2.5 Hybrid Model
With the above four approaches to developing feature vectors for machine
learning of movie review sentiment classification, we have four different classifiers, each
20
with their unique strengths, to classify the movie review documents. We thus adopted the
approach of Mullen and Collier (2004) to combine the outputs of the four features into
one ``hybrid'' model. In this work, we limit our constituent classifiers to the SVMversions of the approaches outlined above.
Figure 8: SVM-hyperplane distances as Constituent Output/Hybrid Input
Using SVM-versions, as with the approach of Mullen and Collier, the inputs to
the hybrid model can then be the outputs of the four individual classifiers, in this case, the
distance from the dividing SVM-hyperplanes for the four constituent SVM classifiers.
These four-feature vectors can then in turn provide data for a second layer of training
with another machine learning algorithm. We experimented with the hybrid data in
Weka, and found Naive Bayes to be the most successful.
Also, note that because we saw no difference between using all the new webmined features and just using user rating and year, we went with this simpler constituent
classifier for the hybrid tests. This reduces the chance of overfitting with the SVM and
allows for faster running of the algorithm.
We test this model using 10-fold cross-validation as with the other approaches.
However, because there are two layers of training, we must divide up the 90% training
portion between the constituent classifiers and the hybrid model. In each of the ten
iterations of the cross-validation, we provide 80% of the data for learning the baseline,
objectivity-filtered, web-mined and sentence-level classifiers, and 10% for learning of the
hybrid model. Each of our ten folds has a turn as the test set and as the hybrid training
data, and then another eight turns in the training set.
21
3.3 Machine Learning Algorithms
3.3.1 Support Vector Machines
Support Vector Machines (SVMs) are a supervised machine learning
classification technique which uses a kernel function to map an input feature space into a
new space where the classes are linearly separable. For a demonstration of the use of
SVMs in text categorization and references to further information, we refer the reader to
Joachim's 1998 paper (1998).
For our support vector machine implementation, we used Thorsten Joachim's
SVM
light
(Joachims, 1999). This implementation uses a sparse vector representation
(suitable for reducing the size of the sparse files representing our indexed set), has a fast
optimization algorithm, and can handle many thousands of features and training
instances. All experiments are with default parameter settings, including a linear kernel.
This is in line with the use of SVM by Pang et al. (2002, 2004).
3.3.2 Naïve Bayes
The Naïve Bayes assumption of attribute independence works well for text
categorization at the word feature level. When the number of attributes is large, the
independence assumption allows for the parameters of each attribute to be learned
separately, greatly simplifying the learning process.
There are two different event models. The multi-variate model uses a document
event model, with the binary occurrence of words being attributes of the event. Here the
model fails to account for multiple occurrences of words within the same document,
which is a more simple model. However, if multiple word occurrences are meaningful,
then a multinomial model should be used instead, where a multinomial distribution
accounts for multiple word occurrences. Here, the words become the events.
Detailed descriptions of the multi-variate Bernoulli and Multinomial as well as an
empirical comparison was conducted by McCallum and Nigam (1998), showing an
average of 27% reduction in error using Multimomial over the multi-variate Bernoulli
model using popular data sets.
22
For the naïve bayes implementation, we used the BOW toolkit (McCallum, 1996).
The event model used was the word level (Multinomial model), as it has been found by
the natural language processing community that in general, multinomial performs better
for larger vocabulary then the multi-variate Bernoulli model.
3.3.3 Decision Trees
No one as far as we know has applied the decision tree algorithm to the problem
of automated categorization by sentiment. Decision trees have performed well for the
problem of automated topic categorization (Gabrilovich and Markovitch, 2004). We
employed Ross Quinlan's well-known C4.5/5.0 decision tree learner and its successor
C5.0. The C4.5/5.0 implements the ID3 algorithm to construct a tree.3 In order to
produce trees having desirable properties like minimality, ID3 ranks the attributes of a
given training set according to their information gain and constructs a tree by employing
the following strategy: supposing that it has already constructed the tree from the root
node down to a node n, ID3 assigns to node n the attribute whose information gain is
greatest among the attributes it has not yet assigned to any of the nodes on the path from
the root node to node n.
As can be seen in Table 3 below, the C4.5/5.0 classifier that was learned on the
training data performed the worst of all the algorithms in the table. With the baseline
approach, it had 66.0% accuracy and with the objectivity-filtered approach, it had 65.5
accuracy. This did not improve by using winnowing (i.e., feature selection).
Unlike C4.5, the C5.0 learner allows adaptive boosting based on Freund and
Schapire’s work. Roughly, boosting generates several trees and takes a (weighted) vote
of the trees classifications for a given instance. This construction may be outlined as
follows: At trial 1, construct a decision tree as before. At trial 2, construct another
decision tree by paying more attention to the mistaken cases in trial 1. This second tree
will also make mistakes. …At trial n+1, construct a decision tree by paying more
attention to the mistaken cases in trial n. Stop after a specified number of trials or when
there is sufficient convergence or when divergence begins.
The C4.5 extends ID3 in that it is able to handle missing data and attributes with continuous
ranges.
3
23
Adaptive boosting (10-trial) on average reduces the error rate by about 25%.
Concerning our data, adaptive boosting did not improve performance on the smaller
‘web-mined’ and ‘sentence-level’ data sets. It did, however, significantly boost the
performance on the ‘baseline’ data set by increasing the accuracy from 66.0% to 74.2%.
It also significantly boosted the performance on the ‘objectivity-filtered’ data set by
increasing the accuracy from 65.5% to 77.0%.
3.3.4 Logistic Regression
Logistic Regression is a popular and well-established algorithm in Statistics,
however it is strangely absent in data mining applications. Most people think Logistic
Regression cannot handle the very large datasets to which other modern data mining tools
have been applied (Komarek & Moore 2003).
Recently, Komarek and Moore(2003) proved that the predictive performance of
Logistic Regression equals or exceeds the performance of some recent algorithms (e.g
SVM and Naïve Bayes) in some circumstances, especially for large sparse datasets with
binary output (eg. +,-), and a large number of binary-value attributes. This feature of
Logistic Regression is very attractive for our project, because there are 13,377 binaryvalue attributes in the Baseline Bow and the Objectivity Filtered classifier datasets in our
project.
We implemented Batch Gradient Ascent for the Logistic Regression Algorithm
(Figure 9) using Matlab code. Since a large number of attributes (13,377) need to be
handled in Logistic Regression for the Baseline Bow and Objectivity Filtered datasets, we
set the convergence condition as | wi +1 − wi |≤ 0.01 and step size as 0.05. With this
parameter setting, Logistic Regression can achieve 85.3% accuracy for ten-fold cross
validation on the Baseline Bow dataset. The convergence always happened at about 125130 loop iterations (Figure 10). From Figure 10, it seems possible to improve the
convergence situation. But experiments with different step size from 0.03-0.1 did not
improve our results significantly. We also tried to change the convergence condition to a
smaller number, however, the program needs to run a very long time to return the result.
Finally, we chose the convergence condition | wi +1 − wi |≤ 0.01 and step size =0.05 to be
our parameter setting for the Baseline Bow and Objectivity Filtered datasets evaluation.
24
For the other two datasets, Web-Mined and Sentence-Level, we set a smaller convergence
condition | wi +1 − wi |≤ 0.001 and step size =0.05, since there are only 5 attributes in WebMined, and one attribute in Sentence-Level.
Figure 9: Batch Gradient Ascent for Logistic Regression Algorithm
Figure 10: Baseline Bow datasets evaluated by Logistic Regression (convergence condition
| wi +1 − wi |≤ 0.01 and step size =0.05)
25
3.3.5 Summary of Machine Learning Algorithms
For the sake of reference, we summarize the tools used in this project in Table 2.
Table 2: ML Tool Summary
Baseline Bow
Objectivity Filtered
Web-Mined
Sentence-Level
SVM
SVMlight
SVMlight
SVMlight
SVMlight
NB
BOW-Toolkit
BOW-Toolkit
Weka
Weka
LR
MATLAB
MATLAB
MATLAB
MATLAB
C4.5
Quinlan’s C5.0
Quinlan’s C5.0
Quinlan’s C5.0
Quinlan’s C5.0
Hybrid
Naïve Bayes – Weka
4. Results
4.1 Comparison of Sentiment Classification using Different ML
Algorithms
In this project, we evaluated the four versions of the sentiment datasets (Baseline
Bow, Objectivity Filtered, Web-Mined and Sentence-Level) using four machine-learning
techniques (SVM, Naïve Bayes, Logistic Regression and Decision Tree C4.5). Baseline
Bow and Objectivity Filtered datasets, both with 13, 377 binary-value attributes, were
proposed by Bo Pang (2004). The other two datasets, Web-Mined and Sentence-Level,
were proposed by our team as we discussed in section 3.4 and section 3.5, 5 attributes in
the Web-Mined dataset and one attribute (the ratio of positive sentences to negative
sentences in each document) in the Sentence-Level dataset respectively.
Table 3: Machine Learning Results for Different Approaches
Baseline Bow
Objectivity Filtered
Web-Mined
Sentence-Level
SVM
86.4
86.2
77.6
74.8
NB
79.5
83.4
72.3
75.5
LR
85.3
84.9
74.7
75.2
C4.5
66.0
65.5
77.2
75.2
26
Figure 11: Machine Learning Results for four Different Approaches
The average accuracies computed by ten-fold cross-validation using different
machine-learning techniques (SVM, Naïve Bayes, Logistic Regression and Decision Tree
C4.5) over the four datasets (e.g. Baseline Bow, Objectivity Filtered, Web-Mined and
Sentence-Level) are listed in Table 3. Generally, the results from SVM, Naïve Bayes and
Logistic Regression were pretty close. The predictive performance of SVM on Baseline
Bow polarity dataset achieves 86.4% accuracy, which is a little bit below the best result
(87.15%) reported by Pang and Lee(2004). Similarly SVM has an average accuracy of
86.2% on the Objectivity Filtered polarity dataset, which is pretty close to the best result
(86.4%) published by Pang and Lee (2004). C4.5 has the worst performance on both
27
Baseline Bow and Objectivity Filtered polarity datasets. The implication is that C4.5 may
not handle datasets with too many attributes very well. For the other two datasets built by
our team: Web-Mined and Sentence-Level, the predict performance accuracy computed
by SVM are 77.6% and 74.8% respectively. Although theses accuracies are clearly below
those from Baseline Bow and Objectivity Filtered datasets, they can still remain about
75% accuracy with much less attributes. (More improvements to sentimental
classification performance are reported later in section 4.3 and 4.4).
In general, the predictive performance of the Web-Mined and Sentence-Level
approaches are below the Baseline Bow and Objectivity Filtered approaches (Figure 11).
However, Web-Mined and Sentence-Level representations have great potential in
sentimental classification because of their low dimension, fast computing and low usage
of computing resources. We believe there is possibility to refine the two new approaches
Web-Mined and Sentence-Level and achieve better performance.
From a machine learning point of view, while the predictive performance of the
three different algorithms SVM, Naïve Bayes and Logistic Regression are comparable,
the performance of the Decision Tree algorithm is far below the others. The interesting
observation is that the predictive performance of Logistic Regression is comparable with
some modern techniques such as SVM and Naïve Bayes on the sentiment classification
problem.
28
4.2 Learning Curve with Logistic Regression
Figure 12: Learning Curve with Logistic Regression
Logistic regression was used to study the learning curve in our project. The tenfold cross-validation “learning curve” for each dataset using Logistic Regression was
shown in Figure 12. We pick up 20, 50, 100, 200, 400 and 600 data points. Each point
on the learning curve is based on 10 different sub-datasets of the specified size, drawn
randomly from the original dataset.
In Figure 12, when the number of data points become larger, the predictive
accuracy of Baseline Bow, Objectivity Filtered and Web-Mined classifier dataset increase,
especially for Baseline Bow and Objectivity Filtered.
However, the predictive
performance of Sentence-Level does not change much. In other words, more training data
is very helpful to improve the accuracy of Baseline Bow and Objectivity Filtered
approach.
29
4.3 Hybrid Classification Results
We next tested the the hybrid approach, combining the output of the SVMclassifier with the four different document representations.
Table 4: SVM 10-fold Cross-Validation Performance (%)
System
Accuracy
Baseline
86.35
Objectivity-Filtered
86.15
Web-Mined Features
77.60
Sentence-level Classified 74.80
Hybrid
90.20
We achieve very impressive performance using the hybrid model, 90.2%, roughly
3% higher than any results so far published in the literature. In fact, in our testing, we
find the hybrid model to outperform the best-so-far-reported bag-of-words approach by
about 4% (Table 4).
Table 5: Hybrid 10-Fold Cross-Validation Relative Drop in Performance (%)
With Removal of:
Accuracy
Baseline
89.05
Objectivity-Filtered
88.80
Web-Mined Features
88.75
Sentence-level Classified 89.75
None
90.20
One interesting question would be, what are the relative contributions of each of
the constituent models to this high performance? This can be assessed by removing each
one of the constituents classifier's output feature from the hybrid classifier, and
measuring the drop in performance. Intuitively, one might expect the relative importance
to vary in terms of the relative performance of the constituents, but this turns out not to be
correct. In fact, we see that removing the web-mined features results in the largest drop
in performance (Table 5), even though these only perform third-best when tested alone
30
(Table 5). However, the sentence level constituent does turn out to be both the least
important to the hybrid (with only a very small drop in performance, from 90.2 to 89.75,
when left out), and the worst performer on the stand-alone task.
The reason for the importance of the web-mined model is that it is the most
``orthogonal'' to the other sentiment measures. Each of the other three constituents
depends on words in the original reviews, whereas the web-mined features, as we
mentioned earlier, classify the documents purely based on background information.
We now offer some justification for taking the hybrid, stacked-classifier approach
to incorporating the web-mined information. Earlier, we mentioned that including the
web-mined information as features within the baseline SVM classifier resulted in no
significant improvement in performance. The key to understanding this phenomenon can
now be explained.
We see above that using the web-mined features with an SVM achieves 77.6%
accuracy on the test set, while the baseline system achieves 86.35%. The information in
both these classifiers seems to be somewhat important to classification – they are within
10% of each other in terms of accuracy. This difference is greater however, when we
look at training set error. The web-mined features also perform around 78% on the
training set – there are only five features, and with 1800 elements to train with,
overfitting is not an issue. When the SVM system learns a classifier based on the training
documents, however, it can achieve very high levels of performance on the training set
itself – typically around 97% performance in our experiments. Thus, we see that some
amount of overfitting occurs as specific words are chosen by the SVM to optimize
classification performance on the training set. The SVM finds it can do quite well on the
training set relying on these key terms – why bother incorporating the web-mined
features, which only ever provide 77% accuracy?
In a hybrid approach, however, the SVM classification accuracy is seen to be
what it truly is during the second stage of training, on the reserved 10% of training
documents.
The hybrid classifier can learn to downweight the contribution of the
baseline classifier output when the document-classification is only slightly on one side of
the dividing hyperplane, and rely more on the web-mined classification in these
instances. In summary, if the web-mined features and BOW-features are combined at the
31
training stage, the web-mined features are ignored because of the high accuracy on the
training set which is possible with the word features. If they are combined in a second
stage, the hybrid classifier learns the realistic classification performance of the two
systems, and can combine them optimally on this set.
4.4 Improved Sentence-level Classification with Position Information
Using a linear SVM classifier on the <+,-,O> sentence level polarity feature
vectors, the classification results on the IMDb sentiment movie corpus are shown in
Table 6.
Table 6: Classification results on the IMDb sentiment movie corpus using fixed length sentence level
polarity feature vectors using SVM and 10 cross validation.
W/O position
(unnormalized)
W/O position
(normalized)
With position
(unnormalized)
With position
(normalized)
74.6%
74.4%
75.5%
74.8%
The best results were shown to incorporate position without length normalization
with an accuracy of 75.5%.
5. Conclusion
5.1 Contributions
In our project, we proposed two new approaches (Web-Mined and SentenceLevel) to solve sentimental classification problems, and their performances are evaluated
by different machine learning techniques. Web-Mined and Sentence-Level have great
potential in sentimental classification because of their low dimension, fast computing and
low usage of computing resources. Currently both approaches have an average accuracy
of about 75%.
Moreover, we proposed a new idea to create a hybrid classifier by combining four
sentimental classification approaches (Baseline Bow, Objectivity Filtered, Web-Mined
and Sentence-Level classifier) together. The predictive performance of the hybrid
32
classifier can achieve 90.2% accuracy. It is a significant improvement over the best result
of 87.15%, published by Pang and Lee (2004) using SVM.
We built a new sentence polarity dataset with 3000 sentences; it may be useful to
other researchers investigating a Sentence-Level classifier approach to solving the
sentiment classification problem.
We found that Logistic Regression was comparable with SVM and Naïve Bayes
in term of predictive performance accuracy. Logistic Regression is a special case in
conditional random fields, however, it is not widely used in data mining applications.
Based on our observations, we believe that Logistic Regression can play a role in
sentiment classification in the future.
5.2 Future Work
Our results indicate that the Web-Mined classifier can both help improve the
performance of the hybrid classifier in our project, as well perform competitively on its
own. This may indicate that the web-mined representation can provide a different “view”
of the document, perhaps allowing co-training of the bag-of-words classifiers along with
the web-mined representation. In fact, it would be interesting to see the performance
capable on the movie review classification task using only the web-mined classifier to
generate the training data, and then learning bag-of-word classifiers from this data. This
would be, effectively, an unsupervised approach to sentiment classification, requiring no
manual labelling of data, rather, we would rely on the putative labels of the web-mined
classifier for the training data of a bag-of-word classifier.
The predictive performance of Logistic Regression is comparable with SVM and
Naïve Bayes in our project. But Logistic Regression is seldom mentioned in related
literature. We would like to further investigate the possibility of using Logistic
Regression to solve the sentiment classification problem.
Finally, it would be very interesting and useful if we could extend our
methodology to other sentiment tasks, such as music reviews, car reviews, etc., in order
to verify and improve the algorithm. Work on this aspect of the project will be part of our
research focus in the future.
33
6. References
Gabrilovich, Evgeniy and Shaul Markovitch. 2002. Text categorization with many redundant
features: using aggressive feature selection to make SVMs competitive with C4.5. In
ICML ’04: Twenty-first International Conference on Machine Learning, New York, NY,
USA, 2004. ACM Press.
Joachims, Thorsten. 1998. Text categorization with Support Vector Machines: learning with
many relevant features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of
ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
Joachims, Thorsten. 1999. Making large-scale SVM learning practical. In B. Schölkopf and C.
Burges, editors, Advances in Kernel Methods. MIT-Press.
Komarek, Paul R. and Andrew W.Moore. 2003. Fast Robust Logistic Regression for Large
Sparse Datasets with Binary Outputs.
Lin, Dekang.
1998.
Dependency-based evaluation of MINIPAR.
In Proceedings of the
Workshop on the Evaluation of Parsing Systems, First International Conference on
Language Resources and Evaluation.
Lin, Dekang. 2001. LaTaT: Language and Text Analysis Tools. In Proceedings of the First
International Conference on Human Language Technology Research.
McCallum, Andrew Kachites. 1996. BOW: A toolkit for statistical language modeling, text
retrieval, classification and clustering.
McCallum, Andrew and Kamal Nigam. 1998. A comparison of Event Models for Naïve Bayes
Text Categorization.
Mullen, Tony and Nigel Collier. 2004. Sentiment analysis using Support Vector Machines with
diverse information sources. In Dekang Lin and Dekai Wu, editors, Proceedings of
34
EMNLP 2004, pages 412-418, Barcelona, Spain.
Association for Computational
Linguistics.
Pang, Bo and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts. In Proceedings of the ACL, pages 271-278.
Pang, Bo, Lillian Lee and Shivakumar Vaithyanathan.
2002.
Thumbs up?
Sentiment
classification using machine learning techniques. In Proceedings of the 2002 Conference
on Empirical Methods in Natural Language Processing (EMNLP), pages 79-86.
Porter, Martin F.. 1980. An algorithm for suffix stripping. Program, 14(3):130-137.
Quinlan, J. Ross: C4.5: Programs for Machine Learning, Morgan Kaufman Publishers, 1993.
Salton, Gerald and Chris Buckley. 1988. Term weighting approaches in automatic text retrieval.
Information Processing and Management, 24(5):513-523.
Sebastiani, Fabrizio.
2002.
Machine Learning in Automated Text Categorization.
Computing Surveys, 34(1):1-47.
35
ACM
© Copyright 2026 Paperzz