Sentiment Mining in Large News Datasets

Sentiment Mining in Large News Datasets
Roshan Sumbaly, Shakti Sinha
May 10, 2009
1
Abstract
etc. Thus SM can help us in the following areas:
(a) Recommendation Systems: used as a subcomponent to prevent recommendations of item that
receive negative feedback.
(b) Businesses and Organizations: finding consumer
sentiment and opinion to understand their marketing
effectiveness
(c) Advertisement Networks: placement of an ad
when a product is praised or the competitors ad when
a product is criticized. Recently companies like Facebook and Twitter have also shown their interest in
this area. [2][3][4]
The field of SM picked up momentum around 2001
with over hundreds of papers being published every
year on this subject. The reason for this sudden increase in interest has been
(a) rise of machine learning algorithms in natural language processing and information retrieval
(b) availability of datasets and bigger and faster computational resources to mine it
(c) realization of fascinating trends that can help decision making in various domains. Most SM research
has revolved around either binary sentiment classification (positive or negative) or giving a score to positivity or negativity.
The next question is how can we compare Sentiment Mining to methods of text categorization. In
particular, is it possible to use text classifiers directly
by training them on data with positive and negative
opinions? This cannot be done because the opinion
value of a sentence is different for different subjects.
The same sentence can be positive about one person,
and negative about another.
So what are the available techniques to mine for
sentiment? The approach taken in the past has been
either statistical or linguistic. Statistical techniques
rely heavily on mathematical comparison of the occurrences and number of negative or positive statements in the text, whereas linguistic approach tries
to build a set of rules and compare the analysed text
with them.[5] In our work we have built a sentiment
Sentiment Mining (SM) is the area of research that
attempts to make automatic systems to determine
opinion from text written in natural language[1].
Most of the effort in the past has been in mining
sentiment in user generated content i.e. review sites,
forums, discussion groups, blogs, etc. In our work
we explore the application of sentiment mining on a
more structured dataset. We use the New York Times
(NYT) annotated corpus[10], which is a collection of
all the articles published by NYT between 1987 and
2007.
Previous efforts to perform SM can be classified
into two broad areas : statistical approach and linguistic approach. We tend to pick a middle path by
using both. We start with a small seed of positive
& negative words and an extensive list of adjectives.
From this we build up a set of word k-grams, using approaches like word-word correlations and wordperson proximity, to finally help us classify sentences
statistically as having either positive or negative opinion towards a person.
Our objective in this project is to determine the
opinions expressed in the articles towards people annotated in them, and compute the trends. Our final
system is capable of providing opinion (a number between 0 - negative to 1 - positive) of a person over a
span of 20 years. To evaluate our work we tried to
correlate our findings with important events in history when a persons opinion would have changed.
2
Introduction
Opinion of other people has always been an important piece of information for most of us during any
decision-making process. With the advent of Computers and finally the WWW, it is now possible to
have a repository of ‘opinions.’ This ever increasing
repository can be mined to find trends in opinions of
physical objects like people, products, organizations,
1
the corpus to extract important fields like published
date, names of people present in the articles and the
actual body of the articles. These parsed tuple were
stored in two different tables on a per sentence and
per article basis i.e. t1<sentence id, article id,
sentence, date> and t2<article id, people>
mining and classification system, based on both statistical and linguistic approaches.
We used the annotated New York Times corpus
which has a collection of 1.8 million articles, and the
annotations include the names of people who feature
in it. Following are the statistics of New York Times
corpus :
Producing training data
The training data for the classifier is a set of sentences, each labeled as having a positive or negative
opinion about a person. Since the classifier has to be
specific to the news data we are working with, we use
sentences from the corpus for training. The method
we use for automatically labeling the sentences is described below. We start with a seed sets of positive
and negative words :
Table 1: New York Times Corpus
Total number of words
1,130,621,175
Number of articles
1,873,783
Number of persons
2,372,244
Format
NITF XML documents
3
Algorithms and Techniques
positive_seedset = { good, nice, excellent,
positive, fortunate,correct, superior}
negative_seedset = {bad, nasty, poor, negative,
unfortunate,wrong, inferior}
The aim of our project is to build a classifier, which
will take a person name and sentence as input, and
output a score between 0 and 1 quantifying the opinion expressed in the sentence about the person (0
being most negative and 1 being most positive). The
features we have used are the word k-grams occurring in the sentence containing the name of the subject person. Each word k-gram has an assigned value
which indicates how likely the sentence containing it
is positive or negative. Such k-gram values can vary
among different data corpus, hence in our project we
are computing the values instead of using available
opinion lexicons that have such values already computed.
The algorithm adopted by us can be split into 2
phases viz. Training Phase & Query Phase. The first
phase consists of three sub-parts (a) preprocessing
done on the XML files of NYT corpus (b) producing
training data (c) training a classifier. The second
phase is the query processing wherein given a query
(person, sentence) we generate a score signifying the
positive / negative sentiment of the person in the
sentence. Determining the opinion score expressed
in set of sentences is not trivial, and there are many
approaches that can be attempted. Also, the method
to determine opinion can vary with the characteristics
of the data being analyzed.
3.1
These words have been suggested in [7] and are
strongly positive or negative irrespective of context.
We use this seed set to expand our set of positive and
negative terms with reference to the NYT corpus. We
do this by computing the number of cooccurrences of
adjectives with the seed set of words. The following is the MapReduce implementation for finding the
cooccurence.
//adjectives: Extensive list of adjectives
//stop_words: Extensive list of stop words
map_constructor()
loadfiles(positive_seedset, negative_seedset,
adjectives);
//p_count: positives in the sentence
map(t1.sentence)
iterate over rows (sentences)
for all (adjectives in sentences,
not in positive_seedset and negative_seedset)
emit (adjective, <p_count,n_count>)
reduce (word, list<p_count, n_count>)
iterate over list
p_count_total += p_count
n_count_total += n_count
emit(word, <p_count_total, n_count_total>)
Training phase
Preprocessing on XML file
The NYT corpus is in a custom XML format with
various fields. We used the parser provided with
2
The output of the above MapReduce gives
us (adjective,<pcount, ncount>) tuples, where
p count total and n count total are the number of cooccurances with positive and negative seed
words. We then empirically take the top 300 words
sorted by pcount/(1+ncount) as the set of positive
words (positive nyt) and another top 300 words
sorted by ncount/(1+pcount) as the set of negative
words(negative nyt). [We add 1 to the denominator
to avoid divide by zero errors]
Using this NYT corpus specific positive and negatives words, we try to identity a set of people identified as having consistent positive and negative opinions. For this, we again use cooccurance with the set
of positive and negative words as the metric. This is
done using the following MapReduce:
ative persons, and we get the sets positive people
and negative people.
We now assume that every sentence that contains a person from positive people is positive
and every sentence that contains a person from
negative people is negative. The sentences form
the training data for the classifier.
Training the Classifier
The classifier we use is a multinomial Naive Bayes
classifier with term trigrams as features. We briefly
describe our motivation for using k-grams instead of
unigrams. The features for the classifier could have
been either k-grams or unigrams. Intuitively, k-grams
are more capable of capturing sentiment than unigrams. If we use only unigrams, the information we
can get about a sentence on seeing words like ‘not’
map_constructor()
and ‘good’ is only indicative of their presence in the
loadfiles(negative_nyt, positive_nyt)
sentence, i.e. bag-of-words approach. K-grams allow
us to capture the state when a word like ‘not’ modifies
map(t2.people,t1.sentence)
another unigram like ‘good’. For example, a sentence
iterate over rows
(after removing stop words) like ‘He is not a winner
for every person in t2.people
in life’, with only unigrams [not, winner, life] would
if person in t1.sentence
for every negative word in sentence
not be able to find the negativity compared to [not
emit(person,<-1>)
winner, winner life]. This statement of ours can be
for every positive word in sentence
verified in the work of Hang Cui et. al.[8] wherein
emit(person,<+1>)
they show that using higher order grams is beneficial
if dealing with large corpuses.
reduce( person, list<indicator> )
As mentioned in the previous section we
iterate over list
use
the sentences which contain a person from
if indicator == -1
people as our training data. We generpositive
ncount++
ate
k-grams
from these sentences using a technique
else
used in the Question Answering systems[9]. K-grams
pcount++
are only generated using a window around a person’s
emit(person, <pcount, ncount>)
name. This is based on the fact that most of they
The output of the above MapReduce is the set of keywords giving an opinion about the person would
tuples (person,<pcount, ncount>). The task now be close to the person’s name. For training the classiis to use these tuples to create the training data for fier, we have implemented the following MapReduce.
the classifier. We have to select a set of people such
that :
map_constructor()
pp = loadintoset(positive_people)
a) they have strong positive or negative values so that
np = loadintoset(negative_people)
we can assume the sentences they occur in to be positive or negative with high confidence, and
b) they occur in a sufficiently large number of sen- map(t2.people, t1.sentence)
iterate over rows
tences that ensures we have enough training data.
for every person in t2.people
Keeping this in mind, we select 20 positive persons
if person in t1.sentence
by first sorting the tuples by their pcount (this enif person in pp[i]
sures we select popular persons), keeping only top 100
for all proximity kgram
people and then within those 100, selecting the top
emit(kgram, <i>)
20 sorted by pcount/(1+ncount) (This ensures we
else if person in np[i]
select people with high positive opinion and low negfor all proximity kgram
emit(kgram, <-i>)
ative opinion). We follow a similar process for neg3
\\s[kgram]=(count1, count2,...count40)
reduce(kgram, list<i>)
iterate over list
if i > 0
s[kgram].count<i> ++;
else
s[kgram].count<-i>++;
emit(kgram, s[kgram])
Figure 1: System Implementation
The result of the Reduce phase is the kgram and
its count eith respect to positive people and negaExperimental Setup
tive people. Using the result of this MapReduce we 4
generate a table of log (base e) of posterior probabilities of each of the k-grams. This is stored as Our system implementation is implemented using
the Aster Data’s nCluster system running on Ama(kgram,<posteriorP, posteriorN>)
zon EC2. Aster provides an SQL interface for running MapReduce operations. All our MapReduces
3.2 Query phase
were compressed to JAR format and uploaded on the
EC2 cluster. Then all these functions were invoked
The objective of the query phase is to compute the through SQL using a JDBC interface.
opinion values for all persons in the corpus over all
days on which they feature in a news article. We do
this through another MapReduce.
5 Results and Evaluation
The MapReduce implementation to compute over
one year is given below. It can easily be extended to We built our full system incrementally, running experiments and evaluating the performance at every
20 years.
stage before proceeding. The first stage was the semap_constructor()
lection of 300 positive and 300 negative adjectives.
post=loadintoset(kgram,<posteriorP, posteriorN>) On visual examination of these adjectives, we found
that that the majority of the adjectives were being
map(t2.people, t1.date,t1.sentence)
classified correctly by our system. Following is the
iterate over rows
list of top 10 positive and negative adjectives in the
for every person in t2.people
year 2003.
if person in t1.sentence
for all proximity kgrams
opP += post[kgram].posteriorP
opN += post[kgram].posteriorN
emit(person,<date,1/1+e^(opN-opP)>)
\\yearop[person] = initializeset(op1,op2,..,op366)
reduce(person, list<date,opinion>)
iterate over list
for every date
add opinion to yearop[person].op<date>
compute average yearop[person].op<i>, all i
emit(person, yearop[person])
The Map phase computes the opinion for a person with respect to a sentence and outputs the tuple
(person,<date, opinion>). The reduce phase aggregrates the tuples for each person and computes the
average opinion values for all days. All these values
are stored in the database, and we can query for the
opinion values for any person.
Positives
spicy
pleasant
delightful
delicious
homely
ripe
tasty
static
helpful
superb
Negatives
tory
disturbed
rich
handicapped
unemployed
wealthy
inadequate
disgusted
linguistic
primitive
The next functionality to verify is whether the 40
people (20 positive and 20 negative) in news were
good representatives of the full set. To verify this
we used an exhaustive set of 8000 positive and negative words [6]. We used this as our seedset to generate a list of top 40 people and assumed it to be our
4
benchmark. On comparing the list of people generated with our list of positive / negative people we
found a healthy overlap (with a high Jaccard coefficient) and hence continued building our system on
the assumption that the 40 people generated by our
own seedset were ideal.
We also performed another manual check on our
seed list of 20 positive-opinion and 20 negativeopinion people. This was done by checking out samples of news articles written about these people in
the corresponding year and evaluating as to whether
they were being classified correctly. Following is the Figure 2: Opinion towards Osama Bin Laden in 2001.
list of 20 people from 1998 about whom there was We can observe a sharp drop around September 11th.
lots of positive (or negative) opinion in the articles as
they were present in. This should not be interpreted
as the person being bad or good.
Positive Opinion
Asimov Eric
Green Adolph
De Dominique
Holzman Red
Woods Tiger
Robbins Jerome
Close Chuck
Bach Sebastian
Rios Marcelo
Clemens Roger
Gershwin George
Seinfeld Jerry
Williams Serena
Bernstein Leonard
Pak Se Ri
De Kooning Willem
Williams Bernie
Pollock Jackson
Hingis Martina
Van Gogh Vincent
Negative Opinion
Sharif Nawaz
Louima Abner
Jones Corbin
Wright Susan Webber
Livoti Francis
Carey Ron
Bin Laden Osama
Rubin Robert
Willey Kathleen
Mckinney Gene
Abubakar Abdulsalam
Hun Sen
Harris Darrel
Johnson Holloway
Chernomyrdin Viktor
Hubbell Webster
Spitzer Eliot
Abacha Sani
Obuchi Keizo
Primakov Yevgeny
Figure 3: Computed opinion towards George Bush in
2001 compared to Galup Ratings
For evaluating the final opinion numbers, we tried
to find events that would have caused a change in
opinion towards a person, and checked if our opinion scores reflected those events. We provide some
examples which demonstrate this.
We also compared the Gallup ratings for US Presidents with the opinion values we got. We can observe
that the crests and troughs of the two graphs are similar.
Figure 4: Computed opinion towards Bill Clinton in
1999 compared to Galup Ratings
5
6
Observations
need for a larger and more corpus depended seedset of words (which was then obtained using the
adjectives list)
Following are some of the interesting observations we
found during the course of this work :
• Variable length k-grams: When we used variable
length k-grams are top results were filled with
tri-grams. Also this did not improve the result
and infact was more computationally expensive.
• We found that our seed list of positive opionated
people tended to have people about whom Obituaries had been published. This actually validates our method as obituaries generally have a
very positive orientation towards the subject.
• Considering only those k grams that have the
person’s name in it: We tried to represent proximity by considering only k-gram containing the
PERSON token, but found general method of
fixed length k-grams in a window around person
to work better. We realized if a k-gram with a
PERSON token is important, our algorithm will
handle it.
• Another set of people that were generally positive were sportsmen. This is primary because
most of these articles would have phrases like
‘nice shot’, ‘winner’, ‘perfect score’, etc. Tiger
Woods has shown his presence in many of our
positive people lists.
• The initial seed of negative people tend to have
people who may not have done anything negative themselves, but due to their area of work are
generally mentioned in negative articles. For example for the year 1998 as shown above, Wright
Webber, a district court judge, appears in negative peoples list. This is primarily due to her
involvement in the Paula Jones - Bill Clinton issue which bought her national attention.
7
Conclusions
In our work we experimented using inherint information in data rather than fixed NLP rules. Our results
indicate that simple data mining techniques can give
good results without the need of NLP techniques if
the corpus is sufficiently large.
• Negative people also tend to have criminals or 8
References
persons against whom trials were on. Politicians
are also very common in this list since the media [1] Opinion mining and sentiment analysis - Bo Pang
and Lillian Lee
always seems to be critical of their work.
[2] http://www.facebook.com/lexicon/new/?sentiment
- Facebook’s Sentiment score based on Wall posts
6.1 Things that worked
[3] http://languagewrong.tumblr.com/d - Blog of
• Proximity K-gram: The decision to use proxim- Facebook engineer Roddy Lindsay
ity k-gram instead of uni-grams in the sentence [4] http://www.techcrunch.com/2009/02/15/
mining-the-thought-stream/
gave better results.
[5] http://en.wikipedia.org/wiki/Sentiment analysis
• Using adjectives to grow the seed set: This added [6] http://www.cs.pitt.edu/mpqa/
linguistic information to our seedset.
[7] Measuring Praise and Criticism: Inference of
Semantic Orientation from Association - Peter D.
• Replacing person names by a token ‘PERSON’:
Turney et. al
This makes the k-grams more generic. This helps
[8] Comparative Experiments on Sentiment Classifiin capturing phrases like ‘Bush won’ or ‘Obama
cation for Online Product Reviews
won’ as ‘PERSON won’.
[9] http://infolab.stanford.edu/∼ullman/
mining/2008/slides/RelationExtraction.ppt
6.2 Things that didn’t work
[10] http://corpus.nytimes.com/
• Creating seed set of people from 14 words: We
initially tried creating the seed list of people using only the 14 seed words. This did not work
since there were very few people who had cooccurences with these words. This indicated the
6