Evaluation of a word list for sentiment analysis in

A new ANEW: Evaluation of a word list for
sentiment analysis in microblogs
Finn Årup Nielsen
DTU Informatics, Technical University of Denmark, Lyngby, Denmark.
[email protected], http://www.imm.dtu.dk/~fn/
Abstract. Sentiment analysis of microblogs such as Twitter has recently gained a fair amount of attention. One of the simplest sentiment
analysis approaches compares the words of a posting against a labeled
word list, where each word has been scored for valence, — a “sentiment
lexicon” or “affective word lists”. There exist several affective word lists,
e.g., ANEW (Affective Norms for English Words) developed before the
advent of microblogging and sentiment analysis. I wanted to examine
how well ANEW and other word lists performs for the detection of sentiment strength in microblog posts in comparison with a new word list
specifically constructed for microblogs. I used manually labeled postings
from Twitter scored for sentiment. Using a simple word matching I show
that the new word list may perform better than ANEW, though not as
good as the more elaborate approach found in SentiStrength.
1
Introduction
Sentiment analysis has become popular in recent years. Web services, such as
socialmention.com, may even score microblog postings on Identi.ca and Twitter
for sentiment in real-time. One approach to sentiment analysis starts with labeled
texts and uses supervised machine learning trained on the labeled text data to
classify the polarity of new texts [1]. Another approach creates a sentiment
lexicon and scores the text based on some function that describes how the words
and phrases of the text matches the lexicon. This approach is, e.g., at the core
of the SentiStrength algorithm [2].
It is unclear how the best way is to build a sentiment lexicon. There exist several word lists labeled with emotional valence, e.g., ANEW [3], General
Inquirer, OpinionFinder [4], SentiWordNet and WordNet-Affect as well as the
word list included in the SentiStrength software [2]. These word lists differ by the
words they include, e.g., some do not include strong obscene words and Internet
slang acronyms, such as “WTF” and “LOL”. The inclusion of such terms could
be important for reaching good performance when working with short informal
text found in Internet fora and microblogs. Word lists may also differ in whether
the words are scored with sentiment strength or just positive/negative polarity.
I have begun to construct a new word list with sentiment strength and the
inclusion of Internet slang and obscene words. Although we have used it for
sentiment analysis on Twitter data [5] we have not yet validated it. Data sets with
· #MSM2011 · 1st Workshop on Making Sense of Microposts ·
93
manually labeled texts can evaluate the performance of the different sentiment
analysis methods. Researchers increasingly use Amazon Mechanical Turk (AMT)
for creating labeled language data, see, e.g., [6]. Here I take advantage of this
approach.
2
Construction of word list
My new word list was initially set up in 2009 for tweets downloaded for online sentiment analysis in relation to the United Nation Climate Conference
(COP15). Since then it has been extended. The version termed AFINN-96 distributed on the Internet1 has 1468 different words, including a few phrases. The
newest version has 2477 unique words, including 15 phrases that were not used
for this study. As SentiStrength2 it uses a scoring range from −5 (very negative)
to +5 (very positive). For ease of labeling I only scored for valence, leaving out,
e.g., subjectivity/objectivity, arousal and dominance. The words were scored
manually by the author.
The word list initiated from a set of obscene words [7, 8] as well as a few positive words. It was gradually extended by examining Twitter postings collected
for COP15 particularly the postings which scored high on sentiment using the
list as it grew. I included words from the public domain Original Balanced Affective Word List 3 by Greg Siegle. Later I added Internet slang by browsing
the Urban Dictionary4 including acronyms such as WTF, LOL and ROFL. The
most recent additions come from the large word list by Steven J. DeRose, The
Compass DeRose Guide to Emotion Words.5 The words of DeRose are categorized but not scored for valence with numerical values. Together with the DeRose
words I browsed Wiktionary and the synonyms it provided to further enhance
the list. In some cases I used Twitter to determine in which contexts the word
appeared. I also used the Microsoft Web n-gram similarity Web service (“Clustering words based on context similarity”6 ) to discover relevant words. I do not
distinguish between word categories so to avoid ambiguities I excluded words
such as patient, firm, mean, power and frank. Words such as “surprise”—with
high arousal but with variable sentiment—were not included in the word list.
Most of the positive words were labeled with +2 and most of the negative
words with –2, see the histogram in Figure 1. I typically rated strong obscene
words, e.g., as listed in [7], with either –4 or –5. The word list have a bias towards
negative words (1598, corresponding to 65%) compared to positive words (878).
A single phrase was labeled with valence 0. The bias corresponds closely to the
bias found in the OpinionFinder sentiment lexicon (4911 (64%) negative and
2718 positive words).
1
2
3
4
5
6
http://www2.imm.dtu.dk/pubdb/views/publication details.php?id=59819
http://sentistrength.wlv.ac.uk/
http://www.sci.sdsu.edu/CAL/wordlist/origwordlist.html
http://www.urbandictionary.com
http://www.derose.net/steve/resources/emotionwords/ewords.html
http://web-ngram.research.microsoft.com/similarity/
· #MSM2011 · 1st Workshop on Making Sense of Microposts ·
94
ANEW
Absolute frequency
I compared the score of each word with mean valence of ANEW. Figure 2
shows a scatter plot for this comparison yielding a Spearman’s rank correlation
on 0.81 when words are directly matched and including words only in the intersection of the two word lists. I also tried to match entries in ANEW and my
word list by applying Porter word stemming (on both word lists) and WordNet
lemmatization (on my word list) as
Histogram of valences for my word list
1000
implemented in NLTK [9]. The results
did not change significantly.
800
When splitting the ANEW at
valence 5 and my list at valence
600
0 I find a few discrepancies: aggressive, mischief, ennui, hard, silly,
400
alert, mischiefs, noisy. Word stem200
ming generates a few further discrepancies, e.g., alien/alienation, af06
4
4
2
0
2
6
fection/affected, profit/profiteer.
My valences
Apart from ANEW I also examined General Inquirer and the OpinFig. 1. Histogram of my valences.
ionFinder word lists. As these word
lists report polarity I associated words
Correlation between sentiment word lists
9
with positive sentiment with the va8
lence +1 and negative with –1. I
7
furthermore obtained the sentiment
6
strength from SentiStrength via its
5
Web service7 and converted its posPearson correlation = 0.91
4
itive and negative sentiments to one
Spearman correlation = 0.81
Kendall correlation = 0.63
3
single value by selecting the one with
the numerical largest value and zero2
ing the sentiment if the positive and
16
4
4
0
2
6
2
My list
negative sentiment magnitudes were
equal.
Fig. 2. Correlation between ANEW and
my new word list.
3
Twitter data
For evaluating and comparing the word list with ANEW, General Inquirer, OpinionFinder and SentiStrength a data set of 1,000 tweets labeled with AMT was
applied. These labeled tweets were collected by Alan Mislove for the Twittermood /“Pulse of a Nation”8 study [10]. Each tweet was rated ten times to get
a more reliable estimate of the human-perceived mood, and each rating was a
sentiment strength with an integer between 1 (negative) and 9 (positive). The
average over the ten values represented the canonical “ground truth” for this
study. The tweets were not used during the construction of the word list.
To compute a sentiment score of a tweet I identified words and found the va7
8
http://sentistrength.wlv.ac.uk/
http://www.ccs.neu.edu/home/amislove/twittermood/
· #MSM2011 · 1st Workshop on Making Sense of Microposts ·
95
Table 1. Example tweet scoring. –5 has been subtracted from the original ANEW
score. SentiStrength reported “positive strength 1 and negative strength –2”.
Words:
My
ANEW
GI
OF
SS
ear infection making it impossible 2 sleep headed 2 the doctors 2 get new prescription so fucking early
0
0
0
0
0
0
0
0
0 0
0
0 0
0
0
0
-4
0
-4
0
-3.34
0
0
0
0 2.2
0
0 0
0
0 0
0
0
0
0
0
-1.14
0
-1
0
0
0
0
0
0
0 0
0
0 0
0
0
0
0
0
-1
0
0
0
0
-1
0
0
0
0 0
0
0 0
0
0
0
0
0
-1
-2
lence for each word by lookup in the sentiment lexicons. The sum of the valences
of the words divided by the number of words represented the combined sentiment
strength for a tweet. I also tried a few other weighting schemes: The sum of
valence without normalization of words, normalizing the sum with the number
of words with non-zero valence, choosing the most extreme valence among the
words and quantisizing the tweet valences to +1, 0 and –1. For ANEW I also
applied a version with match using the NLTK WordNet lemmatizer.
Results
My word tokenization identified 15,768
words in total among the 1,000 tweets
with 4,095 unique words. 422 of these
4,095 words hit my 2,477 word sized
list, while the corresponding number
for ANEW was 398 of its 1034 words.
Of the 3392 words in General Inquirer
I labeled with non-zero sentiment 358
were found in our Twitter corpus and
for OpinionFinder this number was
562 from a total of 6442, see Table 1
for a scored example tweet.
I found my list to have a
higher correlation (Pearson correlation: 0.564, Spearman’s rank correlation: 0.596, see the scatter plot
in Figure 3) with the labeling from
the AMT than ANEW had (Pearson: 0.525, Spearman: 0.544). In my
application of the General Inquirer
word list it did not perform well having a considerable lower AMT correlation than my list and ANEW (Pearson: 0.374, Spearman: 0.422). OpinionFinder with its 90% larger lexicon performed better than General Inquirer but not as good as my list and
9
Scatterplot of sentiment strengths for tweets
8
7
Amazon Mechanical Turk
4
6
5
4
3
2
11.5
1.0
0.5
Pearson correlation = 0.564
Spearman correlation = 0.596
Kendall correlation = 0.439
0.0
0.5
1.0
My list
1.5
Fig. 3. Scatter plot of sentiment
strengths for 1,000 tweets with AMT
sentiment plotted against sentiment
found by application or my word list.
My ANEW GI
AMT .564 .525 .374
My
.696 .525
ANEW
.592
GI
OF
OF
.458
.675
.624
.705
SS
.610
.604
.546
.474
.512
Table 2. Pearson correlations between
sentiment strength detections methods
on 1,000 tweets. AMT: Amazon Mechanical Turk, GI: General Inquirer, OF:
OpinionFinder, SS: SentiStrength.
· #MSM2011 · 1st Workshop on Making Sense of Microposts ·
96
Spearman rank correlation
Pearson correlation
ANEW (Pearson: 0.458, Spearman: 0.491). The SentiStrength analyzer showed
superior performance with a Pearson correlation on 0.610 and Spearman on
0.616, see Table 2.
I saw little effect of the different tweet sentiment scoring approaches: For
ANEW 4 different Pearson correlations were in the range 0.522–0.526. For my
list I observed correlations in the range 0.543–0.581 with the extreme scoring as
the lowest and sum scoring without normalization the highest. With quantization
of the tweet scores to +1, 0 and –1 the correlation only dropped to 0.548. For
the Spearman correlation the sum scoring with normalization for the number of
words appeared as the one with the highest value (0.596).
To examine whether the difference
Evolution of word list performance
in performance between the applica0.6
0.5
tion of ANEW and my list is due to
0.4
0.3
a different lexicon or a different scor0.2
ing I looked on the intersection be0.1
0.0
tween the two word lists. With a di5100 300 500
1000
1500
2000
2477
rect match this intersection consisted
0.55
of 299 words. Building two new sen0.50
timent lexicons with these 299 words,
0.45
0.40
one with the valences from my list, the
0.355100 300 500
1000
1500
2000
2477
other with valences from ANEW, and
Word list size
applying them on the Twitter data
I found that the Pearson correlations
Fig. 4. Performance growth with word
were 0.49 and 0.52 to ANEW’s advanlist extension from 5 words 2477 words.
tage.
Upper panel: Pearson, lower: Spearman
rank correlation, generated from 50 resamples among the 2477 words.
5 Discussion
On the simple word list approach for sentiment analysis I found my list performing slightly ahead of ANEW. However the more elaborate sentiment analysis in
SentiStrength showed the overall best performance with a correlation to AMT
labels on 0.610. This figure is close to the correlations reported in the evaluation
of the SentiStrength algorithm on 1,041 MySpace comments (0.60 and 0.56) [2].
Even though General Inquirer and OpinionFinder have the largest word lists
I found I could not make them perform as good as SentiStrength, my list and
ANEW for sentiment strength detection in microblog posting. The two former
lists both score words on polarity rather than strength and it could explain the
difference in performance.
Is the difference between my list and ANEW due to better scoring or more
words? The analysis of the intersection between the two word list indicated that
the ANEW scoring is better. The slightly better performance of my list with the
entire lexicon may be due to its inclusion of Internet slang and obscene words.
Newer methods, e.g., as implemented in SentiStrength, use a range of techniques: detection of negation, handling of emoticons and spelling variations [2].
The present application of my list used none of these approaches and might have
· #MSM2011 · 1st Workshop on Making Sense of Microposts ·
97
benefited. However, the SentiStrength evaluation showed that valence switching
at negation and emoticon detection might not necessarily increase the performance of sentiment analyzers (Tables 4 and 5 in [2]).
The evolution of the performance (Figure 4) suggests that the addition of
words to my list might still improve its performance slightly.
Although my list comes slightly ahead of ANEW in Twitter sentiment analysis, ANEW is still preferable for scientific psycholinguistic studies as the scoring
has been validated across several persons. Also note that ANEW’s standard
deviation was not used in the scoring. It might have improved its performance.
Acknowledgment I am grateful to Alan Mislove and Sune Lehmann for providing the 1,000 tweets with the Amazon Mechanical Turk labels and to Steven
J. DeRose and Greg Siegle for providing their word lists. Mislove, Lehmann and
Daniela Balslev also provided input to the article. I thank the Danish Strategic Research Councils for generous support to the ‘Responsible Business in the
Blogosphere’ project.
References
1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends
in Information Retrieval 2(1-2) (2008) 1–135
2. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength
detection in short informal text. Journal of the American Society for Information
Science and Technology 61(12) (2010) 2544–2558
3. Bradley, M.M., Lang, P.J.: Affective norms for English words (ANEW): Instruction
manual and affective ratings. Technical Report C-1, The Center for Research in
Psychophysiology, University of Florida (1999)
4. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phraselevel sentiment analysis. In: Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing, Stroudsburg,
PA, USA, Association for Computational Linguistics (2005)
5. Hansen, L.K., Arvidsson, A., Nielsen, F.Å., Colleoni, E., Etter, M.: Good friends,
bad news — affect and virality in Twitter. Accepted for The 2011 International
Workshop on Social Computing, Network, and Services (SocialComNet 2011)
(2011)
6. Akkaya, C., Conrad, A., Wiebe, J., Mihalcea, R.: Amazon Mechanical Turk for
subjectivity word sense disambiguation. In: Proceedings of the NAACL HLT 2010
Workshop on Creating, Speech and Language Data with Amazon’s Mechanical
Turk, Association for Computational Linguistics (2010) 195–203
7. Baudhuin, E.S.: Obscene language and evaluative response: an empirical study.
Psychological Reports 32 (1973)
8. Sapolsky, B.S., Shafer, D.M., Kaye, B.K.: Rating offensive words in three television
program contexts. BEA 2008, Research Division (2008)
9. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly,
Sebastopol, California (June 2009)
10. Biever, C.: Twitter mood maps reveal emotional states of America. The New
Scientist 207(2771) (July 2010) 14
· #MSM2011 · 1st Workshop on Making Sense of Microposts ·
98