A new ANEW: Evaluation of a word list for sentiment analysis in

A new ANEW: Evaluation of a word list for
sentiment analysis in microblogs
Finn Årup Nielsen
DTU Informatics, Technical University of Denmark, Lyngby, Denmark.
[email protected], http://www.imm.dtu.dk/~fn/
Abstract. Sentiment analysis of microblogs such as Twitter has recently gained a fair amount of attention. One of the simplest sentiment
analysis approaches compares the words of a posting against a labeled
word list, where each word has been scored for valence, — a “sentiment
lexicon” or “affective word lists”. There exist several affective word lists,
e.g., ANEW (Affective Norms for English Words) developed before the
advent of microblogging and sentiment analysis. I wanted to examine
how well ANEW and other word lists performs for the detection of sentiment strength in microblog posts in comparison with a new word list
specifically constructed for microblogs. I used manually labeled postings
from Twitter scored for sentiment. Using a simple word matching I show
that the new word list may perform better than ANEW, though not as
good as the more elaborate approach found in SentiStrength.
1
Introduction
Sentiment analysis has become popular in recent years. Web services, such as
socialmention.com, may even score microblog postings on Identi.ca and Twitter
for sentiment in real-time. One approach to sentiment analysis starts with labeled
texts and uses supervised machine learning trained on the labeled text data to
classify the polarity of new texts [1]. Another approach creates a sentiment
lexicon and scores the text based on some function that describes how the words
and phrases of the text matches the lexicon. This approach is, e.g., at the core
of the SentiStrength algorithm [2].
It is unclear how the best way is to build a sentiment lexicon. There exist several word lists labeled with emotional valence, e.g., ANEW [3], General
Inquirer, OpinionFinder [4], SentiWordNet and WordNet-Affect as well as the
word list included in the SentiStrength software [2]. These word lists differ by the
words they include, e.g., some do not include strong obscene words and Internet
slang acronyms, such as “WTF” and “LOL”. The inclusion of such terms could
be important for reaching good performance when working with short informal
text found in Internet fora and microblogs. Word lists may also differ in whether
the words are scored with sentiment strength or just positive/negative polarity.
I have begun to construct a new word list with sentiment strength and the
inclusion of Internet slang and obscene words. Although we have used it for
sentiment analysis on Twitter data [5] we have not yet validated it. Data sets with
manually labeled texts can evaluate the performance of the different sentiment
analysis methods. Researchers increasingly use Amazon Mechanical Turk (AMT)
for creating labeled language data, see, e.g., [6]. Here I take advantage of this
approach.
2
Construction of word list
My new word list was initially set up in 2009 for tweets downloaded for online sentiment analysis in relation to the United Nation Climate Conference
(COP15). Since then it has been extended. The version termed AFINN-96 distributed on the Internet1 has 1468 different words, including a few phrases. The
newest version has 2477 unique words, including 15 phrases that were not used
for this study. As SentiStrength2 it uses a scoring range from −5 (very negative)
to +5 (very positive). For ease of labeling I only scored for valence, leaving out,
e.g., subjectivity/objectivity, arousal and dominance. The words were scored
manually by the author.
The word list initiated from a set of obscene words [7, 8] as well as a few positive words. It was gradually extended by examining Twitter postings collected
for COP15 particularly the postings which scored high on sentiment using the
list as it grew. I included words from the public domain Original Balanced Affective Word List 3 by Greg Siegle. Later I added Internet slang by browsing
the Urban Dictionary4 including acronyms such as WTF, LOL and ROFL. The
most recent additions come from the large word list by Steven J. DeRose, The
Compass DeRose Guide to Emotion Words.5 The words of DeRose are categorized but not scored for valence with numerical values. Together with the DeRose
words I browsed Wiktionary and the synonyms it provided to further enhance
the list. In some cases I used Twitter to determine in which contexts the word
appeared. I also used the Microsoft Web n-gram similarity Web service (“Clustering words based on context similarity”6 ) to discover relevant words. I do not
distinguish between word categories so to avoid ambiguities I excluded words
such as patient, firm, mean, power and frank. Words such as “surprise”—with
high arousal but with variable sentiment—were not included in the word list.
Most of the positive words were labeled with +2 and most of the negative
words with –2, see the histogram in Figure 1. I typically rated strong obscene
words, e.g., as listed in [7], with either –4 or –5. The word list have a bias towards
negative words (1598, corresponding to 65%) compared to positive words (878).
A single phrase was labeled with valence 0. The bias corresponds closely to the
bias found in the OpinionFinder sentiment lexicon (4911 (64%) negative and
2718 positive words).
1
2
3
4
5
6
http://www2.imm.dtu.dk/pubdb/views/publication details.php?id=59819
http://sentistrength.wlv.ac.uk/
http://www.sci.sdsu.edu/CAL/wordlist/origwordlist.html
http://www.urbandictionary.com
http://www.derose.net/steve/resources/emotionwords/ewords.html
http://web-ngram.research.microsoft.com/similarity/
ANEW
Absolute frequency
I compared the score of each word with mean valence of ANEW. Figure 2
shows a scatter plot for this comparison yielding a Spearman’s rank correlation
on 0.81 when words are directly matched and including words only in the intersection of the two word lists. I also tried to match entries in ANEW and my
word list by applying Porter word stemming (on both word lists) and WordNet
lemmatization (on my word list) as
Histogram of valences for my word list
1000
implemented in NLTK [9]. The results
did not change significantly.
800
When splitting the ANEW at
valence 5 and my list at valence
600
0 I find a few discrepancies: aggressive, mischief, ennui, hard, silly,
400
alert, mischiefs, noisy. Word stem200
ming generates a few further discrepancies, e.g., alien/alienation, af06
4
4
2
0
2
6
fection/affected, profit/profiteer.
My valences
Apart from ANEW I also examined General Inquirer and the OpinFig. 1. Histogram of my valences.
ionFinder word lists. As these word
lists report polarity I associated words
Correlation between sentiment word lists
9
with positive sentiment with the va8
lence +1 and negative with –1. I
7
furthermore obtained the sentiment
6
strength from SentiStrength via its
5
Web service7 and converted its posPearson correlation = 0.91
4
itive and negative sentiments to one
Spearman correlation = 0.81
Kendall correlation = 0.63
3
single value by selecting the one with
the numerical largest value and zero2
ing the sentiment if the positive and
16
4
4
0
2
6
2
My list
negative sentiment magnitudes were
equal.
Fig. 2. Correlation between ANEW and
my new word list.
3
Twitter data
For evaluating and comparing the word list with ANEW, General Inquirer, OpinionFinder and SentiStrength a data set of 1,000 tweets labeled with AMT was
applied. These labeled tweets were collected by Alan Mislove for the Twittermood /“Pulse of a Nation”8 study [10]. Each tweet was rated ten times to get
a more reliable estimate of the human-perceived mood, and each rating was a
sentiment strength with an integer between 1 (negative) and 9 (positive). The
average over the ten values represented the canonical “ground truth” for this
study. The tweets were not used during the construction of the word list.
To compute a sentiment score of a tweet I identified words and found the va7
8
http://sentistrength.wlv.ac.uk/
http://www.ccs.neu.edu/home/amislove/twittermood/
Table 1. Example tweet scoring. –5 has been subtracted from the original ANEW
score. SentiStrength reported “positive strength 1 and negative strength –2”.
Words:
My
ANEW
GI
OF
SS
ear infection making it impossible 2 sleep headed 2 the doctors 2 get new prescription so fucking early
0
0
0
0
0
0
0
0
0 0
0
0 0
0
0
0
-4
0
-4
0
-3.34
0
0
0
0 2.2
0
0 0
0
0 0
0
0
0
0
0
-1.14
0
-1
0
0
0
0
0
0
0 0
0
0 0
0
0
0
0
0
-1
0
0
0
0
-1
0
0
0
0 0
0
0 0
0
0
0
0
0
-1
-2
lence for each word by lookup in the sentiment lexicons. The sum of the valences
of the words divided by the number of words represented the combined sentiment
strength for a tweet. I also tried a few other weighting schemes: The sum of
valence without normalization of words, normalizing the sum with the number
of words with non-zero valence, choosing the most extreme valence among the
words and quantisizing the tweet valences to +1, 0 and –1. For ANEW I also
applied a version with match using the NLTK WordNet lemmatizer.
4
Results
9
Scatterplot of sentiment strengths for tweets
8
7
Amazon Mechanical Turk
My word tokenization identified 15,768
words in total among the 1,000 tweets
with 4,095 unique words. 422 of these
4,095 words hit my 2,477 word sized
list, while the corresponding number
for ANEW was 398 of its 1034 words.
Of the 3392 words in General Inquirer
I labeled with non-zero sentiment 358
were found in our Twitter corpus and
for OpinionFinder this number was
562 from a total of 6442, see Table 1
for a scored example tweet.
I found my list to have a
higher correlation (Pearson correlation: 0.564, Spearman’s rank correlation: 0.596, see the scatter plot
in Figure 3) with the labeling from
the AMT than ANEW had (Pearson: 0.525, Spearman: 0.544). In my
application of the General Inquirer
word list it did not perform well having a considerable lower AMT correlation than my list and ANEW (Pearson: 0.374, Spearman: 0.422). OpinionFinder with its 90% larger lexicon performed better than General Inquirer but not as good as my list and
6
5
4
3
2
11.5
1.0
0.5
Pearson correlation = 0.564
Spearman correlation = 0.596
Kendall correlation = 0.439
0.0
0.5
1.0
My list
1.5
Fig. 3. Scatter plot of sentiment
strengths for 1,000 tweets with AMT
sentiment plotted against sentiment
found by application or my word list.
My ANEW GI
AMT .564 .525 .374
My
.696 .525
ANEW
.592
GI
OF
OF
.458
.675
.624
.705
SS
.610
.604
.546
.474
.512
Table 2. Pearson correlations between
sentiment strength detections methods
on 1,000 tweets. AMT: Amazon Mechanical Turk, GI: General Inquirer, OF:
OpinionFinder, SS: SentiStrength.
Spearman rank correlation
Pearson correlation
ANEW (Pearson: 0.458, Spearman: 0.491). The SentiStrength analyzer showed
superior performance with a Pearson correlation on 0.610 and Spearman on
0.616, see Table 2.
I saw little effect of the different tweet sentiment scoring approaches: For
ANEW 4 different Pearson correlations were in the range 0.522–0.526. For my
list I observed correlations in the range 0.543–0.581 with the extreme scoring as
the lowest and sum scoring without normalization the highest. With quantization
of the tweet scores to +1, 0 and –1 the correlation only dropped to 0.548. For
the Spearman correlation the sum scoring with normalization for the number of
words appeared as the one with the highest value (0.596).
To examine whether the difference
Evolution of word list performance
in performance between the applica0.6
0.5
tion of ANEW and my list is due to
0.4
0.3
a different lexicon or a different scor0.2
ing I looked on the intersection be0.1
0.0
tween the two word lists. With a di5100 300 500
1000
1500
2000
2477
rect match this intersection consisted
0.55
of 299 words. Building two new sen0.50
timent lexicons with these 299 words,
0.45
0.40
one with the valences from my list, the
0.355100 300 500
1000
1500
2000
2477
other with valences from ANEW, and
Word list size
applying them on the Twitter data
I found that the Pearson correlations
Fig. 4. Performance growth with word
were 0.49 and 0.52 to ANEW’s advanlist extension from 5 words 2477 words.
tage.
Upper panel: Pearson, lower: Spearman
rank correlation, generated from 50 resamples among the 2477 words.
5 Discussion
On the simple word list approach for sentiment analysis I found my list performing slightly ahead of ANEW. However the more elaborate sentiment analysis in
SentiStrength showed the overall best performance with a correlation to AMT
labels on 0.610. This figure is close to the correlations reported in the evaluation
of the SentiStrength algorithm on 1,041 MySpace comments (0.60 and 0.56) [2].
Even though General Inquirer and OpinionFinder have the largest word lists
I found I could not make them perform as good as SentiStrength, my list and
ANEW for sentiment strength detection in microblog posting. The two former
lists both score words on polarity rather than strength and it could explain the
difference in performance.
Is the difference between my list and ANEW due to better scoring or more
words? The analysis of the intersection between the two word list indicated that
the ANEW scoring is better. The slightly better performance of my list with the
entire lexicon may be due to its inclusion of Internet slang and obscene words.
Newer methods, e.g., as implemented in SentiStrength, use a range of techniques: detection of negation, handling of emoticons and spelling variations [2].
The present application of my list used none of these approaches and might have
benefited. However, the SentiStrength evaluation showed that valence switching
at negation and emoticon detection might not necessarily increase the performance of sentiment analyzers (Tables 4 and 5 in [2]).
The evolution of the performance (Figure 4) suggests that the addition of
words to my list might still improve its performance slightly.
Although my list comes slightly ahead of ANEW in Twitter sentiment analysis, ANEW is still preferable for scientific psycholinguistic studies as the scoring
has been validated across several persons. Also note that ANEW’s standard
deviation was not used in the scoring. It might have improved its performance.
Acknowledgment I am grateful to Alan Mislove and Sune Lehmann for providing the 1,000 tweets with the Amazon Mechanical Turk labels and to Steven
J. DeRose and Greg Siegle for providing their word lists. Mislove, Lehmann and
Daniela Balslev also provided input to the article. I thank the Danish Strategic Research Councils for generous support to the ‘Responsible Business in the
Blogosphere’ project.
References
1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends
in Information Retrieval 2(1-2) (2008) 1–135
2. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength
detection in short informal text. Journal of the American Society for Information
Science and Technology 61(12) (2010) 2544–2558
3. Bradley, M.M., Lang, P.J.: Affective norms for English words (ANEW): Instruction
manual and affective ratings. Technical Report C-1, The Center for Research in
Psychophysiology, University of Florida (1999)
4. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phraselevel sentiment analysis. In: Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing, Stroudsburg,
PA, USA, Association for Computational Linguistics (2005)
5. Hansen, L.K., Arvidsson, A., Nielsen, F.Å., Colleoni, E., Etter, M.: Good friends,
bad news — affect and virality in Twitter. Accepted for The 2011 International
Workshop on Social Computing, Network, and Services (SocialComNet 2011)
(2011)
6. Akkaya, C., Conrad, A., Wiebe, J., Mihalcea, R.: Amazon Mechanical Turk for
subjectivity word sense disambiguation. In: Proceedings of the NAACL HLT 2010
Workshop on Creating, Speech and Language Data with Amazon’s Mechanical
Turk, Association for Computational Linguistics (2010) 195–203
7. Baudhuin, E.S.: Obscene language and evaluative response: an empirical study.
Psychological Reports 32 (1973)
8. Sapolsky, B.S., Shafer, D.M., Kaye, B.K.: Rating offensive words in three television
program contexts. BEA 2008, Research Division (2008)
9. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly,
Sebastopol, California (June 2009)
10. Biever, C.: Twitter mood maps reveal emotional states of America. The New
Scientist 207(2771) (July 2010) 14
Supplementary material
A
Extra results
AMT Mislove ANEW AFINN-96 AFINN AFINN(q) AFINN(s) AFINN(a) ANEW(r) ANEW(l) ANEW(rs) ANEW(a)
0.506
0.546
0.564
0.548
0.581
0.564
0.526
0.527
0.526
0.523
0.564
0.578
0.588
0.583
0.616
0.780
0.760
0.855
0.950
0.970
0.734
0.846
0.802
0.660
0.643
0.580
0.582
0.754
0.870
0.824
0.696
0.682
0.599
0.598
0.836
0.889
0.531
0.528
0.592
0.606
0.884
0.547
0.536
0.628
0.594
0.572
0.563
0.606
0.620
0.973
0.853
0.819
0.829
0.796
0.893
Table 3. Pearson correlation matrix. The first data row is the correlation for Amazon
Mechanical Turk (AMT) labeling. AFINN is my list. AFINN: sum of word valences
with weighting over number of words in text, AFINN(q): (+1, 0, −1)-quantization,
AFINN(s): sum of word valences, AFINN(a): average of non-zero word valences,
AFINN(x): extreme scoring, ANEW: ANEW with “raw” (direct match) and sum of
word valences weighted by number of words, ANEW: with NLTK WordNet lemmatizer
word match, ANEW(rs): raw match with sum of word valences, ANEW(a): average of
non-zero word valences, GI: General Inquirer, OF: OpinionFinder, SS: SentiStrength.
AMT Mislove ANEW AFINN-96 AFINN AFINN(q) AFINN(s) AFINN(a) ANEW(r) ANEW(l) ANEW(rs) ANEW(a)
0.507
0.592
0.596
0.580
0.591
0.581
0.545
0.548
0.537
0.526
0.630
0.635
0.615
0.625
0.635
0.884
0.857
0.895
0.950
0.970
0.912
0.941
0.925
0.656
0.646
0.650
0.647
0.941
0.969
0.954
0.668
0.657
0.661
0.656
0.944
0.933
0.622
0.620
0.644
0.634
0.959
0.630
0.622
0.662
0.644
0.641
0.635
0.651
0.644
0.972
0.946
0.936
0.918
0.905
0.943
Table 4. Spearman rank correlation matrix. For explanation of columns see Table 3.
Listings
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#! / u s r / b i n / env python
#
# Th i s program go t h r o u g h t h e l i s t a c a p t u r e s d o u b l e e n t r i e s
# and o r d e r i n g p r o b l e m s
#
# $ I d : N i e l s e n 2 0 1 1 N e w c h e c k . py , v 1 . 4 2 0 1 1 / 0 3 / 1 6 1 1 : 3 4 : 2 4 f n Exp $
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#! / u s r / b i n / env python
#
#
C o n s t r u c t h i s t o g r a m o f AFINN v a l e n c e s .
#
# $ I d : N i e l s e n 2 0 1 1 N e w h i s t . py , v 1 . 2 2 0 1 1 / 0 3 / 1 4 2 0 : 5 3 : 1 7 f n Exp $
1
2
3
4
5
f i l e b a s e = ’ /home/ f n i e l s e n / ’
f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / data / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . csv ’
l i n e s = [ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e a f i n n ) ]
for l i n e in l i n e s :
i f l e n ( l i n e ) != 2 :
print ( l i n e )
a f i n n = map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e a f i n n ) ] )
words = [ word f o r word , v a l e n c e i n a f i n n ]
swo r d s = s o r t e d ( l i s t ( s e t ( words ) ) )
f o r n i n r a n g e ( l e n ( words ) ) :
i f words [ n ] != swo r d s [ n ] :
p r i n t ( words [ n ] + ” ” + swo r d s [ n ] )
break
import numpy a s np
import p y l a b
import s y s
reload ( sys )
s y s . s e t d e f a u l t e n c o d i n g ( ’ u t f −8 ’ )
f i l e b a s e = ’ /home/ f n i e l s e n / ’
f i l e n a m e = f i l e b a s e + ’ f n i e l s e n / data / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . csv ’
a f i n n = d i c t (map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e ) ] ) )
pylab
pylab
pylab
pylab
.
.
.
.
h i s t ( a f i n n . v a l u e s ( ) , b i n s =11 , r a n g e =( −5.5 , 5 . 5 ) ,
x l a b e l ( ’My v a l e n c e s ’ )
y l a b e l ( ’ Absolute frequency ’ )
t i t l e ( ’ H i st o g r a m o f v a l e n c e s f o r my word l i s t ’ )
f ac e c o lor =(0 ,1 ,0))
p y l a b . show ( )
# pylab . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / eps / ’ + ’ N i e l s e n 2 0 1 1 N e w h i s t . eps ’ )
#! / u s r / b i n / env python
#
# $ I d : N i e l se n 2 0 1 1 N e w e x a m p l e . py , v 1 . 2 2 0 1 1 / 0 4 / 1 1 1 7 : 3 1 : 2 2 f n Exp $
import c s v
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import r e
import numpy a s np
from n l t k . stem . wordnet import WordNetLemmatizer
from n l t k import s e n t t o k e n i z e
import p y l a b
from s c i p y . s t a t s . s t a t s import
k e n d a l l t a u , sp e a r m a n r
import s i m p l e j s o n
# Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t
mislove = 2
# Filenames
f i l e b a s e = ’ /home/ f n / ’
f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / data / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . csv ’
f i l e n a m e a f i n n 9 6 = ’AFINN−96. t x t ’
filename mislove1 = ” r e s u l t s . txt ”
f i l e n am e m i sl ove 2a = ” turk . txt ”
f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ”
f i l e n a m e a n e w = f i l e b a s e + ’ d a t a /ANEW.TXT ’
f i l e n a m e g i = f i l e b a s e + ” data / i n q t a b s . t x t ”
f i l e n a m e o f = f i l e b a s e + ” d a t a / s u b j c l u e s l e n 1 −HLTEMNLP05. t f f ”
# Word s p l i t t e r p a t t e r n
p a t t e r n s p l i t = r e . c o m p i l e ( r ” \W+” )
a f i n n = d i c t (map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e a f i n n ) ] ) )
a f i n n 9 6 = d i c t ( map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e a f i n n 9 6 )
]))
# ANEW
anew = d i c t ( map( lambda l : ( l [ 0 ] , f l o a t ( l [ 2 ] ) − 5 ) ,
[ r e . s p l i t ( ’ \ s+ ’ , l i n e ) f o r l i n e i n open ( f i l e n a m e a n e w ) . r e a d l i n e s ( ) [ 4 1 : 1 0 7 5 ]
# OpinionFinder
o f = {}
f o r l i n e i n open ( f i l e n a m e o f ) . r e a d l i n e s ( ) :
elements = r e . s p l i t ( ’ (\ s |\=) ’ , l i n e )
i f e l e m e n t s [ 2 2 ] == ’ p o s i t i v e ’ :
o f [ e l e m e n t s [ 1 0 ] ] = +1
e l i f e l e m e n t s [ 2 2 ] == ’ n e g a t i v e ’ :
o f [ e l e m e n t s [ 1 0 ] ] = −1
# General i n q u i r e r
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e g i ) , d e l i m i t e r= ’ \ t ’ )
header = [ ]
g i = {}
previousword = [ ]
previousvalence = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
e l i f l e n ( row ) > 2 :
word = r e . s e a r c h ( ” \w+” , row [ 0 ] . l o w e r ( ) ) . g r o u p ( )
i f row [ 2 ] == ” P o s i t i v ” :
valence = 1
e l i f row [ 3 ] == ” N e g a t i v ” :
v a l e n c e = −1
else :
valence = 0
i f r e . s e a r c h ( ”#” , row [ 0 ] . l o w e r ( ) ) :
i f p r e v i o u s w o r d == word :
]))
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
p r e v i o u s v a l e n c e == [ ] :
previousvalence = valence
e l i f p r e v i o u s v a l e n c e != v a l e n c e :
previousvalence = 0
else :
i f previousvalence :
g i [ previousword ] = p r e v i o u s v a l e n c e
p r e v i o u s w o r d = word
previousvalence = [ ]
e l i f valence :
g i [ word ] = v a l e n c e
if
# Lemmatizer f o r WordNet
l e m m a t i z e r = WordNetLemmatizer ( )
def wo r d s2 a n e wse n t i m e n t ( words ) :
”””
Co n v e r t words t o s e n t i m e n t b a se d on ANEW v i a WordNet Lemmatizer
”””
sentiment = 0
f o r word i n words :
i f word i n anew :
s e n t i m e n t += anew [ word ]
continue
l wo r d = l e m m a t i z e r . l e m m a t i z e ( word )
i f l wo r d i n anew :
s e n t i m e n t += anew [ l wo r d ]
continue
l wo r d = l e m m a t i z e r . l e m m a t i z e ( word , p o s= ’ v ’ )
i f l wo r d i n anew :
s e n t i m e n t += anew [ l wo r d ]
return s e n t i m e n t / l e n ( words )
def e x t r e m e v a l e n c e ( v a l e n c e s ) :
”””
Return t h e most e x t r e m e v a l e n c e . I f e x t r e m e s have d i f f e r e n t s i g n t h e n
zero i s returned
”””
imax = np . a r g s o r t ( np . a b s ( v a l e n c e s ) ) [ − 1 ]
e x t r e m e s = f i l t e r ( lambda v : a b s ( v ) == a b s ( v a l e n c e s [ imax ] ) , v a l e n c e s )
e x t r e m e s s a m e s i g n = f i l t e r ( lambda v : v == v a l e n c e s [ imax ] , v a l e n c e s )
i f e x t r e m e s == e x t r e m e s s a m e s i g n :
return v a l e n c e s [ imax ]
else :
return 0
def c o r r m a t r i x 2 l a t e x (C, columns=None ) :
”””
s = c o r r m a t r i x 2 l a t e x (C)
print ( s )
”””
s = ’ \n\\ b e g i n { t a b u l a r }{ ’ + ( ’ r ’ ∗ (C . sh a p e [ 0 ] − 1 ) ) + ’ }\n ’
i f columns :
s += ” & ” . j o i n ( columns [ 1 : ] ) + ” \\\\\ n\\ h l i n e \n”
f o r n i n r a n g e (C . sh a p e [ 0 ] ) :
row = [ ]
f o r m i n r a n g e ( 1 , C . sh a p e [ 1 ] ) :
if m > n:
row . append ( ” %.3 f ” % C [ n ,m] )
else :
row . append ( ”
”)
s += ” & ” . j o i n ( row ) + ’ \\\\\n ’
s += ’ \\ end { t a b u l a r }\ n ’
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
return s
def sp e a r m a n m a t r i x ( d a t a ) :
”””
Spearman r r a n k c o r r e l a t i o n m a t r i x
”””
C = np . z e r o s ( ( d a t a . sh a p e [ 1 ] , d a t a . sh a p e [ 1 ] ) )
f o r n i n r a n g e ( d a t a . sh a p e [ 1 ] ) :
f o r m i n r a n g e ( d a t a . sh a p e [ 1 ] ) :
C [ n ,m] = sp e a r m a n r ( d a t a [ : , n ] , d a t a [ : ,m ] ) [ 0 ]
return C
def k e n d a l l m a t r i x ( d a t a ) :
”””
Kendall tau rank c o r r e l a t i o n matrix
”””
C = np . z e r o s ( ( d a t a . sh a p e [ 1 ] , d a t a . sh a p e [ 1 ] ) )
f o r n i n r a n g e ( d a t a . sh a p e [ 1 ] ) :
f o r m i n r a n g e ( d a t a . sh a p e [ 1 ] ) :
C [ n ,m] = k e n d a l l t a u ( d a t a [ : , n ] , d a t a [ : , m] ) [ 0 ]
return C
# Read M i s l o v e CSV T w i t t e r d a t a : ’ t w e e t s ’ an a r r a y o f d i c t i o n a r i e s
i f m i s l o v e == 1 :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 1 ) , d e l i m i t e r= ’ \ t ’ )
header = [ ]
tweets = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s . append ( { ’ i d ’ : row [ 0 ] ,
’ quant ’ : i n t ( row [ 1 ] ) ,
’ s c o r e o u r ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e m e a n ’ : f l o a t ( row [ 3 ] ) ,
’ s c o r e s t d ’ : f l o a t ( row [ 4 ] ) ,
’ t e x t ’ : row [ 5 ] ,
’ s c o r e s ’ : map( i n t , row [ 6 : ] ) } )
e l i f m i s l o v e == 2 :
i f False :
t w e e t s = s i m p l e j s o n . l o a d ( open ( ’ t w e e t s w i t h s e n t i s t r e n g t h . j s o n ’ ) )
else :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 a ) , d e l i m i t e r= ’ \ t ’ )
tweets = [ ]
f o r row i n c s v r e a d e r :
t w e e t s . append ( { ’ i d ’ : row [ 2 ] ,
’ s c o r e m i s l o v e 2 ’ : f l o a t ( row [ 1 ] ) ,
’ t e x t ’ : row [ 3 ] } )
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 b ) , d e l i m i t e r= ’ \ t ’ )
t w e e t s d i c t = {}
header = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s d i c t [ row [ 0 ] ] = { ’ i d ’ : row [ 0 ] ,
’ s c o r e m i s l o v e ’ : f l o a t ( row [ 1 ] ) ,
’ s c o r e a m t w r o n g ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e a m t ’ : np . mean ( map( i n t , r e . s p l i t ( ” \ s+” , ” ” . j o i n ( row
}
for n in range ( l e n ( t we e t s ) ) :
tweets [ n ] [ ’ score mislove ’ ] = t wee ts di ct [ tweets [ n ] [ ’ id ’ ] ] [ ’ score mislove ’ ]
tweets [ n ] [ ’ score amt wrong ’ ] = t w e e t s d i c t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt wrong ’ ]
tweets [ n ] [ ’ score amt ’ ] = t wee ts d ic t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt ’ ]
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
# Add s e n t i m e n t s t o ’ t w e e t s ’
for n in range ( l e n ( t we e t s ) ) :
words = p a t t e r n s p l i t . s p l i t ( t w e e t s [ n ] [ ’ t e x t ’ ] . l o w e r ( ) )
t w e e t s [ n ] [ ’ words ’ ] = words
a f i n n s e n t i m e n t s = map( lambda word : a f i n n . g e t ( word , 0 ) , words )
a f i n n s e n t i m e n t = f l o a t ( sum ( a f i n n s e n t i m e n t s ) ) / l e n ( a f i n n s e n t i m e n t s )
a f i n n 9 6 s e n t i m e n t s = map( lambda word : a f i n n 9 6 . g e t ( word , 0 ) , words )
a f i n n 9 6 s e n t i m e n t = f l o a t ( sum ( a f i n n 9 6 s e n t i m e n t s ) ) / l e n ( a f i n n 9 6 s e n t i m e n t s )
a n e w s e n t i m e n t s = map( lambda word : anew . g e t ( word , 0 ) , words )
a n e w s e n t i m e n t = f l o a t ( sum ( a n e w s e n t i m e n t s ) ) / l e n ( a n e w s e n t i m e n t s )
g i s e n t i m e n t s = map( lambda word : g i . g e t ( word , 0 ) , words )
g i s e n t i m e n t = f l o a t ( sum ( g i s e n t i m e n t s ) ) / l e n ( g i s e n t i m e n t s )
o f s e n t i m e n t s = map( lambda word : o f . g e t ( word , 0 ) , words )
o f s e n t i m e n t = f l o a t ( sum ( o f s e n t i m e n t s ) ) / l e n ( o f s e n t i m e n t s )
tweets [ n ] [ ’ sentiment afinn96 ’ ] = afinn96 sentiment
tweets [ n ] [ ’ sentiment afinn ’ ] = afinn sentiment
t w e e t s [ n ] [ ’ s e n t i m e n t a f i n n q u a n t ’ ] = np . s i g n ( a f i n n s e n t i m e n t )
t w e e t s [ n ] [ ’ s e n t i m e n t a f i n n s u m ’ ] = sum ( a f i n n s e n t i m e n t s )
n o n z e r o s = l e n ( f i l t e r ( lambda n o n z e r o : n o n z e r o , a f i n n s e n t i m e n t s ) )
i f not n o n z e r o s : n o n z e r o s = 1
t w e e t s [ n ] [ ’ s e n t i m e n t a f i n n n o n z e r o ’ ] = sum ( a f i n n s e n t i m e n t s ) / n o n z e r o s
tweets [ n ] [ ’ se ntim e nt af inn e xtre m e ’ ] = extremevalence( af inn se ntim e nts )
tweets [ n ] [ ’ sentiment anew raw ’ ] = anew sentiment
t w e e t s [ n ] [ ’ s e n t i m e n t a n e w l e m m a t i z e ’ ] = wo r d s2 a n e wse n t i m e n t ( words )
t w e e t s [ n ] [ ’ se n t i m e n t a n e w r a w su m ’ ] = sum ( a n e w s e n t i m e n t s )
n o n z e r o s = l e n ( f i l t e r ( lambda n o n z e r o : n o n z e r o , a n e w s e n t i m e n t s ) )
i f not n o n z e r o s : n o n z e r o s = 1
t w e e t s [ n ] [ ’ s e n t i m e n t a n e w r a w n o n z e r o s ’ ] = sum ( a n e w s e n t i m e n t s ) / n o n z e r o s
tweets [ n ] [ ’ sentiment gi raw ’ ] = gi sentiment
tweets [ n ] [ ’ sentiment of raw ’ ] = of sentiment
# I n d e x f o r example t w e e t
words = t w e e t s [ i n d e x ] [ ’ words ’ ] [ : − 1 ]
i n d e x = 10
s = ””
s += ” Text : & \\ m u l t i c o l u m n{%d}{ c }{ ” % l e n ( words ) + t w e e t s [ i n d e x ] [ ’ t e x t ’ ] + ” } \\\\ [ 1 p t ] \\ h l i n e
s += ”Words : & ” + ” & ” . j o i n ( words ) + ” \\\\ [ 1 p t ] \n”
s += ”My & ” + ” & ” . j o i n ( [ s t r ( a f i n n . g e t (w , 0 ) ) f o r w i n words ] ) + ” & ”+ s t r ( sum ( [ a f i n n . g e t (w,
] ) ) + ” \\\\ [ 1 p t ] \n”
s += ”ANEW & ” + ” & ” . j o i n ( [ s t r ( anew . g e t (w, 0 ) ) f o r w i n words ] ) + ” & ”+ s t r ( sum ( [ anew . g e t (w,
] ) ) + ” \\\\ [ 1 p t ] \n”
s += ”GI & ” + ” & ” . j o i n ( [ s t r ( g i . g e t (w , 0 ) ) f o r w i n words ] ) + ” & ”+ s t r ( sum ( [ g i . g e t (w , 0 ) f o r
] ) ) + ” \\\\ [ 1 p t ] \n”
s += ”OF & ” + ” & ” . j o i n ( [ s t r ( o f . g e t (w , 0 ) ) f o r w i n words ] ) + ” & ”+ s t r ( sum ( [ o f . g e t (w , 0 ) f o r
] ) ) + ” \\\\ [ 1 p t ] \n”
s += ”SS ” + ” & ” ∗ ( l e n ( words )+1) + ” −1 ” + ” \\\\ [ 1 p t ] \ h l i n e \n”
# ’ Ear i n f e c t i o n making i t i m p o s s i b l e 2 s l e e p . headed 2 t h e d o c t o r s 2
# g e t new p r e s c r i p t i o n . s o f u c k i n g ’ h a s p o s i t i v e s t r e n g t h 1 and n e g a t i v e
print ( s )
s c o r e a m t = np . a s a r r a y ( [ t [ ’ s c o r e a m t ’ ] f o r t i n t w e e t s ] )
s c o r e a f i n n = np . a s a r r a y ( [ t [ ’ s e n t i m e n t a f i n n ’ ] f o r t i n t w e e t s ] )
t = ”””
AMT
p os itiv e neural negative
AFINN p o s i t i v e %d
%d
%d
neutral
%d
%d
%d
n e g a t i v e %d
%d
%d
””” % ( sum ( ( 1 ∗ ( s c o r e a m t >5)) ∗ ( 1 ∗ ( s c o r e a f i n n > 0 ) ) ) ,
sum ( ( 1 ∗ ( s c o r e a m t ==5)) ∗ ( 1 ∗ ( s c o r e a f i n n > 0 ) ) ) ,
s t r e n g t h −2
274
275
276
277
278
279
280
281
282
283
284
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
sum ( ( 1 ∗ ( s c o r e
sum ( ( 1 ∗ ( s c o r e
sum ( ( 1 ∗ ( s c o r e
sum ( ( 1 ∗ ( s c o r e
sum ( ( 1 ∗ ( s c o r e
sum ( ( 1 ∗ ( s c o r e
sum ( ( 1 ∗ ( s c o r e
a m t <5)) ∗ ( 1 ∗ ( s c o r e a f i n n > 0 ) ) ) ,
a m t >5)) ∗ ( 1 ∗ ( s c o r e a f i n n ==0))) ,
a m t ==5)) ∗ ( 1 ∗ ( s c o r e a f i n n ==0))) ,
a m t <5)) ∗ ( 1 ∗ ( s c o r e a f i n n ==0))) ,
a m t >5)) ∗ ( 1 ∗ ( s c o r e a f i n n < 0 ) ) ) ,
a m t ==5)) ∗ ( 1 ∗ ( s c o r e a f i n n < 0 ) ) ) ,
a m t <5)) ∗ ( 1 ∗ ( s c o r e a f i n n < 0 ) ) ) )
print ( t )
# 0.1∗(277+299+5)
#! / u s r / b i n / env python
#
#
The program compares AFINN and ANEW word l i s t s .
#
#
( Copied from H a n s e n 2 0 1 0 D i f f u s i o n and e x t e n d e d . )
#
# $ I d : H a n s e n 2 0 1 0 D i f f u s i o n a n e w . py , v 1 . 3 2 0 1 0 / 1 2 / 1 5 1 5 : 5 0 : 3 9 f n Exp $
from n l t k . stem . wordnet import WordNetLemmatizer
import n l t k
import numpy a s np
import p y l a b
import r e
from s c i p y . s t a t s . s t a t s import
k e n d a l l t a u , sp e a r m a n r
import s y s
reload ( sys )
s y s . s e t d e f a u l t e n c o d i n g ( ’ u t f −8 ’ )
f i l e b a s e = ’ /home/ f n / ’
f i l e n a m e = f i l e b a s e + ’ f n i e l s e n / data / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . csv ’
a f i n n = d i c t (map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e ) ] ) )
f i l e n a m e = f i l e b a s e + ’ d a t a /ANEW.TXT ’
anew = d i c t ( map( lambda l : ( l [ 0 ] , f l o a t ( l [ 2 ] ) ) ,
[ r e . s p l i t ( ’ \ s+ ’ , l i n e ) f o r l i n e i n open ( f i l e n a m e ) . r e a d l i n e s ( ) [ 4 1 : 1 0 7 5 ]
l e m m a t i z e r = WordNetLemmatizer ( )
stemmer = n l t k . PorterStemmer ( )
anew stem = d i c t ( [ ( stemmer . stem ( word ) , v a l e n c e ) f o r word ,
def wo r d 2 a n e wse n t i m e n t r a w ( word ) :
return anew . g e t ( word , None )
def wo r d 2 a n e wse n t i m e n t wo r d n e t ( word ) :
s e n t i m e n t = None
i f word i n anew :
s e n t i m e n t = anew [ word ]
else :
l wo r d = l e m m a t i z e r . l e m m a t i z e ( word )
i f l wo r d i n anew :
s e n t i m e n t = anew [ l wo r d ]
else :
l wo r d = l e m m a t i z e r . l e m m a t i z e ( word , p o s= ’ v ’ )
i f l wo r d i n anew :
s e n t i m e n t = anew [ l wo r d ]
return s e n t i m e n t
def wo r d 2 a n e wse n t i m e n t st e m ( word ) :
v a l e n c e i n anew . i t e m s ( )
])
]))
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
return anew stem . g e t ( stemmer . stem ( word ) , None )
sentiments raw = [ ]
f o r word i n a f i n n . k e y s ( ) :
s e n t i m e n t a n e w = wo r d 2 a n e wse n t i m e n t r a w ( word )
i f sentiment anew :
s e n t i m e n t s r a w . append ( ( a f i n n [ word ] , s e n t i m e n t a n e w ) )
sentiments wordnet = [ ]
f o r word i n a f i n n . k e y s ( ) :
s e n t i m e n t a n e w = wo r d 2 a n e wse n t i m e n t wo r d n e t ( word )
i f sentiment anew :
s e n t i m e n t s w o r d n e t . append ( ( a f i n n [ word ] , s e n t i m e n t a n e w ) )
i f ( a f i n n [ word ] > 0 and s e n t i m e n t a n e w < 5 ) or \
( a f i n n [ word ] < 0 and s e n t i m e n t a n e w > 5 ) :
p r i n t ( word )
sentiments stem = [ ]
f o r word i n a f i n n . k e y s ( ) :
s e n t i m e n t s t e m a n e w = wo r d 2 a n e wse n t i m e n t st e m ( word )
i f sentiment stem anew :
s e n t i m e n t s s t e m . append ( ( a f i n n [ word ] , s e n t i m e n t s t e m a n e w ) )
i f ( a f i n n [ word ] > 0 and s e n t i m e n t s t e m a n e w < 5 ) or \
( a f i n n [ word ] < 0 and s e n t i m e n t s t e m a n e w > 5 ) :
p r i n t ( word )
s e n t i m e n t s r a w = np . a s a r r a y ( s e n t i m e n t s r a w )
pylab . f i g u r e ( 1 )
pylab . p l o t ( s e n t i m e n t s r a w [ : , 0 ] , s e n t i m e n t s r a w [ : , 1 ] , ’ . ’ )
p y l a b . x l a b e l ( ’ Our l i s t ’ )
p y l a b . y l a b e l ( ’ANEW’ )
p y l a b . t i t l e ( ’ C o r r e l a t i o n between s e n t i m e n t word l i s t s ( D i r e c t match ) ’ )
p y l a b . t e x t ( 1 , 3 , ” P e a r so n c o r r e l a t i o n = %.2 f ” % np . c o r r c o e f ( s e n t i m e n t s r a w . T ) [ 1 , 0 ] )
p y l a b . t e x t ( 1 , 2 . 5 , ” Spearman c o r r e l a t i o n = %.2 f ” % sp e a r m a n r ( s e n t i m e n t s r a w [ : , 0 ] , s e n t i m e n t s r a w [
p y l a b . t e x t ( 1 , 2 , ” K e n d a l l c o r r e l a t i o n = %.2 f ” % k e n d a l l t a u ( s e n t i m e n t s r a w [ : , 0 ] , s e n t i m e n t s r a w [ : ,
# p y l a b . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / e p s / ’ + ’ N i e l se n 2 0 1 1 N e w a n e w r a w . e p s ’ )
# p y l a b . show ( )
s e n t i m e n t s = np . a s a r r a y ( s e n t i m e n t s w o r d n e t )
pylab . f i g u r e ( 2 )
pylab . p l o t ( s e n t i m e n t s [ : , 0 ] , s e n t i m e n t s [ : , 1 ] , ’ . ’ )
p y l a b . x l a b e l ( ’ Our l i s t ’ )
p y l a b . y l a b e l ( ’ANEW’ )
p y l a b . t i t l e ( ’ C o r r e l a t i o n between s e n t i m e n t word l i s t s ( WordNet l e m m a t i z e r ) ’ )
p y l a b . t e x t ( 1 , 3 , ” P e a r so n c o r r e l a t i o n = %.2 f ” % np . c o r r c o e f ( s e n t i m e n t s . T ) [ 1 , 0 ] )
p y l a b . t e x t ( 1 , 2 . 5 , ” Spearman c o r r e l a t i o n = %.2 f ” % sp e a r m a n r ( s e n t i m e n t s [ : , 0 ] , s e n t i m e n t s [ : , 1 ] ) [ 0 ]
p y l a b . t e x t ( 1 , 2 , ” K e n d a l l c o r r e l a t i o n = %.2 f ” % k e n d a l l t a u ( s e n t i m e n t s [ : , 0 ] , s e n t i m e n t s [ : , 1 ] ) [ 0 ] )
# p y l a b . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / e p s / ’ + ’ N i e l se n 2 0 1 1 N e w a n e w wo r d n e t . e p s ’ )
# p y l a b . show ( )
s e n t i m e n t s s t e m = np . a s a r r a y ( s e n t i m e n t s s t e m )
pylab . f i g u r e ( 3 )
pylab . p l o t ( s e n t i m e n t s s t e m [ : , 0 ] , s e n t i m e n t s s t e m [ : , 1 ] , ’ . ’ )
p y l a b . x l a b e l ( ’My l i s t ’ )
p y l a b . y l a b e l ( ’ANEW’ )
p y l a b . t i t l e ( ’ C o r r e l a t i o n between s e n t i m e n t word l i s t s ( P o r t e r stemmer ) ’ )
p y l a b . t e x t ( 1 , 3 , ” C o r r e l a t i o n = %.2 f ” % np . c o r r c o e f ( s e n t i m e n t s s t e m . T ) [ 1 , 0 ] )
124
125
126
127
128
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
p y l a b . t e x t ( 1 , 2 . 5 , ” Spearman c o r r e l a t i o n = %.2 f ” % sp e a r m a n r ( s e n t i m e n t s s t e m [ : , 0 ] , s e n t i m e n t s s t e m
p y l a b . t e x t ( 1 , 2 , ” K e n d a l l c o r r e l a t i o n = %.2 f ” % k e n d a l l t a u ( s e n t i m e n t s s t e m [ : , 0 ] , s e n t i m e n t s s t e m [
# p y l a b . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / e p s / ’ + ’ N i e l se n 2 0 1 1 N e w a n e w st e m . e p s ’ )
# p y l a b . show ( )
#! / u s r / b i n / env python
#
# $ I d : N i e l se n 2 0 1 1 N e w . py , v 1 . 1 0 2 0 1 1 / 0 3 / 1 6 1 3 : 4 1 : 3 6 f n Exp $
import c s v
import r e
import numpy a s np
from n l t k . stem . wordnet import WordNetLemmatizer
from n l t k import s e n t t o k e n i z e
import p y l a b
from s c i p y . s t a t s . s t a t s import
k e n d a l l t a u , sp e a r m a n r
import s i m p l e j s o n
# Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t
mislove = 2
# Filenames
f i l e b a s e = ’ /home/ f n i e l s e n / ’
f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / data / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . csv ’
f i l e n a m e a f i n n 9 6 = ’AFINN−96. t x t ’
filename mislove1 = ” r e s u l t s . txt ”
f i l e n am e m i sl ove 2a = ” turk . txt ”
f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ”
f i l e n a m e a n e w = f i l e b a s e + ’ d a t a /ANEW.TXT ’
f i l e n a m e g i = f i l e b a s e + ” data / i n q t a b s . t x t ”
f i l e n a m e o f = f i l e b a s e + ” d a t a / s u b j c l u e s l e n 1 −HLTEMNLP05. t f f ”
# Word s p l i t t e r p a t t e r n
p a t t e r n s p l i t = r e . c o m p i l e ( r ” \W+” )
a f i n n = d i c t (map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e a f i n n ) ] ) )
a f i n n 9 6 = d i c t ( map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e a f i n n 9 6 )
]))
# ANEW
anew = d i c t ( map( lambda l : ( l [ 0 ] , f l o a t ( l [ 2 ] ) − 5 ) ,
[ r e . s p l i t ( ’ \ s+ ’ , l i n e ) f o r l i n e i n open ( f i l e n a m e a n e w ) . r e a d l i n e s ( ) [ 4 1 : 1 0 7 5 ]
# OpinionFinder
o f = {}
f o r l i n e i n open ( f i l e n a m e o f ) . r e a d l i n e s ( ) :
elements = r e . s p l i t ( ’ (\ s |\=) ’ , l i n e )
i f e l e m e n t s [ 2 2 ] == ’ p o s i t i v e ’ :
o f [ e l e m e n t s [ 1 0 ] ] = +1
e l i f e l e m e n t s [ 2 2 ] == ’ n e g a t i v e ’ :
o f [ e l e m e n t s [ 1 0 ] ] = −1
# General i n q u i r e r
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e g i ) ,
header = [ ]
g i = {}
previousword = [ ]
previousvalence = [ ]
f o r row i n c s v r e a d e r :
d e l i m i t e r= ’ \ t ’ )
]))
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
i f not h e a d e r :
h e a d e r = row
e l i f l e n ( row ) > 2 :
word = r e . s e a r c h ( ” \w+” , row [ 0 ] . l o w e r ( ) ) . g r o u p ( )
i f row [ 2 ] == ” P o s i t i v ” :
valence = 1
e l i f row [ 3 ] == ” N e g a t i v ” :
v a l e n c e = −1
else :
valence = 0
i f r e . s e a r c h ( ”#” , row [ 0 ] . l o w e r ( ) ) :
i f p r e v i o u s w o r d == word :
i f p r e v i o u s v a l e n c e == [ ] :
previousvalence = valence
e l i f p r e v i o u s v a l e n c e != v a l e n c e :
previousvalence = 0
else :
i f previousvalence :
g i [ previousword ] = p r e v i o u s v a l e n c e
p r e v i o u s w o r d = word
previousvalence = [ ]
e l i f valence :
g i [ word ] = v a l e n c e
# Lemmatizer f o r WordNet
l e m m a t i z e r = WordNetLemmatizer ( )
def wo r d s2 a n e wse n t i m e n t ( words ) :
”””
Co n v e r t words t o s e n t i m e n t b a se d on ANEW v i a WordNet Lemmatizer
”””
sentiment = 0
f o r word i n words :
i f word i n anew :
s e n t i m e n t += anew [ word ]
continue
l wo r d = l e m m a t i z e r . l e m m a t i z e ( word )
i f l wo r d i n anew :
s e n t i m e n t += anew [ l wo r d ]
continue
l wo r d = l e m m a t i z e r . l e m m a t i z e ( word , p o s= ’ v ’ )
i f l wo r d i n anew :
s e n t i m e n t += anew [ l wo r d ]
return s e n t i m e n t / l e n ( words )
def e x t r e m e v a l e n c e ( v a l e n c e s ) :
”””
Return t h e most e x t r e m e v a l e n c e . I f e x t r e m e s have d i f f e r e n t s i g n t h e n
zero i s returned
”””
imax = np . a r g s o r t ( np . a b s ( v a l e n c e s ) ) [ − 1 ]
e x t r e m e s = f i l t e r ( lambda v : a b s ( v ) == a b s ( v a l e n c e s [ imax ] ) , v a l e n c e s )
e x t r e m e s s a m e s i g n = f i l t e r ( lambda v : v == v a l e n c e s [ imax ] , v a l e n c e s )
i f e x t r e m e s == e x t r e m e s s a m e s i g n :
return v a l e n c e s [ imax ]
else :
return 0
def c o r r m a t r i x 2 l a t e x (C, columns=None ) :
”””
s = c o r r m a t r i x 2 l a t e x (C)
print ( s )
”””
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
s = ’ \n\\ b e g i n { t a b u l a r }{ ’ + ( ’ r ’ ∗ (C . sh a p e [ 0 ] − 1 ) ) + ’ }\n ’
i f columns :
s += ” & ” . j o i n ( columns [ 1 : ] ) + ” \\\\\ n\\ h l i n e \n”
f o r n i n r a n g e (C . sh a p e [ 0 ] ) :
row = [ ]
f o r m i n r a n g e ( 1 , C . sh a p e [ 1 ] ) :
if m > n:
row . append ( ” %.3 f ” % C [ n ,m] )
else :
row . append ( ”
”)
s += ” & ” . j o i n ( row ) + ’ \\\\\n ’
s += ’ \\ end { t a b u l a r }\ n ’
return s
def sp e a r m a n m a t r i x ( d a t a ) :
”””
Spearman r r a n k c o r r e l a t i o n m a t r i x
”””
C = np . z e r o s ( ( d a t a . sh a p e [ 1 ] , d a t a . sh a p e [ 1 ] ) )
f o r n i n r a n g e ( d a t a . sh a p e [ 1 ] ) :
f o r m i n r a n g e ( d a t a . sh a p e [ 1 ] ) :
C [ n ,m] = sp e a r m a n r ( d a t a [ : , n ] , d a t a [ : ,m ] ) [ 0 ]
return C
def k e n d a l l m a t r i x ( d a t a ) :
”””
Kendall tau rank c o r r e l a t i o n matrix
”””
C = np . z e r o s ( ( d a t a . sh a p e [ 1 ] , d a t a . sh a p e [ 1 ] ) )
f o r n i n r a n g e ( d a t a . sh a p e [ 1 ] ) :
f o r m i n r a n g e ( d a t a . sh a p e [ 1 ] ) :
C [ n ,m] = k e n d a l l t a u ( d a t a [ : , n ] , d a t a [ : , m] ) [ 0 ]
return C
# Read M i s l o v e CSV T w i t t e r d a t a : ’ t w e e t s ’ an a r r a y o f d i c t i o n a r i e s
i f m i s l o v e == 1 :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 1 ) , d e l i m i t e r= ’ \ t ’ )
header = [ ]
tweets = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s . append ( { ’ i d ’ : row [ 0 ] ,
’ quant ’ : i n t ( row [ 1 ] ) ,
’ s c o r e o u r ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e m e a n ’ : f l o a t ( row [ 3 ] ) ,
’ s c o r e s t d ’ : f l o a t ( row [ 4 ] ) ,
’ t e x t ’ : row [ 5 ] ,
’ s c o r e s ’ : map( i n t , row [ 6 : ] ) } )
e l i f m i s l o v e == 2 :
i f True :
t w e e t s = s i m p l e j s o n . l o a d ( open ( ’ t w e e t s w i t h s e n t i s t r e n g t h . j s o n ’ ) )
else :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 a ) , d e l i m i t e r= ’ \ t ’ )
tweets = [ ]
f o r row i n c s v r e a d e r :
t w e e t s . append ( { ’ i d ’ : row [ 2 ] ,
’ s c o r e m i s l o v e 2 ’ : f l o a t ( row [ 1 ] ) ,
’ t e x t ’ : row [ 3 ] } )
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 b ) , d e l i m i t e r= ’ \ t ’ )
t w e e t s d i c t = {}
header = [ ]
f o r row i n c s v r e a d e r :
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s d i c t [ row [ 0 ] ] = { ’ i d ’ : row [ 0 ] ,
’ s c o r e m i s l o v e ’ : f l o a t ( row [ 1 ] ) ,
’ s c o r e a m t w r o n g ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e a m t ’ : np . mean ( map( i n t , r e . s p l i t ( ” \ s+” , ” ” . j o i n ( row
}
for n in range ( l e n ( t we e t s ) ) :
tweets [ n ] [ ’ score mislove ’ ] = t wee ts di ct [ tweets [ n ] [ ’ id ’ ] ] [ ’ score mislove ’ ]
tweets [ n ] [ ’ score amt wrong ’ ] = t w e e t s d i c t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt wrong ’ ]
tweets [ n ] [ ’ score amt ’ ] = t wee ts d ic t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt ’ ]
# Add s e n t i m e n t s t o ’ t w e e t s ’
for n in range ( l e n ( t we e t s ) ) :
words = p a t t e r n s p l i t . s p l i t ( t w e e t s [ n ] [ ’ t e x t ’ ] . l o w e r ( ) )
a f i n n s e n t i m e n t s = map( lambda word : a f i n n . g e t ( word , 0 ) , words )
a f i n n s e n t i m e n t = f l o a t ( sum ( a f i n n s e n t i m e n t s ) ) / l e n ( a f i n n s e n t i m e n t s )
a f i n n 9 6 s e n t i m e n t s = map( lambda word : a f i n n 9 6 . g e t ( word , 0 ) , words )
a f i n n 9 6 s e n t i m e n t = f l o a t ( sum ( a f i n n 9 6 s e n t i m e n t s ) ) / l e n ( a f i n n 9 6 s e n t i m e n t s )
a n e w s e n t i m e n t s = map( lambda word : anew . g e t ( word , 0 ) , words )
a n e w s e n t i m e n t = f l o a t ( sum ( a n e w s e n t i m e n t s ) ) / l e n ( a n e w s e n t i m e n t s )
g i s e n t i m e n t s = map( lambda word : g i . g e t ( word , 0 ) , words )
g i s e n t i m e n t = f l o a t ( sum ( g i s e n t i m e n t s ) ) / l e n ( g i s e n t i m e n t s )
o f s e n t i m e n t s = map( lambda word : o f . g e t ( word , 0 ) , words )
o f s e n t i m e n t = f l o a t ( sum ( o f s e n t i m e n t s ) ) / l e n ( o f s e n t i m e n t s )
tweets [ n ] [ ’ sentiment afinn96 ’ ] = afinn96 sentiment
tweets [ n ] [ ’ sentiment afinn ’ ] = afinn sentiment
t w e e t s [ n ] [ ’ s e n t i m e n t a f i n n q u a n t ’ ] = np . s i g n ( a f i n n s e n t i m e n t )
t w e e t s [ n ] [ ’ s e n t i m e n t a f i n n s u m ’ ] = sum ( a f i n n s e n t i m e n t s )
n o n z e r o s = l e n ( f i l t e r ( lambda n o n z e r o : n o n z e r o , a f i n n s e n t i m e n t s ) )
i f not n o n z e r o s : n o n z e r o s = 1
t w e e t s [ n ] [ ’ s e n t i m e n t a f i n n n o n z e r o ’ ] = sum ( a f i n n s e n t i m e n t s ) / n o n z e r o s
tweets [ n ] [ ’ se ntim e nt af inn e xtre m e ’ ] = extremevalence( af inn se ntim e nts )
tweets [ n ] [ ’ sentiment anew raw ’ ] = anew sentiment
t w e e t s [ n ] [ ’ s e n t i m e n t a n e w l e m m a t i z e ’ ] = wo r d s2 a n e wse n t i m e n t ( words )
t w e e t s [ n ] [ ’ se n t i m e n t a n e w r a w su m ’ ] = sum ( a n e w s e n t i m e n t s )
n o n z e r o s = l e n ( f i l t e r ( lambda n o n z e r o : n o n z e r o , a n e w s e n t i m e n t s ) )
i f not n o n z e r o s : n o n z e r o s = 1
t w e e t s [ n ] [ ’ s e n t i m e n t a n e w r a w n o n z e r o s ’ ] = sum ( a n e w s e n t i m e n t s ) / n o n z e r o s
tweets [ n ] [ ’ sentiment gi raw ’ ] = gi sentiment
tweets [ n ] [ ’ sentiment of raw ’ ] = of sentiment
# Numpy m a t r i x
i f m i s l o v e == 1 :
columns = [ ”AMT” , ’AFINN ’ , ’AFINN ( q ) ’ , ’AFINN ( s ) ’ , ’AFINN ( a ) ’ ,
’ANEW( r ) ’ , ’ANEW( l ) ’ , ’ANEW( r s ) ’ , ’ANEW( a ) ’ ]
s e n t i m e n t s = np . m a t r i x ( [ [ t [ ’ s c o r e m e a n ’ ] ,
t [ ’ sentiment afinn ’ ] ,
t [ ’ sentiment afinn quant ’ ] ,
t [ ’ sentiment afinn sum ’ ] ,
t [ ’ sentiment afinn nonzero ’ ] ,
t [ ’ sentiment anew raw ’ ] ,
t [ ’ sentiment anew lemmatize ’ ] ,
t [ ’ se n t i m e n t a n e w r a w su m ’ ] ,
t [ ’ s e n t i m e n t a n e w r a w n o n z e r o s ’ ] ] for t in t we e t s ] )
e l i f m i s l o v e == 2 :
columns = [ ”AMT” ,
’AFINN ’ , ’AFINN ( q ) ’ , ’AFINN ( s ) ’ , ’AFINN ( a ) ’ , ’AFINN ( x ) ’ ,
’ANEW( r ) ’ , ’ANEW( l ) ’ , ’ANEW( r s ) ’ , ’ANEW( a ) ’ ,
”GI ” , ”OF” , ” SS” ]
s e n t i m e n t s = np . m a t r i x ( [ [ t [ ’ s c o r e a m t ’ ] ,
t [ ’ sentiment afinn ’ ] ,
t [ ’ sentiment afinn quant ’ ] ,
t [ ’ sentiment afinn sum ’ ] ,
t [ ’ sentiment afinn nonzero ’ ] ,
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
t
t
t
t
t
t
t
t
[
[
[
[
[
[
[
[
’ sentiment afinn extreme ’ ] ,
’ sentiment anew raw ’ ] ,
’ sentiment anew lemmatize ’ ] ,
’ se n t i m e n t a n e w r a w su m ’ ] ,
’ sentiment anew raw nonzeros ’ ] ,
’ sentiment gi raw ’ ] ,
’ sentiment of raw ’ ] ,
’ s e n t i s t r e n g t h ’ ] ] for t in t we e t s ] )
x = np . a s a r r a y ( s e n t i m e n t s [ : , 1 ] ) . f l a t t e n ( )
y = np . a s a r r a y ( s e n t i m e n t s [ : , 0 ] ) . f l a t t e n ( )
pylab . p l o t ( x , y , ’ . ’ )
p y l a b . x l a b e l ( ’My l i s t ’ )
p y l a b . y l a b e l ( ’ Amazon M e c h a n i c a l Turk ’ )
p y l a b . t e x t ( 0 . 1 , 2 , ” P e a r so n c o r r e l a t i o n = %.3 f ” % np . c o r r c o e f ( x , y ) [ 1 , 0 ] )
p y l a b . t e x t ( 0 . 1 , 1 . 6 , ” Spearman c o r r e l a t i o n = %.3 f ” % sp e a r m a n r ( x , y ) [ 0 ] )
p y l a b . t e x t ( 0 . 1 , 1 . 2 , ” K e n d a l l c o r r e l a t i o n = %.3 f ” % k e n d a l l t a u ( x , y ) [ 0 ] )
pylab . t i t l e ( ’ S c a t t e r p l o t o f sentiment s t r e n g t h s f o r t we e t s ’ )
p y l a b . show ( )
# pylab . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / eps / N i e l s e n 2 0 1 1 N e w t we e t s c a t t e r . eps ’ )
# Ordinary c o r r e l a t i o n c o e f f i c i e n t
C = np . c o r r c o e f ( s e n t i m e n t s . t r a n s p o s e ( ) )
s 1 = c o r r m a t r i x 2 l a t e x (C, columns )
print ( s1 )
f = open ( f i l e b a s e + ’ / f n i e l s e n / t e x / N i e l s e n 2 0 1 1 N e w c o r r m a t r i x . t e x ’ ,
f . write ( s1 )
f . close ()
’w ’ )
# Spearman Rank c o r r e l a t i o n
C2 = sp e a r m a n m a t r i x ( s e n t i m e n t s )
s 2 = c o r r m a t r i x 2 l a t e x ( C2 , columns )
print ( s2 )
f = open ( f i l e b a s e + ’ / f n i e l s e n / t e x / N i e l se n 2 0 1 1 N e w sp e a r m a n m a t r i x . t e x ’ ,
f . write ( s2 )
f . close ()
’w ’ )
# Kendall rank c o r r e l a t i o n
C3 = k e n d a l l m a t r i x ( s e n t i m e n t s )
s 3 = c o r r m a t r i x 2 l a t e x ( C3 , columns )
print ( s3 )
f = open ( f i l e b a s e + ’ / f n i e l s e n / t e x / N i e l s e n 2 0 1 1 N e w k e n d a l l m a t r i x . t e x ’ ,
f . write ( s3 )
f . close ()
’w ’ )
#! / u s r / b i n / env python
#
#
Th i s s c r i p t w i l l c a l l t h e S e n t i S t r e n g t h Web s e r v i c e wi t h t h e t e x t from
#
t h e 1000 t w e e t s and w r i t e a JSON f i l e wi t h t w e e t s and t h e S e n t i S t r e n g t h .
#
# $ I d : N i e l s e n 2 0 1 1 N e w s e n t i s t r e n g t h . py , v 1 . 1 2 0 1 1 / 0 3 / 1 3 1 9 : 1 2 : 4 6 f n Exp $
import c s v
import r e
import numpy a s np
import p y l a b
import random
from s c i p y import s p a r s e
from s c i p y . s t a t s . s t a t s import
import s i m p l e j s o n
import s y s
k e n d a l l t a u , sp e a r m a n r
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
reload ( sys )
s y s . s e t d e f a u l t e n c o d i n g ( ’ u t f −8 ’ )
import s l e e p , t i m e
from t i m e
from u r l l i b
import FancyURLopener , u r l e n c o d e , u r l r e t r i e v e
from xml . sa x . s a x u t i l s import u n e s c a p e a s u n e s c a p e s a x u t i l s
c l a s s MyOpener( FancyURLopener ) :
v e r s i o n = ’ RBBBot , Finn Aarup N i e l s e n ( h t t p : / /www. imm . dtu . dk /˜ f n / , fn@imm . dtu . dk ) ’
# Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t
mislove = 2
# Filenames
f i l e b a s e = ’ /home/ f n / ’
f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / data / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . csv ’
filename mislove1 = ” r e s u l t s . txt ”
f i l e n am e m i sl ove 2a = ” turk . txt ”
f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ”
u r l b a s e = ” h t t p : / / s e n t i s t r e n g t h . wlv . a c . uk/ r e s u l t s . php ? ”
p a t t e r n p o s i t i v e = re . compile ( ” p o s i t i v e
p at t e r n n e gat i v e = re . compile ( ” negative
s t r e n g t h <b>(\d)</b>” )
s t r e n g t h <b>(\−\d)</b>” )
# Word s p l i t t e r p a t t e r n
p a t t e r n s p l i t = r e . c o m p i l e ( r ” \W+” )
myopener = MyOpener( )
# Read M i s l o v e CSV T w i t t e r d a t a : ’ t w e e t s ’ an a r r a y o f d i c t i o n a r i e s
i f m i s l o v e == 1 :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 1 ) , d e l i m i t e r= ’ \ t ’ )
header = [ ]
tweets = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s . append ( { ’ i d ’ : row [ 0 ] ,
’ quant ’ : i n t ( row [ 1 ] ) ,
’ s c o r e o u r ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e m e a n ’ : f l o a t ( row [ 3 ] ) ,
’ s c o r e s t d ’ : f l o a t ( row [ 4 ] ) ,
’ t e x t ’ : row [ 5 ] ,
’ s c o r e s ’ : map( i n t , row [ 6 : ] ) } )
e l i f m i s l o v e == 2 :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 a ) , d e l i m i t e r= ’ \ t ’ )
tweets = [ ]
f o r row i n c s v r e a d e r :
t w e e t s . append ( { ’ i d ’ : row [ 2 ] ,
’ s c o r e m i s l o v e 2 ’ : f l o a t ( row [ 1 ] ) ,
’ t e x t ’ : row [ 3 ] } )
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 b ) , d e l i m i t e r= ’ \ t ’ )
t w e e t s d i c t = {}
header = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s d i c t [ row [ 0 ] ] = { ’ i d ’ : row [ 0 ] ,
’ s c o r e m i s l o v e ’ : f l o a t ( row [ 1 ] ) ,
’ s c o r e a m t w r o n g ’ : f l o a t ( row [ 2 ] ) ,
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
’ s c o r e a m t ’ : np . mean ( map( i n t ,
}
r e . s p l i t ( ” \ s+” , ” ” . j o i n ( row [ 4 : ]
for n in range ( l e n ( t we e t s ) ) :
tweets [ n ] [ ’ score mislove ’ ] = t wee ts d ic t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score mislove ’ ]
tweets [ n ] [ ’ score amt wrong ’ ] = t w e e t s d i c t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt wrong ’ ]
tweets [ n ] [ ’ score amt ’ ] = t we et s d ic t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt ’ ]
for n in range ( l e n ( t we e t s ) ) :
u r l = u r l b a s e + u r l e n c o d e ({ ’ t e x t ’ : t we e t s [ n ] [ ’ t e x t ’ ] } )
try :
html = myopener . open ( u r l ) . r e a d ( )
p o s i t i v e = i n t ( p a t t e r n p o s i t i v e . f i n d a l l ( html ) [ 0 ] )
n e g a t i v e = i n t ( p a t t e r n n e g a t i v e . f i n d a l l ( html ) [ 0 ] )
tweets [ n ] [ ’ s e n t i s t r e n g t h p o s i t i v e ’ ] = p o s i t i v e
tweets [ n ] [ ’ se nt i st r e n g t h n e g a t i v e ’ ] = negative
i f p o s i t i v e > abs ( n e g a t i v e ) :
tweets [ n ] [ ’ s en ti st r e ng t h ’ ] = p o s i t i v e
e l i f abs ( n e g a t i v e ) > p o s i t i v e :
tweets [ n ] [ ’ s en ti st r e ng t h ’ ] = negative
else :
tweets [ n ] [ ’ s en ti st r e ng t h ’ ] = 0
except E x c e p t i o n , e :
error = str (e )
tweets [ n ] [ ’ s e n t i s t r e n g t h e r r o r ’ ] = e rr o r
print ( n )
s i m p l e j s o n . dump( t w e e t s , open ( ” t w e e t s w i t h s e n t i s t r e n g t h . j s o n ” , ”w” ) )
#! / u s r / b i n / env python
#
#
Generates a p l o t o f the e v o l u t i o n o f the performance as
#
t h e word l i s t i s e x t e n d e d .
#
# $ I d : N i e l s e n 2 0 1 1 N e w e v o l u t i o n . py , v 1 . 2 2 0 1 1 / 0 3 / 1 3 2 3 : 4 8 : 3 8 f n Exp $
import c s v
import r e
import numpy a s np
import p y l a b
import random
from s c i p y import s p a r s e
from s c i p y . s t a t s . s t a t s import
k e n d a l l t a u , sp e a r m a n r
# Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t
mislove = 2
# Filenames
f i l e b a s e = ’ /home/ f n i e l s e n / ’
f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / data / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . csv ’
filename mislove1 = ” r e s u l t s . txt ”
f i l e n am e m i sl ove 2a = ” turk . txt ”
f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ”
# Word s p l i t t e r p a t t e r n
p a t t e r n s p l i t = r e . c o m p i l e ( r ” \W+” )
a f i n n = d i c t (map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e a f i n n ) ] ) )
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# Read M i s l o v e CSV T w i t t e r d a t a : ’ t w e e t s ’ an a r r a y o f d i c t i o n a r i e s
i f m i s l o v e == 1 :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 1 ) , d e l i m i t e r= ’ \ t ’ )
header = [ ]
tweets = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s . append ( { ’ i d ’ : row [ 0 ] ,
’ quant ’ : i n t ( row [ 1 ] ) ,
’ s c o r e o u r ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e m e a n ’ : f l o a t ( row [ 3 ] ) ,
’ s c o r e s t d ’ : f l o a t ( row [ 4 ] ) ,
’ t e x t ’ : row [ 5 ] ,
’ s c o r e s ’ : map( i n t , row [ 6 : ] ) } )
e l i f m i s l o v e == 2 :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 a ) , d e l i m i t e r= ’ \ t ’ )
tweets = [ ]
f o r row i n c s v r e a d e r :
t w e e t s . append ( { ’ i d ’ : row [ 2 ] ,
’ s c o r e m i s l o v e 2 ’ : f l o a t ( row [ 1 ] ) ,
’ t e x t ’ : row [ 3 ] } )
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 b ) , d e l i m i t e r= ’ \ t ’ )
t w e e t s d i c t = {}
header = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s d i c t [ row [ 0 ] ] = { ’ i d ’ : row [ 0 ] ,
’ s c o r e m i s l o v e ’ : f l o a t ( row [ 1 ] ) ,
’ s c o r e a m t w r o n g ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e a m t ’ : np . mean ( map( i n t , r e . s p l i t ( ” \ s+” , ” ” . j o i n ( row [ 4 : ]
}
for n in range ( l e n ( t we e t s ) ) :
tweets [ n ] [ ’ score mislove ’ ] = t wee ts d ic t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score mislove ’ ]
tweets [ n ] [ ’ score amt wrong ’ ] = t w e e t s d i c t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt wrong ’ ]
tweets [ n ] [ ’ score amt ’ ] = t we et s d ic t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt ’ ]
allwords = [ ]
for n in range ( l e n ( t we e t s ) ) :
words = p a t t e r n s p l i t . s p l i t ( t w e e t s [ n ] [ ’ t e x t ’ ] . l o w e r ( ) )
t w e e t s [ n ] [ ’ words ’ ] = words
a l l w o r d s . e x t e n d ( words )
p r i n t ( ” A l l words : %d” % l e n ( a l l w o r d s ) )
p r i n t ( ”Number o f u n i q u e words i n T w i t t e r c o r p u s : %d” % l e n ( s e t ( a l l w o r d s ) ) )
terms = a f i n n . keys ( )
# terms = l i s t ( s e t ( a l l w o r d s ) . i n t e r s e c t i o n ( a f i n n . keys ( ) ) )
p r i n t ( ”Number o f u n i q u e words matched t o word l i s t : %d” % l e n ( t e r m s ) )
t e r m 2 i n d e x = d i c t ( z i p ( terms , r a n g e ( l e n ( t e r m s ) ) ) )
M = s p a r s e . l i l m a t r i x ( ( l e n ( t we e t s ) , l e n ( terms ) ) )
M = np . z e r o s ( ( l e n ( t w e e t s ) , l e n ( t e r m s ) ) )
# Sparse
for n in range ( l e n ( t we e t s ) ) :
f o r word i n t w e e t s [ n ] [ ’ words ’ ] :
i f t e r m 2 i n d e x . h a s k e y ( word ) :
M[ n , t e r m 2 i n d e x [ word ] ] = a f i n n [ word ]
s c o r e a m t = [ t [ ’ s c o r e a m t ’ ] for t in t we e t s ]
# Fix r e sam p l i n g seed
for
reproducibility
i s not n e c e s s a r y
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
random . s e e d ( a =1729)
K = [ 1 , 10 , 30 , 50 , 70 , 100 , 150 , 200 , 250 , 300 , 350 , 400 , l e n ( terms ) ]
K = [ 5 , 100 , 300 , 500 , 1000 , 1500 , 2000 , l e n ( terms ) ]
I = range ( l e n ( terms ) )
r e s a m p l e s = 50
R = np . z e r o s ( ( r e s a m p l e s , l e n (K) ) )
S = np . z e r o s ( ( r e s a m p l e s , l e n (K) ) )
f o r n i n r a n g e ( l e n (K ) ) :
for m in range ( r e s a m p l e s ) :
J = random . sa m p l e ( I , K[ n ] )
s c o r e a f i n n = M[ : , J ] . sum ( a x i s =1)
R [m, n ] = np . c o r r c o e f ( s c o r e a m t , s c o r e a f i n n ) [ 0 , 1 ]
S [m, n ] = sp e a r m a n r ( s c o r e a m t , s c o r e a f i n n ) [ 0 ]
# pylab . f i g u r e ( 1 )
pylab . s u b p l o t ( 2 , 1 , 1)
p y l a b . b o x p l o t (R, p o s i t i o n s=K, w i d t h s =40)
p y l a b . y l a b e l ( ’ P e a r so n c o r r e l a t i o n ’ )
# p y l a b . x l a b e l ( ’ Word l i s t s i z e ’ )
p y l a b . t i t l e ( ’ E v o l u t i o n o f word l i s t p e r f o r m a n c e ’ )
pylab . a x i s ( ( 0 , 2600 , −0.05 , 0 . 6 5 ) )
p y l a b . show ( )
# pylab . f i g u r e ( 2 )
pylab . s u b p l o t ( 2 , 1 , 2)
p y l a b . b o x p l o t ( S , p o s i t i o n s=K, w i d t h s =40)
p y l a b . y l a b e l ( ’ Spearman r a n k c o r r e l a t i o n ’ )
p y l a b . x l a b e l ( ’ Word l i s t s i z e ’ )
# p y l a b . t i t l e ( ’ E v o l u t i o n o f word l i s t p e r f o r m a n c e ’ )
pylab . a x i s ( ( 0 , 2600 , 0 . 3 5 , 0 . 6 ) )
p y l a b . show ( )
# pylab . s a v e f i g ( f i l e b a s e + ’ f n i e l s e n / eps / N i e l s e n 2 0 1 1 N e w e v o l u t i o n . eps ’ )
#! / u s r / b i n / env python
#
# $ I d : N i e l s e n 2 0 1 1 N e w a n e w a f i n n . py , v 1 . 1 2 0 1 1 / 0 3 / 1 3 1 7 : 1 8 : 1 0 f n Exp $
import c s v
import r e
import numpy a s np
from n l t k . stem . wordnet import WordNetLemmatizer
from n l t k import s e n t t o k e n i z e
import p y l a b
from s c i p y . s t a t s . s t a t s import
k e n d a l l t a u , sp e a r m a n r
# Th i s v a r i a b l e d e t e r m i n e s t h e d a t a s e t
mislove = 2
# Filenames
f i l e b a s e = ’ /home/ f n i e l s e n / ’
f i l e n a m e a f i n n = f i l e b a s e + ’ f n i e l s e n / data / N i e l s e n 2 0 0 9 R e s p o n s i b l e e m o t i o n . csv ’
f i l e n a m e a f i n n 9 6 = ’AFINN−96. t x t ’
filename mislove1 = ” r e s u l t s . txt ”
f i l e n am e m i sl ove 2a = ” turk . txt ”
f i l e n a m e m i s l o v e 2 b = ” t u r k−r e s u l t s . t x t ”
f i l e n a m e a n e w = f i l e b a s e + ’ d a t a /ANEW.TXT ’
f i l e n a m e g i = f i l e b a s e + ” data / i n q t a b s . t x t ”
f i l e n a m e o f = f i l e b a s e + ” d a t a / s u b j c l u e s l e n 1 −HLTEMNLP05. t f f ”
# Word s p l i t t e r p a t t e r n
p a t t e r n s p l i t = r e . c o m p i l e ( r ” \W+” )
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
a f i n n = d i c t (map( lambda ( k , v ) : ( k , i n t ( v ) ) ,
[ l i n e . s p l i t ( ’ \ t ’ ) f o r l i n e i n open ( f i l e n a m e a f i n n ) ] ) )
# ANEW
anew = d i c t ( map( lambda l : ( l [ 0 ] , f l o a t ( l [ 2 ] ) − 5 ) ,
[ r e . s p l i t ( ’ \ s+ ’ , l i n e ) f o r l i n e i n open ( f i l e n a m e a n e w ) . r e a d l i n e s ( ) [ 4 1 : 1 0 7 5 ]
]))
a n e wa f i n n = s e t ( anew . k e y s ( ) ) . i n t e r s e c t i o n ( a f i n n . k e y s ( ) )
a f i n n i n t e r s e c t = d i c t ( [ ( k , a f i n n [ k ] ) f o r k i n a n e wa f i n n ] )
a n e w i n t e r s e c t = d i c t ( [ ( k , anew [ k ] ) f o r k i n a n e wa f i n n ] )
# Read M i s l o v e CSV T w i t t e r d a t a : ’ t w e e t s ’ an a r r a y o f d i c t i o n a r i e s
i f m i s l o v e == 1 :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 1 ) , d e l i m i t e r= ’ \ t ’ )
header = [ ]
tweets = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s . append ( { ’ i d ’ : row [ 0 ] ,
’ quant ’ : i n t ( row [ 1 ] ) ,
’ s c o r e o u r ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e m e a n ’ : f l o a t ( row [ 3 ] ) ,
’ s c o r e s t d ’ : f l o a t ( row [ 4 ] ) ,
’ t e x t ’ : row [ 5 ] ,
’ s c o r e s ’ : map( i n t , row [ 6 : ] ) } )
e l i f m i s l o v e == 2 :
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 a ) , d e l i m i t e r= ’ \ t ’ )
tweets = [ ]
f o r row i n c s v r e a d e r :
t w e e t s . append ( { ’ i d ’ : row [ 2 ] ,
’ s c o r e m i s l o v e 2 ’ : f l o a t ( row [ 1 ] ) ,
’ t e x t ’ : row [ 3 ] } )
c s v r e a d e r = c s v . r e a d e r ( open ( f i l e n a m e m i s l o v e 2 b ) , d e l i m i t e r= ’ \ t ’ )
t w e e t s d i c t = {}
header = [ ]
f o r row i n c s v r e a d e r :
i f not h e a d e r :
h e a d e r = row
else :
t w e e t s d i c t [ row [ 0 ] ] = { ’ i d ’ : row [ 0 ] ,
’ s c o r e m i s l o v e ’ : f l o a t ( row [ 1 ] ) ,
’ s c o r e a m t w r o n g ’ : f l o a t ( row [ 2 ] ) ,
’ s c o r e a m t ’ : np . mean ( map( i n t , r e . s p l i t ( ” \ s+” , ” ” . j o i n ( row [ 4 : ]
}
for n in range ( l e n ( t we e t s ) ) :
tweets [ n ] [ ’ score mislove ’ ] = t wee ts d ic t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score mislove ’ ]
tweets [ n ] [ ’ score amt wrong ’ ] = t w e e t s d i c t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt wrong ’ ]
tweets [ n ] [ ’ score amt ’ ] = t we et s d ic t [ tweets [ n ] [ ’ id ’ ] ] [ ’ score amt ’ ]
# Computer s e n t i m e n t f o r e a c h t w e e t
amt = [ ]
afinn intersect sentiment = [ ]
anew intersect sentiment = [ ]
for n in range ( l e n ( t we e t s ) ) :
amt . append ( t w e e t s [ n ] [ ’ s c o r e a m t ’ ] )
words = p a t t e r n s p l i t . s p l i t ( t w e e t s [ n ] [ ’ t e x t ’ ] . l o w e r ( ) )
a f i n n s e n t i m e n t s = map( lambda word : a f i n n i n t e r s e c t . g e t ( word , 0 ) , words )
a f i n n i n t e r s e c t s e n t i m e n t . append ( f l o a t ( sum ( a f i n n s e n t i m e n t s ) ) / l e n ( a f i n n s e n t i m e n t s ) )
a n e w s e n t i m e n t s = map( lambda word : a n e w i n t e r s e c t . g e t ( word , 0 ) , words )
100
101
102
103
104
105
106
107
108
109
a n e w i n t e r s e c t s e n t i m e n t . append ( f l o a t ( sum ( a n e w s e n t i m e n t s ) ) / l e n ( a n e w s e n t i m e n t s ) )
np . c o r r c o e f ( amt ,
np . c o r r c o e f ( amt ,
sp e a r m a n r ( amt ,
sp e a r m a n r ( amt ,
afinn intersect sentiment )[0 ,1]
anew intersect sentiment ) [0 ,1 ]
afinn intersect sentiment )[ 0]
anew intersect sentiment ) [ 0 ]