Sentiment lexicons - Stanford University

Sentiment lexicons
Chris Potts
Linguist 287 / CS 424P: Extracting Social Meaning and Sentiment, Fall 2010
Sep 21
1
Overview
Goals
• Gain new insights into how sentiment is expressed lexically.
• Begin developing resources that are useful for higher-level classification tasks (phrases, sentences, documents, websites, . . . ).
• Explore different philosophies for how to build large sentiment lexicons:
– what to classify (strings (with parts-of-speech) (in context) . . . )?
– which sentiment categories?
– comparatively small hand-built resources or comparatively large wild ones (or both)?
Plan
§3 Hand annotated/compiled lexicons: Harvard Inquirer, WordNet(s), Micro-WNOp.
§4 A few notes on assessment: accuracy, precision, recall, and friends.
§5 WordNet-based approaches, with and without scores.
§6 Distributional approaches using less-structured corpora.
§7 Summary assessment of theories reviewed.
Code: http://code.google.com/p/linguist287/source/browse/#svn/trunk/lexicons
2
Linguistic and psycholinguistic intuitions
What are we classifying (organizing, characterizing)?
(1)
gross
(10)
(2)
(gross, Adj)
(11)
(3)
(gross, Noun)
(12)
(4)
(gross, Verb)
Susie’s work is good.
(But not great, superb?)
(5)
gross out
(13)
Damn! (impressed by a friend’s juggling)
(6)
GROSS!!!
(14)
Damn! (you lost at ping-pong)
(7)
gross {soup/amount}
(15)
Damn! (as said by Tony Soprano)
(16)
Damn! (as said by Obama)
(8)
the soup was gross — 1 star!
(9)
the slasher movie was gross — 5 stars!
What a {pleasure/disappointment/day}!
{This/That} Kissinger is really something!
LING287/CS424P, Stanford (Potts)
3
3.1
Sentiment lexicons
Human annotated/compiled lexicons
Harvard Inquirer (Stone et al. 1966)
Download: http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm
Documentation: http://www.wjh.harvard.edu/~inquirer/homecat.htm
Entry Positiv Negativ Hostile . . . (184 classes) Othtags Defined
1
A
2
ABANDON
3 ABANDONMENT
4
ABATE
5
ABATEMENT
..
.
35
ABSENT#1
36
ABSENT#2
..
.
11788
ZONE
DET ART article: Indefinite singular article–some or any one
SUPV
Noun
SUPV
Noun
Negativ
Negativ
Negativ
Negativ
Modif
SUPV
Noun
Table 1: Harvard Inquirer fragment. ‘#n’ differentiates senses. 5,394 words have definitions.
Binary category values: ‘Yes’ = category name; ‘No’ = blank.
Category
Words
Positiv
Negativ
Hostile
Strong
Weak
Active
Passive
1,915
2,291
833
1,902
755
2,045
911
Positiv
Pstv
Strong
Active
Pleasur
Ovrst
Yes
Negativ
Ngtv
Weak
Passive
Pain
Undrst
No
(b) Some sentiment oppositions. Pstv
and Ngtv are old versions of Positiv and
(a) A few prominent sentiment Negativ. Ovrst and Undrst are over- and
categories
under-stated.
Categories
Correlation
{RspOth, RspTot}
{WltOth, WltTot}
{NUMB, CARD}
..
.
0.85
0.84
0.84
{Negativ, Hostile}
..
.
0.48
{Strong, Power}
..
.
0.34
{Active, Passive}
{Positiv, Ngtv}
{Negativ, Pstv}
{Positiv, Negativ}
-0.13
-0.15
-0.15
-0.22
(c) Some correlations between categories: highest and lowest with selections in between.
Table 2: A high-level look at some of the Inquirer categories.
2
LING287/CS424P, Stanford (Potts)
3.2
Sentiment lexicons
WordNet (Miller 1995; Fellbaum 1998)
Download/documentation: http://wordnet.princeton.edu/
Web interface: http://wordnetweb.princeton.edu/perl/webwn
APIs: http://wordnet.princeton.edu/wordnet/related-projects/
• idle, a
– Synset: idle·a·01
Definition: not in action or at work
Examples : {an idle laborer, idle drifters, the idle rich, an idle mind}
Also sees: {ineffective·a·01, unemployed·a·01}
Similar tos: {unengaged·s·01, . . . , bone-idle·s·01}
...
Lemma: idle·a·01·idle
· Antonyms: {busy·a·01·busy}
· Derivationally related forms: {faineance·n·01·idleness}
· Pertainyms: { }
· ...
– Synset: baseless·s·01
Definition: without a basis in reason or fact
Examples: {baseless gossip, unfounded suspicions, . . . , unwarranted jealousy}
Also sees: { }
Similar tos: {unsupported·a·01}
...
Lemma: baseless·s·01·baseless
· Antonyms: { }
· Derivationally related forms: { }
· Pertainyms: { }
· ...
Lemma: baseless·s·01·groundless
· ...
– ...
Table 3: Example — from strings to lexical structure. Fields marked with the empty set happen to
be empty for this example but can be filled
In WordNet 3.0 (Miller 2009), the string idle with category a has 7 synsets. Collectively, those
synsets have 18 lemmas. Each synset and each lemma can reach other synsets (lemmas) via
relations like those exemplified above . . . (see sec. 5.1).
OpenThesaurus (German): http://www.openthesaurus.de/
More globally: http://www.globalwordnet.org/
(with free downloads for at least Arabic, Danish, French, Hindi, Russian, Tamil)
3
LING287/CS424P, Stanford (Potts)
3.3
Sentiment lexicons
Micro-WNOp (Cerini et al. 2007)
Documentation and download: http://www.unipv.it/wnop
1
2
3
4
5
..
.
110
Pos
Neg
Synset
1
1
0.5
0.25
0.125
0
0
0
0
0.125
true·a·2 real·a·4
illustrious·a·1 famous·a·1 . . .
real·a·6 tangible·a·2
existent·a·2 real·a·1
real·a·2
0
0
demand·v·6
(a) ‘Common’: Five evaluators working together, 110 synsets.
Evaluator 1
Pos1 Neg1
Evaluator 2
Pos2 Neg2
Evaluator 3
Pos3 Neg3
Synset
1
2
3
4
..
.
1
1
1
0.5
0
0
0
0
1
1
1
0.25
0
0
0
0
1
0.75
1
0.25
0
0
0
0
good·a·15 well·a·2
sweet-smelling·a·1 perfumed·a·2 . . .
good·a·23 unspoilt·a·1 unspoiled·a·1
hot·a·16
496
0.5
0
1
0
0.5
0
heal·v·3 bring_around·v·2 cure·v·1
(b) ‘Group 1’: Three evaluators working separately, 496 synsets. Complete agreement on 197 (40%). Polarity
agreement (sign(Pos1−Neg1) = sign(Pos2−Neg2) = sign(Pos3−Neg3)) on 387 (78%).
Evaluator 1
Pos1 Neg1
Evaluator 2
Pos2 Neg2
Synset
1
2
3
4
..
.
0
0
1
0
1
1
0
0
0
0
1
0
1
1
0
0
forlorn·a·1 godforsaken·a·2 lorn·a·1 desolate·a·2
rotten·a·2
intimate·a·2 cozy·a·2 informal·a·4
federal·a·1
499
0
0
0
0
term·v·1
(c) ‘Group 2’: Two evaluators working separately, 499 synsets. Complete agreement on 395 (79%). Polarity
agreement (sign(Pos1−Neg1) = sign(Pos2−Neg2)) on 471 (90.4%).
Table 4: The three groups of the Micro-WNOp. The strings listed under Synset identify lemmas.
4
LING287/CS424P, Stanford (Potts)
Sentiment lexicons
Pos Neg Obj Total
a (adj)
n (noun)
r (adv)
v (verb)
62
30
4
27
45 50
58 273
0 24
36 93
157
361
28
156
Total 330 139 440
702
Pos Neg
Obj
Total
Pos Neg Obj
Total
a 151 135
n 83 144
r
9
0
v 87 112
89
663
53
233
375
890
62
432
a 105 113 62
n 66 107 483
r
7
0 37
v 68 90 178
280
656
44
336
Total 330 391 1,038 1,759
Total 246 310 760 1,316
(b) By lemma.
(c) By 〈string, tag〉 pair.
(a) By synset.
Pos
Neg
289
Obj
959
Lemmas
Lemmas
240
22
18
0.25
0.5
0.75
Pos score
1
7
0.25
0.5
0.75
1
0.24
Neg score
0.5
0.75 1
Obj score = 1-(Pos+Neg)
(d) The distribution of scores, by lemma. Objectivity values are 1 − (Pos + Neg), so a score of 1 there means that both
Positive and Negative were 0. Relatively few of the synsets received non-0/1 ratings.
Table 5: Micro-WNOp limited to the 702 synsets on which all annotators agreed exactly on all
values (tab. 5(a)). Pos means that the positive score was higher, Neg that the negative score was
higher, and Obj that the two scores were the same (even if they were greater than 0).
3.4
POS tags: relating Harvard Inquirer and WordNet/Micro-WNOp
matches ‘noun’ (case-insensitive)
matches ‘verb’ or ‘supv’ (case-insensitive)
matches ‘modif’ (case-insensitive)
is ‘LY’ or matches ‘ LY’ (case-sensitive)
matches none of the above
7→
7
→
7
→
7
→
n
v
a
r
no change
Table 6: Partial mapping from Inquirer categories to WordNet categories. The mapping is partial
for two reasons. First, Inquirer includes determiners, prepositions, and other parts of speech not
found in WordNet (and hence not represented at all in Micro-WNOp). Second, even setting those
classes aside, Inquirer parts-of-speech are highly subcategorized and don’t necessarily fall into the
four-way WordNet typology.
5
LING287/CS424P, Stanford (Potts)
3.5
Sentiment lexicons
InquirerWN , the WordNet subset of Inquirer
Positive
Negative
Objective
Total
a (adj)
n (noun)
r (adv)
v (verb)
29
638
52
366
11
727
31
621
101
2,906
95
1,167
141
4,271
178
2,154
Total
1,085
1,390
4,269
6,744
Table 7: InquirerWN , the subset of Inquirer that is also in WordNet, at the level of 〈string, tag〉
pairs. Inquirer makes sense distinctions, but I am unsure how to align them WordNet lemmas. I
am not sure why these numbers are larger than those of Rao and Ravichandran (2009:tab. 1).
3.6
Inquirer and Micro-WNOp
Inq
8, 064
Inq
9, 508
Inq
10, 057
576
429
429
Micro
650
Micro
887
With common POS
Micro
887
Without POS
With full POS
Table 8: 〈string, tag〉 pairs in the gold-standard corpora. The differences from left to middle are
due to the different ways in which tags disambiguate strings in the two corpora. Inquirer includes
many word classes that are not in WordNet, hence not in Micro-WNOp. Some of the non-overlap
on both sides is due to formatting differences that could be corrected.
3.7
Thinking ahead to assessment
• Not just strings: words with part-of-speech categories and definitions.
• Multiple n-valued sentiment categories, for different values of n (mostly 2 or 3).
• Continuous values for constructing scales.
Before looking at specific proposals: how might we use these resources to evaluate proposals?
Reminder: There are 8,640 strings (sense disambiguations removed) in the Inquirer, but there are
13,588,391 unigrams in the Google N-grams vocabulary (Brants and Franz 2006),1 so we need
robust methods for projecting beyond these gold-standard resources.
1
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
6
LING287/CS424P, Stanford (Potts)
4
Sentiment lexicons
A few notes on assessment
Observed
Predicted
Neg
10
Pos
Pos
15
Neg
10
15
Obj
10
100
Obj
100
10
1,000
sum_diagonal
sum_all_cells
(a) Accuracy. Typically appropriate only if the categories are equal in size.
Observed
Predicted
Neg
10
Pos
Pos
15
Neg
10
15
Obj
10
100
Obj
100
10
Observed
1, 000
Predicted
Neg
10
Pos
Pos
15
Neg
10
15
Obj
10
100
Obj
100
10
1, 000
true_positive
true_positive + false_positive
true_positive
true_positive + false_negative
(b) Precision: the correct guesses penalized by the
number of incorrect guesses. You can often get high
precision for a category C by rarely guessing C, but
this will ruin your recall.
(c) Recall: the correct guesses penalized by the number of missed items. You can often get high recall for
a category C by always guessing C, but this will ruin
your precision.
Table 9: Confusion matrices. These values should all be interpreted as bounded by [0, 1]. (If you
are tempted to divide by 0, return 0.)
Definition 1 (F1).
2
precision * recall
precision + recall
Note: F1 gives equal weight to precision and recall. However, depending on what you’re studying,
you might value one more than the other.
Definition 2 (Macroaverage). Average precision/recall/F1 over all the classes of interest.
Definition 3 (Microaverage). Sum corresponding cells to create a 2×2 confusion matrix, and
calculate precision = recall = F1 in terms of that new matrix.
Pos
Pos 15
Neg 10
Obj 10

Neg Obj
Pos Other
10 100

⇒  Pos 15 110
15
10
Other 20 1,125
100 1,000
Neg Other
Neg 15
20
Other 110 1,125

Obj Other
Yes
No

Obj 1,000 110  ⇒ Yes 1,030 240
Other 110
50
No 240 2,300
Note: Before using these summary measures, look at the distribution of your by-category numbers
to make sure that you’re not losing important structure.
For more: Manning and Schütze 1999:§8.1; Manning et al. 2009:§8.3, 13.6.
7
LING287/CS424P, Stanford (Potts)
5
Sentiment lexicons
WordNet-based approaches
5.1
Simple sense/sentiment propagation
Hypothesis: Sentiment is constant throughout regions of lexically related items. Thus, sentiment
properties of hand-built seed-sets will be preserved as we follow WordNet relations out from them.
5.1.1
Algorithm (Valitutti et al. 2004; Esuli and Sebastiani 2006; sec. 5.3 for precedents)
WORDNETSENSEPROPAGATE(S, iter)
Input
S: a list of synsets. For example: {brilliant·s·01, n·win·01}, {sadly·r·01, gross·a·01}
iter: the number of iterations
Output



T: LENGTH(S) × 1+iter synset matrix: 

S[1]
..
.
S[LENGTH(S)]
···
···
iter-th propagation of S[1]

..

.

iter-th propagation of S[n]
1 initialize T: a LENGTH(S) × 1+iter matrix such that T[i][1] = S[i] for 1 ¶ i ¶ 1+iter
2 for i ← 1 to iter
3
for j ← 1 to LENGTH(S)
4
newSame ← SAMEPOLARITY(T[ j][i])
SLENGTH(S)
5
others ← k=1
T[k][i] for k 6= j
The other seed-sets in this column.
6
newDiff ← OTHERPOLARITY(others)
For the experiments, I first calculate all the propagation sets and then eliminate their
pairwise intersection from each, to ensure no overlap.
7
T[ j][i + 1] ← (newSame ∪ newDiff)
8 return T
SAMEPOLARITY(synsets)
1 newsynsets ← { }
2 for s ∈ synsets
Synset-level relations.
3
newsynsets ← newsynsets ∪ {s} ∪ AlsoSees(s) ∪ SimilarTos(s)
4
for lemma ∈ Lemmas(s)
Lemma-level relations.
5
for altLemma ∈ (DerivationallyRelatedForms(lemma) ∪ Pertainyms(lemma))
6
newsynsets ← newsynsets ∪ {Synset(altLemma)}
7 return newsynsets
OTHERPOLARITY(synsets)
1
2
3
4
5
6
newsynsets ← { }
for s ∈ synsets
for lemma ∈ Lemmas(s)
Lemma-level relations.
for altLemma ∈ Antonyms(lemma)
newsynsets ← newsynsets ∪ {Synset(altLemma)}
return newsynsets
8
LING287/CS424P, Stanford (Potts)
Sentiment lexicons
Free variables The seed-sets, the WordNet relations called in SAMEPOLARITY and
OTHERPOLARITY, the number of iterations, the decision to remove overlap.
5.1.2
Experiment
Formulate seed-sets by hand, using intuitions. Use the algorithm to derive positive, negative, and
objective sets of 〈string, pos〉 pairs.
positive
negative
objective
excellent, good, nice, positive, fortunate, correct, superior
nasty, bad, poor, negative, unfortunate, wrong, inferior
administrative, financial, geographic, constitute, analogy, ponder, material, public,
department, measurement, visual
25000
Table 10: Seed-sets.
15000
0
5000
10000
Word count
20000
Positive
Negative
Objective
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Iteration
Positive
Negative
Objective
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
Positive
Negative
Objective
0.8
Micro-WNOp
1.0
1.0
Inquirer
0.0
Percentage of items in the gold standard unreached by propagation
Figure 1: Growth in the size of the sets as the number of iterations grows.
0
2
4
6
8
10
12
14
16
18
20
Iteration
0
2
4
6
8
10
12
14
16
18
20
Figure 2: Percentage of items from the gold standard that are unreached by propagation.
9
LING287/CS424P, Stanford (Potts)
5.1.3
Sentiment lexicons
Assessment
Definition 4 (Scoring the propagation algorithm).
Derived by propagation
Matching Inquirer category
Micro-WNOp
positive
negative
objective
Positiv
Negativ
InquirerWN − (Positiv ∪ Negativ)
Positive > Negative
Positive < Negative
Positive = Negative
Options:
i. Score only 〈string, pos〉 pairs that are in the intersection of InquirerWN (Micro-WNOp) with
the union of the three derived sets.
Drawbacks:
ii. Treat all items not reached by the propagation algorithm as members of objective.
Drawbacks:
iii. Create a category ‘missing’ to penalize effectiveness measures for the other categories.
Drawbacks:
10
LING287/CS424P, Stanford (Potts)
Sentiment lexicons
1.0
Micro−WNOp
1.0
Inquirer
0.8
0.62
0.52
0.0
0.2
0.4
0.0
0.2
0.6
0.56
0.50
0.44
F1
Positive
Negative
Objective
0.4
0.6
0.8
Positive
Negative
Objective
1
3
5
7
9
11
13
15
17
19
1
3
5
7
9
11
13
15
17
19
Iteration
(a) F1 using def. 4, option (i): “Score only 〈string, pos〉 pairs that are in the intersection of InquirerWN (Micro-WNOp)
with the union of the three derived sets.”
1.0
Micro−WNOp
1.0
Inquirer
0.8
0.47
0.42
0.2
0.0
0.0
0.2
0.4
0.49
0.39
0.69
0.6
0.6
0.67
F1
Positive
Negative
Objective
0.4
0.8
Positive
Negative
Objective
1
3
5
7
9
11
13
15
17
19
1
3
5
7
9
11
13
15
17
19
Iteration
(b) F1 using def. 4, option (ii): “Treat all items not reached by the propagation algorithm as members of objective.”
1.0
Micro−WNOp
1.0
Inquirer
0.6
0.8
Positive
Negative
Objective
0.47
0.42
0.2
0.30
0.0
0.0
0.2
0.4
0.49
0.39
0.34
0.4
F1
0.6
0.8
Positive
Negative
Objective
1
3
5
7
9
11
13
15
17
19
1
3
5
7
9
11
13
15
17
19
Iteration
(c) F1 using def. 4, option (iii): “Create a category ‘missing’ to penalize effectiveness measures for the other categories.”
Figure 3: Using F1 to assess the propagation algorithm against Inquirer and Micro-WNOp. The
three scoring options emphasize different aspects of the algorithm.
11
LING287/CS424P, Stanford (Potts)
5.2
Sentiment lexicons
Propagation with continuous values
Hypothesis: As in sec. 5.1, but with the additional claim that a word’s strength for property X is
connected to (and hence can be estimated from) the number of X -words it is lexically related to.
5.2.1
Algorithm (Blair-Goldensohn et al. 2008, henceforth G08)
Score vector s0

 +1 if w i ∈ P
i
s0 = −1 if w i ∈ N

0 otherwise
Seed-sets
P (pos.)
N (neg.)
M (obj.)
1+λ
+λ
ai, j =
 −λ
0

Repeated A ∗ si : A ∗ s0 = s1 ; A ∗ s1 = s2 ; . . .
λ = 0.2
syn(〈superb, a〉) = {〈great, a〉}
syn(〈great, a〉) = {〈superb, a〉, 〈good, a〉}
syn(〈good, a〉) = { }
Score vector s0
P = {〈superb, a〉}
N ={}
M ={}


〈good, a〉 0.0


〈great,
a〉 0.0 

〈superb, a〉 1.0
s0
Iteration:
Rescale?
if i = j
if w i ∈ syn(w j ) & w i ∈
/M
if w i ∈ ant(w j ) & w i ∈
/M
otherwise
sentiment scores from the final vector and, for each item, change its final
sign to its initial sign if the two differ
Small example
Seed-sets
Matrix A

Matrix A


〈good, a〉 〈great, a〉 〈superb, a〉


1.2
0.2
0.0

 〈good, a〉


〈great,
a
〉
0.2
1.2
0.2


〈superb, a〉
0.0
0.0
1.2
s1
s2
s3
s4
s5
s6
〈good, a〉 0.00 0.00 0.04 0.14 0.35 0.70 1.28
〈great, a〉 0.00 0.20 0.48 0.87 1.42 2.19 3.26
〈superb, a〉 1.00 1.20 1.44 1.73 2.07 2.49 2.99
G08 propose to rescale scores:
log(abs(si )) · sign(si ) if abs(si ) > 1
si ←
0
otherwise
However, in the above example, we would lose the sentiment score for good if we stopped before
iteration 6. In my experiments, rescaling resulted in dramatically fewer non-0 values.
Free variables The nature of the seed-sets, the number of iterations, the weight λ,
whether to rescale, which if any WordNet relations to use beyond syn and ant.
12
LING287/CS424P, Stanford (Potts)
5.2.2
Sentiment lexicons
G08 results
Positive
Good a (7.73)
Swell a (5.55)
Naughty a (-5.48)
Intellectual a (5.07)
Gorgeous a (3.52)
Irreverent a (3.26)
Angel n (3.06)
Luckily r (1.68)
In our experiments, the original seed-set contained 20 negative and 47 positive words that were selected by hand to
maximize domain coverage, as well as 293 neutral words
that largely consist of stop words. [. . . ] Running the
algorithm resulted in an expanded sentiment lexicon of
5,705 positive and 6,605 negative words, some of which
are shown in Table 2 with their final scores. Adjectives form
nearly 90 percent of the induced vocabulary, followed by
verbs, nouns and finally adverbs.
Negative
Ugly a (-5.88)
Dull a (-4.98)
Tasteless a (-4.38)
Displace v (-3.65)
Beelzebub n:(-2.29)
Bland a (-1.95)
Regrettably r (-1.63)
Tardily r (-1.06)
Table 2: Example terms from our induced sentiment
lexicon, along with their scores and part-of-speech
tags (adjective = a, adverb = r, noun = n, verb =
v). The range of scores found by our algorithm is
[-7.42,7.73].
This score is alwaysand
in thesummarization.
range [−1, 1], and correlates
to
G08 do not stop here. They go on to use their lexicon for classification
We’ll
the weighted fraction of words in x which match the overall
return to those portions of the paper in the next two classes.
sentiment of the raw score; it gives an added measure of the
5.2.3
bias strength of x. For example, if two fragments, xi and xj ,
both have raw scores of 2, but xi obtained it through two
words with score 1, whereas xj obtained through 2 words of
scores 3 and -1, then xi would be considered more pure or
biased in the positive sense due to the lack of any negative
evidence.
Though lexicon-based classifiers can be powerful predictors, they do not exploit any local or global context, which
has been shown to improve performance [12, 13]. Furthermore, the scores are set using ad-hoc decaying functions
instead of through an optimization on real world data. In
order to overcome both shortcomings , we collected a set
of 3916 sentences that have been manually labeled as being positive, negative or neutral [13]. We then trained a
maximum entropy classifier (with a gaussian prior over the
weights) [2, 11] to predict these ratings based on a small
number of local and global contextual features for a sentence
xi occurring in the review r = (x1 , x2 , . . . , xm ), namely,
Our experiment
Micro-WNOp provides the set of seed-sets: we use the entire set of positive, negative, and objective
sets as defined in tab. 5(c). The idea is that we will use this algorithm, not to learn new sentiment
lexicons, but rather to learn the strength relationships between words of the same polarity.
Seed-set
P
N
M
246
310
760
Total
1,316
After propagation
1,021
978
1.154,585
raw-score(xi ) and purity(xi )
156,584
2. raw-score(xi−1 ) and purity(xi−1 )
3. raw-score(xi+1 ) and purity(xi+1 )
Table 11: Experimental results. The combined seeds-sets are the entire Micro-WNOp corpus. The
A common
theme
in our
systeminis to
use positive
as much a priori
total after propagation is the number of lemmas in WordNet
3.0.
The
gain
the
and
information as possible. Consequently, we take advantage
negative categories is not as large as G08 saw. This is almost
certainly
dueinto
thatessenthe
of user provided
star ratings
our the
reviewfact
data that
tially describe the overall sentiment of the service.3 Note
Micro-WNOp corpus does not cover as much of the domainthat
asthis
their
seed-sets
did. the sentiment of indisentiment
does not prescribe
4. raw-score(r) and purity(r)
50
800
60
1000
All non-0 scores
40
3
Though not the case in our data, it is further conceivable
that a user will also have even identified some aspects and
rated them explicitly, e.g., tripadvisor.com.
20
30
Count
600
400
0
0
10
200
Count
Count
vidual sentences, but only the sentiment conveyed overall by
the review. It is frequently the case that a review may have
a good/bad overall sentiment but have some sentences with
opposite polarity. This is especially frequent for reviews
with sentiment
Detail in the middle range of the scale. Thus, this
information should be used only as an additional signal during classification, and not as a rigid rule when determining
-50
0
50
100
150
200
Score
Score
-50
0
50
100
150
200
Score
Figure 4: Distribution of non-0 scores in the final lexicon. The vast majority are close to 0. The
largest is 185.12 (valour) and the smallest is −61.1 (rophy, ‘street names for flunitrazepan’(!?)).
13
the sentiment of
maximum entro
feature (when p
training set:
5. user-gener
The resulting m
ment prediction
itself, but on th
the predicted/g
tionally, we cou
of the sentence
independence w
In order to t
hand-labeled da
our maximum e
Each sentence w
purity scores, th
sentences, the r
the user provid
tence was extra
-1.0 for negative
We then com
• review-labe
all sentenc
all sentenc
to all sent
uments se
left the rev
have prov
to show th
sentiment
• raw-score:
tences and
order for n
• max-ent: T
defined ab
ing.
• max-ent-re
the featur
review rat
We compared
F1, and average
classes since the
aggregate and s
sion we used a t
label and a prob
chose to include
(either raw scor
entropy) primar
the summary.
Results are gi
use any user pro
below the line d
here,
1. raw-score
adding con
sifier leads
LING287/CS424P, Stanford (Potts)
5.2.4
Sentiment lexicons
Assessment
61 examples from Micro-WNOp are missing from the resulting lexicon. This is because MicroWNOp was created with WordNet 2.0 synsets, but the propagation algorithm was run over WordNet 3.0. For details on relating these for the sake of Micro-WNOp, see Baccianella et al. 2010:§4.
If we study the intersection of Micro-WNOp with the score-propagation results, then the propagation algorithm scores perfectly when it comes to simply classifying items according to their basic
polarity (positive, negative, objective). This is as expected; the algorithm is designed to preserve
the scores given in s0 .
The more interesting assessment uses the continuous values:
Definition 5 (Scoring the propagation algorithm). We evaluate agreement on the set of all twomembered sets of {〈wA, x〉, 〈w B , x〉}, where
• Micro(〈w, x〉) = the larger of the two scores Positive or Negative for w in Micro-WNOp,
picking arbitrarily if they are both the same.
• Goog(〈w, x〉) = the score for 〈w, x〉 from the experiment summarized in sec. 5.2.4.
The gold standard Micro-WNOp and our results agree for all cases on the basic polarity, which
means that Micro(〈w, x〉) will never pick the positive score as the higher one when Goog(〈w, x〉)
delivers a negative score, nor the reverse. Thus, we can compare absolute values:
The experimental prediction is correct if, for R ∈ {<, >, =}
Micro(〈wA, x〉) R Micro(〈w B , x〉)
and
abs(Goog(〈wA, x〉)) R abs(Goog(〈w B , x〉))
else the experimental prediction is incorrect.
Precision Recall
stronger weaker
equal
stronger 7,575 5,532 24,301
weaker 6,345 6,599 22,150
equal 17,516 19,722 215,594
stronger
weaker
equal
(a) Confusion matrix for the scores experiment.
0.24
0.21
0.82
F1
0.20 0.22
0.19 0.20
0.85 0.84
(b) Effectiveness
Table 12: Assessment of scores
14
just stronger/weaker 54%
stronger/weaker/equal 71%
(c) Accuracy
LING287/CS424P, Stanford (Potts)
5.3
Sentiment lexicons
Similar approaches and ideas
Hu and Liu 2004 One ofapproaches.
the firstThe
tofirst
usegroup
thisincludes
kindmethods
of propagation
for sentiment. Their algorithm
ambiguity of some words acquired in pass 1 and
is very similar to the one in
though
its inputs
outputs
areAtstrings
rather
than
thatsec.
rely 5.1.1,
on syntactic
or co-occurrence
patternsandfrom
the seed list.
this step, we
also filter
out synsets
words in large texts to determine their sentiall those words that have been assigned contradictand it restricts attention toof
adjectives
(positive
and
negative).
ment (e.g., (Turney and Littman, 2002; Hatzivasing, positive and negative, sentiment values within
siloglou and McKeown, 1997; Yu and Hatzivasthe same run.
siloglou, 2003; Grefenstette et al., 2004) and othThe performance of STEP was evaluated using
Kim and Hovy 2006 Sense
propagation
over WordNet
small
seed-sets,
dealing
with
ers). The
majority of dictionary-based
approachesfrom
GI-H4
as a gold
standard, while
the HM list
was overlap
use WordNet information, especially, synsets and
used
as
a
source
of
seed
words
fed
into
the
sysvia maximum likelihood, which
also delivers sentiment scores/rankings for words (strings).
hierarchies, to acquire sentiment-marked words
tem. We evaluated the performance of our sys(Hu and Liu, 2004; Valitutti et al., 2004; Kim
tem against the complete list of 1904 adjectives in
and Hovy, 2004) or to measure the similarity
GI-H4 that included not
only the words that were
Good
topic neighborhood consisted of words between the topic up
between
candidate words and sentiment-bearing
marked as Positiv, Negativ, but also those that were
to the end
of the sentence.
words such as good and bad (Kamps et al., 2004).
not considered sentiment-laden by GI-H4 annota2.2 Sentiment analysis systems
tors, andGreat
hence were by Serious
default considered
neutral
Keen
In
this
paper,
we
propose
an
approach
to
sentiSeveral
systems
have
been
built
which
attempt
to
quantify
Godbole et al. 2007 WordNet propagation relative
in our evaluation. For the purposes of the evaluaopinion from
product
reviews.
Lee and
Vaithyanathan[10]
ment
annotation
of Pang,
WordNet
entries
that was imtion we have partitioned the entire HM list into 58
performcategories,
sentiment
of movie
reviews.
Their
to general weblog topic
with
scores
based
plementedanalysis
and tested
in the
Semantic
Tagresults
Extracshow that the machine learning techniques perform better
non-intersecting seed lists of Severe
adjectives. Intense
The retion
Program
(STEP).
This
approach
relies
both
Hard
Big
than simple
counting methods.
Theyinto
achieveaccount
an accuracy of
on distance from seed-set
words,
taking
sults
of
the
58
runs
on
these
non-intersecting
seed
on lexical relations
(synonymy,
antonymy
and hypolarity classification
of roughly
83%. In [11],
they identify
lists are presented in Table 2. The Table 2 shows
sentences
in aprovided
review areinofWordNet
subjectiveand
character
imthe dangers of mixedwhich
paths.
ponymy)
on thetoWordprove sentiment analysis. We do not make this distinction in
that the performance of the system exhibits subNetbecause
glosses.we Itfeel
builds
uponboth
the fact
properties
of dicour system,
that that
and opinion
stantial variability depending
on the composition
Bad
tionary
as a special
kind
of structured
contribute
to the entries
public sentiment
about
news
entities. text:
of the seed list, with accuracy ranging from 47.6%
Nasukawa
Yi[12] identify local
moresesuchand
lexicographical
textssentiment
are built as
to being
establish
Fig.
1: Four
ways to(Mean
get from
bad to good
in three
hops
reliable than global document sentiment, since human evaluato 87.5%
percent
= 71.2%,
Standard
Devimantic equivalence between the left-hand and the
tors often fail to agree on the global sentiment of a document.
ation
(St.Dev)
=
11.0%).
Dimension
Seeds
Algorithmic
Hand-curated
right-hand parts of the dictionary entry, and thereThey focus on identifying the orientation of sentiment expresPos. Neg. Pos. Neg. Pos.
Neg.
sions andfore
determining
the target
of these
sentiments.
Shal-the
are designed
to match
as close
as possible
Business
11
12
167
167
223
180
low parsing
identifies the
and the
Crime
12
18Average
337
337 Average
51
224
components
oftarget
meaning
of sentiment
the word.expression;
They have
the latter is evaluated and associated with the target. Our
Health
12
16 run 532
532 %108
349
size
correct
relatively
standard
style,
grammar
and
syntactic
Media
16
10
310
310
295
133
system also analyzes local sentiments but aims to be quicker
StDev327% 216 StDev
Politics
14 # of
11 adj 327
236
structures,
which
removes
a substantial
source of
and cruder:
we charge
sentiment
to all
entities juxtaposed
Sports
13
7
180
180
106
53
PASS 1
103
29
78.0% 10.5%
in the same
sentence
as instead
of atypes
specific
target.and
In finally,
[13],
noise
common
to other
of text,
they follow up by employing a feature-term extractor. For a
(WN
Relations)
they have extensive coverage spanning the entire Table 1: Sentiment dictionary composition for adjectives
given item, the feature extractor identifies parts or attributes
PASS 2
630
377
64.5% 10.8%
lexicon
a natural
language.
of that item.
e.g.,of
battery
and lens
are features of a camera.
Andreevskaia and Bergler 2006 Propagation using
a limited set of basic WordNet relations, but with creative use of the glosses to further expand the lexicon.
The word lists of Hatzivassiloglou and McKeown 1997
The STEP algorithm
starts with a small
(see sec. 6.2) are used
to formulate
the seed-sets,
andset of
3. Sentiment
lexicon generation
seed words of known sentiment value (positive
Sentiment analysis depends on our ability to identify the senthe Harvard Inquirertimental
is theor
gold
standard.
negative).
augmentedWeduring
terms
in a corpusThis
and list
theirisorientation.
definedthe
first pass
adding
synonyms,
separate lexicons
forby
each
of seven
sentiment antonyms
dimensionsand
(gen-hyeral, health,
crime,ofsports,
business,
politics,
media).
We seponyms
the seed
words
supplied
in WordNet.
lected these dimensions based on our identification of distinct
This step brings on average a 5-fold increase in
news spheres with distinct standards of opinion and sentithe size of
original
list with the
accuracy
of the
ment. Enlarging
thethe
number
of sentiment
lexicons
permits
greater focus
in analyzing
topic-specific
but poresulting
list comparable
to phenomena,
manual annotations
tentially (78%,
at a substantial
human
curation.
To avoid
similar tocost
HMinvs.
GI-H4
accuracy).
At the
this, we developed an algorithm for expanding small dimensecond
pass,
the
system
goes
through
all
sion sets of seed sentiment words into full lexicons. WordNet
(WN Glosses)
435 between
291 positive
71.2% and
11.0%
• PASS
Paths 3which alternate
negative
terms clean-up)
are likely spurious. Thus our algorithm runs in
(POS
two iterations. The first iteration calculates a preliminary score
estimate for statistics
each wordon
asSTEP
described
Table
2: Performance
runs.above.
The second iteration re-enumerates the paths while calculating the number of apparent sentiment alternations,
The
significant
in more
accuracy
of thethe
or
flips.
The fewervariability
the flips, the
trustworthy
path
is. TheDeviation
final scoreover
is calculated
taking into acruns
(Standard
10%) is attributable
count only those paths whose flip value is within our
to threshold.
the variability in the properties of the seed list
Rao and Ravichandran 2009 Label propagation using WordNet supplemented with other handbuild resources. Evaluation is partly with Harvard Inquirer. words
Theinapproach
isHM
extended
Hindi and
these runs. The
list includesto
some
• WordNet [14] orders the synonyms/antonyms by sense,
words where not all meanings
with the more common senses listed first. We improve
French with strong results.glosses and identifies the entries that contain in sentiment-marked
are laden with sentiment, but also the words where
3.1 Lexicon
through
path analysis
their expansion
definitions the
sentiment-bearing
words from
accuracy by limiting our notion of synonym/antonym
to only
the top senses
returned
foreven
a given
some
meanings
are neutral
and
theword.
words
where
such neutral
meanings
• This algorithm
generates
more are
thanmuch
18,000more
wordsfreas beSentiWordNet (Esuli and Sebastiani 2006; Baccianella etquent
al.
2010)
Built
first
out
than the
ones.
ing within
five sentiment-laden
hops
fromby
our small
setpropagating
ofThe
seedruns
words.
Since
the
assigned
scores
followed
a
normal
distribution,
where seed lists included such ambiguous adjecfrom the positive and negative seed-sets in tab. 10 and thentives
building
a classifier
fromMost
thewords
resulting
they
are naturally
converted to z-scores.
were labeling a lot of neutral words as senlying in the middle of this distribution are ambiguous,
synsets. The features in the classifier are not the synsets themselves,
but such
rather
theclassified
synsets
timent
marked
since
seed words
were
more
meaning
they cannot
be consistently
as posi-of their
tive or
areindiscarded
likely
to negative.
be 2found Such
in theambiguous
WordNet words
glosses
their
definitions, as provided by the Princeton Annotated Gloss Corpus:
by taking only the top X% words from either extremes
more
frequent neutral meaning. For example, run
of the curve.
# 53 had in its seed list two ambiguous adjectives
introduced
the list
by part-of-speech
• We errors
associate
a polarityinto
(positive
or negative)
to each
Previous systems detailed in Section 2 have expanded seed
the extended seed list and adds these head words
lists into lexicons by recursively querying for synonyms using
(or rather,
lexemes)
to the[14].
corresponding
category
the computer
dictionary
WordNet
The pitfall of
such
methods —
thatpositive,
that synonym
set or
coherence
with dis- A
negative
neutral weakens
(the remainder).
tance. Figure
shows fourpass
separate
ways
to get from
to
third,1 clean-up
is then
performed
to good
partially
bad using chains of WordNet synonyms.
disambiguate
the identified
WordNet
glosses
with
To counteract
such problems,
our sentiment
word
generaBrill’sexpands
part-of-speech
(Brill,
1995),
which
tion algorithm
a set of tagger
seed words
using
synonym
and antonym
queries
as follows:
performs
with
up to 95% accuracy, and eliminates
Table 1 presents the composition of our algorithmically-generated
query both the quality
synonyms and antonyms,
akin
very
good ; of word
theandhighest
; “ made
excellent
” [. . . ]
and an
curated sentiment
dictionaries speech
for each class of adjectives.
to [15, 16] Synonyms inherit the polarity from the parent, whereas
antonyms quality·1
get the opposite polarity.
very·3 good·1
high·3
make·23.2
excellent·3
speech·1
210Performance evaluation
211
• The significance of a path decreases as a function of
We evaluated our sentiment lexicon generation in two differvery·4 good·3
high·4
quality·3
made·3
its length or depth from a seed word, akin to [9, 17,
ent ways. The first we call the un-test. The prefixes un- and
18]. The significance of a word W at depth d decreases
im- generally negate the sentiment of a term. Thus the terms
good·4
exponentially as score(W ) = 1/c for some constant
of form X and unX should appear on different ends of the send
c > 1. The final score of each word is the summation of
the scores received over all paths.
timent spectrum, such as competent and incompetent. Table
2 reports the fraction of (term, negated term) pairs with same
http://sentiwordnet.isti.cnr.it/
2
http://wordnet.princeton.edu/glosstag.shtml
15
LING287/CS424P, Stanford (Potts)
6
Sentiment lexicons
Distributional approaches
6.1
Web-based propagation
Hypothesis: The basic hypothesis of sec. 5.2 is correct, and we can move past the confines of
WordNet by relating distributional similarity to sentiment.
6.1.1
Algorithm (Velikovich et al. 2010)
Input:
Output:
Initialize:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
G = (V, E), wij ∈ [0, 1],
P , N , γ ∈ R, T ∈ N
pol ∈ R|V |
poli , pol+i , pol-i = 0, for all i
pol+i = 1.0 for all vi ∈ P and
pol-i = 1.0 for all vi ∈ N
set αii = 1, and αij = 0 for all i �= j
for vi ∈ P
F = {vi }
for t : 1 . . . T
for (vk , vj ) ∈ E such that vk ∈ F
αij = max{αij , αik · wkj }
F = F ∪ {vj }
for vj ∈ V �
pol+j = vi ∈P αij
Repeat
to compute pol�steps +1-8�using N
β = i poli / i poli
poli = pol+i − βpol-i , for all i
if |poli | < γ then poli = 0.0, for all i
Figure 1: Graph Propagation Algorithm.

phrase must
have to be included in the lexicon. Both
6.1.2
Example

T and γ were tuned on held-out data.


To construct the
finalamazing
lexicon, the remaining
superb

nodes – those with polarity scores above γ – are ex
superb movie
tracted and assigned their corresponding polarity. 
superb movie
superb Graph
movie from the Web
2.2 Building a Phrase
superb movie

Graph propagation
algorithms
rely
on
the
existence
superb movie

of graphs that encode meaningful relationships be
amazing movie

tween candidate nodes. Past studies on building po
amazing movie

larity lexicons have used linguistic resources like

cool superb
filtered
candidate
phrasesgraph
using a num• toG20ismillion
a cosine
similarity
ber of heuristics including frequency and mutual in• P of
and
N boundaries.
are seed-sets
formation
word
A context vector for
each •candidate
phrase was then
constructed
based
γ is a threshold:
sentiment
scores
below it will
on a window of size six aggregated over all menbe rounded to 0
tions of the phrase in the 4 billion documents. The
edge •setTEiswas
by iterations.
first, for each potheconstructed
number of
tential edge (vi , vj ), computing the cosine similarity
value between
context vectors.
All edges
(viA, vand
j)
Definition
6 (Cosine
similarity).
Let
B be vecwere
then
discarded
if
they
were
not
one
of
the
25
tors of co-occurrences counts for words wA and w B
highest weighted edges adjacent to either node vi or
defined over a fixed, ordered vocabulary (rows in
vj . This serves to both reduce the size of the graph
tab.
length
n. Their
cosine
similarity:
and
to 13(b))
eliminateof
many
spurious
edges for
frequently
occurring phrases, while still keeping the
graph relan
P
tively connected. The weight of the remaining
(Ai ∗edges
Bi )
was set to the corresponding cosine
similarity
value.
def
i=1
cosim(w
r
r a
A, w B ) =
Since this
graph encodes
co-occurrences
over
n
n
P
P
2 noisy for 2
large, but local context window, it can Abe
∗
Bi
i
our purposes. In particular, we might
i=1see a number
i=1
of edges between positive and negative sentiment
words as well as sentiment words and non-sentiment
words, e.g., sentiment adjectives and all other adjec-
amazing
cool similar.
movie Larger
superb
tives that are
distributionally
win-
amazing
0
0this problem
2 as they1en-
dows theoretically alleviate

0
0to syntactic
0 similarities.
1 
codecool
semantic as opposed

We
note, however, 2that the0 graph propagation
movie
0
5al-
gorithm
calculates
the
superb described above
1
1
5 sentiment0 of
each phrase
the aggregate ofmatrix.
all the best paths to
(b)asCo-occurrence
seed words. Thus, even if some local edges are erro-
cool that,
movie
superb
neous in theamazing
graph, one hopes
globally,
positive

phrases will be influenced
more by0.42
paths from
pos-
amazing
1.0 0.45
0.86

itivecool
seed words0.45
as opposed
1.0to negative
0.93 seed words.
0.0 

Section
movie3, and indeed
0.42 this
0.93paper, aims
1.0 to measure
0.07 
whether
or not.0.0
superb this is true
0.86
0.07
1.0
WordNet to define the graph through synonym and
(a) Corpus.
antonym relations (Kim and Hovy, 2004; Esuli and
(c) Cosine similarity matrix.
Sabastiani, 2009; Blair-Goldensohn et al., 2008; 2.3 Why Not Label Propagation?
Rao and Ravichandran, 2009). The goal of this study Previous studies on constructing polarity lexicons
Tablethe13:
andlexisimilarity measures. Fig. 5 continues the example
is to examine
sizeExample
and qualitycorpus
of polarity
from lexical graphs, e.g., Rao and Ravichandran
cons when the graph is induced automatically from (2009), have used the label propagation algorithm,
documents on the web.
which takes the form in Figure 2 (Zhu and GhahraConstructing a graph from web-computed lexi- mani, 2002). Label propagation is an iterative algocal co-occurrence statistics is a difficult challenge rithm where each node takes on the weighted averin and of itself and the research and implementa- age of its neighbour’s values from the previous itertion hurdles that arise are beyond the scope of this ation.
16 The result is that nodes with many paths to
work (Alfonseca et al., 2009; Pantel et al., 2009). seeds get high polarities due to the influence from
For this study, we used an English graph where the their neighbours. The label propagation algorithm
node set V was based on all n-grams up to length is known to have many desirable properties includ-
LING287/CS424P, Stanford (Potts)
Sentiment lexicons
1.0
superb
1.0
1.0
0.07
0.86
0.42
amazing
0.45
movie
0.93
cool
1.0
(a) Cosine similarity graph for the positive sub-graph of a small corpus. 0-valued edges not represented.
1.0
superb
1.0
0
0
amazing
1.0
0
0
movie
0
0
cool
1.0
(b) Initial matrix α0 with seed-set P = {superb}.
1.0
�
1.0
�
max
0
α s,s
0
a
m ax
s
�
0
αs,
αs,0 ∗
m =0
s cos(
s, i) =
0.07
�
1.0
0
0
0
α s,a = .86
0
a) =
s,
s(
∗ co
m
0
0
c
1.0
(c) Matrix α1 . At this point F = {superb}.

1.0
0.86 
1
α s,a = .86 
0

a) =

,
a
s(
1 ∗ co
0.03 

 α s,a os(m, a) =
c
0.86
1 ∗
=
)
x
a
α
m
a
s,
m 
cos(s,
1

α s,s ∗
1.0
s


max
1
 αs,1a ∗ co αs,m = 0.0 
7
αs,1 ∗ s(a, m) =
0.36 
m cos(
s, m)
= 0.0 
7
0
a
0
1.0
m
0
c
1.0

max

α1s,c = 0

α1s,m ∗ cos(m, c) = 0.07


α1s,a ∗ cos(a, c) = 0.36

(d) Matrix α2 . F has expanded to all the nodes in fig. 5(a) reachable from superb in one step. cool and movie
benefit from their closeness to amazing, which is close to the seed-word. Your values might be different if you
modify α in place as in line 6 of the algorithm, which involves some αn values in αn calculations.
Figure 5: The inner loop (lines 3-6) of the web propagation algorithm with P{superb}.
17
LING287/CS424P, Stanford (Potts)
6.1.3
Sentiment lexicons
Velikovich et al.’s (2010) lexicon
“For this study, we used an English graph where the node set V was based on all n-grams up to
length 10 extracted from 4 billion web pages. This list was filtered to 20 million candidate phrases
using a number of heuristics including frequency and mutual information of word boundaries. A
context vector for each candidate phrase was then constructed based on a window of size six aggregated over all mentions of the phrase in the 4 billion documents. The edge set E was constructed
by first, for each potential edge (vi , v j ), computing the cosine similarity value between context vectors. All edges (vi , v j ) were then discarded if they were not one of the 25 highest weighted edges
adjacent to either node vi or v j .”
Phrase length
# of phrases
1
37,449
2
108,631
3
27,822
4
3,489
5
598
6
71
7
29
8
4
9
1
Table 14: Sizes for the lexicons that Velikovich et al. (2010) obtained from their Web graph. This
method can assign scores to quasi-phrase-level things. The gains diminish sharply after 4. If you
want to try your own experiments you might use the 4- or 5-gram data from the Google N-grams
corpus (Brants and Franz 2006) to build your graph.
Wilson et al.
WordNet LP
Web GP
All Phrases
7,628
12,310
178,104
Pos. Phrases
2,718
5,705
90,337
Neg. Phrases
4,910
6,605
87,767
Recall wrt other lexicons
Wilson et al. WordNet LP Web GP
100%
37%
2%
21%
100%
3%
70%
48%
100%
Table 1: Lexicon statistics. Wilson et al. is the lexicon used in Wilson et al. (2005), WordNet LP is the lexicon
constructed by Blair-Goldensohn et al. (2008) that uses label propagation algorithms over a graph constructed through
WordNet, and Web GP is the web-derived lexicon from this study.
POSITIVE PHRASES
Typical
Multiword expressions
cute
once in a life time
fabulous
state - of - the - art
cuddly
fail - safe operation
plucky
just what the doctor ordered
ravishing
out of this world
spunky
top of the line
enchanting
melt in your mouth
precious
snug as a bug
charming
out of the box
stupendous
more good than bad
Spelling variations
loveable
nicee
niice
cooool
coooool
koool
kewl
cozy
cosy
sikk
NEGATIVE PHRASES
Typical
Multiword expressions
dirty
run of the mill
repulsive
out of touch
crappy
over the hill
sucky
flash in the pan
subpar
bumps in the road
horrendous
foaming at the mouth
miserable
dime a dozen
lousy
pie - in - the - sky
abysmal
sick to my stomach
wretched
pain in my ass
Vulgarity
fucking stupid
fucked up
complete bullshit
shitty
half assed
jackass
piece of shit
son of a bitch
sonofabitch
sonuvabitch
Table 3: Example positive and negative phrases from web lexicon.
Table
15: “shitty”,
Additional
about
the and
Velikovich
et al.
(2010)
lexicon. positive
‘Wilsonand
et al.’
refers
ate, e.g.,
but information
some are clearly
insults
ple. The
number
of matched
negative
thatal.are
most likely
included
to their
phrases from the
lexicon
aregoes
counted
and whichever
to outbursts
Wilson et
(2005),
which
uses adue
largely
hand-annotated
corpus
that
beyond
unigrams.
co-occurrence
with
angry
texts.
There
were
also
a
has
the
most
votes
wins.
The
algorithm
flips
theextent
deWordNet LP is the approach of sec. 5.2. The recall numbers provide information about the
of derogatory
terms and racial slurs in the cision if the number of negations is odd. Though this
to number
which the
lexicons overlap.
lexicon, again most of which received negative sen- algorithm appears crude, it benefits from not relying
timent due to their typical disparaging usage.
on threshold values for neutral classification, which
is difficult due to the fact that the polarity scores in
18 the three lexicons are not on the same scale.
3.2 Quantitative Evaluation
To rank sentences we defined the purity of a senTo determine the practical usefulness of a polarity
tence
X as the normalized sum of the sentiment
lexicon derived from the web, we measured the per-
LING287/CS424P, Stanford (Potts)
6.2
Sentiment lexicons
Coordination and related features
Hypothesis: The morphosyntactic properties of coordination provide reliable information about adjectival oppositions and lexical polarities (Hatzivassiloglou and McKeown 1997, henceforth H&M97).
6.2.1
Data
H&M97 “For our experiments, we use the 21 million word 1987 Wall
Street Journal corpus, automatically annotated with part-of-speech tags
using the PARTS tagger (Church, 1988).”
Here 105,423 coordinated adjectives drawn from user-supplied reviews
at IMDB.com and POS-tagged using the Stanford tagger. Homework 1,
problem 3, links to the data file (in CSV format) and asks you to explore
it yourself looking for patterns.
http://stanford.edu/class/cs424p/assignments/cs424p-assign01.html
Thus, I focus on the basic procedure rather than on feature selection. I
was too lazy to get at the syntactic features H&M97 use, but my data
include new features about the contexts of utterance.
6.2.2
Ground truth
H&M97 The authors classified the adjectives in their corpus appearing
20 or more times, removing those that were problematically ambiguous
and checking subsets with other annotators too. The resulting set contained 1,336 adjectives (657 positive, 679 negative).
Here Coordinations in which both members are classified as Positiv or
Negativ in the Inquirer. 1,342 adjectives (620 positive, 722 negative).
6.2.3
Baseline
H&M97 “since 77.84% of all links from conjunctions indicate same orientation, we can achieve this level of performance by always guessing that
a link is of the same-orientation type.”
Their data
Our data
Same polarity
Different polarity
78%
76% (72,003)
22%
24% (22,784)
Table 16: Baseline accuracy. Always guessing ‘same polarity’ results in a
recall value of 0 for ‘different polarity’.
19
terr.ible and point.less
love.ly and compassionate
warm and gener.ous
silly and hard
cruel but sympathet.ic
trag.ic and somber
point.less and silly
hand.some and enjoy.able
help.less and ignorant
wrong or weird
dumb and dread.ful
not right and wise
excellent and perfect
bizarre but good
criminal and destruct.ive
dead or alive
health.y and robust
significant and controvers.ial
evil and cynic.al
religi.ous and hero.ic
master.ful and meaning.ful
styl.ish and true
rich and wonder.ful
harm.less and witt.y
strict but realist.ic
decent and not bad
gruff and gentle
cred.ible and worth.y
attract.ive and real
moralist.ic or sympathet.ic
cruel and autocratic
sad and un.necessary
ignorant and not will.ing
obscure or pretenti.ous
mature and precise
in.significant and point.less
realist.ic or blood.y
authent.ic and potent
tire.some or lame
creat.ive and dark
real and majest.ic
dumb and scar.ed
clean and crazy
naive and courage.ous
poor and poor
un.attract.ive but gentle
sad and triumphant
meaning.ful and sens.ible
cynic.al and diabolic.al
casual and ridicul.ous
faith.ful but dead
not true and ill
rampant and hide.ous
malignant and venom.ous
humble and real
erroneous and dead.ly
nice and subtle
vain and ridicul.ous
thought.less and pathet.ic
vivid and creat.ive
un.comfort.able and nerv.ous
effect.ive and grotesque
harsh but comic
grace.ful and talent.ed
LING287/CS424P, Stanford (Potts)
6.2.4
Sentiment lexicons
The but rule
H&M97 “conjunctions using but exhibit the opposite pattern, usually involving adjectives of different orientations. Thus, a revised but still simple rule predicts a different-orientation link if
the two adjectives have been seen in a but conjunction, and a same-orientation link otherwise,
assuming the two adjectives were seen connected by at least one conjunction.”
Predicted
Different Same
Different 12,201 16,170
Same 10,787 66,265
Accuracy
Their data
Our data
81.81%
74%
(a) Accuracy. In addition, I picked randomly balanced subgroups with 20,000
coordination tokens in each category.
The baseline is then 50%, but the average accuracy of the but-rule over 10 runs
was 67%.
Precision Recall
Different
Same
F1
53% 43% 47%
47% 86% 61%
(c) Effectiveness (our data)
(b) Confusion matrix.
Table 17: Assessing the but-rule.
6.2.5
Logistic regression
Coefficient
Estimate
Standard error
0
Intercept
1.56
0.01
1
Coord=but
the coordinator is but
−0.85
0.04
2
but-rule
as described in sec. 6.2.4
−1.37
0.02
3
Conj1Negated
the first conjunct has an external negation
1.04
0.15
4
Conj1Negated
the second conjunct has an external negation
0.62
0.07
5
StemsMatch
at least one adj is complex and the stems match
−4.05
0.37
6
Conj1Negated:Conj2Negated
interaction: both conjuncts negated externally
−3.35
0.29
Table 18: Coefficient estimates with standard errors. All of the features are binary: 1 if the
description is true, else 0. A negative coefficent indicates that the feature biases in favor of different
orientation for the adjectives, whereas a positive coefficient indicates a bias for same orientation.
There are clearly other features and interactions that could be tried — data homework 1, problem
3, asks you to start exploring other options.
20
LING287/CS424P, Stanford (Potts)

1.56
+

−0.85
∗
1
+


+
 −1.37 ∗ 1

+
Pr(same) = logit−1  1.04 ∗ 1

+
 0.62 ∗ 1

+
 −4.05 ∗ 0
−3.35 ∗ (1 ∗ 1) +
Sentiment lexicons


1.51
+

−0.85
∗
0
+


+
 1.37 ∗ 0

+
Pr(same) = logit−1  1.04 ∗ 0

+
 0.62 ∗ 0

+
 −4.05 ∗ 0
−3.35 ∗ (0 ∗ 0) +





 = 0.09




(a) not out.stand.ing but not aw.ful






 = 0.82




(b) healthy and robust
Table 19: Examples
Definition 7 (Logistic regression predictions). The prediction is correct if the polarity differs and
Pr(same) ¶ 0.5 or if the polarity is the same and Pr(same) > 0.5
Empirical
Different
Same
Predicted
Different
Same
3,082 19,702
1,627 70,376
Precision
Recall
F1
65%
78%
14%
98%
22%
87%
Different
Same
(a) Confusion matrix.
(b) Effectiveness.
Table 20: Interim results: The above model correctly classifies 77% of the samples, using def. 7.
6.2.6
Dissimilarity measure
Definition 8 (Dissimilarity).
d is a partial function from (words × words) into [0, 1]

d(x, y) =
 1.0 − the median fitted value of (x, y) if available, else
1.0 − the median fitted value of ( y, x) if available, else

0.5
Note: The fitted model will deliver multiple values for a adjective pairs that appear in different
environments. As far as I can tell, H&M97 don’t specify how to move from these values to a unique
one. I chose the median because it seems to have a good chance of being in the right area in cases
where there are multiple values, some small and some large.
Note: This isn’t a distance measure because it fails to satisfy the triangle inequality (the sum of
the lengths of any two sides of a triangle is greater than the length of the third). As a result, it
isn’t appropriate for use in the objective functions for centroid clustering algorithms like k-means
(Hatzivassiloglou and McKeown 1993).
21
LING287/CS424P, Stanford (Potts)
6.2.7
Sentiment lexicons
Clustering via the Exchange Method (Späth 1977, 1980)
EXCHANGEMETHOD(words, d, k)
1
2
3
4
5
6
7
Input
words: an ordered list of strings
d: a partial function from (words × words) into [0, 1], as in def. 8
k: the number of classes
Output
C: a minimal distance k-celled indexed partition of words
C ← RANDOMPARTITION(words, k)
objVal ← OBJFUNC(C, d)
while TRUE
C, newVal ← EXCHANGE(C, objVal, words, d)
if newVal < objVal
then objVal ← newVal
Termination: the previous run through words produced no improvements.
else return C
EXCHANGE(C, objVal, words, d)
1
2
3
4
5
6
7
8
9
10
11
Notation: If C = {c1 . . . cn }, then C w:i7→ j = {c1 . . . (ci − {w}) . . . (c j ∪ {w}) . . . cn }
for w ∈ words
ci ← CLASSOF(w, C)
initialize candidates Dictionary storing viable cell–value pairs.
for c j ∈ (C − ci ):
newVal ← OBJFUNC(C w:i7→ j , d)
if newVal < objVal
then candidates[c] ← newVal
if candidates 6= ;
If any exchanges would be improvements,
then c j , objVal ← ARGMINWITHMIN(candidates) then pick the biggest improvement.
C ← C w:i7→ j
return C, objVal
OBJFUNC(C, d)
‚
P 1
return
|c|
c∈C
Œ In the inner summation, you needn’t iterate over (c × c), but
rather only over the two-membered subsets of c . In addition,
d(x, y)
x, y∈c,x6= y
pairs without values in d can be treated as a group.
P
Cluster correction? H&M97 take the additional step of ‘cluster correction’ after the exchange
method has converged. In my experiments, this always led to worse results, presumably because
it has the potential to raise the objective value. It might be worth exploring a version where such
correction is followed by additional iterations of EXCHANGE, since this could pop us out of a local
minimum.
22
LING287/CS424P, Stanford (Potts)
Sentiment lexicons
0.4
0.9
good
0.3
great
bad
0.9
0.8
0.2
0.3
0.2
0.2
excellent
0.9
0.1
0.8
0.1
0.01
0.9
awful
(1 − 0.4) + (1 − 0.9) + (1 − 0.3)
3
terrible
(1 − 0.1) + (1 − 0.9) + (1 − 0.01)
+
3
= 1.13
0.4
0.9
good
0.3
great
bad
0.9
0.8
0.2
0.3
0.2
0.2
excellent
0.9
0.1
0.8
0.1
0.01
0.9
awful
terrible
(1 − 0.4) + (1 − 0.9) + (1 − 0.3) + (1 − 0.8) + (1 − 0.9) + (1 − 0.2)
4
+
1 − 0.9
2
= 0.675
Figure 6: Exchange of excellent.
6.2.8
Cluster labeling
H&M97 “The clustering algorithm separates each component of the graph into two groups of
adjectives, but does not actually label the adjectives as positive or negative. [. . . ] We compute the
average frequency of the words in each group, expecting the group with higher average frequency
to contain the positive terms.”
Here The frequencies for unigrams in the Google N-grams corpus (Brants and Franz 2006).
23
LING287/CS424P, Stanford (Potts)
Sentiment lexicons
Free variables The choice of clustering algorithms, the inherent randomness in the initial partition, the order in which the words are gone through (which can lead to different
local optima), the nature of the dictionary d, which is determined by the features in the
logistic regression, the frequency measure that determines class labels.
6.2.9
Assessment
Number of
adjectives in
test set ([An[)
730
516
369
236
2
3
4
5
Number of
links in
test set ([L~[)
2,568
2,159
1,742
1,238
Average number
oflinksfor
each adjective
7.04
8.37
9.44
10.49
Accuracy
Ratio of average
group frequencies
78.08%
82.56%
87.26%
92.37%
1.8699
1.9235
L3486
1.4040
Table 3: Evaluation of the adjective classification and labeling methods.
1.0
0.9
(1)
8
0.76
0.3
0.4
0.5
0.6
0.7
where C is the complement of cluster C, i.e., the 0.68
other member of the partition. This constraint,
based on Rousseeuw's (1987) s=lhoue~es, helps correct wrong cluster assignments.
0.41
To find Pmin, we first construct a random parti- 0.35
tion of the adjectives, then locate the adjective that
1
2
3
4
5
6
7
8
9
10
will most reduce the objective function if it is moved
from its current cluster. WeFitmove
adjective and
1: Final this
class sizes
proceed with the next iteration until no movements
can improve the objective function. At the final iteration, the cluster assignment of any adjective that
violates constraint (1) is changed. This is a steepestdescent hill-climbing method, and thus is guaranteed to converge. However, it will in general find a
local minimum rather than the global one; the problem is NP-complete (Garey and $ohnson, 1979). We
can arbitrarily increase the probability of finding the
globally optimal solution by repeatedly running the
1
2
3
4
5
6
7
8
9
10
algorithm with different
starting
partitions.
Trial
0.8
Positive F1
Negative F1
Accuracy
Results
Positive F1
Negative F1
Accuracy
and Evaluation
0.75
0.7
d(=, y)
0.68
Since graph connectivity affects performance, we devised a method of selecting test sets that makes this
dependence explicit. Note that the graph
density is
0.41
largely a function of corpus size, and0.35thus can be
increased by adding more data. Nevertheless, we
1
2
3
4
6
7
8
9
10
report
results
on 5sparser
test
sets
to show how our
algorithm scales
up.
Fit 2: Final class sizes
0.6
--IVl
0.5
<
0.8
F1 and accuracy
0.9
d(=,y)
0.4
ICl- 1
0.3
1.0
where [Cil stands for the cardinality ofTable
cluster21:
i, and
erageresults.
frequency to contain the positive terms. This
H&M97’s
d(z, y) is the dissimilarity between adjectives z and
aggregation operation increases the precision of the
y. We want to select the partition :Pmin that minlabeling dramatically since indicators for many pairs
imizes ~, subject to the additional constraint that
of words are combined, even when some of the words
for each adjective z in a cluster FitC,1: F1
are incorrectly assigned
to their group.
Fit 2: F1
1
1
We separated our sets of adjectives A (containing
1,336 adjectives) and conjunction- and morphologybased links L (containing 2,838 links) into training
and testing groups by selecting, for several values
of the parameter a, the maximal subset of A, An,
which includes an adjective z if and only if there
exist at least a links from L between x and other
elements of An. This operation in turn defines a
subset of L, L~, which includes all links between
members
model on
1
2 of3 An.
4
5We6 train
7
8our9 log-linear
10
L - La (excluding links between morphologically related adjectives), compute predictions and dissimi7
Labeling the Clusters as Positive
larities
for two
the links
in L~,
and usemodel
these to
classify
Figure 7: Assessment of the coordination experiment
with
slightly
different
fits.
The
or Negative
and
label
the
adjectives
in
An.
c~
must
be
at
least
cluster labeling procedure was effective, in that no trial had under 50% accuracy, which would
2, since we need to leave some links for training.
The clustering algorithm separates each component
Words
Positive
Initial Class 1
Total
Positive
Initial Class 1
Total
1328
1328
1171
1171
1132
724
178
182
125
44
have been a sign that we got the labels backwards. Drops in effectiveness seem correlated with
of the graph into two groups of adjectives, but does
Table 3 shows the results of these experiments for
imbalances
in the initial random split, as we see with
trial 6 on the left and trial 9 on the right.
not actually label the adjectives as positive or nega = 2 to 5. Our method produced the correct clasative. To accomplish that, we use a simple criterion
sification between 78% of the time on the sparsest
that applies only to pairs or groups of words of oppo24 test set up to more than 92% of the time when a
site orientation. We have previously shown (Hatzihigher number of links was present. Moreover, in all
vassiloglou and McKeown, 1995) that in oppositions
cases, the ratio of the two group frequencies correctly
of gradable adjectives where one member is semantiidentified the positive subgroup. These results are
LING287/CS424P, Stanford (Potts)
6.3
Sentiment lexicons
Similar approaches and ideas
Clustering based on specific properties Hatzivassiloglou and McKeown (1993) apply the method
of sec. 6.2 to adjective co-occurrence inside noun phrases.
Polarity by association Turney and Littman (2003) infer polarity for new words based on their
distributional similarity with a set of seed-sets of known polarity.
Vector-space semantics The distributional evidence employed by Velikovich et al. (2010) (sec. 6.1)
is one specific algorithm falling under the rubric of vector-space semantics, which is reviewed quite
thoroughly by Turney and Pantel (2010).
7
Summary assessment
The approaches differ along too many interesting dimensions to permit fair head-to-head comparisons, but here is my overall take on them one-by-one:
Simple WordNet propagation (sec. 5.1) Interesting and informative. It suggests that sentiment
and lexical relatedness are linked, which bodes well for approaches that leverage the first to get at
the second. The proposals quickly summarized in sec. 5.3 show that we can deal effectively with
impure paths and uneven strengths. The major foundational weaknesses: (i) different seed-sets
can give very different results, and (ii) the lexicon cannot outgrown WordNet.
WordNet propagation with scores (sec. 5.2) Deepens and expands our understanding of the
connection between lexical relatedness and sentiment. Its foundational weaknesses seem to me to
be just those of the above.
Web-based propagation Uses the basic insight of the above, but breaks free of WordNet, thereby
addressing concern (ii). Moreover, I strongly suspect that it can perform well with fewer than 20
billion words, especially since the 6- 9-grams were not contributing much. (I was getting positive
initial results from an 8 million word corpus of short online review summary lines, restricting
attention to just unigrams.) Weakness (i) — dependence on seed-sets — remains.
Coordination (and similar morphosyntactic ideas) Helps to highlight, in a quantitative way,
the relationships between sentiment and particular words and constructions. Thus, it has substantial linguistic interest. Its major weakness is that it is limited by our cleverness in coming up
with useful constructions (an analogue of the seed-set issue for the above) but perhaps Snow et al.
(2005) have already shown us how to shake this sort of limitation.
A note on seed-sets Perhaps the dependence on seed-sets is a point of scientific interest — we as
researchers use our knowledge of language and the domain to pick worthwhile seed-sets, as a way
of extending our limited but high-precision knowledge out to vast quantities of new data. That is,
perhaps (i) above is not a weakness.
25
LING287/CS424P, Stanford (Potts)
Sentiment lexicons
References
Andreevskaia, Alina and Sabine Bergler. 2006. Mining WordNet for a fuzzy sentiment: Sentiment
tag extraction from WordNet glosses. In Proceedings of the European Chapter of the Association
for Computational Linguistics (EACL).
Baccianella, Stefano; Andrea Esuli; and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the Seventh
Conference on International Language Resources and Evaluation, 2200–2204. European Language
Resources Association.
Blair-Goldensohn, Sasha; Kerry Hannan; Ryan McDonald; Tyler Neylon; George A. Reis; and Jeff
Reynar. 2008. Building a sentiment summarizer for local service reviews. In WWW Workshop on
NLP in the Information Explosion Era (NLPIX). Beijing, China.
Brants, Thorsten and Alex Franz. 2006. Web 1T 5-gram version 1. Linguistic Data Consortium,
Philadelphia.
Cerini, S.; V. Compagnoni; A. Demontis; M. Formentelli; and G. Gandini. 2007. Micro-WNOp: A
gold standard for the evaluation of automatically compiled lexical resources for opinion mining. In Andrea Sansò, ed., Language Resources and Linguistic Theory: Typology, Second Language
Acquisition, English Linguistics. Milan: Franco Angeli Editore.
Esuli, Andrea and Fabrizio Sebastiani. 2006. SentiWordNet: A publicly available lexical resource
for opinion mining. In Proceedings of the 5th Conference on Language Resources and Evaluation,
417–422. Genova.
Fellbaum, Christiane, ed. 1998. WordNet: An Electronic Database. Cambridge, MA: MIT Press.
Godbole, Namrata; Manjunath Srinivasaiah; and Steven Skiena. 2007. Large-scale sentiment analysis for news and blogs. In Proceedings of the International Conference on Weblogs and Social
Media.
Hatzivassiloglou, Vasileios and Kathleen R. McKeown. 1993. Towards the automatic identification
of adjectival scales: Clustering adjectives according to meaning. In Proceedings of the 31st Annual
Meeting of the Association for Computational Linguistics, 172–182. ACL.
Hatzivassiloglou, Vasileios and Kathleen R. McKeown. 1997. Predicting the semantic orientation
of adjectives. In Proceedings of the 35th Annual Meeting of the ACL and the 8th Conference of the
European Chapter of the ACL, 174–181. ACL.
Hu, Minqing and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the
10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 168–177.
ACL.
Kim, Soo-Min and Eduard H Hovy. 2006. Identifying and analyzing judgment opinions. In Robert C.
Moore; Jeff A. Bilmes; Jennifer Chu-Carroll; and Mark Sanderson, eds., Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 200–207. ACL.
26
LING287/CS424P, Stanford (Potts)
Sentiment lexicons
Manning, Christopher D.; Prabhakar Raghavan; and Hinrich Schütze. 2009. An Introduction to
Information Retrieval. Cambridge University Press.
Manning, Christopher D. and Hinrich Schütze. 1999. Foundations of Statistical Natural Language
Processing. Cambridge, MA: MIT Press.
Miller, George A. 1995. WordNet: A lexical database for English. Communications of the ACM
38(11):39–41.
Miller, George A. 2009. WordNet — about us. WordNet. Princeton University.
Rao, Delip and Deepak Ravichandran. 2009. Semi-supervised polarity lexicon induction. In Proceedings of the 12th Conference of the European Chapter of the ACL, 675–682. ACL.
Snow, Rion; Daniel Jurafsky; and Andrew Y. Ng. 2005. Learning syntactic patterns for automatic
hypernym discovery. In Lawrence K. Saul; Yair Weiss; and Léon Bottou, eds., Advances in Neural
Information Processing Systems 17, 1297–1304. Cambridge, MA: MIT Press.
Späth, Helmuth. 1977. Computational experiences with the exchange method. European Journal
of Operations Research 1(1):23–31.
Späth, Helmuth. 1980. Cluster Analysis Algorithms for Data Reduction and Classification of Objects.
New York: Ellis Horwood.
Stone, Philip J; Dexter C Dunphry; Marshall S Smith; and Daniel M Ogilvie. 1966. The General
Inquirer: A Computer Approach to Content Analysis. Cambridge, MA: MIT Press.
Turney, Peter D and Michael L Littman. 2003. Measuring praise and criticism: Inference of semantic
orientation from association. ACM Transactions on Information Systems 21(4):315–346.
Turney, Peter D and Patrick Pantel. 2010. From frequency to meaning: Vector space models of
semantics. Journal of Artificial Intelligence Research 37:141–188.
Valitutti, Alessandro; Carlo Strapparava; and Oliviero Stock. 2004. Developing affective lexical
resources. PsychNology Journal 2(1):61–83.
Velikovich, Leonid; Sasha Blair-Goldensohn; Kerry Hannan; and Ryan McDonald. 2010. The viability of web-derived polarity lexicons. In Proceedings of North American Association for Computational Linguistics 2010.
Wilson, T; Janyce Wiebe; and Philip Hoffmann. 2005. Recognizing contextual polarity in phraselevel sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
27