Word Sense Disambiguation of Opinionated Words

Word Sense Disambiguation of Opinionated Words
Using Extended Gloss Overlap
Bernadette Rosario C. Razon
Charibeth K. Cheng
De La Salle University
[email protected]
De La Salle University
[email protected]
ABSTRACT
polarity classification. Using scores found in SentiWordNet[6],
the system takes the average score of each of the synsets of a
word and chooses the polarity with the higher score as the overall
polarity of the word. The scores are then used for both sentencelevel and commentary-level classification. This approach,
although straightforward, only yields an accuracy of 50.5%. One
probable reason is that the system does not take into consideration
the context in which a term belongs to.
The increasing use of digital technology and social media has
prompted government organizations to take advantage of these in
order to gather feedback and opinions from the general public on
policies and laws. Since people’s opinions vary, there is a need to
present these in an orderly manner for faster and more efficient
decision and policy-making. This is where opinion classification
comes in. Existing systems such as VoxPop [5] are able to classify
texts according to polarity with the help of SentiWordNet, which
assigns subjectivity and polarity scores to WordNet synsets.
However, the current algorithm that VoxPop uses yields an
accuracy rate of only 50.5%.
Word sense disambiguation (WSD) is the task of “identifying
the meaning of words in context in a computational manner” [12].
It is often used in improving tasks such as information retrieval,
part-of-speech tagging and machine translation. Some algorithms
in word sense disambiguation involve comparing a target word to
the words surrounding it, while others are example-based.
This research integrated a word sense disambiguation (WSD)
algorithm may be the key to improving the task of polarity
classification. The extended gloss overlap algorithm was used,
which compares two synsets and scores the relatedness between
them by looking for phrasal matches between all their glosses.
With the integration of WSD into VoxPop, its accuracy rating
increased to 60%.
The rest of the paper is organized as follows: Section 2
discusses the literature related to the study. Section 3 presents the
methodology of the research, including the testing methods.
Section 4 presents the results and analysis, while Section 5
presents recommendations for further research of the topic.
Keywords
2. RELATED LITERATURE
2.1 Polarity Classification
Word Sense Disambiguation, Sentiment Analysis
The work described in [19] makes use of a list of seed words in
order to determine the polarity of other words. They make use of a
modified log-likelihood ratio to determine the polarity of the
target word based on its collocation frequency with the seed
words in the input sentence. The resulting scores of the words are
then averaged in order to determine the polarity of the opinion
sentence.
1. INTRODUCTION
With the ever-increasing use of technology – especially social
media, it is not at all surprising that information is exchanged at a
much faster pace. The amount of data being shared especially
through the World Wide Web is not showing any signs of
decreasing; conversely, the increasing presence of various
platforms such as social networking sites and online forums
provide more venues for exchanging ideas and opinions.
On the other hand, a subjectivity lexicon comprising of
words tagged as negative, positive and neutral was used in the
work described in [9]. They employ a rule-based algorithm based
on the notion that sentiment-bearing sentences follow certain
structures and that these structures can be encapsulated into a
number of rules. Chunked text is taken as input and rules are fired
using a cascade of transducers, proceeding from word- to phraselevel and finally to sentence-level polarity classification.
However, it is also because of the increasing volume of data
in the Web that the need for a structured and organized
presentation of certain data becomes highly relevant. For example,
there may be numerous online reviews of a newly-released
product and a potential buyer is interested in reading only positive
reviews. It would be tedious to read through so many articles just
to see whether these are positive or not. It would be beneficial if
these types of data are automatically classified into its proper
polarity so as to save time and effort.
VoxPop is “a web-based opinion detection and classification
system which organizes, analyzes and manages English
commentaries” [5]. Akin to online forums, the system accepts as
input comments on certain topics. It then detects the opinions and
classifies them according to their polarity, and further clustered
according to subtopics. For the opinion classification submodule,
VoxPop makes use of SentiWordNet to identify word polarity.
SentiWordNet was created using semi-supervised synset
classification to assign 3 scores – Objective, Positive and
Negative – to determine the polarity of the synset. Scores were
derived using eight ternary classifiers [6].
Polarity classification aims to determine whether a given text
has a positive, negative, and in some cases a neutral stance in
relation to a particular issue or topic. It can be done in the word,
sentence, or even document-level. Generally, algorithms make use
of existing lexicons or word sets for word-level polarity to aid in
classifying the polarity in higher levels.
Although there has been much work involving polarity
classification, improvements can still be done. An example work
is VoxPop [5], which detects and classifies opinionated parts of
commentaries by polarity. It employs a three-tiered approach to
2.2 Word Sense Disambiguation
In the case of the Lesk algorithm [11], the dictionary definitions
of two terms are compared and the appropriate sense for each
1
Proceedings of the 8th National Natural Language Processing Research Symposium, pages 1-5
De La Salle University, Manila, 24-25 November 2011
if a single annotator’s bias has a significant effect on the accuracy,
the system was also tested on another set of inter-annotated data.
The second set consists of 474 commentaries annotated by three
individuals.
word is determined by the frequency of common words between
the definitions. This algorithm is said to be responsible in paving
the way for more research on word sense disambiguation [13].
Another algorithm, the extended gloss overlap [4], has its
foundations on the original Lesk algorithm. However, the
algorithm gives a higher score to phrasal matches instead of
treating glosses as a bag of terms regardless of word order.
Furthermore, the words compared are not limited to those present
in the input text. Glosses of related words obtained through
WordNet also play a role in determining the sense of a term.
To examine the effect of integrating WSD in the opinion
classification task, three types of tests were done on each data set.
The first type of test was conducted using the VoxPop system
without the WSD algorithm integrated. The second type used the
VoxPop system with the extended gloss overlap. However, this
test only included the main gloss of each content word in the data.
The third type used the VoxPop system integrated with the
extended gloss overlap algorithm, this time including the glosses
of related synsets.
The gloss vector algorithm also compares the glosses of two
terms or concepts. One distinct feature of the algorithm is that it
uses a co-occurrence matrix representing the frequency of any two
words appearing in a WordNet gloss. The glosses of the two terms
to be compared are represented through vectors, and for each term
the algorithm outputs a vector representing the overall concept of
the term.
Also, each set of tests on the system used different
combinations of parts of speech. The groupings are as follows:
adjectives and adverbs; adjectives, adverbs and nouns; adjectives,
adverbs and verbs; and lastly adjectives, adverbs, nouns and
verbs. Adjectives and adverbs were retained in all four groups
because these form the basic components of an opinionated phrase
or sentence.
An algorithm called the Chain Algorithm for Disambiguation
(CHAD) [15] also compares the glosses of words. In this case
words are disambiguated in triplets such that the sense of the 3rd
word is determined with the help of the senses of the two words
before it. The algorithm, however, does not consider the senses of
succeeding words.
Furthermore, the accuracy of the resources used to determine
word polarity was also examined. For each type of test run on the
system, SentiWordNet 1.0 and SentiWordNet 3.0 were used
alternately to see if the improved accuracy of SentiWordNet 3.0
[3] will also improve overall results.
The work in [17] describes an algorithm for disambiguating
Chinese adjectives. It uses pointwise mutual information to
compute word association strength between nouns and adjectives
in the training data, and uses the results to disambiguate the
adjectives. It achieved very high precision but very low recall.
4. RESULTS AND ANALYSIS
4.1 Linguist-evaluated data
The succeeding 4 tables contain the results of the tests performed
using the original data compiled by the VoxPop proponents.
A task called subjectivity word sense disambiguation is
described in [2]. Using machine learning features for WSD, terms
are labeled as subjective or objective depending on how it is used
in a sentence, and whether it has an impact on the overall polarity
of a sentence.
However, it did not make a significant
improvement compared to the original classifier in terms of
accurately labeling sentence-level polarity.
Table 1. Results on VoxPop data using adjectives and adverbs
Algorithm
SWN1
SWN3
No WSD
46.46% 48.48%
Extended Gloss Overlap without
46.97% 48.48%
Related Synsets
Extended Gloss Overlap with
46.97% 46.46%
Related Synsets
3. METHODOLOGY
The research focused on integrating word sense disambiguation
(WSD) into the opinion classification task, and seeing how this
will affect over-all results. The word sense disambiguation
algorithm used was the extended gloss overlap described in [4]
because it had the highest accuracy in disambiguating terms as
compared to other similar WSD algorithms [5].
Table 2. Results on VoxPop data using nouns, adjectives and
adverbs
Algorithm
SWN1
SWN3
No WSD
43.94% 45.45%
Extended Gloss Overlap without
45.45% 48.99%
Related Synsets
Extended Gloss Overlap with
43.43% 49.49%
Related Synsets
The algorithm was integrated into VoxPop, a system that
detects opinions in commentaries and classifies these by topic as
well as polarity. The original VoxPop algorithm makes use of
adjectives and adverbs in determining a commentary’s polarity.
However, this research also made use of nouns and verbs in the
classification task as well as in the word sense disambiguation
algorithm. This is because certain nouns (e.g. honesty) and verbs
(e.g. agree) also lean towards a certain polarity. Furthermore,
only the polarity scores of the word sense identified by the WSD
algorithm are used.
Table 3. Results on VoxPop data using verbs, adjectives and
adverbs
Algorithm
SWN1
SWN3
No WSD
46.46% 46.97%
Extended Gloss Overlap without
45.45% 47.47%
Related Synsets
Extended Gloss Overlap with
42.93% 50.00%
Related Synsets
Two sets of data were used. Both data sets consist of
commentaries taken from the Inbox World section of The
Philippine Star. The first set of data consists of 200 commentaries
that were used to evaluate the original VoxPop system. These
commentaries were primarily evaluated by a linguist and each
commentary was manually classified as being positive, negative
or neutral. The linguist’s evaluation was then compared with the
results yielded by the system in the different tests. In order to see
Table 4. Results on VoxPop data using nouns, verbs, adjectives
and adverbs
Algorithm
SWN1
SWN3
2
No WSD
Extended Gloss Overlap without
Related Synsets
Extended Gloss Overlap with
Related Synsets
41.92%
41.92%
44.44%
45.96%
45.45%
45.45%
Table 8. Results on unanimously classified data using verbs,
adjectives and adverbs
Algorithm
SWN1
SWN3
No WSD
Extended Gloss Overlap without
Related Synsets
Extended Gloss Overlap with
Related Synsets
Generally, the use of SentiWordNet 3.0 yielded better results, as
compared to the use of SentiWordNet 1.0. This observation can be
seen across many combinations of word sense disambiguation
algorithms. Since SentiWordNet 3.0 used disambiguated WordNet
glosses in determining polarity scores, the resource as a whole is
said to be 20% more precise as compared to its predecessor [3].
However, even this generalization cannot be trusted completely
because of the highly varying degrees of difference which range
anywhere from 0% to more than 7%.
59.71%
54.68%
57.55%
55.40%
49.64%
52.52%
Table 9. Results on unanimously classified data using verbs,
nouns, adjectives and adverbs
Algorithm
SWN1
SWN3
No WSD
Extended Gloss Overlap without
Related Synsets
Extended Gloss Overlap with
Related Synsets
4.2 Inter-annotated data
The purpose of evaluating the commentaries by inter-annotator
agreement is to see how different people view similar
commentaries.
60.43%
57.55%
63.31%
60.43%
53.24%
54.68%
There is a slight improvement in the accuracy as compared to the
results of the tests run on the VoxPop data. However, the
improvement is not significant.
Table 5 gives a summary of the annotators’ evaluation.
Table 5. Statistics on Annotators' Evaluation
Number of commentaries with completely
80
different classification
Number of commentaries with at least 2 same
394
polarity classification
Number of commentaries with 3 same polarity 139
classification
4.3 Factors
Although using SentiWordNet 3.0 yielded better results as
opposed to using SentiWordNet 1.0, the overall accuracy of the
system across all combinations of algorithms and parts-of-speech
is quite low. The following were identified as factors that
contributed to the low improvement:
Of the 474 commentaries that were evaluated, only 139 or 29.32%
were tagged with the same polarity by the annotators. On the other
hand, 394 commentaries were tagged with the same polarity by at
least two annotators. That left 80 commentaries, which were
tagged with different polarities by each annotator. This shows that
it is difficult for humans to unanimously concur on a
commentary’s polarity.
4.3.1 Presence of highly polarized words
For a direct comparison of the results in the linguist-annotated
data and the inter-annotated data, only the results for the
unanimously inter-annotated data are considered. Tables 6 until 9
show the results for tests ran using the inter-annotated data,
wherein each table shows results of a part-of-speech group.
The linguist classified this commentary as positive, while system
has classified it as negative. This result was consistent in all
combinations of part-of-speech groups, algorithms and
SentiWordNet resources. The fault lies in the polarity of the word
relentless because it has a high negative score in both
SentiWordNet 1.0 and 3.0.
Certain terms in a commentary may have polarities that are too
high, rendering the polarities of other words insignificant. Take
the following statement as an example:
“Maybe this could be attributed to Chief Justice
Puno’s relentless efforts in cleansing the ranks of the
judiciary.”
Table 6. Results on unanimously classified data using adjectives
and adverbs
Algorithm
SWN1
SWN3
No WSD
Extended Gloss Overlap without
Related Synsets
Extended Gloss Overlap with
Related Synsets
53.24%
54.68%
53.96%
55.40%
53.96%
55.40%
4.3.2 Presence of contextual valence shifters
Words such as not, too, should and although change the polarity
of the words next to them. However, this is not considered by the
VoxPop formula. In the VoxPop data, most commentaries
classified as neutral by the linguist are classified as either positive
or negative by the system. The statement
“Integrity and honesty do not translate to good
governance in solving our country’s economy and
complex society full of backbiters.”
Table 7. Results on unanimously classified data using nouns,
adjectives and adverbs
Algorithm
SWN1
SWN3
No WSD
Extended Gloss Overlap without
Related Synsets
Extended Gloss Overlap with
Related Synsets
53.96%
57.55%
57.55%
60.43%
55.40%
56.83%
was classified by the linguist as being neutral. The system
classified it as negative. In the statement, the term not shifts the
polarity of the whole sentence. In this case, the term should only
influence the succeeding word or the phrase that contains it.
4.3.3 Accuracy of resources used
The resources used also have their own deficiencies such as words
tagged with the incorrect part-of-speech, as well as terms and/or
particular senses of terms having no equivalent score in
SentiWordNet.
3
taking the polarities of individual terms may prove to be more
accurate, especially if the presence of contextual valence shifters
is taken into account. This is because the valence shifters can
change the polarity of the word immediately next to it, or even a
group of words depending on its type.
4.3.4 Inaccuracy of the WSD algorithm
There are also instances where the word sense disambiguation
algorithm assigned the incorrect sense to a word. For example, in
the sentence
“That is why we need a leader, a president with
spiritual and moral aptitude, who will influence
Filipinos by the way he lives.”
Another recommendation is to tackle sentiment classification
based on the relationship of commentaries to a hook question or
thread title. The current classification of positive, negative and
neutral can be ambiguous in some topics. Take the question
“Should the death penalty be reinstated?” as an example. A
commentary stating something like “The death penalty should be
reinstated because it can potentially lessen the crime rate in the
Philippines.” can be interpreted as positive by someone who
agrees with it. However, for someone who is against the death
penalty this commentary is negative. In other words, the polarity
of a commentary can be highly relative to an individual’s beliefs
and values. It would make more sense if a commentary is
classified based on its agreement to the given topic.
the term spiritual was tagged with its 4th sense. The table below
shows the different glosses of spiritual as an adjective:
Sense
1
2
3
4
Table 10. Glosses for the term "spiritual"
Gloss
concerned with sacred matters or religion or
the church
concerned with or affecting the spirit or soul
lacking material body or form or substance
resembling or characteristic of a phantom
Seeing the glosses, it is more appropriate to tag the term spiritual
with its second sense since its gloss is more in line with what the
statement wants to imply.
The last recommendation is adding domain knowledge in the
opinion classification task. Since the SentiWordNet resources
were made for use in multiple domains, certain words can have
different polarities when contextualized to a specific domain.
5. RECOMMENDATIONS
6. REFERENCES
The research can be improved in many aspects. One of these is the
creation of resources to help process commentaries that are
written in both English and Tagalog. A good number of the data
uses Filipino words either as part of an English sentence or in
entire sentences altogether. The creation of a Filipino WordNet or
even a Filipino SentiWordNet and using these together with their
English counterparts may help increase the accuracy of the
opinion classification task, and can eventually be used in other
Natural Language Processing tasks as well.
[1] Agarwal, A., Biadsy, F., & McKeown, K. (2009). Contextual
Phrase-Level Polarity Analysis using Lexical Affect Scoring
and Syntactic N-grams. Proceedings of the 12th Conference
of the European Chapter of the ACL (pp. 24-32). Athens,
Greece: Association for Computational Linguistics.
[2] Akkaya, C., Wiebe, J., & Mihalcea, R. (2009). Subjectivity
Word Sense Disambiguation. Proceedings of the 2009
Conference on Empirical Methods in Natural Language
Processing (pp. 190-199). Singapore: Association for
Computational Linguistics.
In creating the sentiment resource, the three-type scoring
concept of SentiWordNet can be employed as this seems to be
more flexible. However, the scores of each word should differ or
be adjusted based on the domain they are used in. Also, the word
list should include both English and Filipino terms used in
commentaries and other forms of media. Most importantly, even
if the scores may be automatically or semi-automatically
generated, there must be some sort of validation by at least three
experts of a particular domain. This is to ensure that the scores to
be used per word are more or less considered correct.
[3] Baccianella, S., Esuli, A., & Sebastiani, F. (2010).
SENTIWORDNET 3.0: An Enhanced Lexical Resource for
Sentiment Analysis and Opinion Mining. Proceedings of the
7th Conference on Language Resources and Evaluation
(LREC'10) (pp. 2200-2204). Valetta, Malta: European
Language Resources Association (ELRA).
[4] Banerjee, S., & Petersen, T. (2003). Extended Gloss
Overlaps as a Measure of Semantic Relatedness. Proceedings
of the Eighteenth International Joint Conference on Artificial
Intelligence, (pp. 805-810). Acapulco, Mexico.
Another recommendation is using WordNet 3.0 in testing the
effectiveness of the WSD task as well as the opinion classification
task. Since the creation of SentiWordNet 3.0 used glosses of
WordNet 3.0, the rate of finding the polarity of a certain term or
sense in SentiWordNet 3.0 may increase. Also, the senses
assigned to words in the WSD task may be more appropriate.
[5] Bautista, G. Z., Garcia, M. A., & Tan, R. J. (2010, September
1). VoxPop: Automated Opinion Detection and
Classification with Data Clustering. Manila, Philippines.
The significance of contextual valence shifters [14] in a
commentary should also be studied. Apart from negatives such as
not and neither, valence shifters also include intensifiers (e.g.
rather, deeply) and modal operators (e.g., could, should), which
can signal that succeeding words are not necessarily opinions but
rather suggestions. There are also presuppositional items (e.g. She
made the cut. vs. She barely made the cut.) and connectors (e.g.
Although Rose is very smart, she is quite lazy.). These are aspects
of speech that are not covered by the current VoxPop formula
since it does not take into account how combinations of words can
have a different meaning as opposed to when they’re taken as
separate lexical items.
[6] Esuli, A., & Sebastiani, F. (2006). SENTIWORDNET: A
Publicly Available Lexical Resource for Opinion Mining.
Proceedings of the 5th Conference on Language Resources
and Evaluation, (pp. 417-422).
[7] Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical
Database. Cambridge, MA: MIT Press.
[8] Howes, D. (2001). e-Legislation: Law-Making in the Digital
Age. McGill Law Journal, 47(1), 39-57.
[9] Klenner, M., Petrakis, S., & Fahrni, A. (2009). Robust
Compositional Polarity Classification. Proceedings of the
International Conference on Recent Advances in Natural
Language Processing 2009 (pp. 180-184). Borovets,
Bulgaria: Association for Computational Linguistics.
Modifications in the VoxPop formula should also be
considered. Computing polarities by phrases or clauses instead of
4
International Conference on Knowledge Engineering,
Principles and Techniques, (pp. 41-49). Cluj-Napoca.
[10] Kolte, S. G., & Bhirud, S. G. (2008). Word Sense
Disambiguation Using WordNet Domains. 2008 First
International Conference on Emerging Trends in Engineering
and Technology (pp. 1187-1191). Los Alamitos, California:
IEEE Computer Society.
[16] Wawer, A. (2010). Is Sentiment a Property of Synsets?
Evaluating Resources for Sentiment Classification using
Machine Learning. Proceedings of the Seventh Conference
on International Language Resources and Evaluation (pp.
1101-1104). Valleta, Malta: European Language Resources
Association (ELRA).
[11] Lesk, M. (1986). Automatic sense disambiguation using
machine readable dictionaries: how to tell a pine cone from
an ice cream cone. Proceedings of the 5th Annual
International Conference on Systems Documentation (pp. 2426). New York: Association for Computing Machinery.
[17] Wu, Y., Wang, M., & Jin, P. (2008). Disambiguating
Sentiment Ambiguous Adjectives. 2008 International
Conference on Natural Language Processing and Knowledge
Engineering (pp. 1-8). Beijing: Institute of Electrical and
Electronics Engineers.
[12] Navigli, R. (2009). Word Sense Disambiguation: A Survey.
41(2).
[13] Pedersen, T., Banerjee, S., & Patwardhan, S. (2005).
Maximizing Semantic Relatedness to Perform Word Sense
Disambiguation. Minneapolis: University of Minnesota
Super Computing Institute.
[18] Yarowsky, D. (1993). One sense per collocation.
Proceedings of the workshop on Human Language
Technology (pp. 266--271). Princeton, New Jersey:
Association for Computational Linguistics.
[14] Polanyi, L., & Zaenen, A. (2004). Contextual Valence
Shifters. Proceedings of AAAI Spring Symposium on
Exploring Attitude and Affect in Text (pp. 106-111).
California: Stanford University.
[19] Yu, H., & Hatzivassiloglou, V. (2003). Towards Answering
Opinion Questions: Separating Facts from Opinions and
Identifying the Polarity of Opinion Sentences. Proceedings of
the 2003 Conference on Empirical Methods in Natural
Language Processing (pp. 129-136). Morristown, NJ, USA:
Association for Computational Linguistics
[15] Tatar, D., Serban, G., Mihis, A., Lupea, M., Lupsa, D., &
Frentiu, M. (2007). A Chain Dictionary Method for Word
Sense Disambiguation and Applications. Proceedings of the
5