Twitter Sentiment Analysis using Hybrid Naïve Bayes

A
An
n M
M TTe
ec
ch D
D ii ss ss e
e rr tt a
a tt ii o
on
n
Titled
Twitter Sentiment Analysis using Hybrid
Naïve Bayes
Submitted in partial fulfilment towards the award of the degree of
MASTERS OF TECHNOLOGY
IN
COMPUTER ENGINEERING
BY
Mr. Harsh Vrajesh Thakkar
Supervisor(s)
Dr. Dhiren Patel
2012-2013
Department of Computer Engineering
SARDAR VALLABHBHAI NATIONAL
INSTITUTE OF TECHNOLOGY,
SURAT
Declaration
I hereby declare that the work being presented in this dissertation
report entitled “Twitter Sentiment Analysis using Hybrid Naive Bayes” by me
i.e. Harsh Vrajesh Thakkar, bearing Roll No: P11CO010 and submitted to
the Computer Engineering Department at Sardar Vallabhbhai National
Institute of Technology, Surat; is an authentic record of my own work carried
out during the period of July 2012 to June 2013 under the supervision of
Name of the Supervisor. The matter presented in this report has not been
submitted by me in any other University/Institute for any cause.
Neither the source code there in, nor the content of the project report
have been copied or downloaded from any other source. I understand that my
result grades would be revoked if later it is found to be so.
______________________
(Harsh V. Thakkar)
ii
CERTIFICATE
This is to certify that the dissertation report entitled “Twitter Sentiment
Analysis using Hybrid Naïve Bayes”, submitted by Harsh Vrajesh Thakkar,
bearing Roll No: P11CO010 in partial fulfillment of the requirement for the
award of the degree of MASTER OF TECHNOLOGY in Computer
Engineering,
at
Computer
Engineering
Department
of
the
Sardar
Vallabhbhai National Institute of Technology, Surat is a record of his/her own
work carried out as part of the coursework for the year 2012--13. To the best
of our knowledge, the matter embodied in the report has not been submitted
elsewhere for the award of any degree or diploma.
Certified by
____________________
(Dr. Dhiren Patel)
Professor,
Department of Computer Engineering,
S V National Institute of Technology,
Surat – 395007
India
_______________________________
Mr. Udai Pratap Rao
PG Incharge,
M Tech in Computer Engineering,
SVNIT, Surat
Mr. Rakesh Gohil
Head,
Department of Computer Engineering,
S V National Institute of Technology,
Surat – 395007, India
iii
Department of Computer Engineering
SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY,
SURAT
(2012--13)
Approval Sheet
This is to state that the Dissertation Report entitled “Twitter Sentiment
Analysis using Hybrid Naïve Bayes” submitted by Mr. Harsh Vrajesh Thakkar
(Admission No: P11CO010) is approved for the award of the degree of
Masters of Technology in Computer Engineering.
Board of Examiners
Examiners
Supervisor(s)
Chairman
Head, Department of Computer Engineering
Date:___________
Place:______________
iv
Acknowledgements
The journey of my Master of Technology has so far been very exciting and full of
challenges. I would like to acknowledge that, apart from God’s and my parent’s blessings,
the success of this dissertation work is the result of sheer encouragement and
guidelines of my Guide Dr. Dhiren R. Patel. I am very much thankful to him for being
there for me and empowering me with his methodology and expertise. He has given me
ample amount of space to explore research interests of my own, and I personally believe
this to be a burning requirement for any natural research enthusiast.
I am also grateful to Mr. Rakesh Gohil, the head of Computer Engineering depatment
and his Staff for giving me a lot of their valuable time, as well as the Staff of Central
Computer Centre for allowing me to stay in the laboratory till late.
Last but not the least, I would like to thank to my friends, who stood by me like a shadow,
whenever I needed their assistance and also gave me the strength in my weak and
wrecked to carry on. Today I am here because of the efforts of all of these wonderful
people.
Harsh V. Thakkar
(P11CO010)
v
Abstract
Millions of users share opinions on diverse aspects of life and politics every day
using microblogging over the internet. Microbloging websites are rich sources of data
for belief mining and sentiment analysis. In this dissertation work, we focus on using
Twitter for sentiment analysis for extracting opinions about events, products, people and
use it for understanding the current trends or state of the art.
Twitter allows its users a limit of only 140 characters; this restriction forces the user to
be concise as well as expressive at the same moment. This ultimately makes twitter an
ocean of sentiments. Twitter also provides developer friendly streaming. We scuttle
datasets over 4 million tweets by a custom designed crawler for sentiment analysis
purpose. We propose a hybrid naïve bayes classifier by integrating an english lexical
dictionary (SentiWordNet) to the existing machine learning naïve bayes classifier
algorithm. Hybrid naïve bayes classifies the tweets in positive and negative classes
respectivel. Experimental results demonstrate the superiority of hybrid naïve bayes on
multi-sized datasets consisting of variety of keywords over existing approaches yielding
>90 percent accuracy in general and 98.59 percent accuacy in the best case. In our
research, we worked with English; however, the proposed technique can be used with any
other language, provided that language lexicon dictionary.
vi
Table of Contents
Chapter 1
Introduction
1-5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Why sentiment analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Why NLP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Applications of sentiment analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.3 The network: Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2
Theoretical Background and Literature Review
5-17
2.1 General sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Issues in sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Classification of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Knowledge-based approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
2.3.2 Relationship-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Language models approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
2.3.4 Discourse structures and semantics approach . . . . . . . . . . . . . . . . . . . . 9
2.4 Twitter specific approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2.4.1 Lexical analysis approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Machine learning approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
2.4.3 Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Performance review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Lexical approach performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
2.5.2 Machine Learning approach performance . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.3 Hybrid approach performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Chapter conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
Chapter 3
Design and Analysis of proposed approach
18-33
3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Problem introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Proposed approach: Hybrid Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
3.2.1 Data collection : Twitter API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
3.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Sentiment analysis: The classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter 4
Implementation Methodology
26-30
4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Test application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter 5
Results and Analysis
34-42
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 6
Conclusion and Future Work
43-44
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Bibliography
45-47
viii
List of Figures
Figure
Figure
Figure
Figure
Figure
2.1
2.2
3.1
3.2
3.3
Figure 4.1
Figure 4.2
Figure 5.1
Figure 5.2
Figure 5.3
Figure 5.4
Figure 5.5
Figure 5.6
Figure 5.7
Figure 5.8
Generic architecture of an lexical approach classifier.
Generic architecture of a machine learning approach classifier.
Process steps followed by hybrid naive bayes.
System architecture of hybrid naive bayes approach.
A sequence of intermediate preprocessing steps taking place
at this level
Polarity triangle of synset in SentiWordNet.
Class diagram of Tweet Sentiment Analyzer.
Comparison of classifier performances, Dataset size (no. of
tweets) vs Accuracy (%).
The most informative features if hybrid naive bayes classifier.
The base naive bayes classifier in action on a windows platform with 50k tweets dataset.
Accuracy of base naive bayes classifier with a 50k tweets
dataset.
The hybrid naive bayes classifier in action on a windows platform with 50k tweets dataset.
Accuracy of hybrid naive bayes classifier with a 50k tweets
dataset.
Base naive bayes classifier in action on a linux server with 10k
tweets dataset.
Accuracy of base naive bayes classifier on a linux server with
a 10k tweets dataset.
IX
11
13
20
21
22
30
32
36
37
37
38
39
40
41
42
List of Tables
Table
Table
Table
Table
Table
Table
2.1
2.2
2.3
4.1
5.1
5.2
Performance of lexical approach variants
Performance machine learning approach variants
Performance of hybrid approach variants
General system requirements for our approach.
Performance of base naive bayes classifier
Performance of hybrid naive bayes classifier.
X
14
15
17
26
34
35
“I don’t know if I will have the time to write anymore letters because I might be too
busy trying to participate. So if this does end up being the last letter I just want you to
know that I was in a bad place before I started high school and you helped me. Even if
you didn’t know what I was talking about or know someone who has gone through it, you
made me not feel alone. Because I know there are people who say all these things don’t
happen. And there are people who forget what it’s like to be 16 when they turn 17. I know
these will all be stories someday. And our pictures will become old photographs. We’ll
all become somebody’s mom or dad. But right now these moments are not stories. This
is happening, I am here and I am looking at her. And she is so beautiful. I can see it.
This one moment when you know you’re not a sad story. You are alive, and you stand
up and see the lights on the buildings and everything that makes you wonder. And you’re
listening to that song and that drive with the people you love most in this world.
And in this moment I swear, We are infinite.”
- Charlie’s last letter
(The Perks of being a Wallflower, 2012)
XI
Chapter 1
Introduction
1.1
Motivation
With the proliferation of Web to applications such as microbloging, forums and social
networks, there came reviews, comments, recommendations, ratings and feedbacks
generated by users. The user generated content can be about virtually anything
including politicians, products, people, events, etc. With the explosion of user generated content came the need by companies, politicians, service providers, social
psychologists, analysts and researchers to mine and analyze the content for different uses. The bulk of this user generated content required the use of automated
techniques for mining and analyzing. Cases of the bulk user-generated content that
have been studied are blogs [1] and product/movie [2] reviews.
Microbloging has become a very popular communication tool among Internet users.
Millions of messages are appearing daily in popular web-sites that provide services
for microbloging such as Twitter1 , Tumblr 2 , Facebook 3 . Users of these services
write about their life, share opinions on variety of topics and discuss current issues. Because of a free format of messages and an easy accessibility of microbloging
platforms, Internet users tend to shift from traditional communication tools to microbloging services. As more and more users post about products and services they
use and express their political and religious views, microbloging web-sites become
valuable sources of peoples opinions and sentiments. Such data can be efficiently
used for marketing and social studies.
Sentiment analysis is an exhaustive research field which has been in study since
1
http://www.twitter.com
http://www.tumblr.com
3
http://www.facebook.com
2
Introduction
decades. Its initial use was made to analyze sentiment based on long texts such
as letters, emails and so on. It is also deployed in the field of pre-and post-crime
analysis of criminal activities. Variants of approaches have been applied for the
same. Applying this field with the microbloging fraternity is a challenging job. This
challenge became our motivation. Needless to say we are not the first ones to work
in this area. There has been substantial research in both machine learning and the
lexical approaches to sentimental analysis for social networks. We try to improve
the existing approaches by diversifying variants to the research.
1.2
Problem Description
we propose a hybrid approach for sentiment analysis that is a combination of a
machine learning algorithm and a special lexical dictionary.
1.2.1
WHY SENTIMENT ANALYSIS?
Everyday enormous amount of data is created from social networks, blogs and other
media and diffused in to the world wide web. This huge data contains very crucial
opinion related information that can be used to benefit businesses and other aspects
of commercial and scientific industries. Manual tracking and extraction of this useful
information is not possible, thus, Sentiment analysis is required. Sentiment Analysis is the phenomenon of extracting sentiments or opinions from reviews expressed
by users over a particular subject, area or product online. It is an application of
natural language processing, computational linguistics, and text analytics to identify
subjective information from source data. It clubs the sentiments in to categories like
“positive” or “negative”. Thus, it determines the general attitude of the speaker or
a writer with respect to the topic in context.
1.2.2
WHY NLP?
Natural language processing (NLP) is the technology dealing with our most ubiquitous product: human language, as it appears in emails, web pages, tweets, product
descriptions, newspaper stories, social media, and scientific articles, in thousands of
languages and varieties. In the past decade, successful natural language processing
2
Introduction
applications have become part of our everyday experience, from spelling and grammar correction in word processors to machine translation on the web, from email
spam detection to automatic question answering, from detecting people’s opinions
about products or services to extracting appointments from your email.
The greatest challenge of sentiment analysis is to design application-specific algorithms and techniques that can analyze the human language linguistics accurately.
1.2.3
Applications of sentiment analysis
Following are the major applications of sentiment analysis in real world scenarios.
• Product and Service reviews - The most common application of sentiment
analysis is in the area of reviews of consumer products and services. There are
many websites that provide automated summaries of reviews about products
and about their specific aspects. A notable example of that is “Google Product
Search”.
• Reputation Monitoring - Twitter and Facebook are a focal point of many
sentiment analysis applications. The most common application is monitoring
the reputation of a specific brand on Twitter and/or Facebook.
• Result prediction - By analyzing sentiments from relevant sources, one can
predict the probable outcome of a particular event. For instance, sentiment
analysis can provide substantial value to candidates running for various positions. It enables campaign managers to track how voters feel about different
issues and how they relate to the speeches and actions of the candidates.
• Decision making - Another important application is that sentiment analysis
can be used as a important factor assisting the decision making systems. For
instance, in the financial markets investment. There are numerous news items,
articles, blogs, and tweets about each public company. A sentiment analysis
system can use these various sources to find articles that discuss the companies
and aggregate the sentiment about them as a single score that can be used by
an automated trading system. One such system is The Stock Sonar4
4
The Stock Sonar [http://www.thestocksonar.com/]: This system (developed by Digital Trowel)
shows graphically the daily positive and negative sentiment about each stock alongside the graph
of the price of the stock.
3
Introduction
1.3
The network: Twitter
Twitter is an online social networking service and microblogging service that enables
its users to send and read text-based messages called “tweets”. Tweets are publicly
visible by default, but senders can restrict the message delivery to a limited crowd.
Twitter is one of the largest microblogging service having over 500 million registered
users as of 2012. Statistics revealed by the Infographics Labs5 suggest that back in
the year 2012, on a daily basis 175 million tweets were communicated.There is a large
mass of people using twitter to express sentiments, which makes it an interesting and
challenging choice for sentiment analysis. When so much attention is being paid to
twitter, why not monitor and cultivate methods to analyze these sentiments. Twitter
has been selected with the following purposes in mind.
• Twitter is an Open access social network.
• Twitter is an Ocean of sentiments (limited within 140 characters, i.e. high
sentiment density).
• Twitter provides user friendly API making it easier to mine sentiments in
realtime.
1.4
Contribution
In this thesis, we propose a Hybrid Naive Bayes classifier which is the combination
of a machine learning algorithm (Naive Bayes) and a special lexical dictionary (SentiWordNet6 ). We crawled multi-size datasets consisting of approximately 4 million
tweets with a variety of most popular keywords like [“#ironman3”, “#amitabhbachhan”, “#Google”, “#twitter”, “#robertdowneyjr” and etc] for the training and
testing purposes. We test the proposed Hybrid Naive Bayes approach using Natural
language Toolkit and observe that it outperforms the existing approaches delivering
competitive results having 98.59 percent accuracy (best case).
5
6
http://infographiclabs.com/
http://www.sentiwordnet.isti.cnr.it
4
Introduction
1.5
Thesis Outline
This thesis is organised as follows:
• chapter 1: Introduces the work of this thesis.
• chapter 2: An exhaustive survey of existing approaches.
• chapter 3: Proposed approach and goals.
• chapter 4: Details the implementation methodology.
• chapter 5: Discussion of results and analysis.
• chapter 6: Concludes the work and explorers avenues for future work.
5
Chapter 2
Theoretical Background and
Literature Review
Sentiment analysis caught attention as one of the most active research areas with
the explosion of social networks. The enormous user-generated content resulting
from these social media contained valuable information in the form of reviews, opinions, etc about products, events and people. Most sentiment analysis studies use
machine learning approaches, which require large amount of user generated content
for training. The research on sentiment analysis so far has mainly focused on two
things: identifying whether a given textual entity is subjective or objective, and
identifying polarity of subjective texts [3].
In the following sections, we review literature on both general and twitter-specific
sentiment analysis in brief. Starting with general sentiment analysis, we also discuss
about the issues in sentiment analysis that make it a difficult task than other textclassification tasks. Later on, we move to the focus area of this thesis, i.e. twitter
specific approaches. In a study, [4] present a comprehensive review of the literature
written before 2008. Most of the material on general sentiment analysis is based on
their review.
2.1
General Sentiment Analysis
Sentiment analysis has been carried out on a range of topics. For example, there
are sentiment analysis studies for movie reviews [4], product reviews [5], and news
and blogs ([3], [6]).
Theoretical Background and Literature Review
2.2
Issues in Sentiment Analysis
Research reveals that sentiment analysis is more difficult than traditional topicbased text classification, despite the fact that the number of classes in sentiment
analysis are less than the number of classes in topic-based classification [4]. In sentiment analysis, the classes to which a piece of text is assigned are usually negative
or positive. They can also be other binary classes or multi-valued classes like classification into “positive”, “negative” and “neutral”, but still they are less than the
number of classes in topic-based classification. Sentiment analysis is tougher compared to topic-based classification as the latter relies on keywords for classification.
Whereas in the case of sentiment analysis keywords a variety of features have to be
taken into account. The main reason that sentiment analysis is more difficult than
topic-based text classification is that topic-based classification can be done with the
use of keywords while this does not work well in sentiment analysis [2].
Other reasons for difficulty are: sentiment can be expressed in subtle ways without
any perceived use of negative words; it is difficult to determine whether a given text
is objective or subjective (there is always a newline between objective and subjective
texts); it is difficult to determine the opinion holder (example, is it the opinion of the
author or the opinion of the commenter); there are other factors such as dependency
on domain and on order of words [3]. Other challenges of sentiment analysis are to
deal with sarcasm, irony, and/or negation.
2.3
Classification of approaches
Sentiment analysis is formulated as a computational linguistics problem. The classification can be approached from different perspectives depending on the nature of the
task at hand and perspective of the person carrying out sentiment analysis. The familiar approaches are discourse-driven, relationship-driven, language-model-driven,
or keyword-driven. We discuss these approaches in the subsequent subsections.
2.3.1
Knowledge-based approach
In this approach, sentiment is calculated as the function of some keywords. The main
task is the construction of sentiment discriminatory-word lexicons that indicate a
7
Theoretical Background and Literature Review
particular class such as positive class or negative class. The polarity of the words in
the lexicon is determined prior to the sentiment analysis work. There are variations
to how the lexicon is created. For example, lexicons can be created by starting
with some seed words and then using some linguistic heuristics to add more words
to them, or starting with some seed words and adding to these seed words other
words based on frequency in a text [2]. For certain applications, there are publicly
available discriminatory word lexicons for use in sentiment analysis. Twitrratr 1
provides opinion tracking service of public sentiments for Twitter sentiment analysis.
2.3.2
Relationship-based approach
In this approach, the classification task is approached from the different relationships
that may exist between features 2 and components. Such relationships include relationships between discourse participants, relationships between product features.
For instance, if one wants to know the sentiment of customers about a product
brand, one may compute it as a function of the sentiments on different features or
components of it.
2.3.3
Language models approach
In this approach, the classification is done by building n-gram language models. A
gram is a token or lexicon taken into consideration for training and classification.
N-gram represents a set of such chosen lexicons. Generally, in this approach frequency of n-grams are used. In traditional information retrieval and topic-oriented
classification, frequency of n-grams gives better results. The frequency is converted
to TF-IDF3 to take term’s importance in the document to be classified. In a study,
[3] show that in the sentiment classification of movie review blog, term-presence
gives better results than term frequency. They indicate that uni-gram presence is
more suited for sentiment analysis. But later ([3], [5]) found that bi-grams and trigrams worked better than uni-grams in a study of sentiment classification of product
reviews.
1
http://twitrratr.com
Feature: A feature in terms of sentiment analysis is the chosen set of word used for training
in the supervised learning technique, i.e. the “word dictionary” which is fed into the classifier for
the classification of sentiments in the incorporated text
3
TF-IDF: Represents the Term Frequency or the Identifier frequency whichever is taken into
account.
2
8
Theoretical Background and Literature Review
2.3.4
Discourse structures and semantics approach
This approach, is very dominant in the applications where prior classification of
classes is not possible. Text is classified when it is encountered into the best category
it fits (in the context of its objective). Based on the similarity of semantics of words
in the text, they are grouped together and tagged in to classes. For example in
reviews, the overall sentiment is usually expressed at the end of the text [2]. As a
result the approach, in this case, might be discourse-driven in which the sentiment of
the whole review is obtained as a function of the sentiment of the different discourse
components in the review and the discourse relations that exist between them. For
instance, the sentiment of a paragraph that is at the end of the review might be given
more weight in the determination of the sentiment of the whole review. Semantics
can be used in role identification of agents where there is a need to do so. For
example “India beat Australia” is different from “Australia beat India”.
2.4
Twitter specific approaches
The main difference between sentiment analysis of twitter and documents is that,
twitter based approaches are more specific towards determining the polarity of words
(mainly adjectives); whereas the document based approaches are specific towards
the task of determining features in the text. There are three major approaches for
twitter specific sentiment analysis.
• Lexical analysis approach
• Machine learning approach
• Hybrid approach
Using one or a combination of the different approaches, one can employ one or
a combination of lexical and machine learning techniques. Specifically, one can
use unsupervised techniques, supervised techniques or a combination of them. We
first review the lexical approaches, which focus on building successful dictionaries,
then the machine learning approaches, which are primarily concerned with feature
vectors, and finally a combination of both i.e. hybrid approach.
9
Theoretical Background and Literature Review
2.4.1
Lexical analysis approach
A lexical approach typically utilizes a dictionary or lexicon of pre-tagged words.
Each word that is present in a text is compared against the dictionary. If a word
is present in the dictionary, then its polarity value is added to the “total polarity
score” of the text. For example, if a match has been found with the word “excellent”,
which is annotated in the dictionary as positive, then the total polarity score of the
blog is increased. If the total polarity score of a text is positive, then that text is
classified as positive, otherwise it is classified as negative. Although naive in nature,
many variants of this lexical approach have been reported to perform better ([7],
[11], [8]).
Since the classification of a statement is dependent upon the scoring it receives,
there is a large volume of work devoted to discovering which lexical information
works best. As a starting point for the field, [9] demonstrated that the subjectivity
of an evaluative sentence could be determined through the use of a hand-tagged
lexicon comprised solely of adjectives. They report over 80 percent accuracy on
single phrases. Extending this work, [7] utilized the same methodology and handtagged adjective lexicon as [9], but they tested the paradigm on a dataset composed
of movie reviews. They reported a much lower accuracy rate of about 62 percent.
Moving away from the hand-tagged lexicons, Turney [2] utilized an Internet search
engine to determine the polarity of words that would be included in the lexicon
[7]. Turney [2] performed two AltaVista4 search queries: one with a target word
conjoined with the word “good”, and a second with the target word conjoined with
the word “bad”. The polarity of the target word was determined by the search result
that returned the most hits. This approach improved accuracy to 65 percent.
In a study, [8] and [10] chose to use the WordNet 5 database to determine the polarity
of words. We compared a target word to two pivot words (usually “good” and
“bad”) to find the minimum path distance between the target word and the pivot
words in the WordNet hierarchy. The minimum path distance was converted to an
incremental score and this value was stored with the word in the dictionary. The
reported accuracy level of this approach was 64 percent [10]. An alternative to
the WordNet metric, proposed by [11], was to compute the semantic orientation of
4
AltaVista [in.altavista.com]: Was previously a popular internet search engine, which lost its
ground to Google after its expansion.
5
WordNet [wordnet.princeton.edu]: Is a lexical database for English language. It groups English
words into sets of synonyms called synsets, provides short, general definitions, and records the
various semantic relations between these synonym sets.
10
Theoretical Background and Literature Review
a word [7]. By subtracting a word’s association strength to a set of negative words
from its association strength to a set of positive words, [11] were able to achieve
an accuracy rate of 82 percent using two different semantic orientation statistic
metrics. Figure 2.1 shows the generic architecture of working of a lexical approach.
Figure 2.1: Generic architecture of an lexical approach classifier.
2.4.2
Machine Learning approach
The other main avenue of research within this area has utilized supervised machine learning techniques. Within the machine learning approach, a series of feature
vectors are chosen and a collection of tagged corpora are provided for training a
classifier, which can then be applied to an untagged corpus of text. In a machine
learning approach, the selection of features is crucial to the success rate of the classification. Most commonly, a variety of unigrams (single words from a document) or
n-grams (two or more words from a document in sequential order) are chosen as feature vectors. Other proposed features include the number of positive words, number
11
Theoretical Background and Literature Review
of negating words, and the length of a document. Support Vector Machines (SVMs)
([14], [15]) and the Naive Bayes algorithm [16] are the most commonly employed
classification techniques. The reported classification accuracy ranges between 63
percent and 84 percent, but these results are dependent upon the features selected.
Since majority of sentiment analysis approaches use machine learning techniques,
the features of text are often represented as feature vectors. The following are
features used in sentiment analysis.
• TF-IDF -Term frequency or Identifier frequency: as commented in the section
2.3.4, represents simply the count of the “terms” or “words” being taken into
account. These terms maybe uni-grams, bi-grams or higher order n-grams.
Which ones of these yield better results is still not known. [4] claim the
superiority of uni-grams over bi-grams in a movie review sentiment analysis,
whereas [5] argue “bi-grams and tri-grams give better results” on the basis of
their product review classification analysis.
• POS -Part Of Speech tags: By nature English is an ambiguous language.
One particular word may have more than one meaning depending upon its
context of use. POS is used to disambiguate sense which in turn is used to
guide feature selection [3]. For instance, these tags can be used to differentiate
adjectives and adverbs since they are generally used as sentiment indicators
[2]. Later on they realised that the performance of adjectives is worse than
same number of uni-grams chosen on the basis of frequency.
• Syntax and negation: - The use of collocations and other syntactic features
can be used to improve performance. In classification of short texts, algorithms
using syntactic features and algorithms using n-gram features were reported
to yield the same performance [3].
Figure 2.2 shows the generic architecture of working of a machine learning approach.
2.4.3
Hybrid approach
There are some approaches which use a combination of other approaches. One combined approach is followed [17]. They start with two word lexicons and unlabelled
12
Theoretical Background and Literature Review
Figure 2.2: Generic architecture of a machine learning approach classifier.
data. With the two discriminatory-word lexicons (negative and positive), they create pseudo-documents containing all the words of the chosen lexicon. After that,
they compute the cosine similarity between these pseudo-documents and the unlabelled documents. Based on the cosine similarity, a document is assigned either
positive or negative sentiment.Then they use these to train a Naive Bayes classifier.
In a similar combined approach, [18] defined “a unified framework” that allows one
to use background lexical information in terms of word-class associations, and renew this information for specific domains using any available training examples.
They proposed and used a different approach which they called polling multinomial classifier, which is another name for Multinomial Naive Bayes (based on the
multinomial distribution function). They used manually labelled data for training,
unlike [17]. They report that experiments with data reveal that their incorporation
of lexical knowledge improves performance. They obtained better performance with
their approach than approaches using lexical knowledge or training data in isolation,
or other approaches that use combined techniques. There are also other types of
combined approaches that are complimentary in that different classifiers are used in
such a way that one classifier contributes to another [19].
13
Theoretical Background and Literature Review
2.5
Performance review
Sentiment analysis has been applied on a variety of applications like movie reviews,
blog reviews, etc. In their study, [20] carried out the task of evaluating the performance of the earlier discussed approaches on a news and movie review application
for twitter. For this purpose they employed the following resources.
1. Cornell Movie Review dataset of tagged blogs (1000 positive and 1000 negative) [2].
2. List of 2000 positive words and 2000 negative words from the General Inquirer
lists of adjectives [12].
3. Yahoo! Web Search API [13].
4. Porter Stemmer [21].
5. WordNet Java API [22].
6. Stanford Log Linear POS Tagger built with the Penn Treebank tag set [23].
7. WEKA Machine Learning Java API (only used for machine learning) [24].
8. SVM-Light Machine Learning Implementation [25].
The performance of various approaches reported by [20] is as follows.
2.5.1
Lexical approach performance
The performance of lexical approach variants are shown in table 2.1. Combining
Table 2.1: Performance of lexical approach variants. [20].
Approach
Baseline
Baseline
Baseline
Baseline
Baseline
+
+
+
+
Accuracy
Stemming
Yahoo! words
WordNet
all
14
50.0
50.2
57.7
60.4
55.7
Theoretical Background and Literature Review
all of the three variants (i.e. stemming, Yahoo! words, and WordNet) lead to a
surprising drop in accuracy. By simultaneously increasing the number of words in the
dictionary, applying a stemming algorithm, and assigning a weight to each dictionary
word, they substantially increased the size of the dictionary. The proportion of
positive and negative token matches did not change, and they hypothesize that for
each positive match that was found, it was equally likely to find a negative match,
thereby creating a neutral net effect.
Given these results, it appears that the selection of words that are included in the
dictionary is very important for the lexical approach. If the dictionary is too sparse
or exhaustive, one risks the chance of over or under analyzing the results, leading to
a decrease in performance. In accordance with previous findings, our results confirm
that it is difficult to surpass the 65 percent accuracy level using a purely lexical
approach.
2.5.2
Machine Learning approach performance
During the initial experimentation phase of the machine learning approach, [20] used
the WEKA package [24] to obtain a general idea of ML algorithms. Their preliminary experiments indicate that the SVM and Naive Bayes algorithms are the most
accurate (both giving a tough competition).
Table 2.2: Performance machine learning approach variants [20].
Approach
Unigram
Unigram
Unigram
Unigram
SVM-Light
integer
binary
integer + aggregate
binary + aggregate
77.4
77.0
68.2
65.4
NaiveBayes
77.1
75.5
77.3
77.5
Naive Bayes approach performs quite well in all variants (delivering greater than
75 percent). The unigram results classified by the SVM algorithm are on a similar
level. After initially running the SVM-Light experiments, [20] observed that the
confusion matrix results were heavily skewed to the negative side, for all the feature
representation vectors. To correct this imbalance, they tried introducing a threshold
variable into the results. This threshold variable was set to - 0.2, and whenever a
15
Theoretical Background and Literature Review
blog was assigned a value greater than -0.2, it was classified as positive. Unfortunately, although this modification did succeed in evening out the confusion matrix,
it did not increase the overall accuracy of the results.
The results from table 2.1 and 2.2 clearly indicate the superiority of machine learning approaches. Even the worst of the ML results is superior to the best of the
lexical results. It seems that the lexical approaches rely too heavily on semantic
information. As both the lexical and ML approaches demonstrated, the inclusion
of any type of ”dictionary” information in an experiment does not automatically
increase the method’s performance. Even though the ML results are superior, one
must not forget that in order for a ML approach to be successful, a large corpus
of tagged training data must first be collected and annotated, and this can be a
challenging and expensive task.
2.5.3
Hybrid approach performance
Last but not the least, few researchers have proposed some hybrid approaches for
analyzing twitter sentiments. Like, [26] and [17] propose a method to automatically
create a training corpus using microblog specific features like emoticons, which is
subsequently used to train a classifier.
Distant learning is used by [27] to acquire sentiment data. They use tweets ending
in positive emoticons like [ “:)”, “:-)”, “:D”] as positive and [“:(”,“:-(”] as negative. They built models using Naive Bayes, Maximum Entropy and Support Vector
Machines (SVMs), and they report SVM outperforms other classifiers. In terms of
feature space, they try a uni-gram, bi-gram model in conjunction with POS features.
They note that the unigram model outperforms all other models. Specifically, bigrams and POS features do not help. The following table shows the performance of
hybrid approach variants.
2.6
Chapter conclusion
It is clear from the analysis of the literature that machine learning approaches have
so far proved to be outstanding in delivering accurate results. Depending upon the
area of application both the approaches have an edge. For less number of features
Lexical analysis technique is respectable, whereas for large features Machine learning
16
Theoretical Background and Literature Review
Table 2.3: Performance of hybrid approach variants.
Approach
Accuracy
SVM and NB classifiers trained
with Usenet news groups data [19]
Class-two NB classifier trained
with unlabelled data [17]
Class-two NB classifier
trained with twitter data [18]
70.0
64.0
84.0
analysis technique is overshadowing.
Both approaches have pros and cons. Lexical analysis is ready-to-go technique
which does not require any prior classification or training of the datasets. It can
be directly applied on live data given that the feature set is large. Where as in the
Machine learning technique the classifier need to be initially fed or “trained” with
raw datasets and tuned to cluster the sentiments into predefined classes. But it
works efficiently on large texts with large feature support. Low features lead to less
accuracy for this technique.
In this chapter we presented a detailed literature review of the existing approaches
and techniques. The next chapter describes the detailed design and analysis of the
proposed hybrid naive bayes approach.
17
Chapter 3
Design and Analysis of Proposed
Approach
We propose a hybrid approach for analysing sentiments of Twitter. We obtain our
inspiration from nature’s way of inter-species hybridization of the living organisms.
The implementation and methodology and the experimental setup is discussed in
chapter 4.
3.1
Problem Statement
3.1.1
Problem Introduction
The field of sentiment analysis has been always a challenging area of research. The
idea of applying sentiment analysis to Open Social Networks (OSNs) like Twitter,
MySpace and Facebook is the current trend in research. Millions of people use
microbloging services like twitter to express their views and opinions on day to
day affairs. Machine learning approaches have so far proved to be out-ranging
lexical approaches in terms of accuracy; however the speed of lexical approaches is
noteworthy. Almost all approaches aimed at classifying twitter sentiments, so far,
have fallen prey to the time vs performance dilemma. In the literature survey we
summarized machine learning approaches yearning accuracy up to 80 percent. We
firmly believe, hybridizing both lexical and machine learning approaches, results up
to 90 percent or more can be realised (see chapter 5).
Design and Analysis of Proposed Approach
3.1.2
Problem Definition
To propose a hybrid approach yearning competitive results by hybridizing machine
learning and lexical approaches which captures and analyses sentiments of users in
an open social network like twitter for exploring public opinion.
3.2
Proposed approach: Hybrid Naive Bayes
The superiority of both lexical approach (for its speed) and machine learning approach (for its accuracy) are not unknown to the world. A lexical approach is fast
because of the predefined features (e.g. dictionary) it employs for extracting sentiments. Having a dictionary to refer at runtime reduces the time consumption almost
exponentially. To improve performance of lexical approaches the feature set has to
be increased drastically, i.e. a very large dictionary of variety of words with their
frequencies has to be provided at runtime. This increases the overhead of the system and hence the performance suffers. Thus, there is a constant trade-off between
Performance vs Time. On the other hand, machine learning approaches employ
a recursively learning and tuning of their features, given large input datasets, improves its performance way beyond any lexical approach can achieve. However, due
to this runtime performance tuning and learning the system undergoes drastic fall
in time constraints.
Our goal is to propose an approach that is a combination of both lexical and machine learning, hence exploit the best features of both in one. For this purpose
we choose to employ a Naive Bayes classifier and empower it with an English lexical dictionary SentiWordNet (refer chapter 4 section 1 ). Our hybrid
naive bayes follows the ritual four steps namely: Data collection, Preprocessing, Training the classifier and Classification shown in the fig 3.1. Through the
following sections we shall discuss each step in detail, one at a time. The following
figure (fig 3.2) shows the system architecture of our proposed approach. The labels
Phase I and Phase II shown in figure 3.2 are the deployment phases. They have
been discussed in the next chapter.
19
Design and Analysis of Proposed Approach
Figure 3.1: Process steps followed by hybrid naive bayes.
3.2.1
Data collection: Twitter API
For classification and training the classifier we need Twitter data. For this purpose
we make use of API’s twitter provides. Twitter provides two API’s; Stream API1 and
REST API2 . The difference between Streaming API and REST APIs are: Streaming
API supports long-lived connection and provides data in almost real -time. The
REST APIs support short-lived connections and are rate-limited (one can download
a certain amount of data [*150 tweets per hour] but not more per day). REST APIs
allow access to Twitter data such as status updates and user info regardless of time.
However, Twitter does not make data older than a week or so available. Thus REST
access is limited to data Twittered not before more than a week. Therefore, while
REST API allows access to these accumulated data, Streaming API enables access
to data as it is being twittered.
1
Twitter Stream API [https://dev.twitter.com/docs/streaming-apis]: Twitter has provided developers and research to access realtime twitter data easily through its stream API, based on some
given constraints of API rate and other privacy policies.
2
http://dev.Twitter.com/do
20
Design and Analysis of Proposed Approach
Figure 3.2: System architecture of hybrid naive bayes approach.
The search API provides users the ability to access twitter search functionality. It
uses GET requests and returns results formatted using ATOM or JSON, JSON is
recommended due to compactness. A maximum of 100 results are returned per page
for a maximum of 15 pages. Search requests take the general form shown below.
Notable Arguments:
• q: The query string (required).
• lang: Restricts results to specified language using ISO 639-1 code.
• rpp: Results to return per page.
• page: Result page number to return.
An example url request which returns 100 posts containing the word “stuff” formatted using JSON is given below. Next we follow a preprocessing step, which is
essential to the tweets/text we will acquire in this step.
21
Design and Analysis of Proposed Approach
3.2.2
Preprocessing
The tweets gathered from twitter are a mixture of urls, and other non-sentimental
data like hashtags “#”, annotation “@” and retweets “RT”. To obtain n-gram features, we first have to tokenize the text input. Tweets pose a problem for standard
tokenizers designed for formal and regular text. The following figure displays the
various intermediate processing feature steps. The intermediate steps are the list of
Figure 3.3: A sequence of intermediate preprocessing steps taking place at this level.
features to be taken account of by the classifier. We discuss each feature deployed
in brief.
• Language detection- Since we are mainly interested in English text only.
All tweets have been separated into English and non English data. This is
possible by using NLTK’s language detection feature.
• Tokenize- For a sample input text say ”Today the weather is sunny and
beautiful”, Tokenizers divide strings into lists of substrings also known as
Tokens. Tokenizing the text makes it easy to separate out other unnecessary
symbols and punctuations and filter out only those words that can add value
to the sentimental polarity score of the text.
• Constructing n-grams- We will make a set of n-grams out of consecutive
words. A negation (such as “no” and “not”) is attached to a word which
precedes it or follows it. For example, a sentence “I do not like fish” will form
two bigrams: “I do+not”, “do+not”, “like”, “not+like fish”. Such a procedure
allows improving the accuracy of the classification since the negation plays a
special role in an opinion and sentiment expression [28].
• Stop words- In information retrieval, it is a common tactic to ignore very
common words such as “a”, “an”, “the”, etc. since their appearance in a
post does not provide any useful information in classifying a document. Since
22
Design and Analysis of Proposed Approach
query term itself should not be used to determine the sentiment of the post
with respect to it, every query term is replaced with a QUERY keyword.
Although this makes it somewhat of a stop word, it can still be useful when
not using a bag-of-words model and the location of the query in relation to
other words becomes important.
• Strip smileys- Many microblogging posts make use of emoticons in order to
convey emotion, making them very useful for sentiment analysis. A range of
about 30 emoticons, including “:)”,“:(”, “:D”, “=]”, “:]”, “=)”, “=[”, “=(” are
replaced with either a SMILE or FROWN keyword. In addition, variations of
laughter such as “haha” or “ahahaha” are all replaced with a single LAUGH
keyword.
• Erratic Casting- In order to address the problem of posts containing various
casings for words (e.g. “HeLLo”), We sanitize the input by lower casing all
words, which provides some consistency in the lexicon. Punctuation in microblogging posts, it is common to use excessive punctuation in order to avoid
proper grammar and to convey emotion. By identifying a series of exclamation
marks or a combination of exclamation and question marks before removing
all punctuation, relevant features are retained while more consistency is maintained.
3.2.3
Training Data
To precisely label the text into their respective classes and thus achieve highest
possible accuracy, we plan to train the classifier using pre-labelled twitter data itself. Pre-labelled twitter training data is not available freely, since this year Twitter
changed its data privacy policies and it no longer allows open/free sharing of twitter content. However, they mention that using or downloading twitter content for
individual research purposes is acceptable.
Labelled data
Since we do not have direct access to pre-labelled twitter data, we planned to crawl
it manually. We crawled various sizes of datasets with various keywords of approximately 4 Million tweets from twitter using a custom python scripted crawler (refer
23
Design and Analysis of Proposed Approach
chapter 4). The data obtained in such a way is certainly not labelled, so in order to
address this issue, we propose to crawl twitter and form two different datasets. the
first one consisting of all the positive sentiment tweets i.e. [“:)”, “:-)”, “:-D”, “:D”,
“B-)”] and the latter one consisting of all the negative sentiment tweets i.e.[“:(”,
“:-(”, “:’(”, “X(”, “X-(”]. Thus, we feed these datasets for the classifier for training,
which function almost similar to hand labelled datasets as used in other sentiment
analysis domains.
3.2.4
Sentiment Analysis - The classifier
Naive Bayes was our first choice based on the inference from literature review carried
out in chapter 2 . Naive bayes is bayesian probability distribution model based algorithm. In general all bayesian models are derivatives of the well known Bayes Rule,
which suggests that the probability of a hypothesis given a certain evidence, i.e. the
posterior probability of a hypothesis, can be obtained in tems of the prior probability of the evidence, the prior probability of the hypothesis and the conditional
probability of the evidence given the hypothesis. Mathematically,
P (H|E) =
P (H)P (E|H)
P (E)
(3.1)
where,
P (H|E)- posterior probability of the hypothesis.
P (H)- prior probability of hypothesis.
P (E)- prior probability of evidence.
P (E|H)- conditional probability of evidence of given hypothesis.
Or in a simpler form:
P osterior =
(P rior) × (Likelihood)
Evidence
(3.2)
To explain the concept, lets take an example. For instance, we have a new tweet to
be classified in to one of the positive or negative classes. Given that in the previously
classified tweets, positive tweets are twice the number of negative tweets. Since the
new tweet’s class is not known, the problem is estimating correctly the class that
the tweet is to be categorised in. This can be found out by Bayes rule calculating
the probabilities of the likelihood of the tweet to be positive or negative. Hence,
24
Design and Analysis of Proposed Approach
from eq. 3.1 we have:
P (n|p) =
P (n)P (p|n)
P (p)
(3.3)
Since there are twice as many positive tweets as negative, it is reasonable to believe
that a new case (which hasn’t been observed yet) is twice as likely to have membership positive rather than negative. In the Bayesian analysis, this belief is known as
the prior probability. Prior probabilities are based on previous experience, in this
case the percentage of positive tweets and negative tweets, and often used to predict
outcomes before they actually happen. Thus, we can write:
Prior Probability of positive tweetP (p) =
No. of positive tweets
Total no. of tweets
Prior Probability of negative tweetP (n) =
No. of negative tweets
Total no. of tweets
Let there be say a total of 6k tweets, 4k of which are positive and 2k negative, our
prior probabilities for class membership are(where k = 103 ):
4k
4
2
= =
6k
6
3
2
1
2k
= =
Prior Probability for negative tweet P (R)=
6k
6
3
Prior Probability for positive tweet P (p) =
The likelihood of the tweet falling into either of the classes is equal, since we have
only two classes. So likelihood of X = 0.5. So now calculating the posterior probability of the new tweet say X, being positive or negative, will be:
• Posterior probability of X being positive = (Prior probability of positive)×
(Likelihood of X being positive)
2 1
1
= × = = 33.34% chances of X being positive.
3 2
3
• Posterior probability of X being negative = (Prior probability of negative)×
(Likelihood of X being negative)
1 1
1
= × = = 16.67% chances of X being negative.
3 2
6
Thus this tweet will fall in to the positive class.
In our case we would have two hypothesis and many other features on basis of which
the one that has the highest probability would be chosen as a class of the tweet whose
sentiment is being predicted. After every classification step all the probabilities are
again calculated and updated accordingly.
25
Chapter 4
Implementation Methodology
In this chapter the experimental setup and evaluation methodology for the proposed
approach are discussed.
4.1
Experimental Setup
To test the proposed approach, we created a setup with the following system requirements. We tested our approach on both Linux and Windows platforms. We used a
Dell Optiplex 980 Windows-7 core-i5 (64 bit) machine equipped with 4 GB of RAM
and a Linux server system with Quardcore processor equipped with 8 GB of RAM.
The general requirements shown in the table below. The tools and technology used
Table 4.1: General system requirements for our approach.
Component
Operating system
Processor
Min. Memory(RAM)
Min. Storage
Bandwidth
Software and Third-party tools
are as follows.
Type
Windows XP/7/8,
Linux (Ubuntu Server 12.04)
C2D/i3/i5/i7 (32/64 bits)
≥ 4 GB
20 GB
Uninterrupted High-speed
Internet (1Mbps connection)
Linked Media Framework 2.3.5,
NLTK 2.0, and SentiWordNet 3.0
Implementation Methodology
• Python 2.7-(implementation language): Python is a general-purpose, interpreted high-level programming language whose design philosophy emphasizes
code readability. Its syntax is clear and expressive. Python has a large and
comprehensive standard library and more than 25 thousand extension modules.
We use python for developing the backend of the test application crawler .
This and the other modules implemented are discussed later.
• NLTK-(language processing modules and validation): The Natural Language
Processing Toolkit (NLTK) is an open source language processing module
of human language in python. Created in 2001 as a part of computational
linguistics course in the Department of Computer and Information Science at
the University of Pennslyvania. NLTK provides inbuilt support for easy-touse interfaces over 50 lexicon corpora. NLTK was designed with four goals in
mind.
1. Simplicity: Provide and intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting
bogged down in the tedious house-keeping usually associated with processing annoted language data.
2. Consistency: Provide a uniform framework with consistent interfaces and
data structures, and easily guessable method names.
3. Extensibility: Provide a structure into which new software modules can
easily by accommodated, including alternative implementations and competing approaches on the same task.
4. Modularity: Provide components that can be used independently without
needing to understand the rest of the toolkit.
• LMF - (Persistent database): The Linked Media Framework, unlike other
applications of databases this application requires a persistent database. this
is because we have to query twitter in realtime and collect a certain number
of tweets the database connection should be open all the time. This feature
is only available in some database server applications namely Google App.
27
Implementation Methodology
Engine1 , Linked Media Framework [LMF]2 The core component of the
Linked Media Framework is a Linked Data Server that allows to expose data
following the Linked Data Principles:
– Use URIs as names for things.
– Use HTTP URIs, so that people can look up those names.
– When someone looks up a URI, provide useful information, using the
standards (RDF, SPARQL).
– Include links to other URIs, so that they can discover more things.
The Linked Data Server implemented as part of the LMF goes beyond the
Linked Data principles by extending them with Linked Data Updates and by
integrating management of metadata and content and making both accessible
in a uniform way. Our extensions are described in more detail in LinkedMediaPrinciples.
In addition to the Linked Data Server, the LMF Core also offers a highly configurable Semantic Search service and a SPARQL endpoint. Setting up and using
the Semantic Search component is described in SemanticSearch. Accessing the
SPARQL endpoint is described in SPARQLEndpoint. Whereas the extension
of the Linked Data principles is already conceptually well-described, we are
currently still working on a proper specification and extension of Semantic
Search and SPARQL endpoint for Linked Data servers.
LMF consists of some modules, some of them optional, that can be used to
extend the functionality of the Linked Media Server:
– LMF Semantic Search - offers a highly configurable Semantic Search
service based on Apache SOLR. Several semantic search indexes can be
configured in the same LMF instance.
1
Google App. Engine: Google App Engine (often referred to as GAE or simply App Engine,
and also used by the acronym GAE) is a platform as a service (PaaS) cloud computing platform
for developing and hosting web applications in Google-managed data centers. Applications are
sandboxed and run across multiple servers. App Engine offers automatic scaling for web applications as the number of requests increases for an application, App Engine automatically allocates
more resources for the web application to handle the additional demand.
2
LMF: The Linked Media Framework is an easy-to-setup server application that bundles together some key open source projects to offer some advanced services for linked media management.
28
Implementation Methodology
– LMF Linked Data Cache - implements a cache to the Linked Data
Cloud that is transparently used when querying the content of the LMF
using either LDPath, SPARQL (to some extent) or the Semantic Search
component. In case a local resource links to a remote resource in the
Linked Data Cloud and this relationship is queried, the remote resource
will be retrieved in the background and cached locally.
– LMF Reasoner - implements a rule-based reasoner that allows to process Datalog-style rules over RDF triples; the LMF Reasoner will be
based on the reasoning component developed in the KiWi project, the
predecessor of the LMF.
– LMF Text Classification provides basic statistical text classification services; multiple classifiers can be created, trained with
sample data and used to classify texts into categories.
– LMF Versioning - implements versioning of metadata updates; the
module allows getting metadata snapshots for a resource for any time in
its history and provides an implementation of the memento protocol.
– LMF Stanbol Integration - allows in-tegrating with Apache Stanbol
for content analysis and interlinking; the LMF provides some automatic
configuration of Stanbol for common tasks.
• SentiWordNet 3.0: SentiWordNet is a lexical resource for opinion mining.
SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. It groups English words into sets of synonyms
called “synsets”, provides short, general definitions, and records the various
semantic relations between these synonym sets. The purpose is twofold: to
produce a combination of dictionary and thesaurus that is more intuitively
usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD
style license and can be downloaded and used freely. The database can also
be browsed online 3 .
SentiWordNet is the result of research carried out by Andrea Esuli and Fabrizio
Sebastiani. Given that the sum of the opinion-related scores assigned to a
synset is always 1.0, it is possible to display these values in a triangle whose
3
SentiWordNet available at: http://sentiwordnet.isti.cnr.it
29
Implementation Methodology
vertices are the maximum possible values for the three dimensions observed.
Figure 1 shows the graphical model we have designed to display the scores of a
synset. This model is used in the Web-based graphical user interface through
which SENTIWORDNET can be freely accessed at their website 4
Figure 4.1:
Polarity triangle
http://sentiwordnet.isti.cnr.it].
4.2
of
synset
in
SentiWordNet
[courtesy:
Evaluation
Accuracy is the ability of the classifier to correctly classify and label the new
tweets into their respective classes. For this purpose we propose to use Natural
Language Toolkit (NLTK). The classifier is first run on the crawled twitter data,
classifying tweets for the keywords decided previously and thus trained. Thereafter,
the classifier is ran through NLTK accuracy analysis function which tests it on its
own corpora. Thus, we obtain validated results.
4.3
Test Application
The application “TSA” (Tweet Sentiment Analyzer) has been deployed in two
phases (refer chapter 3), namely4
SentiWordNet: http://patty.isti.cnr.it/esuli/software/SentiWordNet.
30
Implementation Methodology
• Phase I: Deploying TSA (aplha) system with the base naive bayes classifier
and harvesting analyzer results.
• Phase II: Hybridizing naive bayes classifier by embedding SentiWordNet 3.0
lexicon dictionary i.e. (beta). The analyzer is run through NLTK tests and
results are extracted. Then later on we integrate LMF 2.3.5 for analyzing
realtime twitter data on the fly.
The figure 4.2 represents the class diagram of the TSA application, generated using
PyNSource tool5 .
5
PyNSource: PyNSource can convert your python source into java files (just the class declarations of course, not the meat of the code) and thus you can import the resulting java files into a
more sophisticated UML modelling tool that understands java. There you can auto-layout your
UML and you can print out or take screen shots of the resulting diagrams.
You could even write a script to automate the java code generation. Thus when you fire up you
Java based UML modelling tool and do an import, you only import the changes to your diagram
and you don’t lose the custom way you have laid out your diagrams. PyNSource can also generate
Delphi (object pascal) code
31
Figure 4.2: Class diagram of Tweet Sentiment Analyzer.
Implementation Methodology
The classes in olive green color are the root classes. Other mixed color classes serve
as an intermediate functions to the root classes. While the ones in greyish-yellow
color are the terminal classes.
33
Chapter 5
Results and Analysis
In this chapter, the results of the Phase I and Phase II are presented along with
their analysis.
5.1
Results
The results of both of the existing classifier and proposed hybrid classifier are
presented and compared in the format of deployment i.e. Phase I & II (refer chapter
4 ).
• ATSPhase I : Tests were carried out using multiple twitter datasets consisting
of a mixture of new and old keywords like [“#ironman3”, “#amitabhbachhan”,
“#Google”, “#twitter”, “#robertdowneyjr” and etc]. Our datasets vary in size
from a few hundreds to a couple of millions, with a goal to test the scalability
and also ensuring performance. The performance of base naive bayes classifier
is shown in the table below with respect to dataset-size. kindly note here
(K=thousand, and M=Million)
Table 5.1: Performance of base naive bayes classifier.
Dataset size
1K
10K
50K
100K
1M
Accuracy
28.54
29.57
50.77
59.96
70.02
Results and Analysis
• ATSPhase II : After integrating SentiWordNet lexicon dictionary, same procedure was carried on the hybrid naive bayes classifier and the following results
were harvested. Note should be taken that the same datasets were used on
both the classifiers. Comparing the results of base naive bayes to hybrid naive
Table 5.2: Performance of hybrid naive bayes classifier.
Dataset size
1K
10K
50K
100K
1M
Accuracy
63.95
97.50
98.60
95.48
94.50
bayes, we can straight forwardly conclude that the proposed hybrid classifier
clearly outperforms the earlier approach. The main reason of the later approach’s precise accuracy is the availability of the polarity values at before
hand. However, it can observed that there is a gradual and constant downfall
in the accuracy, in the case of hybrid naive bayes, as the number of tweets
increase above 50K. One of The main factors for this phenomenon are
– Quality of dataset - It is observed that sometimes while crawling large
dataset of tweets, due to crawler’s in-activity for periods of time resulting
from network failure may lead to broken or noisy dataset.
– Use of shorthand of words - i.e. people generally tend write short
forms of words like “today” becomes “2day”, “that” is written as “dat”,
and etc, These kind of mixed literal lexicons, when searched through
the dictionary, do not count as a match or hit, thus resulting degraded
classifier accuracy.
• Most Informative Features: After training the Naive Bayes classifier with
the tweets and comments, we asked the classifier to show the 25 most informative features for it to recognize whether some text should be classified positive,
or negative. The ratio (neg:pos)OR(pos:neg), implies that the particular
word has been used more often as a positive or negative than else wise. For instance, the word “neverknow” has been used 6.2 times as a negative sentiment
than a positive sentiment in the text(last word in figure 5.2 ).
35
Results and Analysis
Figure 5.1: Comparison of classifier performances, Dataset size (no. of tweets) vs
Accuracy (%).
Some of the snapshots of the working system are presented. Figures 5.3 and 5.4
display the base naive bayes classifier working on a windows platform. Figures 5.5
and 5.6 display the hybrid naive bayes classifier working on a windows platform.
Figures 5.7 and 5.8 display the base naive bayes classifier working on a linux server
platform.
36
Results and Analysis
Figure 5.2: The most informative features if hybrid naive bayes classifier.
Figure 5.3: The base naive bayes classifier in action on a windows platform with
50k tweets dataset.
37
Results and Analysis
Figure 5.4: Accuracy of base naive bayes classifier with a 50k tweets dataset.
38
Results and Analysis
Figure 5.5: The hybrid naive bayes classifier in action on a windows platform with
50k tweets dataset.
39
Results and Analysis
Figure 5.6: Accuracy of hybrid naive bayes classifier with a 50k tweets dataset.
40
Results and Analysis
Figure 5.7: Base naive bayes classifier in action on a linux server with 10k tweets
dataset.
41
Results and Analysis
Figure 5.8: Accuracy of base naive bayes classifier on a linux server with a 10k
tweets dataset.
42
6
Conclusion and Future Work
6.1
Conclusion
The biological hybridization of real life inter-species, it has prominently known that
limitations of both species can be conquered. Having applied the same attitude
in our proposal, the experimental studies performed through chapters 4 and 5 ,
successfully show that hybridizing the existing machine learning analysis and lexical
analysis techniques for sentiment classification yield comparatively outperforming
accurate results. For all the datasets used, we recorded consistent accuracy of ≥
90%.
Clearly from the success of Hybrid Naive Bayes, it can positively be applied over
other related sentiment analysis applications like financial sentiment analysis (stock
market opinion mining), customer feedback services, and etc.
6.2
Future Work
Substantial amount of work is left to be carried on, here we provide a beam of light
in direction of possible future avenues of research.
• Interpreting Sarcasm: The proposed approach is currently incapable of
interpreting sarcasm. In general sarcasm is the use of irony to mock or convey
contempt, in the context of current work sarcasm transforms the polarity of
an apparently positive or negative utterance into its opposite. This limitation
Conclusion and Future Work
can be overcome by exhaustive study of fundamentals in “discourse-driven
sentiment analysis”. The main goal of this approach is to empirically identify
lexical and pragmatic factors that distinguish sarcastic, positive and negative
usage of words.
• Multi-lingual support: Due to the lack of multi-lingual lexical dictionary, it
is current not feasible to develop a multi-language based sentiment analyzer.
Further research can be carried out in making the classifiers language independent, shown by [30]. The authors have proposed a sentiment analysis system
with support vector machines, similar approach can be applied for our system
to make it language independent.
44
Bibliography
[1] M. Bautin, L. Vijayrenu L, and Skenia., “International sentiment analysis for
news and blogs”, In Second International Conference on Weblogs and Social
Media (ICWSM), 2008.
[2] P. Turney., “Thumbs up or thumbs down?” Semantic orientation applied to
unsupervised classification of reviews. In Proceedings of the 40th annual meeting
of the Association for Computational Linguistics,pp. 417424, 2002.
[3] B. Pang and L. Lee., “Opinion mining and sentiment analysis”, Foundations and
Trends in Information Retrieval, vol. 2, no. 1-2, pp. 1-135, 2008.
[4] B. Pang and L. Lee., “Using very simple statistics for review search: An exploration”, In Proceedings of the International Conference on Computational
Linguistics (COLING), 2008.
[5] K. Dave, S. Lawrence, and D. Pennock., “Mining the peanut gallery: Opinion
extraction and semantic classification of product reviews”, pp. 519-528, 2003.
[6] N. Godbole, M. Srinivasaiah, and S. Skiena., “Large-scale sentiment anal-ysis
for news and blogs,” 2007.
[7] A. Kennedy, D. Inkpen,. “Sentiment Classification of Movie and Product Reviews Using Contextual Valence Shifters”, Computational Intelligence,
pp.110125, 2006.
[8] J. Kamps, M. Marx, R. Mokken., ”Using WordNet to Measure Semantic Orientation of Adjectives”, LREC 2004, vol. IV, pp. 11151118, 2004.
[9] V. Hatzivassiloglou, and J. Wiebe., “Effects of Adjective Orientation and Gradability on Sentence Subjectivity”, Proceedings of the 18th International Conference on Computational Linguistics, New Brunswick, NJ, 2000.
[10] A. Andreevskaia, S. Bergler, and M. Urseanu, “All Blogs Are Not Made Equal:
Exploring Genre Differences in Sentiment Tagging of Blogs”, In International
Conference on Weblogs and Social Media (ICWSM-2007), Boulder, CO, 2007.
[11] P. Turney, and M. Littman., “Measuring Praise and Criticism: Inference of Semantic Orientation from Association”, ACM Transactions on Information Systems, pp. 315346, 2003.
45
[12] P. Stone, J. Dunphy, and D.Smith, “The General Inquirer: A Computer Approach to Content Analysis”, MIT Press, Cambridge, 1966.
[13] Yahoo! Search Web Services, Online http:// developer. yahoo.com/search/.,
2007.
[14] J. Akshay., “A Framework for Modeling Influence, Opinions and Structure in
Social Media”, In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, Vancouver, BC, July 2007, pp. 19331934, 2007.
[15] K. Durant, and M. Smith,. “Mining Sentiment Classification from Political Web
Logs”, In Proceedings of Workshop on Web Mining and Web Usage Analysis of
the 12th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (WebKDD-2006), Philadelphia, 2006.
[16] S. Prasad, Micro-blogging sentiment analysis using bayesian classification methods,” 2010.
[17] A. Pak and P. Paroubek, Twitter based system: Using twitter for disambiguating sentiment ambiguous adjectives,” pp. 436-439, 2010.
[18] B. Liu, X. Li, W.S. Lee, and P.S. Yu. Text classfication by labeling words.,
Proceedings of the National Conference on Artificial Intelligence, Menlo Park,
CA; Cambridge, MA; London; AAAI Press; pp. 425-430. MIT Press; 2004.
[19] P. Melville, W. Gryc, and R.D. Lawrence. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining, pp.
1275-1284, 2009.
[20] R. Prabowo and M. Thelwall. Sentiment analysis: A combined approach. Journal of Informetrics, 3(2), pp. 143-157, 2009.
[21] M. Annett and G. Kondrak, A comparison of sentiment analysis tech-niques:
Polarizing movie blogs,” pp. 25-35, 2008.
[22] No Title, Online http://tartarus.org/ martin/PorterStemmer/ java.txt, 2007.
[23] C. Fellbaum,. “WordNet: An Electronic Lexical Database. Language, Speech,
and Communication Series”, MIT Press, Cambridge, 1998.
[24] K. Toutanova, and C. Manning., “Enriching the Knowledge Sources Used in
a Maxi-mum Entropy Part-of-Speech Tagger” In Proceedings of EMNLP/VLC2000, Hong Kong, China, pp. 6371, 2000.
[25] H. Witten, E. Frank., “Data Mining: Practical Machine Learning Tools and
Techniques”,In 2nd edition of Morgan Kaufmann, San Francisco, 2005.
[26] A. Go, R. Bhayani, and L. Huang. “Twitter sentiment classification using distant supervision”, Technical report, Stanford, 2009.
[27] J. Read,. “Using emoticons to reduce dependency in machine learning techniques for sentiment classification”, In Proceedings of Association for Computer
Linguistics (ACL), 2005.
46
[28] T. Joachims., “Making Large-Scale SVM Learning Practical”, In Advances in
Kernel Methods - Support Vector Learn-ing, pp. 169184. MIT-Press, Cambridge,
1999.
[29] J. Wiebe, T. Wilson, and C. Cardie, Annotating expressions of opinions and
emotions in language,” Language Resources and Evaluation (formerly Computers
and the Humanities), vol. 39, no. 2/3, pp. 164-210, 2005.
[30] S. Narr, M. Hifulfenhaus, and S. Albayrak, Language-independent twitter sentiment analysis.”, The 5th SNA-KDD Workshop, 2011.
47

Download Report

Twitter Sentiment Analysis using Hybrid Naïve Bayes

Paperzz.com

Your Paperzz