Method performance difference of sentiment analysis on

DEGREE PROJECT IN TECHNOLOGY,
FIRST CYCLE, 15 CREDITS
STOCKHOLM, SWEDEN 2016
Method performance difference of
sentiment analysis on social media
databases
Sentiment classification in social media
HENRIK JOHANSSON
ANTON LILJA
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION
Method performance difference of sentiment
analysis on social media databases
Skillnad i prestanda för sentimentanalysmetoder
på data från sociala medier
Sentiment classification in social media
Sentimentklassifikation i sociala medier
HENRIK JOHANSSON
ANTON LILJA
Degree Project in Computer Science, DD143X
Supervisor: Richard Glassey
Examiner: Örjan Ekeberg
CSC, KTH 2016-05-11
Abstract
As the amount of available data have exploded with the increase in use of social media the interest of doing sentiment
anlysis have increased. However as the source and nature of
the data have changed it is possible that the known methods will not perform as before. The purpose of this paper is
to examine if such a difference exist and if the methods can
be improved through preprocessing the data. The results
show that there is a difference and that on this new type
of data a lexicon approach may be a better choice than a
machine learning based one. Preprocessing the data give
some but no large improvements.
Sammanfattning
Den explosion av tillgänglig data i och med den ökade användningen av sociala medier har ökat intresset för att göra
sentimentsanalys. Men eftersom källan och innehållet för
den data som analyseras har förändrats är det möjligt att
de metoder som används kommer att prestera annorlunda.
Syftet med denna studie är att undersöka om en sådan skillnad finns och om metodernas träffsäkerhet kan ökas genom
att förarbeta data. Resultatet visar att det finns en skillnad
och att en lexikal analys kan vara ett bättre tillvägagångssätt än en metod baserad på maskininlärning. Att förarbeta
data visar viss men inte i sammanhanget stor förbättring
av resultatet.
Contents
1 Introduction
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
3
2 Background
2.1 Sentiment Analysis .
2.2 Classification . . . .
2.3 Lexicon . . . . . . .
2.4 Machine learning . .
2.4.1 Preprocessing
2.5 Data . . . . . . . . .
2.5.1 Social media
2.5.2 Datasets . . .
2.6 Tools . . . . . . . . .
2.6.1 NLTK . . . .
2.6.2 Scikitlearning
2.6.3 AFINN . . .
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
5
5
7
8
8
8
9
9
9
9
.
.
.
.
.
.
.
10
10
10
10
11
11
12
12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Method
3.1 Data formatting . . . . .
3.1.1 Data requirements
3.1.2 Formatting . . . .
3.1.3 PreProcessing . . .
3.2 Testing procedure . . . . .
3.2.1 Lexicon . . . . . .
3.2.2 Machine learning .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Results
14
5 Discussion
5.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
16
16
17
5.4
Applications and future research . . . . . . . . . . . . . . . . . . . .
18
6 Conclusion
19
Bibliography
20
Chapter 1
Introduction
In recent years the amount of available data has exploded as a result of the ever
increasing use of social media. Today millions upon millions of thoughts, ideas and
opinions are shared online. This opens the possibility to get a glimpse of what
people talk and think about. The data itself is worthless on its own but can be a
goldmine if properly analysed.
To extract the general feeling, the sentiment, of a text can be interesting both for
research and businesses purposes. For instance, a company may easily track the
popularity of its brand, providing feedback to advertisement campaigns.The potential of the field have made it gain significant interest under a long time. This
stretches back further than the use of social media.
Early research focused on other sources of data, most commonly customer reviews
online. The user reviews have the benefit of often having both text and a grade,
providing the correct answers to a researcher. However there is a great deal of difference between most online interaction and the text found in a user review. This is
reason to question old truths about which algorithms and settings is most effective
in classifying data by the feelings it conveys.
1.1
Problem statement
There is no known method to correctly extract the feelings in a text. Even humans
sometimes misunderstands each other and fail to accurately understand the subtle
cues which convey the feelings in text. As it is an unprecise field the accuracy of a
method is of high importance.
From previous thesis work, into method and algorithm accuracy (Hindersson &
Lousseief 2015), the question of what methods achieve higher accuracy depending
on social media dataset was suggested. This study aims to answer the question
1
CHAPTER 1. INTRODUCTION
posed, and to be able to answer it better the following problem statement is used:
• Does the accuracy of the machine learning and lexicon based sentiment classification methods differ depending on what source of
data is used?
– If there is a difference in accuracy between the datasets, why
is it so?
– Is it possible to improve the accuracy of the different methods?
1.1.1
Scope
Not every method and social media database can be tested and some limitations is
therefore needed to define the scope. There are many different methods, databases,
settings and tuning that can be used to answer the questions in the problem statement. The combination of trying to use all the methods existing will be too time
consuming both concerning implementation and runtime. Therefore this study will
only try a limited amount of databases, methods and settings.
The databases will be limited to already annotated and availiable free databases of
social media and reviews in english. We do not have time or resources to collect
data and annotate it ourselves. We want to compare social media to already studied
data sources and that studying data in other languages may impact the results as
the available tools may not be developed to handle other languages.
The methods being tested will be some of the most used methods as the less used
(and most likely less accurate) methods have limited value from an industry point
of view. Focusing on the most used methods also ensure tools exist which simplifies
the implementation, which is necessary as we have limited time and experince.
As the focus of the settings is if the accuracy of methods can change, rather than
exactly how a maximal result is reached, a small number of different settings will
be explored. The small number will both keep implementation time down as well as
limit the amount of results that would need to be analyzed. The settings selected
will also be the ones that are simple to implement with the chosen tools.
To summarize, in the study we will test a limited number of methods and datasets,
due to constraints in time and experience. We will strive to have a result which
reflect what may be used in real applications of sentiment analysis. The study will
focus on known methods needing little work to implement.
2
CHAPTER 1. INTRODUCTION
1.2
Purpose
Previous research on using sentiment analysis has often been focused on datasets
of user reviews (Medhat et al. 2014). However as social media is exploding in size
and importance in people lives, it is far more relevant to research this data.
The purpose of this study is to examine how some common sentiment analysis
methods perform when the source of data is social media. By using state of the
art implementations the results will give a good picture of what algorithms has the
greatest potential in this field of sentiment analysis on social media.
As well as answering which algorithm that would perform well on social media,
this paper aims to examine what and how big impact tuning the methods have on
the accuracy. In combination this would give a starting point for further research
of what should be studied and developed to achieve greater accuracy in sentiment
analysis on social media.
3
Chapter 2
Background
This chapter describes the sentiment analysis subject, how classification is done,
how the lexical method works, how the machine learning methods works, what data
is used and what the tools are. This serves as a support for understanding the
results and the discussion.
2.1
Sentiment Analysis
Sentiment analysis is the subject of using natural language processing to extract
sentiment out of text. This can be done in several ways but the most commonly
used are lexicon and machine learning based methods (Medhat et al. 2014). The
methods purposes are the same but achieves the results in different ways. Lexicon
based by using sentiment lexicons while machine learning based trains classifiers on
data.
2.2
Classification
Splitting data into groups or classes is known as the problem of classification. This
is a common problem, natural language processing is no exception. The methods
presented here may also be used in classification of other types of data. There are
also a large number of methods not presented here since they are outperformed in
this rather specific field of classifying text.
Classifying data based on the feeling is known as sentiment classification. It can
be used to classify any groups of feelings but most common is either a binary scale
of positive and negative or one including a neutral option (Medhat et al. 2014, p.
1). The two groups of methods that will be examined in this paper is the lexicon
based and machine learning based methods. These are the classification methods
most commonly used in sentiment clssification.
4
CHAPTER 2. BACKGROUND
2.3
Lexicon
A method for extracting sentiment from text is using a lexical approach with special
lexicons. The lexicons used in this approach have words that have been classified
with sentiments. Words such as “bad” or “horrible” might have a negative classification while words as “good” or “amazing” might have a positive classification. By
then using a function for a text every word can then be assigned a sentiment (Pang
& Lee 2008, p. 27). See figure 2.1 for a general picture on how this might be done.
Figure 2.1. An illustration of how the lexicon based approach classifies text
2.4
Machine learning
Machine learning can be explained as to have the computer learn to recognize patterns by mining data rather than being programmed rules (Joachims 1998). This
method can be used to solve many different problems, classification being one of
them. The methods used to solve the classification problem belong to the group
of methods known as supervised learning algorithms. They require annotated data
to learn from, which might be hard or time consuming to acquire. This is a major
drawback of using a machine learning approach. See figure 2.2 for a schematic of
the learning process.
One of the most simple of machine learning methods is the Naive Bayes, which relies
on the assumption that all features (typically words in classifications of texts) are
independent. This assumption is of course false for most natural language texts,
where something like the order of the words play a huge role. The method have
however proved to be quite powerful despite its simplicity (Pang et al. 2002).
5
CHAPTER 2. BACKGROUND
Figure 2.2. An illustration of how the machine learning based approach classifies
text.
Another machine learning method is the using a Support Vector Machine (SVM).
The method is based on creating a plane where the data is placed and then splitting
it into a given number of groups or classes as seen in figure 2.3. The vector is to have
as large margin as possible to the classes and thereby maximizing the chance that
new data will be sorted to right side of the vector and thereby sorted into the right
class.(Tong & Koller 1998) The method can be extended to handle classification
where the data is not linear separable, giving it an edge to other machine learning
algorithms. This is however unlikely in the realm of classifying natural language as
the data i generally sparse, which would suite a SVM approach.(Hoschs 2016)
6
CHAPTER 2. BACKGROUND
Figure 2.3. H1 fails to classify the trainingdata into the two classes, H2 and H3
does but as H2 have a minimal margin. H3 is therefore used by the SVM as it has a
maximal margin to both classes. Figure: (ZackWeinberg 2012).
The last method to be tested is called decision tree. The method creates a tree
which then is used to classify data. Each node in the tree contains a boolean expression regarding some feature of the data. To classify a piece of data it is taken
from the root to a leaf, traveling through the nodes tree based on the features. Each
leaf node contains a class and the data is classified upon reaching a leaf. The tree
is created using a recursive method of replacing the leaf with the lowest probability
of successfully classify data with a tree(Medhat et al. 2014).
2.4.1
Preprocessing
The data, in this case text, must be parsed before being fed to the algorithms. There
are a number of ways to represent the words as features, the two most common is
bag of words and n-grams.
In the bag of words model the features of a document is represented by a set of
boooleans. Where the value of each boolean represent whether a word exist or not
in the document. The bag of words models can be improved to include the number
of times a word exist, however it remains a very simple representation as the order
of the words is lost. The n-gram representation does not consider each word on their
own but the n following words. This representation can increase the accuracy but,
as it is harder to implement, is beyond the scope of this paper (Manning et al. 2009).
Not all words in a text have a emotional meaning and are therefore not interesting
7
CHAPTER 2. BACKGROUND
when doing sentiment analysis. These words are known as stopwords. A lexicon
method would ignore these words by assigning the value zero to them in its calculations. But a machine learning algorithm may try to find patterns with these
words. Preprocessing the data by removing all stopwords may improve the accuracy
of machine learning algorithms. As it hinders the machine learning algorithms from
finding rules based on these words. Rules which would generate false result as they
are based on coincidences rather than real patterns in natural language.
2.5
Data
The subject of sentiment analysis gained speed in the early 2000s with the rise of
the internet. The data used in this early history of the subject was mainly reviews
collected from websites aggregating them (Pang & Lee 2008, p. 4). The availability
and structure of data is important and will be presented in this section.
2.5.1
Social media
With the explosion of available data following the use of social media likewise the
opportunity to mine this data has also increased. However the nature and content
of the data is different from other sources of natural language texts, reviews, news
stories, scientific articles or books. The language is more simple and gramaticly
worse than standard text(Baldwin et al. 2013).
2.5.2
Datasets
Several datasets have been created by different organizations and universities to
be used in research and experimentation in sentiment analysis. Three datasets are
presented here. The datasets are extracted from the micro-blogging service Twitter
and the Internet Movie DataBase (IMDB).
Twitter
Sentiment140
Sentiment140 is a twitter dataset compiled at Stanford university using tweets (1140 character documents) which have been machine-annotated using an emoticon
heuristic(Go et al. 2015). This heuristic assumes that emoticons such as ”:)” indicates a positive sentiment and ”:(” indicates a negative sentiment. The data is
divided into a training and testing part. The training part consists of 1 600 000
tweets with half being annotated as positive and the other half as negative. The
test set is 359 tweets where 182 is positive and 177 negative.
Sanders analytics twitter corpus
Sanders analytics twitter corpus or what this paper will shorten as ”Sanders”
8
CHAPTER 2. BACKGROUND
(Sanders 2011) is a small dataset of 5513 tweets sorted in sentiments of ”Positive”, ”Negative”, ”Neutral” and ”Irrelevant”. Approximately 1000 of the tweets in
the set is labeled ”Positive” or ”Negative” which is useful for the purpose of this
study. The data was acquired by Sanders Analytics by downloading tweets from
specific hashtags using the twitter API. All the tweets are annotated manually by
humans.
IMDB
Large Movie Review Dataset
Large Movie Review Dataset 1.0 is a dataset of IMDB (Internet Movie DataBase)
reviews with accompanying sentiment (Maas et al. 2011). The dataset consists of
50 000 reviews where half of them is purposed for training and the other half for
testing. Both the training and testing set is split evenly with positive and negative
sentiment. The data is classified using IMDBs review score system where a negative
review got a score of Æ 4 of 10 and a positive review Ø 7 of 10.
2.6
Tools
Tools and libraries for research in sentiment analysis already exists and the aim of
this sections is to present some of these tools.
2.6.1
NLTK
NLTK or “Natural Language Toolkit” is a toolkit built with python that is used as a
framework for general natural language processing needs (Natural Language ToolKit
2016). It can be used for its naive bayes classifier and for removing stopwords during
data preprocessing.
2.6.2
Scikitlearning
Scikit-learn is a toolkit that consists of different machine learning algorithms that
can be used in data science and sentiment analysis (scikit-learn 2014). It is implemented in python and built on other frameworks like NumPy, SciPy and matplotlib.
It contains classifier algorithms such as Support Vector Machine and deciscion tree.
2.6.3
AFINN
AFINN is a tool that is used to acquire sentiment score for text using the lexical based approach (Nielsen 2011)(Nielsen 2016). AFINN uses a human compiled
sentiment lexicon. It uses this lexicon to classify text with sentiment.
9
Chapter 3
Method
The approach of the method of the study is divided into two parts: Formatting
of the data and the procedure of the testing to acquire the accuracy results. This
chapter will describe these processes.
3.1
Data formatting
3.1.1
Data requirements
One of the most important elements to this study is the problem of acquiring data
that can be used. To be able to answer the questions posed in the problem statement
we need to have data that adhere to some requirements:
• The data from all the databases must be in a format or can be formatted
in such a way so that the tools can read it. This is needed so that methods
accuracy can be compared between the databases.
• The data needs to be annotated with a sentiment of positive and negative so
that the accuracy of the methods can be determined.
The datasets presented in the backround all adhere to these requirements, which is
why they were chosen.
3.1.2
Formatting
The datasets presented was acquired in different file types and formats. To implement the tools to all the different data would take too much time and code for this
study. This creates the need for the data to have a common format. The decided
on format was in the form of Comma Separated Values (CSV) files. Python can
handle csv files and this format can be transformed to the tools different input
specifications. The common format was structured with the tweet/review text and
the sentiment of said text like the example below. The data was extracted and
10
CHAPTER 3. METHOD
reformatted with python code and bash shell scripts to save time.
“Sentiment”, “Document text”
This format is all that the tools need to be able to get accuracy out of the algorithms.
3.1.3
PreProcessing
To be able to see if different characteristics of the text made an impact to the
results some preprocessing was done for the machine learning approach. The lexicon
approach was expected to read the text as it is and with no difference to the results.
The preprocessing done was:
• Everyting was converted to lower-case. This is done to avoid the same word
being interpreted differently.
• Removal of hash character ”#” using regular expressions. This processing was
done so that words such as ”#amazing” would turn into ”amazing” otherwise
they would be interpreted as different words.
• Substitution of the usernames. Usernames in twitter begin with an ”@”character. All words beginning with this character was replaced with the
string ”USERNAME”. Usernames should not alter the sentiment of the text
and were therefore removed.
• Removing stopwords from the text. This was done using the stopword dataset
supplied by NLTK.
3.2
Testing procedure
The approaches were implemented differently. To be able to replicate the study and
understand some of the consequences of the results all the versions of the tools used
are presented in the table 3.1.
Python
NLTK
AFINN
scikit-learn
NumPy
Version number
2.7.11
3.2.1
GitHub commit: 720367d
0.17.1
1.11.0
Table 3.1. Version numbers of the tools used.
11
CHAPTER 3. METHOD
3.2.1
Lexicon
The lexicon based approach used the tool for sentiment scoring called AFINN(Nielsen
2011)(Nielsen 2016). It was used in a previous thesis study for its ease of use and
lesser need of implementation (Sommar & Wielondek 2015). It was used in this
study for the same reasons.
AFINN has a function for scoring the sentiment of a text. This function returns if
the supplied text is positive or negative on a scale from -5 to 5. All texts with the
values Æ-1 were annotated negative and the texts with values Ø1 were annotated
positive. The scoring function was used on all the separate tweets or reviews in the
test sets to get a negative or positive sentiment. In the sanders dataset the whole
dataset of 1091 tweets were scored instead of only partially. The sentiment result
from AFINN was then compared to the actual sentiment to calculate the accuracy
of the method depending on dataset. See figure 3.1 below.
Figure 3.1. An illustration of how the accuracy was acquired for the lexicon based
approach.
3.2.2
Machine learning
The machine learning based approach used the toolkits NLTK and scikit-learn for
scoring the sentiment of a text. As described in the background each classifier algorithm needs to be trained before being tested and used. We trained the Naive bayes
classifier, SVM and decision tree with the training data. Both the Sentiment140 and
IMDB dataset had training data for the machine learning approach bundled with
the test data but with the sanders dataset another approach had to be performed.
For the sanders database the tweets were shuffled and then split into 1000 tweets
for training data and 91 tweets for testing data, this was repeated 20 times to get
12
CHAPTER 3. METHOD
an average accuracy.
The trained classifiers was then used on the testing data to acquire sentiment from
the text. This was performed on each dataset with the corresponding classifier. The
repeating of the training and testing on the sanders dataset was needed because the
dataset was much smaller and yielded large differences in results with each repetition. The extracted sentiment was then compared to the actual sentiment of each
dataset and the accuracy could be calculated and retrieved.
For each dataset all three classifiers were trained and then tested on the testing
data for the same dataset. See figure 3.2 below.
Figure 3.2. An illustration of how the accuracy was acquired for the machine
learning based approach.
13
Chapter 4
Results
In table 4.1 to 4.3 the accuracy of the different machine learning methods with
different levels of preprocessing is shown. The result from the decision tree method
is incomplete due to unreasonable computing time required. Table 4.4 shows the
results from the lexicon method. Lastly figure 4.1 compares the best result for each
method.
Sentiment140
Sanders
IMDB
Naive Bayes
80.45%
70.05%
80.35%
SVM
77.37%
66.48%
78.26%
Decision Tree
No
data
63.79%
No
data
Table 4.1. The accuracy of the machine learning methods with no processing.
The Naive Bayes method achieve the highest accuracy on all datasets, with accuracy
between 70-80%. Significant lower accuracy is found for all methods when the
Sanders dataset is used.
Sentiment140
Sanders
IMDB
Naive Bayes
78.83%
71.15%
80.89%
SVM
76.88%
66.21%
77.94%
Decision Tree
No
data
63.74%
No
data
Table 4.2. The accuracy of the machine learning methods with the twitter syntax
removed
Similar results as table 4.1, Naive Bayes is the best performing method for all
datasets. The removal of twitter syntax have little impact, mostly increasing the
accuracy with less than 1%. For some methods and datasets the accuracy decreases.
14
CHAPTER 4. RESULTS
Sentiment140
Sanders
IMDB
Naive Bayes
81.56%
71.18%
80.29%
SVM
79.05%
67.91%
78.00%
Decision Tree
No
data
62.80%
No
data
Table 4.3. The accuracy of the machine learning methods with the twitter syntax
removed as well as stopwords
Similar results as table 4.1, Naive Bayes is the best performing method for all
datasets. The removal of stopwords increases the accuracy in the test on the Sentiment140 dataset with 2-3% while giving either a boost or loss of approximately 1%
for the other datasets (compared to table 4.2).
Sentiment140
Sanders
IMDB
Lexicon
76.87%
84.38%
69.67%
Table 4.4. The accuracy of the lexicon method
The result from the lexicon method show that it performs best on the Sanders
dataset and worst on the reviews from IMDB.
Figure 4.1.
dataset
Graphical represantion of the best result for each method on each
15
Chapter 5
Discussion
5.1
Machine learning
As seen by comparing the lexicon methods performance on the reviews from IMBD
table 4.4 with the results of machine learning methods the machine learning algorithms perform better than the lexicon method on the reviews from IMDB.
To preprocess the data by removing twitter specific syntax and stopwords give in
most cases some improvement to the machine learning algorithms. With the exception of the decision tree method and the IMDB dataset the best accuracy is found in
table 4.3. It should be easier to find the patterns which have sentimental meaning
if white noise is removed, which is reflected in our results. Why the decision tree
is different is hard to explain. If all algorithms behaved like that one could suggest
the hashtags and the stopwords actually contain information, but as it is only one
algorithm on a, rather small, dataset it should be treated more as an anomaly rather
than anything else. Overall the machine learning algorithms can be configured to
perform better, even if the difference is quite small.
5.2
Lexicon
The lexicon method is almost as good on the Sentiment140 dataset as machine
learning and outperforms all machine learning methods on the Sanders database.
One possible explanation of why the lexicon method performs well on the twitter
datasets would be that the language in these is more compact and simple than a
movie review.
The fact that the language used in social media may be far from correct may be a
problem if a lexicon method is used. Unlike machine learning methods which finds
patterns in any data, a lexicon method is constructed to understand a language.
This could be a problem if the language in the data is different from what the lex16
CHAPTER 5. DISCUSSION
icon method is constructed to classify. However this can not be concluded to be a
problem with a lexicon method from the result as the lexicon method performs well.
As seen in 4.4 the lexicon method performs wastly different on the two twitter
datasets, almost a 10 percentage difference. While the machine learning methods
do perform better when trained with a larger set, there is no clear reason why a
lexicon method would perform like this.
One possible explanation would be that the datasets differ in more ways than size.
Even if both datasets consist of tweets they have been annotated differently, Sanders
by hand and Sentiment140 automatically. There is a possibility that this is the
source of the difference. The lexicon method disagrees with the program which
annotated Sentiment140 but agrees with the human interpretation of sentiment in
tweets. However a very likely source of this difference could be the natural distribution of accuracy of a method on different datasets. As the Sanders dataset is small
an accidental extreme result could be possible.
5.3
Implementation
The great strength of machine learning is that if presented with more data, it will
perform better. Our result reflect this as the larger database Sentiment140 gives
better result to all machine learning algorithms compared to the smaller Sanders
database.
Increasing the amount of data to improve the machine learning approaches comes
at a price, as computing time and amount of memory needed increases. This is
especially true if the NLTK is used as it is far from being effective in either aspect.
Since NLTK is written in python it is slow and ineffective which poses a real problem. During the work several of the tests failed because of memory errors with the
Windows distribution of NLTK. And since the tests were time consuming the cost
of testing many configurations quickly climbed. The machine learning approach of
constructing decision trees had to be omitted because it took too long to be constructed with the tools at hand and the size of the data.
There are other toolkits and implementations available written in other languages
which most likely are more effective and more suited for large scale computations.
However these toolkits may prove harder to implement and in the end not be worth
the time. As we found NLTK to be widely used in research we assumed it to be
representative of the algorithms even if NLTK may not be the best tool to process
large sets of data.
17
CHAPTER 5. DISCUSSION
5.4
Applications and future research
Our result show that the lexicon based method have potential in the field of sentiment analysis on social media and should not be underestimated. It could therefore
be used in real applications and should be subject for further research. For example
analyzing more datasets with more implementations as datasets from other sources
within social media and different lexicons probably would give other results. In our
study we did not compare runtime as a variable but for the practicality and interest
this should also be researched.
18
Chapter 6
Conclusion
Switching source of data from reviews to social media do affect the accuracy of
sentiment analysis methods. A lexicon approach shows potential as it performs
better at social media datasets compared to machine learning methods. Possibly
due to the compact nature of the social media data. The accuracy of the machine
learning methods can be improved through preprocessing but the improvement is
small. More significant gain in accuracy is achieved by increasing the size of the
training dataset.
19
Bibliography
Baldwin, T., Cook, P., Lui, M., MacKinlay, A. & Wang, L. (2013), How noisy social
media text, how diffrnt social media sources?
URL: http://cs.unb.ca/ ccook1/ijcnlp2013-socmed.pdf
Go, A., Bhayani, R. & Huang, L. (2015), ‘Sentiment140’.
URL: http://help.sentiment140.com/for-students/
Hindersson, T. & Lousseief, E. (2015), ‘Smirking or smiling smileys? - evaluating
the use of emoticons to determine sentimental mood’.
URL: http://www.diva-portal.org/smash/get/diva2:811037/FULLTEXT01.pdf
Hoschs, W. L. (2016), ‘Machine learning’.
URL: http://global.britannica.com/technology/machine-learning
Joachims, T. (1998), ‘Text categorization with support vector machines’.
URL: http://www.cs.cornell.edu/people/tj/publications/joachims9 8a.pdf
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & Potts, C. (2011),
Learning word vectors for sentiment analysis, in ‘Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies’, Association for Computational Linguistics, Portland, Oregon, USA, pp. 142–150.
URL: http://www.aclweb.org/anthology/P11-1015
Manning, C. D., Raghavan, P. & Schütze, H. (2009), An Introduction to Information
Retrieval, Cambridge University Press, Cambridge, England.
Medhat, W., Hassan, A. & Korashy, H. (2014), ‘Sentiment analysis algorithms and
applications: A survey’.
URL: http://www.sciencedirect.com/science/article/pii/S2090447914000550
Natural Language ToolKit (2016).
URL: http://www.nltk.org/
Nielsen, F. Å. (2011), A new ANEW: evaluation of a word list for sentiment analysis in microblogs, in M. Rowe, M. Stankovic, A.-S. Dadzie & M. Hardey, eds,
‘Proceedings of the ESWC2011 Workshop on ’Making Sense of Microposts’: Big
20
BIBLIOGRAPHY
things come in small packages’, Vol. 718 of CEUR Workshop Proceedings, pp. 93–
98.
URL: http://ceur-ws.org/Vol-718/paper1 6.pdf
Nielsen, F. Å. (2016), ‘Afinn sentiment analysis in python’.
URL: https://github.com/fnielsen/afinn
Pang, B. & Lee, L. (2008), ‘Opinion mining and sentiment analysis’.
URL: http://www.cs.cornell.edu/home/llee/omsa/omsa.pdf
Pang, B., Lee, L. & Vaithyanathan, S. (2002), Thumbs up? sentiment classification
using machine learning techniques, pp. 79–86.
URL: http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
Sanders, N. J. (2011), ‘Sanders-twitter sentiment corpus’.
URL: http://www.sananalytics.com/lab/twitter-sentiment/
scikit-learn (2014).
URL: http://scikit-learn.org/stable/index.html
Sommar, F. & Wielondek, M. (2015), ‘Combining lexicon- and learning-based approaches for improved performance and convenience in sentiment classification’.
URL: http://www.diva-portal.org/smash/get/diva2:811021/FULLTEXT01.pdf
Tong, S. & Koller, D. (1998), ‘Support vector machine active learning with applications to text classification’, p. 1.
URL: http://ai.stanford.edu/ koller/Papers/Tong+Koller:ICML00.pdf
ZackWeinberg (2012), ‘Svm separating hyperplanes’.
URL: https://commons.wikimedia.org/wiki/File:Svm_separating_hyperplanes_(SVG).svg
21
www.kth.se