A An n M M TTe ec ch D D ii ss ss e e rr tt a a tt ii o on n Titled Twitter Sentiment Analysis using Hybrid Naïve Bayes Submitted in partial fulfilment towards the award of the degree of MASTERS OF TECHNOLOGY IN COMPUTER ENGINEERING BY Mr. Harsh Vrajesh Thakkar Supervisor(s) Dr. Dhiren Patel 2012-2013 Department of Computer Engineering SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY, SURAT Declaration I hereby declare that the work being presented in this dissertation report entitled “Twitter Sentiment Analysis using Hybrid Naive Bayes” by me i.e. Harsh Vrajesh Thakkar, bearing Roll No: P11CO010 and submitted to the Computer Engineering Department at Sardar Vallabhbhai National Institute of Technology, Surat; is an authentic record of my own work carried out during the period of July 2012 to June 2013 under the supervision of Name of the Supervisor. The matter presented in this report has not been submitted by me in any other University/Institute for any cause. Neither the source code there in, nor the content of the project report have been copied or downloaded from any other source. I understand that my result grades would be revoked if later it is found to be so. ______________________ (Harsh V. Thakkar) ii CERTIFICATE This is to certify that the dissertation report entitled “Twitter Sentiment Analysis using Hybrid Naïve Bayes”, submitted by Harsh Vrajesh Thakkar, bearing Roll No: P11CO010 in partial fulfillment of the requirement for the award of the degree of MASTER OF TECHNOLOGY in Computer Engineering, at Computer Engineering Department of the Sardar Vallabhbhai National Institute of Technology, Surat is a record of his/her own work carried out as part of the coursework for the year 2012--13. To the best of our knowledge, the matter embodied in the report has not been submitted elsewhere for the award of any degree or diploma. Certified by ____________________ (Dr. Dhiren Patel) Professor, Department of Computer Engineering, S V National Institute of Technology, Surat – 395007 India _______________________________ Mr. Udai Pratap Rao PG Incharge, M Tech in Computer Engineering, SVNIT, Surat Mr. Rakesh Gohil Head, Department of Computer Engineering, S V National Institute of Technology, Surat – 395007, India iii Department of Computer Engineering SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY, SURAT (2012--13) Approval Sheet This is to state that the Dissertation Report entitled “Twitter Sentiment Analysis using Hybrid Naïve Bayes” submitted by Mr. Harsh Vrajesh Thakkar (Admission No: P11CO010) is approved for the award of the degree of Masters of Technology in Computer Engineering. Board of Examiners Examiners Supervisor(s) Chairman Head, Department of Computer Engineering Date:___________ Place:______________ iv Acknowledgements The journey of my Master of Technology has so far been very exciting and full of challenges. I would like to acknowledge that, apart from God’s and my parent’s blessings, the success of this dissertation work is the result of sheer encouragement and guidelines of my Guide Dr. Dhiren R. Patel. I am very much thankful to him for being there for me and empowering me with his methodology and expertise. He has given me ample amount of space to explore research interests of my own, and I personally believe this to be a burning requirement for any natural research enthusiast. I am also grateful to Mr. Rakesh Gohil, the head of Computer Engineering depatment and his Staff for giving me a lot of their valuable time, as well as the Staff of Central Computer Centre for allowing me to stay in the laboratory till late. Last but not the least, I would like to thank to my friends, who stood by me like a shadow, whenever I needed their assistance and also gave me the strength in my weak and wrecked to carry on. Today I am here because of the efforts of all of these wonderful people. Harsh V. Thakkar (P11CO010) v Abstract Millions of users share opinions on diverse aspects of life and politics every day using microblogging over the internet. Microbloging websites are rich sources of data for belief mining and sentiment analysis. In this dissertation work, we focus on using Twitter for sentiment analysis for extracting opinions about events, products, people and use it for understanding the current trends or state of the art. Twitter allows its users a limit of only 140 characters; this restriction forces the user to be concise as well as expressive at the same moment. This ultimately makes twitter an ocean of sentiments. Twitter also provides developer friendly streaming. We scuttle datasets over 4 million tweets by a custom designed crawler for sentiment analysis purpose. We propose a hybrid naïve bayes classifier by integrating an english lexical dictionary (SentiWordNet) to the existing machine learning naïve bayes classifier algorithm. Hybrid naïve bayes classifies the tweets in positive and negative classes respectivel. Experimental results demonstrate the superiority of hybrid naïve bayes on multi-sized datasets consisting of variety of keywords over existing approaches yielding >90 percent accuracy in general and 98.59 percent accuacy in the best case. In our research, we worked with English; however, the proposed technique can be used with any other language, provided that language lexicon dictionary. vi Table of Contents Chapter 1 Introduction 1-5 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Why sentiment analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Why NLP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.3 Applications of sentiment analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 1.3 The network: Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2 Theoretical Background and Literature Review 5-17 2.1 General sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Issues in sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Classification of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Knowledge-based approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 2.3.2 Relationship-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.3 Language models approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 2.3.4 Discourse structures and semantics approach . . . . . . . . . . . . . . . . . . . . 9 2.4 Twitter specific approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.4.1 Lexical analysis approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Machine learning approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 2.4.3 Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Performance review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Lexical approach performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 2.5.2 Machine Learning approach performance . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.3 Hybrid approach performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Chapter conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 Chapter 3 Design and Analysis of proposed approach 18-33 3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 Problem introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Proposed approach: Hybrid Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 vii 3.2.1 Data collection : Twitter API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 3.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.3 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.4 Sentiment analysis: The classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 4 Implementation Methodology 26-30 4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Test application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 5 Results and Analysis 34-42 5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 6 Conclusion and Future Work 43-44 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Bibliography 45-47 viii List of Figures Figure Figure Figure Figure Figure 2.1 2.2 3.1 3.2 3.3 Figure 4.1 Figure 4.2 Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8 Generic architecture of an lexical approach classifier. Generic architecture of a machine learning approach classifier. Process steps followed by hybrid naive bayes. System architecture of hybrid naive bayes approach. A sequence of intermediate preprocessing steps taking place at this level Polarity triangle of synset in SentiWordNet. Class diagram of Tweet Sentiment Analyzer. Comparison of classifier performances, Dataset size (no. of tweets) vs Accuracy (%). The most informative features if hybrid naive bayes classifier. The base naive bayes classifier in action on a windows platform with 50k tweets dataset. Accuracy of base naive bayes classifier with a 50k tweets dataset. The hybrid naive bayes classifier in action on a windows platform with 50k tweets dataset. Accuracy of hybrid naive bayes classifier with a 50k tweets dataset. Base naive bayes classifier in action on a linux server with 10k tweets dataset. Accuracy of base naive bayes classifier on a linux server with a 10k tweets dataset. IX 11 13 20 21 22 30 32 36 37 37 38 39 40 41 42 List of Tables Table Table Table Table Table Table 2.1 2.2 2.3 4.1 5.1 5.2 Performance of lexical approach variants Performance machine learning approach variants Performance of hybrid approach variants General system requirements for our approach. Performance of base naive bayes classifier Performance of hybrid naive bayes classifier. X 14 15 17 26 34 35 “I don’t know if I will have the time to write anymore letters because I might be too busy trying to participate. So if this does end up being the last letter I just want you to know that I was in a bad place before I started high school and you helped me. Even if you didn’t know what I was talking about or know someone who has gone through it, you made me not feel alone. Because I know there are people who say all these things don’t happen. And there are people who forget what it’s like to be 16 when they turn 17. I know these will all be stories someday. And our pictures will become old photographs. We’ll all become somebody’s mom or dad. But right now these moments are not stories. This is happening, I am here and I am looking at her. And she is so beautiful. I can see it. This one moment when you know you’re not a sad story. You are alive, and you stand up and see the lights on the buildings and everything that makes you wonder. And you’re listening to that song and that drive with the people you love most in this world. And in this moment I swear, We are infinite.” - Charlie’s last letter (The Perks of being a Wallflower, 2012) XI Chapter 1 Introduction 1.1 Motivation With the proliferation of Web to applications such as microbloging, forums and social networks, there came reviews, comments, recommendations, ratings and feedbacks generated by users. The user generated content can be about virtually anything including politicians, products, people, events, etc. With the explosion of user generated content came the need by companies, politicians, service providers, social psychologists, analysts and researchers to mine and analyze the content for different uses. The bulk of this user generated content required the use of automated techniques for mining and analyzing. Cases of the bulk user-generated content that have been studied are blogs [1] and product/movie [2] reviews. Microbloging has become a very popular communication tool among Internet users. Millions of messages are appearing daily in popular web-sites that provide services for microbloging such as Twitter1 , Tumblr 2 , Facebook 3 . Users of these services write about their life, share opinions on variety of topics and discuss current issues. Because of a free format of messages and an easy accessibility of microbloging platforms, Internet users tend to shift from traditional communication tools to microbloging services. As more and more users post about products and services they use and express their political and religious views, microbloging web-sites become valuable sources of peoples opinions and sentiments. Such data can be efficiently used for marketing and social studies. Sentiment analysis is an exhaustive research field which has been in study since 1 http://www.twitter.com http://www.tumblr.com 3 http://www.facebook.com 2 Introduction decades. Its initial use was made to analyze sentiment based on long texts such as letters, emails and so on. It is also deployed in the field of pre-and post-crime analysis of criminal activities. Variants of approaches have been applied for the same. Applying this field with the microbloging fraternity is a challenging job. This challenge became our motivation. Needless to say we are not the first ones to work in this area. There has been substantial research in both machine learning and the lexical approaches to sentimental analysis for social networks. We try to improve the existing approaches by diversifying variants to the research. 1.2 Problem Description we propose a hybrid approach for sentiment analysis that is a combination of a machine learning algorithm and a special lexical dictionary. 1.2.1 WHY SENTIMENT ANALYSIS? Everyday enormous amount of data is created from social networks, blogs and other media and diffused in to the world wide web. This huge data contains very crucial opinion related information that can be used to benefit businesses and other aspects of commercial and scientific industries. Manual tracking and extraction of this useful information is not possible, thus, Sentiment analysis is required. Sentiment Analysis is the phenomenon of extracting sentiments or opinions from reviews expressed by users over a particular subject, area or product online. It is an application of natural language processing, computational linguistics, and text analytics to identify subjective information from source data. It clubs the sentiments in to categories like “positive” or “negative”. Thus, it determines the general attitude of the speaker or a writer with respect to the topic in context. 1.2.2 WHY NLP? Natural language processing (NLP) is the technology dealing with our most ubiquitous product: human language, as it appears in emails, web pages, tweets, product descriptions, newspaper stories, social media, and scientific articles, in thousands of languages and varieties. In the past decade, successful natural language processing 2 Introduction applications have become part of our everyday experience, from spelling and grammar correction in word processors to machine translation on the web, from email spam detection to automatic question answering, from detecting people’s opinions about products or services to extracting appointments from your email. The greatest challenge of sentiment analysis is to design application-specific algorithms and techniques that can analyze the human language linguistics accurately. 1.2.3 Applications of sentiment analysis Following are the major applications of sentiment analysis in real world scenarios. • Product and Service reviews - The most common application of sentiment analysis is in the area of reviews of consumer products and services. There are many websites that provide automated summaries of reviews about products and about their specific aspects. A notable example of that is “Google Product Search”. • Reputation Monitoring - Twitter and Facebook are a focal point of many sentiment analysis applications. The most common application is monitoring the reputation of a specific brand on Twitter and/or Facebook. • Result prediction - By analyzing sentiments from relevant sources, one can predict the probable outcome of a particular event. For instance, sentiment analysis can provide substantial value to candidates running for various positions. It enables campaign managers to track how voters feel about different issues and how they relate to the speeches and actions of the candidates. • Decision making - Another important application is that sentiment analysis can be used as a important factor assisting the decision making systems. For instance, in the financial markets investment. There are numerous news items, articles, blogs, and tweets about each public company. A sentiment analysis system can use these various sources to find articles that discuss the companies and aggregate the sentiment about them as a single score that can be used by an automated trading system. One such system is The Stock Sonar4 4 The Stock Sonar [http://www.thestocksonar.com/]: This system (developed by Digital Trowel) shows graphically the daily positive and negative sentiment about each stock alongside the graph of the price of the stock. 3 Introduction 1.3 The network: Twitter Twitter is an online social networking service and microblogging service that enables its users to send and read text-based messages called “tweets”. Tweets are publicly visible by default, but senders can restrict the message delivery to a limited crowd. Twitter is one of the largest microblogging service having over 500 million registered users as of 2012. Statistics revealed by the Infographics Labs5 suggest that back in the year 2012, on a daily basis 175 million tweets were communicated.There is a large mass of people using twitter to express sentiments, which makes it an interesting and challenging choice for sentiment analysis. When so much attention is being paid to twitter, why not monitor and cultivate methods to analyze these sentiments. Twitter has been selected with the following purposes in mind. • Twitter is an Open access social network. • Twitter is an Ocean of sentiments (limited within 140 characters, i.e. high sentiment density). • Twitter provides user friendly API making it easier to mine sentiments in realtime. 1.4 Contribution In this thesis, we propose a Hybrid Naive Bayes classifier which is the combination of a machine learning algorithm (Naive Bayes) and a special lexical dictionary (SentiWordNet6 ). We crawled multi-size datasets consisting of approximately 4 million tweets with a variety of most popular keywords like [“#ironman3”, “#amitabhbachhan”, “#Google”, “#twitter”, “#robertdowneyjr” and etc] for the training and testing purposes. We test the proposed Hybrid Naive Bayes approach using Natural language Toolkit and observe that it outperforms the existing approaches delivering competitive results having 98.59 percent accuracy (best case). 5 6 http://infographiclabs.com/ http://www.sentiwordnet.isti.cnr.it 4 Introduction 1.5 Thesis Outline This thesis is organised as follows: • chapter 1: Introduces the work of this thesis. • chapter 2: An exhaustive survey of existing approaches. • chapter 3: Proposed approach and goals. • chapter 4: Details the implementation methodology. • chapter 5: Discussion of results and analysis. • chapter 6: Concludes the work and explorers avenues for future work. 5 Chapter 2 Theoretical Background and Literature Review Sentiment analysis caught attention as one of the most active research areas with the explosion of social networks. The enormous user-generated content resulting from these social media contained valuable information in the form of reviews, opinions, etc about products, events and people. Most sentiment analysis studies use machine learning approaches, which require large amount of user generated content for training. The research on sentiment analysis so far has mainly focused on two things: identifying whether a given textual entity is subjective or objective, and identifying polarity of subjective texts [3]. In the following sections, we review literature on both general and twitter-specific sentiment analysis in brief. Starting with general sentiment analysis, we also discuss about the issues in sentiment analysis that make it a difficult task than other textclassification tasks. Later on, we move to the focus area of this thesis, i.e. twitter specific approaches. In a study, [4] present a comprehensive review of the literature written before 2008. Most of the material on general sentiment analysis is based on their review. 2.1 General Sentiment Analysis Sentiment analysis has been carried out on a range of topics. For example, there are sentiment analysis studies for movie reviews [4], product reviews [5], and news and blogs ([3], [6]). Theoretical Background and Literature Review 2.2 Issues in Sentiment Analysis Research reveals that sentiment analysis is more difficult than traditional topicbased text classification, despite the fact that the number of classes in sentiment analysis are less than the number of classes in topic-based classification [4]. In sentiment analysis, the classes to which a piece of text is assigned are usually negative or positive. They can also be other binary classes or multi-valued classes like classification into “positive”, “negative” and “neutral”, but still they are less than the number of classes in topic-based classification. Sentiment analysis is tougher compared to topic-based classification as the latter relies on keywords for classification. Whereas in the case of sentiment analysis keywords a variety of features have to be taken into account. The main reason that sentiment analysis is more difficult than topic-based text classification is that topic-based classification can be done with the use of keywords while this does not work well in sentiment analysis [2]. Other reasons for difficulty are: sentiment can be expressed in subtle ways without any perceived use of negative words; it is difficult to determine whether a given text is objective or subjective (there is always a newline between objective and subjective texts); it is difficult to determine the opinion holder (example, is it the opinion of the author or the opinion of the commenter); there are other factors such as dependency on domain and on order of words [3]. Other challenges of sentiment analysis are to deal with sarcasm, irony, and/or negation. 2.3 Classification of approaches Sentiment analysis is formulated as a computational linguistics problem. The classification can be approached from different perspectives depending on the nature of the task at hand and perspective of the person carrying out sentiment analysis. The familiar approaches are discourse-driven, relationship-driven, language-model-driven, or keyword-driven. We discuss these approaches in the subsequent subsections. 2.3.1 Knowledge-based approach In this approach, sentiment is calculated as the function of some keywords. The main task is the construction of sentiment discriminatory-word lexicons that indicate a 7 Theoretical Background and Literature Review particular class such as positive class or negative class. The polarity of the words in the lexicon is determined prior to the sentiment analysis work. There are variations to how the lexicon is created. For example, lexicons can be created by starting with some seed words and then using some linguistic heuristics to add more words to them, or starting with some seed words and adding to these seed words other words based on frequency in a text [2]. For certain applications, there are publicly available discriminatory word lexicons for use in sentiment analysis. Twitrratr 1 provides opinion tracking service of public sentiments for Twitter sentiment analysis. 2.3.2 Relationship-based approach In this approach, the classification task is approached from the different relationships that may exist between features 2 and components. Such relationships include relationships between discourse participants, relationships between product features. For instance, if one wants to know the sentiment of customers about a product brand, one may compute it as a function of the sentiments on different features or components of it. 2.3.3 Language models approach In this approach, the classification is done by building n-gram language models. A gram is a token or lexicon taken into consideration for training and classification. N-gram represents a set of such chosen lexicons. Generally, in this approach frequency of n-grams are used. In traditional information retrieval and topic-oriented classification, frequency of n-grams gives better results. The frequency is converted to TF-IDF3 to take term’s importance in the document to be classified. In a study, [3] show that in the sentiment classification of movie review blog, term-presence gives better results than term frequency. They indicate that uni-gram presence is more suited for sentiment analysis. But later ([3], [5]) found that bi-grams and trigrams worked better than uni-grams in a study of sentiment classification of product reviews. 1 http://twitrratr.com Feature: A feature in terms of sentiment analysis is the chosen set of word used for training in the supervised learning technique, i.e. the “word dictionary” which is fed into the classifier for the classification of sentiments in the incorporated text 3 TF-IDF: Represents the Term Frequency or the Identifier frequency whichever is taken into account. 2 8 Theoretical Background and Literature Review 2.3.4 Discourse structures and semantics approach This approach, is very dominant in the applications where prior classification of classes is not possible. Text is classified when it is encountered into the best category it fits (in the context of its objective). Based on the similarity of semantics of words in the text, they are grouped together and tagged in to classes. For example in reviews, the overall sentiment is usually expressed at the end of the text [2]. As a result the approach, in this case, might be discourse-driven in which the sentiment of the whole review is obtained as a function of the sentiment of the different discourse components in the review and the discourse relations that exist between them. For instance, the sentiment of a paragraph that is at the end of the review might be given more weight in the determination of the sentiment of the whole review. Semantics can be used in role identification of agents where there is a need to do so. For example “India beat Australia” is different from “Australia beat India”. 2.4 Twitter specific approaches The main difference between sentiment analysis of twitter and documents is that, twitter based approaches are more specific towards determining the polarity of words (mainly adjectives); whereas the document based approaches are specific towards the task of determining features in the text. There are three major approaches for twitter specific sentiment analysis. • Lexical analysis approach • Machine learning approach • Hybrid approach Using one or a combination of the different approaches, one can employ one or a combination of lexical and machine learning techniques. Specifically, one can use unsupervised techniques, supervised techniques or a combination of them. We first review the lexical approaches, which focus on building successful dictionaries, then the machine learning approaches, which are primarily concerned with feature vectors, and finally a combination of both i.e. hybrid approach. 9 Theoretical Background and Literature Review 2.4.1 Lexical analysis approach A lexical approach typically utilizes a dictionary or lexicon of pre-tagged words. Each word that is present in a text is compared against the dictionary. If a word is present in the dictionary, then its polarity value is added to the “total polarity score” of the text. For example, if a match has been found with the word “excellent”, which is annotated in the dictionary as positive, then the total polarity score of the blog is increased. If the total polarity score of a text is positive, then that text is classified as positive, otherwise it is classified as negative. Although naive in nature, many variants of this lexical approach have been reported to perform better ([7], [11], [8]). Since the classification of a statement is dependent upon the scoring it receives, there is a large volume of work devoted to discovering which lexical information works best. As a starting point for the field, [9] demonstrated that the subjectivity of an evaluative sentence could be determined through the use of a hand-tagged lexicon comprised solely of adjectives. They report over 80 percent accuracy on single phrases. Extending this work, [7] utilized the same methodology and handtagged adjective lexicon as [9], but they tested the paradigm on a dataset composed of movie reviews. They reported a much lower accuracy rate of about 62 percent. Moving away from the hand-tagged lexicons, Turney [2] utilized an Internet search engine to determine the polarity of words that would be included in the lexicon [7]. Turney [2] performed two AltaVista4 search queries: one with a target word conjoined with the word “good”, and a second with the target word conjoined with the word “bad”. The polarity of the target word was determined by the search result that returned the most hits. This approach improved accuracy to 65 percent. In a study, [8] and [10] chose to use the WordNet 5 database to determine the polarity of words. We compared a target word to two pivot words (usually “good” and “bad”) to find the minimum path distance between the target word and the pivot words in the WordNet hierarchy. The minimum path distance was converted to an incremental score and this value was stored with the word in the dictionary. The reported accuracy level of this approach was 64 percent [10]. An alternative to the WordNet metric, proposed by [11], was to compute the semantic orientation of 4 AltaVista [in.altavista.com]: Was previously a popular internet search engine, which lost its ground to Google after its expansion. 5 WordNet [wordnet.princeton.edu]: Is a lexical database for English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. 10 Theoretical Background and Literature Review a word [7]. By subtracting a word’s association strength to a set of negative words from its association strength to a set of positive words, [11] were able to achieve an accuracy rate of 82 percent using two different semantic orientation statistic metrics. Figure 2.1 shows the generic architecture of working of a lexical approach. Figure 2.1: Generic architecture of an lexical approach classifier. 2.4.2 Machine Learning approach The other main avenue of research within this area has utilized supervised machine learning techniques. Within the machine learning approach, a series of feature vectors are chosen and a collection of tagged corpora are provided for training a classifier, which can then be applied to an untagged corpus of text. In a machine learning approach, the selection of features is crucial to the success rate of the classification. Most commonly, a variety of unigrams (single words from a document) or n-grams (two or more words from a document in sequential order) are chosen as feature vectors. Other proposed features include the number of positive words, number 11 Theoretical Background and Literature Review of negating words, and the length of a document. Support Vector Machines (SVMs) ([14], [15]) and the Naive Bayes algorithm [16] are the most commonly employed classification techniques. The reported classification accuracy ranges between 63 percent and 84 percent, but these results are dependent upon the features selected. Since majority of sentiment analysis approaches use machine learning techniques, the features of text are often represented as feature vectors. The following are features used in sentiment analysis. • TF-IDF -Term frequency or Identifier frequency: as commented in the section 2.3.4, represents simply the count of the “terms” or “words” being taken into account. These terms maybe uni-grams, bi-grams or higher order n-grams. Which ones of these yield better results is still not known. [4] claim the superiority of uni-grams over bi-grams in a movie review sentiment analysis, whereas [5] argue “bi-grams and tri-grams give better results” on the basis of their product review classification analysis. • POS -Part Of Speech tags: By nature English is an ambiguous language. One particular word may have more than one meaning depending upon its context of use. POS is used to disambiguate sense which in turn is used to guide feature selection [3]. For instance, these tags can be used to differentiate adjectives and adverbs since they are generally used as sentiment indicators [2]. Later on they realised that the performance of adjectives is worse than same number of uni-grams chosen on the basis of frequency. • Syntax and negation: - The use of collocations and other syntactic features can be used to improve performance. In classification of short texts, algorithms using syntactic features and algorithms using n-gram features were reported to yield the same performance [3]. Figure 2.2 shows the generic architecture of working of a machine learning approach. 2.4.3 Hybrid approach There are some approaches which use a combination of other approaches. One combined approach is followed [17]. They start with two word lexicons and unlabelled 12 Theoretical Background and Literature Review Figure 2.2: Generic architecture of a machine learning approach classifier. data. With the two discriminatory-word lexicons (negative and positive), they create pseudo-documents containing all the words of the chosen lexicon. After that, they compute the cosine similarity between these pseudo-documents and the unlabelled documents. Based on the cosine similarity, a document is assigned either positive or negative sentiment.Then they use these to train a Naive Bayes classifier. In a similar combined approach, [18] defined “a unified framework” that allows one to use background lexical information in terms of word-class associations, and renew this information for specific domains using any available training examples. They proposed and used a different approach which they called polling multinomial classifier, which is another name for Multinomial Naive Bayes (based on the multinomial distribution function). They used manually labelled data for training, unlike [17]. They report that experiments with data reveal that their incorporation of lexical knowledge improves performance. They obtained better performance with their approach than approaches using lexical knowledge or training data in isolation, or other approaches that use combined techniques. There are also other types of combined approaches that are complimentary in that different classifiers are used in such a way that one classifier contributes to another [19]. 13 Theoretical Background and Literature Review 2.5 Performance review Sentiment analysis has been applied on a variety of applications like movie reviews, blog reviews, etc. In their study, [20] carried out the task of evaluating the performance of the earlier discussed approaches on a news and movie review application for twitter. For this purpose they employed the following resources. 1. Cornell Movie Review dataset of tagged blogs (1000 positive and 1000 negative) [2]. 2. List of 2000 positive words and 2000 negative words from the General Inquirer lists of adjectives [12]. 3. Yahoo! Web Search API [13]. 4. Porter Stemmer [21]. 5. WordNet Java API [22]. 6. Stanford Log Linear POS Tagger built with the Penn Treebank tag set [23]. 7. WEKA Machine Learning Java API (only used for machine learning) [24]. 8. SVM-Light Machine Learning Implementation [25]. The performance of various approaches reported by [20] is as follows. 2.5.1 Lexical approach performance The performance of lexical approach variants are shown in table 2.1. Combining Table 2.1: Performance of lexical approach variants. [20]. Approach Baseline Baseline Baseline Baseline Baseline + + + + Accuracy Stemming Yahoo! words WordNet all 14 50.0 50.2 57.7 60.4 55.7 Theoretical Background and Literature Review all of the three variants (i.e. stemming, Yahoo! words, and WordNet) lead to a surprising drop in accuracy. By simultaneously increasing the number of words in the dictionary, applying a stemming algorithm, and assigning a weight to each dictionary word, they substantially increased the size of the dictionary. The proportion of positive and negative token matches did not change, and they hypothesize that for each positive match that was found, it was equally likely to find a negative match, thereby creating a neutral net effect. Given these results, it appears that the selection of words that are included in the dictionary is very important for the lexical approach. If the dictionary is too sparse or exhaustive, one risks the chance of over or under analyzing the results, leading to a decrease in performance. In accordance with previous findings, our results confirm that it is difficult to surpass the 65 percent accuracy level using a purely lexical approach. 2.5.2 Machine Learning approach performance During the initial experimentation phase of the machine learning approach, [20] used the WEKA package [24] to obtain a general idea of ML algorithms. Their preliminary experiments indicate that the SVM and Naive Bayes algorithms are the most accurate (both giving a tough competition). Table 2.2: Performance machine learning approach variants [20]. Approach Unigram Unigram Unigram Unigram SVM-Light integer binary integer + aggregate binary + aggregate 77.4 77.0 68.2 65.4 NaiveBayes 77.1 75.5 77.3 77.5 Naive Bayes approach performs quite well in all variants (delivering greater than 75 percent). The unigram results classified by the SVM algorithm are on a similar level. After initially running the SVM-Light experiments, [20] observed that the confusion matrix results were heavily skewed to the negative side, for all the feature representation vectors. To correct this imbalance, they tried introducing a threshold variable into the results. This threshold variable was set to - 0.2, and whenever a 15 Theoretical Background and Literature Review blog was assigned a value greater than -0.2, it was classified as positive. Unfortunately, although this modification did succeed in evening out the confusion matrix, it did not increase the overall accuracy of the results. The results from table 2.1 and 2.2 clearly indicate the superiority of machine learning approaches. Even the worst of the ML results is superior to the best of the lexical results. It seems that the lexical approaches rely too heavily on semantic information. As both the lexical and ML approaches demonstrated, the inclusion of any type of ”dictionary” information in an experiment does not automatically increase the method’s performance. Even though the ML results are superior, one must not forget that in order for a ML approach to be successful, a large corpus of tagged training data must first be collected and annotated, and this can be a challenging and expensive task. 2.5.3 Hybrid approach performance Last but not the least, few researchers have proposed some hybrid approaches for analyzing twitter sentiments. Like, [26] and [17] propose a method to automatically create a training corpus using microblog specific features like emoticons, which is subsequently used to train a classifier. Distant learning is used by [27] to acquire sentiment data. They use tweets ending in positive emoticons like [ “:)”, “:-)”, “:D”] as positive and [“:(”,“:-(”] as negative. They built models using Naive Bayes, Maximum Entropy and Support Vector Machines (SVMs), and they report SVM outperforms other classifiers. In terms of feature space, they try a uni-gram, bi-gram model in conjunction with POS features. They note that the unigram model outperforms all other models. Specifically, bigrams and POS features do not help. The following table shows the performance of hybrid approach variants. 2.6 Chapter conclusion It is clear from the analysis of the literature that machine learning approaches have so far proved to be outstanding in delivering accurate results. Depending upon the area of application both the approaches have an edge. For less number of features Lexical analysis technique is respectable, whereas for large features Machine learning 16 Theoretical Background and Literature Review Table 2.3: Performance of hybrid approach variants. Approach Accuracy SVM and NB classifiers trained with Usenet news groups data [19] Class-two NB classifier trained with unlabelled data [17] Class-two NB classifier trained with twitter data [18] 70.0 64.0 84.0 analysis technique is overshadowing. Both approaches have pros and cons. Lexical analysis is ready-to-go technique which does not require any prior classification or training of the datasets. It can be directly applied on live data given that the feature set is large. Where as in the Machine learning technique the classifier need to be initially fed or “trained” with raw datasets and tuned to cluster the sentiments into pre- defined classes. But it works efficiently on large texts with large feature support. Low features lead to less accuracy for this technique. In this chapter we presented a detailed literature review of the existing approaches and techniques. The next chapter describes the detailed design and analysis of the proposed hybrid naive bayes approach. 17 Chapter 3 Design and Analysis of Proposed Approach We propose a hybrid approach for analysing sentiments of Twitter. We obtain our inspiration from nature’s way of inter-species hybridization of the living organisms. The implementation and methodology and the experimental setup is discussed in chapter 4. 3.1 Problem Statement 3.1.1 Problem Introduction The field of sentiment analysis has been always a challenging area of research. The idea of applying sentiment analysis to Open Social Networks (OSNs) like Twitter, MySpace and Facebook is the current trend in research. Millions of people use microbloging services like twitter to express their views and opinions on day to day affairs. Machine learning approaches have so far proved to be out-ranging lexical approaches in terms of accuracy; however the speed of lexical approaches is noteworthy. Almost all approaches aimed at classifying twitter sentiments, so far, have fallen prey to the time vs performance dilemma. In the literature survey we summarized machine learning approaches yearning accuracy up to 80 percent. We firmly believe, hybridizing both lexical and machine learning approaches, results up to 90 percent or more can be realised (see chapter 5). Design and Analysis of Proposed Approach 3.1.2 Problem Definition To propose a hybrid approach yearning competitive results by hybridizing machine learning and lexical approaches which captures and analyses sentiments of users in an open social network like twitter for exploring public opinion. 3.2 Proposed approach: Hybrid Naive Bayes The superiority of both lexical approach (for its speed) and machine learning approach (for its accuracy) are not unknown to the world. A lexical approach is fast because of the predefined features (e.g. dictionary) it employs for extracting sentiments. Having a dictionary to refer at runtime reduces the time consumption almost exponentially. To improve performance of lexical approaches the feature set has to be increased drastically, i.e. a very large dictionary of variety of words with their frequencies has to be provided at runtime. This increases the overhead of the system and hence the performance suffers. Thus, there is a constant trade-off between Performance vs Time. On the other hand, machine learning approaches employ a recursively learning and tuning of their features, given large input datasets, improves its performance way beyond any lexical approach can achieve. However, due to this runtime performance tuning and learning the system undergoes drastic fall in time constraints. Our goal is to propose an approach that is a combination of both lexical and machine learning, hence exploit the best features of both in one. For this purpose we choose to employ a Naive Bayes classifier and empower it with an English lexical dictionary SentiWordNet (refer chapter 4 section 1 ). Our hybrid naive bayes follows the ritual four steps namely: Data collection, Preprocessing, Training the classifier and Classification shown in the fig 3.1. Through the following sections we shall discuss each step in detail, one at a time. The following figure (fig 3.2) shows the system architecture of our proposed approach. The labels Phase I and Phase II shown in figure 3.2 are the deployment phases. They have been discussed in the next chapter. 19 Design and Analysis of Proposed Approach Figure 3.1: Process steps followed by hybrid naive bayes. 3.2.1 Data collection: Twitter API For classification and training the classifier we need Twitter data. For this purpose we make use of API’s twitter provides. Twitter provides two API’s; Stream API1 and REST API2 . The difference between Streaming API and REST APIs are: Streaming API supports long-lived connection and provides data in almost real -time. The REST APIs support short-lived connections and are rate-limited (one can download a certain amount of data [*150 tweets per hour] but not more per day). REST APIs allow access to Twitter data such as status updates and user info regardless of time. However, Twitter does not make data older than a week or so available. Thus REST access is limited to data Twittered not before more than a week. Therefore, while REST API allows access to these accumulated data, Streaming API enables access to data as it is being twittered. 1 Twitter Stream API [https://dev.twitter.com/docs/streaming-apis]: Twitter has provided developers and research to access realtime twitter data easily through its stream API, based on some given constraints of API rate and other privacy policies. 2 http://dev.Twitter.com/do 20 Design and Analysis of Proposed Approach Figure 3.2: System architecture of hybrid naive bayes approach. The search API provides users the ability to access twitter search functionality. It uses GET requests and returns results formatted using ATOM or JSON, JSON is recommended due to compactness. A maximum of 100 results are returned per page for a maximum of 15 pages. Search requests take the general form shown below. Notable Arguments: • q: The query string (required). • lang: Restricts results to specified language using ISO 639-1 code. • rpp: Results to return per page. • page: Result page number to return. An example url request which returns 100 posts containing the word “stuff” formatted using JSON is given below. Next we follow a preprocessing step, which is essential to the tweets/text we will acquire in this step. 21 Design and Analysis of Proposed Approach 3.2.2 Preprocessing The tweets gathered from twitter are a mixture of urls, and other non-sentimental data like hashtags “#”, annotation “@” and retweets “RT”. To obtain n-gram features, we first have to tokenize the text input. Tweets pose a problem for standard tokenizers designed for formal and regular text. The following figure displays the various intermediate processing feature steps. The intermediate steps are the list of Figure 3.3: A sequence of intermediate preprocessing steps taking place at this level. features to be taken account of by the classifier. We discuss each feature deployed in brief. • Language detection- Since we are mainly interested in English text only. All tweets have been separated into English and non English data. This is possible by using NLTK’s language detection feature. • Tokenize- For a sample input text say ”Today the weather is sunny and beautiful”, Tokenizers divide strings into lists of substrings also known as Tokens. Tokenizing the text makes it easy to separate out other unnecessary symbols and punctuations and filter out only those words that can add value to the sentimental polarity score of the text. • Constructing n-grams- We will make a set of n-grams out of consecutive words. A negation (such as “no” and “not”) is attached to a word which precedes it or follows it. For example, a sentence “I do not like fish” will form two bigrams: “I do+not”, “do+not”, “like”, “not+like fish”. Such a procedure allows improving the accuracy of the classification since the negation plays a special role in an opinion and sentiment expression [28]. • Stop words- In information retrieval, it is a common tactic to ignore very common words such as “a”, “an”, “the”, etc. since their appearance in a post does not provide any useful information in classifying a document. Since 22 Design and Analysis of Proposed Approach query term itself should not be used to determine the sentiment of the post with respect to it, every query term is replaced with a QUERY keyword. Although this makes it somewhat of a stop word, it can still be useful when not using a bag-of-words model and the location of the query in relation to other words becomes important. • Strip smileys- Many microblogging posts make use of emoticons in order to convey emotion, making them very useful for sentiment analysis. A range of about 30 emoticons, including “:)”,“:(”, “:D”, “=]”, “:]”, “=)”, “=[”, “=(” are replaced with either a SMILE or FROWN keyword. In addition, variations of laughter such as “haha” or “ahahaha” are all replaced with a single LAUGH keyword. • Erratic Casting- In order to address the problem of posts containing various casings for words (e.g. “HeLLo”), We sanitize the input by lower casing all words, which provides some consistency in the lexicon. Punctuation in microblogging posts, it is common to use excessive punctuation in order to avoid proper grammar and to convey emotion. By identifying a series of exclamation marks or a combination of exclamation and question marks before removing all punctuation, relevant features are retained while more consistency is maintained. 3.2.3 Training Data To precisely label the text into their respective classes and thus achieve highest possible accuracy, we plan to train the classifier using pre-labelled twitter data itself. Pre-labelled twitter training data is not available freely, since this year Twitter changed its data privacy policies and it no longer allows open/free sharing of twitter content. However, they mention that using or downloading twitter content for individual research purposes is acceptable. Labelled data Since we do not have direct access to pre-labelled twitter data, we planned to crawl it manually. We crawled various sizes of datasets with various keywords of approximately 4 Million tweets from twitter using a custom python scripted crawler (refer 23 Design and Analysis of Proposed Approach chapter 4). The data obtained in such a way is certainly not labelled, so in order to address this issue, we propose to crawl twitter and form two different datasets. the first one consisting of all the positive sentiment tweets i.e. [“:)”, “:-)”, “:-D”, “:D”, “B-)”] and the latter one consisting of all the negative sentiment tweets i.e.[“:(”, “:-(”, “:’(”, “X(”, “X-(”]. Thus, we feed these datasets for the classifier for training, which function almost similar to hand labelled datasets as used in other sentiment analysis domains. 3.2.4 Sentiment Analysis - The classifier Naive Bayes was our first choice based on the inference from literature review carried out in chapter 2 . Naive bayes is bayesian probability distribution model based algorithm. In general all bayesian models are derivatives of the well known Bayes Rule, which suggests that the probability of a hypothesis given a certain evidence, i.e. the posterior probability of a hypothesis, can be obtained in tems of the prior probability of the evidence, the prior probability of the hypothesis and the conditional probability of the evidence given the hypothesis. Mathematically, P (H|E) = P (H)P (E|H) P (E) (3.1) where, P (H|E)- posterior probability of the hypothesis. P (H)- prior probability of hypothesis. P (E)- prior probability of evidence. P (E|H)- conditional probability of evidence of given hypothesis. Or in a simpler form: P osterior = (P rior) × (Likelihood) Evidence (3.2) To explain the concept, lets take an example. For instance, we have a new tweet to be classified in to one of the positive or negative classes. Given that in the previously classified tweets, positive tweets are twice the number of negative tweets. Since the new tweet’s class is not known, the problem is estimating correctly the class that the tweet is to be categorised in. This can be found out by Bayes rule calculating the probabilities of the likelihood of the tweet to be positive or negative. Hence, 24 Design and Analysis of Proposed Approach from eq. 3.1 we have: P (n|p) = P (n)P (p|n) P (p) (3.3) Since there are twice as many positive tweets as negative, it is reasonable to believe that a new case (which hasn’t been observed yet) is twice as likely to have membership positive rather than negative. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of positive tweets and negative tweets, and often used to predict outcomes before they actually happen. Thus, we can write: Prior Probability of positive tweetP (p) = No. of positive tweets Total no. of tweets Prior Probability of negative tweetP (n) = No. of negative tweets Total no. of tweets Let there be say a total of 6k tweets, 4k of which are positive and 2k negative, our prior probabilities for class membership are(where k = 103 ): 4k 4 2 = = 6k 6 3 2 1 2k = = Prior Probability for negative tweet P (R)= 6k 6 3 Prior Probability for positive tweet P (p) = The likelihood of the tweet falling into either of the classes is equal, since we have only two classes. So likelihood of X = 0.5. So now calculating the posterior probability of the new tweet say X, being positive or negative, will be: • Posterior probability of X being positive = (Prior probability of positive)× (Likelihood of X being positive) 2 1 1 = × = = 33.34% chances of X being positive. 3 2 3 • Posterior probability of X being negative = (Prior probability of negative)× (Likelihood of X being negative) 1 1 1 = × = = 16.67% chances of X being negative. 3 2 6 Thus this tweet will fall in to the positive class. In our case we would have two hypothesis and many other features on basis of which the one that has the highest probability would be chosen as a class of the tweet whose sentiment is being predicted. After every classification step all the probabilities are again calculated and updated accordingly. 25 Chapter 4 Implementation Methodology In this chapter the experimental setup and evaluation methodology for the proposed approach are discussed. 4.1 Experimental Setup To test the proposed approach, we created a setup with the following system requirements. We tested our approach on both Linux and Windows platforms. We used a Dell Optiplex 980 Windows-7 core-i5 (64 bit) machine equipped with 4 GB of RAM and a Linux server system with Quardcore processor equipped with 8 GB of RAM. The general requirements shown in the table below. The tools and technology used Table 4.1: General system requirements for our approach. Component Operating system Processor Min. Memory(RAM) Min. Storage Bandwidth Software and Third-party tools are as follows. Type Windows XP/7/8, Linux (Ubuntu Server 12.04) C2D/i3/i5/i7 (32/64 bits) ≥ 4 GB 20 GB Uninterrupted High-speed Internet (1Mbps connection) Linked Media Framework 2.3.5, NLTK 2.0, and SentiWordNet 3.0 Implementation Methodology • Python 2.7-(implementation language): Python is a general-purpose, interpreted high-level programming language whose design philosophy emphasizes code readability. Its syntax is clear and expressive. Python has a large and comprehensive standard library and more than 25 thousand extension modules. We use python for developing the backend of the test application crawler . This and the other modules implemented are discussed later. • NLTK-(language processing modules and validation): The Natural Language Processing Toolkit (NLTK) is an open source language processing module of human language in python. Created in 2001 as a part of computational linguistics course in the Department of Computer and Information Science at the University of Pennslyvania. NLTK provides inbuilt support for easy-touse interfaces over 50 lexicon corpora. NLTK was designed with four goals in mind. 1. Simplicity: Provide and intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annoted language data. 2. Consistency: Provide a uniform framework with consistent interfaces and data structures, and easily guessable method names. 3. Extensibility: Provide a structure into which new software modules can easily by accommodated, including alternative implementations and competing approaches on the same task. 4. Modularity: Provide components that can be used independently without needing to understand the rest of the toolkit. • LMF - (Persistent database): The Linked Media Framework, unlike other applications of databases this application requires a persistent database. this is because we have to query twitter in realtime and collect a certain number of tweets the database connection should be open all the time. This feature is only available in some database server applications namely Google App. 27 Implementation Methodology Engine1 , Linked Media Framework [LMF]2 The core component of the Linked Media Framework is a Linked Data Server that allows to expose data following the Linked Data Principles: – Use URIs as names for things. – Use HTTP URIs, so that people can look up those names. – When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL). – Include links to other URIs, so that they can discover more things. The Linked Data Server implemented as part of the LMF goes beyond the Linked Data principles by extending them with Linked Data Updates and by integrating management of metadata and content and making both accessible in a uniform way. Our extensions are described in more detail in LinkedMediaPrinciples. In addition to the Linked Data Server, the LMF Core also offers a highly configurable Semantic Search service and a SPARQL endpoint. Setting up and using the Semantic Search component is described in SemanticSearch. Accessing the SPARQL endpoint is described in SPARQLEndpoint. Whereas the extension of the Linked Data principles is already conceptually well-described, we are currently still working on a proper specification and extension of Semantic Search and SPARQL endpoint for Linked Data servers. LMF consists of some modules, some of them optional, that can be used to extend the functionality of the Linked Media Server: – LMF Semantic Search - offers a highly configurable Semantic Search service based on Apache SOLR. Several semantic search indexes can be configured in the same LMF instance. 1 Google App. Engine: Google App Engine (often referred to as GAE or simply App Engine, and also used by the acronym GAE) is a platform as a service (PaaS) cloud computing platform for developing and hosting web applications in Google-managed data centers. Applications are sandboxed and run across multiple servers. App Engine offers automatic scaling for web applications as the number of requests increases for an application, App Engine automatically allocates more resources for the web application to handle the additional demand. 2 LMF: The Linked Media Framework is an easy-to-setup server application that bundles together some key open source projects to offer some advanced services for linked media management. 28 Implementation Methodology – LMF Linked Data Cache - implements a cache to the Linked Data Cloud that is transparently used when querying the content of the LMF using either LDPath, SPARQL (to some extent) or the Semantic Search component. In case a local resource links to a remote resource in the Linked Data Cloud and this relationship is queried, the remote resource will be retrieved in the background and cached locally. – LMF Reasoner - implements a rule-based reasoner that allows to process Datalog-style rules over RDF triples; the LMF Reasoner will be based on the reasoning component developed in the KiWi project, the predecessor of the LMF. – LMF Text Classification provides basic statistical text classification services; multiple classifiers can be created, trained with sample data and used to classify texts into categories. – LMF Versioning - implements versioning of metadata updates; the module allows getting metadata snapshots for a resource for any time in its history and provides an implementation of the memento protocol. – LMF Stanbol Integration - allows in-tegrating with Apache Stanbol for content analysis and interlinking; the LMF provides some automatic configuration of Stanbol for common tasks. • SentiWordNet 3.0: SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. It groups English words into sets of synonyms called “synsets”, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and can be downloaded and used freely. The database can also be browsed online 3 . SentiWordNet is the result of research carried out by Andrea Esuli and Fabrizio Sebastiani. Given that the sum of the opinion-related scores assigned to a synset is always 1.0, it is possible to display these values in a triangle whose 3 SentiWordNet available at: http://sentiwordnet.isti.cnr.it 29 Implementation Methodology vertices are the maximum possible values for the three dimensions observed. Figure 1 shows the graphical model we have designed to display the scores of a synset. This model is used in the Web-based graphical user interface through which SENTIWORDNET can be freely accessed at their website 4 Figure 4.1: Polarity triangle http://sentiwordnet.isti.cnr.it]. 4.2 of synset in SentiWordNet [courtesy: Evaluation Accuracy is the ability of the classifier to correctly classify and label the new tweets into their respective classes. For this purpose we propose to use Natural Language Toolkit (NLTK). The classifier is first run on the crawled twitter data, classifying tweets for the keywords decided previously and thus trained. Thereafter, the classifier is ran through NLTK accuracy analysis function which tests it on its own corpora. Thus, we obtain validated results. 4.3 Test Application The application “TSA” (Tweet Sentiment Analyzer) has been deployed in two phases (refer chapter 3), namely4 SentiWordNet: http://patty.isti.cnr.it/esuli/software/SentiWordNet. 30 Implementation Methodology • Phase I: Deploying TSA (aplha) system with the base naive bayes classifier and harvesting analyzer results. • Phase II: Hybridizing naive bayes classifier by embedding SentiWordNet 3.0 lexicon dictionary i.e. (beta). The analyzer is run through NLTK tests and results are extracted. Then later on we integrate LMF 2.3.5 for analyzing realtime twitter data on the fly. The figure 4.2 represents the class diagram of the TSA application, generated using PyNSource tool5 . 5 PyNSource: PyNSource can convert your python source into java files (just the class declarations of course, not the meat of the code) and thus you can import the resulting java files into a more sophisticated UML modelling tool that understands java. There you can auto-layout your UML and you can print out or take screen shots of the resulting diagrams. You could even write a script to automate the java code generation. Thus when you fire up you Java based UML modelling tool and do an import, you only import the changes to your diagram and you don’t lose the custom way you have laid out your diagrams. PyNSource can also generate Delphi (object pascal) code 31 Figure 4.2: Class diagram of Tweet Sentiment Analyzer. Implementation Methodology The classes in olive green color are the root classes. Other mixed color classes serve as an intermediate functions to the root classes. While the ones in greyish-yellow color are the terminal classes. 33 Chapter 5 Results and Analysis In this chapter, the results of the Phase I and Phase II are presented along with their analysis. 5.1 Results The results of both of the existing classifier and proposed hybrid classifier are presented and compared in the format of deployment i.e. Phase I & II (refer chapter 4 ). • ATSPhase I : Tests were carried out using multiple twitter datasets consisting of a mixture of new and old keywords like [“#ironman3”, “#amitabhbachhan”, “#Google”, “#twitter”, “#robertdowneyjr” and etc]. Our datasets vary in size from a few hundreds to a couple of millions, with a goal to test the scalability and also ensuring performance. The performance of base naive bayes classifier is shown in the table below with respect to dataset-size. kindly note here (K=thousand, and M=Million) Table 5.1: Performance of base naive bayes classifier. Dataset size 1K 10K 50K 100K 1M Accuracy 28.54 29.57 50.77 59.96 70.02 Results and Analysis • ATSPhase II : After integrating SentiWordNet lexicon dictionary, same procedure was carried on the hybrid naive bayes classifier and the following results were harvested. Note should be taken that the same datasets were used on both the classifiers. Comparing the results of base naive bayes to hybrid naive Table 5.2: Performance of hybrid naive bayes classifier. Dataset size 1K 10K 50K 100K 1M Accuracy 63.95 97.50 98.60 95.48 94.50 bayes, we can straight forwardly conclude that the proposed hybrid classifier clearly outperforms the earlier approach. The main reason of the later approach’s precise accuracy is the availability of the polarity values at before hand. However, it can observed that there is a gradual and constant downfall in the accuracy, in the case of hybrid naive bayes, as the number of tweets increase above 50K. One of The main factors for this phenomenon are – Quality of dataset - It is observed that sometimes while crawling large dataset of tweets, due to crawler’s in-activity for periods of time resulting from network failure may lead to broken or noisy dataset. – Use of shorthand of words - i.e. people generally tend write short forms of words like “today” becomes “2day”, “that” is written as “dat”, and etc, These kind of mixed literal lexicons, when searched through the dictionary, do not count as a match or hit, thus resulting degraded classifier accuracy. • Most Informative Features: After training the Naive Bayes classifier with the tweets and comments, we asked the classifier to show the 25 most informative features for it to recognize whether some text should be classified positive, or negative. The ratio (neg:pos)OR(pos:neg), implies that the particular word has been used more often as a positive or negative than else wise. For instance, the word “neverknow” has been used 6.2 times as a negative sentiment than a positive sentiment in the text(last word in figure 5.2 ). 35 Results and Analysis Figure 5.1: Comparison of classifier performances, Dataset size (no. of tweets) vs Accuracy (%). Some of the snapshots of the working system are presented. Figures 5.3 and 5.4 display the base naive bayes classifier working on a windows platform. Figures 5.5 and 5.6 display the hybrid naive bayes classifier working on a windows platform. Figures 5.7 and 5.8 display the base naive bayes classifier working on a linux server platform. 36 Results and Analysis Figure 5.2: The most informative features if hybrid naive bayes classifier. Figure 5.3: The base naive bayes classifier in action on a windows platform with 50k tweets dataset. 37 Results and Analysis Figure 5.4: Accuracy of base naive bayes classifier with a 50k tweets dataset. 38 Results and Analysis Figure 5.5: The hybrid naive bayes classifier in action on a windows platform with 50k tweets dataset. 39 Results and Analysis Figure 5.6: Accuracy of hybrid naive bayes classifier with a 50k tweets dataset. 40 Results and Analysis Figure 5.7: Base naive bayes classifier in action on a linux server with 10k tweets dataset. 41 Results and Analysis Figure 5.8: Accuracy of base naive bayes classifier on a linux server with a 10k tweets dataset. 42 6 Conclusion and Future Work 6.1 Conclusion The biological hybridization of real life inter-species, it has prominently known that limitations of both species can be conquered. Having applied the same attitude in our proposal, the experimental studies performed through chapters 4 and 5 , successfully show that hybridizing the existing machine learning analysis and lexical analysis techniques for sentiment classification yield comparatively outperforming accurate results. For all the datasets used, we recorded consistent accuracy of ≥ 90%. Clearly from the success of Hybrid Naive Bayes, it can positively be applied over other related sentiment analysis applications like financial sentiment analysis (stock market opinion mining), customer feedback services, and etc. 6.2 Future Work Substantial amount of work is left to be carried on, here we provide a beam of light in direction of possible future avenues of research. • Interpreting Sarcasm: The proposed approach is currently incapable of interpreting sarcasm. In general sarcasm is the use of irony to mock or convey contempt, in the context of current work sarcasm transforms the polarity of an apparently positive or negative utterance into its opposite. This limitation Conclusion and Future Work can be overcome by exhaustive study of fundamentals in “discourse-driven sentiment analysis”. The main goal of this approach is to empirically identify lexical and pragmatic factors that distinguish sarcastic, positive and negative usage of words. • Multi-lingual support: Due to the lack of multi-lingual lexical dictionary, it is current not feasible to develop a multi-language based sentiment analyzer. Further research can be carried out in making the classifiers language independent, shown by [30]. The authors have proposed a sentiment analysis system with support vector machines, similar approach can be applied for our system to make it language independent. 44 Bibliography [1] M. Bautin, L. Vijayrenu L, and Skenia., “International sentiment analysis for news and blogs”, In Second International Conference on Weblogs and Social Media (ICWSM), 2008. [2] P. Turney., “Thumbs up or thumbs down?” Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics,pp. 417424, 2002. [3] B. Pang and L. Lee., “Opinion mining and sentiment analysis”, Foundations and Trends in Information Retrieval, vol. 2, no. 1-2, pp. 1-135, 2008. [4] B. Pang and L. Lee., “Using very simple statistics for review search: An exploration”, In Proceedings of the International Conference on Computational Linguistics (COLING), 2008. [5] K. Dave, S. Lawrence, and D. Pennock., “Mining the peanut gallery: Opinion extraction and semantic classification of product reviews”, pp. 519-528, 2003. [6] N. Godbole, M. Srinivasaiah, and S. Skiena., “Large-scale sentiment anal-ysis for news and blogs,” 2007. [7] A. Kennedy, D. Inkpen,. “Sentiment Classification of Movie and Product Reviews Using Contextual Valence Shifters”, Computational Intelligence, pp.110125, 2006. [8] J. Kamps, M. Marx, R. Mokken., ”Using WordNet to Measure Semantic Orientation of Adjectives”, LREC 2004, vol. IV, pp. 11151118, 2004. [9] V. Hatzivassiloglou, and J. Wiebe., “Effects of Adjective Orientation and Gradability on Sentence Subjectivity”, Proceedings of the 18th International Conference on Computational Linguistics, New Brunswick, NJ, 2000. [10] A. Andreevskaia, S. Bergler, and M. Urseanu, “All Blogs Are Not Made Equal: Exploring Genre Differences in Sentiment Tagging of Blogs”, In International Conference on Weblogs and Social Media (ICWSM-2007), Boulder, CO, 2007. [11] P. Turney, and M. Littman., “Measuring Praise and Criticism: Inference of Semantic Orientation from Association”, ACM Transactions on Information Systems, pp. 315346, 2003. 45 [12] P. Stone, J. Dunphy, and D.Smith, “The General Inquirer: A Computer Approach to Content Analysis”, MIT Press, Cambridge, 1966. [13] Yahoo! Search Web Services, Online http:// developer. yahoo.com/search/., 2007. [14] J. Akshay., “A Framework for Modeling Influence, Opinions and Structure in Social Media”, In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, Vancouver, BC, July 2007, pp. 19331934, 2007. [15] K. Durant, and M. Smith,. “Mining Sentiment Classification from Political Web Logs”, In Proceedings of Workshop on Web Mining and Web Usage Analysis of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (WebKDD-2006), Philadelphia, 2006. [16] S. Prasad, Micro-blogging sentiment analysis using bayesian classification methods,” 2010. [17] A. Pak and P. Paroubek, Twitter based system: Using twitter for disambiguating sentiment ambiguous adjectives,” pp. 436-439, 2010. [18] B. Liu, X. Li, W.S. Lee, and P.S. Yu. Text classfication by labeling words., Proceedings of the National Conference on Artificial Intelligence, Menlo Park, CA; Cambridge, MA; London; AAAI Press; pp. 425-430. MIT Press; 2004. [19] P. Melville, W. Gryc, and R.D. Lawrence. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1275-1284, 2009. [20] R. Prabowo and M. Thelwall. Sentiment analysis: A combined approach. Journal of Informetrics, 3(2), pp. 143-157, 2009. [21] M. Annett and G. Kondrak, A comparison of sentiment analysis tech-niques: Polarizing movie blogs,” pp. 25-35, 2008. [22] No Title, Online http://tartarus.org/ martin/PorterStemmer/ java.txt, 2007. [23] C. Fellbaum,. “WordNet: An Electronic Lexical Database. Language, Speech, and Communication Series”, MIT Press, Cambridge, 1998. [24] K. Toutanova, and C. Manning., “Enriching the Knowledge Sources Used in a Maxi-mum Entropy Part-of-Speech Tagger” In Proceedings of EMNLP/VLC2000, Hong Kong, China, pp. 6371, 2000. [25] H. Witten, E. Frank., “Data Mining: Practical Machine Learning Tools and Techniques”,In 2nd edition of Morgan Kaufmann, San Francisco, 2005. [26] A. Go, R. Bhayani, and L. Huang. “Twitter sentiment classification using distant supervision”, Technical report, Stanford, 2009. [27] J. Read,. “Using emoticons to reduce dependency in machine learning techniques for sentiment classification”, In Proceedings of Association for Computer Linguistics (ACL), 2005. 46 [28] T. Joachims., “Making Large-Scale SVM Learning Practical”, In Advances in Kernel Methods - Support Vector Learn-ing, pp. 169184. MIT-Press, Cambridge, 1999. [29] J. Wiebe, T. Wilson, and C. Cardie, Annotating expressions of opinions and emotions in language,” Language Resources and Evaluation (formerly Computers and the Humanities), vol. 39, no. 2/3, pp. 164-210, 2005. [30] S. Narr, M. Hifulfenhaus, and S. Albayrak, Language-independent twitter sentiment analysis.”, The 5th SNA-KDD Workshop, 2011. 47
© Copyright 2026 Paperzz