structured data analytics of social network reviews using

International Journal of Recent Innovation in Engineering and Research
Scientific Journal Impact Factor - 3.605 by SJIF
e- ISSN: 2456 – 2084
STRUCTURED DATA ANALYTICS OF SOCIAL NETWORK REVIEWS
USING SENTIMENT ANALYSIS - SVM ON HADOOP
Rajamohan V.R1 and Dr. M Krishnamurthy2
1
PG Scholar, Department of Computer Science and Engineering, KCG College of Technology, Chennai, India
2
Professor, Department of Computer Science and Engineering, KCG College of Technology, Chennai, India
Abstract-There are millions of user store their information in Social Media every day. These data
are gathered in Big Data. Those are analysing to get useful information. Sentiment dictionaries have
numerous inaccuracies. This could not able to principally categorize the Opinion Results. Sentiment
based analysis is the major key in categorizing the user’s Feedback. FSM & EEM Algorithm for the
Word processing process is not much efficient to analysis the result. The proposed model will
implement SVM algorithm to analyse data from Big Data collected through social network. Twitter
like application is created and users Tweets are processed. Users Tweets are the input to the Big Data
HDFS System. Data are stored in the Data Nodes. Index is maintained in the Name Node. Tweets are
clustered and classified based on Keywords extracted. Tweets are analysed using Sentiment Analysis
and Positive & Negative Tweets are classified using SVM algorithm. Map & Reduce is also
implemented. These result are used to conclude the analysed data is useful are not.
Key Words- Sentiment analysis, opinion poll, social network, SVM algorithm, Big Data, Twitter
Analysis.
I.INTRODUCTION
The digital age, also referred to as the information society, is characterized by ever growing
volumes of information. Driven by the current generation of web applications, the nearly limitless
connectivity, and an insatiable desire for uploading information, in particular for younger
generations, the volume of user-generated social media content is growing rapidly and likely to add
even more people in the upcoming future. People using the Web are constantly invited to share their
opinions and preferences with the rest of the world, which has led to an explosion of opinionated
blogs, reviews on products and services, and comments on tweets pictures or anything. This type of
web-based content is more and more recognized as a source of data that has added value for multiple
application domains. People use social media every day to upload new information anytime. They
belong to many information like news, status, geographic report, and daily breaking information time
to time. The opinions expressed in various web and media outlets(e.g., blogs, newspapers) are an
important yardstickfor the success of a product or a government policy.For instance, a product with
consistently good reviews islikely to sell well. The general approach of determiningthe overall
orientation (i.e., positive or negative) of a sentence/document is by analysis of the orientations of the
individual words. Sentiment dictionariesare utilized to facilitate the summarization. There
arenumerous works that, given a sentiment lexicon, analysethe structure of a sentence/document to
infer its orientation,the holder of an opinion, the sentiment of the opinion, etc. Several area has
independentsentiment analyses dictionaries that have been manually working or
automaticallycreated, e.g., general inquirer (GI), opinion finder (OF), appraisal lexicon (AL),
SentiWordNet(SWN) and Q-WordNet (QW). QW and SWN arelexical resources which classify the
synsets (senses) in WordNet according to their polarities. We call them sentimentsense dictionaries
(SSD). They consist of wordsmanually comment with their corresponding polarities. We have
noticed the following problems with the sentiment dictionaries: They exhibit substantial (intradictionary) inaccuracies. Example, European has a negative polarity in QW, while intuitively this
synset has a neutral polarity instead. They have (inter-dictionary) inconsistencies. Forexample, the
adjective word fast is positive in AL andnegative in OF.These sentiment analyses dictionaries
doesn’t address the concept ofpolarity in consistency of words/synsets. We concentrate on the
@IJRIER-All rights Reserved -2017
Page 47
Volume: 02 Issue: 04 April– 2017 (IJRIER)
concept of (in) consistency in this paper. This paper defines consistency among the polarities
ofwords/synsets within and across sentiment dictionariesand give methods to check them. We
provide two examplesto illustrate the problem addressed in this paper.The first example is the verbs
deny and disprove,which have positive and negative polarities, respectively, inOF. According to
WordNet, both words have a uniquemeaning sense, which they share: disprove, deny (prove to be
false)”The physicist disproved his colleagues’ theories” Assuming that WordNet has complete
information aboutthe two words, it is rather strange that the words have distinctpolarities. By
manually checking two authoritativeEnglish dictionaries, Oxford1 and Cambridge, user confirm that
the information about deny and disprove in WordNet is the same as that in these dictionaries. So,
theproblem seems to originate in OF.The second example is the verbs tantalize and taunt,which have
positive and negative polarities, respectively, inOF. They also have a unique meaning sense in
WordNet, which they share. Again, there is a contradiction. In this case Oxforddictionary mentions a
sense of bait that is missing from WordNet: “excite the senses or desires of (someone)”. Thissense
conveys a positive polarity. Hence, bait conveysa positive sentiment when used with this sense.In
summary, these dictionaries have conflicting information.Their use in sentiment analysis tasks can
give conflictingresults. Manual checking of sentimentdictionaries for inconsistency is a difficult
endeavour. Weregard words such as deny and disprove are more inconsistent.We aim to remove
these inconsistencies in sentiment analyse dictionaries.Note that the occurrence of inconsistency
found viapolarity analysis is not exclusively attributed to one party,i.e., either the sentiment
dictionary or WordNet. Instead, asemphasized by the above examples, some of them lie in
thesentiment dictionaries, while others lie in WordNet. Hence, a by-product of our polarity
consistency analysis isthat it can also found likely places where WordNet needs difficult attention.
II. RELATED WORK
There are many analysis related on text mining. This paper considers the user tweets based on
recent trends or news. This analyse gives the count of majority of the person interested on the topic.
This proposed paper invoked based on some existing paper.
[8]Pedro Domingos and Geoff Hulten proposed “Mining High-Speed Data Streams” to
mining the text with high speed data. Many data are analysing with short period of time. Also they
proposed the mapping and search the content to access with accurate text data. They use decision
tree to sort and store the data in database. It is easy to retrieve data quick and also search the text data
with respective search key from the database. Very fast decision tree (VFDT) learner has poor
attribute and high memory to analyse text mining. The drawback of this paper is the result accuracy
is complexity of the concept.
E. Breck, Y. Choi, and C. Cardie[5] proposed “Identifying expressions of opinion in context”
that describes the user expression through text data.Extracting information about subjectivity is an
area of great interest to a variety of public and private interests. Author have discussed that
successfully completed this research will require the same expression-level identification as in
factual information extraction. This method is the first to directly approach the task of extracting
these expressions. It identifies the user expression is happy or sad, good or bad. It provides only 5%
of the user expression through his research.Also improves to build on this expressionlevel
identification towards systems that present the user with a comprehensive view of the opinions
expressed in text.
A. L. Maas, R. E. Daly, P. Pham, D. Huang, A. Ng, and C. Potts[7], “Learning word vectors
for sentiment analysis” paper describes the text mining of large data set to produce robust benchmark
for movie review. It’s similar to this paper that hold only movie review. But this paper gathers all
user tweet and analyse the required text data from data sets. Author presented a vector space model
that learns word representations capturing semantic and sentiment information. The model’s
probabilistic foundation gives a theoretically justified technique for word vector induction as an
alternative to the overwhelming number of matrix factorization-based techniques commonly used.
This model is parametrized as a log-bilinear model following recent success in using similar
Available Online at : www.ijrier.com
Page 48
Volume: 02 Issue: 04 April– 2017 (IJRIER)
techniques for language models (Bengio et al., 2003; Collobert and Weston, 2008; Mnih and Hinton,
2007), and it is related to probabilistic latent topic models (Blei et al., 2003; Steyvers and Grif- fiths,
2006). Parametrize the topical component of this model in a manner that aims to capture word
representations instead of latent topics. In this experiments, our method performed better than LDA,
which models latent topics directly. We extended the unsupervised model to incorporate sentiment
information and showed how this extended model can leverage the abundance of sentiment-labeled
texts available online to yield word representations that capture both sentiment and semantic
relations. This shows the benefit of such representations on two quest of sentiment classification,
using existing datasets as well as a larger one that author release for future research. These tasks
involve relatively simple sentiment information, but the model is highly flexible in this regard; it can
be used to characterize a wide variety of annotations, and thus is broadly applicable in the growing
areas of sentiment analysis and retrieval.
C. Danescu-N.-M., G. Kossinets, J. Kleinberg, and L. Lee[2], “How opinions are received by
online communities”amazon.com conduct some opinion on employee like “What did Y think of X?”,
also, “What did Z think of Y’s opinion of X?” these dataset analyse through social behavior. This
relate to current paper that analyse social user text data analyses.Author have seen that helpfulness
evaluations on a site like Amazon.com provide a way to assess how opinions are evaluated by
members of an on-line community at a very large scale. A resultshows helpfulness depends not just
on its data, but also the relation of its score to other scores. This dependence on the score contrasts
with a number of theories from sociology and social psychology, but is consistent with a simple and
natural model of individual bias in the presence of a mixture of opinion distributions. There are more
number of sentiment analyses for this paper research. First, the robustness of our results across
independent populations suggests that the phenomenon may be relevant to other settings in which the
evaluation of expressed opinions is a key social dynamic. Variations in the effect (such as the
magnitude of deviations above or below the mean) can be used to form hypotheses about differences
in the collective behaviour of the underlying populations. Finally, it would also be very interesting to
consider social feedback mechanisms that might be capable of modifying the effects we observe
here, and to consider the possible outcomes of such a design problem for systems enabling the
expression and dissemination of opinions.
M. Kim and E. Hovy, “Determining the sentiment of opinions” [3] proposed the opinion poll
on social information. Sentiment recognition is a challenging and difficult part of understanding
opinions. We plan to extend our work to more difficult cases such as sentences with weak-opinionbearing words or sentences with multiple opinions about a topic. To improve identification of the
Holder, we plan to use a parser to associate regions more reliably with holders. Author have idea to
do other learning techniques, such as decision word lists or SVM. Nonetheless, as the experiments
show, encouraging results can be obtained even with relatively simple models and only a small
amount of manual seeding effort.
III.PROPOSED SYSTEM
A. TWITTER APPLICATION
This project starts with twitter like application. This application is created using Advanced
Java Concepts like JSP and Servlets. In that application user may register new login with their
personal details like name, age, sex, mail id, mobile number, user name and passwords. After
registration of user, he/she can login to their personal account. He can see the other user tweets and
also he make new tweets that other user can see. Server stores the user tweets and maintain the chat
domain. Admin has the full control to create the mapping and start the sentiment analysis.
B. DATABASE
Database stores the user name and password. It’s used to verify the user while login. And also
database store and retrieve the user tweets to all other users. New tweets store in database. Admin
access the database and collect the required tweets and data. Admin controls the tweets to be
gathered from the database.
Available Online at : www.ijrier.com
Page 49
Volume: 02 Issue: 04 April– 2017 (IJRIER)
C. HADOOP
Hadoop works on command system. It gathers the initial information from lib and jar file
from the package. That package holds the information that hadoop software to run on. Hadoop
coding holds the SVM algorithm that counts the tweets that belong to positive and negative tweets.
First it gathers the required information from the lib and jar file. Then it separates the data with
NameNodes and DataNodes. NameNode stores Meta-Data about the data being stored in
DataNodes. DataNode stores the actual Data. In a multinode cluster NameNode and DataNodes are
stores on different machines. There is only one NameNode in a cluster and many DataNodes; that’s
why we call NameNode as a single point of failure. Although There is a Secondary NameNode
(SNN) that can exist on different system which doesn't actually act as a NameNode but stores the
image of primary NameNode at certain checkpoint and is used as backup to restore NameNode.
JobTracker is a MasterNode which creates and runs the job. JobTracker which also run on the
NameNode allocates the job to TaskTrackers which run on DataNodes; TaskTrackers run the tasks
and report the status of task to JobTracker. The JobTracker runs on MasterNode aka NameNode
whereas TaskTrackers run on DataNodes.The ResourceManager (RM) is responsible for tracking
the resources in a cluster, and scheduling applications (e.g., MapReduce jobs). User is free to pick up
any storage to set up RM restart, but must use ZooKeeper based state-store to support RM high
availability. ZooKeeper based state-store results support fencing mechanism to avoid a split-brain
situation where multiple RMs assumes they are active and can edit the state-store at the same time.
IV. SVM ALGORITHM
SVM algorithms provide the results that count positive and negative tweet of the user. Hadoop
command holds the tweets to be analyse through SVM algorithm. The system architecture describes
the flow of this paper. It starts with user tweets that store in database. Then the server handles the
data to be sentiment analyse through hadoop server. Hadoop uses SVM algorithm to count the
positive and negative tweets. SVM algorithm takes input as user tweets and keyword and alternative
keyword and files that hold the words that belong to positive or negative words in the tweet. Using
these words SVM Algorithm separate the user tweets with respective count. And the result taken to
GUI hadoop result page for user view.
Figure: SYSTEM ARCHITECTURE
SVM algorithm works on the text mining dataset. This first gather the dataset from database.
And collect the data then sort the data with the user search keyword to minimize the dataset. Then
Available Online at : www.ijrier.com
Page 50
Volume: 02 Issue: 04 April– 2017 (IJRIER)
these dataset is reduced to required tweets. Then hadoop concept works on these dataset. It separated
the data into positive tweet count and negative tweet count to count the user tweet using the keyword
stored in the set files in the database. Hadoop take each user tweets and search positive words in it as
count on +1 and also count negative keywords as count on -1. The final word count comes positive
integer then the user tweet count as positive tweet. Else the user tweet count as negative tweet.
IV. RESULT AND ANALYSE
Sample result provide the output for the search keyword of “KCG COLLEGE” and
alternative keyword “KCTECH” the GUI result page shows the positive tweet count as 1022 and
negative tweet count as 213 and overall tweet in database as 51827. The below graph shows the
different result analysis with different tweet count. First analysis as KCG COLLEGE is good or bad?
User opinion based on this tweet is 1022 user tweeted as positive and 213 users as negative count.
Next analysis as Sunandha murder case, 422 users tweeted as murder and 22 users tweeted as
suicide. Next analysis as cavery river water issue, 1088 user supported the need for river water and
248 user denied to give the river water to Tamilnadu. And final analyse in this graph as farmer issue,
1180 user support farmer to provide what farmer need and 210 users doesn’t support farmer to
provide what they need. Like that this paper analyse many issue and recent trend through user tweets.
Figure: GRAPH FOR DIFFERENT ANALYSIS RESULT
V. CONCLUSION AND FUTURE WORK
The problem of checking polarity consistency for sentiment word dictionaries. This project
proves that this problem is NP-complete. And shown that in practice polarity inconsistencies of
words both within a dictionary and throughout dictionaries can be obtained using SAT solvers. Sets
of inconsistent words are pinpointed and this allows the dictionaries to be improved. This project
gives the result of analysed text tweet that holds positive and negative reviews about any topic.
Server holds the final output of the analyses. Then administrator decides the final result is positive
opinion or negative opinion. This topic uses map-reduce and holds the good result to the user and
admin. The admin finalize the topic resultant review is positive or negative. SVM method is used to
perform the reduce function through category and sub-category given by the administrator.Future
work is based on gather data set from different social media applications. And also gather many data
set data to analyse the user opinions. Bayesian data analysis is to improve the parameter of content of
the data to analyse the required parameter data. Bayesian data observed the data format and provide
the similar data into a collective database to fetch the relevant data. This method helpful to sort the
relevant data easily to the map reduce concept. Bayesian data and SVM algorithm provide the
Available Online at : www.ijrier.com
Page 51
Volume: 02 Issue: 04 April– 2017 (IJRIER)
required information to the server or admin to proceed further. Hadoop then gives the count of the
user tweets to the administrator.
REFERENCES
[1] B. Pang and L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on
minimum cuts,” in Proc. 42nd Annu. Meeting Assoc. Comput. Linguistics, 2004, pp. 271–278.
[2] C. Danescu-N.-M., G. Kossinets, J. Kleinberg, and L. Lee, “How opinions are received by online communities: A
case study on amazon.com helpfulness votes,” in Proc. 18th Int. Conf. World Wide Web, 2009, pp. 141–150.
[3] M. Kim and E. Hovy, “Determining the sentiment of opinions,” in Proc. 20th Int. Conf. Comput. Linguistics, 2004,
pp. 1367–1373.
[4] H. Takamura, T. Inui, and M. Okumura, “Extracting semantic orientations of words using spin model,” in Proc. 43rd
Annu. Meeting Assoc. Comput. Linguistics, 2005, pp. 133–140.
[5] E. Breck, Y. Choi, and C. Cardie, “Identifying expressions of opinion in context,” in Proc. 20th Int. Joint Conf.
Artif. Intell., 2007, pp. 2683–2688.
[6] X. Ding and B. Liu, “Resolving object and attribute coreference in opinion mining,” in Proc. 23rd Int. Conf.
Comput. Linguistics, 2010, pp. 268–276.
[7] L. Maas, R. E. Daly, P. Pham, D. Huang, A. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in
Proc. 49thAnnu. Meeting Assoc. Comput. Linguistics, 2011, pp. 142–150.
[8] Pedro Domingos, Geoff Hulten, “Mining High-Speed Data Streams”, Sixth ACM SIGKDD International
Conference,2000.
[9] Kim Schouten and Flavius Frasincar, “Survey on Aspect-Level Sentiment Analysis,” IEEE Transactions on
Knowledge and Data Engineering, Vol. 28, No. 3, March 2016.
[10] R. Chen and K. Sivakumar, “Collective Mining of Bayesian Networks from Distributed Heterogeneous Data”, in
Proc. Knowledge and Information Systems, march 2014, Volume 6, Issue 2, pp 164–187.
[11] L. Jia, C. Yu, and W. Meng, “The Effect of Negation on Sentiment Analysis and Retrieval Effectiveness,” in
Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009). ACM, 2009,
pp. 1827–1830.
[12] M. Thelwall, K. Buckley, and G. Paltoglou, “Sentiment Strength Detection for the Social Web,” Journal of the
American Society for Information Science and Technology, vol. 63, no. 1, pp. 163–173, 2012.
[13] R. Narayanan, B. Liu, and A. Choudhary, “Sentiment Analysis of Conditional Sentences,” in Proceedings of the
Conference on Empirical Methods in Natural Language Processing (EMNLP 2009). ACL, 2009, pp. 180–189.
[14] H. Lakkaraju, C. Bhattacharyya, I. Bhattacharya, and S. Merugu, “Exploiting Coherence for the Simultaneous
Discovery of Latent Facets and Associated Sentiments,” in SIAM International Conference on Data Mining 2011
(SDM
2011).
SIAM,
2011,
pp.
498–509.
Available Online at : www.ijrier.com
Page 52