Toward Social Media Opinion Mining for Sustainability Research

Computational Sustainability: Papers from the 2015 AAAI Workshop
Toward Social Media Opinion Mining for Sustainability Research
Rundong Du, Zhongming Lu, Arka Pandit, Da Kuang,
John Crittenden, Haesun Park
Georgia Institute of Technology
Atlanta, Georgia 30332
Abstract
sentiment analysis into the field of computational sustainability. We will present a general framework for mining
topics and opinions from social media. And then we will
summarize related work and the necessity of developing new
methods. We will use Twitter as an example in this paper,
but our framework also works for other social media.
We propose to introduce social media opinion mining research into the field of computational sustainability. Opinion mining from social media can be a
faster and less expensive alternative to traditional survey and polling, on which many sustainability research
are based. We describe a framework for such analysis,
examine the challenges in our proposed framework and
current status of research on those challenges. We also
propose some possible research directions for tackling
these challenges.
Framework for Mining Topics and Opinions
from Social Media
For survey and polling methods, the first step is usually to
carefully design questions that reflect the information we
want to gather. However, mining opinions from social media is a rather passive process, which need a different work
flow. Given the fact that tweets about sustainability is relatively rare compared to those discussing hot topics, it is
better to know what kind of information we can actually
acquire from social media first, so that we can design the
questions and analysis methods in a more targeted and efficient way. Therefore, we propose the following framework
as illustrated in Figure 1.
Introduction
In the area of urban sustainability and resilience, it is usually
important to understand people’s attitude towards certain
products, amenities or design. For example, urban planners
and policy makers may want to know how people choose
between “conventional sprawling community” and “smart
growth neighborhood” (Lu et al. 2014); green products manufacturers may want to understand what makes consumers
choose non-sustainable products over their green products
(or the other way around) (Pickett-Baker and Ozaki 2008;
Yam-Tang and Chan 1998); sustainability researchers and
educators may want to know sustainability related hot topics
people are talking about and people’s attitude towards them.
Traditionally, these information is obtained by conducting
and analyzing surveys. However, conducting surveys is usually expensive: a survey with 1000 respondents usually takes
tens of thousands of dollars to run (Braunsberger, Wybenga,
and Gates 2007).
On the other hand, in today’s digital age, hundreds of millions of people are sharing their thoughts and opinions on
social media like Twitter every day. And most social media
contents can be freely accessed by the public. It would be of
great help if we could extract sustainability related opinions
from social media, as a faster and less expensive alternative
to traditional survey and polling methods. Such analysis also
significantly increases the variety of potential opinions since
people are talking about all kinds of things on social media.
Based on the reasons we state above, we propose to bring
social media opinion mining research based on topic and
Social Media (eg. Twitter)
Step 1
Step 2
Data
Collector
Clusters of
Related
Documents
Representative
Data
Collection
Opinion
Analysis
Targeted
Topic
Modeling
Meaningful Report
Step 3
Figure 1: Illustration of our proposed framework
c 2015, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
In this framework, the first step is to collect data from
21
these messages) in their paper, which also correspond to the
two fundamental tasks in our framework: targeted topic discovery and sentiment analysis.
While much research have been done in the area of topic
modeling and sentiment analysis, new methods are still
needed to be developed for social media contents. For
topic discovery, people have built some statistical models
(Blei, Ng, and Jordan 2003; Hofmann 1999) and developed fast and large-scale algorithms (Kim and Park 2011;
Kuang and Park 2013). However, those methods don’t usually perform well on Twitter data due to the restricted length
of Twitter messages, and they are not able to find high quality targeted topics. Traditional sentiment analysis methods
have similar issues on Twitter data (Kouloumpis, Wilson,
and Moore 2011). While there are already many open challenges in the area of sentiment analysis, features of Twitter
messages such as restricted length, casual language style,
mixed use of symbols (like hash tags) and words, and high
frequency of grammar and spelling errors are making such
analysis even harder. Some efforts were made trying to address these issues (Ramage, Dumais, and Liebling 2010;
Hong and Davison 2010; Agarwal et al. 2011; Go, Bhayani,
and Huang 2009), but building robust and trustable topic
models and sentiment analysis methods remains an open
challenge.
We may need a series of research to tackle these challenges on query-driven analysis, targeted topic discovery
and sentiment analysis. Although solving some of these issues perfectly may be as hard as developing human level
language processing methods, there are still a lot we can do
to develop approximation algorithms or alternative methods
that are good enough for practical use, as in many AI areas.
For query-driven analysis, we can investigate different
types of survey questions and establish a general framework
of converting survey problems into data analysis methods.
We can also identify and utilize the types of survey questions that can be answered without very accurate sentiment
analysis.
For targeted topic discovery, we are currently working on
better pre-processing techniques and encoding methods for
Twitter texts, which may make existing topic models more
accurate on Twitter data. Also, we are looking for an unsupervised measurement of topic/cluster quality to solve the issue of evaluating targeted topic model without labeled data.
For sentiment analysis, which is harder, we can try to develop aggregate opinion mining methods that do not require
high accuracy of individual sentiment analysis (O’Connor et
al. 2010).
the social media. For Twitter, this step is usually done by
querying the Twitter API. From our experience, while people do talk about sustainability on Twitter, it is hardly a hot
topic. Directly querying Twitter’s sample API, which only
returns a small random sample of all public statuses, will result in very few tweets about sustainability, which is far from
enough for analysis. In a 5-month sample that contains 293
million tweets, we only found less than ten thousand tweets
that contain the keywords we are interested in. Therefore, it
is better to use the filter API to get tweets containing a
set of predefined keywords.
The second step is to discover and clean up topics from
the collected data by using a topic model (e.g. Blei, Ng, and
Jordan 2003). The collected data may be about a specific
product or a general topic. In the case of a specific product, such as solar panel, topic discovery can help products
manufacturers to understand how different attributes of their
products affect consumer’s choices. In the case of a general
topic, such as distributed energy, topic discovery is essential for researchers to get a more concrete view of people’s
thoughts expressed on social media. However, traditional
topic models are good at discovering the salient topics, or
major themes, underlying a text collection, while the specific keywords related to sustainability we are targeting for
might not be revealed in the discovered topics. Therefore,
we propose targeted topic models that aim at finding topics related to one’s needs and producing document clusters
with better quality, so that related documents and noises are
separated.
After the second step, documents will be clustered into
different topics that are discovered. And we will get a good
knowledge of potential topics that we can design questions
for. Then the last step is to apply sentiment/opinion analysis on the clustered documents. It will help product manufacturers to know which attributes of their products are
liked/disliked by the public, and will help researchers to
know which green technologies are adopted more widely by
the public. To answer specific survey problems, one may
need to design specialized methods to analyze the data.
Challenges, Related Work and Future
Directions
The idea of using Twitter as a polling mechanism is not new.
O’Connor et al. (2010) revealed the potential of using text
streams as a substitute or supplement for traditional polling,
by showing correlation between sentiment word frequencies
in contemporaneous Twitter messages and some survey results on consumer confidence and political opinions. They
then proposed a more general goal of developing “querydriven sentiment analysis where one can ask more varied
questions of what people are thinking based on text they
are already writing”, which is very challenging and requires
more research on it.
Their illuminating work also suggests that more advanced
NLP techniques may be useful for more accurate estimation of public opinions. Such NLP techniques were referred
as “message retrieval” (identifying messages related to the
topic) and “opinion estimation” (determining sentiment of
Conclusion
Opinion mining based on topic model and sentiment analysis is a promising alternative to survey and polling for sustainability researchers, urban planners and green products
manufacturers. Plenty of research can be done in this area,
which can not only enrich the field of computational sustainability but also motivate new research in natural language
processing, which can also be applied to many other areas.
22
References
Agarwal, A.; Xie, B.; Vovsha, I.; Rambow, O.; and Passonneau, R. 2011. Sentiment analysis of twitter data. In Proceedings of the Workshop on Languages in Social Media,
30–38. Association for Computational Linguistics.
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent
dirichlet allocation. the Journal of machine Learning research 3:993–1022.
Braunsberger, K.; Wybenga, H.; and Gates, R. 2007. A
comparison of reliability between telephone and web-based
surveys. Journal of Business Research 60(7):758–764.
Go, A.; Bhayani, R.; and Huang, L. 2009. Twitter sentiment
classification using distant supervision. CS224N Project Report, Stanford 1–12.
Hofmann, T. 1999. Probabilistic latent semantic indexing.
In Proceedings of the 22nd annual international ACM SIGIR
conference on Research and development in information retrieval, 50–57. ACM.
Hong, L., and Davison, B. D. 2010. Empirical study of topic
modeling in twitter. In Proceedings of the First Workshop on
Social Media Analytics, 80–88. ACM.
Kim, J., and Park, H. 2011. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM
Journal on Scientific Computing 33(6):3261–3281.
Kouloumpis, E.; Wilson, T.; and Moore, J. 2011. Twitter
sentiment analysis: The good the bad and the omg! ICWSM
11:538–541.
Kuang, D., and Park, H. 2013. Fast rank-2 nonnegative
matrix factorization for hierarchical document clustering. In
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 739–747.
ACM.
Lu, Z.; Southworth, F.; Crittenden, J.; and Dunhum-Jones,
E. 2014. Market potential for smart growth neighbourhoods
in the usa: A latent class analysis on heterogeneous preference and choice. Urban Studies 0042098014550956.
O’Connor, B.; Balasubramanyan, R.; Routledge, B. R.; and
Smith, N. A. 2010. From tweets to polls: Linking text
sentiment to public opinion time series. ICWSM 11:122–
129.
Owoputi, O.; O’Connor, B.; Dyer, C.; Gimpel, K.; Schneider, N.; and Smith, N. A. 2013. Improved part-of-speech
tagging for online conversational text with word clusters. In
HLT-NAACL, 380–390.
Pang, B., and Lee, L. 2008. Opinion mining and sentiment
analysis. Foundations and trends in information retrieval
2(1-2):1–135.
Pickett-Baker, J., and Ozaki, R. 2008. Pro-environmental
products: marketing influence on consumer purchase decision. Journal of consumer marketing 25(5):281–293.
Ramage, D.; Dumais, S. T.; and Liebling, D. J. 2010. Characterizing microblogs with topic models. ICWSM 10:1–1.
Yam-Tang, E. P., and Chan, R. Y. 1998. Purchasing behaviours and perceptions of environmentally harmful products. Marketing Intelligence & Planning 16(6):356–362.
23