Real-time Sentiment Analysis in Social Media Streams: The 2013

Real-time Sentiment Analysis in Social Media
Streams: The 2013 Confederation Cup Case
Paulo Rodrigo Cavalin, Maı́ra A. de C. Gatti,
Cicero Nogueira dos Santos and Claudio Pinhanez
IBM Research
Av. Tutóia 1157, São Paulo – Brazil
Email: {pcavalin, mairacg, cicerons, csantosp}@br.ibm.com
Abstract—From sport games to marketing campaigns, millions
of people react and interact on Online Social Networks (OSN),
especially on microblogging services such as Twitter. As people
react to these events, a large stream of data might be generated
per minute, and this stream could be processed for analyzing
human behavior on the network during the event’s lifespan.
Monitoring, analyzing and delivering these results in real time
and with enough accuracy is challenging. For this reason, in
this paper we present a system for real-time Machine Learningbased sentiment analysis of streaming Twitter data with large
availability and scalability. This system was implemented and
tested during games of the 2013 FIFA Confederations Cup for
Portuguese messages. The solution can be easily adapted to other
events and types of text categorization tasks.
I.
I NTRODUCTION
The messages posted on microblogs consist of short messages (tweets) that people use to provide updates on their
activities, observations and/or interesting content, directly or
indirectly to others [1]. The most famous microblogging online
social network is Twitter. The contents of the tweets can be a
status update where the user says what she is doing, thinking
or feeling about herself or about en event that is happening
within or outside Twitter. Such events could be sports games
such as Soccer, Basketball, etc, or a marketing campaign about
politics, brands or products. In sports, for instance, a lot of
sentiment generally is during each game, and people tend to
express such sentiment on OSNs. Recently, several approaches
have been proposed to tackle the problem of classifying the
sentiment of tweets using different methods. Among them we
can cite Machine Learning (ML) with supervised and semisupervised learning approaches [2], [3] and/or lexicon-based
which is an unsupervised approach [4], [5].
Given the large stream of data that is constantly generated
as people react to these events, it has become more and more
unfeasible to manually process and analyze all this stream
during the event’s lifespan. An automated systems is needed
to analyze all this data as fast as possible, like in seconds.
However, designing such kind of system is not trivial. First, the
system must be reliable enough to avoid information loss. This
leads to a high available system, with redundancy and fault
tolerance mechanics active. Second, it must be able to deal
with high throughput rates, which leads to an infrastructure that
allows parallelism management. Thirdly, sentiment classifiers
should be able to work with limited resources of time and
space. The training phase should handle unbalanced distribution of sentiments and in real time, it should be adaptive.
In this paper, we present a system for real-time MLbased sentiment analysis of streaming Twitter data with large
availability and scalability, using the IBM InfoSphere Streams
(ISS) platform1 . IBM ISS is allows applications to ingest,
analyze and correlate information as it arrives from many
real-time sources. The solution can handle very high data
throughput rates, up to millions of events or messages per
second. It is worth noting that t is not the goal of this paper
to advance the state of the art in natural language processing
nor in sentiment analysis. But to the best of our knowledge,
this is the first work that presents a real-time highly available
and reliable solution using ML-based classifiers for sentiment
analysis of messages posted during a high-audience event for
Portuguese.
The proposed solution has been implemented for performing sentiment analysis of streaming Twitter data during
games of the 2013 FIFA Confederations Cup. We run the
implemented architecture in the IBM Smart Cloud Enterprise
(SCE)2 . During the matches, throughput spikes of 60 thousands
tweets per minute have been observed. Such solution can be
easily adapted to other events and types of text categorization
tasks. We also show how accurate our solution is.
This paper is organized as follow. Section II presents the
problem description and the related work in a integrated view.
In Section III we present the proposed solution for sentiment
analysis for the Confederations Cup, and in Section IV we
describe the validation of the system during the games. Finally, we conclude the paper and present the future work in
Section V.
II.
P ROBLEM D ESCRIPTION
The FIFA Confederations Cup is a 8-team soccer competition that happens one year before the FIFA World Cup Finals,
generally hosted in the same country that will host the latter.
The 8 team include each of FIFA’s six continental champions,
the host nation and last World Cup’s champion. The FIFA 2013
Confederations Cup, hosted in Brazil, included team from the
following nations: Brazil, Italy, Japan, Mexico, Nigeria, Spain,
Tahiti, and Uruguay. Italy represented the European continent
since Spain have won both the last World Cup and the Euro
Cup, and Italy was runners-up in the latter.
Before, during and after each game, millions of messages
are posted in online social networks, more in particular in
1 http://www.ibm.com/software/products/us/en/infosphere-streams/
2 http://www.ibm.com/services/us/en/cloud-enterprise/
Twitter, with topics related to the games of the event. People
express what they feel about the players, the coach, the tactics
and so on. Several people have many advices to the coach, for
instance which players should be in the line-up or not, and
they share this advices on Twitter. This is a common practice
in Brazil, since soccer is very popular in the country. As a
consequence, by analyzing the sentiments of these tweets one
may understand what people want to say to the coach and the
players, and bring insights to them.
Text stream
text
1.
1. Relevant
Relevant text
text selection*
selection*
Domain
knowledge
With less space available for writing in Microbloggings,
people tend to find alternative and faster ways to write the
same word, like the English words ‘straight’ and ‘str8’, or
‘together’, ‘2gthr’ and ‘2getter’ [6]. Furthermore, microblogs
are prone to typing errors, owing to either messages written in
a hurry or the typing interface used, which often is on mobile
device [7].
3.
3. Preprocessing*
Preprocessing*
tokens
Topics
Nonetheless, monitoring, analyzing and delivering these
results in real time to the coach and the players is challenging.
Sentiment analysis on Twitter is known as not trivial. In addition, processing large amounts of tweets require an architecture
and infra-structure to support real-time processing.
Classifier
Language
knowledge
Interval i's statistics
Fig. 1.
2)
3)
4)
6)
A. Main Workflow
7)
Relevant text selection: consists of evaluating if
the text in the current message is relevant or not
tokens + topics
+ sentiment
Interval I's
analysis result
In this section we explain, in greater detail, how we
gathered data to train the sentiment classifier, how we trained
it, and how we implemented the entire system.
1)
6.
6. Postprocessing*
Postprocessing*
7.
7. Statistic computation
S YSTEM D ESIGN
The proposed architecture is shown in Figure 1. This
workflow in divided into the following modules:
5.
5. Sentiment
Sentiment classification*
classification*
tokens + topics
+ sentiment
5)
III.
4. Topic classification*
tokens + topics
Sentiment analysis consists of evaluating whether a message express some sentiment about a specific topic. The
sentiment might be related to specific classes, such as positive,
negative, and neutral, or to different degrees of likeness, like
the popular five-star approach of review websites.
More recently, though, increasing attention is being paid
to sentiment analysis on large social networks such as Twitter
[10], [11]. Besides the challenges imposed by the limited
size of the messages, the massive data generated everyday on
these data source present a continuously-growing source of
knowledge to tune the parameters of the classifiers.
2.
2. Tokenizer*
Tokenizer*
tokens
That said, this work is related to a real use case to analyze,
in real time, the sentiment of tweets about the Brazilian team
during the FIFA 2013 Confederations Cup. We focused on
tweets in Portuguese, aiming at analyzing tweets posted by
Brazilian users mainly, i.e. fans of the Brazilian team.
In the literature, sentiment analysis has been generally
handled as a Natural Language Processing task [8]. Several
approaches employ Machine Learning techniques to classify
features based on bags of n-grams or Part-of-Speech tags.
Many of the proposed approaches have been applied to problems such as review-related websites and business intelligence
[9], where the language used tend to be more formal and less
prone to mistakes since people take enough time to write these
text.
text
Main workflow.
for the application, considering domain knowledge.
Messages that are not of interest are simply discarded.
Tokenizer: this process consists of dividing the text
into a list of sub-strings, i.e. tokens, that represent
meaningful entities for the domain.
Preprocessing: the tokenized text is then preprocessed for variability reduction, according to the
language and also the domain.
Topic classification: this step consists of finding
topics of interest in the text. All topics found are
added to the topics output list. If no topic is found
the tweet is discarded.
Sentiment classification: an algorithm is used to
classify the sentiment of the text, considering a predefined set of classes, e.g. positive, negative and
neutral.
Postprocessing: the main goal of this module is
to prepare the previously-computed tokens for the
subsequent statistical analysis.
Statistics computation: in this module we compute
the main statistics for the interval. These statistics
are related to topics, terms, users, and so forth, and
the sentiment of the texts to which they are related.
After the predefined duration of the interval is over,
the results are send to the output.
Note that different methods can be used to implement each
of the aforementioned modules, according to the application,
domain, language, and thus forth.
B. Corpus
In order to train the classifiers for sentiment analysis, we
have built a training corpus comprising data posted on Twitter
during four friendly matches of the Brazilian team in 2013,
against the following teams: Italy, England, Bolivia and Chile.
About 1 million of tweets have been gathered from these
games.
We built an online interface for a collaborative labeling of
the tweets in seven different classes: Certainly Negative (CN),
Negative (N), Maybe Negative (MN), Neutral (N), Maybe
Positive (MP), Positive (P), and Certainly Positive (CP). We
divided both negative and positive sentiment into three distinct
classes to capture the degree of confidence for which the user
is able to associate that tweet with one of these two main
sentiment classes, where as Maybe and Certainly represent less
and more confidence, respectively. Another class, i.e. Don’t
Know (D), represent tweets for which the sentiment could not
be identified by the user.
Fifteen users participated in this labeling process, resulting
in a total of 2,092 tweets labeled (523 per game). It is worth
mentioning that the tweets to be labeled were selected based
on a roulette scheme, so as to balance the number of tweets
per game. Figure 2 shows the distribution of samples for each
class.
500
400
Total
300
200
100
0
Fig. 2.
CN
N
MN
E
MP
Sentiment
P
CP
D
The number samples manually labeled in each sentiment class.
1) Offline Learning: We used the corpus described in the
previous section to design a Machine Learning-based approach
for sentiment analysis. We considered a Naı̈ve Bayes classifier
based on [10], reaching accuracy levels of about 82% with
only the positive and negative classes.
For training and validate the classifier for the given application, we made use of 4-fold cross-validation, where in each
resampling one game is considered as test set and the others
as the training set. For training, both CP and P classes were
combined into the single class P. Since, the same was done for
the class N, the classifier considered three distinct class: N, E,
and P. Although It is known that the inclusion of the neutral
class generally results in less accuracy [10], we have opted to
consider this class since a significant amount of tweets can be
classified in this class (see Figure 2). It is worth noting that the
class D was discarded for both training and test. The classes
MN and MP, on the other hand, were discarded for training,
with the goal of defining less noisy class boundaries for the
classifiers, but were kept for test, also merge to the N and P
classes respectively, to allow us to evaluate how the classifiers
behaved for these cases.
The average recognition rates achieved on the test set were
of about 60%, using the combination of unigrams and bigrams
for the feature set (we evaluated n-grams up to a tri-gram). For
preprocessing, we considered the same steps described in the
next section.
C. Architecture Implementation
Here we provide additional details for the implementation
of the proposed architecture. First, we describe in detail how
we implemented each module shown in Figure 1, then we
explain how we optimized this implementation to run on a
parallel architecture.
1)
Relevant text selection: we consider a set of dictionaries that contain domain specific knowledge, for
instance player names and common moves of soccer
players, and look for these entries in the text of each
tweet. The use of these dictionaries is intended to
classify tweets related to soccer. Five dictionaries
were considered. The first one contains words that
by appearing only once in Tweet they yield strong
evidence that the tweet is about the desired domain,
for example the usernames of Twitter accounts of
the football players. The second dictionary was composed of more common words that should appear
together at least twice. The third one includes words
that are useful to know that tweet is not relevant,
even though it contains some words that might be
associated to soccer but to other domains as well.
The fourth dictionary has some positive and negative
adjectives, and in the fifth one, common names of
players were used be rated together with the second
or fourth dictionary. These dictionaries evaluated in
the sequence using an AQL query. In the following
listing we illustrate how the first dictionary is read
from a file, then how the dictionary is used to match
the current tweet (R.text).
create dictionary First_Dict
from file ’first.dict’
with language as ’pt’;
create view Futebol1 as
extract
R.text as text,
dictionary ’First_Dict’ on
R.text as match
from Document R;
In the end, we select the tweet based on the count of
words in these dictionary as in this listing:
create view ResultadoUnidoConsolidado as
select R.text as getFootball,
Sum(R.identificador) as filter
from ResultadoUnido R
group by R.text;
2)
3)
4)
5)
6)
The tweets that do not contain any words, or that
contain words of the third dictionary, are discarded.
Tokenizer: we decompose the tweet into words,
punctuations, numbers, URLs, usernames, and hashtags. Blank characters, such as spaces, tabulations
and new lines, guide the tokenizer except when
punctuations happen together with another type of
token. In this case, the punctuation and the other
token are decoupled. In addition, domain knowledge
was used to find proper nouns composed of more than
one token, such as ‘Mario Balotelli’. In this case,
after the tokens ‘Mario’ and ’Balotelli’ are found,
a dictionary containing composed words is used to
convert them to a single token, i.e. ‘Mario Balotelli’.
In addition, if a punctuations occurs more than once
in a row, these occurrences are converted to a single
tokens representing that punctuation but with more
intensity, which might be an evidence for sentiment.
For instance, the sequence ‘!!!!!!!’ is converted to the
token ‘!!’.
Preprocessing: the steps considered for this are: 1)
all URLs and usernames are converted to special
tokens representing all URLs and usernames, that is
‘ URL ’ and ‘ USERNAME ’; 2) the tokens
containing non-alphanumeric characters are removed,
except if they represent some special token such
as punctuations; 3) the tokens that contain several
repeated occurrences of the same letters are normalized, if possible, to the correct word with the
help a language dictionary; 4) all words in a prespecified black list are removed, e.g. some stopwords; 5) semantic and lexical word disambiguation
is conducted on words that yield similar meaning or
that are mistyped (either on purpose or by mistake),
i.e. with the aid of a domain dictionary we check
all tokens that may represent the same entity or
expression and convert them to a single token that
represent the same thing.
Topic classification: for task we implemented a
lexicon-based topic categorizer. By considering a
list containing all desired topics, in our case the
player, the coach, the team itself and the referees,
we compare all current tokens with the entries in this
list. As a result, all topics found in the tokens are
added to the topics output list. If no topic is found,
the tweet is discarded.
Sentiment classification: we make use of a Nave
Bayes classifier to classify the sentiment of the current tweet into three distinct classes: positive, negative, and neutral. For this task, both bag-of-words and
bigram features are considered. Even though domain
knowledge is not explicitly used by this module
in the proposed architecture, domain knowledge is
implicitly used by means of the previous modules
and the training of classifier, which may benefit from
the use of domain-specific training data.
Postprocessing: one goal of this module is to conduct
stemming/lemmatization to reduce the number of
7)
different words that might represent the same content,
and to remove other unwanted tokens that were not
removed in the preprocessing since they were useful
for sentiment analysis. For the lemmatization, we
consider a two-step procedure. First, a part-of-speech
tagger is used to find the verbs in the current list of
tokens. Then, the verbs are stemmed. For removing
the remaining unwanted tokens, we consider another
previously-defined black list.
Statistics computation: in this module we compute
statistics for Interval i, where 1 ≤ i. Such statistics
take into account the topics, terms, co-occurrences of
terms and co-occurrences of topics with terms that
appeared in the interval. That is, the goal is to count
the number of occurrences of each of these types
of statistics, for each sentiment, and to output those
which occur the most.
In greater detail, the process of finding terms is pretty
straight-forward: each token corresponds to a term
except when it is in the topics list. In that case, the
token is a topic. Pairs of terms are computed by
taking into account the co-occurrence of terms up to
distance D, which is pre-defined. Pairs of topic/term
are similar to the latter, but it is considered the cooccurrence of a token representing a topic with a
token that is a term. All occurrences of these types
of information are stored in a map containing as key
the topic, term, pair of term, or pair of topic/term (a
different map is used for each type), and the value
corresponds to the number of tweets observed for
each sentiment class in the current interval, i.e. a
single dimension array containing one position for
positive, one for negative and another for neural.
After this is computed, the value in the map are
incremented or inserted. If a key is in the map already,
the value corresponding to the current sentiment of
the tweet is incremented by one. Otherwise, a new
key is inserted with value 1 for the current sentiment
and 0 for all the others.
When the interval is over, i.e. a tweet from the next
interval has been observed in the input, the maps are
sorted in a descending way, by considering the sum
of all sentiments. Next, these sorted lists are saved as
the statistics for interval i, and the maps are reset to
starting computing the statistics for the new interval,
i.e. interval i + 1.
Another important implementation aspect that we took into
account was a further decomposition of each module to take
better advantage of a distributed architecture. In Figure 3 we
illustrate Preprocessing module was further decomposed. The
same was done for the other modules considering the basic
operations they perform.
IV.
R ESULTS
The implemented system was tested during several games
of local Brazilian soccer competitions and other friendly
matches of the Brazilian national team. The system was
officially put in use during the FIFA 2013 Confederations Cup.
Although we designed the system to be able to handle
variations of user activity, our experience running it during
Tweet
50000
40000
Remove
Remove URLs
URLs
30000
Total
Remove
Remove usernames
usernames
Negative
Neutral
Positive
20000
Remove
Remove non-alphanumeric
non-alphanumeric
10000
Remove
Remove repeated
repeated
0
1'
45' 46'
Time
Remove
Remove stop-words
stop-words
Fig. 4.
90'
Total of tweets posted during Brasil vs Uruguay game.
Disambiguation
Disambiguation
Output
Fig. 3.
for the audience to react to it. Similarly, there is a peak for
other defender, Thiago Silva in the early second half as he is
blamed for the tying goal of Uruguay minutes before.
Preprocessing module decomposed in sub-modules.
4500
4000
these games was beyond the expectations. Game followers
react extremely quickly and in large numbers to the main
events of a game.
It is clear from Figure 4 that the number of tweets follows
closely the main events of the game. In the first half, a penalty
kick defended by the Brazilian goalkeeper almost increased 10fold the number of tweets produced by fans. The reaction to
the goal from Uruguay in the early second half (1x1) was less
pronounced, but the scoring of the winning goal (2x1) by the
end of the game reached almost 50 thousand tweets produced
by fans and analyzed by the system in 5 minutes. We also see
a lot of chat after the game ends, in a pattern of decay taking
about 30 minutes which we consistently observed in the other
games.
The focus and polarity of the comments also change wildly
as the game unfolds, following the performance and events of
individual players. Figure 5 shows the progression of negative
tweets for 5 players of the Brazilian national team during
the same game. There is a strong negative showing for the
defender David Luiz just since he caused an unnecessary
penalty infraction. Notice that the peak of negative sentiment
happens some minutes after the event, since it takes some time
Number of tweets
Figure 4 shows the number of tweets (grouped in blocks
of 5 minutes) during one of the semifinals of the tournament,
the game Brazil 2x1 Uruguay. We started collecting one hour
before the game started (at 15:00 PM, considering Brazil’s
local timezone) and finished the collection about half an hour
after the end of the game (at 18:30 PM). The graph depicts only
the tweets selected for analysis, and the proportion between
tweets classified as positive, neutral, and negative in each 5minute block. The graph also shows where the main events of
the game happened.
3500
3000
Bernard
David Luiz
Hulk
Neymar
Thiago Silva
2500
2000
1500
1000
500
0
1'
45' 46'
Time
90'
Fig. 5. Progression of negative tweets for 5 selected players of the Brazilian
national team.
However, not all major conversations are directly tied to
game plays. Figure 5 shows that the negative sentiment towards
the Brazilian forward Hulk is strong throughout the game,
sometimes peaking following other events of the game not
shown in the graph. Brazilian fans have expressed, consistently
during all four games during the cup, that they do not want
him in the team.
Finally, we have also observed in this and other games that
there are occasions in which a parallel conversation emerges
during the game, triggered by an event of the game. For
example, in the Brazil 2x1 Uruguay game, a player from
Brazil provoked another player by sending kisses when he
was substituted in the second half. Comments about this event
evolved into a more complex conversation about kissing and
irony, persisting well beyond the end of game, generating at
least 7,000 tweets.
R EFERENCES
As this analysis shows, the volume of tweets during a large
event such as the ones we have done varies wildly and is
largely unpredictable. As noticed, even the game occurrences
are not always direct good predictors of tweet volume. In other
words, predicting demand in this kind of task is very difficult,
so to adequately manage this kind of workload we need flexible
but powerful architectures such as the one proposed in this
paper.
To complement this analysis, Figure 6 depicts the users
engaged to Twitter during the game. We observe that the
number of users that post messages increases in the same
intervals on which the number of messages increased. Nevertheless, by keeping track of the cumulative number of users
that started posting messages during the game, we can see that
the increase in the number of users is almost linear, with little
change during peaks. This suggests that most users that posted
during the peaks are generally the ones that have posted some
message previously, and they do not got engaged in the social
network because of the specific fact that increased the number
of messages related to event, but because of the event itself.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Total of users
250000
Cumulative
Each Interval
[8]
200000
[9]
150000
[10]
[11]
100000
50000
0
Fig. 6.
1'
45' 46'
Time
90'
User engagement during Brazil versus Uruguay game.
V.
C ONCLUSIONS AND F UTURE W ORK
In this paper we presented the results for real-time sentiment analysis of social networks, aiming at keeping track of
what users post during a given event and the sentiment of their
tweets about it.
The system was tested during the FIFA 2013 Confederations Cups, to analyze the messages posted by Brazilian users
(in Portuguese) during each game of Brazilian national team.
As future work, the evaluation of the architecture on
other types of events, for instance marketing campaigns, may
help evaluating how the system can be adapted and used in
other domains. Another promising direction is to investigate
unsupervised methods to help implementing the system. In
this case, the adaptation to other domains and other languages
would benefit greatly from the results of these methods.
K. Ehrlich and N. S. Shami, “Microblogging inside and outside the
workplace,” in ICWSM, 2010.
A. Celikyilmaz, D. Hakkani-Tur, and J. Feng, “Probabilistic modelbased sentiment analysis of twitter messages,” in Spoken Language
Technology Workshop (SLT), 2010 IEEE, 2010, pp. 79–84.
A. Bakliwal, P. Arora, S. Madhappan, N. Kapre, M. Singh,
and V. Varma, “Mining sentiments from tweets,” in Proceedings
of the 3rd Workshop in Computational Approaches to
Subjectivity and Sentiment Analysis. Jeju, Korea: Association
for Computational Linguistics, July 2012, pp. 11–18. [Online].
Available: http://www.aclweb.org/anthology/W12-3704
L. Li, Y. Xia, and P. Zhang, “An unsupervised approach to sentiment
word extraction in complex sentiment analysis,” International Journal
of Knowledge and Language Processing, vol. 2, no. 1, 2011.
A. Hogenboom, D. Bal, F. Frasincar, M. Bal, F. de Jong,
and U. Kaymak, “Exploiting emoticons in sentiment analysis,”
in Proceedings of the 28th Annual ACM Symposium on Applied
Computing, ser. SAC ’13. New York, NY, USA: ACM, 2013, pp. 703–
710. [Online]. Available: http://doi.acm.org/10.1145/2480486.2480498
F. Liu, F. Weng, and X. Jiang, “A broad-coverage normalization
system for social media language,” in Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics: Long Papers
- Volume 1, ser. ACL ’12. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2012, pp. 1035–1044. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2390524.2390662
A. P. Felt and D. Wagner, “Phishing on mobile devices,” in In W2SP,
2011.
A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau,
“Sentiment analysis of twitter data,” in LMS’11 Proceedigns of the
Workshop on Languages in Social Media, 2001, pp. 30–38.
B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval, vol. 2, no. 1-2, pp. 1–135,
2008.
A. Go, L. Huang, and R. Bhayani, “Twitter sentiment analysis,” Stanford
University, Tech. Rep. CS224N, 2009.
G. Paltoglou and M. Thelwall, “Twitter, myspace, digg: Unsupervised
sentiment analysis in social media,” ACM Transactions on Intelligent
Systems and Technology, vol. 3, no. 4, pp. 66:1–66:19, 2012.