(Big) Data Analytics: From Word Counts to Population Opinions

(Big) Data Analytics:
From Word Counts to Population Opinions
Mark Keane
Insight@University College Dublin
October 2014 ~ RSS ~ Edinburgh
September 2014/EPIC
2
September 2014/EPIC
3
September 2014/EPIC
4
September 2014/EPIC
5
September 2014/EPIC
6
Outline
• What’s New About (Big) Data Analytics
• 3 Sample Cases:
–
Google Queries Predicting Epidemics –
Networks of Influence
–
Financial Opinions in an Stockmarket Bubble
• Take Home Messages
October 2014/RSS-Edinburgh
7
What’s New ?
October 2014/RSS-Edinburgh
8
Four Vs of Big Data
October 2014/RSS-Edinburgh
9
What’s New ?: The Suggestion of a…
•
•
•
•
Brave new world of (new) data analysis…that can
Handle vast amounts of data effortlessly…with
Instant press-of-a-button answers…from
Vast server farms of (almost free) computation
October 2014/RSS-Edinburgh
10
What’s New ?: The Suggestion of a…
•
•
•
•
Brave new world of (new) data analysis…that can
Handle vast amounts of data effortlessly…with
Instant press-of-a-button answers…from
Vast server farms of (almost free) computation
• But… there are significant issues • And…there is a lot that is “old” (familiar)
October 2014/RSS-Edinburgh
11
What’s Old ?
•
•
•
•
•
Good old-fashioned, data analysis
Many statistical ideas are very familiar
Many research problems are familiar
Proper collection of data is important
Proper treatment of data is critical
October 2014/RSS-Edinburgh
12
What’s Really New ? An Approach
• Tipping-point with Very Large Data Sets
» from 100s to 1,000,000,000s of data points
• Unusual Types of Data » video, text, thumbs-up, unstructured data
• Non-standard Data Sources » social media (FB, Tweets), news, phones
• Data is not conventionally-measured
» the sensing devices are doing other things
October 2014/RSS-Edinburgh
13
In this New Big-Data World…
!
• Who we know, says a lot about who we are…
– Facebook friends, linked-in network, tweet followers
• What we write, says a lot about what we think…
– text in books, news, blogs, social media and so on
• Where we located, says a lot about us…
– location-based sensing, GPS, IP-addresses
• What we do, says a lot about our decisions/interests…
– what we buy, web-sites visited, youtube videos watched,
news re-tweeted, items shared and so on…
October 2014/RSS-Edinburgh
14
Three Sample Cases…
October 2014/RSS-Edinburgh
15
Finding Flu Outbreaks…
October 2014/RSS-Edinburgh
16
Case 1: Predicting Flu from Searches
• Google Flu Trends (GFT):
• aggregates search data, counting influenza keywords
• US Centre for Disease Control:
• tracks influenza-like-illnesses (ILIs) in outpatient data • From 2003-2009:
• GFT showed high correlations with ILI stats (ILINet)
…until 2009 influenza virus A (H1N1) pandemic [pH1N1]
Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. (2011). Assessing Google flu trends performance in
the United States during the 2009 influenza virus A (H1N1) pandemic. PloS one, 6, e23610.
October 2014/RSS-Edinburgh
17
Good Correlations (Initially…)
• Body Level One
• Body Level Two
– Body Level Three
– Body Level Four
» Body Level Five
Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. (2011). Assessing Google flu trends performance in
the United States during the 2009 influenza virus A (H1N1) pandemic. PloS one, 6, e23610.
October 2014/RSS-Edinburgh
18
Hang on a sec…
• Body Level One
• Body Level Two
– Body Level Three
– Body Level Four
» Body Level Five
In 2009, Google modify model with new search terms…
October 2014/RSS-Edinburgh
19
The Message
• What we do, says a lot about our concerns…
– if I think I have flu and I am looking it up on Google
• Here, people’s illness is being defined by
– their search behaviour and keywords
• Population behaviour can be predicted (in locations)
– by aggregating these searches
October 2014/RSS-Edinburgh
20
The Message
• What we do, says a lot about our concerns…
– if I think I have flu I am looking it up on Google
• Here, people’s illness is being defined by
– their search behaviour and keywords
• Population behaviour can be predicted (in locations)
– by aggregating these searches
• But, – proper treatment of data is critical (keywords, normalising)
– a model of what leads a user to use a certain search term
October 2014/RSS-Edinburgh
21
Networks of Influence…
22
Case 2: Showing Networks of Influence
• Tracking news on Social Networks
• terrorists release youtube videos
• politicians comment in Facebook
• celebs tweet intimacies
• Who you comment on, What you
comment on and where; can reveal
networks of influence
• Storyful is using Insight system, to
curate the lists of sources and
propose new ones, by analysing
social networks
October 2014/RSS-Edinburgh
23
Curated Lists of Sources (Large)
D. Greene, G. Sheridan, B. Smyth, & P. Cunningham (2012) Aggregating content and network information to
curate twitter user lists. In Proc. 4th ACM RecSys Wkshp on Recommender Systems & The Social Web.
October 2014/RSS-Edinburgh
24
Automated Recommendation…
D. Greene, G. Sheridan, B. Smyth, & P. Cunningham, “Aggregating Content and Network Information to Curate
Twitter User Lists,” in Proc. 4th ACM RecSys Workshop on Recommender Systems & The Social Web, 2012.
October 2014/RSS-Edinburgh
25
Networks in Syrian Conflict
Network of Syrian-related Twitter
accounts active during late 2013
O'Callaghan, D., Prucha, N., Greene, D., Conway, M., Carthy, J., & Cunningham, P. (2014). Online Social Media in the
Syria Conflict: Encompassing the Extremes and the In-Betweens. arXiv preprint arXiv:1401.7535.
October 2014/RSS-Edinburgh
26
European Parliament Networks
Data analysed for 584
MEPs on Twitter during
July-Sept 2014.
J. P. Cross & D. Greene. (2014) Tracking information flows in the Council of the European Union: A social network analysis. Under review.
October 2014/RSS-Edinburgh
27
Political Groupings…
Data analysed for 584
MEPs on Twitter during
July-Sept 2014.
Cross & Greene (2014)
October 2014/RSS-Edinburgh
28
The Outlier Party…
Data analysed for 584
MEPs on Twitter during
July-Sept 2014.
Cross & Greene (2014)
October 2014/RSS-Edinburgh
29
The Message
• Who we know, says a lot about who we are…
– Facebook friends, linked-in network, tweet followers
• I can be defined by
– the people I know/like/respect/follow (homophily)
• My behaviour can be predicted by
– assuming that like-people act alike
• But, – accuracy of those relationships is critical
– may not generalise from one domain to another
September 2014/EPIC
30
Tracking Bubble Behaviour…
Case 3: Tracking Herding & Market Bubbles
•
•
•
•
Word frequencies reveal power-laws (Zipf’s Law) Bubble would show in herd-like use of language
Power laws change systematically with herding
Sentiment of phrases should also be trackable
Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock
market bubbles. IJCAI-2011. AAAI Press.
October 2014/RSS-Edinburgh
32
Zipf’s Law & Moby Dick
October 2014/RSS-Edinburgh
33
Agreement ‘tween Commentators…
Agreeing to Agree in Power Laws of Words
October 2014/RSS-Edinburgh
34
Analysing Text in News
• 17,713 finance articles (FT, NYT, BBC)
• 4 years (Jan 2006-Jan 2010) including 2007 crash • 10,418,266 words, we extract nouns and verbs
Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock
market bubbles. IJCAI-2011. AAAI Press.
October 2014/RSS-Edinburgh
35
September 2014/EPIC
36
September 2014/EPIC
37
Analysing Text in News
• 17,713 finance articles (FT, NYT, BBC)
• 4 years (Jan 2006-Jan 2010) including 2007 crash • 10,418,266 words, we extract nouns and verbs • Correlations for verb distributions show:
• DJIA (r = .79), FTSE-100 (r = .78), NIKKEI-225 (r = .73)
• NB: prediction is another matter
Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock
market bubbles. IJCAI-2011. AAAI Press.
October 2014/RSS-Edinburgh
38
September 2014/EPIC
39
The Message
• What we write, says a lot about what we think…
– text in books, news, blogs, social media and so on
• Here, agreement in a population is being captured by
– carefully treated word frequencies
• Population beliefs can be tracked
– by a distributional analysis of changes in words
October 2014/RSS-Edinburgh
40
The Message
• What we write, says a lot about what we think…
– text in books, news, blogs, social media and so on
• Here, agreement in a population is being captured by
– carefully treated word frequencies
• Population beliefs can be tracked
– by a distributional analysis of changes in words
• But, – proper treatment of words is critical (stop-words, syntax)
– sentiment analysis had to be based on human judgements
October 2014/RSS-Edinburgh
41
Some Conclusions…
October 2014/RSS-Edinburgh
42
In this New Big-Data World…
!
• Who we know, says a lot about who we are…
– Facebook friends, linked-in network, tweet followers
• What we write, says a lot about what we think…
– text in books, news, blogs, social media and so on
• Where we located, says a lot about us…
– location-based sensing, GPS, IP-addresses
• What we do, says a lot about our decisions/interests…
– what we buy, web-sites visited, youtube videos watched,
news re-tweeted, items shared and so on…
October 2014/RSS-Edinburgh
43
In this New Big-Data World…
!
U
O
Y
R
A
NE
E
– Facebook friends, linked-in network, tweet N
followers
O
H
P
• What we write,
T
R
A
– text in books, news, blogs, social
SMmedia and so on
A
T
A
• Where we located
E
L
BGPS, IP-addresses
– location-based sensing,
A
L
I
A
V
• What we do
A
Y
L
E web-sites visited, youtube videos watched,
– what we buy,
N
I
T
U
news O
re-tweeted,
items shared and so on…
R
W
O
N
• Who we know,
October 2014/RSS-Edinburgh
44
Promises and Caveats…
• Data analytics bears promise in tracking and predicting:
• population actions, beliefs, opinions, illness…
• changes in those actions, beliefs, opinions, illnesses…
• Challenges are in finding:
• right treatment of the data: selection/collation of data is still
critical, combining multiple data-sources
• right analytic methods: which, if any, are appropriate
• right interpretations; old-fashion exclusion-of-vars/
interpretation
October 2014/RSS-Edinburgh
45
The End
October 2014/RSS-Edinburgh
46