(Big) Data Analytics: From Word Counts to Population Opinions Mark Keane Insight@University College Dublin October 2014 ~ RSS ~ Edinburgh September 2014/EPIC 2 September 2014/EPIC 3 September 2014/EPIC 4 September 2014/EPIC 5 September 2014/EPIC 6 Outline • What’s New About (Big) Data Analytics • 3 Sample Cases: – Google Queries Predicting Epidemics – Networks of Influence – Financial Opinions in an Stockmarket Bubble • Take Home Messages October 2014/RSS-Edinburgh 7 What’s New ? October 2014/RSS-Edinburgh 8 Four Vs of Big Data October 2014/RSS-Edinburgh 9 What’s New ?: The Suggestion of a… • • • • Brave new world of (new) data analysis…that can Handle vast amounts of data effortlessly…with Instant press-of-a-button answers…from Vast server farms of (almost free) computation October 2014/RSS-Edinburgh 10 What’s New ?: The Suggestion of a… • • • • Brave new world of (new) data analysis…that can Handle vast amounts of data effortlessly…with Instant press-of-a-button answers…from Vast server farms of (almost free) computation • But… there are significant issues • And…there is a lot that is “old” (familiar) October 2014/RSS-Edinburgh 11 What’s Old ? • • • • • Good old-fashioned, data analysis Many statistical ideas are very familiar Many research problems are familiar Proper collection of data is important Proper treatment of data is critical October 2014/RSS-Edinburgh 12 What’s Really New ? An Approach • Tipping-point with Very Large Data Sets » from 100s to 1,000,000,000s of data points • Unusual Types of Data » video, text, thumbs-up, unstructured data • Non-standard Data Sources » social media (FB, Tweets), news, phones • Data is not conventionally-measured » the sensing devices are doing other things October 2014/RSS-Edinburgh 13 In this New Big-Data World… ! • Who we know, says a lot about who we are… – Facebook friends, linked-in network, tweet followers • What we write, says a lot about what we think… – text in books, news, blogs, social media and so on • Where we located, says a lot about us… – location-based sensing, GPS, IP-addresses • What we do, says a lot about our decisions/interests… – what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on… October 2014/RSS-Edinburgh 14 Three Sample Cases… October 2014/RSS-Edinburgh 15 Finding Flu Outbreaks… October 2014/RSS-Edinburgh 16 Case 1: Predicting Flu from Searches • Google Flu Trends (GFT): • aggregates search data, counting influenza keywords • US Centre for Disease Control: • tracks influenza-like-illnesses (ILIs) in outpatient data • From 2003-2009: • GFT showed high correlations with ILI stats (ILINet) …until 2009 influenza virus A (H1N1) pandemic [pH1N1] Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. (2011). Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PloS one, 6, e23610. October 2014/RSS-Edinburgh 17 Good Correlations (Initially…) • Body Level One • Body Level Two – Body Level Three – Body Level Four » Body Level Five Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. (2011). Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PloS one, 6, e23610. October 2014/RSS-Edinburgh 18 Hang on a sec… • Body Level One • Body Level Two – Body Level Three – Body Level Four » Body Level Five In 2009, Google modify model with new search terms… October 2014/RSS-Edinburgh 19 The Message • What we do, says a lot about our concerns… – if I think I have flu and I am looking it up on Google • Here, people’s illness is being defined by – their search behaviour and keywords • Population behaviour can be predicted (in locations) – by aggregating these searches October 2014/RSS-Edinburgh 20 The Message • What we do, says a lot about our concerns… – if I think I have flu I am looking it up on Google • Here, people’s illness is being defined by – their search behaviour and keywords • Population behaviour can be predicted (in locations) – by aggregating these searches • But, – proper treatment of data is critical (keywords, normalising) – a model of what leads a user to use a certain search term October 2014/RSS-Edinburgh 21 Networks of Influence… 22 Case 2: Showing Networks of Influence • Tracking news on Social Networks • terrorists release youtube videos • politicians comment in Facebook • celebs tweet intimacies • Who you comment on, What you comment on and where; can reveal networks of influence • Storyful is using Insight system, to curate the lists of sources and propose new ones, by analysing social networks October 2014/RSS-Edinburgh 23 Curated Lists of Sources (Large) D. Greene, G. Sheridan, B. Smyth, & P. Cunningham (2012) Aggregating content and network information to curate twitter user lists. In Proc. 4th ACM RecSys Wkshp on Recommender Systems & The Social Web. October 2014/RSS-Edinburgh 24 Automated Recommendation… D. Greene, G. Sheridan, B. Smyth, & P. Cunningham, “Aggregating Content and Network Information to Curate Twitter User Lists,” in Proc. 4th ACM RecSys Workshop on Recommender Systems & The Social Web, 2012. October 2014/RSS-Edinburgh 25 Networks in Syrian Conflict Network of Syrian-related Twitter accounts active during late 2013 O'Callaghan, D., Prucha, N., Greene, D., Conway, M., Carthy, J., & Cunningham, P. (2014). Online Social Media in the Syria Conflict: Encompassing the Extremes and the In-Betweens. arXiv preprint arXiv:1401.7535. October 2014/RSS-Edinburgh 26 European Parliament Networks Data analysed for 584 MEPs on Twitter during July-Sept 2014. J. P. Cross & D. Greene. (2014) Tracking information flows in the Council of the European Union: A social network analysis. Under review. October 2014/RSS-Edinburgh 27 Political Groupings… Data analysed for 584 MEPs on Twitter during July-Sept 2014. Cross & Greene (2014) October 2014/RSS-Edinburgh 28 The Outlier Party… Data analysed for 584 MEPs on Twitter during July-Sept 2014. Cross & Greene (2014) October 2014/RSS-Edinburgh 29 The Message • Who we know, says a lot about who we are… – Facebook friends, linked-in network, tweet followers • I can be defined by – the people I know/like/respect/follow (homophily) • My behaviour can be predicted by – assuming that like-people act alike • But, – accuracy of those relationships is critical – may not generalise from one domain to another September 2014/EPIC 30 Tracking Bubble Behaviour… Case 3: Tracking Herding & Market Bubbles • • • • Word frequencies reveal power-laws (Zipf’s Law) Bubble would show in herd-like use of language Power laws change systematically with herding Sentiment of phrases should also be trackable Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock market bubbles. IJCAI-2011. AAAI Press. October 2014/RSS-Edinburgh 32 Zipf’s Law & Moby Dick October 2014/RSS-Edinburgh 33 Agreement ‘tween Commentators… Agreeing to Agree in Power Laws of Words October 2014/RSS-Edinburgh 34 Analysing Text in News • 17,713 finance articles (FT, NYT, BBC) • 4 years (Jan 2006-Jan 2010) including 2007 crash • 10,418,266 words, we extract nouns and verbs Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock market bubbles. IJCAI-2011. AAAI Press. October 2014/RSS-Edinburgh 35 September 2014/EPIC 36 September 2014/EPIC 37 Analysing Text in News • 17,713 finance articles (FT, NYT, BBC) • 4 years (Jan 2006-Jan 2010) including 2007 crash • 10,418,266 words, we extract nouns and verbs • Correlations for verb distributions show: • DJIA (r = .79), FTSE-100 (r = .78), NIKKEI-225 (r = .73) • NB: prediction is another matter Gerow, A., & Keane, M. T. (2011, July). Mining the web for the voice of the herd to track stock market bubbles. IJCAI-2011. AAAI Press. October 2014/RSS-Edinburgh 38 September 2014/EPIC 39 The Message • What we write, says a lot about what we think… – text in books, news, blogs, social media and so on • Here, agreement in a population is being captured by – carefully treated word frequencies • Population beliefs can be tracked – by a distributional analysis of changes in words October 2014/RSS-Edinburgh 40 The Message • What we write, says a lot about what we think… – text in books, news, blogs, social media and so on • Here, agreement in a population is being captured by – carefully treated word frequencies • Population beliefs can be tracked – by a distributional analysis of changes in words • But, – proper treatment of words is critical (stop-words, syntax) – sentiment analysis had to be based on human judgements October 2014/RSS-Edinburgh 41 Some Conclusions… October 2014/RSS-Edinburgh 42 In this New Big-Data World… ! • Who we know, says a lot about who we are… – Facebook friends, linked-in network, tweet followers • What we write, says a lot about what we think… – text in books, news, blogs, social media and so on • Where we located, says a lot about us… – location-based sensing, GPS, IP-addresses • What we do, says a lot about our decisions/interests… – what we buy, web-sites visited, youtube videos watched, news re-tweeted, items shared and so on… October 2014/RSS-Edinburgh 43 In this New Big-Data World… ! U O Y R A NE E – Facebook friends, linked-in network, tweet N followers O H P • What we write, T R A – text in books, news, blogs, social SMmedia and so on A T A • Where we located E L BGPS, IP-addresses – location-based sensing, A L I A V • What we do A Y L E web-sites visited, youtube videos watched, – what we buy, N I T U news O re-tweeted, items shared and so on… R W O N • Who we know, October 2014/RSS-Edinburgh 44 Promises and Caveats… • Data analytics bears promise in tracking and predicting: • population actions, beliefs, opinions, illness… • changes in those actions, beliefs, opinions, illnesses… • Challenges are in finding: • right treatment of the data: selection/collation of data is still critical, combining multiple data-sources • right analytic methods: which, if any, are appropriate • right interpretations; old-fashion exclusion-of-vars/ interpretation October 2014/RSS-Edinburgh 45 The End October 2014/RSS-Edinburgh 46
© Copyright 2025 Paperzz