Is BIG necessarily Better? 1 A few disappointing stories … • Google Flu Trends • 1936 US Presidential Election • Finding interesting patterns 2 Story 1: Can Big Data Predict Flu Trends? disease incidence Can Big Data Disease Detection pick up signals earlier ? Traditional public health confirmed information lagged 2‐3 weeks time 3 Using Big Data To Predict Flu Trends • Importance: 250,000 ‐ 500,000 deaths from respiratory illnesses worldwide • During flu season, more people enter search queries concerning flu • Each year 90 million American adults search web for info about specific illnesses = LOTS OF DATA 4 Previous Attempts • Swedish website counted queries in order track flu activity • There was a strong correlation between frequency of search terms containing “flu” and “influenza” and virologic surveillance data • Small data – These models look for a very limited number of queries 5 Google Flu Trends “We want to estimate flu activity based on more then just a few queries” “We've found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real‐time.” http://www.google.org/flutrends/ • Predictions by Google Flu are 1‐2 weeks ahead of CDC’s ILI (Influenza‐like illness) surveillance reports [Ginsberg Nature ‘09] 6 Very promising retrospective comparison! “In April 2009, Dr. Brilliant said it epitomized the power of Google’s vaunted engineering prowess to make the world a better place, and he predicted that it would save untold numbers of lives.” 7 Google Flu Trends • Took 50 million (!!) of the most common search queries between 2003‐2008 and did a weekly count for each state – Each query is tested for correlation with CDC data – Normalized data by dividing count by total searches for the week (thereby getting a percentage) – Ranked – from most to least correlated • Google added top ranked queries together to see what number would yield the most accurate results • The magic number is 45 8 Google Flu Trends 9 Google Flu Trends Google Flu Estimate National Data U.S. Australia 10 The Wonder of Big Data Analytics! • Generate accurate estimates faster than CDC – CDC takes one to two weeks to process data and generate a flu activity report – It takes Google one to two days to generate an estimate • Faster estimates means that health officials can quickly direct resources to where the need is greatest • And it is theory‐free! 11 As time goes by … • GFT’s prediction was way off 12 What happen? • No one outside Google really knows! • Some reasonable explanations – Panic strikes – N ≠ ALL – Algorithm dynamics It is not the size that matters Big data present great opportunities BUT an improper use may lead to erroneous predictions 13 Story 2: Can Big Data Predict Election Results? • Two famous pre‐election polls – a Literary Digest poll predicted the defeat of President Roosevelt in 1936, and – the Gallup Poll (and almost all other polls) predicted the defeat of President Truman in 1948. 14 1936 election and the Literary Digest survey • Magazine had predicted every election (successfully) since 1916 • Sent out 10 million surveys – 2.4 million responded • Prediction: Roosevelt 43% • Actual: Roosevelt: 62% • (Literary Digest went bankrupt soon after) 15 George Gallup did it right • Roosevelt’s percentage – Actual election result: 62% – Literary Digest prediction: 43% – Gallup’s prediction of the Digest prediction (based on sample of 2,000): 44% – Gallup’s prediction of the election result (based on sample of 50,000): 56% 16 What went wrong (Literary Digest’s prediction)? • Context: Great Depression – 9 million unemployed; real income down 33% – Landon: “Cut spending” versus Roosevelt: “Balance peoples’ budgets before government’s budget • Polling – Survey sent out to 10 million people from subscription list, telephone directories, club membership lists – 2.4 million responded 17 Sources of bias • Sampling not representative – Biased toward better off groups (and more Republican) • The rich are the ones who own a telephone, club members, etc • Voluntary response bias – The anti‐Roosevelt forces were angry‐‐‐and had a higher response rate! 18 George Gallup did it right • Gallup used Random sampling – Every combination of people has equal chance to be selected 19 The Year the Polls Elected Dewey • Another classic fiasco • 1948: Harry Truman (Dem) vs. Thomas Dewey (Rep) • All major polls predicted Dewey would win by 5 percentage points (even Gallup) 20 What went wrong? • Quota Sampling – Each interviewer assigned a fixed quota of subjects in certain categories (race, sex, age) – In each category, interviewers are free to choose – Left room for human choice‐‐‐and inevitable bias • Republicans were easier to reach – Had telephones, permanent addresses, “nicer” neighborhoods • Gallup stopped polling altogether about two weeks before the election – There were two third‐party candidacies, support for which was hard to predict • In the election itself, Truman ran relatively weakly in the Northeast but much more strongly in the farm belt and the West. – Hence the famous Chicago Tribune photo 21 Quota Sampling biased • Republican bias in Gallup Poll Year Prediction Actual of GOP vote GOP vote Error in favor of GOP 1936 44 38 6 1940 48 45 3 1944 48 46 2 1948 50 45 5 • Quota sampling eventually abandoned for random sampling • Repeated evidence points to superiority of random sampling! 22 Once again … • Big data can be useful only if they are properly used • Understand how the data is collected • Understand the quality of data that you are collecting • Account for noise in your analysis if need to 23 Story 3: Can Big Data Find Interesting Patterns? What is happening here? 24 How to ensure I.I.D.? • In clinical testing, we carefully choose the sample to ensure I.I.D. so that the test is valid – Independent: Patients are not related – Identical: Similar # of male/female, young/old, … in cases and controls Note that sex, age, … don’t need to appear in the contingency table • In big‐data analytics, and in many datamining works, people hardly ever do this! – Is this sound? 25 Can Big Data Find Interesting Patterns? Challenge : Separating causal factors from confounding factors; making good inferences. Taking A • Men = 100 (63%) • Women = 60 (37%) Taking B • Men = 210 (91%) • Women = 20 (9%) Men taking A • History = 80 (80%) • No history = 20 (20%) Men taking B • History = 55 (26%) • No history = 155 (74%) 26 Once again … • Need to understand the available data • Need to understand the assumptions of tools and ensure the data actually comply to these • Need to account for other factors that may influence the outcome 27 How we can get more out of big data? 28 How we can get more out of big data? 29 Analytics Algorithms • Pick the right tool • Understand the algorithm – The strengths and limitations – The assumptions – The parameter settings • Tuning vs default settings 30 Databases • Pick the right database – Model, consistency level, etc • Understand its features – The strengths and limitations 31 Statistics • Ensure sample is representative • Understand what test to use – t‐test, 2 test, etc • Get the null hypothesis right, and exploit domain knowledge properly 32 Summary • Big data can be beneficial – many successful stories • Improper use can have drastic consequences • Need to – Understand data collection process, and manage the noise – Know the algorithms – Apply the right statistical analysis – Use the database technologies 33
© Copyright 2026 Paperzz