slides

Is BIG necessarily Better?
1
A few disappointing stories …
• Google Flu Trends
• 1936 US Presidential Election
• Finding interesting patterns
2
Story 1: Can Big Data Predict Flu Trends?
disease incidence
Can Big Data Disease Detection pick up signals earlier ?
Traditional public health
confirmed information
lagged 2‐3 weeks
time
3
Using Big Data To Predict Flu Trends
• Importance: 250,000 ‐ 500,000 deaths from respiratory illnesses worldwide
• During flu season, more people enter search queries concerning flu
• Each year 90 million American adults search web for info about specific illnesses = LOTS OF DATA 4
Previous Attempts
• Swedish website counted queries in order track flu activity
• There was a strong correlation between frequency of search terms containing “flu” and “influenza” and virologic surveillance data
• Small data
– These models look for a very limited number of queries 5
Google Flu Trends
“We want to estimate flu activity based on more then just a few queries”
“We've found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real‐time.” http://www.google.org/flutrends/
• Predictions by Google Flu are 1‐2 weeks ahead of CDC’s ILI (Influenza‐like illness) surveillance reports
[Ginsberg Nature ‘09]
6
Very promising retrospective comparison!
“In April 2009, Dr. Brilliant said it epitomized the power of Google’s vaunted engineering prowess to make the world a better place, and he predicted that it would save untold numbers of lives.”
7
Google Flu Trends
• Took 50 million (!!) of the most common search queries between 2003‐2008 and did a weekly count for each state
– Each query is tested for correlation with CDC data
– Normalized data by dividing count by total searches for the week (thereby getting a percentage) – Ranked – from most to least correlated
• Google added top ranked queries together to see what number would yield the most accurate results
•
The magic number is 45
8
Google Flu Trends
9
Google Flu Trends
Google Flu Estimate
National Data
U.S.
Australia
10
The Wonder of Big Data Analytics!
• Generate accurate estimates faster than CDC – CDC takes one to two weeks to process data and generate a flu activity report
– It takes Google one to two days to generate an estimate • Faster estimates means that health officials can quickly direct resources to where the need is greatest
• And it is theory‐free!
11
As time goes by …
• GFT’s prediction was way off 12
What happen?
• No one outside Google really knows! • Some reasonable explanations
– Panic strikes
– N ≠ ALL
– Algorithm dynamics
It is not the size that matters
Big data present great opportunities
BUT an improper use may lead to erroneous predictions
13
Story 2: Can Big Data Predict Election Results?
•
Two famous pre‐election polls
– a Literary Digest poll predicted the defeat of President Roosevelt in 1936, and – the Gallup Poll (and almost all other polls) predicted the defeat of President Truman in 1948. 14
1936 election and the Literary Digest survey
• Magazine had predicted every election (successfully) since 1916
• Sent out 10 million surveys
– 2.4 million responded
• Prediction: Roosevelt 43%
• Actual: Roosevelt: 62%
• (Literary Digest went bankrupt soon after)
15
George Gallup did it right
• Roosevelt’s percentage
– Actual election result: 62%
– Literary Digest prediction: 43%
– Gallup’s prediction of the Digest prediction (based on sample of 2,000): 44%
– Gallup’s prediction of the election result (based on sample of 50,000): 56%
16
What went wrong (Literary Digest’s prediction)?
• Context: Great Depression
– 9 million unemployed; real income down 33%
– Landon: “Cut spending” versus Roosevelt: “Balance peoples’ budgets before government’s budget
• Polling
– Survey sent out to 10 million people from subscription list, telephone directories, club membership lists
– 2.4 million responded
17
Sources of bias
• Sampling not representative
– Biased toward better off groups (and more Republican)
• The rich are the ones who own a telephone, club members, etc
• Voluntary response bias
– The anti‐Roosevelt forces were angry‐‐‐and had a higher response rate!
18
George Gallup did it right
• Gallup used Random sampling – Every combination of people has equal chance to be selected
19
The Year the Polls Elected Dewey
• Another classic fiasco
• 1948: Harry Truman (Dem) vs. Thomas Dewey (Rep)
• All major polls predicted Dewey would win by 5 percentage points (even Gallup)
20
What went wrong?
• Quota Sampling
– Each interviewer assigned a fixed quota of subjects in certain categories (race, sex, age)
– In each category, interviewers are free to choose
– Left room for human choice‐‐‐and inevitable bias
• Republicans were easier to reach – Had telephones, permanent addresses, “nicer” neighborhoods
• Gallup stopped polling altogether about two weeks before the election
– There were two third‐party candidacies, support for which was hard to predict
• In the election itself, Truman ran relatively weakly in the Northeast but much more strongly in the farm belt and the West.
– Hence the famous Chicago Tribune photo
21
Quota Sampling biased
• Republican bias in Gallup Poll
Year
Prediction
Actual
of GOP vote GOP vote
Error in
favor of GOP
1936
44
38
6
1940
48
45
3
1944
48
46
2
1948
50
45
5
• Quota sampling eventually abandoned for random sampling
• Repeated evidence points to superiority of random sampling!
22
Once again …
• Big data can be useful only if they are properly used
• Understand how the data is collected
• Understand the quality of data that you are collecting
• Account for noise in your analysis if need to
23
Story 3: Can Big Data Find Interesting Patterns?
What is happening here?
24
How to ensure I.I.D.? • In clinical testing, we carefully choose the sample to ensure I.I.D. so that the test is valid
– Independent: Patients are not related – Identical: Similar # of male/female, young/old, … in cases and controls Note that sex, age, … don’t
need to appear in the
contingency table
•
In big‐data analytics, and in many datamining works, people hardly ever do this!
– Is this sound?
25
Can Big Data Find Interesting Patterns?
Challenge : Separating causal factors from confounding factors; making good inferences.
Taking A
• Men = 100 (63%)
• Women = 60 (37%)
Taking B
• Men = 210 (91%)
• Women = 20 (9%)
Men taking A
• History = 80 (80%)
• No history = 20 (20%)
Men taking B
• History = 55 (26%)
• No history = 155 (74%)
26
Once again …
• Need to understand the available data
• Need to understand the assumptions of tools and ensure the data actually comply to these
• Need to account for other factors that may influence the outcome
27
How we can get more out of big data?
28
How we can get more out of big data?
29
Analytics Algorithms
• Pick the right tool
• Understand the algorithm
– The strengths and limitations
– The assumptions
– The parameter settings
• Tuning vs default settings
30
Databases
• Pick the right database
– Model, consistency level, etc
• Understand its features
– The strengths and limitations
31
Statistics
• Ensure sample is representative
• Understand what test to use
– t‐test, 2 test, etc
• Get the null hypothesis right, and exploit domain knowledge properly
32
Summary
• Big data can be beneficial – many successful stories
• Improper use can have drastic consequences
• Need to – Understand data collection process, and manage the noise
– Know the algorithms
– Apply the right statistical analysis
– Use the database technologies
33