in Big Data - American Statistical Association

Statistical Paradises and Paradoxes
in Big Data
Xiao-Li Meng
Department of Statistics,
Harvard University
Thanks to many students and colleagues
1
Paradises
• Much larger general pipeline:
Statistics Concentration (Major)
Size at Harvard College
• Much better airplane conversations
• Golden era for methodological research
• Emerging theoretical foundations
2
Berkeley Group: Integrating Stat/Prob with CS, ML, IS, and Math
• Rigorous theory of the trade-off between
statistical and computational efficiency,
under confidentiality, etc., based on
classical statistical decision theory.
• Wide-ranging statistical machine learning
theory, methodology, algorithms, using
empirical process, signal processing &
information theory (e.g., MDL principle).
• Automated Targeted Learning and Super
Learning built upon well-established semiparametric and nonparametric theory.
• Algebraic statistics, e.g., studying
statistical hypothesis testing via algebraic
geometry and computational and
combinatorial techniques.
• ……
3
BFF group: Integrating Bayes, Frequentist, and Fiducial perspectives
• Fusion learning via confidence distributions (CD)
• Combining results from multiple analyses under
possibly different perspectives
4
Jianqing Fan’s Group (Princeton):
Bringing statistical theory and methods to the forefront of Big Data
Fan et al. (2014) Challenges of Big Data Analysis
National Science Review (China) 1: 293-314
Salient features of Big Data
• Heterogeneity (Individuality)
• Noise accumulation
• Spurious correlation
• Incidental endogeneity
• FanBigDataReview.pdf
5
Great Promises and Grand Challenges
Multi-Resolution Inference
Multi-Phase Inference
Multi-Source Inference
o Meng (2014) A Trio of Inference Problems That Could Win You a Nobel
Prize in Statistics (if you help fund it). COPSS 50th Anniversary Volume.
o Blocker and Meng (2013) The Potential and Perils of Preprocessing:
Building New Foundations. Bernoulli, 19, 1176-1211.
o Xie and Meng (2016) Dissecting Multiple Imputation from a Multiphase
Inference Perspective: What Happens When God’s, Imputer’s and
Analyst’s Models are Uncongenial? (With discussion). Statistica Sinica,
to appear.
6
OnTheMap Project of US Census Bureau
• Developed by LED (Local
Employment Dynamic).
• Users zoom into any region of
the US for paired employeeemployer information.
• Used diverse data sources:
surveys and administrative
datasets with confidential
information.
Thanks to Jeremy Wu of C. B.
7
Multi-Resolution
8
Multi-Phase
• To protect confidentiality, the displayed data are synthetic:
draws from a posterior.
• Each data source itself has gone through multiple
“clean up” processes, most of which are gray boxes
or even
9
Multi-Source
• Built from more than 20 data sources in the LEHD
(Longitudinal Employer-Household Dynamics) system.
• Survey Samples: Monthly survey of 60,000 households
covering only 0.05% of households.
• Administrative Records: Unemployment insurance wage
records covering more than 90% of the US workforce;
Never intended for inference purposes.
• Census Data: Quarterly census of earnings and wages
covering 98% of US jobs.
10
A Trio of NP-Hard Inference Problems
• Multi-Resolution: How do we infer estimands with resolution far
exceeding any possible estimators? Is it possible for such inference to
be qualitatively robust even if it cannot be quantitatively robust?
• Multi-Phase: (Big) Data are almost never collected, preprocessed,
and analyzed in a single phase. What theory and methods
accommodate this multi-phase setup?
• Multi-Source: Which one is better: a survey sample covering 1% or
an administrative record covering 95% of the population? How
should we combine information from these sources? Is it worth
combining?
11
So which one is better for estimating the population mean:
a 1% simple random sample (SRS) or a 95% administrative
(observational) dataset (AD) ?
1% SRS
95% AD
It depends!
Is this a trick question?
sa
tri
c..
.
0%
Is
th
i
ep
en
ds
!
0%
It
d
AD
0%
95
%
SR
S
0%
1%
1.
2.
3.
4.
12
A fundamental principle of statistics: Variance-Bias Tradeoff
Total Error =
• probabilistic SRS
• Large non-prob data
Variance
+
Bias2
[(1-fs)/n]S2 +
0
≈ 0
+ r2[(1-fa)/fa)] S2
• f is the fraction in the population: f=n/N
• r is the correlation between the (honest)
responded/recorded value X and the probability of
response/recording, P(X)
• “Big Data Paradox” – the larger the data, the more
pronounced the bias
13
For estimating a population mean, if r=0.1, how large does an AD, as a
percentage of US population, need to be in order to produce a more
accurate sample average than a SRS with n=100 does?
0%
14
M
)
(3
03
%
>9
5
90
%
(2
88
M
0M
)
(2
4
75
%
(1
6
0M
M
50
%
0%
)
0%
)
0%
)
0%
(6
4
(3
2
M
)
(1
6M
0%
)
0%
10
%
5%
(1
.
6M
)
0%
20
%
<0.5% (1.6M)
5% (16M)
10% (32M)
20% (64M)
50% (160M)
75% (240M)
90% (288M)
>95% (303M)
<0
.5
%
1.
2.
3.
4.
5.
6.
7.
8.
Big Data: Big Size or Big Fraction?
• Size matters, but only after having quality
• Importance of combining non-probabilistic samples
with probabilistic ones, however small the latter are.
• More does NOT guarantee better:
• I got more data, my model is more refined,
but my estimator is getting worse! Am I just
dumb?
(Meng and Xie, 2014, Economics Review, 218-250)
15
So when/why do we need Big Data?
• Individualized treatments (e.g., medical;
educational; marketing; news)
• Inference/prediction with very weak signal to
noise ratio (e.g., climate change)
• Understand deeply connected (spatial)
networks and (temporal) dynamics
16
What does Big Data mean for you?
We see you and others more clearly
2015/11/1
17
Gift: Treatment for you based only on data from people like you.
Curse: No one is perfectly like you.
2015/11/1
18
Personalized Treatment:
Sounds heavenly, but where on
Earth did they find the right
guinea pig for me?
Liu and Meng (2014) A Fruitful Resolution to
Simpson’s Paradox via Multi-Resolution
Inference, The American Statistician, 17-29
2015/11/1
19
A Painful Problem
2015/11/1
20
Kidney Stone Treatment
C. R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (March 1986)
Br Med J (Clin Res Ed) 292 (6524): 879–882.
2015/11/1
Treatment A
Treatment B
78%
(273/350)
83%
(289/350)
Treatment A
Treatment B
Small
Stone
93%
(81/87)
87%
(234/270)
Large
Stone
73%
(192/263)
69%
(55/80)
A: Open Surgery;
B: Percutaneous Nephrolithotomy
21
Treatment A
73%
successful Large
Stones
78%
93%
Small
Stones
Overall
Successful
Unsuccessful
Treatment B
69%
successful Large
Stones
83%
87%
Small
Stones
Overall
Uneven distribution of stone sizes across treatments makes
overall success rate misleading.
22
Simpson’s Paradox
• Dealing with the disparities between
aggregated analysis and disaggregated
analyses
• Determining the right level (primary
resolution) for analysis
• Understanding the bias-variance (relevancerobustness) trade-off
23
So what would be the right resolution?
Let’s take a CarTalk challenge (7/111/2015)
24
From Cartalk: “You are tested positive for D by a test with
95% accuracy. What’s the chance you actually have D, given
the prevalence of D is 0.1%?”
id
ea
...
0%
no
an
yt
h
av
e
Ih
be
d
C
o
u
n
t
d
o
w
n
0%
...
0%
75
-9
5%
0%
C
ou
l
0%
50
-7
5%
0%
25
-5
0%
0%
10
-2
5%
0%
510
%
1-5%
5-10%
10-25%
25-50%
50-75%
75-95%
Could be anything
I have no idea.
15%
1.
2.
3.
4.
5.
6.
7.
8.
10
25
It could be anything …
depending on the meaning of “accuracy” and …
• Need to know how accurate the test is among
those with no disease (specificity) AND among
those with the disease (sensitivity)
• The probability could be 1 if sensitivity = 100%
• For rare disease, overall accuracy ~ specificity
• Then the answer is less than 2%, if this was a
random screening test
26
100,000 People for Screening
0.1%
100 D
95%
95
pos
99.9%
99,900 no D
5%
5
neg 5%
95%
4,995 94,005
pos
neg
95/(95+4,995) = 1.87%
1,000 with Symptoms
10%
100 D
95%
95
pos
90%
900 no D
5%
5
neg 5%
45
pos
95%
855
neg
95/(95+45) = 67.9%
Conditioning is the Soul of Statistics
--- Joe Blitzstein
27
Bayes Theorem
When the facts change, I change my opinion. What
do you do, sir?
~ John Maynard Keynes 28
Useful Statistical Principles/Concepts for Data Science
Data Selection and Replication Mechanisms:
Randomization, sampling, experiments, observational studies, missing
data mechanisms; latent variable/constructs; potential outcome;
confidentiality protections
Conditioning vs. Marginalizing:
Disaggregation vs. aggregation, sub-population analysis,
individualized inference, Simpson’s paradox, ecological fallacy
Bias-Variance Trade-off:
Efficiency vs. Robustness, Relevance vs. Robustness; model
predictability vs. fitness
Inferences principles/perspectives:
Likelihood principle; Bayesian thinking; fiducial argument for
objectivity; uncertainty quantifications
…….
29
2015/11/1
•
A Traditional Statistical Theme/Aim:
Seeking representative samples to infer about populations
A Big-Data Statistical Theme/Aim:
Constructing approximating populations to infer about individuals
Targeted Individual Approx. Population
2015/11/1
30
One more V for Big Data: Veracity
31
I find your presentation …
0%
0%
as
te
nd
o.
..
.. .
0%
w
ha
ta
w
ga
co
nf
us
in
in
f
or
m
at
iv
ea
ga
nd
...
n.
..
0%
In
sp
iri
n
1. Inspiring and thought
provoking
2. informative and I
learned a few things
3. confusing and not
very helpful
4. what a waste of my
time!
32