Statistical Paradises and Paradoxes in Big Data Xiao-Li Meng Department of Statistics, Harvard University Thanks to many students and colleagues 1 Paradises • Much larger general pipeline: Statistics Concentration (Major) Size at Harvard College • Much better airplane conversations • Golden era for methodological research • Emerging theoretical foundations 2 Berkeley Group: Integrating Stat/Prob with CS, ML, IS, and Math • Rigorous theory of the trade-off between statistical and computational efficiency, under confidentiality, etc., based on classical statistical decision theory. • Wide-ranging statistical machine learning theory, methodology, algorithms, using empirical process, signal processing & information theory (e.g., MDL principle). • Automated Targeted Learning and Super Learning built upon well-established semiparametric and nonparametric theory. • Algebraic statistics, e.g., studying statistical hypothesis testing via algebraic geometry and computational and combinatorial techniques. • …… 3 BFF group: Integrating Bayes, Frequentist, and Fiducial perspectives • Fusion learning via confidence distributions (CD) • Combining results from multiple analyses under possibly different perspectives 4 Jianqing Fan’s Group (Princeton): Bringing statistical theory and methods to the forefront of Big Data Fan et al. (2014) Challenges of Big Data Analysis National Science Review (China) 1: 293-314 Salient features of Big Data • Heterogeneity (Individuality) • Noise accumulation • Spurious correlation • Incidental endogeneity • FanBigDataReview.pdf 5 Great Promises and Grand Challenges Multi-Resolution Inference Multi-Phase Inference Multi-Source Inference o Meng (2014) A Trio of Inference Problems That Could Win You a Nobel Prize in Statistics (if you help fund it). COPSS 50th Anniversary Volume. o Blocker and Meng (2013) The Potential and Perils of Preprocessing: Building New Foundations. Bernoulli, 19, 1176-1211. o Xie and Meng (2016) Dissecting Multiple Imputation from a Multiphase Inference Perspective: What Happens When God’s, Imputer’s and Analyst’s Models are Uncongenial? (With discussion). Statistica Sinica, to appear. 6 OnTheMap Project of US Census Bureau • Developed by LED (Local Employment Dynamic). • Users zoom into any region of the US for paired employeeemployer information. • Used diverse data sources: surveys and administrative datasets with confidential information. Thanks to Jeremy Wu of C. B. 7 Multi-Resolution 8 Multi-Phase • To protect confidentiality, the displayed data are synthetic: draws from a posterior. • Each data source itself has gone through multiple “clean up” processes, most of which are gray boxes or even 9 Multi-Source • Built from more than 20 data sources in the LEHD (Longitudinal Employer-Household Dynamics) system. • Survey Samples: Monthly survey of 60,000 households covering only 0.05% of households. • Administrative Records: Unemployment insurance wage records covering more than 90% of the US workforce; Never intended for inference purposes. • Census Data: Quarterly census of earnings and wages covering 98% of US jobs. 10 A Trio of NP-Hard Inference Problems • Multi-Resolution: How do we infer estimands with resolution far exceeding any possible estimators? Is it possible for such inference to be qualitatively robust even if it cannot be quantitatively robust? • Multi-Phase: (Big) Data are almost never collected, preprocessed, and analyzed in a single phase. What theory and methods accommodate this multi-phase setup? • Multi-Source: Which one is better: a survey sample covering 1% or an administrative record covering 95% of the population? How should we combine information from these sources? Is it worth combining? 11 So which one is better for estimating the population mean: a 1% simple random sample (SRS) or a 95% administrative (observational) dataset (AD) ? 1% SRS 95% AD It depends! Is this a trick question? sa tri c.. . 0% Is th i ep en ds ! 0% It d AD 0% 95 % SR S 0% 1% 1. 2. 3. 4. 12 A fundamental principle of statistics: Variance-Bias Tradeoff Total Error = • probabilistic SRS • Large non-prob data Variance + Bias2 [(1-fs)/n]S2 + 0 ≈ 0 + r2[(1-fa)/fa)] S2 • f is the fraction in the population: f=n/N • r is the correlation between the (honest) responded/recorded value X and the probability of response/recording, P(X) • “Big Data Paradox” – the larger the data, the more pronounced the bias 13 For estimating a population mean, if r=0.1, how large does an AD, as a percentage of US population, need to be in order to produce a more accurate sample average than a SRS with n=100 does? 0% 14 M ) (3 03 % >9 5 90 % (2 88 M 0M ) (2 4 75 % (1 6 0M M 50 % 0% ) 0% ) 0% ) 0% (6 4 (3 2 M ) (1 6M 0% ) 0% 10 % 5% (1 . 6M ) 0% 20 % <0.5% (1.6M) 5% (16M) 10% (32M) 20% (64M) 50% (160M) 75% (240M) 90% (288M) >95% (303M) <0 .5 % 1. 2. 3. 4. 5. 6. 7. 8. Big Data: Big Size or Big Fraction? • Size matters, but only after having quality • Importance of combining non-probabilistic samples with probabilistic ones, however small the latter are. • More does NOT guarantee better: • I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? (Meng and Xie, 2014, Economics Review, 218-250) 15 So when/why do we need Big Data? • Individualized treatments (e.g., medical; educational; marketing; news) • Inference/prediction with very weak signal to noise ratio (e.g., climate change) • Understand deeply connected (spatial) networks and (temporal) dynamics 16 What does Big Data mean for you? We see you and others more clearly 2015/11/1 17 Gift: Treatment for you based only on data from people like you. Curse: No one is perfectly like you. 2015/11/1 18 Personalized Treatment: Sounds heavenly, but where on Earth did they find the right guinea pig for me? Liu and Meng (2014) A Fruitful Resolution to Simpson’s Paradox via Multi-Resolution Inference, The American Statistician, 17-29 2015/11/1 19 A Painful Problem 2015/11/1 20 Kidney Stone Treatment C. R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (March 1986) Br Med J (Clin Res Ed) 292 (6524): 879–882. 2015/11/1 Treatment A Treatment B 78% (273/350) 83% (289/350) Treatment A Treatment B Small Stone 93% (81/87) 87% (234/270) Large Stone 73% (192/263) 69% (55/80) A: Open Surgery; B: Percutaneous Nephrolithotomy 21 Treatment A 73% successful Large Stones 78% 93% Small Stones Overall Successful Unsuccessful Treatment B 69% successful Large Stones 83% 87% Small Stones Overall Uneven distribution of stone sizes across treatments makes overall success rate misleading. 22 Simpson’s Paradox • Dealing with the disparities between aggregated analysis and disaggregated analyses • Determining the right level (primary resolution) for analysis • Understanding the bias-variance (relevancerobustness) trade-off 23 So what would be the right resolution? Let’s take a CarTalk challenge (7/111/2015) 24 From Cartalk: “You are tested positive for D by a test with 95% accuracy. What’s the chance you actually have D, given the prevalence of D is 0.1%?” id ea ... 0% no an yt h av e Ih be d C o u n t d o w n 0% ... 0% 75 -9 5% 0% C ou l 0% 50 -7 5% 0% 25 -5 0% 0% 10 -2 5% 0% 510 % 1-5% 5-10% 10-25% 25-50% 50-75% 75-95% Could be anything I have no idea. 15% 1. 2. 3. 4. 5. 6. 7. 8. 10 25 It could be anything … depending on the meaning of “accuracy” and … • Need to know how accurate the test is among those with no disease (specificity) AND among those with the disease (sensitivity) • The probability could be 1 if sensitivity = 100% • For rare disease, overall accuracy ~ specificity • Then the answer is less than 2%, if this was a random screening test 26 100,000 People for Screening 0.1% 100 D 95% 95 pos 99.9% 99,900 no D 5% 5 neg 5% 95% 4,995 94,005 pos neg 95/(95+4,995) = 1.87% 1,000 with Symptoms 10% 100 D 95% 95 pos 90% 900 no D 5% 5 neg 5% 45 pos 95% 855 neg 95/(95+45) = 67.9% Conditioning is the Soul of Statistics --- Joe Blitzstein 27 Bayes Theorem When the facts change, I change my opinion. What do you do, sir? ~ John Maynard Keynes 28 Useful Statistical Principles/Concepts for Data Science Data Selection and Replication Mechanisms: Randomization, sampling, experiments, observational studies, missing data mechanisms; latent variable/constructs; potential outcome; confidentiality protections Conditioning vs. Marginalizing: Disaggregation vs. aggregation, sub-population analysis, individualized inference, Simpson’s paradox, ecological fallacy Bias-Variance Trade-off: Efficiency vs. Robustness, Relevance vs. Robustness; model predictability vs. fitness Inferences principles/perspectives: Likelihood principle; Bayesian thinking; fiducial argument for objectivity; uncertainty quantifications ……. 29 2015/11/1 • A Traditional Statistical Theme/Aim: Seeking representative samples to infer about populations A Big-Data Statistical Theme/Aim: Constructing approximating populations to infer about individuals Targeted Individual Approx. Population 2015/11/1 30 One more V for Big Data: Veracity 31 I find your presentation … 0% 0% as te nd o. .. .. . 0% w ha ta w ga co nf us in in f or m at iv ea ga nd ... n. .. 0% In sp iri n 1. Inspiring and thought provoking 2. informative and I learned a few things 3. confusing and not very helpful 4. what a waste of my time! 32
© Copyright 2026 Paperzz