Understanding Generalization in Adaptive Data Analysis Vitaly Feldman Overview β’ Adaptive data analysis o Motivation o Definitions o Basic techniques With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] β’ New results [F, Steinke 17] β’ Open problems 2 ππ₯βΌπ [πΏππ π (π, π₯)]=? Learning problem Distribution π over domain π Model Data π = π΄(π) π = π₯1 , β¦ , π₯π βΌ ππ Analysis π΄ 3 Statistical inference π π i.i.d. samples from π Data Data Data Theory Model complexity Rademacher compl. Stability Online-to-batch β¦ Algorithm π΄ π = π΄(π) Generalization guarantees for π Data analysis is adaptive Steps depend on previous analyses of the same dataset π΄1 Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets β¦ π£1 π΄2 π£2 π΄π π£π Data analyst(s) π Thou shalt not test hypotheses suggested by data βQuiet scandal of statisticsβ [Leo Breiman, 1992] ML practice Data Data Data Data Data Training Data Testing π Test error of π β ππ₯βΌπ [πΏππ π (π, π₯)] 7 ML practice now Data Data Data Data Data Data Data Validation Training Testing π π ππ Test error of π β ππ₯βΌπ [πΏππ π (π, π₯)] 8 Adaptive data analysis [DFHPRR 14] π΄1 π£1 π΄2 π£2 π΄π Data analyst(s) π£π π = π₯1 , β¦ , π₯π βΌ ππ Algorithm Goal: given π compute π£π βs βcloseβ to running π΄π on fresh samples β’ β’ Each analysis is a query Design algorithm for answering adaptively-chosen queries Adaptive statistical queries π΄1 π£1 π΄2 π£2 π π΄π Data analyst(s) Statistical query π π = π₯oracle 1 , β¦ , π₯π βΌ π π£π [Kearns 93] π΄π (π) β‘ 1 π π₯βπ ππ π₯ ππ : π β 0,1 Example: ππ = πΏππ π (π, π₯) π£π β ππ₯βΌπ ππ π₯ β€ Ο with prob. 1 β π½ Can measure correlations, moments, accuracy/loss Run any statistical query algorithm Answering non-adaptive SQs Given π non-adaptive query functions π1 , β¦ , ππ and π i.i.d. samples from π estimate ππ₯βΌπ ππ π₯ Use empirical mean: ππ ππ = 1 π π₯βπ log (π/π½) π=π π2 ππ π₯ Answering adaptively-chosen SQs What if we use ππ ππ ? For some constant π½ > 0: π πβ₯ 2 π Variable selection, boosting, bagging, step-wise regression .. Data splitting: π = π πβ log π π2 Answering adaptive SQs [DFHPRR 14] Exists an algorithm that can answer π adaptively chosen SQs with accuracy π for π π = π 2.5 π Data splitting: π [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] π π=π 2 π Generalizes to low-sensitivity analyses: 1 π΄π π β π΄π π β² β€ π when π, πβ² differ in a single element Estimates ππβΌππ [π΄π (π)] within π π π2 Differential privacy [Dwork,McSherry,Nissim,Smith 06] π πβ² M ratio bounded Pr[π] π Randomized algorithm π is (π, πΏ)-differentially private if for any two data sets π, πβ² that differ in one element: βπ β range π , Pr π΄ π β π β€ π π β Pr π΄ π β² β π + πΏ π΄ π΄ DP implies generalization Differential privacy is stability DP composes adaptively Implies strongly uniform replace-one stability and generalization in expectation Composition of π π-DP algorithms: for every πΏ > 0, is π π log 1/πΏ , πΏ -DP DP implies generalization with high probability [DFHPRR 14, BNSSSU 15] [Dwork,Rothblum,Vadhan 10] Value perturbation [DMNS 06] Answer low-sensitivity query π΄ with π΄ π + π max π΄ π β π΄ π β² π,πβ² β€ 1/π Gaussian π(0, π) 1 4 Given π samples achieves error β Ξ(π΄) β π β π where Ξ(π΄) is the worst-case sensitivity: max π΄ π β π΄(π β² ) π,πβ² Ξ(π΄) β π could be much larger than standard deviation of π΄ 16 Beyond low-sensitivity [F, Steinke 17] Exists an algorithm that for any adaptivelychosen sequence π΄1 , β¦ , π΄π : π π‘ β β given π = π π β π‘ i.i.d. samples from π outputs values π£1 , β¦ , π£π such that w.h.p. for all π: ππβΌππ‘ π΄π π β π£π β€ 2ππ where ππ = πππ«πβΌππ‘ π΄π π For statistical queries: ππ : π β [βπ΅, π΅] given π samples get error that scales as πππ«π₯βΌπ ππ π₯ β π1/4 π Value perturbation: π΅ π β π1/4 17 Stable Median π π1 π2 π3 β― β― ππ β― π = π‘π β― π΄π π΄π π¦1 β― π΄π π¦2 π΄π π΄π π¦πβ2 π¦3 π΄π π¦πβ1 π¦π π π£ Find an approximate median with DP relative to π value π£ greater than bottom 1/3 and smaller than top 1/3 in π 18 Median algorithms Requires discretization: ground set π, |π| = π β Upper bound: 2π(log π) samples Lower bound: Ξ©(log β π) samples [Bun,Nissim,Stemmer,Vadhan 15] π Exponential mechanism [McSherry, Talwar 07] Output π£ β π with prob. β π Uses π log π π βπ # π£β€π¦ β π¦βπ π π 2 samples Stability and confidence amplification for the price of one log factor! 19 Analysis Differential privacy approximately preserves quantiles [F, Steinke 17] Let π be a DP algorithm that on input π β π π outputs a function π: π β β and a value π£ β β. Then w.h.p. over π βΌ π· π and π, π£ β π π : Pr π£ β€ π(π¦) β Pr π£ β€ π(π¦) π¦βΌπ· If π£ is within 1 2 , 3 3 then π£ is within π¦βΌπ empirical quantiles 1 3 , 4 4 true quantiles π£ is within mean ±2π If π is well-concentrated on π· then easy to prove high probability bounds 20 Limits Any algorithm for answering π adaptively chosen SQs with accuracy π requires* π = Ξ©( π/π) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Verification of responses to queries: π = π( π log π) where π is the number of queries that failed verification o Data splitting if overfitting [DFHPRR 14] o Reusable holdout [DFHPRR 15] o Maintaining public leaderboard in a competition [Blum, Hardt 15] 21 Open problems Analysts without side information about π β’ Queries depend only on previous answers Does there exist an SQ analyst whose queries require more than π(log π) samples to answer? (with 0.1 accuracy/confidence) β’ Fixed βnaturalβ analyst/Learning algorithm o Gradient descent for stochastic convex optimization 22 Stochastic convex optimization β’ Convex body πΎ = πΉπ2 1 β π₯ π₯ 2 β€ 1} β’ Class πΉ of convex 1-Lipschitz functions πΉ = π convex βπ₯ β πΎ, π»π(π₯) 2 β€ 1 Given π1 , β¦ , ππ sampled i.i.d. from unknown π over πΉ Minimize true (expected) objective: ππ (π₯) β ππβΌπ [π(π₯)] over πΎ: β’ Find π₯ s.t. ππ π₯ β€ min ππ π₯ + π π₯βπΎ π1 π2 ππ β¦ ππ π₯ 23 Gradient descent ERM via projected gradient descent: 1 ππ (π₯) β ππ (π₯) π Initialize π₯1 β πΎ For π‘ = 1 to π π₯π‘+1 = Project πΎ π₯π‘ β π β π»ππ π₯π‘ Output: 1 π 1 π π»ππ (π₯π‘ ) Fresh samples: π»ππ π₯π‘ βπ»ππ π₯π‘ π‘ π₯π‘ 2 β€ 1/ π Sample complexity is unknown β’ Uniform convergence: π β’ SGD solves using π 1 π2 π π2 samples (tight [F. 16]) samples [Robbins,Monro 51; Polyak 90] Overall: π/π 2 statistical queries with accuracy π in 1/π 2 adaptive rounds β’ Sample splitting: π β’ DP: π π π3 log π π4 samples samples 24 Conclusions οΌ Real-valued analyses (without any assumptions) β’ Going beyond tools from DP o Other notions of stability for outcomes o Max/mutual information β’ Generalization beyond uniform convergence β’ Using these techniques in practice 25
© Copyright 2024 Paperzz