Understanding Generalization in Adaptive Data Analysis Vitaly

Understanding Generalization
in Adaptive Data Analysis
Vitaly Feldman
Overview
β€’ Adaptive data analysis
o Motivation
o Definitions
o Basic techniques
With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15]
β€’ New results [F, Steinke 17]
β€’ Open problems
2
𝐄π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓, π‘₯)]=?
Learning
problem
Distribution 𝑃
over domain 𝑋
Model
Data
𝑓 = 𝐴(𝑆)
𝑆 = π‘₯1 , … , π‘₯𝑛 ∼ 𝑃𝑛
Analysis 𝐴
3
Statistical inference
𝑆
𝑛 i.i.d. samples from 𝑃
Data
Data
Data
Theory
Model complexity
Rademacher compl.
Stability
Online-to-batch
…
Algorithm
𝐴
𝑓 = 𝐴(𝑆)
Generalization
guarantees for 𝑓
Data analysis is adaptive
Steps depend on previous analyses of the same dataset
𝐴1
Exploratory data analysis
Feature selection
Model stacking
Hyper-parameter tuning
Shared datasets
…
𝑣1
𝐴2
𝑣2
π΄π‘˜
π‘£π‘˜
Data analyst(s)
𝑆
Thou shalt not test
hypotheses suggested
by data
β€œQuiet scandal of statistics”
[Leo Breiman, 1992]
ML practice
Data
Data
Data
Data
Data
Training
Data
Testing
𝑓
Test error of 𝑓
β‰ˆ 𝐄π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓, π‘₯)]
7
ML practice now
Data
Data
Data
Data
Data
Data
Data
Validation
Training
Testing
πœƒ
𝑓
π‘“πœƒ
Test error of 𝑓
β‰ˆ 𝐄π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓, π‘₯)]
8
Adaptive data analysis [DFHPRR 14]
𝐴1
𝑣1
𝐴2
𝑣2
π΄π‘˜
Data analyst(s)
π‘£π‘˜
𝑆 = π‘₯1 , … , π‘₯𝑛 ∼ 𝑃𝑛
Algorithm
Goal: given 𝑆 compute 𝑣𝑖 ’s β€œclose” to running 𝐴𝑖 on fresh samples
β€’
β€’
Each analysis is a query
Design algorithm for answering adaptively-chosen queries
Adaptive statistical queries
𝐴1
𝑣1
𝐴2
𝑣2
𝑆
π΄π‘˜
Data analyst(s)
Statistical query 𝑛
𝑆 = π‘₯oracle
1 , … , π‘₯𝑛 ∼ 𝑃
π‘£π‘˜
[Kearns 93]
𝐴𝑖 (𝑆) ≑
1
𝑛
π‘₯βˆˆπ‘†
πœ™π‘– π‘₯
πœ™π‘– : 𝑋 β†’ 0,1
Example: πœ™π‘– = πΏπ‘œπ‘ π‘ (𝑓, π‘₯)
𝑣𝑖 βˆ’ 𝐄π‘₯βˆΌπ‘ƒ πœ™π‘– π‘₯
≀ Ο„ with prob. 1 βˆ’ 𝛽
Can measure correlations, moments, accuracy/loss
Run any statistical query algorithm
Answering non-adaptive SQs
Given π‘˜ non-adaptive query functions πœ™1 , … , πœ™π‘˜ and 𝑛
i.i.d. samples from 𝑃 estimate 𝐄π‘₯βˆΌπ‘ƒ πœ™π‘– π‘₯
Use empirical mean: 𝐄𝑆 πœ™π‘– =
1
𝑛
π‘₯βˆˆπ‘†
log (π‘˜/𝛽)
𝑛=𝑂
𝜏2
πœ™π‘– π‘₯
Answering adaptively-chosen SQs
What if we use 𝐄𝑆 πœ™π‘– ?
For some constant 𝛽 > 0:
π‘˜
𝑛β‰₯ 2
𝜏
Variable selection, boosting, bagging, step-wise regression ..
Data splitting: 𝑛 = 𝑂
π‘˜β‹…log π‘˜
𝜏2
Answering adaptive SQs
[DFHPRR 14] Exists an algorithm that can answer π‘˜ adaptively
chosen SQs with accuracy 𝜏 for
π‘˜
𝑛 = 𝑂 2.5
𝜏
Data splitting:
𝑂
[Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15]
π‘˜
𝑛=𝑂 2
𝜏
Generalizes to low-sensitivity analyses:
1
𝐴𝑖 𝑆 βˆ’ 𝐴𝑖 𝑆 β€² ≀ 𝑛 when 𝑆, 𝑆′ differ in a single element
Estimates π„π‘†βˆΌπ‘ƒπ‘› [𝐴𝑖 (𝑆)] within 𝜏
π‘˜
𝜏2
Differential privacy [Dwork,McSherry,Nissim,Smith 06]
𝑆
𝑆′
M
ratio bounded
Pr[π‘Ÿ]
𝑀
Randomized algorithm 𝑀 is (πœ–, 𝛿)-differentially private if
for any two data sets 𝑆, 𝑆′ that differ in one element:
βˆ€π‘ βŠ† range 𝑀 ,
Pr 𝑴 𝑆 ∈ 𝑍 ≀ 𝑒 πœ– β‹… Pr 𝑴 𝑆 β€² ∈ 𝑍 + 𝛿
𝑴
𝑴
DP implies
generalization
Differential privacy is stability
DP composes
adaptively
Implies strongly uniform replace-one
stability and generalization in expectation
Composition of π‘˜ πœ–-DP algorithms:
for every 𝛿 > 0, is
πœ– π‘˜ log 1/𝛿 , 𝛿 -DP
DP implies generalization with high
probability [DFHPRR 14, BNSSSU 15]
[Dwork,Rothblum,Vadhan 10]
Value perturbation [DMNS 06]
Answer low-sensitivity query 𝐴 with 𝐴 𝑆 + 𝜁
max 𝐴 𝑆 βˆ’ 𝐴 𝑆 β€²
𝑆,𝑆′
≀ 1/𝑛
Gaussian 𝑁(0, 𝜏)
1
4
Given 𝑛 samples achieves error β‰ˆ Ξ”(𝐴) β‹… 𝑛 β‹… π‘˜ where
Ξ”(𝐴) is the worst-case sensitivity:
max 𝐴 𝑆 βˆ’ 𝐴(𝑆 β€² )
𝑆,𝑆′
Ξ”(𝐴) β‹… 𝑛 could be much larger than standard
deviation of 𝐴
16
Beyond low-sensitivity
[F, Steinke 17] Exists an algorithm that for any adaptivelychosen sequence 𝐴1 , … , π΄π‘˜ : 𝑋 𝑑 β†’ ℝ given 𝑛 = 𝑂 π‘˜ β‹… 𝑑 i.i.d.
samples from 𝑃 outputs values 𝑣1 , … , π‘£π‘˜ such that w.h.p. for
all 𝑖:
π„π‘†βˆΌπ‘ƒπ‘‘ 𝐴𝑖 𝑆 βˆ’ 𝑣𝑖 ≀ 2πœŽπ‘–
where πœŽπ‘– =
π•πšπ«π‘†βˆΌπ‘ƒπ‘‘ 𝐴𝑖 𝑆
For statistical queries: πœ™π‘– : 𝑋 β†’ [βˆ’π΅, 𝐡] given 𝑛 samples
get error that scales as
π•πšπ«π‘₯βˆΌπ‘ƒ πœ™π‘– π‘₯
β‹… π‘˜1/4
𝑛
Value perturbation:
𝐡
𝑛
β‹… π‘˜1/4
17
Stable Median
𝑆
𝑆1
𝑆2
𝑆3
β‹―
β‹―
π‘†π‘š
β‹―
𝑛 = π‘‘π‘š
β‹―
𝐴𝑖
𝐴𝑖
𝑦1
β‹―
𝐴𝑖
𝑦2
𝐴𝑖
𝐴𝑖
π‘¦π‘šβˆ’2
𝑦3
𝐴𝑖
π‘¦π‘šβˆ’1
π‘¦π‘š
π‘ˆ
𝑣
Find an approximate median with DP relative to π‘ˆ
value 𝑣 greater than bottom 1/3 and smaller than top 1/3 in π‘ˆ
18
Median algorithms
Requires discretization: ground set 𝑇, |𝑇| = π‘Ÿ
βˆ—
Upper bound: 2𝑂(log π‘Ÿ) samples
Lower bound: Ξ©(log βˆ— π‘Ÿ) samples
[Bun,Nissim,Stemmer,Vadhan 15]
π‘ˆ
Exponential mechanism [McSherry, Talwar 07]
Output 𝑣 ∈ 𝑇 with prob. ∝ 𝑒
Uses 𝑂
log π‘Ÿ
πœ–
βˆ’πœ–
# 𝑣≀𝑦 βˆ’
π‘¦βˆˆπ‘ˆ
𝑇
π‘š
2
samples
Stability and confidence amplification for the price of one log factor!
19
Analysis
Differential privacy approximately preserves quantiles
[F, Steinke 17] Let 𝑀 be a DP algorithm that on input π‘ˆ ∈ π‘Œ π‘š
outputs a function πœ™: π‘Œ β†’ ℝ and a value 𝑣 ∈ ℝ.
Then w.h.p. over π‘ˆ ∼ 𝐷 π‘š and πœ™, 𝑣 ← 𝑀 π‘ˆ :
Pr 𝑣 ≀ πœ™(𝑦) β‰ˆ Pr 𝑣 ≀ πœ™(𝑦)
π‘¦βˆΌπ·
If 𝑣 is within
1 2
,
3 3
then 𝑣 is within
π‘¦βˆΌπ‘ˆ
empirical quantiles
1 3
,
4 4
true quantiles
𝑣 is within mean ±2𝜎
If πœ™ is well-concentrated on 𝐷 then easy to prove high probability
bounds
20
Limits
Any algorithm for answering π‘˜ adaptively chosen SQs
with accuracy 𝜏 requires* 𝑛 = Ξ©( π‘˜/𝜏) samples
[Hardt, Ullman 14; Steinke, Ullman 15]
*in sufficiently high dimension or under crypto assumptions
Verification of responses to queries: 𝑛 = 𝑂( 𝑐 log π‘˜)
where 𝑐 is the number of queries that failed
verification
o Data splitting if overfitting [DFHPRR 14]
o Reusable holdout [DFHPRR 15]
o Maintaining public leaderboard in a competition
[Blum, Hardt 15]
21
Open problems
Analysts without side information about 𝑃
β€’ Queries depend only on previous answers
Does there exist an SQ analyst whose queries require more than
𝑂(log π‘˜) samples to answer?
(with 0.1 accuracy/confidence)
β€’ Fixed β€œnatural” analyst/Learning algorithm
o Gradient descent for stochastic convex optimization
22
Stochastic convex optimization
β€’ Convex body 𝐾 = 𝔹𝑑2 1 ≐ π‘₯
π‘₯ 2 ≀ 1}
β€’ Class 𝐹 of convex 1-Lipschitz functions
𝐹 = 𝑓 convex βˆ€π‘₯ ∈ 𝐾, 𝛻𝑓(π‘₯) 2 ≀ 1
Given 𝑓1 , … , 𝑓𝑛 sampled i.i.d. from unknown 𝑃 over 𝐹
Minimize true (expected) objective: 𝑓𝑃 (π‘₯) ≐ π„π‘“βˆΌπ‘ƒ [𝑓(π‘₯)] over 𝐾:
β€’ Find π‘₯ s.t. 𝑓𝑃 π‘₯ ≀ min 𝑓𝑃 π‘₯ + πœ–
π‘₯∈𝐾
𝑓1
𝑓2
𝑓𝑃
…
𝑓𝑛
π‘₯
23
Gradient descent
ERM via projected gradient descent:
1
𝑓𝑆 (π‘₯) ≐
𝑓𝑖 (π‘₯)
𝑛
Initialize π‘₯1 ∈ 𝐾
For 𝑑 = 1 to 𝑇
π‘₯𝑑+1 = Project 𝐾 π‘₯𝑑 βˆ’ πœ‚ β‹… 𝛻𝑓𝑆 π‘₯𝑑
Output:
1
𝑇
1
𝑛
𝛻𝑓𝑖 (π‘₯𝑑 )
Fresh samples:
𝛻𝑓𝑆 π‘₯𝑑 βˆ’π›»π‘“π‘ƒ π‘₯𝑑
𝑑 π‘₯𝑑
2
≀ 1/ 𝑛
Sample complexity is unknown
β€’
Uniform convergence: 𝑂
β€’
SGD solves using 𝑂
1
πœ–2
𝑑
πœ–2
samples (tight [F. 16])
samples [Robbins,Monro 51; Polyak 90]
Overall: 𝑑/πœ– 2 statistical queries with accuracy πœ– in 1/πœ– 2 adaptive rounds
β€’
Sample splitting: 𝑂
β€’
DP: 𝑂
𝑑
πœ–3
log 𝑑
πœ–4
samples
samples
24
Conclusions
οƒΌ Real-valued analyses (without any assumptions)
β€’ Going beyond tools from DP
o Other notions of stability for outcomes
o Max/mutual information
β€’ Generalization beyond uniform convergence
β€’ Using these techniques in practice
25