Understanding Generalization in Adaptive Data Analysis Vitaly

Understanding Generalization
in Adaptive Data Analysis
Vitaly Feldman
Overview
• Adaptive data analysis
o Motivation
o Definitions
o Basic techniques
With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15]
• New results [F, Steinke 17]
• Open problems
2
𝐄𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓, 𝑥)]=?
Learning
problem
Distribution 𝑃
over domain 𝑋
Model
Data
𝑓 = 𝐴(𝑆)
𝑆 = 𝑥1 , … , 𝑥𝑛 ∼ 𝑃𝑛
Analysis 𝐴
3
Statistical inference
𝑆
𝑛 i.i.d. samples from 𝑃
Data
Data
Data
Theory
Model complexity
Rademacher compl.
Stability
Online-to-batch
…
Algorithm
𝐴
𝑓 = 𝐴(𝑆)
Generalization
guarantees for 𝑓
Data analysis is adaptive
Steps depend on previous analyses of the same dataset
𝐴1
Exploratory data analysis
Feature selection
Model stacking
Hyper-parameter tuning
Shared datasets
…
𝑣1
𝐴2
𝑣2
𝐴𝑘
𝑣𝑘
Data analyst(s)
𝑆
Thou shalt not test
hypotheses suggested
by data
“Quiet scandal of statistics”
[Leo Breiman, 1992]
ML practice
Data
Data
Data
Data
Data
Training
Data
Testing
𝑓
Test error of 𝑓
≈ 𝐄𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓, 𝑥)]
7
ML practice now
Data
Data
Data
Data
Data
Data
Data
Validation
Training
Testing
𝜃
𝑓
𝑓𝜃
Test error of 𝑓
≈ 𝐄𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓, 𝑥)]
8
Adaptive data analysis [DFHPRR 14]
𝐴1
𝑣1
𝐴2
𝑣2
𝐴𝑘
Data analyst(s)
𝑣𝑘
𝑆 = 𝑥1 , … , 𝑥𝑛 ∼ 𝑃𝑛
Algorithm
Goal: given 𝑆 compute 𝑣𝑖 ’s “close” to running 𝐴𝑖 on fresh samples
•
•
Each analysis is a query
Design algorithm for answering adaptively-chosen queries
Adaptive statistical queries
𝐴1
𝑣1
𝐴2
𝑣2
𝑆
𝐴𝑘
Data analyst(s)
Statistical query 𝑛
𝑆 = 𝑥oracle
1 , … , 𝑥𝑛 ∼ 𝑃
𝑣𝑘
[Kearns 93]
𝐴𝑖 (𝑆) ≡
1
𝑛
𝑥∈𝑆
𝜙𝑖 𝑥
𝜙𝑖 : 𝑋 → 0,1
Example: 𝜙𝑖 = 𝐿𝑜𝑠𝑠(𝑓, 𝑥)
𝑣𝑖 − 𝐄𝑥∼𝑃 𝜙𝑖 𝑥
≤ τ with prob. 1 − 𝛽
Can measure correlations, moments, accuracy/loss
Run any statistical query algorithm
Answering non-adaptive SQs
Given 𝑘 non-adaptive query functions 𝜙1 , … , 𝜙𝑘 and 𝑛
i.i.d. samples from 𝑃 estimate 𝐄𝑥∼𝑃 𝜙𝑖 𝑥
Use empirical mean: 𝐄𝑆 𝜙𝑖 =
1
𝑛
𝑥∈𝑆
log (𝑘/𝛽)
𝑛=𝑂
𝜏2
𝜙𝑖 𝑥
Answering adaptively-chosen SQs
What if we use 𝐄𝑆 𝜙𝑖 ?
For some constant 𝛽 > 0:
𝑘
𝑛≥ 2
𝜏
Variable selection, boosting, bagging, step-wise regression ..
Data splitting: 𝑛 = 𝑂
𝑘⋅log 𝑘
𝜏2
Answering adaptive SQs
[DFHPRR 14] Exists an algorithm that can answer 𝑘 adaptively
chosen SQs with accuracy 𝜏 for
𝑘
𝑛 = 𝑂 2.5
𝜏
Data splitting:
𝑂
[Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15]
𝑘
𝑛=𝑂 2
𝜏
Generalizes to low-sensitivity analyses:
1
𝐴𝑖 𝑆 − 𝐴𝑖 𝑆 ′ ≤ 𝑛 when 𝑆, 𝑆′ differ in a single element
Estimates 𝐄𝑆∼𝑃𝑛 [𝐴𝑖 (𝑆)] within 𝜏
𝑘
𝜏2
Differential privacy [Dwork,McSherry,Nissim,Smith 06]
𝑆
𝑆′
M
ratio bounded
Pr[𝑟]
𝑀
Randomized algorithm 𝑀 is (𝜖, 𝛿)-differentially private if
for any two data sets 𝑆, 𝑆′ that differ in one element:
∀𝑍 ⊆ range 𝑀 ,
Pr 𝑴 𝑆 ∈ 𝑍 ≤ 𝑒 𝜖 ⋅ Pr 𝑴 𝑆 ′ ∈ 𝑍 + 𝛿
𝑴
𝑴
DP implies
generalization
Differential privacy is stability
DP composes
adaptively
Implies strongly uniform replace-one
stability and generalization in expectation
Composition of 𝑘 𝜖-DP algorithms:
for every 𝛿 > 0, is
𝜖 𝑘 log 1/𝛿 , 𝛿 -DP
DP implies generalization with high
probability [DFHPRR 14, BNSSSU 15]
[Dwork,Rothblum,Vadhan 10]
Value perturbation [DMNS 06]
Answer low-sensitivity query 𝐴 with 𝐴 𝑆 + 𝜁
max 𝐴 𝑆 − 𝐴 𝑆 ′
𝑆,𝑆′
≤ 1/𝑛
Gaussian 𝑁(0, 𝜏)
1
4
Given 𝑛 samples achieves error ≈ Δ(𝐴) ⋅ 𝑛 ⋅ 𝑘 where
Δ(𝐴) is the worst-case sensitivity:
max 𝐴 𝑆 − 𝐴(𝑆 ′ )
𝑆,𝑆′
Δ(𝐴) ⋅ 𝑛 could be much larger than standard
deviation of 𝐴
16
Beyond low-sensitivity
[F, Steinke 17] Exists an algorithm that for any adaptivelychosen sequence 𝐴1 , … , 𝐴𝑘 : 𝑋 𝑡 → ℝ given 𝑛 = 𝑂 𝑘 ⋅ 𝑡 i.i.d.
samples from 𝑃 outputs values 𝑣1 , … , 𝑣𝑘 such that w.h.p. for
all 𝑖:
𝐄𝑆∼𝑃𝑡 𝐴𝑖 𝑆 − 𝑣𝑖 ≤ 2𝜎𝑖
where 𝜎𝑖 =
𝐕𝐚𝐫𝑆∼𝑃𝑡 𝐴𝑖 𝑆
For statistical queries: 𝜙𝑖 : 𝑋 → [−𝐵, 𝐵] given 𝑛 samples
get error that scales as
𝐕𝐚𝐫𝑥∼𝑃 𝜙𝑖 𝑥
⋅ 𝑘1/4
𝑛
Value perturbation:
𝐵
𝑛
⋅ 𝑘1/4
17
Stable Median
𝑆
𝑆1
𝑆2
𝑆3
⋯
⋯
𝑆𝑚
⋯
𝑛 = 𝑡𝑚
⋯
𝐴𝑖
𝐴𝑖
𝑦1
⋯
𝐴𝑖
𝑦2
𝐴𝑖
𝐴𝑖
𝑦𝑚−2
𝑦3
𝐴𝑖
𝑦𝑚−1
𝑦𝑚
𝑈
𝑣
Find an approximate median with DP relative to 𝑈
value 𝑣 greater than bottom 1/3 and smaller than top 1/3 in 𝑈
18
Median algorithms
Requires discretization: ground set 𝑇, |𝑇| = 𝑟
∗
Upper bound: 2𝑂(log 𝑟) samples
Lower bound: Ω(log ∗ 𝑟) samples
[Bun,Nissim,Stemmer,Vadhan 15]
𝑈
Exponential mechanism [McSherry, Talwar 07]
Output 𝑣 ∈ 𝑇 with prob. ∝ 𝑒
Uses 𝑂
log 𝑟
𝜖
−𝜖
# 𝑣≤𝑦 −
𝑦∈𝑈
𝑇
𝑚
2
samples
Stability and confidence amplification for the price of one log factor!
19
Analysis
Differential privacy approximately preserves quantiles
[F, Steinke 17] Let 𝑀 be a DP algorithm that on input 𝑈 ∈ 𝑌 𝑚
outputs a function 𝜙: 𝑌 → ℝ and a value 𝑣 ∈ ℝ.
Then w.h.p. over 𝑈 ∼ 𝐷 𝑚 and 𝜙, 𝑣 ← 𝑀 𝑈 :
Pr 𝑣 ≤ 𝜙(𝑦) ≈ Pr 𝑣 ≤ 𝜙(𝑦)
𝑦∼𝐷
If 𝑣 is within
1 2
,
3 3
then 𝑣 is within
𝑦∼𝑈
empirical quantiles
1 3
,
4 4
true quantiles
𝑣 is within mean ±2𝜎
If 𝜙 is well-concentrated on 𝐷 then easy to prove high probability
bounds
20
Limits
Any algorithm for answering 𝑘 adaptively chosen SQs
with accuracy 𝜏 requires* 𝑛 = Ω( 𝑘/𝜏) samples
[Hardt, Ullman 14; Steinke, Ullman 15]
*in sufficiently high dimension or under crypto assumptions
Verification of responses to queries: 𝑛 = 𝑂( 𝑐 log 𝑘)
where 𝑐 is the number of queries that failed
verification
o Data splitting if overfitting [DFHPRR 14]
o Reusable holdout [DFHPRR 15]
o Maintaining public leaderboard in a competition
[Blum, Hardt 15]
21
Open problems
Analysts without side information about 𝑃
• Queries depend only on previous answers
Does there exist an SQ analyst whose queries require more than
𝑂(log 𝑘) samples to answer?
(with 0.1 accuracy/confidence)
• Fixed “natural” analyst/Learning algorithm
o Gradient descent for stochastic convex optimization
22
Stochastic convex optimization
• Convex body 𝐾 = 𝔹𝑑2 1 ≐ 𝑥
𝑥 2 ≤ 1}
• Class 𝐹 of convex 1-Lipschitz functions
𝐹 = 𝑓 convex ∀𝑥 ∈ 𝐾, 𝛻𝑓(𝑥) 2 ≤ 1
Given 𝑓1 , … , 𝑓𝑛 sampled i.i.d. from unknown 𝑃 over 𝐹
Minimize true (expected) objective: 𝑓𝑃 (𝑥) ≐ 𝐄𝑓∼𝑃 [𝑓(𝑥)] over 𝐾:
• Find 𝑥 s.t. 𝑓𝑃 𝑥 ≤ min 𝑓𝑃 𝑥 + 𝜖
𝑥∈𝐾
𝑓1
𝑓2
𝑓𝑃
…
𝑓𝑛
𝑥
23
Gradient descent
ERM via projected gradient descent:
1
𝑓𝑆 (𝑥) ≐
𝑓𝑖 (𝑥)
𝑛
Initialize 𝑥1 ∈ 𝐾
For 𝑡 = 1 to 𝑇
𝑥𝑡+1 = Project 𝐾 𝑥𝑡 − 𝜂 ⋅ 𝛻𝑓𝑆 𝑥𝑡
Output:
1
𝑇
1
𝑛
𝛻𝑓𝑖 (𝑥𝑡 )
Fresh samples:
𝛻𝑓𝑆 𝑥𝑡 −𝛻𝑓𝑃 𝑥𝑡
𝑡 𝑥𝑡
2
≤ 1/ 𝑛
Sample complexity is unknown
•
Uniform convergence: 𝑂
•
SGD solves using 𝑂
1
𝜖2
𝑑
𝜖2
samples (tight [F. 16])
samples [Robbins,Monro 51; Polyak 90]
Overall: 𝑑/𝜖 2 statistical queries with accuracy 𝜖 in 1/𝜖 2 adaptive rounds
•
Sample splitting: 𝑂
•
DP: 𝑂
𝑑
𝜖3
log 𝑑
𝜖4
samples
samples
24
Conclusions
 Real-valued analyses (without any assumptions)
• Going beyond tools from DP
o Other notions of stability for outcomes
o Max/mutual information
• Generalization beyond uniform convergence
• Using these techniques in practice
25

Download Report

Understanding Generalization in Adaptive Data Analysis Vitaly

Paperzz.com

Your Paperzz