slides

Approximate resilience,
Monotonicity, and the
Complexity of agnostic learning
Vitaly Feldman
IBM Research – Almaden
Dana Dachman-Soled
UMD
Li-Yang Tan
Andrew Wan
Karl Wimmer
Simons Inst.
IDA
Duquesne
SODA, 2015
Learning from examples
-+
+
+
- - + +
+
+
Learning algorithm
Labeled examples
Classifier
Agnostic learning [V 84; H 92; KSS 94]
ℎ +
-
-- +
+
+
-- -
+
+ +
Example: 𝑥, ℓ ~ 𝑃
𝑃: distribution over 𝑋 × {−1, +1}
min Pr [𝑓 𝑥 ≠ ℓ]
𝑓∈𝐶 𝑥,ℓ ∼𝑃
Agnostic learning of a class of functions 𝐶
with excess error 𝜖 : ∀ 𝑃, output w.h.p. ℎ
Pr [ℎ 𝑥 ≠ ℓ] ≤ Opt 𝑃 𝐶 + 𝜖
𝑥,ℓ ∼𝑃
3
Complexity of agnostic learning
• Too hard
– Conjunctions over 0,1 𝑛 : 𝑛𝑂(
𝑛)
for 𝜖 = Ω 1 [KKMS 05]
• … but can’t prove lower bounds
Complexity of agnostic learning
Distribution-specific learning over distribution 𝐷
Marginal of 𝑃 on 𝑋 equals to a fixed 𝐷
• For uniform distribution 𝑈 over 0,1
𝑛
– Conjunctions: 𝑛𝑂(log 1/𝜖)
– Halfspaces: 𝑛𝑂(1/𝜖
– Monotone juntas?
4)
[KKMS 05]; 𝑛𝑂(1/𝜖
2)
[FKV 14]
Our characterization
• 𝑔 1 = E𝑥∼𝑈 𝑔 𝑥
• Pol𝑑 : polynomials of degree 𝑑 over 𝑛 vars
• Δ 𝑓, Pol𝑑 = min 𝑓 − 𝑝
𝑝∈𝑃𝑑
1
THM: For degree 𝑑 let 𝛿 =
Δ 𝑓,Pol𝑑
max
2
𝑓∈𝐶
Any SQ algorithm for agnostically learning 𝐶 over
𝑈 with excess error < 𝛿 needs 𝑛Ω(𝑑) time
* 𝐶 is closed under renaming of variables; 𝑑 ≤
3
𝑛
Polynomial 𝐿1 regression [KKMS 05]
Δ 𝑓,Pol𝑑
2
𝑓∈𝐶
For degree 𝑑 let 𝛿 = max
There exists a SQ algorithm for learning 𝐶 over 𝑈
with excess error 𝛿 + 𝜖 in time poly(𝑛𝑑 /𝜖)
Statistical queries [Kearns 93]
𝜙1
𝑣1
𝜙2
𝑣2
SQ learning algorithm
𝜙𝑡
𝑣𝑡
𝑃
SQ oracle
𝜙𝑖 : 𝑋 × {−1,1} → −1,1 ,
𝑣𝑖 − 𝐄𝑃 𝜙𝑖 𝑥, ℓ
𝜏 is tolerance of the query
Complexity 𝑇:
• at most 𝑇 queries
• each of tolerance at least 1/𝑇
≤τ
SQ algorithms
• PAC/agnostic learning algorithms (except Gaussian elimination)
• Convex optimization (Ellipsoid, iterative methods)
• Expectation maximization (EM)
• SVM (with kernel)
• PCA
• ICA
• ID3
• 𝑘-means
• method of moments
• MCMC
• Naïve Bayes
• Neural Networks (backprop)
• Perceptron
• Nearest neighbors
• Boosting
[K 93, BDMN 05, CKLYBNO 06, FPV 14]
Roadmap
Proof: approximate
resilience
Tools: analysis of
approximate
resilience
Application: monotone
functions
Open problems
Approximate resilience
𝑔 is 𝑑-resilient if for any degree-𝑑 polynomial 𝑝
E𝑈 𝑔 𝑥 𝑝 𝑥 = 0
All Fourier coefficients of degree at most 𝑑 are 0
𝑓 is 𝛼-approximately 𝑑-resilient if exists
𝑑-resilient 𝑔: 0,1 𝑛 → [−1,1] such that 𝑓 − 𝑔
1
≤𝛼
THM: a Boolean 𝑓 is 𝛼-approximately 𝑑-resilient
if and only if
Δ 𝑓, Pol𝑑 ≥ 1 − 𝛼
From approx. resilience to SQ hardness
𝐶 closed under variable renaming ⇒
Exist 𝑚 = 𝑛Ω(𝑑) functions 𝑓1 , … , 𝑓𝑚 such that corresponding
𝑔1 , … , 𝑔𝑚 are uncorrelated
𝑓−𝑔
1
≤ 𝛼 ⇒ 𝑃𝑔 such that Pr [𝑓 𝑥 ≠ ℓ] ≤ 𝛼/2
𝑥,ℓ ∼𝑃𝑔
Complexity of any SQ algorithm with error
least 𝑚1/3 [BFJKMR 94; F 09]
1
2
Excess error: −
1
𝑚1/3
−
𝛼
2
=
Δ 𝑓,Pol𝑑
2
−
1
𝑚1/3
1
2
−
1
𝑚1/3
is at
Bounds on approximate resilience
1. Convert bounds on 𝐿2 distance to unbounded
𝑑-resilient function to 𝐿1 distance to [−1,1]
bounded 𝑑-resilient function
2. Amplify resilience degree via composition
If 𝑓1 is 𝛼1 -approximately 𝑑1 -resilient
𝑓2 is 𝛼2 -approximately 𝑑2 -resilient
𝑓1 has low noise sensitivity
Then 𝐹 𝑥1 , 𝑥2 , … , 𝑥𝑘 = 𝑓1 𝑓2 𝑥1 , … , 𝑓2 𝑥𝑘
𝛼3 -approximately (𝑑1 𝑑2 )-resilient
is
Monotone functions of 𝑘 variables
Known results:
•
𝑛𝑂
• 𝑛
𝑘/𝜖 2
Ω(1/𝜖 2 )
[BT 96]
[KKMS 05]
THM: SQ complexity 𝑛Ω
𝑘
for 𝜖 = 1/2 − 𝑜(1)
Show existence of 𝑜 1 -approximately Ω 𝑘 -resilient
mon. func.
• Lower bounds on Talagrand’s DNF [BBL 98]
Explicit constructions
• 𝑜 1 -approximately 2
TRIBES: disjunction of
log 𝑘
𝑘
log 𝑘
-resilient mon. func.
conjunctions of log 𝑘 variables
+amplification
• 2 log 𝑘 /loglog 𝑘 -resilient Boolean function 𝑜 1 -close to
mon. Boolean func.
CycleRun function [Wieder 12]
+amplification
Conclusions and open problems
• Characterization can be extended to general product
distributions over more general domains
• Does not hold for non-product distributions [FK14]
• Can the lower bound be obtained via a reduction to a
concrete problem? E.g. learning noisy parities
• Other techniques for proving bounds on approximate
resilience (or 𝐿1 approximation by polynomials)
• Complexity of distribution-independent agnostic SQ
learning?

Download Report

slides

Paperzz.com

Your Paperzz