slides

Approximate resilience,
Monotonicity, and the
Complexity of agnostic learning
Vitaly Feldman
IBM Research – Almaden
Dana Dachman-Soled
UMD
Li-Yang Tan
Andrew Wan
Karl Wimmer
Simons Inst.
IDA
Duquesne
SODA, 2015
Learning from examples
-+
+
+
- - + +
+
+
Learning algorithm
Labeled examples
Classifier
Agnostic learning [V 84; H 92; KSS 94]
β„Ž +
-
-- +
+
+
-- -
+
+ +
Example: π‘₯, β„“ ~ 𝑃
𝑃: distribution over 𝑋 × {βˆ’1, +1}
min Pr [𝑓 π‘₯ β‰  β„“]
π‘“βˆˆπΆ π‘₯,β„“ βˆΌπ‘ƒ
Agnostic learning of a class of functions 𝐢
with excess error πœ– : βˆ€ 𝑃, output w.h.p. β„Ž
Pr [β„Ž π‘₯ β‰  β„“] ≀ Opt 𝑃 𝐢 + πœ–
π‘₯,β„“ βˆΌπ‘ƒ
3
Complexity of agnostic learning
β€’ Too hard
– Conjunctions over 0,1 𝑛 : 𝑛𝑂(
𝑛)
for πœ– = Ξ© 1 [KKMS 05]
β€’ … but can’t prove lower bounds
Complexity of agnostic learning
Distribution-specific learning over distribution 𝐷
Marginal of 𝑃 on 𝑋 equals to a fixed 𝐷
β€’ For uniform distribution π‘ˆ over 0,1
𝑛
– Conjunctions: 𝑛𝑂(log 1/πœ–)
– Halfspaces: 𝑛𝑂(1/πœ–
– Monotone juntas?
4)
[KKMS 05]; 𝑛𝑂(1/πœ–
2)
[FKV 14]
Our characterization
β€’ 𝑔 1 = Eπ‘₯βˆΌπ‘ˆ 𝑔 π‘₯
β€’ Pol𝑑 : polynomials of degree 𝑑 over 𝑛 vars
β€’ Ξ” 𝑓, Pol𝑑 = min 𝑓 βˆ’ 𝑝
π‘βˆˆπ‘ƒπ‘‘
1
THM: For degree 𝑑 let 𝛿 =
Ξ” 𝑓,Pol𝑑
max
2
π‘“βˆˆπΆ
Any SQ algorithm for agnostically learning 𝐢 over
π‘ˆ with excess error < 𝛿 needs 𝑛Ω(𝑑) time
* 𝐢 is closed under renaming of variables; 𝑑 ≀
3
𝑛
Polynomial 𝐿1 regression [KKMS 05]
Ξ” 𝑓,Pol𝑑
2
π‘“βˆˆπΆ
For degree 𝑑 let 𝛿 = max
There exists a SQ algorithm for learning 𝐢 over π‘ˆ
with excess error 𝛿 + πœ– in time poly(𝑛𝑑 /πœ–)
Statistical queries [Kearns 93]
πœ™1
𝑣1
πœ™2
𝑣2
SQ learning algorithm
πœ™π‘‘
𝑣𝑑
𝑃
SQ oracle
πœ™π‘– : 𝑋 × {βˆ’1,1} β†’ βˆ’1,1 ,
𝑣𝑖 βˆ’ 𝐄𝑃 πœ™π‘– π‘₯, β„“
𝜏 is tolerance of the query
Complexity 𝑇:
β€’ at most 𝑇 queries
β€’ each of tolerance at least 1/𝑇
≀τ
SQ algorithms
β€’ PAC/agnostic learning algorithms (except Gaussian elimination)
β€’ Convex optimization (Ellipsoid, iterative methods)
β€’ Expectation maximization (EM)
β€’ SVM (with kernel)
β€’ PCA
β€’ ICA
β€’ ID3
β€’ π‘˜-means
β€’ method of moments
β€’ MCMC
β€’ Naïve Bayes
β€’ Neural Networks (backprop)
β€’ Perceptron
β€’ Nearest neighbors
β€’ Boosting
[K 93, BDMN 05, CKLYBNO 06, FPV 14]
Roadmap
Proof: approximate
resilience
Tools: analysis of
approximate
resilience
Application: monotone
functions
Open problems
Approximate resilience
𝑔 is 𝑑-resilient if for any degree-𝑑 polynomial 𝑝
Eπ‘ˆ 𝑔 π‘₯ 𝑝 π‘₯ = 0
All Fourier coefficients of degree at most 𝑑 are 0
𝑓 is 𝛼-approximately 𝑑-resilient if exists
𝑑-resilient 𝑔: 0,1 𝑛 β†’ [βˆ’1,1] such that 𝑓 βˆ’ 𝑔
1
≀𝛼
THM: a Boolean 𝑓 is 𝛼-approximately 𝑑-resilient
if and only if
Ξ” 𝑓, Pol𝑑 β‰₯ 1 βˆ’ 𝛼
From approx. resilience to SQ hardness
𝐢 closed under variable renaming β‡’
Exist π‘š = 𝑛Ω(𝑑) functions 𝑓1 , … , π‘“π‘š such that corresponding
𝑔1 , … , π‘”π‘š are uncorrelated
π‘“βˆ’π‘”
1
≀ 𝛼 β‡’ 𝑃𝑔 such that Pr [𝑓 π‘₯ β‰  β„“] ≀ 𝛼/2
π‘₯,β„“ βˆΌπ‘ƒπ‘”
Complexity of any SQ algorithm with error
least π‘š1/3 [BFJKMR 94; F 09]
1
2
Excess error: βˆ’
1
π‘š1/3
βˆ’
𝛼
2
=
Ξ” 𝑓,Pol𝑑
2
βˆ’
1
π‘š1/3
1
2
βˆ’
1
π‘š1/3
is at
Bounds on approximate resilience
1. Convert bounds on 𝐿2 distance to unbounded
𝑑-resilient function to 𝐿1 distance to [βˆ’1,1]
bounded 𝑑-resilient function
2. Amplify resilience degree via composition
If 𝑓1 is 𝛼1 -approximately 𝑑1 -resilient
𝑓2 is 𝛼2 -approximately 𝑑2 -resilient
𝑓1 has low noise sensitivity
Then 𝐹 π‘₯1 , π‘₯2 , … , π‘₯π‘˜ = 𝑓1 𝑓2 π‘₯1 , … , 𝑓2 π‘₯π‘˜
𝛼3 -approximately (𝑑1 𝑑2 )-resilient
is
Monotone functions of π‘˜ variables
Known results:
β€’
𝑛𝑂
β€’ 𝑛
π‘˜/πœ– 2
Ξ©(1/πœ– 2 )
[BT 96]
[KKMS 05]
THM: SQ complexity 𝑛Ω
π‘˜
for πœ– = 1/2 βˆ’ π‘œ(1)
Show existence of π‘œ 1 -approximately Ξ© π‘˜ -resilient
mon. func.
β€’ Lower bounds on Talagrand’s DNF [BBL 98]
Explicit constructions
β€’ π‘œ 1 -approximately 2
TRIBES: disjunction of
log π‘˜
π‘˜
log π‘˜
-resilient mon. func.
conjunctions of log π‘˜ variables
+amplification
β€’ 2 log π‘˜ /loglog π‘˜ -resilient Boolean function π‘œ 1 -close to
mon. Boolean func.
CycleRun function [Wieder 12]
+amplification
Conclusions and open problems
β€’ Characterization can be extended to general product
distributions over more general domains
β€’ Does not hold for non-product distributions [FK14]
β€’ Can the lower bound be obtained via a reduction to a
concrete problem? E.g. learning noisy parities
β€’ Other techniques for proving bounds on approximate
resilience (or 𝐿1 approximation by polynomials)
β€’ Complexity of distribution-independent agnostic SQ
learning?