Agnostically Learning Decision Trees
Parikshit Gopalan MSR-Silicon Valley, IITB’00.
Adam Tauman Kalai MSR-New England
Adam R. Klivans UT Austin
1
0
0
0
0
1
1
0
1
X1
1
X3
X2
0
0
1
1
0
1
1
0
Computational Learning
Computational Learning
Computational Learning
f:{0,1}n ! {0,1}
x, f(x)
Learning: Predict f from examples.
Valiant’s Model
f:{0,1}n ! {0,1}
Halfspaces:
+
x, f(x)
+
+
-
+
+
+
+
- +
- - - - -
Assumption: f comes from a nice concept class.
Valiant’s Model
f:{0,1}n ! {0,1}
Decision Trees:
X1
0
x, f(x)
1
X2
0
0
X3
1
0
1
1
1
0
Assumption: f comes from a nice concept class.
The Agnostic Model
[Kearns-Schapire-Sellie’94]
f:{0,1}n ! {0,1}
Decision Trees:
X1
0
x, f(x)
1
X2
0
0
X3
1
0
1
1
1
0
No assumptions about f.
Learner should do as well as best decision tree.
The Agnostic Model
[Kearns-Schapire-Sellie’94]
Decision Trees:
X1
0
x, f(x)
1
X2
0
0
X3
1
0
1
1
1
0
No assumptions about f.
Learner should do as well as best decision tree.
Agnostic Model = Noisy Learning
X2
0
0
X1
0
1
X3
1
1
0
1
+
1
=
0
Concept: Message
Truth table: Encoding
Function f: Received word.
Coding: Recover the Message.
Learning: Predict f.
f:{0,1}n ! {0,1}
Uniform Distribution Learning
for Decision Trees
Noiseless Setting:
– No queries: nlog n [Ehrenfeucht-Haussler’89].
– With queries: poly(n). [Kushilevitz-Mansour’91]
Agnostic Setting:
Polynomial time, uses queries. [G.-Kalai-Klivans’08]
Reconstruction for
sparse real polynomials
in the l1 norm.
The Fourier Transform Method
Powerful tool for uniform distribution learning.
Introduced by Linial-Mansour-Nisan.
–
–
–
–
–
–
Small depth circuits [Linial-Mansour-Nisan’89]
DNFs [Jackson’95]
Decision trees [Kushilevitz-Mansour’94, O’DonnellServedio’06, G.-Kalai-Klivans’08]
Halfspaces, Intersections [Klivans-O’DonnellServedio’03, Kalai-Klivans-Mansour-Servedio’05]
Juntas [Mossel-O’Donnell-Servedio’03]
Parities [Feldman-G.-Khot-Ponnsuswami’06]
The Fourier Polynomial
Let f:{-1,1}n ! {-1,1}.
Write f as a polynomial.
–
–
AND: ½ + ½X1 + ½X2 - ½X1X2
Parity: X1X2
Parity of ½ [n]: (x) = i 2 Xi
Write f(x) = c()(x)
2
– c() =1.
Standard Basis
Function f
Parities
The Fourier Polynomial
Let f:{-1,1}n ! {-1,1}.
Write f as a polynomial.
–
–
AND: ½ + ½X1 + ½X2 - ½X1X2
Parity: X1X2
Parity of ½ [n]: (x) = i 2 Xi
Write f(x) = c()(x)
2
– c() =1.
c()2: Weight
of .
Low Degree Functions
Sparse Functions: Most of the
weight lies on small subsets.
Halfspaces, Small-depth
circuits.
Low-degree algorithm. [LinialMansour-Nisan]
Finds the low-degree Fourier
coefficients.
Least Squares Regression:
Find low-degree P minimizing Ex[ |P(x) – f(x)|2 ].
Sparse Functions
Sparse Functions: Most of the
weight lies on a few subsets.
Decision trees.
t leaves ) O(t) subsets
Sparse Algorithm.
[Kushilevitz-Mansour’91]
Sparse l2 Regression:
Find t-sparse P minimizing Ex[ |P(x) – f(x)|2 ].
Sparse l2 Regression
Sparse Functions: Most of the
weight lies on a few subsets.
Decision trees.
t leaves ) O(t) subsets
Sparse Algorithm.
[Kushilevitz-Mansour’91]
Sparse l2 Regression:
Find t-sparse P minimizing Ex[ |P(x) – f(x)|2 ].
Finding large coefficients: Hadamard decoding.
[Kushilevitz-Mansour’91, Goldreich-Levin’89]
Agnostic Learning via l2 Regression?
+1
-1
f:{-1,1}n ! {-1,1}
Agnostic Learning via l2 Regression?
+1
X2
0
-1
0
X1
0
1
X3
1
1
0
1
1
0
Agnostic Learning via l2 Regression?
+1
Target f
Best Tree
-1
l2 Regression:
Loss |P(x) –f(x)|2
l1 Regression: [KKMS’05]
Loss |P(x) –f(x)|
Pay 1 for indecision.
Pay 1 for indecision.
Pay 4 for a mistake.
Pay 2 for a mistake.
Agnostic Learning via l1 Regression?
+1
-1
l2 Regression:
Loss |P(x) –f(x)|2
l1 Regression: [KKMS’05]
Loss |P(x) –f(x)|
Pay 1 for indecision.
Pay 1 for indecision.
Pay 4 for a mistake.
Pay 2 for a mistake.
Agnostic Learning via l1 Regression
+1
Target f
Best Tree
-1
Thm [KKMS’05]: l1 Regression always gives a
good predictor.
l1 regression for low degree polynomials via
Linear Programming.
Agnostically Learning Decision
Trees
Sparse l1 Regression: Find a t-sparse polynomial
P minimizing Ex[ |P(x) – f(x)| ].
Why is this Harder:
l2 is basis independent, l1 is not.
Don’t know the support of P.
[G.-Kalai-Klivans]: Polynomial time algorithm for
Sparse l1 Regression.
The Gradient-Projection Method
L1(P,Q) = |c() – d()|
L2(P,Q) = [ (c()
–d())2]1/2
Q(x) = d() (x)
f(x)
P(x) = c() (x)
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
The Gradient-Projection Method
Gradient
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
The Gradient-Projection Method
Projection
Gradient
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
The Gradient
+1
f(x)
P(x)
-1
Increase P(x) if low.
Decrease P(x) if high.
g(x) = sgn[f(x) - P(x)]
P(x) := P(x) + g(x).
The Gradient-Projection Method
Gradient
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
The Gradient-Projection Method
Projection
Gradient
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
Projection onto the L1 ball
Fourier Spectrum of P
0.3
0.25
0.2
0.15
0.1
0.05
0
Currently: |c()| > t
Want: |c()| · t.
P
Projection onto the L1 ball
Fourier Spectrum of P
0.3
0.25
0.2
0.15
0.1
0.05
0
Currently: |c()| > t
Want: |c()| · t.
P
Projection onto the L1 ball
Fourier Spectrum of P
0.3
0.25
0.2
0.15
0.1
0.05
0
Below cutoff: Set to 0.
Above cutoff: Subtract.
P
Projection onto the L1 ball
Fourier Spectrum of Proj(P)
0.3
0.25
0.2
0.15
P
Proj(P)
0.1
0.05
0
Below cutoff: Set to 0.
Above cutoff: Subtract.
Analysis of Gradient-Projection
[Zinkevich’03]
Progress measure: Squared L2 distance
from optimum P*.
Key Equation:
|Pt – P*|2 - |Pt+1 – P*|2 ¸ 2 (L(Pt) – L(P*))
– 2
Within of optimal in 1/2 iterations.
Good
L2 approximation
to Pt suffices.
Progress
made
How suboptimal
in this step.
current soln is.
Gradient
+1
f(x)
P(x)
g(x) = sgn[f(x) - P(x)].
-1
Projection
Fourier Spectrum of P
0.3
0.25
0.2
0.15
0.1
0.05
0
P
The Gradient
+1
f(x)
P(x)
g(x) = sgn[f(x) - P(x)].
-1
Compute sparse approximation g’ = KM(g).
Is g a good L2 approximation to g’?
No. Initially g = f.
L2(g,g’) can be as large 1.
Sparse l1 Regression
Approximat
e Gradient
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
Sparse l1 Regression
Projection
Compensate
s
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
KM as l2 Approximation
The KM Algorithm:
Input: g:{-1,1}n ! {-1,1}, and t.
Output: A t-sparse polynomial g’ minimizing
Ex [|g(x) – g’(x)|2]
Run Time: poly(n,t).
KM as L1 Approximation
The KM Algorithm:
Input: A Boolean function g = c()(x).
A error bound .
Output: Approximation g’ = c’()(x) s.t
|c() – c’()| · for all ½ [n].
Run Time: poly(n,1/)
KM as L1 Approximation
0.3
0.25
0.2
g
0.15
g' = KM(g)
0.1
0.05
0
Only 1/2
1) Identify coefficients larger than .
2) Estimate via sampling, set rest to 0.
KM as L1 Approximation
0.3
0.25
0.2
0.15
0.1
0.05
0
1) Identify coefficients larger than .
2) Estimate via sampling, set rest to 0.
g
g' = KM(g)
Projection Preserves L1 Distance
0.3
0.25
0.2
0.15
0.1
0.05
0
L1 distance at most 2 after projection.
Both lines stop within of each other.
P+g
P + g'
Projection Preserves L1 Distance
0.3
0.25
0.2
0.15
0.1
0.05
0
L1 distance at most 2 after projection.
Both lines stop within of each other.
Else, Blue dominates Red.
P+g
P + g'
Projection Preserves L1 Distance
0.3
0.25
0.2
0.15
P+g
P + g'
0.1
0.05
0
L1 distance at most 2 after projection.
Projecting onto the L1 ball does not increase L1
distance.
Sparse l1 Regression
• L1(P, P’) · 2
• L1(P, P’) · 2t
• L2(P, P’)2 · 4t
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
P’
Can take = 1/t2.
P
Agnostically Learning Decision
Trees
Sparse L1 Regression: Find a sparse polynomial P
minimizing Ex[ |P(x) – f(x)| ].
[G.-Kalai-Klivans’08]:
Can get within of optimum in poly(t,1/)
iterations.
Algorithm for Sparse l1 Regression.
First polynomial time algorithm for
Agnostically Learning Sparse Polynomials.
l1 Regression from l2 Regression
Function f: D ! [-1,1], Orthonormal Basis B.
Sparse l2 Regression: Find a t-sparse polynomial
P minimizing Ex[ |P(x) – f(x)|2 ].
Sparse l1 Regression: Find a t-sparse polynomial
P minimizing Ex[ |P(x) – f(x)| ].
[G.-Kalai-Klivans’08]: Given solution to l2
Regression, can solve l1 Regression.
Agnostically Learning DNFs?
Problem: Can we agnostically learn DNFs in
polynomial time? (uniform dist. with queries)
Noiseless Setting: Jackson’s Harmonic Sieve.
Implies weak learner for depth-3 circuits.
Beyond current Fourier techniques.
© Copyright 2026 Paperzz