Rigorous Learning Curve
Bounds from Statistical
Mechanics
D. Haussler, M. Kearns,
H. S. Seung, N. Tishby
Presentation: Talya Meltzer
Motivation
According to the VC-theory, minimizing the
empirical error within a function class F on a
random sample will lead to generalization
error bounds:
Realizable case:
Unrealizable case: min
m
~
Od
~ d
O
m
VC-dimension
Sample-size
The VC-bounds are the best distributionindependent upper bounds
Motivation
Yet, these bounds are vacuous for m<d
And fail to capture the true behavior of
particular learning curves:
Experimental learning curves fit a variety of
functional forms, including exponentials
Curves analyzed using statistical mechanics
methods, experience phase transitions
(sudden drops in the generalization error)
Main Ideas
Decompose the hypothesis class into
error shells
Attribute each hypothesis the correct
generalization error, while taking the specific
distribution into account
Use the thermodynamic limit method
Notate the correct scale at which to analyze a
learning curve
Express the learning curve as a competition
between an entropy function and an energy
function
Overview: The PAC Learning Model
The hypothesis class: F f : X Y
Input:
Assumptions:
S
x,y
i
i 1i m
Input space
Label set
The examples in the training set S are sampled i.i.d
according to a distribution D over X
D is unknown
D is fixed throughout the learning process
There exists a target function f*:X→Y, i.e. yi = f*(xi)
Goal:
find the target function
Overview: The PAC Learning Model
1 m
*(x )
(
h
)
h
(
x
)
f
Training (empirical) error: train
i
m i 1 i
Generalization error:
The class F is PAC-learnable, if there exists a
learning algorithm which given ε,δ returns hєF such
that:
(h) prob h( x) f * ( x)
gen
D
The training error is minimal
prob
(h) 1
gen
The Finite & Realizable Case
The version space:
VS (S ) h F : x S h( x) f * ( x)
The ε-ball:
B( ) h F :
(h)
gen
If B(ε) includes VS(S), then any function in the
version space has generalization error ≤ ε
prob h VS ( S ) (h) prob VS ( S ) B( )
gen
S
S
The Finite & Realizable Case
probS VS (S ) B( ) 1
where
1
hB ( )
1
hB ( )
gen (h)
m
The standard
cardinality bound:
gen (h) F 1
m
m
εgen(h) ≥ ε for h outside of B(ε)
+ at most |F| such functions
In probability≥1-δ
1 F
gen (h) ln
m
Decomposition into error shells
realizable
target
In a finite class, there is a finite number
of possible error values:
0 = ε1 < ε2 < … < εr ≤ 1
εgen(h)>ε
r≤|F|<∞
εj
Target
function
B(ε)
Decomposition into error shells
So, we can replace the union bound in the
exact phrase:
1
hB ( i )
gen ( h) Q j 1 j
m
r
m
The cardinality of
the j-th error shell
j i
Now, with probability at least 1-δ, any h consistent
with the sample obeys:
r
m
gen (h) min i : Q j 1 j
j i
To understand the behavior of this bound, we
will use the thermodynamic limit method
The Thermodynamic Limit
We consider an infinite sequence of classes
of functions: F1,F2,…,FN,…
FN = { f : XN → {0,1} }, N≥log2(|FN|)
We are often interested in a parametric class
of functions
The number of functions in the class at any
given error value may have a limiting
asymptotic behavior, as the number of
parameters grows
The Thermodynamic Limit
Rewrite the expression:
r(N )
N
Q
j 1
j i
exp ln Q m ln 1
N m
j
r(N )
N
j
j i
The entropy of the j-th error
shell – POSITIVE
N
j
The minus energy of the j-th
error shell – NEGATIVE
Notate the scaling function t(N):
when chosen
properly, captures the scale at which the learning curve
is most interesting
Find a permissible entropy bound:
tightly captures the
behavior of
log Q N t N
j
The Thermodynamic Limit
Formal definitions:
t(N): a mapping from the natural numbers to the
natural numbers, such that: t N N
s(ε): a continuous function s : 0,1
s(ε) is called a permissible entropy bound if
there exists a natural number N0 such that for all
N≥ N0 and for all 1≤j≤r(N) :
log Q jN
s
t N
The Thermodynamic Limit
r N
N
N
exp
log
Q
m
ln
1
j
j
j i
N
log
Q
m
j
exp t N
ln 1 Nj
t N
j i
t N
r N
rN
exp t N s Nj ln 1 Nj
j i
α=m/t(N) remains const, as m,N→∞
α controls the competition between the entropy
and the energy
The Thermodynamic Limit
In order to describe infinite systems:
We describe a system in finite size, then let the
size grow to infinity
We normalize extensive variables by the volume:
N
V
no. of particles
volume
S
s
V
entropy
U
u
V
energy
We keep the density fixed: ρ = N/V = const, as
N,V → ∞
The Thermodynamic Limit
The Learning System vs. The Thermodynamic System
no. of examples m
no. of functions
F 2N
scaling function t(N)→∞, as N→∞
α = m/t(N) = const, as m,N→∞
energy u(ε) = -ln(1-ε)
no. of particles N
volume V
density ρ = N/V = const, as N,V→∞
energy E
cardinality of error shell ε Q(ε,N)
no. of states with energy E Ω(E,V)
entropy S(ε,N) = log(Q(ε,N)),
permissible entropy bound
s(ε)≥S(ε,N)/t(N) for N≥N0
entropy S(E,V) = ln(Ω(E,V)),
entropy per unit volume
s(E,V) = S(E,V)/V
The Thermodynamic Limit
Benefits: N isolated in the factor t(N), and the
remaining factor is the continuous function:
s ln 1 evaluated in the points Nj
Define * 0,1 as the largest 0,1 such that
s ln 1
In the thermodynamic limit, under certain conditions,
we can bound the generalization error of any
consistent hypothesis by *
The Thermodynamic Limit
We will see that for ε>ε* the thermodynamic limit
of the sum is 0.
Let 0<τ≤1 be an arbitrarily small quantity
iN , min i : iN *
* ,1
s ln 1
min ln 1 s : * ,1 0
r(N )
N
N
exp
t
N
s
ln
1
j
j
j i N ,
r N
exp t N
We are actually summing
over the shells outside ε*+τ
j i N ,
The Thermodynamic Limit
r N
exp t N r N i exp t N
ji N,
N,
r N exp t N N
0
The limit will be indeed zero, provided that: r(N)=o(exp[t(N)Δ])
Theorem: let s(ε) be any continuous function that is a
permissible entropy bound with respect to the scaling
function t(N), and suppose that r(N)=o(exp[t(N)Δ]) for
any positive constant Δ. Then as m,N→∞ but α=m/t(N)
remains constant, for any positive τ we have:
probS VS S B * 1
The Thermodynamic Limit
Summary:
ε* is the rightmost crossing point of s(ε) and
-αln(1-ε)
in the thermodynamic limit, any hypothesis h
consistent with m = αt(N) examples will have
εgen(h) ≤ ε* + τ (with probability 1).
Scaled Learning Curves
Extracting scaled learning curves:
Let the value of α vary
Apply the thermodynamic limit method to each
value
Plot the generalization error bound as a
function of α (instead of m “scaled”)
Artificial Examples
Using weak permissible entropy
bound for some scaling function t(N)
-α3ln(1-ε) -α2ln(1-ε) -α1ln(1-ε)
s(ε)=1
ε*(α3)
ε*(α2)
ε*(α1)
Artificial Examples
Using single-peak permissible entropy bound
-α3ln(1-ε) -α2ln(1-ε) -α1ln(1-ε)
ε*(α2)
ε*(α1)
Artificial Examples
Using different single-peak as
a permissible entropy bound
Artificial Examples
Using double-peak permissible entropy bound
Phase Transitions
The sudden drops in the learning curves are
called phase transitions
In thermodynamic systems, a phase
transition is the transformation from one
phase to another
A critical point is the conditions (such as
temperature, pressure) at which the transition
occur
Phase Transitions
Well known phase transitions: solid to liquid, liquid to gas...
Phase Transitions – more…
The emergence of
superconductivity in
certain metals when cooled
below a critical temperature
The transition between
ferromagnetic and
paramagnetic phases of
magnetic materials
Phase Transitions & Learning
In some learning curves, we see a transition
from a finite generalization error to perfect
learning
The transition occur in a critical α, i.e. when
the sample reaches the size of m = αCt(N)
In this critical point the system “realizes” the
problem at once
(Almost) Real Examples
The Ising Perceptron
DN:
a spherically symmetric distribution
XN:
N
FN:
Ising perceptrons: all weights are constrained to be ±1
w 1,1 , f w x sgn w x
N
fN:
arbitrary target function, defined by w0
(Almost) Real Examples
The Ising Perceptron
Due to the spherically symmetric distribution:
2d h w, w0
2 1
1
1 w w0
1
gen w
cos
cos 1
2
N
N
The number of perceptrons with hamming
distance j from the target: N N N H j N
Q j 2
j
1
j
log Q jN H H sin 2 Nj 2
N
N
will be chosen as t(N)
will be chosen as s(ε)
Hamming
distance
(Almost) Real Examples
The Ising Perceptron
We’ve seen this entropy bound as the “single-peak”
The phase transition to perfect learning occur in αC=1.448
The critical m for perfect learning according to both the VC
32
and cardinality bounds, is m ~ N , rather than m ~ N
(Almost) Real Examples
The Ising Perceptron
The right zero crossing
yields the upper bound on
the generalization error
With high probability,
there are no hypotheses
in VS(S) with error less
than the left zero crossing
except for the target itself
VS(S) minus the target is
contained within these
zero crossings
The Thermodynamic Limit
Lower Bound
The thermodynamic limit method can provide a
lower bound to the generalization error
The lower bound shows that the behavior examined
in scaled learning curve, including phase transitions,
can actually occur for certain function classes and
distributions
We will use the energy function 2αε
The qualitative behavior of the curves obtained by
intersecting with 2αε and -α·ln(1-ε) is essentially the
same
The Thermodynamic Limit
Lower Bound
Let s(ε) be any continuous
function such that:
s : 0,1 2 0,1
as the
s 0 s 1 2 0 (such
binary entropy)
s c 1
We can construct:
a function class sequence FN
over XN
a distribution sequence DN over
XN
a target function sequence fN
such that:
1)
2)
s(ε) is a permissible entropy bound with respect to t(N)=N
for the largest ε*≤½ for which 2αε*≥s(ε*), there is a constant
probability to find a consistent hypothesis with εgen(h)≥ε*
ε* is a lower bound on the worst consistent hypothesis
The Finite & Unrealizable Case
The data can be labeled according to a function not
within our class
Or sampled by a distribution DN over XN×{0,1}, which
can also model noise in the examples
Use u(ε) as a permissible energy bound, if for any
h in F and any sample size m:
probS h VS S exp u gen h m
for the realizable case we had:
u gen h ln 1 gen h
and the exact equality: probS h VS S exp u gen h m
The Finite & Unrealizable Case
We can always choose:
u ln 1
(and in certain cases we can do better)
min
The standard cardinality bound obtained:
min ln F ln F
gen h min 2
m
2
m
Since the class is finite, we can slice it into error shells
and apply the thermodynamic limit, just as in the
realizable case.
Choosing ε* to be the rightmost intersection of s(ε) and
α·u(ε), we get for any τ>0:
probS VS S B * 1
The Infinite Case
The covering approach: build a finite γ-cover, F[γ], to the
infinite class εmin(γ)≤γ
Apply the thermodynamic limit by building a sequence of
nested covers:
1 2 3 ...
F 1 F 2 F 3 ...
Result a bound on the error of εγ*, the rightmost crossing
function of sγ(ε) and αuγ(ε)
Trade-off:
The best error achievable in the chosen cover F[γ]
improves as γ→0
The size of F[γ] increases as γ→0
Real World Example
Sufficient Dimensionality Reduction
with Irrelevance Statistics
A. Globerson, G. Chechik, N. Tishby
In this example:
Extracting a single feature using
SDR-IR, for various λ values
Main data: images of all
men with neutral face
expression and light
either from the right or
the left
Irrelevance data: similarly
created with female
images
IM[Φ(x),p]
Real World Example
phase transition
in a critical λ
λ
Normalized information about the main data
I
x
,
p
and the irrelevance data
M
I M x , p
as a function of λ
Summary
Benefits of the method:
Derives tighter bounds
Allows to describe the behavior for small samples as
well useful in practice, where we want to work with
m~d
Captures the phase transitions in learning curves,
including transitions to perfect learning, which can
actually occur experimentally in certain problems
Further work to be done:
Refined extensions to the infinite case
© Copyright 2026 Paperzz