Introduction to
Machine Learning
236756
Prof. Nir Ailon
Lecture 2:
Uniform Convergence
Cardinality as Complexity Control
Approximation-Estimation Tradeoff
Growth Function as Measure of Complexity
The Statistical Model
β’ Instance space π³(aka Domain Set)
β’ Ex. π³ = Set of photos of faces
Known
β’ Label space π΄ (aka Label set)
β’ Ex. π΄ = {βglassesβ, βno glassesβ}
Known
β’ Distribution π over π³ × π΄
β’ π₯, π¦ βΌ π
Unknown
β’ Pr π₯ = π, π¦ = π =
βProbability of seeing instance a with label bβ
Unknown
β’ For simplicity, we think of π₯ βΌ ππ₯ , π¦ = π π₯
In most cases ,you will be ok!
Ex. π
= βno glassesβ, Pr π₯ =
, π¦ = glasses = 0
(Possibly) Unknown
βDistribution-Freeβ setting
The Statistical Model
Aka:
Generalization error
π³, π: π³ β¦ π΄
π΄ = __, __
Risk
True error
True loss
Whatever you name
it⦠remember it is
unknown (WHY??)`
Aka Empirical Risk,
Empirical Error
Known
What is LS(h) in the
example?
β’ Training set π βΌ π π
β’ Output of algorithm π΄ π is prediction rule β: π³ β¦ π΄
β’ Error = πΏπ,π (β) = πππ₯βΌππ₯ β π₯ β π π₯
β’ Training error πΏπ β = π β π : β π₯π β π¦π /π
Empirical Risk Minimization
To be precise:
β
πΈπ
π π = argminββ βπΏπ (β)
β
Realizable Case (Finite β)
β’ πΏπ,π (ββ ) = 0 for some ββ β β then πΏπ (ββ ) = 0 a.s.
β’ If πΏπ,π β > 0 then πΏπ β = 0 with prob
π
1 β πΏπ,π (β) β€ π βπβ
πΏπ,π (β)
Almost Surely
(with prob. 1)
β’ Does this mean that if πΏπ β = 0 then h is correct?
β’ Noβ¦
β’ If πΏπ,π β β₯ π then πΏπ β = 0 with pos. prob β€ π βππ
β’ By union bound, with prob β₯ 1 β β π βππ , all
β β β such that πΏπ,π β β₯ π satisfy πΏπ β > 0
οβπΏ, with probability β₯ 1 β πΏ, if πΏπ β = 0 for some
β β β then all we can say is:
1
|β|
πΏπ,π β < log
π
πΏ
Realizable Case
β’ For all πΏ, π > 0
with probability β₯ 1 β πΏ
ββ β β s.t. πΏπ,π β >
πΏπ β > 0
1
|β|
log
π
πΏ
πΏπ,π ERMβ π
β€
1
πΏπ,π ββ + π log
β’ Proof:
β’ ββ β β: Pr πΏπ β = 0 β€ π βπΏπ,π(β) π
β’ Union bound on at most |β| hypotheses
|β|
πΏ
If you want it to be
fixed π, set
1
|β|
π = log
π
πΏ
β’ Notice the order of quantifiers:
With (high) probabilityβ¦.
ββ β β: πΏπ (β) βgoodβ
With (low) probability
ββ β β: πΏπ (β)
βbadβ
The Non-Realizable Case
β’ What do we do if πΏπ β > 0 for all β β β ?
β’ Fix β β β
β’ Fix number of examples π
β’ π βΌ ππ
β’ How is πΏπ (β) distributed?
β’ When the πβth example π₯π is drawn
Pr[β π₯) β π π₯ = πΏπ,π (β)
ο π β
πΏπ β βΌ Bin(π, πΏπ,π β )
What Do We Know About the
Binomial Distribution?
β’ Quite a lotβ¦ Assume π βΌ Bin π, π
β’ πΈ π = ππ
β’ πππ π = πΈ π 2 β πΈ π 2 = ππ(1 β π)
β’ By Central Limit Theorem, as π β β, π tends to
Gaussian. But at what rate?
β’ Good enough approximation (for now):
Hoeffding Bound
Hoeffding Bound
β’ Assume π = π
π=1 ππ
β’ ππ βs independent
β’ For all π β π , ππ β€ 1
π=πΈ π =
πΈ[ππ ]
(Linearity of expectation)
β’ Then:
2 /π
β2π‘
Pr π β π > π‘ β€ 2π
β’ If ππ βΌ Ber π for all π β [π], then π βΌ Bin(π, π)
2 /π
β2π‘
Pr π β ππ > π‘ β€ 2π
Conclusion (Finite β)
β’ For all β, π‘:
Pr ππΏπ β β ππΏπ,π β
π‘
:
π
β’ Variable change: π =
Pr πΏπ β β πΏπ,π β
>π‘ β€
2 /π
β2π‘
2π
Ex. With probability β₯ 0.99,
|πΏπ β β πΏπ,π β | β€ 0.1 with π β 1000
To get probability β₯ 0.9999 we need π β
2400.
>π β€
2
β2ππ
2π
β’ By union bound, with prob β₯ 1 β 2 β π
β2ππ 2
:
β’ ββ β β: πΏπ β β πΏπ,π β β€ π
Uniform Concentration
β if β = argminβββ πΏπ (β), then πΏπ,π β β€ πΏπ,π ββ + 2π
ERM Rule
β βπΏ: W.p β₯ 1 β πΏ, πΏπ,π β β€ πΏπ,π ββ + 2
1
2|β|
log
2π
πΏ
Why?
Uniform Convergence
Agnostic (Non-Realizable)
Case
If you want it to be fixed π, set
β’ For all πΏ, π > 0
2
2|β|
log
π2
πΏ
πΏπ,π ERMβ π β€
π=
with probability β₯ 1 β πΏ
ββ β β:
πΏπ β β πΏπ,π β
β€
1
2|β|
log
2π
πΏ
πΏπ,π (ββ ) + 2
1
2|β|
log
2π
πΏ
β’ Proof:
β’ Hoeffding: ββ β β: Pr πΏπ β β πΏπ,π (β) β₯ π β€
β’ Union bound over at most |β| hypotheses
2π
β2π
2π
β’ Notice the order of quantifiers:
With (high) probabilityβ¦.
ββ β β: πΏπ (β) βgood approxβ
With (low) probability
ββ β β: πΏπ (β) βbad approxβ
β’ This is a uniform convergence argument: The πΏπ (β)βs
converge to the πΏπ,π (β)βs, and all approximations likely
to be good simultaneously
Error Decomposition
πΏπ,π ERMβ π
β€ πΏπ,π
(ββ ) +
1
2|β|
2
log
2π
πΏ
πapp
β’ πest = πΏπ,π ERMβ (π) β πapp
Upper bound on πest
Logarithmically
πapp
πest
Not a random variable
Random variable
Doesnβt depend on algorithm
or sample size
Depends on algorithm and
sample size
Decreases as β grows
Increases as β grows
Does not depend on π
Decreases as π grows
Captures our prior knowledge
Captures estimation of
true error by empirical error
Inversepolynomially
Bias-Complexity Tradeoff
(holding π fixed)
πapp + πest
πest small
πapp large
Underfitting
πest large
πapp small
Overfitting
|β|
Low complexity
Large inductive bias
High complexity
Small inductive bias
π Vs. π = Upper Bound On πest
πΏ = 0.01
π
β = 1000000
β = 1000
π
What determines the curve? Is it really only |β|?
Are All Infinite |β| Equally Bad?
β’ π³ = 0,1 2 , β = π΄ π³
β’ π³ = 0,1 2 , β = βπ : π β [0,1]
1,
πβ₯π
βπ π, π =
β1,
π<π
β’ π³ = 0,1 2 , β = βπΌ,π½,π : πΌ, π½, π β β
1,
πΌβ
π+π½β
π β₯π
βπΌ,π½,π π, π =
β1,
otherwise
As we shall see, the answer is a resounding NO. To
understand why, weβd need a couple of definitions.
The Growth Function of β
β’ Let πΆ β π³
β’ Define βπΆ to be the restriction of all functions β β β
to πΆ:
βπΆ = β|πΆ : β β β
β|πΆ = restriction of β to sub-domain πΆ
β’
β’
β’
β’
β’
The growth function πβ π = max βπΆ : πΆ = π
Measures how big |βπΆ | can be when πΆ = π.
πβ (π) monotonically nondecreasing. (Why?)
πβ (π) distribution free (does not depend on ππ³ )
As we shall see, ππ (π) measures the effective size of
π for the purpose of learning from π examples.
Example
ππ
ππ
ππ
ππ
ππ
ππ
ππ
ππ
ππ
β1
0
0
1
0
0
0
1
0
0
β2
0
1
0
0
0
1
0
0
0
β3
1
0
0
0
1
1
0
0
0
β4
0
0
0
1
1
0
0
0
1
β5
0
0
1
0
0
0
0
1
0
β6
0
1
0
0
0
0
1
0
0
β7
1
0
0
0
0
1
0
0
0
β8
0
0
0
0
0
0
0
0
0
πβ 1 = 2 πβ 2 = 4 πβ 3 = ?
Uniform Convergence From Growth
Function
β’ For every distribution π, function π: π³ β¦ π΄, π β₯ 0
and πΏ β 0,1 :
β’ if π βΌ π π then with probability β₯ 1 β πΏ
β’ for all β β β:
4 + log πβ (2π)
πΏπ (β) β πΏπ (β) β€
πΏ 2π
β’ Difficulty in proof: May still have infinitely many
hypotheses to βunion boundβ against
β’ Proof idea: If we take two independent samples of
size π, call them π, πβ², then |πΏπ β β πΏπβ² β |
should look like an estimate for a hypothesis class
of size πβ (2π)β¦
Shattering and VC Dimension
β’ A set πΆ β π³ is said to be shattered by β if βπΆ
consists exactly of all 2|πΆ| possible functions
β’ The VC-dimension of β, denoted VCdim(β), is
the largest number π such that there exists a
subset πΆ β π³ of size π that is shattered by β
β’ Note: VCdim(β) can be β
β’ Note: If β finite, then VCdim(β) β€ log 2 |β|
Example
ππ
ππ
ππ
ππ
ππ
ππ
ππ
ππ
ππ
β1
0
0
1
0
0
0
1
0
0
β2
0
1
0
0
0
1
0
0
0
β3
1
0
0
0
1
1
0
0
0
β4
0
0
0
1
1
0
0
0
1
β5
0
0
1
0
0
0
0
1
0
β6
0
1
0
0
0
0
1
0
0
β7
1
0
0
0
0
1
0
0
0
β8
0
0
0
0
0
0
0
0
0
πβ 1 = 2 πβ 2 = 4 πβ 3 < 8
Proof: Only β3 ,β4 have three 1βs
Therefore VCDim = 2
in their row. Therefore can only
maybe shatter πΆ = {π₯1 , π₯5 , π₯6 }
or πΆ = π₯4 , π₯5 , π₯9 . But these
two options ruled out by quick
inspection.
Sauer-Shelah Lemma
β’ If β has finite VC dimension π then for all π β₯ 1
π
π
πβ π β€
π
π=0
β’ In particular, if π > π + 1 then
ππ π
πβ π β€
π
β’ Note: This estimate is much smaller than ππ
Putting It All Together
β’ Assume βhas VC dimension π
β’ For every distribution π, function π: π³ β¦ π΄,
π > π + 1 and πΏ β 0,1 :
β’ if π βΌ π π then with probability β₯ 1 β πΏ
β’ for all β β β:
4 + log πβ (2π)
πΏπ (β) β πΏπ (β) β€
πΏ 2π
π β
logπ
β€π
πΏ π
β’ Note: Working harder, can get a tighter bound of
π+logπΏ β1
πlogπ+logπΏ β1
or π
Agnostic π
Realizable
π
π
Does VC Dimension Also Tell Us
When We Canβt Learn?
β’ What if a problem has infinite VCdim?
β’ Perhaps we can still learn it?
⒠In some sense⦠no
β’ To understand in what sense, weβll need another
definitionβ¦
(Realizable) PAC Learning
A hypothesis class β is PAC learnable if
βπβ : (0,1)2 β β and an algorithm π΄ s.t.:
β π, πΏ β 0,1 , π, π: π³ β 0,1 , if βrealizableβ, then
PrπβΌππ πΏπ,π π΄(π) β€ π β₯ 1 β πΏ
πβ (π, πΏ)
Approximately
Correct
Probably
Agnostic PAC Learning
In particular:
Uniform Convergence
β
Agnostic PAC Learning!
A hypothesis class β is PAC learnable if
βπβ : (0,1)2 β β and an algorithm π΄ s.t.:
β π, πΏ β 0,1 , π, π: π³ β 0,1 , then
PrπβΌπ π πΏπ,π π΄(π) β€ πΏπ,π ββ + π β₯ 1 β πΏ
πβ (π, πΏ)
Approximately
Correct
Probably
Negative Results With VCdim
β’ Theorem: If β has VCdim β then β is not PAC
learnable: For any algorithm π΄, m β₯ 0 exists π, h
1
β β s.t. with probability β₯ (over choice of
π βΌ π π ), πΏπ,β π΄ π
β₯
1
.
8
7
β’ Moreover, if VCdim β = π, then can force
π
algorithmβs error to be at least Ξ©( ).
π
β’ Proofs use statistical βno free lunchβ theorem.
© Copyright 2026 Paperzz