Lecture 2

Introduction to
Machine Learning
236756
Prof. Nir Ailon
Lecture 2:
Uniform Convergence
Cardinality as Complexity Control
Approximation-Estimation Tradeoff
Growth Function as Measure of Complexity
The Statistical Model
β€’ Instance space 𝒳(aka Domain Set)
β€’ Ex. 𝒳 = Set of photos of faces
Known
β€’ Label space 𝒴 (aka Label set)
β€’ Ex. 𝒴 = {β€œglasses”, β€œno glasses”}
Known
β€’ Distribution π’Ÿ over 𝒳 × π’΄
β€’ π‘₯, 𝑦 ∼ π’Ÿ
Unknown
β€’ Pr π‘₯ = π‘Ž, 𝑦 = 𝑏 =
β€œProbability of seeing instance a with label b”
Unknown
β€’ For simplicity, we think of π‘₯ ∼ π’Ÿπ‘₯ , 𝑦 = 𝑓 π‘₯
In most cases ,you will be ok!
Ex. 𝑓
= β€œno glasses”, Pr π‘₯ =
, 𝑦 = glasses = 0
(Possibly) Unknown
β€œDistribution-Free” setting
The Statistical Model
Aka:
Generalization error
𝒳, 𝑓: 𝒳 ↦ 𝒴
𝒴 = __, __
Risk
True error
True loss
Whatever you name
it… remember it is
unknown (WHY??)`
Aka Empirical Risk,
Empirical Error
Known
What is LS(h) in the
example?
β€’ Training set 𝑆 ∼ π’Ÿ π‘š
β€’ Output of algorithm 𝐴 𝑆 is prediction rule β„Ž: 𝒳 ↦ 𝒴
β€’ Error = πΏπ’Ÿ,𝑓 (β„Ž) = π‘ƒπ‘Ÿπ‘₯βˆΌπ’Ÿπ‘₯ β„Ž π‘₯ β‰  𝑓 π‘₯
β€’ Training error 𝐿𝑆 β„Ž = 𝑖 ∈ π‘š : β„Ž π‘₯𝑖 β‰  𝑦𝑖 /π‘š
Empirical Risk Minimization
To be precise:
∈
𝐸𝑅𝑀 𝑆 = argminβ„Žβˆˆ ℋ𝐿𝑆 (β„Ž)
β„‹
Realizable Case (Finite β„‹)
β€’ πΏπ’Ÿ,𝑓 (β„Žβˆ— ) = 0 for some β„Žβˆ— ∈ β„‹ then 𝐿𝑆 (β„Žβˆ— ) = 0 a.s.
β€’ If πΏπ’Ÿ,𝑓 β„Ž > 0 then 𝐿𝑠 β„Ž = 0 with prob
π‘š
1 βˆ’ πΏπ’Ÿ,𝑓 (β„Ž) ≀ 𝑒 βˆ’π‘šβ‹…πΏπ’Ÿ,𝑓 (β„Ž)
Almost Surely
(with prob. 1)
β€’ Does this mean that if 𝐿𝑆 β„Ž = 0 then h is correct?
β€’ No…
β€’ If πΏπ’Ÿ,𝑓 β„Ž β‰₯ πœ€ then 𝐿𝑠 β„Ž = 0 with pos. prob ≀ 𝑒 βˆ’π‘šπœ€
β€’ By union bound, with prob β‰₯ 1 βˆ’ β„‹ 𝑒 βˆ’π‘šπœ€ , all
β„Ž ∈ β„‹ such that πΏπ’Ÿ,𝑓 β„Ž β‰₯ πœ€ satisfy 𝐿𝑠 β„Ž > 0
οƒžβˆ€π›Ώ, with probability β‰₯ 1 βˆ’ 𝛿, if 𝐿𝑆 β„Ž = 0 for some
β„Ž ∈ β„‹ then all we can say is:
1
|β„‹|
πΏπ’Ÿ,𝑓 β„Ž < log
π‘š
𝛿
Realizable Case
β€’ For all 𝛿, π‘š > 0
with probability β‰₯ 1 βˆ’ 𝛿
βˆ€β„Ž ∈ β„‹ s.t. πΏπ’Ÿ,𝑓 β„Ž >
𝐿𝑆 β„Ž > 0
1
|β„‹|
log
π‘š
𝛿
πΏπ’Ÿ,𝑓 ERMβ„‹ 𝑆
≀
1
πΏπ’Ÿ,𝑓 β„Žβˆ— + π‘š log
β€’ Proof:
β€’ βˆ€β„Ž ∈ β„‹: Pr 𝐿𝑆 β„Ž = 0 ≀ 𝑒 βˆ’πΏπ’Ÿ,𝑓(β„Ž) π‘š
β€’ Union bound on at most |β„‹| hypotheses
|β„‹|
𝛿
If you want it to be
fixed πœ€, set
1
|β„‹|
π‘š = log
πœ€
𝛿
β€’ Notice the order of quantifiers:
With (high) probability….
βˆ€β„Ž ∈ β„‹: 𝐿𝑆 (β„Ž) β€œgood”
With (low) probability
βˆƒβ„Ž ∈ β„‹: 𝐿𝑆 (β„Ž)
β€œbad”
The Non-Realizable Case
β€’ What do we do if 𝐿𝑆 β„Ž > 0 for all β„Ž ∈ β„‹ ?
β€’ Fix β„Ž ∈ β„‹
β€’ Fix number of examples π‘š
β€’ 𝑆 ∼ π’Ÿπ‘š
β€’ How is 𝐿𝑆 (β„Ž) distributed?
β€’ When the 𝑖’th example π‘₯𝑖 is drawn
Pr[β„Ž π‘₯) β‰  𝑓 π‘₯ = πΏπ’Ÿ,𝑓 (β„Ž)
οƒž π‘š β‹… 𝐿𝑆 β„Ž ∼ Bin(π‘š, πΏπ’Ÿ,𝑓 β„Ž )
What Do We Know About the
Binomial Distribution?
β€’ Quite a lot… Assume 𝑍 ∼ Bin π‘š, 𝑝
β€’ 𝐸 𝑍 = π‘šπ‘
β€’ π‘‰π‘Žπ‘Ÿ 𝑍 = 𝐸 𝑍 2 βˆ’ 𝐸 𝑍 2 = π‘šπ‘(1 βˆ’ 𝑝)
β€’ By Central Limit Theorem, as π‘š β†’ ∞, 𝑍 tends to
Gaussian. But at what rate?
β€’ Good enough approximation (for now):
Hoeffding Bound
Hoeffding Bound
β€’ Assume 𝑍 = π‘š
𝑖=1 𝑍𝑖
β€’ 𝑍𝑖 ’s independent
β€’ For all 𝑖 ∈ π‘š , 𝑍𝑖 ≀ 1
πœ‡=𝐸 𝑍 =
𝐸[𝑍𝑖 ]
(Linearity of expectation)
β€’ Then:
2 /π‘š
βˆ’2𝑑
Pr 𝑍 βˆ’ πœ‡ > 𝑑 ≀ 2𝑒
β€’ If 𝑍𝑖 ∼ Ber 𝑝 for all 𝑖 ∈ [π‘š], then 𝑍 ∼ Bin(π‘š, 𝑝)
2 /π‘š
βˆ’2𝑑
Pr 𝑍 βˆ’ π‘šπ‘ > 𝑑 ≀ 2𝑒
Conclusion (Finite β„‹)
β€’ For all β„Ž, 𝑑:
Pr π‘šπΏπ‘† β„Ž βˆ’ π‘šπΏπ’Ÿ,𝑓 β„Ž
𝑑
:
π‘š
β€’ Variable change: πœ€ =
Pr 𝐿𝑆 β„Ž βˆ’ πΏπ’Ÿ,𝑓 β„Ž
>𝑑 ≀
2 /π‘š
βˆ’2𝑑
2𝑒
Ex. With probability β‰₯ 0.99,
|𝐿𝑠 β„Ž βˆ’ πΏπ’Ÿ,𝑓 β„Ž | ≀ 0.1 with π‘š β‰ˆ 1000
To get probability β‰₯ 0.9999 we need π‘š β‰ˆ
2400.
>πœ€ ≀
2
βˆ’2π‘šπœ€
2𝑒
β€’ By union bound, with prob β‰₯ 1 βˆ’ 2 β„‹ 𝑒
βˆ’2π‘šπœ€ 2
:
β€’ βˆ€β„Ž ∈ β„‹: 𝐿𝑆 β„Ž βˆ’ πΏπ’Ÿ,𝑓 β„Ž ≀ πœ€
Uniform Concentration
β‡’ if β„Ž = argminβ„Žβˆˆβ„‹ 𝐿𝑆 (β„Ž), then πΏπ’Ÿ,𝑓 β„Ž ≀ πΏπ’Ÿ,𝑓 β„Žβˆ— + 2πœ€
ERM Rule
β‡’ βˆ€π›Ώ: W.p β‰₯ 1 βˆ’ 𝛿, πΏπ’Ÿ,𝑓 β„Ž ≀ πΏπ’Ÿ,𝑓 β„Žβˆ— + 2
1
2|β„‹|
log
2π‘š
𝛿
Why?
Uniform Convergence
Agnostic (Non-Realizable)
Case
If you want it to be fixed πœ€, set
β€’ For all 𝛿, π‘š > 0
2
2|β„‹|
log
πœ€2
𝛿
πΏπ’Ÿ,𝑓 ERMβ„‹ 𝑆 ≀
π‘š=
with probability β‰₯ 1 βˆ’ 𝛿
βˆ€β„Ž ∈ β„‹:
𝐿𝑆 β„Ž βˆ’ πΏπ’Ÿ,𝑓 β„Ž
≀
1
2|β„‹|
log
2π‘š
𝛿
πΏπ’Ÿ,𝑓 (β„Žβˆ— ) + 2
1
2|β„‹|
log
2π‘š
𝛿
β€’ Proof:
β€’ Hoeffding: βˆ€β„Ž ∈ β„‹: Pr 𝐿𝑆 β„Ž βˆ’ πΏπ’Ÿ,𝑓 (β„Ž) β‰₯ πœ€ ≀
β€’ Union bound over at most |β„‹| hypotheses
2π‘š
βˆ’2πœ€
2𝑒
β€’ Notice the order of quantifiers:
With (high) probability….
βˆ€β„Ž ∈ β„‹: 𝐿𝑆 (β„Ž) β€œgood approx”
With (low) probability
βˆƒβ„Ž ∈ β„‹: 𝐿𝑆 (β„Ž) β€œbad approx”
β€’ This is a uniform convergence argument: The 𝐿𝑆 (β„Ž)’s
converge to the πΏπ’Ÿ,𝑓 (β„Ž)’s, and all approximations likely
to be good simultaneously
Error Decomposition
πΏπ’Ÿ,𝑓 ERMβ„‹ 𝑆
≀ πΏπ’Ÿ,𝑓
(β„Žβˆ— ) +
1
2|β„‹|
2
log
2π‘š
𝛿
πœ–app
β€’ πœ–est = πΏπ’Ÿ,𝑓 ERMβ„‹ (𝑆) βˆ’ πœ–app
Upper bound on πœ–est
Logarithmically
𝝐app
𝝐est
Not a random variable
Random variable
Doesn’t depend on algorithm
or sample size
Depends on algorithm and
sample size
Decreases as β„‹ grows
Increases as β„‹ grows
Does not depend on π‘š
Decreases as π‘š grows
Captures our prior knowledge
Captures estimation of
true error by empirical error
Inversepolynomially
Bias-Complexity Tradeoff
(holding π‘š fixed)
πœ–app + πœ–est
πœ–est small
πœ–app large
Underfitting
πœ–est large
πœ–app small
Overfitting
|β„‹|
Low complexity
Large inductive bias
High complexity
Small inductive bias
π‘š Vs. πœ€ = Upper Bound On πœ–est
𝛿 = 0.01
πœ€
β„‹ = 1000000
β„‹ = 1000
π‘š
What determines the curve? Is it really only |β„‹|?
Are All Infinite |β„‹| Equally Bad?
β€’ 𝒳 = 0,1 2 , β„‹ = 𝒴 𝒳
β€’ 𝒳 = 0,1 2 , β„‹ = β„Žπœ : 𝜏 ∈ [0,1]
1,
π‘Žβ‰₯𝜏
β„Žπœ π‘Ž, 𝑏 =
βˆ’1,
π‘Ž<𝜏
β€’ 𝒳 = 0,1 2 , β„‹ = β„Žπ›Ό,𝛽,𝜏 : 𝛼, 𝛽, 𝜏 ∈ ℝ
1,
π›Όβ‹…π‘Ž+𝛽⋅𝑏 β‰₯𝜏
β„Žπ›Ό,𝛽,𝜏 π‘Ž, 𝑏 =
βˆ’1,
otherwise
As we shall see, the answer is a resounding NO. To
understand why, we’d need a couple of definitions.
The Growth Function of β„‹
β€’ Let 𝐢 βŠ† 𝒳
β€’ Define ℋ𝐢 to be the restriction of all functions β„Ž ∈ β„‹
to 𝐢:
ℋ𝐢 = β„Ž|𝐢 : β„Ž ∈ β„‹
β„Ž|𝐢 = restriction of β„Ž to sub-domain 𝐢
β€’
β€’
β€’
β€’
β€’
The growth function πœβ„‹ π‘š = max ℋ𝐢 : 𝐢 = π‘š
Measures how big |ℋ𝐢 | can be when 𝐢 = π‘š.
πœβ„‹ (π‘š) monotonically nondecreasing. (Why?)
πœβ„‹ (π‘š) distribution free (does not depend on π’Ÿπ’³ )
As we shall see, 𝝉𝓗 (π’Ž) measures the effective size of
𝓗 for the purpose of learning from π’Ž examples.
Example
π’™πŸ
π’™πŸ
π’™πŸ‘
π’™πŸ’
π’™πŸ“
π’™πŸ”
π’™πŸ•
π’™πŸ–
π’™πŸ—
β„Ž1
0
0
1
0
0
0
1
0
0
β„Ž2
0
1
0
0
0
1
0
0
0
β„Ž3
1
0
0
0
1
1
0
0
0
β„Ž4
0
0
0
1
1
0
0
0
1
β„Ž5
0
0
1
0
0
0
0
1
0
β„Ž6
0
1
0
0
0
0
1
0
0
β„Ž7
1
0
0
0
0
1
0
0
0
β„Ž8
0
0
0
0
0
0
0
0
0
πœβ„‹ 1 = 2 πœβ„‹ 2 = 4 πœβ„‹ 3 = ?
Uniform Convergence From Growth
Function
β€’ For every distribution π’Ÿ, function 𝑓: 𝒳 ↦ 𝒴, π‘š β‰₯ 0
and 𝛿 ∈ 0,1 :
β€’ if 𝑆 ∼ π’Ÿ π‘š then with probability β‰₯ 1 βˆ’ 𝛿
β€’ for all β„Ž ∈ β„‹:
4 + log πœβ„‹ (2π‘š)
πΏπ’Ÿ (β„Ž) βˆ’ 𝐿𝑆 (β„Ž) ≀
𝛿 2π‘š
β€’ Difficulty in proof: May still have infinitely many
hypotheses to β€œunion bound” against
β€’ Proof idea: If we take two independent samples of
size π‘š, call them 𝑆, 𝑆′, then |𝐿𝑆 β„Ž βˆ’ 𝐿𝑆′ β„Ž |
should look like an estimate for a hypothesis class
of size πœβ„‹ (2π‘š)…
Shattering and VC Dimension
β€’ A set 𝐢 βŠ† 𝒳 is said to be shattered by β„‹ if ℋ𝐢
consists exactly of all 2|𝐢| possible functions
β€’ The VC-dimension of β„‹, denoted VCdim(β„‹), is
the largest number 𝑑 such that there exists a
subset 𝐢 βŠ† 𝒳 of size 𝑑 that is shattered by β„‹
β€’ Note: VCdim(β„‹) can be ∞
β€’ Note: If β„‹ finite, then VCdim(β„‹) ≀ log 2 |β„‹|
Example
π’™πŸ
π’™πŸ
π’™πŸ‘
π’™πŸ’
π’™πŸ“
π’™πŸ”
π’™πŸ•
π’™πŸ–
π’™πŸ—
β„Ž1
0
0
1
0
0
0
1
0
0
β„Ž2
0
1
0
0
0
1
0
0
0
β„Ž3
1
0
0
0
1
1
0
0
0
β„Ž4
0
0
0
1
1
0
0
0
1
β„Ž5
0
0
1
0
0
0
0
1
0
β„Ž6
0
1
0
0
0
0
1
0
0
β„Ž7
1
0
0
0
0
1
0
0
0
β„Ž8
0
0
0
0
0
0
0
0
0
πœβ„‹ 1 = 2 πœβ„‹ 2 = 4 πœβ„‹ 3 < 8
Proof: Only β„Ž3 ,β„Ž4 have three 1’s
Therefore VCDim = 2
in their row. Therefore can only
maybe shatter 𝐢 = {π‘₯1 , π‘₯5 , π‘₯6 }
or 𝐢 = π‘₯4 , π‘₯5 , π‘₯9 . But these
two options ruled out by quick
inspection.
Sauer-Shelah Lemma
β€’ If β„‹ has finite VC dimension 𝑑 then for all π‘š β‰₯ 1
𝑑
π‘š
πœβ„‹ π‘š ≀
𝑖
𝑖=0
β€’ In particular, if π‘š > 𝑑 + 1 then
π‘’π‘š 𝑑
πœβ„‹ π‘š ≀
𝑑
β€’ Note: This estimate is much smaller than πŸπ’Ž
Putting It All Together
β€’ Assume β„‹has VC dimension 𝑑
β€’ For every distribution π’Ÿ, function 𝑓: 𝒳 ↦ 𝒴,
π‘š > 𝑑 + 1 and 𝛿 ∈ 0,1 :
β€’ if 𝑆 ∼ π’Ÿ π‘š then with probability β‰₯ 1 βˆ’ 𝛿
β€’ for all β„Ž ∈ β„‹:
4 + log πœβ„‹ (2π‘š)
πΏπ’Ÿ (β„Ž) βˆ’ 𝐿𝑆 (β„Ž) ≀
𝛿 2π‘š
𝑑 β‹… logπ‘š
≀𝑂
𝛿 π‘š
β€’ Note: Working harder, can get a tighter bound of
𝑑+log𝛿 βˆ’1
𝑑logπ‘š+log𝛿 βˆ’1
or 𝑂
Agnostic 𝑂
Realizable
π‘š
π‘š
Does VC Dimension Also Tell Us
When We Can’t Learn?
β€’ What if a problem has infinite VCdim?
β€’ Perhaps we can still learn it?
β€’ In some sense… no
β€’ To understand in what sense, we’ll need another
definition…
(Realizable) PAC Learning
A hypothesis class β„‹ is PAC learnable if
βˆƒπ‘šβ„‹ : (0,1)2 β†’ β„• and an algorithm 𝐴 s.t.:
βˆ€ πœ€, 𝛿 ∈ 0,1 , π’Ÿ, 𝑓: 𝒳 β†’ 0,1 , if β€œrealizable”, then
Prπ‘†βˆΌπ’Ÿπ‘š πΏπ’Ÿ,𝑓 𝐴(𝑆) ≀ πœ€ β‰₯ 1 βˆ’ 𝛿
π‘šβ„‹ (πœ€, 𝛿)
Approximately
Correct
Probably
Agnostic PAC Learning
In particular:
Uniform Convergence
β‡’
Agnostic PAC Learning!
A hypothesis class β„‹ is PAC learnable if
βˆƒπ‘šβ„‹ : (0,1)2 β†’ β„• and an algorithm 𝐴 s.t.:
βˆ€ πœ€, 𝛿 ∈ 0,1 , π’Ÿ, 𝑓: 𝒳 β†’ 0,1 , then
Prπ‘†βˆΌπ’Ÿ π‘š πΏπ’Ÿ,𝑓 𝐴(𝑆) ≀ πΏπ’Ÿ,𝑓 β„Žβˆ— + πœ€ β‰₯ 1 βˆ’ 𝛿
π‘šβ„‹ (πœ€, 𝛿)
Approximately
Correct
Probably
Negative Results With VCdim
β€’ Theorem: If β„‹ has VCdim ∞ then β„‹ is not PAC
learnable: For any algorithm 𝐴, m β‰₯ 0 exists π’Ÿ, h
1
∈ β„‹ s.t. with probability β‰₯ (over choice of
𝑆 ∼ π’Ÿ π‘š ), πΏπ’Ÿ,β„Ž 𝐴 𝑆
β‰₯
1
.
8
7
β€’ Moreover, if VCdim β„‹ = 𝑑, then can force
𝑑
algorithm’s error to be at least Ξ©( ).
π‘š
β€’ Proofs use statistical β€œno free lunch” theorem.