Lecture 2

Introduction to
Machine Learning
236756
Prof. Nir Ailon
Lecture 2:
Uniform Convergence
Cardinality as Complexity Control
Approximation-Estimation Tradeoff
Growth Function as Measure of Complexity
The Statistical Model
• Instance space 𝒳(aka Domain Set)
• Ex. 𝒳 = Set of photos of faces
Known
• Label space 𝒴 (aka Label set)
• Ex. 𝒴 = {“glasses”, “no glasses”}
Known
• Distribution 𝒟 over 𝒳 × 𝒴
• 𝑥, 𝑦 ∼ 𝒟
Unknown
• Pr 𝑥 = 𝑎, 𝑦 = 𝑏 =
“Probability of seeing instance a with label b”
Unknown
• For simplicity, we think of 𝑥 ∼ 𝒟𝑥 , 𝑦 = 𝑓 𝑥
In most cases ,you will be ok!
Ex. 𝑓
= “no glasses”, Pr 𝑥 =
, 𝑦 = glasses = 0
(Possibly) Unknown
“Distribution-Free” setting
The Statistical Model
Aka:
Generalization error
𝒳, 𝑓: 𝒳 ↦ 𝒴
𝒴 = __, __
Risk
True error
True loss
Whatever you name
it… remember it is
unknown (WHY??)`
Aka Empirical Risk,
Empirical Error
Known
What is LS(h) in the
example?
• Training set 𝑆 ∼ 𝒟 𝑚
• Output of algorithm 𝐴 𝑆 is prediction rule ℎ: 𝒳 ↦ 𝒴
• Error = 𝐿𝒟,𝑓 (ℎ) = 𝑃𝑟𝑥∼𝒟𝑥 ℎ 𝑥 ≠ 𝑓 𝑥
• Training error 𝐿𝑆 ℎ = 𝑖 ∈ 𝑚 : ℎ 𝑥𝑖 ≠ 𝑦𝑖 /𝑚
Empirical Risk Minimization
To be precise:
∈
𝐸𝑅𝑀 𝑆 = argminℎ∈ ℋ𝐿𝑆 (ℎ)
ℋ
Realizable Case (Finite ℋ)
• 𝐿𝒟,𝑓 (ℎ∗ ) = 0 for some ℎ∗ ∈ ℋ then 𝐿𝑆 (ℎ∗ ) = 0 a.s.
• If 𝐿𝒟,𝑓 ℎ > 0 then 𝐿𝑠 ℎ = 0 with prob
𝑚
1 − 𝐿𝒟,𝑓 (ℎ) ≤ 𝑒 −𝑚⋅𝐿𝒟,𝑓 (ℎ)
Almost Surely
(with prob. 1)
• Does this mean that if 𝐿𝑆 ℎ = 0 then h is correct?
• No…
• If 𝐿𝒟,𝑓 ℎ ≥ 𝜀 then 𝐿𝑠 ℎ = 0 with pos. prob ≤ 𝑒 −𝑚𝜀
• By union bound, with prob ≥ 1 − ℋ 𝑒 −𝑚𝜀 , all
ℎ ∈ ℋ such that 𝐿𝒟,𝑓 ℎ ≥ 𝜀 satisfy 𝐿𝑠 ℎ > 0
∀𝛿, with probability ≥ 1 − 𝛿, if 𝐿𝑆 ℎ = 0 for some
ℎ ∈ ℋ then all we can say is:
1
|ℋ|
𝐿𝒟,𝑓 ℎ < log
𝑚
𝛿
Realizable Case
• For all 𝛿, 𝑚 > 0
with probability ≥ 1 − 𝛿
∀ℎ ∈ ℋ s.t. 𝐿𝒟,𝑓 ℎ >
𝐿𝑆 ℎ > 0
1
|ℋ|
log
𝑚
𝛿
𝐿𝒟,𝑓 ERMℋ 𝑆
≤
1
𝐿𝒟,𝑓 ℎ∗ + 𝑚 log
• Proof:
• ∀ℎ ∈ ℋ: Pr 𝐿𝑆 ℎ = 0 ≤ 𝑒 −𝐿𝒟,𝑓(ℎ) 𝑚
• Union bound on at most |ℋ| hypotheses
|ℋ|
𝛿
If you want it to be
fixed 𝜀, set
1
|ℋ|
𝑚 = log
𝜀
𝛿
• Notice the order of quantifiers:
With (high) probability….
∀ℎ ∈ ℋ: 𝐿𝑆 (ℎ) “good”
With (low) probability
∃ℎ ∈ ℋ: 𝐿𝑆 (ℎ)
“bad”
The Non-Realizable Case
• What do we do if 𝐿𝑆 ℎ > 0 for all ℎ ∈ ℋ ?
• Fix ℎ ∈ ℋ
• Fix number of examples 𝑚
• 𝑆 ∼ 𝒟𝑚
• How is 𝐿𝑆 (ℎ) distributed?
• When the 𝑖’th example 𝑥𝑖 is drawn
Pr[ℎ 𝑥) ≠ 𝑓 𝑥 = 𝐿𝒟,𝑓 (ℎ)
 𝑚 ⋅ 𝐿𝑆 ℎ ∼ Bin(𝑚, 𝐿𝒟,𝑓 ℎ )
What Do We Know About the
Binomial Distribution?
• Quite a lot… Assume 𝑍 ∼ Bin 𝑚, 𝑝
• 𝐸 𝑍 = 𝑚𝑝
• 𝑉𝑎𝑟 𝑍 = 𝐸 𝑍 2 − 𝐸 𝑍 2 = 𝑚𝑝(1 − 𝑝)
• By Central Limit Theorem, as 𝑚 → ∞, 𝑍 tends to
Gaussian. But at what rate?
• Good enough approximation (for now):
Hoeffding Bound
Hoeffding Bound
• Assume 𝑍 = 𝑚
𝑖=1 𝑍𝑖
• 𝑍𝑖 ’s independent
• For all 𝑖 ∈ 𝑚 , 𝑍𝑖 ≤ 1
𝜇=𝐸 𝑍 =
𝐸[𝑍𝑖 ]
(Linearity of expectation)
• Then:
2 /𝑚
−2𝑡
Pr 𝑍 − 𝜇 > 𝑡 ≤ 2𝑒
• If 𝑍𝑖 ∼ Ber 𝑝 for all 𝑖 ∈ [𝑚], then 𝑍 ∼ Bin(𝑚, 𝑝)
2 /𝑚
−2𝑡
Pr 𝑍 − 𝑚𝑝 > 𝑡 ≤ 2𝑒
Conclusion (Finite ℋ)
• For all ℎ, 𝑡:
Pr 𝑚𝐿𝑆 ℎ − 𝑚𝐿𝒟,𝑓 ℎ
𝑡
:
𝑚
• Variable change: 𝜀 =
Pr 𝐿𝑆 ℎ − 𝐿𝒟,𝑓 ℎ
>𝑡 ≤
2 /𝑚
−2𝑡
2𝑒
Ex. With probability ≥ 0.99,
|𝐿𝑠 ℎ − 𝐿𝒟,𝑓 ℎ | ≤ 0.1 with 𝑚 ≈ 1000
To get probability ≥ 0.9999 we need 𝑚 ≈
2400.
>𝜀 ≤
2
−2𝑚𝜀
2𝑒
• By union bound, with prob ≥ 1 − 2 ℋ 𝑒
−2𝑚𝜀 2
:
• ∀ℎ ∈ ℋ: 𝐿𝑆 ℎ − 𝐿𝒟,𝑓 ℎ ≤ 𝜀
Uniform Concentration
⇒ if ℎ = argminℎ∈ℋ 𝐿𝑆 (ℎ), then 𝐿𝒟,𝑓 ℎ ≤ 𝐿𝒟,𝑓 ℎ∗ + 2𝜀
ERM Rule
⇒ ∀𝛿: W.p ≥ 1 − 𝛿, 𝐿𝒟,𝑓 ℎ ≤ 𝐿𝒟,𝑓 ℎ∗ + 2
1
2|ℋ|
log
2𝑚
𝛿
Why?
Uniform Convergence
Agnostic (Non-Realizable)
Case
If you want it to be fixed 𝜀, set
• For all 𝛿, 𝑚 > 0
2
2|ℋ|
log
𝜀2
𝛿
𝐿𝒟,𝑓 ERMℋ 𝑆 ≤
𝑚=
with probability ≥ 1 − 𝛿
∀ℎ ∈ ℋ:
𝐿𝑆 ℎ − 𝐿𝒟,𝑓 ℎ
≤
1
2|ℋ|
log
2𝑚
𝛿
𝐿𝒟,𝑓 (ℎ∗ ) + 2
1
2|ℋ|
log
2𝑚
𝛿
• Proof:
• Hoeffding: ∀ℎ ∈ ℋ: Pr 𝐿𝑆 ℎ − 𝐿𝒟,𝑓 (ℎ) ≥ 𝜀 ≤
• Union bound over at most |ℋ| hypotheses
2𝑚
−2𝜀
2𝑒
• Notice the order of quantifiers:
With (high) probability….
∀ℎ ∈ ℋ: 𝐿𝑆 (ℎ) “good approx”
With (low) probability
∃ℎ ∈ ℋ: 𝐿𝑆 (ℎ) “bad approx”
• This is a uniform convergence argument: The 𝐿𝑆 (ℎ)’s
converge to the 𝐿𝒟,𝑓 (ℎ)’s, and all approximations likely
to be good simultaneously
Error Decomposition
𝐿𝒟,𝑓 ERMℋ 𝑆
≤ 𝐿𝒟,𝑓
(ℎ∗ ) +
1
2|ℋ|
2
log
2𝑚
𝛿
𝜖app
• 𝜖est = 𝐿𝒟,𝑓 ERMℋ (𝑆) − 𝜖app
Upper bound on 𝜖est
Logarithmically
𝝐app
𝝐est
Not a random variable
Random variable
Doesn’t depend on algorithm
or sample size
Depends on algorithm and
sample size
Decreases as ℋ grows
Increases as ℋ grows
Does not depend on 𝑚
Decreases as 𝑚 grows
Captures our prior knowledge
Captures estimation of
true error by empirical error
Inversepolynomially
Bias-Complexity Tradeoff
(holding 𝑚 fixed)
𝜖app + 𝜖est
𝜖est small
𝜖app large
Underfitting
𝜖est large
𝜖app small
Overfitting
|ℋ|
Low complexity
Large inductive bias
High complexity
Small inductive bias
𝑚 Vs. 𝜀 = Upper Bound On 𝜖est
𝛿 = 0.01
𝜀
ℋ = 1000000
ℋ = 1000
𝑚
What determines the curve? Is it really only |ℋ|?
Are All Infinite |ℋ| Equally Bad?
• 𝒳 = 0,1 2 , ℋ = 𝒴 𝒳
• 𝒳 = 0,1 2 , ℋ = ℎ𝜏 : 𝜏 ∈ [0,1]
1,
𝑎≥𝜏
ℎ𝜏 𝑎, 𝑏 =
−1,
𝑎<𝜏
• 𝒳 = 0,1 2 , ℋ = ℎ𝛼,𝛽,𝜏 : 𝛼, 𝛽, 𝜏 ∈ ℝ
1,
𝛼⋅𝑎+𝛽⋅𝑏 ≥𝜏
ℎ𝛼,𝛽,𝜏 𝑎, 𝑏 =
−1,
otherwise
As we shall see, the answer is a resounding NO. To
understand why, we’d need a couple of definitions.
The Growth Function of ℋ
• Let 𝐶 ⊆ 𝒳
• Define ℋ𝐶 to be the restriction of all functions ℎ ∈ ℋ
to 𝐶:
ℋ𝐶 = ℎ|𝐶 : ℎ ∈ ℋ
ℎ|𝐶 = restriction of ℎ to sub-domain 𝐶
•
•
•
•
•
The growth function 𝜏ℋ 𝑚 = max ℋ𝐶 : 𝐶 = 𝑚
Measures how big |ℋ𝐶 | can be when 𝐶 = 𝑚.
𝜏ℋ (𝑚) monotonically nondecreasing. (Why?)
𝜏ℋ (𝑚) distribution free (does not depend on 𝒟𝒳 )
As we shall see, 𝝉𝓗 (𝒎) measures the effective size of
𝓗 for the purpose of learning from 𝒎 examples.
Example
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒙𝟒
𝒙𝟓
𝒙𝟔
𝒙𝟕
𝒙𝟖
𝒙𝟗
ℎ1
0
0
1
0
0
0
1
0
0
ℎ2
0
1
0
0
0
1
0
0
0
ℎ3
1
0
0
0
1
1
0
0
0
ℎ4
0
0
0
1
1
0
0
0
1
ℎ5
0
0
1
0
0
0
0
1
0
ℎ6
0
1
0
0
0
0
1
0
0
ℎ7
1
0
0
0
0
1
0
0
0
ℎ8
0
0
0
0
0
0
0
0
0
𝜏ℋ 1 = 2 𝜏ℋ 2 = 4 𝜏ℋ 3 = ?
Uniform Convergence From Growth
Function
• For every distribution 𝒟, function 𝑓: 𝒳 ↦ 𝒴, 𝑚 ≥ 0
and 𝛿 ∈ 0,1 :
• if 𝑆 ∼ 𝒟 𝑚 then with probability ≥ 1 − 𝛿
• for all ℎ ∈ ℋ:
4 + log 𝜏ℋ (2𝑚)
𝐿𝒟 (ℎ) − 𝐿𝑆 (ℎ) ≤
𝛿 2𝑚
• Difficulty in proof: May still have infinitely many
hypotheses to “union bound” against
• Proof idea: If we take two independent samples of
size 𝑚, call them 𝑆, 𝑆′, then |𝐿𝑆 ℎ − 𝐿𝑆′ ℎ |
should look like an estimate for a hypothesis class
of size 𝜏ℋ (2𝑚)…
Shattering and VC Dimension
• A set 𝐶 ⊆ 𝒳 is said to be shattered by ℋ if ℋ𝐶
consists exactly of all 2|𝐶| possible functions
• The VC-dimension of ℋ, denoted VCdim(ℋ), is
the largest number 𝑑 such that there exists a
subset 𝐶 ⊆ 𝒳 of size 𝑑 that is shattered by ℋ
• Note: VCdim(ℋ) can be ∞
• Note: If ℋ finite, then VCdim(ℋ) ≤ log 2 |ℋ|
Example
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒙𝟒
𝒙𝟓
𝒙𝟔
𝒙𝟕
𝒙𝟖
𝒙𝟗
ℎ1
0
0
1
0
0
0
1
0
0
ℎ2
0
1
0
0
0
1
0
0
0
ℎ3
1
0
0
0
1
1
0
0
0
ℎ4
0
0
0
1
1
0
0
0
1
ℎ5
0
0
1
0
0
0
0
1
0
ℎ6
0
1
0
0
0
0
1
0
0
ℎ7
1
0
0
0
0
1
0
0
0
ℎ8
0
0
0
0
0
0
0
0
0
𝜏ℋ 1 = 2 𝜏ℋ 2 = 4 𝜏ℋ 3 < 8
Proof: Only ℎ3 ,ℎ4 have three 1’s
Therefore VCDim = 2
in their row. Therefore can only
maybe shatter 𝐶 = {𝑥1 , 𝑥5 , 𝑥6 }
or 𝐶 = 𝑥4 , 𝑥5 , 𝑥9 . But these
two options ruled out by quick
inspection.
Sauer-Shelah Lemma
• If ℋ has finite VC dimension 𝑑 then for all 𝑚 ≥ 1
𝑑
𝑚
𝜏ℋ 𝑚 ≤
𝑖
𝑖=0
• In particular, if 𝑚 > 𝑑 + 1 then
𝑒𝑚 𝑑
𝜏ℋ 𝑚 ≤
𝑑
• Note: This estimate is much smaller than 𝟐𝒎
Putting It All Together
• Assume ℋhas VC dimension 𝑑
• For every distribution 𝒟, function 𝑓: 𝒳 ↦ 𝒴,
𝑚 > 𝑑 + 1 and 𝛿 ∈ 0,1 :
• if 𝑆 ∼ 𝒟 𝑚 then with probability ≥ 1 − 𝛿
• for all ℎ ∈ ℋ:
4 + log 𝜏ℋ (2𝑚)
𝐿𝒟 (ℎ) − 𝐿𝑆 (ℎ) ≤
𝛿 2𝑚
𝑑 ⋅ log𝑚
≤𝑂
𝛿 𝑚
• Note: Working harder, can get a tighter bound of
𝑑+log𝛿 −1
𝑑log𝑚+log𝛿 −1
or 𝑂
Agnostic 𝑂
Realizable
𝑚
𝑚
Does VC Dimension Also Tell Us
When We Can’t Learn?
• What if a problem has infinite VCdim?
• Perhaps we can still learn it?
• In some sense… no
• To understand in what sense, we’ll need another
definition…
(Realizable) PAC Learning
A hypothesis class ℋ is PAC learnable if
∃𝑚ℋ : (0,1)2 → ℕ and an algorithm 𝐴 s.t.:
∀ 𝜀, 𝛿 ∈ 0,1 , 𝒟, 𝑓: 𝒳 → 0,1 , if “realizable”, then
Pr𝑆∼𝒟𝑚 𝐿𝒟,𝑓 𝐴(𝑆) ≤ 𝜀 ≥ 1 − 𝛿
𝑚ℋ (𝜀, 𝛿)
Approximately
Correct
Probably
Agnostic PAC Learning
In particular:
Uniform Convergence
⇒
Agnostic PAC Learning!
A hypothesis class ℋ is PAC learnable if
∃𝑚ℋ : (0,1)2 → ℕ and an algorithm 𝐴 s.t.:
∀ 𝜀, 𝛿 ∈ 0,1 , 𝒟, 𝑓: 𝒳 → 0,1 , then
Pr𝑆∼𝒟 𝑚 𝐿𝒟,𝑓 𝐴(𝑆) ≤ 𝐿𝒟,𝑓 ℎ∗ + 𝜀 ≥ 1 − 𝛿
𝑚ℋ (𝜀, 𝛿)
Approximately
Correct
Probably
Negative Results With VCdim
• Theorem: If ℋ has VCdim ∞ then ℋ is not PAC
learnable: For any algorithm 𝐴, m ≥ 0 exists 𝒟, h
1
∈ ℋ s.t. with probability ≥ (over choice of
𝑆 ∼ 𝒟 𝑚 ), 𝐿𝒟,ℎ 𝐴 𝑆
≥
1
.
8
7
• Moreover, if VCdim ℋ = 𝑑, then can force
𝑑
algorithm’s error to be at least Ω( ).
𝑚
• Proofs use statistical “no free lunch” theorem.

Download Report

Lecture 2

Paperzz.com

Your Paperzz