On Bias, Variance, 0/1-Loss, and the Curse-of

On Bias, Variance, 0/1-Loss, and
the Curse-of-Dimensionality-2nd
Weiqiang Dong
1
Function Estimate
β€’ Input: 𝒙
β€’ Output: 𝑦 = 𝑓 𝒙 + πœ€
where 𝑓(𝒙) (β€œtarget function”) is a single valued deterministic function
of 𝒙 and πœ€ is a random variable, 𝐸(πœ€ | 𝒙) = 0.
β€’ The goal is to obtain an estimate
using a training data set T
2
Estimation Error
β€’ The goal is to obtain an estimate
using a training data set T
β€’ Mean Square Error:
𝐸𝑇[𝑦 βˆ’ 𝑓(𝒙| 𝑇)]2
= [𝑓(𝒙) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 +𝐸𝑇 [𝑓(𝒙| 𝑇) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + πΈπœ€ [πœ€| 𝒙]2
1. Square of bias
2. Variance
3. Irreducible prediction error
3
β€’ 𝐸𝑇[𝑦 βˆ’ 𝑓(𝒙| 𝑇)]2
= [𝑓(𝒙) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + 𝐸𝑇 [𝑓(𝒙| 𝑇) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + πΈπœ€ [πœ€| 𝒙]2
1. Square of bias
2. Variance
3. Irreducible prediction error
1. Square of bias: The extent to which the average prediction over all
data sets differs from the desired regression function.
2. Variance: The extent to which the solutions for individual data sets
vary around their average (sensitivity to the particular choice of
data set).
4
β€’ 𝐸𝑇[𝑦 βˆ’ 𝑓(𝒙| 𝑇)]2
= [𝑓(𝒙) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + 𝐸𝑇 [𝑓(𝒙| 𝑇) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + πΈπœ€ [πœ€| 𝒙]2
1. Square of bias
2. Variance
3. Irreducible prediction error
Red Dot: True model
Blue Dot: Individual estimation
It is desirable to have both low biassquared and low variance since
both contribute to the squared
estimation error in equal measure.
5
Classification
β€’ Input: 𝒙 = {π‘₯1, … , π‘₯𝑛}
β€’ Output: 𝑦 ∈ {0, 1}
β€’ Prediction 𝑦 ∈ {0, 1}
β€’ The goal is to choose 𝑦 𝒙 to minimize inaccuracy as characterized by
the misclassification β€œrisk”
6
β€’ The goal is to choose 𝑦 𝒙 to minimize inaccuracy as characterized by
the misclassification β€œrisk”
β€’ The misclassification risk (2.2) is minimized by the (β€œBayes”) rule
which achieves the lowest possible risk
7
β€’ Let 𝑙0 = 𝑙1 = 1, the misclassification risk is minimized by the (β€œBayes”)
rule
1
𝑦𝐡 𝒙 = 1 𝑓(𝒙) β‰₯
2
Let 𝑔(π‘₯) = 𝑓(π‘₯) βˆ’ 1/2,
𝑦𝐡 𝒙 = 1 𝑔(𝒙) β‰₯ 0
𝑔(π‘₯) can be considered as the decision boundary.
8
β€’ Let 𝑙0 = 𝑙1 = 1, the misclassification risk is minimized by the (β€œBayes”)
rule
1
𝑦𝐡 𝒙 = 1 𝑓(𝒙) β‰₯
2
Let 𝑔(π‘₯) = 𝑓(π‘₯) βˆ’ 1/2,
𝑦𝐡 𝒙 = 1 𝑔(𝒙) β‰₯ 0
𝑦 𝒙 = 1 𝑔(𝒙|𝑻) β‰₯ 0
𝑔(π‘₯) can be considered as the decision boundary.
𝑔 𝒙|𝑻 is the estimation of the decision boundary.
9
Classification
β€’ Input: 𝒙
β€’ Output: 𝑦 ∈ {0, 1}
β€’ Decision boundary: g(π‘₯)
y 𝒙 = 1 𝑔(𝒙) β‰₯ 0
10
Classification
β€’ Input: 𝒙
β€’ Output: 𝑦 ∈ {0, 1}
β€’ Decision boundary: g(π‘₯)
β€’ Prediction 𝑦 ∈ {0, 1}
β€’ Estimate of decision boundary: 𝑔 𝒙 𝑇)
y 𝒙 = 1 𝑔(𝒙) β‰₯ 0
𝑦 𝒙 = 1 𝑔 𝒙 𝑻) β‰₯ 0
The training data set T is used to learn a classification rule
y(π‘₯ | 𝑇 ) for (future) prediction. The usual paradigm for
accomplishing this is to use the training data T to form an
approximation (estimate) 𝑔(π‘₯ | 𝑇 ) to 𝑔(π‘₯)
Regular function estimation technology can be applied to
obtain the estimate 𝑔 π‘₯ 𝑇)
11
Estimation Error
𝐸𝑇[𝑦 βˆ’ 𝑓(𝒙| 𝑇)]2 = [𝑓(𝒙) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + 𝐸𝑇 [𝑓(𝒙| 𝑇) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + πΈπœ€ [πœ€| 𝒙]2
1. Square of bias
2. Variance
3. Irreducible prediction error
β€’ It is desirable to have both low bias-squared and low variance since both
contribute to the squared estimation error in equal measure.
Classification Error
𝑦𝐡 𝒙 = 1 𝑔(𝒙) β‰₯ 0
𝑦 𝒙 = 1 𝑔(𝒙|𝑻) β‰₯ 0
β€’ The estimate 𝐸 𝑔 may be off from 𝑔 by a huge margin. It does not matter as long
as we take care of the fact that 1 𝐸 𝑔 𝒙 𝑇) β‰₯ 0 =1 𝑔(𝒙) β‰₯ 0 and cut down our
variance.
12
Output: y 𝒙 = 1 𝑔(𝒙) β‰₯ 0
Prediction: 𝑦(𝒙) = 1 𝑔 𝒙 𝑇) β‰₯ 0
Red Line: 𝑔(π‘₯)
Blue line: 𝑔 𝒙 𝑇)
13
Output: y 𝒙 = 1 𝑔(𝒙) β‰₯ 0
Prediction: 𝑦(𝒙) = 1 𝑔 𝒙 𝑇) β‰₯ 0
Red Line: 𝑔(π‘₯)
Blue line: 𝑔 𝒙 𝑇)
14
Output: y 𝒙 = 1 𝑔(𝒙) β‰₯ 0
Prediction: 𝑦(𝒙) = 1 𝑔 𝒙 𝑇) β‰₯ 0
Red Line: 𝑔(π‘₯)
Blue line: 𝑔 𝒙 𝑇)
The estimate 𝐸 𝑔 may be off from 𝑔 by
a huge margin. It does not matter as
long as we take care of the fact that
1 𝐸 𝑔 𝒙 𝑇) β‰₯ 0 =1 𝑔(𝒙) β‰₯ 0 and
cut down our variance.
15
Estimation Error
𝐸𝑇[𝑦 βˆ’ 𝑓(𝒙| 𝑇)]2 = [𝑓(𝒙) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + 𝐸𝑇 [𝑓(𝒙| 𝑇) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + πΈπœ€ [πœ€| 𝒙]2
1. Square of bias
2. Variance
3. Irreducible prediction error
β€’ It is desirable to have both low bias-squared and low variance since both
contribute to the squared estimation error in equal measure.
Classification Error
𝑦𝐡 𝒙 = 1 𝑔(𝒙) β‰₯ 0
𝑦 𝒙 = 1 𝑔(𝒙|𝑻) β‰₯ 0
β€’ The estimate 𝐸 𝑔 may be off from 𝑔 by a huge margin. It does not matter as long
as we take care of the fact that 1 𝐸 𝑔 𝒙 𝑇) β‰₯ 0 =1 𝑔(𝒙) β‰₯ 0 and cut down our
variance.
The aggregation approach often decreases the variance and slightly increases the
bias.
16
The aggregation approach often improve classification accuracy since it reduces the
variance.
single decision tree vs bagging of tree
𝑓(π‘₯)
17
The aggregation approach often improve classification accuracy since it reduces the
variance.
single decision tree vs bagging of tree
𝑓(π‘₯)
𝑓 π‘₯ + π‘›π‘œπ‘–π‘ π‘’
18
The aggregation approach often improve classification accuracy since it reduces the
variance.
single decision tree vs bagging of tree
𝑓(π‘₯)
𝑓 π‘₯ + π‘›π‘œπ‘–π‘ π‘’
𝑓 𝒙 𝑇)
19
The aggregation approach often improve classification accuracy since it reduces the
variance.
single decision tree vs bagging of tree
𝑓(π‘₯)
𝑓 π‘₯ + π‘›π‘œπ‘–π‘ π‘’
𝑓 𝒙 𝑇)
𝐸𝑓
20
The aggregation approach often improve classification accuracy since it reduces the
variance.
single decision tree vs bagging of tree
𝑓(π‘₯)
𝑓 π‘₯ + π‘›π‘œπ‘–π‘ π‘’
𝑓 𝒙 𝑇)
𝐸𝑓
21
𝑓(π‘₯)
𝑓 π‘₯ + π‘›π‘œπ‘–π‘ π‘’
𝑓 𝒙 𝑇)
𝐸𝑓
Tree: 0.0255 (error) = 0.0003 (bias^2) + 0.0152 (var) + 0.0098 (noise)
Bagging(Tree): 0.0196 (error) = 0.0004 (bias^2) + 0.0092 (var) + 0.0098 (noise) 22
Estimation Error
𝐸𝑇[𝑦 βˆ’ 𝑓(𝒙| 𝑇)]2 = [𝑓(𝒙) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + 𝐸𝑇 [𝑓(𝒙| 𝑇) βˆ’ 𝐸𝑇 𝑓(𝒙| 𝑇)]2 + πΈπœ€ [πœ€| 𝒙]2
1. Square of bias
2. Variance
3. Irreducible prediction error
β€’ It is desirable to have both low bias-squared and low variance since both
contribute to the squared estimation error in equal measure.
Classification Error
𝑦𝐡 𝒙 = 1 𝑔(𝒙) β‰₯ 0
𝑦 𝒙 = 1 𝑔(𝒙|𝑻) β‰₯ 0
β€’ The estimate 𝐸 𝑔 may be off from 𝑔 by a huge margin. It does not matter as long
as we take care of the fact that 1 𝐸 𝑔 𝒙 𝑇) β‰₯ 0 =1 𝑔(𝒙) β‰₯ 0 and cut down our
variance.
The aggregation approach often decreases the variance and slightly increases the
bias.
23
β€’ Much research in classification has been devoted to achieving higher
accuracy probability estimates under the presumption that this will
generally lead to more accurate predictions. This need not always be
the case.
24