Contributions to Decision Tree Induction

Overfitting, Bias/Variance tradeoff
Content of the presentation
• Bias and variance definitions
• Parameters that influence bias and variance
• Bias and variance reduction techniques
2
Content of the presentation
• Bias and variance definitions:
– A simple regression problem with no input
– Generalization to full regression problems
– A short discussion about classification
• Parameters that influence bias and variance
• Bias and variance reduction techniques
3
Regression problem - no input
• Goal: predict as well as possible the height of a
Belgian male adult
• More precisely:
– Choose an error measure, for example the square error.
– Find an estimation ŷ such that the expectation:
over the whole population of Belgian male adult is
minimized.
180
4
Regression problem - no input
• The estimation that minimizes the error can be computed by
taking:
• So, the estimation which minimizes the error is Ey{y}. It is
called the Bayes model.
• But in practice, we cannot compute the exact value of Ey{y}
(this would imply to measure the height of every Belgian
male adults).
5
Learning algorithm
•
•
As p(y) is unknown, find an estimation y from a
sample of individuals, LS={y1,y2,…,yN}, drawn
from the Belgian male adult population.
Example of learning algorithms:
–
,
–
(if we know that the height is close to 180)
6
Good learning algorithm
• As LS are randomly drawn, the prediction ŷ will also be a
random variable
pLS (ŷ)
ŷ
• A good learning algorithm should not be good only on one
learning sample but in average over all learning samples
(of size N)  we want to minimize:
• Let us analyse this error in more detail
7
Bias/variance decomposition (1)
8
Bias/variance decomposition (2)
vary{y}
Ey{y}
y
E= Ey{(y- Ey{y})2} + ELS{(Ey{y}-ŷ)2}
= residual error = minimal attainable error
= vary{y}
9
Bias/variance decomposition (3)
10
Bias/variance decomposition (4)
bias2
Ey{y} ELS{ŷ}
ŷ
E= vary{y} + (Ey{y}-ELS{ŷ})2 + …
ELS{ŷ} = average model (over all LS)
bias2 = error between Bayes and average model
11
Bias/variance decomposition (5)
varLS{ŷ}
ELS{ŷ}
ŷ
E= vary{y} + bias2 + ELS{(ŷ-ELS{ŷ})2}
varLS{ŷ} = estimation variance = consequence of
over-fitting
12
Bias/variance decomposition (6)
bias2
vary{y}
varLS{ŷ}
Ey{y} ELS{ŷ}
ŷ
E= vary{y} + bias2 + varLS{ŷ}
13
Our simple example
•
–
–
From statistics, ŷ1 is the best estimate with zero bias
•
–
–
So, the first one may not be the best estimator because of
variance (There is a bias/variance tradeoff w.r.t. l)
14
Regression problem – full (1)
• Actually, we want to find a function ŷ(x) of several inputs
=> average over the whole input space:
• The error becomes:
• Over all learning sets:
15
Regression problem – full (2)
ELS{Ey|x{(y-ŷ(x))^2}}=Noise(x)+Bias2(x)+Variance(x)
• Noise(x) = Ey|x{(y-hB(x))2}
Quantifies how much y varies from hB(x) = Ey|x{y}, the
Bayes model.
• Bias2(x) = (hB(x)-ELS{ŷ(x)})2:
Measures the error between the Bayes model and the
average model.
• Variance(x) = ELS{(ŷ(x)-ELS{ŷ(x))2} :
Quantify how much ŷ(x) varies from one learning sample
to another.
16
Illustration (1)
• Problem definition:
– One input x, uniform random variable in [0,1]
– y=h(x)+ε where εN(0,1)
y
h(x)=Ey|x{y}
x
17
Illustration (2)
• Low variance, high bias method  underfitting
ELS{ŷ(x)}
18
Illustration (3)
• Low bias, high variance method  overfitting
ELS{ŷ(x)}
19
Classification problems (1)
• The mean misclassification error is:
• The best possible model is the Bayes model:
• The “average” model is:
• Unfortunately, there is no such decomposition of
the mean misclassification error into a bias and a
variance terms.
• Nevertheless, we observe the same phenomena
20
Classification problems (2)
One test node
A full decision tree
1
1
LS1
0
0
0
1
1
1
0
0
0
1
0
1
LS2
0
1
21
Classification problems (3)
• Bias  systematic error component (independent of the
learning sample)
• Variance  error due to the variability of the model with
respect to the learning sample randomness
• There are errors due to bias and errors due to variance
One test node
Full decision tree
1
1
0
0
0
1
0
1
22
Content of the presentation
• Bias and variance definitions
• Parameters that influence bias and variance
–
–
–
–
–
Complexity of the model
Complexity of the Bayes model
Noise
Learning sample size
Learning algorithm
• Bias and variance reduction techniques
23
Illustrative problem
• Artificial problem with 10 inputs, all uniform
random variables in [0,1]
• The true function depends only on 5 inputs:
y(x)=10.sin(π.x1.x2)+20.(x3-0.5)2+10.x4+5.x5+ε,
where ε is a N(0,1) random variable
• Experimentations:
– ELS  average over 50 learning sets of size 500
– Ex,y  average over 2000 cases
 Estimate variance and bias (+ residual error)
24
Complexity of the model
E=bias2+var
var
bias2
Complexity
Usually, the bias is a decreasing function of the
complexity, while variance is an increasing
function of the complexity.
25
Complexity of the model – neural networks
• Error, bias, and variance w.r.t. the number of
neurons in the hidden layer
8
7
6
Error
5
Error
4
Bias R
Var R
3
2
1
0
0
2
4
6
8
10
12
Nb hidden perceptrons
26
Complexity of the model – regression trees
• Error, bias, and variance w.r.t. the number of test
nodes
20
18
16
Error
14
12
Error
10
Bias R
8
Var R
6
4
2
0
0
10
20
30
40
50
Nb test nodes
27
Complexity of the model – k-NN
• Error, bias, and variance w.r.t. k, the number of
neighbors
18
16
14
Error
12
Error
Bias R
Var R
10
8
6
4
2
0
0
5
10
15
20
25
30
k
28
Learning problem
• Complexity of the Bayes model:
– At fixed model complexity, bias increases with the complexity of
the Bayes model. However, the effect on variance is difficult to
predict.
• Noise:
– Variance increases with noise and bias is mainly unaffected.
– E.g. with (full) regression trees
70
60
50
Error
Error
40
Noise
Bias R
30
Var R
20
10
0
0
1
2
3
4
5
6
Noise std. dev.
29
Learning sample size (1)
• At fixed model complexity, bias remains constant and
variance decreases with the learning sample size. E.g.
linear regression
10
9
8
Error
7
6
Error
5
Bias R
4
Var R
3
2
1
0
0
500
1000
1500
2000
LS size
30
Learning sample size (2)
• When the complexity of the model is dependant on the
learning sample size, both bias and variance decrease with
the learning sample size. E.g. regression trees
20
18
16
Error
14
12
Error
10
Bias R
8
Var R
6
4
2
0
0
500
1000
1500
2000
LS size
31
Learning algorithms – linear regression
Method
Err2
Bias2+Noise
Variance
Linear regr.
7.0
6.8
0.2
k-NN (k=1)
15.4
5
10.4
k-NN (k=10)
8.5
7.2
1.3
MLP (10)
2.0
1.2
0.8
MLP (10 – 10)
4.6
1.4
3.2
Regr. Tree
10.2
3.5
6.7
• Very few parameters : small variance
• Goal function is not linear : high bias
32
Learning algorithms – k-NN
Method
Err2
Bias2+Noise
Variance
Linear regr.
7.0
6.8
0.2
15.4
8.5
5
7.2
10.4
1.3
MLP (10)
2.0
1.2
0.8
MLP (10 – 10)
4.6
1.4
3.2
Regr. Tree
10.2
3.5
6.7
k-NN (k=1)
k-NN (k=10)
• Small k : high variance and moderate bias
• High k : smaller variance but higher bias
33
Learning algorithms - MLP
Method
Err2
Bias2+Noise
Variance
Linear regr.
7.0
6.8
0.2
k-NN (k=1)
15.4
5
10.4
k-NN (k=10)
8.5
7.2
1.3
MLP (10)
MLP (10 – 10)
2.0
4.6
1.2
1.4
0.8
3.2
Regr. Tree
10.2
3.5
6.7
• Small bias
• Variance increases with the model complexity
34
Learning algorithms – regression trees
Method
Err2
Bias2+Noise
Variance
Linear regr.
7.0
6.8
0.2
k-NN (k=1)
15.4
5
10.4
k-NN (k=10)
8.5
7.2
1.3
MLP (10)
2.0
1.2
0.8
MLP (10 – 10)
4.6
1.4
3.2
Regr. Tree
10.2
3.5
6.7
• Small bias, a (complex enough) tree can
approximate any non linear function
• High variance
35
Content of the presentation
•
•
•
•
Bias and variance definition
Parameters that influence bias and variance
Decision/regression tree variance
Bias and variance reduction techniques
– Introduction
– Dealing with the bias/variance tradeoff of one
algorithm
– Ensemble methods
36
Bias and variance reduction techniques
• In the context of a given method:
– Adapt the learning algorithm to find the best trade-off
between bias and variance.
– Not a panacea but the least we can do.
– Example: pruning, weight decay.
• Ensemble methods:
– Change the bias/variance trade-off.
– Universal but destroys some features of the initial
method.
– Example: bagging, boosting.
37
Variance reduction: 1 model (1)
• General idea: reduce the ability of the learning
algorithm to fit the LS
– Pruning
• reduces the model complexity explicitly
– Early stopping
• reduces the amount of search
– Regularization
• reduce the size of hypothesis space
• Weight decay with neural networks consists in penalizing high
weight values
38
Variance reduction: 1 model (2)
Optimal fitting
E=bias2+var
var
bias2
Fitting
• Selection of the optimal level of fitting
– a priori (not optimal)
– by cross-validation (less efficient): Bias2  error on the
learning set, E  error on an independent test set
39
Variance reduction: 1 model (3)
• Examples:
– Post-pruning of regression trees
– Early stopping of MLP by cross-validation
Method
E
Bias
Variance
Full regr. Tree (250)
10.2
3.5
6.7
Pr. regr. Tree (45)
9.1
4.3
4.8
Full learned MLP
4.6
1.4
3.2
Early stopped MLP
3.8
1.5
2.3
• As expected, variance decreases but bias increases
40
Ensemble methods
• Combine the predictions of several models built with a
learning algorithm in order to improve with respect to the
use of a single model
• Two important families:
– Averaging techniques
• Grow several models independantly and simply average their
predictions
• Ex: bagging, random forests
• Decrease mainly variance
– Boosting type algorithms
• Grows several models sequentially
• Ex: Adaboost, MART
• Decrease mainly bias
41
Conclusion (1)
• The notions of bias and variance are very useful to predict
how changing the (learning and problem) parameters will
affect the accuracy. E.g. this explains why very simple
methods can work much better than more complex ones on
very difficult tasks
• Variance reduction is a very important topic:
– To reduce bias is easy, but to keep variance low is not as easy.
– Especially in the context of new applications of machine learning
to very complex domains: temporal data, biological data, Bayesian
networks learning, text mining…
• All learning algorithms are not equal in terms of variance.
Trees are among the worst methods from this criterion
42
Conclusion (2)
• Ensemble methods are very effective techniques to reduce
bias and/or variance. They can transform a not so good
method to a competitive method in terms of accuracy.
43