Ch 3. Linear Models for Regression
(2/2)
Pattern Recognition and Machine Learning,
C. M. Bishop, 2006.
Previously summarized by Yung-Kyun Noh
Updated and presented by Rhee, Je-Keun
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
Contents
3.4 Bayesian Model Comparison
3.5 The Evidence Approximation
3.5.1
3.5.2
3.5.3
Evaluation of the evidence function
Maximizing the evidence function
Effective number of parameters
3.6 Limitations of Fixed Basis Functions
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
Bayesian Model Comparison (1/4)
Model selection from a Bayesian perspective
Over-fitting associated with maximum likelihood can be avoided by
marginalizing over the model parameters instead of making point
estimates of their values.
It also allow multiple complexity parameters to be determined
simultaneously as part of the training process. (relevance vector
machine)
The Bayesian view of model comparison simply involves the use of
probabilities to represent uncertainty in the choice of model.
Posterior
p( M i ): prior, a preference for different models.
p(D |M i ): model evidence (marginal likelihood), the preference
shown by the data for different models. Parameters have been
marginalized out.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
Bayesian Model Comparison (2/4)
Bayes factor: the ratio of model evidences for two models
p(D |M i )/ p(D |M j )
Predictive distribution: mixture distribution. Averaging the predictive distribution
weighted by the posterior probabilities.
L
p(t | x,D ) p(t | x, M i , D ) p(M i |D )
i 1
Model evidence
p(D |M i )
p(D | w, M
i
) p( w | M i )dw
Sampling perspective: Marginal likelihood can be viewed as the probability of generating the
data set D from a model whose parameters are sampled at random from the prior.
Posterior distribution over parameters
p(D | w, M i ) p( w | M i )
p( w |D ,M i )
p(D | M i )
Evidence is the normalizing term that appears in the denominator when evaluating the
posterior distribution over parameters
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
Bayesian Model Comparison (3/4)
Consider the case of a model having a single parameter w.
Assume that the posterior distribution is sharply peaked around the most probable
value wMAP, with width
.
Assume that the prior is plat, then
The first term represents the fit to the data given by the most probable parameter
value, and for a flat prior this would correspond to the log likelihood.
The second term penalizes the model according to its complexity, because this
term is negative.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
Bayesian Model Comparison (4/4)
For a model having a set of M parameters,
As we increase the complexity of the model, the first term will typically decrease,
because a more complex model is better able to fit the data.
Whereas the second term will increase due to the dependence on M.
The optimal model complexity, as determined by the maximum evidence, will be
given by a trade-off between these two competing terms.
• A simple model has little variability and so will
generate data sets that are fairly similar to each
other.
• A complex model spreads its predictive
probability over too broad a range of data sets and
so assigns relatively small probability to any one of
them.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
The Evidence Approximation (1/2)
Fully Bayesian treatment of linear basis function model
Hyperparameters: α, β.
Prediction: Marginalize w.r.t. hyperparameters as well as w.
Predictive distribution
p(t | t)
p(t | w, ) p( w |t, , ) p(, | t)dwd d
p(, |t)
If the posterior distribution
is sharply peaked around
values
, the predictive distribution is obtained simply by
marginalizing over w in which , are fixed to the values
.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
The Evidence Approximation (2/2)
p(, |t) p(t |, ) p(, )
If the prior p(, ) is relatively flat,
In the evidence framework the values of
are obtained by
maximizing the marginal likelihood function p(t |, ).
Hyperparameters can be determined from the training data alone
from this method. (w/o recourse to cross-validation)
Recall that the ratio α/β is analogous to a regularization parameter.
Maximizing evidence
Set evidence function’s derivative equal to zero, re-estimate equations
for α,β.
Use technique called the expectation maximization (EM) algorithm.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
Evaluation of the Evidence Function
Marginal likelihood
p(t | , )
2
p(t | , )
N /2
2
p(t | w, ) p( w | )dw
M /2
exp{ E( w)}dw
E( w) ED( w) EW ( w)
l n p(t |, )
2
2
t Φw
2
wT w
M
N
1
N
l n l n E(m N ) l n A l n(2 )
2
2
2
2
mN A1ΦT t
A I ΦT Φ
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
Evaluation of the Evidence Function
Model
evidence
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
Maximizing the Evidence Function
p(t |, )
Maximization of
Set derivative w.r.t α, β to zero.
w.r.t. α
ui
and λi are eigenvector and eigenvalue described by
( ΦT Φ)ui i ui
Maximizing hyperparameter
w.r.t. β
1
1
N
mTN m N
i
i
i
N
T
2
{
t
m
(
x
)}
n N n
n 1
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
Effective Number of Parameters (1/2)
γ: effective total number of
well determined parameters
i
i i
i / i 1
i / i 0
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
Effective Number of Parameters (2/2)
2 EW (m N )
Optimal α
Log evidence
l n p(t |, )
Optimal α
Test err.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
13
Limitations of Fixed Basis Functions
Models comprising a linear combination of fixed,
nonlinear basis functions.
Have closed-form solutions to the least-squares problem.
Have a tractable Bayesian treatment.
The difficulty
The basis functions j( x) are fixed before the training data set is
observed, and is a manifestation of the curse of dimensionality.
Properties of data sets to alleviate this problem
The data vectors {xn} typically lie close to a nonlinear manifold
whose intrinsic dimensionality is smaller than that of the input
space
Target variables may have significant dependence on only a
small number of possible directions within the data manifold.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
© Copyright 2026 Paperzz