14. Development and Plasticity

Ch 3. Linear Models for Regression
(2/2)
Pattern Recognition and Machine Learning,
C. M. Bishop, 2006.
Previously summarized by Yung-Kyun Noh
Updated and presented by Rhee, Je-Keun
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
Contents


3.4 Bayesian Model Comparison
3.5 The Evidence Approximation
 3.5.1
 3.5.2
 3.5.3

Evaluation of the evidence function
Maximizing the evidence function
Effective number of parameters
3.6 Limitations of Fixed Basis Functions
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
Bayesian Model Comparison (1/4)

Model selection from a Bayesian perspective
 Over-fitting associated with maximum likelihood can be avoided by
marginalizing over the model parameters instead of making point
estimates of their values.
 It also allow multiple complexity parameters to be determined
simultaneously as part of the training process. (relevance vector
machine)
 The Bayesian view of model comparison simply involves the use of
probabilities to represent uncertainty in the choice of model.

Posterior
 p( M i ): prior, a preference for different models.
 p(D |M i ): model evidence (marginal likelihood), the preference
shown by the data for different models. Parameters have been
marginalized out.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
Bayesian Model Comparison (2/4)

Bayes factor: the ratio of model evidences for two models
p(D |M i )/ p(D |M j )

Predictive distribution: mixture distribution. Averaging the predictive distribution
weighted by the posterior probabilities.
L

p(t | x,D )   p(t | x, M i , D ) p(M i |D )
i 1
Model evidence
p(D |M i ) 
 p(D | w, M
i
) p( w | M i )dw
 Sampling perspective: Marginal likelihood can be viewed as the probability of generating the
data set D from a model whose parameters are sampled at random from the prior.

Posterior distribution over parameters
p(D | w, M i ) p( w | M i )
p( w |D ,M i ) 
p(D | M i )
 Evidence is the normalizing term that appears in the denominator when evaluating the
posterior distribution over parameters
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
Bayesian Model Comparison (3/4)

Consider the case of a model having a single parameter w.
 Assume that the posterior distribution is sharply peaked around the most probable
value wMAP, with width
.
 Assume that the prior is plat, then
 The first term represents the fit to the data given by the most probable parameter
value, and for a flat prior this would correspond to the log likelihood.
 The second term penalizes the model according to its complexity, because this
term is negative.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
Bayesian Model Comparison (4/4)

For a model having a set of M parameters,
 As we increase the complexity of the model, the first term will typically decrease,
because a more complex model is better able to fit the data.
 Whereas the second term will increase due to the dependence on M.
 The optimal model complexity, as determined by the maximum evidence, will be
given by a trade-off between these two competing terms.
• A simple model has little variability and so will
generate data sets that are fairly similar to each
other.
• A complex model spreads its predictive
probability over too broad a range of data sets and
so assigns relatively small probability to any one of
them.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
The Evidence Approximation (1/2)

Fully Bayesian treatment of linear basis function model
 Hyperparameters: α, β.
 Prediction: Marginalize w.r.t. hyperparameters as well as w.

Predictive distribution
p(t | t) 
 p(t | w,  ) p( w |t, ,  ) p(,  | t)dwd d 
p(,  |t)
 If the posterior distribution
is sharply peaked around
values
, the predictive distribution is obtained simply by
marginalizing over w in which  ,  are fixed to the values
.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
The Evidence Approximation (2/2)
p(,  |t)  p(t |,  ) p(,  )

If the prior p(,  ) is relatively flat,
 In the evidence framework the values of
are obtained by
maximizing the marginal likelihood function p(t |,  ).

Hyperparameters can be determined from the training data alone
from this method. (w/o recourse to cross-validation)
 Recall that the ratio α/β is analogous to a regularization parameter.
 Maximizing evidence
 Set evidence function’s derivative equal to zero, re-estimate equations
for α,β.
 Use technique called the expectation maximization (EM) algorithm.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
Evaluation of the Evidence Function

Marginal likelihood
  
p(t | ,  )  

2



p(t | ,  ) 
N /2
 


2



 p(t | w,  ) p( w | )dw
M /2
 exp{ E( w)}dw
E( w)   ED( w)   EW ( w) 
l n p(t |,  ) 

2
2
t  Φw 

2
wT w
M
N
1
N
l n   l n   E(m N )  l n A  l n(2 )
2
2
2
2
mN   A1ΦT t
A   I   ΦT Φ
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
Evaluation of the Evidence Function
Model
evidence
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
Maximizing the Evidence Function

p(t |,  )
Maximization of
 Set derivative w.r.t α, β to zero.
 w.r.t. α
 ui
and λi are eigenvector and eigenvalue described by
(  ΦT Φ)ui  i ui

Maximizing hyperparameter
 w.r.t. β
1
1

 N 
 

mTN m N
 
i
  
i
i
N
T
2
{
t

m

(
x
)}
 n N n
n 1
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
Effective Number of Parameters (1/2)

γ: effective total number of
well determined parameters
i
 
i   i
i /   i  1
i /   i  0
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
Effective Number of Parameters (2/2)
2 EW (m N )
Optimal α
Log evidence
l n p(t |,  )
Optimal α

Test err.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
13
Limitations of Fixed Basis Functions

Models comprising a linear combination of fixed,
nonlinear basis functions.
 Have closed-form solutions to the least-squares problem.
 Have a tractable Bayesian treatment.

The difficulty
 The basis functions  j( x) are fixed before the training data set is
observed, and is a manifestation of the curse of dimensionality.

Properties of data sets to alleviate this problem
 The data vectors {xn} typically lie close to a nonlinear manifold
whose intrinsic dimensionality is smaller than that of the input
space
 Target variables may have significant dependence on only a
small number of possible directions within the data manifold.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14