A noninformative prior on σ is the inverse distribution

ECE 8527
to Machine Learning and Pattern Recognition
8443 – Introduction
Pattern Recognition
LECTURE 05: BAYESIAN ESTIMATION
• Objectives:
Bayesian Estimation
Example
• Resources:
D.H.S.: Chapter 3 (Part 2)
J.O.S.: Bayesian Parameter Estimation
A.K.: The Holy Trinity
A.E.: Bayesian Methods
J.H.: Euro Coin
Introduction to Bayesian Parameter Estimation
• In Chapter 2, we learned how to design an optimal classifier if we knew the
prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).
• Bayes: treat the parameters as random variables having some known prior
distribution. Observations of samples converts this to a posterior.
• Bayesian learning: sharpen the a posteriori density causing it to peak near
the true value.
• Supervised vs. unsupervised: do we know the class assignments of the
training data.
• Bayesian estimation and ML estimation produce very similar results in many
cases.
• Reduces statistical inference (prior knowledge or beliefs about the world) to
probabilities.
ECE 8527: Lecture 05, Slide 1
Class-Conditional Densities
• Posterior probabilities, P(ωi|x), are central to Bayesian classification.
• Bayes formula allows us to compute P(ωi|x) from the priors, P(ωi), and the
likelihood, p(x|ωi).
• But what If the priors and class-conditional densities are unknown?
• The answer is that we can compute the posterior, P(ωi|x), using all of the
information at our disposal (e.g., training data).
• For a training set, D, Bayes formula becomes:
P(i | x, D) 
likelihood  prior

evidence
p(x i , D) P(i D)
c
 p(x  j , D) P( j D)
j 1
• We assume priors are known: P(ωi|D) = P(ωi).
• Also, assume functional independence:
Di have no influence on
This gives:
P(i | x, D) 
p(x  j , D) if i  j
p(x i , Di ) P(i )
c
 p(x  j , D j ) P( j )
j 1
ECE 8527: Lecture 05, Slide 2
The Parameter Distribution
• Assume the parametric form of the evidence, p(x), is known: p(x|θ).
• Any information we have about θ prior to collecting samples is contained in a
known prior density p(θ).
• Observation of samples converts this to a posterior, p(θ|D), which we hope is
peaked around the true value of θ.
• Our goal is to estimate a parameter vector:
p(x D)   p(x, D)d
• We can write the joint distribution as a product:
p (x D)   p (x  , D ) p( D)d
  p (x  ) p( D)d
because the samples are drawn independently.
• This equation links the class-conditional density p(x D)
to the posterior, p( D ) . But numerical solutions are typically required!
ECE 8527: Lecture 05, Slide 3
Univariate Gaussian Case
• Case: only mean unknown
p( x  )  N (  , 2 )
• Known prior density:
p(  )  N ( 0 , 02 )
• Using Bayes formula:
p(  D) p( D)  p( D  ) p(  )
• Rationale: Once a value of μ is
known, the density for x is
completely known. α is a
normalization factor that
depends on the data, D.
ECE 8527: Lecture 05, Slide 4
p(  D) 

p( D  ) p(  )
p( D)
p( D  ) p(  )
 p ( D  ) p (  ) d
  [ p ( D  ) p (  )]
n
   p( x k  ) p(  )
k 1
Univariate Gaussian Case
• Applying our Gaussian assumptions:
2
 n  1
 1 

x


1
 k
    1
0

p  | D     
exp  
exp  
  
 k 1  2 
 2   0
 2       2  0
 
 1    
0
   exp  
 2   0

2
n

x 
    k

 
k 1 

 1    2  2  0   02
   exp   

 02
 2  
n
 1   02
x 2k
   exp   2   2
 2   0 k 1 
2
 

 




 n  x 2k
x k   2  
    2  2 2  2  



  
 k 1  
 1    2 2  0

 exp    2 


 02
 2    0

 1    2 2  0
   exp    2 

 02
 2    0
ECE 8527: Lecture 05, Slide 5
2



  n
x k   2  
    (2 2  2 )   




 
  k 1
  n
x k   2  
    (2 2  2 )   



 
  k 1
Univariate Gaussian Case (Cont.)
• Now we need to work this into a simpler form:
 1    2 2  0
p | D     exp    2 

 02
 2    0
 1  n  2
   exp      2
 2    k 1 
  n
x k   2  
    (2 2  2 )   




 
  k 1
 2
  2
 0
  n 

x  
      2 k 2    2 20
  k 1 
0
  
 
 1 2 2

   exp   n 2  2  2 2

0
 2  
n
 x k  2
k 1
 0 

2 
 0 
 1  n
1
   exp (  2  2
0
 2  
 2
 1  n
0

   2
n

x


2
 2  k


1

k


0


 1  n
1
   exp (  2  2
0
 2  
1 n
where ˆ n   x k
n k 1
 2


   2 1 (nˆ n )  0
 2

 02


ECE 8527: Lecture 05, Slide 6




 
  

  
 
  

  
Univariate Gaussian Case (Cont.)
• p(μ|D) is an exponential of a quadratic function, which makes it a normal
distribution. Because this is true for any n, it is referred to as a reproducing
density.
• p(μ) is referred to as a conjugate prior.
• Write p(μ|D) ~ N(μn,σn
2):
p(  D) 
1   n 2
exp[ (
) ]
2  n
2 n
1
• Expand the quadratic term:
p(  D) 
1   n 2
exp[  (
) ]
2 n
2  n
1
1  2  2 n   n2
exp[  (
)]
2
2
n
2  n
1
• Equate coefficients of our two functions:
 1   2  2  n   n2  
exp   
  
2

n
2  n

 2
1
 1  n
1
  exp    2  2
 2  
0


ECE 8527: Lecture 05, Slide 7
 2


   2 1 (nˆ n )  0
2
 2


0


  
  
  
Univariate Gaussian Case (Cont.)
• Rearrange terms so that the dependencies on μ are clear:
 1   n2
exp    2
 2 
2  n
 n

1
 1 1

 
  exp    2  2  2 n2    

 2 
 n  

 n

 1  n
1
  exp    2  2
 2  
0


 2


   2 1 (nˆ n )  0
 2

 02


  
  
  
• Associate terms related to σ2 and μ:
• There is actually a third equation involving terms not related to :
 1   n2  
exp    2       or 
 2  
2  n
 n 

1
 1   n2  
1
exp    2    
 2  
2  n
2  0
 n 

1
 1

 2 



n
but we can ignore this since it is not a function of μ and is a complicated
equation to solve.
ECE 8527: Lecture 05, Slide 8
Univariate Gaussian Case (Cont.)
• Two equations and two unknowns. Solve for μn and σn2. First, solve for μn2 :
 2 02
 02 2
 


n
1
n 02   2 n 02   2
 2
2

0
1
2
n
• Next, solve for μn:
 n  ˆ n (
n n2
2
  n2
)   0  2
0
n   02 2
 ˆ n ( 2 )
  n 02   2
 n 02
 ˆ n 
2
2
 n 0  





 1
  0  2



 0
  02 2

 n 2   2
 0

 2
  0 

 n 2   2

 0




• Summarizing:
 2



n  ( 2
) ˆ 
2 n 
2
2 0
n 0  
 n 0   
n 02
 n2
 02 2

n 02   2
ECE 8527: Lecture 05, Slide 9




Bayesian Learning
• μn represents our best guess after n samples.
• σn2 represents our uncertainty about this guess.
• σn2 approaches σ2/n for large n – each additional observation decreases our
uncertainty.
• The posterior, p(μ|D), becomes more sharply peaked as n grows large. This is
known as Bayesian learning.
ECE 8527: Lecture 05, Slide 10
Class-Conditional Density
• How do we obtain p(x|D) (derivation is tedious):
p ( x D)   p( x  ) p(  D)d


1
1 x 2
1
1   n 2
exp[ (
) ][
exp[ (
) ]d
2 
2 
2  n
2 n
1 x
exp[ ( 2 n2 ) 2 ] f ( , n )
2 n
2  n
1
where:
f ( , n ) 
• Note that:
2
2 
2
2

1  n
 n x   n 
d
 exp[ ( 2 2 )  
2
2

2  n 
 n 
p( x | D) ~ N ( n , 2   n2 )
2
• The conditional mean, μn, is treated as the true mean.
• p(x|D) and P(ωj) can be used to design the classifier.
ECE 8527: Lecture 05, Slide 11
Multivariate Case
• Assume: p(x  ) ~ N (,  ) and p() ~ N ( 0 ,  0 )
where 0 , and 0 are assumed to be known.
• Applying Bayes formula:
n
p | D     p (x k |  ) p 
k 1
n
 1 t

1
1
t
1

  exp ( (n    0 )  2 (  x k   01  0 ))
 2

i 1
which has the form:
 1

p | D     exp (   n )t  1(   n )
 2

• Once again: p | D ~ N (n , n )
and we have a reproducing density.
ECE 8527: Lecture 05, Slide 12
Estimation Equations
• Equating coefficients between the two Gaussians:
 n1  n  1   01
 n1  n  n  1 ˆ n   01 0
ˆ n 
1 n
 xk
n k 1
• The solution to these equations is:
1
1
1
 n   0 ( 0   ) 1ˆ n   ( 0   ) 1 0
n
n
n
1
1
 n   0 ( 0   ) 1 
n
n
• It also follows that: p(x | D) ~ N (n ,   n )
ECE 8527: Lecture 05, Slide 13
General Theory
• p(x | D) computation can be applied to any situation in which the unknown
density can be parameterized.
• The basic assumptions are:
 The form of p(x | θ) is assumed known, but the value of θ is not known
exactly.
 Our knowledge about θ is assumed to be contained in a known prior
density p(θ).
 The rest of our knowledge about θ is contained in a set D of n random
variables x1, x2, …, xn drawn independently according to the unknown
probability density function p(x).
ECE 8527: Lecture 05, Slide 14
Formal Solution
• The posterior is given by:
p(x | D)   p(x | ) p( | D)d
• Using Bayes formula, we can write p(D| θ ) as:
p ( | D ) 
p (D | ) p 
 p (D | ) p d
• and by the independence assumption:
n
p ( D | )   p ( x k  
k 1
• This constitutes the formal solution to the problem because we have an
expression for the probability of the data given the parameters.
• This also illuminates the relation to the maximum likelihood estimate:
• Suppose p(D| θ ) reaches a sharp peak at θ  θˆ .
ECE 8527: Lecture 05, Slide 15
Comparison to Maximum Likelihood
• This also illuminates the relation to the maximum likelihood estimate:
 Suppose p(D| θ ) reaches a sharp peak at θ  θˆ .
 p(θ| D) will also peak at the same place if p(θ) is well-behaved.


 p(x|D) will be approximately p x | ̂ , which is the ML result.
 If the peak of p(D| θ ) is very sharp, then the influence of prior information
on the uncertainty of the true value of θ can be ignored.
 However, the Bayes solution tells us how to use all of the available
information to compute the desired density p(x|D).
ECE 8527: Lecture 05, Slide 16
Recursive Bayes Incremental Learning
• To indicate explicitly the dependence on the number of samples, let:
D n  x1, x 2 ,..., x n 
• We can then write our expression for p(D| θ ):
p( D n | )  p(x n | ) p( D n 1 | )
0
where p( D | )  p() .
• We can write the posterior density using a recursive relation:
p ( | D ) 
n

p ( D n | ) p  
n
 p ( D | ) p  d
p ( xn |  ) p ( D n 1 |  )
n 1
 p ( xn |  ) p ( | D ) d
0
• where p( | D )  p( ) .
• This is called the Recursive Bayes Incremental Learning because we have a
method for incrementally updating our estimates.
ECE 8527: Lecture 05, Slide 17
When do ML and Bayesian Estimation Differ?
• For infinite amounts of data, the solutions converge. However, limited data
is always a problem.
• If prior information is reliable, a Bayesian estimate can be superior.
• Bayesian estimates for uniform priors are similar to an ML solution.
• If p(θ| D) is broad or asymmetric around the true value, the approaches are
likely to produce different solutions.
• When designing a classifier using these techniques, there are three
sources of error:
 Bayes Error: the error due to overlapping distributions
 Model Error: the error due to an incorrect model or incorrect assumption
about the parametric form.
 Estimation Error: the error arising from the fact that the parameters are
estimated from a finite amount of data.
ECE 8527: Lecture 05, Slide 18
Noninformative Priors and Invariance
• The information about the prior is based on the designer’s knowledge of the
problem domain.
• We expect the prior distributions to be “translation and scale invariance” –
they should not depend on the actual value of the parameter.
• A prior that satisfies this property is referred to as a
“noninformative prior”:
 The Bayesian approach remains applicable even when little or no prior
information is available.
 Such situations can be handled by choosing a prior density giving equal
weight to all possible values of θ.
 Priors that seemingly impart no prior preference, the so-called
noninformative priors, also arise when the prior is required to be invariant
under certain transformations.
 Frequently, the desire to treat all possible values of θ equitably leads to
priors with infinite mass. Such noninformative priors are called improper
priors.
ECE 8527: Lecture 05, Slide 19
Example of Noninformative Priors
• For example, if we assume the prior distribution of a mean of a continuous
random variable is independent of the choice of the origin, the only prior
that could satisfy this is a uniform distribution (which isn’t possible).
• Consider a parameter σ, and a transformation of this variable:
new variable, ~  ln( ) . Suppose we also scale by a positive constant:
ˆ  ln( a )  ln a  ln  . A noninformative prior on σ is the inverse distribution
p(σ) = 1/ σ, which is also improper.
ECE 8527: Lecture 05, Slide 20
Sufficient Statistics
• Direct computation of p(D|θ) and p(θ|D) for large data sets is challenging
(e.g. neural networks)
• We need a parametric form for p(x|θ) (e.g., Gaussian)
• Gaussian case: computation of the sample mean and covariance, which was
straightforward, contained all the information relevant to estimating the
unknown population mean and covariance.
• This property exists for other distributions.
• A sufficient statistic is a function s of the samples D that contains all the
information relevant to a parameter, θ.
• A statistic, s, is said to be sufficient for θ if p(D|s,θ) is independent of θ:
p ( | s , D ) 
p( D | s, ) p( | s)
 p ( | s )
p( D | s )
ECE 8527: Lecture 05, Slide 21
The Factorization Theorem
• Theorem: A statistic, s, is sufficient for θ, if and only if p(D|θ) can be written
as: p( D | )  g ( s, )h( D) .
• There are many ways to formulate sufficient statistics
(e.g., define a vector of the samples themselves).
• Useful only when the function g() and the sufficient statistic are simple
(e.g., sample mean calculation).
• The factoring of p(D|θ) is not unique:
g(s, )  f (s) g (s, )
h( D)  h( D) / f (s)
• Define a kernel density invariant to scaling:
g ( s , )
g~( s, ) 
 g ( s, )d
• Significance: most practical applications of parameter estimation involve
simple sufficient statistics and simple kernel densities.
ECE 8527: Lecture 05, Slide 22
Gaussian Distributions
 1

t 1




exp

x



x


k
k
 2

d / 2 1/ 2
k 1 2 

n
1
p ( D )  
 1 n t 1
t 1
t 1 

exp







x

x

k
k 
 2
d / 2 1/ 2
 k 1

2  
1
n
 n t 1

t 1
 exp          x k 
 k 1 
 2

1
 1 n t 1  


exp   x k  x k 
 2 d / 2  1/ 2
 2 k 1
 

• This isolates the θ dependence in the first term, and hence, the sample mean
is a sufficient statistic using the Factorization Theorem.
• The kernel is:
~ ,  
g~
n
 1
~ t  1  1   
~ 

exp





n
n 

n


2 d / 2  1/ 2  2 
1
ECE 8527: Lecture 05, Slide 23
The Exponential Family
t
• This can be generalized: p(x | )   x exp[a()  b() c()]
n
n
k 1
k 1
and: p ( D | )  exp[na()  b()  c(x k )]   x k   g (s, )h( D)
t
• Examples:
ECE 8527: Lecture 05, Slide 24
Summary
• Introduction of Bayesian parameter estimation.
• The role of the class-conditional distribution in a Bayesian estimate.
• Estimation of the posterior and probability density function assuming the only
unknown parameter is the mean, and the conditional density of the “features”
given the mean, p(x|θ), can be modeled as a Gaussian distribution.
• Bayesian estimates of the mean for the multivariate Gaussian case.
• General theory for Bayesian estimation.
• Comparison to maximum likelihood estimates.
• Recursive Bayesian incremental learning.
• Noninformative priors.
• Sufficient statistics
• Kernel density.
ECE 8527: Lecture 05, Slide 25
“The Euro Coin”
• Getting ahead a bit, let’s see how we can put these ideas to work on a simple
example due to David MacKay, and explained by Jon Hamaker.
ECE 8527: Lecture 05, Slide 26