Bayesian Decision Theory

Tutorial 6
• Bias and variance of estimators
• The score and Fisher information
• Cramer-Rao inequality
236607 Visual Recognition Tutorial
1
Estimators and their Properties
• Let { p( x |  )},    be a parametric set of distributions.
Given a sample D  x( n ) x1 , , xn drawn i.i.d from one of
the distributions in the set we would like to estimate its
parameter (thus identifying the distribution).
• An estimator for  w.r.t. D is any function T ( D)  
notice that an estimator is a random variable.
• How do we measure the quality of an estimator?
• Consistency: An estimator T for  is consistent if
p
T ( x( n ) ) 
 , as n  
this is a (desirable) asymptotic property that motivates us
to acquire large samples. But we should emphasize that we
are also interested in measures for finite (and small!)
sample sizes.
236607 Visual Recognition Tutorial
2
Estimators and their Properties
2

ˆ
ˆ
• Bias: Define the bias of an estimator  to be b( )   E [ ]   
Here, the expectation is w.r.t. to the distribution p ( x |  ).
The estimator is unbiased if its bias is zero b(ˆ)  0
• Example: the estimators x and x  1 n i1 xi , for the mean
of a normal distribution, are both unbiased.
The estimator 1n in1 ( xi x1)2 n for its variance is biased
2
whereas the estimator n1 i1 ( xi x )  is unbiased.
n
• Variance: another important property of an estimator is its
variance varp ( x| ) (ˆ). We would like to find estimators with
minimum bias and variance.
• Which is more important, bias or variance?
236607 Visual Recognition Tutorial
3
Risky Estimators
• Employ our decision-theoretic framework to measure the
quality of estimators.
• Abbreviate ˆ  T ( x ( n ) ) and consider the square error loss
function  (ˆ, )  (ˆ   ) 2 
• The conditional risk associated with  when  is the true
parameter R(ˆ |  )  E (ˆ   ) 2   (ˆ   ) 2 p( x ( n ) |  )dx ( n )
• Claim: R(ˆ |  )  var(ˆ)  b(ˆ)  variance+bias
• Proof: E (ˆ   )2  E (ˆ  Eˆ  Eˆ   ) 2 

 

 
 E ˆ  Eˆ    Eˆ     variance+bias
2
 E ˆ  Eˆ  2 E ˆ  Eˆ Eˆ    Eˆ  
2

2
2
236607 Visual Recognition Tutorial
4
Bias vs. Variance
• So, for a given level of conditional risk, there is a tradeoff
between bias and variance.
• This tradeoff is among the most important facts in pattern
recognition and machine learning.
• Classical approach: Consider only unbiased estimators and
try to find those with minimum possible variance.
• This approach is not always fruitful:
– The unbiasedness only means that the average of the
estimator (w.r.t. to p( x | ))is  . It doesn’t mean it will
be near  for a particular sample (if variance is large).
– In general, an unbiased estimate is not guaranteed to
exist.
236607 Visual Recognition Tutorial
5
The Score
• The score
v of the family p( x |  ) is the random variable

p( x |  )

v
ln p( x |  )  


p( x |  )
measures the “sensitivity” of p( x |  )as a function of the
parameter  .
• Claim: E[v]  0

• Proof:
p( x |  )



E[ v ]  
p ( x |  )dx  
p ( x |  )dx
p( x |  )




p ( x |  )dx 
1 0



2
2


var[
v
]

E
(
v

E
[
v
])

E
[
v
]
• Corollary:


236607 Visual Recognition Tutorial
6
The Score - Example
N (  ,1)
• Consider the normal distribution
1
 1

p( x |  ) 
exp   ( x   ) 2 
2
 2

1
1
ln p( x |  )   ln(2 )  ( x   ) 2
2
2

v
ln p ( x |  )  x  

• clearly,
• and
E[v]  E[ x   ]  E[ x]    0
var(v)  E[v 2 ]  E[( x   ) 2 ]   2  1
236607 Visual Recognition Tutorial
7
The Score - Vector Form
• In case where   (1 , , k ) is a vector, the score
the vector whose i th component is
v is

ln p ( x |  )
 i
1
 1

p( x |  ,  ) 
exp   2 ( x   ) 2 
2 
 2

1
1
ln p( x |  ,  )   ln(2 )  ln   2 ( x   ) 2
2
2

x
ln p( x |  ,  ) 

2
vi 
• Example:

1 ( x   )2
ln p( x |  ,  )   


3
 x   1 ( x   )2 
v   2 , 

3





236607 Visual Recognition Tutorial
8
Fisher Information
• Fisher information: Designed to provide a measure of how
much information the parametric probability law p( x |  )
carries about the parameter  .
• An adequate definition of such information should possess
the following properties:
– The larger the sensitivity of p( x |  ) to changes in  , the
larger should be the information
– The information should be additive: The information
carried by the combined law p( x1 , x2 | ) should be the
sum of those carried by p( x1 | ) and p( x2 | )
– The information should be insensitive to the sign of the
change in  and preferably positive
– The information should be a deterministic quantity;
should not depend on the specific random observation
236607 Visual Recognition Tutorial
9
Fisher Information
• Definition (scalar form): Fisher information (about  ), is
the variance of the score
 

J ( )  E  ln p( x |  ) 
 

• Example: consider a random variable
2
~ N ( ,  2 )
1
1
ln p( x |  ,  )   ln(2 )  ln   2 ( x   ) 2
2
2

x 
v
ln p( x |  ,  )  2


2

x


2

  1
2
2
J ( )  E v   E  2    4 E ( x   )   4  1/  2

    
236607 Visual Recognition Tutorial
10
Fisher Information - Cntd.
• Whenever   (1 , , k ) is a vector, Fisher information
is the matrix J ( )   J i , j ( )  where
 


J i , j ( )  cov 
log p( x |  ),
log p( x |  ) 
 



i
j


• Remainder:
cov  X , Y   E  X  E[ X ]Y  E[Y ]
• Remark: the Fisher information is only defined whenever
the distributions p( x |  ) satisfy some regularity conditions.
(For example, they should be differentiable w.r.t.  i and
all the distributions in the parametric family must have
same support set).
236607 Visual Recognition Tutorial
11
Fisher Information - Cntd.
• Claim: Let x ( n )  x1 , , xn be i.i.d. random variables ~ p( x |  ).
The score of p( x( n) |  ) is the sum of the individual scores.
• Proof: v( x ( n ) )   ln p( x ( n ) |  )   ln
p( xi |  )



i


ln p( xi |  )
i 
  v( xi )
i
• Example: If
x ( n )  x1 ,
, xn
are i.i.d. ~ N ( ,  2 ), the score is

x 
n
ln p( x |  ,  )  n 2


236607 Visual Recognition Tutorial
12
Fisher Information - Cntd.
• Based on n i.i.d. samples, the Fisher information about 
2
is
 

J n ( )  E  ln p( x ( n ) |  ) 
 



 E v ( x )   E   v( xi ) 
 i 1

n
2
2
(n)
n
  E v 2 ( xi )   nJ ( )
i 1
• Thus, the Fisher information is additive w.r.t. i.i.d. random
variables.
(n)
• Example: Suppose x  x1 , , xn are i.i.d. ~ N ( ,  2 ) . From
previous example we know that the Fisher information
2
about the parameter  based on one sample is J ( )  1/ 
2
Therefore, based on the entire sample, J n ( )  n / 
236607 Visual Recognition Tutorial
13
The Cramer-Rao Inequality
• Theorem: Let  be an unbiased estimator for  . Then
1
ˆ
var( ) 
J ( )
• Proof: Using Ev  0 we have:




E  v  Ev  ˆ  Eˆ   E v ˆ  Eˆ 




 E vˆ   EˆEv
 E[vˆ]
236607 Visual Recognition Tutorial
14
The Cramer-Rao Inequality - Cntd.

p( x |  )
E vˆ    
ˆ p ( x |  )dx
p( x |  )


p ( x |  )ˆdx


ˆdx

p
(
x
|

)

 


ˆ



E   
 1


• Now
236607 Visual Recognition Tutorial
15
The Cramer-Rao Inequality - Cntd.


• So, E  v  Ev  ˆ  Eˆ   E[vˆ]  1


• By the Cauchy-Schwarz inequality



• Therefore,


2

ˆ
ˆ
 E  v  Ev   E   E 


 
 E v 2  var(ˆ)
 J ( ) var(ˆ)
1  E  v  Ev  ˆ  Eˆ 


2
1
ˆ
var( ) 
J ( )
• For a biased estimator we have:
2
1

var(ˆ) 
236607 Visual Recognition Tutorial


( Eˆ   )

2
J ( )
16
The Cramer-Rao General Case
• The Cramer-Rao inequality also true in general
form: The error covariance matrix for θ̂ is
bounded as follows:
C  E[(θˆ - θ)(θˆ - θ)t ]  J 1 ( )
236607 Visual Recognition Tutorial
17
The Cramer-Rao Inequality - Cntd.
• Example: Let x ( n )  x1 , , xn be i.i.d. ~ N ( ,  2 ) . From
previous example n J n ( )  n /  2
1
(n)
• Now let ˆ( x )  n  xi be an (unbiased) estimator for  .
i 1

var(ˆ)  E ˆ  Eˆ

2

 E ˆ  

2
 Eˆ 2  2 Eˆ   2  Eˆ 2   2
2
1 
1 2 2

2
E
x

n


n




i

2
2
n  i 1  n
 2  2 / n
Eˆ 2 
n
2
• So var(ˆ)   / n matches the Cramer-Rao lower
bound.
• Def: An unbiased estimator whose covariance meets the
Cramer-Rao lower bound is called efficient.
236607 Visual Recognition Tutorial
18
Efficiency
• Theorem (Efficiency): The unbiased estimator
efficient, that is,
Eθˆ  θ
θ̂ is
C  E[(θˆ - θ)(θˆ - θ)t ]  J 1 (θ)
iff
J (θ)(θˆ - θ)  ν
• Proof (If): If J (θ)(θˆ - θ)  ν
then
E[J (θ)(θˆ - θ)(θˆ - θ)t J t (θ)]  J (θ)CJ t (θ)  E[ νν t ]  J (θ)
meaning
C  J 1 (θ)
236607 Visual Recognition Tutorial
19
Efficiency
• Only if: Recall the cross covariance between
 E[ν(θˆ - θ) ]
t
2
νand(θˆ  θ) :
I
The Cauchy-Schwarz inequality for random variables says


2
I  E[ ν(θˆ - θ) ]  E[ νν t ]E[(θˆ - θ)(θˆ - θ)t ]  JC  1
t
(θˆ - θ)   ν;C   2 J;  J 1;
thus
J (θ)(θˆ - θ)  ν
236607 Visual Recognition Tutorial
20
Cramer-Rao Inequality and ML - Cntd.
• Theorem: Suppose there exists an efficient estimator 
for all  . Then the ML estimator  ML is  .
• Proof: By assumption var( ) 
By previous claim    
1
J ( )
v or
J ( )
 log p( x| )
for all
 J ( )(   )


This holds at    ML and since this is a maximum point
the left side is zero so   
ML
236607 Visual Recognition Tutorial
21