Tutorial 6
• Bias and variance of estimators
• The score and Fisher information
• Cramer-Rao inequality
236607 Visual Recognition Tutorial
1
Estimators and their Properties
• Let { p( x | )}, be a parametric set of distributions.
Given a sample D x( n ) x1 , , xn drawn i.i.d from one of
the distributions in the set we would like to estimate its
parameter (thus identifying the distribution).
• An estimator for w.r.t. D is any function T ( D)
notice that an estimator is a random variable.
• How do we measure the quality of an estimator?
• Consistency: An estimator T for is consistent if
p
T ( x( n ) )
, as n
this is a (desirable) asymptotic property that motivates us
to acquire large samples. But we should emphasize that we
are also interested in measures for finite (and small!)
sample sizes.
236607 Visual Recognition Tutorial
2
Estimators and their Properties
2
ˆ
ˆ
• Bias: Define the bias of an estimator to be b( ) E [ ]
Here, the expectation is w.r.t. to the distribution p ( x | ).
The estimator is unbiased if its bias is zero b(ˆ) 0
• Example: the estimators x and x 1 n i1 xi , for the mean
of a normal distribution, are both unbiased.
The estimator 1n in1 ( xi x1)2 n for its variance is biased
2
whereas the estimator n1 i1 ( xi x ) is unbiased.
n
• Variance: another important property of an estimator is its
variance varp ( x| ) (ˆ). We would like to find estimators with
minimum bias and variance.
• Which is more important, bias or variance?
236607 Visual Recognition Tutorial
3
Risky Estimators
• Employ our decision-theoretic framework to measure the
quality of estimators.
• Abbreviate ˆ T ( x ( n ) ) and consider the square error loss
function (ˆ, ) (ˆ ) 2
• The conditional risk associated with when is the true
parameter R(ˆ | ) E (ˆ ) 2 (ˆ ) 2 p( x ( n ) | )dx ( n )
• Claim: R(ˆ | ) var(ˆ) b(ˆ) variance+bias
• Proof: E (ˆ )2 E (ˆ Eˆ Eˆ ) 2
E ˆ Eˆ Eˆ variance+bias
2
E ˆ Eˆ 2 E ˆ Eˆ Eˆ Eˆ
2
2
2
236607 Visual Recognition Tutorial
4
Bias vs. Variance
• So, for a given level of conditional risk, there is a tradeoff
between bias and variance.
• This tradeoff is among the most important facts in pattern
recognition and machine learning.
• Classical approach: Consider only unbiased estimators and
try to find those with minimum possible variance.
• This approach is not always fruitful:
– The unbiasedness only means that the average of the
estimator (w.r.t. to p( x | ))is . It doesn’t mean it will
be near for a particular sample (if variance is large).
– In general, an unbiased estimate is not guaranteed to
exist.
236607 Visual Recognition Tutorial
5
The Score
• The score
v of the family p( x | ) is the random variable
p( x | )
v
ln p( x | )
p( x | )
measures the “sensitivity” of p( x | )as a function of the
parameter .
• Claim: E[v] 0
• Proof:
p( x | )
E[ v ]
p ( x | )dx
p ( x | )dx
p( x | )
p ( x | )dx
1 0
2
2
var[
v
]
E
(
v
E
[
v
])
E
[
v
]
• Corollary:
236607 Visual Recognition Tutorial
6
The Score - Example
N ( ,1)
• Consider the normal distribution
1
1
p( x | )
exp ( x ) 2
2
2
1
1
ln p( x | ) ln(2 ) ( x ) 2
2
2
v
ln p ( x | ) x
• clearly,
• and
E[v] E[ x ] E[ x] 0
var(v) E[v 2 ] E[( x ) 2 ] 2 1
236607 Visual Recognition Tutorial
7
The Score - Vector Form
• In case where (1 , , k ) is a vector, the score
the vector whose i th component is
v is
ln p ( x | )
i
1
1
p( x | , )
exp 2 ( x ) 2
2
2
1
1
ln p( x | , ) ln(2 ) ln 2 ( x ) 2
2
2
x
ln p( x | , )
2
vi
• Example:
1 ( x )2
ln p( x | , )
3
x 1 ( x )2
v 2 ,
3
236607 Visual Recognition Tutorial
8
Fisher Information
• Fisher information: Designed to provide a measure of how
much information the parametric probability law p( x | )
carries about the parameter .
• An adequate definition of such information should possess
the following properties:
– The larger the sensitivity of p( x | ) to changes in , the
larger should be the information
– The information should be additive: The information
carried by the combined law p( x1 , x2 | ) should be the
sum of those carried by p( x1 | ) and p( x2 | )
– The information should be insensitive to the sign of the
change in and preferably positive
– The information should be a deterministic quantity;
should not depend on the specific random observation
236607 Visual Recognition Tutorial
9
Fisher Information
• Definition (scalar form): Fisher information (about ), is
the variance of the score
J ( ) E ln p( x | )
• Example: consider a random variable
2
~ N ( , 2 )
1
1
ln p( x | , ) ln(2 ) ln 2 ( x ) 2
2
2
x
v
ln p( x | , ) 2
2
x
2
1
2
2
J ( ) E v E 2 4 E ( x ) 4 1/ 2
236607 Visual Recognition Tutorial
10
Fisher Information - Cntd.
• Whenever (1 , , k ) is a vector, Fisher information
is the matrix J ( ) J i , j ( ) where
J i , j ( ) cov
log p( x | ),
log p( x | )
i
j
• Remainder:
cov X , Y E X E[ X ]Y E[Y ]
• Remark: the Fisher information is only defined whenever
the distributions p( x | ) satisfy some regularity conditions.
(For example, they should be differentiable w.r.t. i and
all the distributions in the parametric family must have
same support set).
236607 Visual Recognition Tutorial
11
Fisher Information - Cntd.
• Claim: Let x ( n ) x1 , , xn be i.i.d. random variables ~ p( x | ).
The score of p( x( n) | ) is the sum of the individual scores.
• Proof: v( x ( n ) ) ln p( x ( n ) | ) ln
p( xi | )
i
ln p( xi | )
i
v( xi )
i
• Example: If
x ( n ) x1 ,
, xn
are i.i.d. ~ N ( , 2 ), the score is
x
n
ln p( x | , ) n 2
236607 Visual Recognition Tutorial
12
Fisher Information - Cntd.
• Based on n i.i.d. samples, the Fisher information about
2
is
J n ( ) E ln p( x ( n ) | )
E v ( x ) E v( xi )
i 1
n
2
2
(n)
n
E v 2 ( xi ) nJ ( )
i 1
• Thus, the Fisher information is additive w.r.t. i.i.d. random
variables.
(n)
• Example: Suppose x x1 , , xn are i.i.d. ~ N ( , 2 ) . From
previous example we know that the Fisher information
2
about the parameter based on one sample is J ( ) 1/
2
Therefore, based on the entire sample, J n ( ) n /
236607 Visual Recognition Tutorial
13
The Cramer-Rao Inequality
• Theorem: Let be an unbiased estimator for . Then
1
ˆ
var( )
J ( )
• Proof: Using Ev 0 we have:
E v Ev ˆ Eˆ E v ˆ Eˆ
E vˆ EˆEv
E[vˆ]
236607 Visual Recognition Tutorial
14
The Cramer-Rao Inequality - Cntd.
p( x | )
E vˆ
ˆ p ( x | )dx
p( x | )
p ( x | )ˆdx
ˆdx
p
(
x
|
)
ˆ
E
1
• Now
236607 Visual Recognition Tutorial
15
The Cramer-Rao Inequality - Cntd.
• So, E v Ev ˆ Eˆ E[vˆ] 1
• By the Cauchy-Schwarz inequality
• Therefore,
2
ˆ
ˆ
E v Ev E E
E v 2 var(ˆ)
J ( ) var(ˆ)
1 E v Ev ˆ Eˆ
2
1
ˆ
var( )
J ( )
• For a biased estimator we have:
2
1
var(ˆ)
236607 Visual Recognition Tutorial
( Eˆ )
2
J ( )
16
The Cramer-Rao General Case
• The Cramer-Rao inequality also true in general
form: The error covariance matrix for θ̂ is
bounded as follows:
C E[(θˆ - θ)(θˆ - θ)t ] J 1 ( )
236607 Visual Recognition Tutorial
17
The Cramer-Rao Inequality - Cntd.
• Example: Let x ( n ) x1 , , xn be i.i.d. ~ N ( , 2 ) . From
previous example n J n ( ) n / 2
1
(n)
• Now let ˆ( x ) n xi be an (unbiased) estimator for .
i 1
var(ˆ) E ˆ Eˆ
2
E ˆ
2
Eˆ 2 2 Eˆ 2 Eˆ 2 2
2
1
1 2 2
2
E
x
n
n
i
2
2
n i 1 n
2 2 / n
Eˆ 2
n
2
• So var(ˆ) / n matches the Cramer-Rao lower
bound.
• Def: An unbiased estimator whose covariance meets the
Cramer-Rao lower bound is called efficient.
236607 Visual Recognition Tutorial
18
Efficiency
• Theorem (Efficiency): The unbiased estimator
efficient, that is,
Eθˆ θ
θ̂ is
C E[(θˆ - θ)(θˆ - θ)t ] J 1 (θ)
iff
J (θ)(θˆ - θ) ν
• Proof (If): If J (θ)(θˆ - θ) ν
then
E[J (θ)(θˆ - θ)(θˆ - θ)t J t (θ)] J (θ)CJ t (θ) E[ νν t ] J (θ)
meaning
C J 1 (θ)
236607 Visual Recognition Tutorial
19
Efficiency
• Only if: Recall the cross covariance between
E[ν(θˆ - θ) ]
t
2
νand(θˆ θ) :
I
The Cauchy-Schwarz inequality for random variables says
2
I E[ ν(θˆ - θ) ] E[ νν t ]E[(θˆ - θ)(θˆ - θ)t ] JC 1
t
(θˆ - θ) ν;C 2 J; J 1;
thus
J (θ)(θˆ - θ) ν
236607 Visual Recognition Tutorial
20
Cramer-Rao Inequality and ML - Cntd.
• Theorem: Suppose there exists an efficient estimator
for all . Then the ML estimator ML is .
• Proof: By assumption var( )
By previous claim
1
J ( )
v or
J ( )
log p( x| )
for all
J ( )( )
This holds at ML and since this is a maximum point
the left side is zero so
ML
236607 Visual Recognition Tutorial
21
© Copyright 2026 Paperzz