Statistics 550 Notes 18
Reading: Section 3.3-3.4
I. Finding Minimax Estimators (Section 3.3)
Review: Theorem 3.3.2 is a tool for finding minimax
estimators. It says that if an estimator is a Bayes estimator
and has maximum risk equal to its Bayes risk, then the
estimator is minimax. In particular, a Bayes estimator with
constant risk is a minimax estimator.
Minimax as limit of Bayes rules:
If the parameter space is not bounded, minimax rules are
often not Bayes rules but instead can be obtained as limits
of Bayes rules. To deal with such situations we need an
extension of Theorem 3.3.2. The theorem says that if an
estimator has maximum risk that is equal to the limit of the
Bayes risks of the Bayes estimators for a sequence of
priors, then the estimator is minimax. In particular, an
estimator with constant risk equal to the limit of the Bayes
risks of the Bayes estimators for a sequence of priors is
minimax.
*
Theorem 3.3.3: Let be a decision rule such that
sup R( , * ) r . Let { k } be a sequence of prior
distributions and let rk be the Bayes risk of the Bayes rule
with respect to the prior k . If
1
rk r as k , then * is minimax.
Proof: Suppose is any other estimator. Then,
sup R( , ) R( , )d k ( ) rk ,
and this holds for every k. Hence,
sup R( , ) lim k rk r sup R( , * ) . Thus,
* is minimax.
Note: Unlike Theorem 3.3.2, even if the Bayes estimators
for the priors k are unique, the theorem does not guarantee
*
that is the unique minimax estimator.
Example 2 (Example 3.3.3): X 1 , , X n iid
N ( ,1), . Suppose we want to estimate
with squared error loss. We will show that X is
minimax.
1
First, note that X has constant risk n . Consider the
sequence of priors, k N (0, k ) . In Notes 16, we showed
that the Bayes estimator for squared error loss with respect
n
ˆk
X
1
to the prior k is
. The risk function of ˆk is
n
k
2
2
n 2 1
n
k
R( ,ˆk ) E
X
2
1
.
1
n
n
k
k
The Bayes risk of ˆ with respect to is
2
k
k
1
n
k
rk
2
1
n
k
2
2
2
exp d
2 k
2k
1
1
n
k
2
2
1
1
n n
k
k
1
r
As k , k
n , which is the constant risk of X .
Thus, by Theorem 3.3.3, X is minimax.
II. Unbiased Estimation and Uniformly Minimum
Variance Unbiased Estimates (Section 3.4)
Consider the point estimation problem of estimating
g ( ) when the data X is generated from p ( X | ) , is an
unknown parameter, .
3
A fundamental problem in choosing a point estimator
(more generally a decision procedure) is that generally no
procedure dominates all other procedures.
Two approaches we have considered are (1) Bayes –
minimize weighted average of risk; (2) minimax –
minimize worst case risk.
Another approach is to restrict the class of possible
estimators and look for a procedure within a restricted class
that dominates all others.
The most commonly used restricted class of estimators is
the class of unbiased estimators:
U { ( X ) : E [ ( X )] g ( ) for all } .
Under squared error loss, the risk of an unbiased estimator
is the variance of the unbiased estimator.
Uniformly minimum variance unbiased estimator
*
(UMVU): Estimator ( X ) which has the minimum
variance among all unbiased estimators for all , i.e.,
Var ( * ( X )) Var ( ( X )) for all
for all ( X ) U .
A UMVU estimator is at least as good as all other unbiased
estimators under squared error loss.
A UMVU estimator might or might not exist.
4
Example of challenges in finding UMVU estimator
(Poisson unbiased estimation):
Let X 1 , , X n be iid Poisson ( ). Consider the sample
1 n
2
2
S
(
X
X
)
i
mean X and the sample variance
.
n 1 i 1
Because the mean and the variance of a Poisson are both
, the sample mean and the sample variance are both
unbiased estimators of .
2
To determine the better estimator X or S , we should now
compare variances. We easily have Var X n but
Var S 2 is quite a lengthy calculation. This is one of the
first problems in finding a UMVU estimator – the
calculations may be long and involved. It turns out that
Var X Var S 2 for all .
2
Even if we can establish that X is better than S , consider
the class of estimators
Wa ( X , S 2 ) aX (1 a ) S 2
2
For every constant a , E [Wa ( X , S )] so we now have
infinitely many unbiased estimators of . Even if X is
2
2
better than S , is it better than every Wa ( X , S ) ?
Furthermore, how can we be sure that there are not other,
better unbiased estimators lurking about?
5
II. The Information Inequality
The information inequality provides a lower bound on the
variance of an unbiased estimator. If an unbiased estimator
achieves this lower bound, then it must be UMVU.
We will focus on a one-parameter model – the data X is
generated from p ( X | ) , is an unknown parameter,
.
We make two “regularity” assumptions on the model
{ p( X | ) : } :
(I) The support of p ( X | ) , A { x : p ( x | ) 0} does not
depend on . Also for all x A , ,
log p ( x | ) exists and is finite.
(II) If T is any statistic such that E (| T |) for all ,
then the operations of integration and differentiation by
can be interchanged in T ( x ) p( x | )dx . That is, for
q
integration over ,
d
d
E T ( x ) p( x | )dx T ( x )
p( x | )dx (1.1)
d
d
whenever the right hand side of (1.1) is finite.
Assumption II is not useful as written – what is needed are
simple sufficient conditions on p ( x | ) for II to hold.
Classical conditions may be found in an analysis book such
6
as Rudin, Principles of Mathematical Analysis, pg. 236237.
Assumptions I and II generally hold for a one-parameter
exponential family.
Proposition 3.4.1: If
p( x | ) h( x) exp{ ( )T ( x) B( )} is an exponential
family and ( ) has a nonvanishing continuous derivative
on , then Assumptions I and II hold.
Recall from Notes 12 the concept of Fisher information. In
Notes 12, we defined the Fisher information for a single
random variable X . Here we define the Fisher
information for data X that might be multidimensional.
For a model { p( X | ) : } and a value of , the Fisher
information number I ( ) is defined as
2
I ( ) E
log p( X | )
.
Lemma 1 (similar to Lemma 1 from Notes 12) : Suppose
Assumptions I and II hold and that
E
log p( X | ) . Then
E
log p( X | 0 and thus,
I ( ) Var
log p( X | ) .
7
Proof: First, we observe that since p( x | )dx 1 for all
, we have p ( x | )dx 0 . Combining this with the
p
(
x
|
)
log
p
(
x
|
)
identity
p( x | ) , we have
0
p
(
x
|
)
dx
log
p
(
x
|
)
p
(
x
|
)
d
x
E
log p( x | )
where we have interchanged differentation and integration
which is justified under Assumption II.
The information (Cramer-Rao) inequality provides a lower
bound on the variance that an estimator can achieve based
on the Fisher information number of the model.
Theorem 3.4.1 (Information Inequality): Let T ( X ) be any
statistic such that Var (T ( X )) for all . Denote
E (T ( X )) by ( ) . Suppose that Assumptions I and II
hold and 0 I ( ) . Then for all , ( ) is
differentiable and
[ '( )]2
Var (T ( X ))
I ( ) .
The application of the Information Inequality to unbiased
estimators is Corollary 3.4.1:
Suppose the conditions of Theorem 3.4.1 hold and T ( X ) is
an unbiased estimate of . Then
8
1
I ( )
This corollary holds because for an unbiased estimator,
( ) so that '( ) 1 .
Var (T ( X ))
Proof of Information Inequality: The proof of the theorem
is a clever application of the Cauchy-Schwarz Inequality.
Stated statistically, the Cauchy-Schwarz Inequality is that
for any two random variables X and Y ,
[Cov( X , Y )]2 Var ( X )Var (Y ) .
If we rearrange the inequality, we can get a lower bound on
the variance of X ,
[Cov( X , Y )]2
Var ( X )
Var (Y )
We choose X to be the estimator T ( X ) and Y to be the
d
quantity d log f ( X | ) , and apply the Cauchy-Schwarz
Inequality.
d
Cov
T
(
X
),
log
f
(
X
|
)
. We
First, we compute
d
have, using Assumption II,
d
d p( x | )
d
E [T ( X )
log p ( x | )] E T ( X )
d
p
(
x
|
)
d
d p( x | )
d
T ( x) p( x | ) p( x | )dx = T ( x) d p( x | ) dx =
d
d
T ( x ) p ( x | ) d
E [T ( X )] '( )
d
d
9
d
E
log
f
(
X
|
)
0 so that we
From Lemma 1, d
d
Cov
T
(
X
),
log
f
(
X
|
)
'( ) .
conclude that
d
From Lemma 1, we have
d
Var
log f ( X | ) I ( ) . Thus, we conclude from
d
the Cauchy-Schwarz inequality applied to
d
log f ( X | ) that
T ( X ) and
d
[ '( )]2
Var (T ( X ))
I ( ) .
Application of Information Inequality to find UMVU
estimator for Poisson model:
Consider X 1 , , X n iid Poisson ( ) with parameter space
0 . This is a one-parameter exponential family,
p( X | ) exp(n log i 1 X i i 1 log X i !) . We
d
1
'(
)
log
have
d
, which is greater than zero over
the whole parameter space. Thus, by Proposition 3.4.1,
Assumptions I and II hold for this model.
The Fisher information number is
n
10
n
d
I ( ) Var
log p( X | )
d
n
n
d
Var
n log i 1 X i i 1 log X i !)
d
1 n
n
1
Var n i 1 X i 2 nVar ( X i )
Thus, by the Information Inequality, the lower bound on
1
the variance of an unbiased estimator of is I ( ) n .
The unbiased estimator X has Var ( X ) n . Thus,
X achieves the Information Inequality lower bound and is
hence a UMVU estimator.
Comment: While the Information Inequality can be used to
establish that an estimator is UMVU for certain models,
failure of an estimator to achieve the lower bound does not
necessarily mean that the estimator is not UMVU for a
model. There are some models for which no unbiased
estimator achieves the lower bound.
Multiparameter Information Inequality: There is a version
of the information inequality for multiparameter models,
which is given in Theorem 3.4.3 on pg. 186 of Bickel and
Doksum. '( ) is replaced by the gradient of the expected
value of T ( X ) with respect to . 1/ I ( ) is replaced by the
inverse of the Fisher information matrix, where the Fisher
Cov
log
p
(
X
|
)
.
information matrix is
11
III. The Information Inequality and Asymptotic Optimality
of the MLE
Theorem 2 of Notes 12 was about the asymptotic variance
of the MLE. We will rewrite the result of the theorem in
terms of the Fisher Information Number as we have defined
it in this notes.
Consider X 1 , , X n iid from a distribution p( X i | ) ,
, which satisfies assumptions I and II. Let
d
I1 ( ) Var
log p( X 1 | ) .
d
Theorem 2 of Notes 12 can be rewritten as
Theorem 2’: Under “regularity conditions,” (including
Assumptions I and II),
L
1
ˆ
n ( MLE 0 ) N 0,
I
(
)
1 0
The relationship between I ( ) and I1 ( ) is
d
I ( ) Var
log p( X | )
d
n
d
n d
Var
log
p
(
X
|
)
Var
log
p
(
X
|
)
i
i 1
i
i 1
d
d
d
nVar
log p ( X 1 | ) nI1 ( )
d
12
Thus, from Theorem 2’, we have that for large n , ˆMLE is
1
1
approximately unbiased and has variance nI ( ) I ( ) .
1
By the Information Inequality, the minimum variance of an
1
unbiased estimator is I ( ) . Thus the MLE approximately
achieves the lower bound of the Information Inequality.
This suggests that for large n , among all consistent
estimators (which are approximately unbiased for large n ),
the MLE is achieving approximately the lowest variance
and is hence asymptotically optimal.
Note: Making precise the sense in which the MLE is
asymptotically optimal took many years of brilliant work
by Lucien Le Cam and other mathematical statisticians. It
will be covered in Stat 552.
13
© Copyright 2026 Paperzz