Joshi, V.M.; (1966). "Asymptotic risk of maximum likelihood estimates."

I
I~
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I
I
I
ASYMPTOTIC RISK OF MAXIMUM
LIKELIHOOD ESTIMATES
by
v. M. Joshi
University of North Carolina, Chapel Hill
(on leave from Maharashtra Government,
Bombay)
Institute of Statistics Mimeo Series No. 471
Maylg66
This research was supported by the Mathematics Division
of the Air Force Office of Scientific Research Contract
No. AF-AFOSR-76o-65.
DEPARTMENT OF STATISTICS
UNIVERST!Y OF NORTH CAROLmA
Chapel Hill, N. C.
'-
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
II
1.
Introduction.
The main result proved in this paper is a hypothesis conjectured by Cher-
noff (1956), that with increase in sample size, the risk, suitably normalized,
of a maximum likelihood estimate (m.l.e.) converges to a limit equal to the variance of the asymptmtic distribution of the estimate.
to hold generally subject to mild conditions.
The hypothesis is shown
As the asymptotic variance is a
lower bound for the risk function for all estimates, the result establishes 'a
new optimum propety of the m.l. e. viz that it is asymptotically of minimum risk.
Chernoff's above mentioned hypothesis has also been briefly referred to in
a recent paper of Yu. V. Linnik and N. M. Mitro Fanova (1965).
The efficiency of estiJrates in general is coaa1iered in the last sectica
in connection with superef:f'icient estimates and a revised definition of a.8JII'Ptotic
efficiency suggested.
2.
Main Result.
We first reproduce the relevant formulae from the above mentioned paper
of Chernoff (1956).
X is a random variable with a frequency function f(x,9),
e.
with one unknown parameter
We consider a sequence of estimates of
where
(1)
T
n
- 9 = 0
p
(-1:-.)
,[n
A sequence of loss functions is assumed, given by
(2)
where c2n > 0,
and of normalized loss functions,
L
L*(t e) = n [
n '
where we assume,
n
(t,e) - c
c2n
(e)
on
1
e,
I
·1
L*(t,8) = n[(t-e) 2 + .(t-8) 2 ]
n
(4)
e
in Which • is assumed to hold uniformly in n as t -+ 8.
Then, c being an arbitrary constant
> 0,
so that,
(5)
lim
c -+ 10
E L*(T ,8)
lim inf
n n
2
n -+ 10 n E (min[(T _9)2, ~ ])
n
(6)
I
->1
n
If.[n (T -e) has a limiting distribution with second moment
n
2
lim lim n E (min[(T _e)2, ~ J) =;( 8)
n
c -+10 n -+10
;(9), then
so that O'2(e) is a lower bound for the risk function
R (T ,e) = E (L*(T , e) )
n n
n n
PflT -el> c)
n
= 0(...1-)
n
I
I
for each c, it is possible to show that
(9)
lim inf
n -+ 10
n E ((T _e)2)
n
E (L*n (Tn' e»)
>1
-
so that the normalized risk is sandwiched between the real variance and the
asymptotic variance.
Chernoff remarks that in accordance with the axioms of Neumann and l-brgenstein the loss function requires to be bounded above and (8) indica.tes that he
assumes an upper bound C for the loss function in (2) and hence an upper bound
n C for the normalized loss function in
(10)
(3),
so that
L*(t, e) < n C
n
I
I
I
_I
On the other hand if,
(8)
I
I
I
-
Chernoff's conjecture then is that if Tn is the maximum likelihood estimate,
the standard derivations of the asymptotic normal tlistribution of T , can, withn
out unreasonalle modifications be used to show that
2
I
I
I
I
I
-.
I
I
I.
I
I
I
I
I
~
i
lim E (L*(T ,9) J
(ll)
n
,
,
]
n
= asymptotic
variance of T •
n
We now give a proof of this conjecture.
We asllWDe tha the frequency tunc-
tion f(x,9) satisfies the following conditions, sufficient for asymptotic normality, as stated in Cramerls Mathematical Statistics, (1946,
S. " ..at Writing
f for f(x,9).
2
(i)
For a1nrJst all (Lebesgue measure) x the derivate.
and
~ lo~
~, golo)f
(ii) For every' in A, we have
*~
< F l (x),
r; j 1<
F2 (x) and
functions 1'1 and F2 being (Lebesgue) integrable over (
while
J
,
exist for every 6 belolllling to a non-degenerate inter-
val. A;
the
~e
n
-+10
r~
101;
_10 ,
I < H(x)
10)
10
H(x) f(x,9) dx < M
-10
where M is independent of
(iii) For every
9,
• in A, the integral
10
J
(
01
a ~g f ) 2 f
dx is finite and
-10
positive.
These conditions are sufficient for asymptotic normality of T.
n
For
proving our reSUlt, we assume that f satisfies the following further condition.
1
I
f
J
Ie
1
FURTHER COIIDITIOII: (iv)
For every
• in A, the integrals
JCO:10~f)2 f
<Ix
10
1
and
2
10
2
(H(x)) dx are finite, or in other words the random variables
o 10S2f
o 9
and H(x) have finite variances.
vIe now use the same notation as in Cramer (1946) except that we denote the
parameter by .', instead of
instead of a*.
•
o
a and the maximum likelihood estimate by Tn==Tn (JS.,~, ... ,Xn )
denotes the unknown true value of the parameter "
3
and 9
0
I
We write f i in place of f(X ,') aDd
i
to indicate that. i8 to be put equal to •. Then putting
is assumed to be an interior point of A.
use the subscript
0
~ 0 109. :ri
L. ( () 9
1
""'it
B0 =
~
1
) 0' B1 =
n L ( 0;' )0
i-1
(12)
.J. H(X
i)
i=l
and
where 0 <
lal
(14)
_I
<1 .
The asymptotic normality of (T -.) follows from
n
In (13), we now put for brevity
un
=
J:Jn
vn
=-
(T
(~).
n -9)
and
B1
k2 - 1 -
1
2
a
2
B2 (Tn - 'o)!k
so that the denominator in the r.h.s. of (13) is l+vn • Hence from (13), and
(14)
[I
i=l
so that
n
u
2
n
+ 'i!;u2...
<..1..
[ \
2
n n-
.
1.
A
n
••I
I
I
I
I
I
I
i.1
n
= ~
B2
"a~ :ri
o
L
i=l
Let Fn = Fn (X1, x2' ... , Xn ) denote the d.f. of the Xi' i • 1,2, "" n. We
now integrate both sides of (15), with respect to the probability measure
I
I
I
I
I
I
I
-.
I
-----------~
I
I.
I
I
I
I
I
I
I
determined bY' F , on the subset D of the sample space R , given by
n
(16)
D = {x:
where x =
(~, ~,
I
I·
•
n
ITn -
90I < 8)
... , xn ) denotes a point in Rn and 8 is some podtive constant, whose value shall be determined later. We thus have from (15),
(16)
[
u
[un v
2
2
d F +2
n
n
n
E(
o log f i
a•
and bY' (12)
o log
E(
)0
=
f~
a.·)~ = k
~
kn
<
d F
n-
[[
~
L
~o: f i )of. d Fn,
( 0
i=l
= 1, 2, ... ,
Now in the r.h.s. of (16), for each i, i
n,
0
2
Hence since the variables xi' i = 1, 2, •.. , n, are independently distributed,
L. (
1= J I (
2
n o log f.
Ie
I
I
I
I
I
I
.
E[
ae
J.
0 log f
[ n
ae
)0_
i=l
R
n
i)
1
2
= nk
i=l
and hence
(17)
r.h.s. of (16) ~ 1
Now by the assumed condition (iii) E{H(x)} exists and is finite.
o
(18)
E {H(x)} o
= ~(6
. 0)
Then using (14) and (18)'
Lh. s. !'f (16)
(19)
Let
J
J
= u:
d Fn - 2
JU:(
D
-
u
D
D
2
n
ex ~(.)
0
:i
+ 1) d Fn
(Tn - • )
.2
k
0
d F
n
-
J
u
D
2
n
ex
[B -J,1(' ) l
2
0
(T -"0 )
n 2
k
= say gl + t~ + g, + g4
where g., i = 1, 2, , .. , 4 is written for brevity for the ith term in the r.h s.
J.
o·f (1').
,
I
We now derive upper bounds f'or I~J , I~ I and, Ig4 1 , by applYing Schwarz
inequality and using the f'act that f'or xeD, by
IUn I~ 19Jn
I
I
I
I
inequality
(20), using (12),
n
\
- 2 + 1 = - 2 /.
k
nk ~
Bl
(21)
1
1
Ie.
[02 log f i
....
(----:-:::~) + k
0
~=l
Where by
1l 2log
••. , n
I
21
f.
o I- ~)o
[(
2
0
(12), for each i, i=l, 2,
E
+ k
.. = 0
I
and by the assumed further condition (iv)
02
E [(
2
log f.
o;~
i
)o+k.
{i=lL~
1}2{{~
L
(2
i
0 9
-
_.
_I
2
=crl('o)say
Hence since the x. are distributed independently
~
2
l0g
[ 1l 1og f 1 )0 + k2
...[ 02
fi
E
Ri=l
n
(
. 2 )0 + k
0'-
2
211.
JJ
-
d Fn
so that,
(22)
Hence in
(23)
(20), by (21) and (22),
f
'n
B
.
(---1 +, 1)2 d F ~
k
2
••I
8. /
'VTe have by Schwarz'
Now in
(14) and (16)
t
,l(, )
:0
I
I
I
I
I
I
n · n k'
Again in the f'irst integral in the r.~ •. of
6
(20), since by (16) and (14)
I
-.
I
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
I·
I
(24)
Combining (2') and (24) with (20), we have
2
g2 ::
cr~
2
8
('0)
k2
gl
so that
(25)
Next in
g,
in (19),
lal :: 1,
by assumed condition (ii), "..l('6~ :: M,
and by (16)
ITn
-
'0 I :: 8
and hence
(26)
Ig,l:: 2Mk 8
Lastly in g4' sincela I
J
u
2
d F..
n
n
D
:: 1
and
ITn
-
M8 gl
2
k
'f::
8
etting
we again apply Schwarz t
(28)
g"25, -<
J
D
u
4
n
d
inequality, and have
J
Fn
[:&2 - fJ(' )]2 d F
on
D
In the second integral in the r.h.s. of (28») by (12)
7
I
.-I
n
(29)
2 - IJ.( 90 ) =
B
I
1
n
[H(xi ) - IJ.( '0)]
i=l
where for each i, i = 1, 2, ••• , n
by (14)
and by assumed further condition (iv)
= CJ~
E[H(Xi ) - 1J.('0)]2 is finite
('0) say.
Hence since the xi are distri~uted independently, in (29)"
E[B 2
fLee o )]2
cr.
=
2
(.)
n
0
so that in (28),
(30)
J
[:5 - IJ.(' )]2 il F <
2
o
n-
~
('0)
~-~
n
D
Now in the first integeal in the r.h.s. of (28), since by (12) and (16)
2
2 2
u < n k 8
n
-
Now in the r.h.s. of (19), obviously
gl + ~+ g3 + g4 ~ gl - 21g2 1 - Ig3 1 - Ig4 '
and hence collecting together the results in (25), (26), and
(33)
(32), we have from
(16), (17) and (19), using (33),
)
1
(34)
8 ) _ g"2[
g (1 _M
_
1
k2
1
28
(1.1(')
_~-.......,;o~
+
k
,2
fr:
2
(' )
0
]
<
1
k-
and denoting the coefficient of gl in (34) by 2 K, Where
as
8
K
>0
we write (34)
I
I
I
I
I
I
_I
I
I
I
I
I
I
I
-.
I
I
I.
I
I
I
I
I
I
I
M8
k
g (1 - -;2) - 21( g
l'
'lve no" assume that
I(
i
<1
1-
is sufficiently small wo that the coefficient of gl in
the 1.h.s. of (35) is positive; Jay,
8 ~
(36)
k2
2:M
Then solving the quadratic in g!
Ie
I
I
I
I
I
I
-e
M8
+ 1 +-2 +
.
2k
I(
I(
< --------_.
1 _ M8
k
2
so that substituting the value of
_~1_ 1 + 8[
{
M8
1--
I(
from (34), we have
2rr(')
1
0
+
k
k2
which using
(36),
can be reduced to the form
Where A(e ) is independent of 8
o
9
I
Returning to the loss function in (4), we now write it in the form
.
(38)
L*{t,
n
e0 ) = net-e)
2
[1 + h{t, • )J
0
where by assumption, given any arbitrary DUmber E
> 0, we can find a 6,
o
such that
Ih{t,') I ~
(39)
€
We now take 6 = min[60 '
for all t, such that It 2
·k
2 M ],so that
'0
I oS 60
6 satisfies both (36) and (:39)
Now
(40)
E (L*,
"
n (Tno
»)
= JL*{T
n n"o ) d Fn
R
n
=JL*{T
n n"o )d F n + JL*{T
n n"o )d Fn
D
D
R -D
n
being the subset of R defined by (16)
n
Now the first integral in the r.h.s. of (40)
=
In{T -.
In{T -. )2
D
D
d
no )2 Fn +
no
< (l + €)
In{T -.
)2d F
non
D
Hence noting the definitions of gl and un in
(l + £)g
(41)
I4t(Tn , t,)d Fn
2 1
J
D
(19)
and
(14)
:s
k
Then consider the second integral in the r. h. s. of (40).
Here we note that
since the distribution of T is asymptotically normal, by a well known
n
property of the Normal frequency function (8) holl.s for T. Hence for given
n
6, and
(42)
€,
by making n sufficiently large, we can make
P[ITn
-. I o
> 6] =n
peR -D] -<-.!..
n C
10
I
I
I
I
I
I
_I
h{Tno
" )d Fn
which by (39)
-
--I
I
I
I
I
I
I
I
-.
I
I
I.
I
I
I
I
I
I
I
I_
I
I
I
I
I
I
I
II
--
Then from (10) and (42)
where C is the upper bound in (10).
.fL*n (Tn" o) dnF -< e
(43)
Rn -D
Now collecting the results in (lfiO), (41), and (4,) and using ('7), we have
< (1
-
+
A(' )e) (1 +2e)
0
k
+ e
£or all su££iciently large n.
Since
€
and 8 can be made arbitrarily small, it £ollows that
lim sup E{L*(T
n n,e0 )ks
n -.eo
(45)
1
But 2
is the asymptotic variance of Tn' and hence from (5) and (6)
k
(46)
lim in! E{L*(T ,e )l > 1
2
n n 0
k
n -. eo
(45) and (46) together imply
E{L* (T , •
n
n
0
») =
1
k2
Which was to be proved.
3. Asymptotic Efficiency.
Here we incidentally point out another optimum property of a m.l.e.,
which is implied by a relation given in Cheroff
.
(1956), but Which has a
special signi£icance in relation to sup,re£ficient estimates, which is stressed
here.
According to Fisher's original concept of
asympt~tic
for short), the A.E. of any estimate T is given by
n
11
efficiency (A.E.
I
(48)
1
A. E. =
.-I
2
lim (n I(e) E[T -.] }
n
n -.+eo
where I(e) is Fisher's measure of imformation; evaluated at
I(')
=
"
E ( -!- (~
• r2 n f)2)
It was assumed that thi s A. E. has an upper bound of 1.
This aS8U1llption has
however turned out to be false, and in fact the A, E. in (48) has no upper bound,
so that given any estimate we can construct another uniformly superior to it and
there exists no most efficient estimate.
The Fisherian concept of A, E
is
thus void.
A superefficient estimate is one whose A. E. is not less than 1 for any
e and
exceets 1 for at least one
e.
An example of such an estimate wal first
presented by J. L Hodges, Jr. in 1951.
The example relates to a Normae
population N(t,l), and is
T
= ax-n
if
IinI <
1
;r-
T
n
= xn
if
Iia I>
n
(50)
nr1
wherea is some constant, such that 0
sial
< 1 and in is the sample mean.
n
xn = -L
n
J.
' ... '
x.
J.
i=l
Further examples of superefficient estimates were later given by Le Cam
(1953) who constructed an example where the set of superefficiency, 1. e. the
set of values of., for which the A. E. excelds 1, is non-denumerable and
everywhere dense.
It was also proved by him that the set of superefficiency
must necessarily be of Lebesgue measure zero.
1 2
I
I
I
I
I
I
_I
I
I
I
I
I
I
I
-.
I
I
I.
I
I
'1
I
I
I
I
Ie
I
I
I
I
I
I
I
Ie
I
It should be noted here, that though the A. E. of the superefficient
estimate Tn ill (48) is for all
9
higher than that of m.l. e. in' 'n is not
'superior' to in' even if we take the M.S.E. (mean squared error) as the lole
While T has a lower ?t¥e.E. at • • 0, for every n, however large,
criterion.
n
there exist values of
lower M.S.E.
-
9 (in fact an interval of values) in which in has a
The observation made sometimes, that according to a 'behaviourist
viewpoint' the superefficient estimate T has to be prefer~ed to i
n
not correct.
As pointed out by
18 Cam
n
is thus
(1953) the same thing occur. in the
case of all other superefficient estimates.
The author considers that this revells a defect in the definition (48) al
the possession af a higher A. E. aoes not imply the possession of a higher
The defect in this definition (48) arises
degree of some desirable property.
from the fact that is takes into consideration the performance of the estimate
T for each value of n at some single point
n
• = •.
0
The true value of
is however not known, though as n increases we can locate
closely with a given degree confidence.
•
(; more and more
This 8U118sta that a revised definition
of E. E. should take into account the performance of Tn' not at a single point
(;o but in some aeighbourhood. of it,
With these considerations a modified
definition of A. E. is proposed below.
Let c be a sequence of positive numbers such that
n
C
n
.In cn
-+0
-+eo
Then the suggested revised definition is
A.E. = lim inf
n
-+ eo
1
sup
[n I(') E,(T _.)2
n
• -c <9<, +c
o non
The following result is proved in Chernoff(1956, Theorem 1)
I
2
lim
k-+IO
( n I(') E..(min[CT
sup
lim inf
- -
k
1¥fn
k
•
<9<-
n
_,)2,l J» > 1
n
.1
-
1¥fn
From the proof of the theorem, it is seen that the following conditions are
asswned,
(a)
that the frequency function :r(x,9) and the estimate Tn' satisfy the 'regularity'
conditions sufficient for the Cramer-Reo inequality, such '~as those given by
Wolfourtz
(b)
I(e) is bounded away from zero in some neighbourhood ofe • O.
For (52) we now replace condition (b) by the condition
(c) I ( e) bounded away from zero in some neighborhood of every point • , in the
o
range of values assumed bye.
It is then easily seen to be a consiquence of (53), that the A, E defined
by (52), has an upper bound of 1.
It is further seen that this upper bound
is attained for a m.l.e. while for the superefficient estimate T the A. E.
n
by the revised definition is nil.
Acknowledgement
The relation given in ( 53
) was pointed out to me by a referee of the
Annals in connection with another paper.
I have also to thank Mrs. 'Kay Herring
for doing the typing work neatly and speadily.
References
(1)
Chernoff H (1956) Large Sample Theory.
(2)
Cramer Harold (1946).
sity Press.
p.
~~thematical
Ann. Math Statist, V.
27 1-23.
Methods £f Statistics Princeton Univer-
500-503
(3) Le Cam (1953) On some asymptotic properties of .aximum likelihood estimates
and related Bayes estimates.
Vol I, 1953 p 277-330
Universitx of CaliforDia Pubn Stats
I
I
I
I
I
I
I
_I
I
I
I
I~
I'
I
I,
e
l
I
,
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
(4)
Neumann von J and Morgenstein (1947)
Theory of Yarnes and Economic
Behavior. 2d ed. Princeton University Press, Princeton N. J. 1947,
p. 641
(5)
Linnik Yu. V. and Mitrofanova N. M. (1965) lome asymptotic expansions for
the distribution of the maximum likelihood estimate.
~ankhya
Series A.
Vol. 27 Part 1, p 73
(6)
lVolfowitz J (1947)
The efficiency of sequential estilimates and Wa1ds'
equation for sequential processes ,Al1ll Math. Statist. Vol. 18
p 215-230