Point Estimation: properties of estimators 1 FINITE-SAMPLE

Point Estimation: properties of estimators
• finite-sample properties (CB 7.3)
• large-sample properties (CB 10.1)
1
FINITE-SAMPLE PROPERTIES
How an estimator performs for finite number of observations n.
Estimator: W
Parameter: Оё
Criteria for evaluating estimators:
• Bias: does EW = θ?
• Variance of W (you would like an estimator with a smaller variance)
Example: X1 , . . . , Xn в€ј i.i.d. (Вµ, Пѓ 2 )
Unknown parameters are Вµ and Пѓ 2 .
Consider:
Вµ
Л†n в‰Ў n1 i Xi , estimator of Вµ
ВЇ n )2 , estimator of Пѓ 2 .
Пѓ
Л†n2 в‰Ў n1 i (Xi в€’ X
Bias: E Вµ
Л†n =
Var Вµ
Л†n =
1
n
В· nВµ = Вµ. So unbiased.
1
nПѓ 2
n2
= n1 Пѓ 2 .
EПѓ
Л† 2 = E(
=
1
n
1
В·
n
ВЇ n )2 )
(Xi в€’ X
i
ВЇn + E X
ВЇ 2)
(EXi2 в€’ 2EXi X
n
i
1
Пѓ2
Пѓ2
В· n (Вµ2 + Пѓ 2 ) в€’ 2(Вµ2 + ) +
+ Вµ2
n
n
n
nв€’1 2
=
Пѓ .
n
=
1
Hence it is biased.
To fix this bias, consider the estimator s2n в‰Ў
(unbiased).
1
nв€’1
i (Xi
ВЇ n )2 , and Es2 = Пѓ 2
в€’X
n
Mean-squared error (MSE) of W is E(W в€’ Оё)2 . Common criterion for comparing
estimators.
Decompose: M SE(W ) = V W + (EW в€’ Оё)2 =Variance + (Bias)2 .
Hence, for an unbiased estimator: M SE(W ) = V W .
Example: X1 , . . . , Xn в€ј U [0, Оё]. f (X) = 1/Оё, x в€€ [0, Оё].
ВЇn.
• Consider estimator θˆn ≡ 2X
E ОёЛ†n = 2 n1 В· E i Xi = 2 В· n1 В· 12 n В· Оё = Оё. So unbiased
2
V Xi = Оё
M SE(ОёЛ†N ) = V ОёЛ†n = 42
i
n
3n
• Consider estimator θ˜n ≡ max (X1 , . . . , Xn ).
In order to derive moments, start by deriving CDF:
P (θ˜n ≤ z) = P (X1 ≤ z, X2 ≤ z, . . . , Xn ≤ z)
n
P (Xi ≤ z)
=
i=1
z n
Оё
=
Therefore fОёЛњn (z) = n В·
1
z nв€’1 1
,
Оё
Оё
for 0 ≤ x ≤ θ.
Оё
E(ОёЛњn ) =
1
dz
Оё
Оё
n
z n dz =
Оё.
n+1
0
zВ·nВ·
0
=
Bias(ОёЛњn ) = в€’Оё/(n + 1)
Оё n+1
E(ОёЛњ2 ) = nn
z dz =
n
Оё
0
Hence V ОёЛњn = Оё2
n
n+2
Accordingly, MSE=
в€’
if z ≤ θ
otherwise
n
Оёn
z
Оё
nв€’1
n
Оё2 .
n+2
2
n
n+1
n
= Оё2 (n+2)(n+1)
2.
2Оё2
(n+2)(n+1)
2
Continue the previous example. Redefine ОёЛњn =
estimators ОёЛ†n and ОёЛњn are unbiased.
Оё2
Which is better? V ОёЛ†n = 3n
= O(1/n).
2
n+1
V ОёЛњn =
V (max(X1 , . . . , Xn )) = Оё2
n
1
n(n+2)
n+1
n
max(X1 , . . . , Xn ). Now both
= O(1/n2 ).
Hence, for n large enough, θ˜n has a smaller variance, and in this sense it is “better”.
Best unbiased estimator: if you choose the best (in terms of MSE) estimator, and
restrict yourself to unbiased estimators, then the best estimator is the one with the
lowest variance.
A best unbiased estimator is also called the “Uniform minimum variance unbiased
estimator” (UMVUE).
Formally: an estimator W is a UMVUE of Оё satisfies:
(i) EОё W = Оё, for all Оё (unbiasedness)
Лњ , for all Оё, and all other unbiased estimators W
Лњ.
(ii) Vθ W ≤ Vθ W
The “uniform” condition is crucial, because it is always possible to find estimators
which have zero variance for a specific value of Оё.
It is difficult in general to verify that an estimator W is UMVUE, since you have
to verify condition (ii) of the definition, that V W is smaller than all other unbiased
estimators.
Luckily, we have an important result for the lowest attainable variance of an estimator.
• Theorem 7.3.9 (Cramer-Rao Inequality): Let X1 , . . . , Xn be a sample with
joint pdf f (X|Оё), and let W (X) be any estimator satisfying
(i)
∂
W (X) В· f (X|Оё) dX;
∂θ
d
EОё W (X) =
dОё
(ii) VОё W (X) < в€ћ.
Then
d
E W (X)
dОё Оё
Vθ W (X) ≥
EОё
3
∂
∂θ
2
log f (X|Оё)
2.
The RHS above is called the Cramer-Rao Lower Bound.
Proof: CB, pg. 336.
d
Note: the LHS of condition (i) above is dОё
W (X)f (X|θ)dX, so by Leibniz’
rule, this condition basically rules out cases where the support of X is dependent
on Оё.
• The equality
EОё
1
∂
log f (X|Оё) =
∂θ
∂f (X|θ)
f (X|Оё)dx
f (X|θ) ∂θ
∂
f (X|Оё)dx
∂θ
∂
В·1=0
=
∂θ
=
(1)
is noteworthy, because
∂
log f (X|Оё) = 0
∂θ
is the FOC of maximum likelihood estimation problem. (Alternatively, as in
CB, apply condition (i) of CR result, using W (X) = 1.)
In the i.i.d. case, this becomes the sample average
1
n
i
∂
log f (xi |Оё) = 0.
∂θ
And by the LLN:
1
n
i
∂
∂
p
log f (xi |Оё) в†’ EОё0
log f (xi |Оё),
∂θ
∂θ
where Оё0 is the true value of Оё0 . This shows that maximum likelihood estimation
of Оё is equivalent to estimation based on the moment condition
∂
log f (xi |Оё) = 0
∂θ
which holds only at the true value θ = θ0 . (Thus MLE is “consistent” for the
true value θ0 , as we’ll see later.)
EОё0
(However, note that Eq. (1) holds at all values of Оё, not just Оё0 .)
What if model is “misspecified”, in the sense that true density of X is g(x), and
that for all Оё в€€ О�, f (x|Оё) = g(x) (that is, there is no value of the parameter Оё
such that the postulated model f coincides with the true model g)? Does Eq.
(1) still hold? What is MLE looking for?
4
• In the iid case, the CR lower bound can be simplified
Corollary 7.3.10: if X1 , . . . , Xn в€ј i.i.d. f (X|Оё), then
Vθ W (X) ≥
d
E W (X)
dОё Оё
∂
∂θ
n В· EОё
2
log f (X|Оё)
2.
• Up to this point, Cramer-Rao results not that operational for us to find a “best”
estimator, because the estimator W (X) is on both sides of the inequality. However, for an unbiased estimator, how can you simplify the expression further?
• Example: X1 , . . . , Xn ∼ i.i.d. N (µ, σ 2 ).
What is CRLB for an unbiased estimator of Вµ?
Unbiased в†’ numerator =1.
в€љ
1 xв€’Вµ
log f (x|Оё) = log 2ПЂ в€’ log Пѓ в€’
2
Пѓ
xв€’Вµ
∂
в€’1
xв€’Вµ
log f (x|Оё) = в€’
В·
=
∂µ
Пѓ
Пѓ
Пѓ2
E
∂
log f (X|Оё)
∂θ
Hence the CRLB =
1
nВ· 12
Пѓ
=
2
= E (X в€’ Вµ)2 Пѓ 4 =
2
1
1
V X = 2.
4
Пѓ
Пѓ
Пѓ2
.
n
This is the variance of the sample mean, so that the sample mean is a UMVUE
for Вµ.
• Sometimes we can simplify the denominator of the CRLB further:
Lemma 7.3.11 (Information inequality): if f (X|Оё) satisfies
(*)
d
EОё
dОё
∂
log f (X|Оё)
∂θ
then
EОё
∂
log f (X|Оё)
∂θ
∂
∂θ
=
2
= в€’EОё
∂
log f (X|Оё) f (X|Оё) dx,
∂θ
∂2
log f (X|Оё) .
∂θ2
Rough proof:
LHS of (*): Using Eq. (1) above, we get that LHS of (*) =0.
5
RHS of (*):
∂
∂θ
=
=E
∂
log f
∂θ
f dx
∂ 2 log f
f dx +
∂θ2
∂ 2 log f
+E
∂θ2
1
f
∂f
∂θ
∂ log f
∂θ
2
dx
2
.
Putting the LHS and RHS together yields the desired result.
d
∂
Note: the LHS of the above condition (*) is just dОё
log f (X|Оё) f (X|Оё)dX,
∂θ
so that, by Leibniz’ rule, the condition (*) just states that the bounds of the
integration (i.e., the support of X) does not depend on Оё. Normal distribution
satisfies this (support is always (в€’в€ћ, в€ћ)), but U [0, Оё] does not.
∂
log f (X|Оё) =
Also, the information inequality depends crucially on the equality Eθ ∂θ
0, which depends on the correct specification of the model. Thus information
inequality can be used as basis of “specification test”. (How?)
• Example: for the previous example, consider CRLB for unbiased estimator of
Пѓ2.
We can use the information inequality, because condition (*) is satisfied for the
normal distribution. Hence:
в€’E
в€’1
1 (x в€’ Вµ)2
∂
log
f
(x|Оё)
=
+
∂σ 2
2Пѓ 2 2 Пѓ 4
∂
∂
1
(x в€’ Вµ)2
log f (x|Оё) = 4 в€’
∂σ 2 ∂σ 2
2Пѓ
Пѓ6
∂
1
1
∂
1
log f (x|Оё)
=в€’
в€’ 4 = 4.
2
2
4
∂σ
∂σ
2Пѓ
Пѓ
2Пѓ
Hence the CRLB is
2Пѓ 4
.
n
• Example: X1 , . . . , Xn ∼ U [0, θ]. Check conditions for CRLB for an unbiased
estimator W (X) of Оё.
d
EW (X)
dОё
= 1 (because it is unbiased)
∂
W (X)f (X|Оё) dX =
∂θ
W (X) В·
Hence, condition (i) of theorem not satisfied.
6
в€’1
Оё2
dX =
d
EW (X) = 1
dОё
• But when can CRLB (if it exists) be attained?
Corollary 7.3.15: X1 , . . . , Xn в€ј i.i.d. f (X|Оё), satisfying the conditions of CR
theorem.
– Let L(θ|X) =
n
i=1
f (X|Оё) denote the likelihood function.
– Estimator W (X) unbiased for θ
– W (X) attains CRLB iff you can write
∂
log L(Оё|X) = a(Оё) W (X) в€’ Оё
∂θ
for some function a(Оё).
• Example: X1 , . . . , Xn ∼ i.i.d. N (µ, σ 2 )
Consider estimating Вµ:
∂
∂
log L(Оё|X) =
∂µ
∂µ
=
∂
∂µ
n
=
i=1
=
n
log f (X|Оё)
i=1
n
в€’ log
в€љ
2ПЂ в€’ log Пѓ в€’
i=1
1
2
(X в€’ Вµ)2
Пѓ2
X в€’Вµ
Пѓ2
n
ВЇ n в€’ Вµ).
В· (X
Пѓ2
Hence, CRLB can be attained (in fact, we showed earlier that CRLB attained
ВЇn)
by X
Loss function optimality
Let X в€ј f (X|Оё).
Consider a loss function L(Оё, W (X)), taking values in [0, +в€ћ), which penalizes you
when your W (X) estimator is “far” from the true parameter θ. Note that L(θ, W (X))
is a random variable, since X (and W (X)) are random.
Consider estimators which minimize expected loss: that is
min EОё L(Оё, W (X)) в‰Ў min R(Оё, W (В· В· В· ))
W (В·В·В· )
W (В·В·В· )
7
where R(Оё, W (В· В· В· )) is the risk function. (Note: the risk function is not a random
variable, because X has been integrated out.)
Loss function optimality is a more general criterion than minimum MSE. In fact,
2
because M SE(W (X)) = EОё W (X) в€’ Оё
, the MSE is actually the risk function
2
associated with the quadratic loss function L(Оё, W (X)) = W (X) в€’ Оё
.
Other examples of loss functions:
• Absolute error loss: |W (X) − θ|
• Relative quadratic error loss:
(W (X)в€’Оё)2
|Оё|+1
The exercise of minimizing risk takes a given value of Оё as given, so that the minimized
risk of an estimator depends on whichever value of Оё you are considering. You might
be interested in an estimator which does well regardless of which value of Оё you are
considering. (Analogous to the focus on the uniform minimal variance.)
For this different problem, you want to consider a notion of risk which does not
depend on Оё. Two criteria which have been considered are:
• “Average” risk:
min
R(Оё, W (В· В· В· ))h(Оё)dОё.
W (В·В·В· )
where h(Оё) is some weighting function across Оё. (In a Bayesian interpretation,
h(Оё) is a prior density over Оё.)
• Minmax criterion:
min max R(Оё, W (В· В· В· )).
W (В·В·В· )
Оё
Here you choose the estimator W (В· В· В· ) to minimize the maximum risk = maxОё R(Оё, W (В· В· В· )),
where θ is set to the “worse” value. So minmax optimizer is the best that can
be achieved in a “worst-case” scenario.
ВЇ n is:
Example: X1 , . . . , Xn в€ј i.i.d. N (Вµ, Пѓ 2 ). Sample mean X
unbiased
minimum MSE
UMVUE
attains CRLB
minimizes expected quadratic loss
8
2
LARGE SAMPLE PROPERTIES OF ESTIMATORS
It can be difficult to compute MSE, risk functions, etc., for some estimators, especially
when estimator does not resemble a sample average.
Large-sample properties: exploit LLN, CLT
Consider data {X1 , X2 , . . .} by which we construct a sequence of estimators Wn в‰Ў
W (X1 ), W (X2 ), . . . . Wn is a random sequence.
Define: we say that Wn is consistent for a parameter Оё iff the random sequence Wn
as
converges (in some stochastic sense) to Оё. Strong consistency obtains when Wn в†’ Оё.
p
Weak consistency obtains when Wn в†’ Оё.
For estimators like sample-means, consistency (either weak or strong) follows easily
using a LLN. Consistency can be thought of as the large-sample version of unbiasedness.
Define: an M-estimator is an estimator of Оё which a maximizer of an objective
function Qn (Оё).
Examples:
• MLE: Qn (θ) =
1
n
n
i=1
log f (xi |Оё)
• Least squares: Qn (θ) =
О± + Xi ОІ.
n
2
i=1 [yi в€’ g(xi ; Оё)] .
OLS is special case when g(xi ; Оё) =
• GMM: Qn (θ) = Gn (θ) Wn (θ)Gn (θ) where
1
Gn (Оё) =
n
n
i=1
1
m1 (xi ; Оё),
n
n
i=1
1
m2 (xi ; Оё), . . . ,
n
n
mM (xi ; Оё) ,
i=1
an M Г— 1 vector of sample moment conditions, and Wn is an M Г— M weighting
matrix.
Notation: Denote the limit objective function Q0 (θ) = plimn→∞ Qn (θ) (at each θ).
Define Оё0 в‰Ў argmaxОё Q0 (Оё).
Consistency of M-estimators Make the following assumptions:
9
1. Q0 (θ) is uniquely maximized at some value θ0 (“identification”)
2. Parameter space О� is a compact subset of RK .
3. Q0 (Оё) is continuous in Оё
4. Qn (Оё) converges uniformly in probability to Q0 (Оё); that is:
p
sup |Qn (Оё) в€’ Q0 (Оё)| в†’ 0.
Оёв€€О�
p
Theorem: (Consistency of M-Estimator) Under assumption 1,2,3,4, Оёn в†’ Оё0 .
Proof: We need to show: for any arbitrarily small neighborhood N containing Оё0 ,
P (Оёn в€€ N ) в†’ 1.
For n large enough, the uniform convergence conditions that, for all , Оґ > 0,
P
sup |Qn (Оё) в€’ Q0 (Оё)| < /2
> 1 в€’ Оґ.
Оёв€€О�
The event “supθ∈� |Qn (θ) − Q0 (θ)| < /2” implies
Qn (θn ) − Q0 (θn ) < /2 ⇔ Q0 (θn ) > Qn (θn ) − /2
(2)
Qn (Оё0 ) в€’ Q0 (Оё0 ) > в€’ /2 в‡’ Qn (Оё0 ) > Q0 (Оё0 ) в€’ /2.
(3)
Similarly,
Since Оёn = argmaxОё Qn (Оё), Eq. (2) implies
Q0 (Оёn ) > Qn (Оё0 ) в€’ /2.
(4)
Hence, adding Eqs. (3) and (4), we have
Q0 (Оёn ) > Q0 (Оё0 ) в€’ .
(5)
So we have shown that
sup |Qn (Оё) в€’ Q0 (Оё)| < /2 =в‡’ Q0 (Оёn ) > Q0 (Оё0 ) в€’
Оёв€€О�
⇔P (Q0 (θn ) > Q0 (θ0 ) − ) ≥ P
sup |Qn (Оё) в€’ Q0 (Оё)| < /2
Оёв€€О�
10
в†’ 1.
Now define N as any open neighborhood of RK , which contains Оё0 , and NВЇ is the
complement of N in RK . Then � ∩ N¯ is compact, so that maxθ∈�∩N¯ Q0 (θ) exists.
Set = Q0 (θ0 ) − maxθ∈�∩N¯ Q0 (θ). Then
Q0 (Оёn ) > Q0 (Оё0 ) в€’ в‡’ Q0 (Оёn ) > max Q0 (Оё)
ВЇ
θ∈�∩N
в‡’ Оёn в€€ N
⇔ P (θn ∈ N ) ≥ P (Q0 (θn ) > Q0 (θ0 ) − ) → 1.
Since the argument above holds for any arbitrarily small neighborhood of Оё0 , we are
done.
• In general, the limit objective function Q0 (θ) = plimn→∞ Qn (θ) may not be that
straightforward to determine. But in many cases, Qn (Оё) is a sample average of
some sort:
1
Qn (Оё) =
q(xi |Оё)
n i
(eg. least squares, MLE). Then by a law of large numbers, we conclude that
(for all Оё)
1
q(xi |Оё) = Exi q(xi |Оё)
Q0 (Оё) = plim
n i
where Exi denote expectation with respect to the true (but unobserved) distribution of xi .
• Most of the time, θ0 can be interpreted as a “true value”. But if model is misspecified, then this interpretation doesn’t hold (indeed, under misspecification,
not even clear what the “true value” is). So a more cautious way to interpret
the consistency result is that
p
Оёn в†’ argmaxОё Q0 (Оё)
which holds (given the conditions) no matter whether model is correctly specified.
** Of the four assumptions above, the most “high-level” one is the uniform convergence condition. Sufficient conditions for this conditions are:
1. Pointwise convergence: For each Оё в€€ О�, Qn (Оё) в€’ Q0 (Оё) = op (1).
11
2. Qn (Оё) is stochastically equicontinuous: for every > 0, О· > 0 there exists a
sequence of random variable ∆n ( , η) and n∗ ( , η) such that for all n > n∗ ,
P (|∆n | > ) < η and for each θ there is an open set N containing θ with
˜ − Qn (θ)| ≤ ∆n , ∀n > n∗ .
|Qn (Оё)
supОёв€€N
Лњ
Note that both ∆n and n∗ do not depend on θ: it is uniform result.
In a deterministic context, this is an “in probability” version of the notion
of uniform equicontinuity: we say a sequence of deterministic functions
Rn (Оё) is uniformly equicontinuous if, for every > 0 there exists Оґ( ) and
nв€— ( ) such that for all Оё
˜ − Rn (θ)| ≤ , ∀n > n∗ .
supОё:||
|Rn (Оё)
Лњ Оёв€’Оё||<Оґ
Лњ
Asymptotic normality for M-estimators Define the “score vector”
ОёЛњQn (Оё)
∂Qn (θ)
∂θ1
=
,...,
Оё=ОёЛњ
∂Qn (θ)
∂θK
.
Оё=ОёЛњ
Similarly, define the K Г— K Hessian matrix
ОёЛњОёЛњ Qn (Оё)
i,j
=
∂ 2 Qn (θ)
∂θi ∂θj
,
1 ≤ i, j ≤ K.
Оё=ОёЛњ
Note that the Hessian is symmetric.
Make the following assumptions:
p
1. Оёn = argmaxОё Qn (Оё) в†’ Оё0
2. Оё0 в€€ interior(О�)
3. Qn (Оё) is twice continuously differentiable in a neighborhood N of Оё0 .
4.
в€љ
n
d
Оё0
Qn (Оё) в†’ N (0, ОЈ)
5. Uniform convergence of Hessian: there exists the matrix H(Оё) which is continp
uous at Оё0 and supОёв€€N || ОёОё Qn (Оё) в€’ H(Оё)|| в†’ 0.
6. H(Оё0 ) is nonsingular
12
Theorem (Asymptotic normality for M-estimator): Under assumptions 1,2,3,4,5,
в€љ
d
n(Оёn в€’ Оё0 ) в†’ N (0, H0в€’1 ОЈH0в€’1 )
where H0 в‰Ў H(Оё0 ).
Proof: (sketch) By Assumptions 1,2,3, Оёn Qn (Оё) = 0 (this is FOC of maximization
problem). Then using mean-value theorem (with ОёВЇn denoting mean value):
= Оё0 Qn (Оё) + ОёВЇn ОёВЇn Qn (Оё)(Оёn в€’ Оё0 )
в€љ
в€љ
n(Оёn в€’ Оё0 ) = в€’ n Оё0 Qn (Оё)
ОёВЇn ОёВЇn Qn (Оё)
0=
в‡’
Оёn Qn (Оё)
p
d
в†’H0 (using A5)
в†’N (0,ОЈ) (using A4)
в€љ
d
⇔ n(θn − θ0 ) → −H(θ0 )−1 N (0, Σ) = N (0, H0−1 ΣH0−1 ).
Note: A5 is a uniform convergence assumption on the sample Hessian. Compared to
analogous step in the Delta method, we used the plim-operator theorem coupled with
the “squeeze” result. Here it doesn’t suffice, because as n → ∞, both θ¯n as well as
the function Gn (В·) = В·В· Qn (Оё) are changing.
2.1
Maximum likelihood estimation
The consistency of MLE can follow by application of the theorem above for consistency
of M-estimators.
Essentially, as we noted above,what the consistency theorem showed above was that,
for any M-estimator sequence Оёn :
plimn→∞ θn = argmaxθ Q0 (θ).
For MLE, there is an argument due to Wald (1949), who shows that, in the i.i.d.
case, the “limiting likelihood function” (corresponding to Q0 (θ) is indeed globally
maximized at θ0 , the “true value”. Thus, we can directly confirm the identification
assumption of the M-estimator consistency theorem. This argument is of interest by
itself.
Argument: (summary of Amemiya, pp. 141–142)
• Define θˆnM LE ≡ argmaxθ n1
• By LLN:
Оё0 ).
1
n
i
log f (xi |Оё). Let Оё0 denote the true value.
p
i
log f (xi |Оё) в†’ EОё0 log f (xi |Оё), for all Оё (not necessarily the true
13
f (x|Оё)
f (x|Оё0 )
• By Jensen’s inequality: Eθ0 log
f (x|Оё)
f (x|Оё0 )
• But Eθ0
=
f (x|Оё)
f (x|Оё0 )
< log EОё0
f (x|Оё)
f (x|Оё0 )
f (x|Оё0 ) = 1, since f (x|Оё) is a density function,
for all Оё.1
• Hence:
f (x|Оё)
< 0, в€ЂОё
f (x|Оё0 )
=в‡’EОё0 log f (x|Оё) < EОё0 log f (x|Оё0 ), в€ЂОё
=в‡’EОё0 log f (x|Оё) is maximized at the true Оё0 .
EОё0 log
This is the “identification” assumption from the M-estimator consistency theorem.
• Analogously, we also know that, for δ > 0,
Вµ1 = EОё0 log
f (x; Оё0 в€’ Оґ)
f (x; Оё0 )
< 0;
Вµ2 = EОё0 log
f (x; Оё0 + Оґ)
f (x; Оё0 )
< 0.
By the SLLN, we know that
1
n
log
i
f (xi ; Оё0 в€’ Оґ)
f (xi ; Оё0 )
=
1
as
[log Ln (x; Оё0 в€’ Оґ) в€’ log Ln (x : Оё0 )] в†’ Вµ1
n
so that, with probability 1, log Ln (x; Оё0 в€’ Оґ) < log Ln (x; Оё0 ) for n large enough.
Similarly, for n large enough, log Ln (x; Оё0 + Оґ) < log Ln (x; Оё0 ) with probability
1. Hence, for large n,
ОёЛ†n в‰Ў argmaxОё log Ln (x; Оё) в€€ (Оё0 в€’ Оґ, Оё0 + Оґ).
That is, the MLE ОёЛ†n is strongly consistent for Оё0 .
Note that this argument requires weaker assumptions than the M-estimator
consistency theorem above.
Now we consider another idea, efficiency, which can be thought of as the large-sample
analogue of the “minimum variance” concept.
1
In this step, note the importance of assumption A3 in CB, pg. 516. If x has support depending
on Оё, then it will not integrate to 1 for all Оё.
14
For the sequence of estimators Wn , suppose that
d
k(n)(Wn в€’ Оё) в†’ N (0, Пѓ 2 )
where k(n) is a polynomial in n. Then Пѓ 2 is denoted the asymptotic variance of Wn .
в€љ
в€љ ВЇ
d
In “usual” cases, k(n) = n. For example, by the CLT, we know that n(X
n в€’ Вµ) в†’
ВЇn.
N (0, Пѓ 2 ). Hence, Пѓ 2 is the asymptotic variance of the sample mean X
Definition 10.1.11: An estimator sequence Wn is asymptotically efficient for Оё if
•
в€љ
d
n (Wn в€’ Оё) в†’ N (0, v(Оё)), where
• the asymptotic variance v(θ) =
1
2
∂
Eθ0 ( ∂θ
log f (X|Оё))
(the CRLB)
By asymptotic normality result for M-estimator, we know what the asymptotic distribution for the MLE should be. However, it turns out given the information inequality,
the MLE’s asymptotic distribution can be further simplified.
Theorem 10.1.12: Asymptotic efficiency of MLE
Proof: (following Amemiya, pp. 143–144)
• θˆnM LE satisfies the FOC of the MLE problem:
0=
∂ log L(θ|Xn )
∂θ
M LE
Оё=ОёЛ†n
.
• Using the mean value theorem:
∂ log L(θ|Xn )
0=
∂θ
∂ 2 log L(θ|Xn )
+
∂θ2
Оё=Оё0
∂ log L(θ|Xn )
∂θ
Оё=Оё0
в€љ
в€љ в€’
=в‡’ n ОёЛ†n в€’ Оё0 = n 2
∂ log L(θ|Xn )
в€—
∂θ2
Оё=Оёn
в€—
Оё=Оёn
ОёЛ†nM LE в€’ Оё0
в€љ в€’ n1
= n 1
n
∂ log f (xi |θ)
i
∂θ
Оё=Оё0
∂ 2 log f (xi |θ)
в€—
i
∂θ2
Оё=Оёn
(в€—в€—)
• Note that, by the LLN,
1
n
i
∂ log f (xi |θ)
∂θ
p
Оё=Оё0
в†’ EОё0
∂ log f (X|θ)
∂θ
15
Оё=Оё0
=
∂f (xi |θ)
∂θ
Оё=Оё0
dx.
Using same argument as in the information inequality result above, the last
term is:
∂f
∂
dx =
f dx = 0.
∂θ
∂θ
• Hence, the CLT can be applied to the numerator of (**):
d
numerator of (**) в†’ N
0, EОё0
2
∂ log f (xi |θ)
∂θ
Оё=Оё0
.
• By LLN, and uniform convergence of Hessian term:
p
denominator of (**) в†’ EОё0
∂ 2 log f (X|θ)
∂θ2
Оё=Оё0
.
• Hence, by Slutsky theorem:
пЈ«
∂ log f (xi |θ)
∂θ
Оё=Оё0
в€љ
d
пЈ¬ EОё0
n ОёЛ†n в€’ Оё0 в†’ N пЈ­0,
2
Eθ0 ∂ log∂θf2(X|θ)
2
пЈ¶
2пЈё.
пЈ·
Оё=Оё0
• By the information inequality:
EОё0
so that
∂ log f (xi |θ)
∂θ
2
Оё=Оё0
∂ 2 log f (X|θ)
= в€’EОё0
∂θ2
пЈ«
в€љ
d
пЈ¬
n ОёЛ†n в€’ Оё0 в†’ N пЈ­0,
Оё=Оё0
пЈ¶
1
EОё0
пЈ·
∂ log f (xi |θ)
∂θ
Оё=Оё0
2пЈё
so that the asymptotic variance is the CRLB.
Hence, the asymptotic approximation for the finite-sample distribution is
пЈ«
пЈ¶
a
пЈ¬ 1
ОёЛ†nM LE в€ј N пЈ­Оё0 ,
n
1
EОё0
16
∂ log f (xi |θ)
∂θ
Оё=Оё0
2пЈё.
пЈ·