Point Estimation: properties of estimators вЂў finite-sample properties (CB 7.3) вЂў large-sample properties (CB 10.1) 1 FINITE-SAMPLE PROPERTIES How an estimator performs for finite number of observations n. Estimator: W Parameter: Оё Criteria for evaluating estimators: вЂў Bias: does EW = Оё? вЂў Variance of W (you would like an estimator with a smaller variance) Example: X1 , . . . , Xn в€ј i.i.d. (Вµ, Пѓ 2 ) Unknown parameters are Вµ and Пѓ 2 . Consider: Вµ Л†n в‰Ў n1 i Xi , estimator of Вµ ВЇ n )2 , estimator of Пѓ 2 . Пѓ Л†n2 в‰Ў n1 i (Xi в€’ X Bias: E Вµ Л†n = Var Вµ Л†n = 1 n В· nВµ = Вµ. So unbiased. 1 nПѓ 2 n2 = n1 Пѓ 2 . EПѓ Л† 2 = E( = 1 n 1 В· n ВЇ n )2 ) (Xi в€’ X i ВЇn + E X ВЇ 2) (EXi2 в€’ 2EXi X n i 1 Пѓ2 Пѓ2 В· n (Вµ2 + Пѓ 2 ) в€’ 2(Вµ2 + ) + + Вµ2 n n n nв€’1 2 = Пѓ . n = 1 Hence it is biased. To fix this bias, consider the estimator s2n в‰Ў (unbiased). 1 nв€’1 i (Xi ВЇ n )2 , and Es2 = Пѓ 2 в€’X n Mean-squared error (MSE) of W is E(W в€’ Оё)2 . Common criterion for comparing estimators. Decompose: M SE(W ) = V W + (EW в€’ Оё)2 =Variance + (Bias)2 . Hence, for an unbiased estimator: M SE(W ) = V W . Example: X1 , . . . , Xn в€ј U [0, Оё]. f (X) = 1/Оё, x в€€ [0, Оё]. ВЇn. вЂў Consider estimator ОёЛ†n в‰Ў 2X E ОёЛ†n = 2 n1 В· E i Xi = 2 В· n1 В· 12 n В· Оё = Оё. So unbiased 2 V Xi = Оё M SE(ОёЛ†N ) = V ОёЛ†n = 42 i n 3n вЂў Consider estimator ОёЛњn в‰Ў max (X1 , . . . , Xn ). In order to derive moments, start by deriving CDF: P (ОёЛњn в‰¤ z) = P (X1 в‰¤ z, X2 в‰¤ z, . . . , Xn в‰¤ z) n P (Xi в‰¤ z) = i=1 z n Оё = Therefore fОёЛњn (z) = n В· 1 z nв€’1 1 , Оё Оё for 0 в‰¤ x в‰¤ Оё. Оё E(ОёЛњn ) = 1 dz Оё Оё n z n dz = Оё. n+1 0 zВ·nВ· 0 = Bias(ОёЛњn ) = в€’Оё/(n + 1) Оё n+1 E(ОёЛњ2 ) = nn z dz = n Оё 0 Hence V ОёЛњn = Оё2 n n+2 Accordingly, MSE= в€’ if z в‰¤ Оё otherwise n Оёn z Оё nв€’1 n Оё2 . n+2 2 n n+1 n = Оё2 (n+2)(n+1) 2. 2Оё2 (n+2)(n+1) 2 Continue the previous example. Redefine ОёЛњn = estimators ОёЛ†n and ОёЛњn are unbiased. Оё2 Which is better? V ОёЛ†n = 3n = O(1/n). 2 n+1 V ОёЛњn = V (max(X1 , . . . , Xn )) = Оё2 n 1 n(n+2) n+1 n max(X1 , . . . , Xn ). Now both = O(1/n2 ). Hence, for n large enough, ОёЛњn has a smaller variance, and in this sense it is вЂњbetterвЂќ. Best unbiased estimator: if you choose the best (in terms of MSE) estimator, and restrict yourself to unbiased estimators, then the best estimator is the one with the lowest variance. A best unbiased estimator is also called the вЂњUniform minimum variance unbiased estimatorвЂќ (UMVUE). Formally: an estimator W is a UMVUE of Оё satisfies: (i) EОё W = Оё, for all Оё (unbiasedness) Лњ , for all Оё, and all other unbiased estimators W Лњ. (ii) VОё W в‰¤ VОё W The вЂњuniformвЂќ condition is crucial, because it is always possible to find estimators which have zero variance for a specific value of Оё. It is difficult in general to verify that an estimator W is UMVUE, since you have to verify condition (ii) of the definition, that V W is smaller than all other unbiased estimators. Luckily, we have an important result for the lowest attainable variance of an estimator. вЂў Theorem 7.3.9 (Cramer-Rao Inequality): Let X1 , . . . , Xn be a sample with joint pdf f (X|Оё), and let W (X) be any estimator satisfying (i) в€‚ W (X) В· f (X|Оё) dX; в€‚Оё d EОё W (X) = dОё (ii) VОё W (X) < в€ћ. Then d E W (X) dОё Оё VОё W (X) в‰Ґ EОё 3 в€‚ в€‚Оё 2 log f (X|Оё) 2. The RHS above is called the Cramer-Rao Lower Bound. Proof: CB, pg. 336. d Note: the LHS of condition (i) above is dОё W (X)f (X|Оё)dX, so by LeibnizвЂ™ rule, this condition basically rules out cases where the support of X is dependent on Оё. вЂў The equality EОё 1 в€‚ log f (X|Оё) = в€‚Оё в€‚f (X|Оё) f (X|Оё)dx f (X|Оё) в€‚Оё в€‚ f (X|Оё)dx в€‚Оё в€‚ В·1=0 = в€‚Оё = (1) is noteworthy, because в€‚ log f (X|Оё) = 0 в€‚Оё is the FOC of maximum likelihood estimation problem. (Alternatively, as in CB, apply condition (i) of CR result, using W (X) = 1.) In the i.i.d. case, this becomes the sample average 1 n i в€‚ log f (xi |Оё) = 0. в€‚Оё And by the LLN: 1 n i в€‚ в€‚ p log f (xi |Оё) в†’ EОё0 log f (xi |Оё), в€‚Оё в€‚Оё where Оё0 is the true value of Оё0 . This shows that maximum likelihood estimation of Оё is equivalent to estimation based on the moment condition в€‚ log f (xi |Оё) = 0 в€‚Оё which holds only at the true value Оё = Оё0 . (Thus MLE is вЂњconsistentвЂќ for the true value Оё0 , as weвЂ™ll see later.) EОё0 (However, note that Eq. (1) holds at all values of Оё, not just Оё0 .) What if model is вЂњmisspecifiedвЂќ, in the sense that true density of X is g(x), and that for all Оё в€€ О�, f (x|Оё) = g(x) (that is, there is no value of the parameter Оё such that the postulated model f coincides with the true model g)? Does Eq. (1) still hold? What is MLE looking for? 4 вЂў In the iid case, the CR lower bound can be simplified Corollary 7.3.10: if X1 , . . . , Xn в€ј i.i.d. f (X|Оё), then VОё W (X) в‰Ґ d E W (X) dОё Оё в€‚ в€‚Оё n В· EОё 2 log f (X|Оё) 2. вЂў Up to this point, Cramer-Rao results not that operational for us to find a вЂњbestвЂќ estimator, because the estimator W (X) is on both sides of the inequality. However, for an unbiased estimator, how can you simplify the expression further? вЂў Example: X1 , . . . , Xn в€ј i.i.d. N (Вµ, Пѓ 2 ). What is CRLB for an unbiased estimator of Вµ? Unbiased в†’ numerator =1. в€љ 1 xв€’Вµ log f (x|Оё) = log 2ПЂ в€’ log Пѓ в€’ 2 Пѓ xв€’Вµ в€‚ в€’1 xв€’Вµ log f (x|Оё) = в€’ В· = в€‚Вµ Пѓ Пѓ Пѓ2 E в€‚ log f (X|Оё) в€‚Оё Hence the CRLB = 1 nВ· 12 Пѓ = 2 = E (X в€’ Вµ)2 Пѓ 4 = 2 1 1 V X = 2. 4 Пѓ Пѓ Пѓ2 . n This is the variance of the sample mean, so that the sample mean is a UMVUE for Вµ. вЂў Sometimes we can simplify the denominator of the CRLB further: Lemma 7.3.11 (Information inequality): if f (X|Оё) satisfies (*) d EОё dОё в€‚ log f (X|Оё) в€‚Оё then EОё в€‚ log f (X|Оё) в€‚Оё в€‚ в€‚Оё = 2 = в€’EОё в€‚ log f (X|Оё) f (X|Оё) dx, в€‚Оё в€‚2 log f (X|Оё) . в€‚Оё2 Rough proof: LHS of (*): Using Eq. (1) above, we get that LHS of (*) =0. 5 RHS of (*): в€‚ в€‚Оё = =E в€‚ log f в€‚Оё f dx в€‚ 2 log f f dx + в€‚Оё2 в€‚ 2 log f +E в€‚Оё2 1 f в€‚f в€‚Оё в€‚ log f в€‚Оё 2 dx 2 . Putting the LHS and RHS together yields the desired result. d в€‚ Note: the LHS of the above condition (*) is just dОё log f (X|Оё) f (X|Оё)dX, в€‚Оё so that, by LeibnizвЂ™ rule, the condition (*) just states that the bounds of the integration (i.e., the support of X) does not depend on Оё. Normal distribution satisfies this (support is always (в€’в€ћ, в€ћ)), but U [0, Оё] does not. в€‚ log f (X|Оё) = Also, the information inequality depends crucially on the equality EОё в€‚Оё 0, which depends on the correct specification of the model. Thus information inequality can be used as basis of вЂњspecification testвЂќ. (How?) вЂў Example: for the previous example, consider CRLB for unbiased estimator of Пѓ2. We can use the information inequality, because condition (*) is satisfied for the normal distribution. Hence: в€’E в€’1 1 (x в€’ Вµ)2 в€‚ log f (x|Оё) = + в€‚Пѓ 2 2Пѓ 2 2 Пѓ 4 в€‚ в€‚ 1 (x в€’ Вµ)2 log f (x|Оё) = 4 в€’ в€‚Пѓ 2 в€‚Пѓ 2 2Пѓ Пѓ6 в€‚ 1 1 в€‚ 1 log f (x|Оё) =в€’ в€’ 4 = 4. 2 2 4 в€‚Пѓ в€‚Пѓ 2Пѓ Пѓ 2Пѓ Hence the CRLB is 2Пѓ 4 . n вЂў Example: X1 , . . . , Xn в€ј U [0, Оё]. Check conditions for CRLB for an unbiased estimator W (X) of Оё. d EW (X) dОё = 1 (because it is unbiased) в€‚ W (X)f (X|Оё) dX = в€‚Оё W (X) В· Hence, condition (i) of theorem not satisfied. 6 в€’1 Оё2 dX = d EW (X) = 1 dОё вЂў But when can CRLB (if it exists) be attained? Corollary 7.3.15: X1 , . . . , Xn в€ј i.i.d. f (X|Оё), satisfying the conditions of CR theorem. вЂ“ Let L(Оё|X) = n i=1 f (X|Оё) denote the likelihood function. вЂ“ Estimator W (X) unbiased for Оё вЂ“ W (X) attains CRLB iff you can write в€‚ log L(Оё|X) = a(Оё) W (X) в€’ Оё в€‚Оё for some function a(Оё). вЂў Example: X1 , . . . , Xn в€ј i.i.d. N (Вµ, Пѓ 2 ) Consider estimating Вµ: в€‚ в€‚ log L(Оё|X) = в€‚Вµ в€‚Вµ = в€‚ в€‚Вµ n = i=1 = n log f (X|Оё) i=1 n в€’ log в€љ 2ПЂ в€’ log Пѓ в€’ i=1 1 2 (X в€’ Вµ)2 Пѓ2 X в€’Вµ Пѓ2 n ВЇ n в€’ Вµ). В· (X Пѓ2 Hence, CRLB can be attained (in fact, we showed earlier that CRLB attained ВЇn) by X Loss function optimality Let X в€ј f (X|Оё). Consider a loss function L(Оё, W (X)), taking values in [0, +в€ћ), which penalizes you when your W (X) estimator is вЂњfarвЂќ from the true parameter Оё. Note that L(Оё, W (X)) is a random variable, since X (and W (X)) are random. Consider estimators which minimize expected loss: that is min EОё L(Оё, W (X)) в‰Ў min R(Оё, W (В· В· В· )) W (В·В·В· ) W (В·В·В· ) 7 where R(Оё, W (В· В· В· )) is the risk function. (Note: the risk function is not a random variable, because X has been integrated out.) Loss function optimality is a more general criterion than minimum MSE. In fact, 2 because M SE(W (X)) = EОё W (X) в€’ Оё , the MSE is actually the risk function 2 associated with the quadratic loss function L(Оё, W (X)) = W (X) в€’ Оё . Other examples of loss functions: вЂў Absolute error loss: |W (X) в€’ Оё| вЂў Relative quadratic error loss: (W (X)в€’Оё)2 |Оё|+1 The exercise of minimizing risk takes a given value of Оё as given, so that the minimized risk of an estimator depends on whichever value of Оё you are considering. You might be interested in an estimator which does well regardless of which value of Оё you are considering. (Analogous to the focus on the uniform minimal variance.) For this different problem, you want to consider a notion of risk which does not depend on Оё. Two criteria which have been considered are: вЂў вЂњAverageвЂќ risk: min R(Оё, W (В· В· В· ))h(Оё)dОё. W (В·В·В· ) where h(Оё) is some weighting function across Оё. (In a Bayesian interpretation, h(Оё) is a prior density over Оё.) вЂў Minmax criterion: min max R(Оё, W (В· В· В· )). W (В·В·В· ) Оё Here you choose the estimator W (В· В· В· ) to minimize the maximum risk = maxОё R(Оё, W (В· В· В· )), where Оё is set to the вЂњworseвЂќ value. So minmax optimizer is the best that can be achieved in a вЂњworst-caseвЂќ scenario. ВЇ n is: Example: X1 , . . . , Xn в€ј i.i.d. N (Вµ, Пѓ 2 ). Sample mean X unbiased minimum MSE UMVUE attains CRLB minimizes expected quadratic loss 8 2 LARGE SAMPLE PROPERTIES OF ESTIMATORS It can be difficult to compute MSE, risk functions, etc., for some estimators, especially when estimator does not resemble a sample average. Large-sample properties: exploit LLN, CLT Consider data {X1 , X2 , . . .} by which we construct a sequence of estimators Wn в‰Ў W (X1 ), W (X2 ), . . . . Wn is a random sequence. Define: we say that Wn is consistent for a parameter Оё iff the random sequence Wn as converges (in some stochastic sense) to Оё. Strong consistency obtains when Wn в†’ Оё. p Weak consistency obtains when Wn в†’ Оё. For estimators like sample-means, consistency (either weak or strong) follows easily using a LLN. Consistency can be thought of as the large-sample version of unbiasedness. Define: an M-estimator is an estimator of Оё which a maximizer of an objective function Qn (Оё). Examples: вЂў MLE: Qn (Оё) = 1 n n i=1 log f (xi |Оё) вЂў Least squares: Qn (Оё) = О± + Xi ОІ. n 2 i=1 [yi в€’ g(xi ; Оё)] . OLS is special case when g(xi ; Оё) = вЂў GMM: Qn (Оё) = Gn (Оё) Wn (Оё)Gn (Оё) where 1 Gn (Оё) = n n i=1 1 m1 (xi ; Оё), n n i=1 1 m2 (xi ; Оё), . . . , n n mM (xi ; Оё) , i=1 an M Г— 1 vector of sample moment conditions, and Wn is an M Г— M weighting matrix. Notation: Denote the limit objective function Q0 (Оё) = plimnв†’в€ћ Qn (Оё) (at each Оё). Define Оё0 в‰Ў argmaxОё Q0 (Оё). Consistency of M-estimators Make the following assumptions: 9 1. Q0 (Оё) is uniquely maximized at some value Оё0 (вЂњidentificationвЂќ) 2. Parameter space О� is a compact subset of RK . 3. Q0 (Оё) is continuous in Оё 4. Qn (Оё) converges uniformly in probability to Q0 (Оё); that is: p sup |Qn (Оё) в€’ Q0 (Оё)| в†’ 0. Оёв€€О� p Theorem: (Consistency of M-Estimator) Under assumption 1,2,3,4, Оёn в†’ Оё0 . Proof: We need to show: for any arbitrarily small neighborhood N containing Оё0 , P (Оёn в€€ N ) в†’ 1. For n large enough, the uniform convergence conditions that, for all , Оґ > 0, P sup |Qn (Оё) в€’ Q0 (Оё)| < /2 > 1 в€’ Оґ. Оёв€€О� The event вЂњsupОёв€€О� |Qn (Оё) в€’ Q0 (Оё)| < /2вЂќ implies Qn (Оёn ) в€’ Q0 (Оёn ) < /2 в‡” Q0 (Оёn ) > Qn (Оёn ) в€’ /2 (2) Qn (Оё0 ) в€’ Q0 (Оё0 ) > в€’ /2 в‡’ Qn (Оё0 ) > Q0 (Оё0 ) в€’ /2. (3) Similarly, Since Оёn = argmaxОё Qn (Оё), Eq. (2) implies Q0 (Оёn ) > Qn (Оё0 ) в€’ /2. (4) Hence, adding Eqs. (3) and (4), we have Q0 (Оёn ) > Q0 (Оё0 ) в€’ . (5) So we have shown that sup |Qn (Оё) в€’ Q0 (Оё)| < /2 =в‡’ Q0 (Оёn ) > Q0 (Оё0 ) в€’ Оёв€€О� в‡”P (Q0 (Оёn ) > Q0 (Оё0 ) в€’ ) в‰Ґ P sup |Qn (Оё) в€’ Q0 (Оё)| < /2 Оёв€€О� 10 в†’ 1. Now define N as any open neighborhood of RK , which contains Оё0 , and NВЇ is the complement of N in RK . Then О� в€© NВЇ is compact, so that maxОёв€€О�в€©NВЇ Q0 (Оё) exists. Set = Q0 (Оё0 ) в€’ maxОёв€€О�в€©NВЇ Q0 (Оё). Then Q0 (Оёn ) > Q0 (Оё0 ) в€’ в‡’ Q0 (Оёn ) > max Q0 (Оё) ВЇ Оёв€€О�в€©N в‡’ Оёn в€€ N в‡” P (Оёn в€€ N ) в‰Ґ P (Q0 (Оёn ) > Q0 (Оё0 ) в€’ ) в†’ 1. Since the argument above holds for any arbitrarily small neighborhood of Оё0 , we are done. вЂў In general, the limit objective function Q0 (Оё) = plimnв†’в€ћ Qn (Оё) may not be that straightforward to determine. But in many cases, Qn (Оё) is a sample average of some sort: 1 Qn (Оё) = q(xi |Оё) n i (eg. least squares, MLE). Then by a law of large numbers, we conclude that (for all Оё) 1 q(xi |Оё) = Exi q(xi |Оё) Q0 (Оё) = plim n i where Exi denote expectation with respect to the true (but unobserved) distribution of xi . вЂў Most of the time, Оё0 can be interpreted as a вЂњtrue valueвЂќ. But if model is misspecified, then this interpretation doesnвЂ™t hold (indeed, under misspecification, not even clear what the вЂњtrue valueвЂќ is). So a more cautious way to interpret the consistency result is that p Оёn в†’ argmaxОё Q0 (Оё) which holds (given the conditions) no matter whether model is correctly specified. ** Of the four assumptions above, the most вЂњhigh-levelвЂќ one is the uniform convergence condition. Sufficient conditions for this conditions are: 1. Pointwise convergence: For each Оё в€€ О�, Qn (Оё) в€’ Q0 (Оё) = op (1). 11 2. Qn (Оё) is stochastically equicontinuous: for every > 0, О· > 0 there exists a sequence of random variable в€†n ( , О·) and nв€— ( , О·) such that for all n > nв€— , P (|в€†n | > ) < О· and for each Оё there is an open set N containing Оё with Лњ в€’ Qn (Оё)| в‰¤ в€†n , в€Ђn > nв€— . |Qn (Оё) supОёв€€N Лњ Note that both в€†n and nв€— do not depend on Оё: it is uniform result. In a deterministic context, this is an вЂњin probabilityвЂќ version of the notion of uniform equicontinuity: we say a sequence of deterministic functions Rn (Оё) is uniformly equicontinuous if, for every > 0 there exists Оґ( ) and nв€— ( ) such that for all Оё Лњ в€’ Rn (Оё)| в‰¤ , в€Ђn > nв€— . supОё:|| |Rn (Оё) Лњ Оёв€’Оё||<Оґ Лњ Asymptotic normality for M-estimators Define the вЂњscore vectorвЂќ ОёЛњQn (Оё) в€‚Qn (Оё) в€‚Оё1 = ,..., Оё=ОёЛњ в€‚Qn (Оё) в€‚ОёK . Оё=ОёЛњ Similarly, define the K Г— K Hessian matrix ОёЛњОёЛњ Qn (Оё) i,j = в€‚ 2 Qn (Оё) в€‚Оёi в€‚Оёj , 1 в‰¤ i, j в‰¤ K. Оё=ОёЛњ Note that the Hessian is symmetric. Make the following assumptions: p 1. Оёn = argmaxОё Qn (Оё) в†’ Оё0 2. Оё0 в€€ interior(О�) 3. Qn (Оё) is twice continuously differentiable in a neighborhood N of Оё0 . 4. в€љ n d Оё0 Qn (Оё) в†’ N (0, ОЈ) 5. Uniform convergence of Hessian: there exists the matrix H(Оё) which is continp uous at Оё0 and supОёв€€N || ОёОё Qn (Оё) в€’ H(Оё)|| в†’ 0. 6. H(Оё0 ) is nonsingular 12 Theorem (Asymptotic normality for M-estimator): Under assumptions 1,2,3,4,5, в€љ d n(Оёn в€’ Оё0 ) в†’ N (0, H0в€’1 ОЈH0в€’1 ) where H0 в‰Ў H(Оё0 ). Proof: (sketch) By Assumptions 1,2,3, Оёn Qn (Оё) = 0 (this is FOC of maximization problem). Then using mean-value theorem (with ОёВЇn denoting mean value): = Оё0 Qn (Оё) + ОёВЇn ОёВЇn Qn (Оё)(Оёn в€’ Оё0 ) в€љ в€љ n(Оёn в€’ Оё0 ) = в€’ n Оё0 Qn (Оё) ОёВЇn ОёВЇn Qn (Оё) 0= в‡’ Оёn Qn (Оё) p d в†’H0 (using A5) в†’N (0,ОЈ) (using A4) в€љ d в‡” n(Оёn в€’ Оё0 ) в†’ в€’H(Оё0 )в€’1 N (0, ОЈ) = N (0, H0в€’1 ОЈH0в€’1 ). Note: A5 is a uniform convergence assumption on the sample Hessian. Compared to analogous step in the Delta method, we used the plim-operator theorem coupled with the вЂњsqueezeвЂќ result. Here it doesnвЂ™t suffice, because as n в†’ в€ћ, both ОёВЇn as well as the function Gn (В·) = В·В· Qn (Оё) are changing. 2.1 Maximum likelihood estimation The consistency of MLE can follow by application of the theorem above for consistency of M-estimators. Essentially, as we noted above,what the consistency theorem showed above was that, for any M-estimator sequence Оёn : plimnв†’в€ћ Оёn = argmaxОё Q0 (Оё). For MLE, there is an argument due to Wald (1949), who shows that, in the i.i.d. case, the вЂњlimiting likelihood functionвЂќ (corresponding to Q0 (Оё) is indeed globally maximized at Оё0 , the вЂњtrue valueвЂќ. Thus, we can directly confirm the identification assumption of the M-estimator consistency theorem. This argument is of interest by itself. Argument: (summary of Amemiya, pp. 141вЂ“142) вЂў Define ОёЛ†nM LE в‰Ў argmaxОё n1 вЂў By LLN: Оё0 ). 1 n i log f (xi |Оё). Let Оё0 denote the true value. p i log f (xi |Оё) в†’ EОё0 log f (xi |Оё), for all Оё (not necessarily the true 13 f (x|Оё) f (x|Оё0 ) вЂў By JensenвЂ™s inequality: EОё0 log f (x|Оё) f (x|Оё0 ) вЂў But EОё0 = f (x|Оё) f (x|Оё0 ) < log EОё0 f (x|Оё) f (x|Оё0 ) f (x|Оё0 ) = 1, since f (x|Оё) is a density function, for all Оё.1 вЂў Hence: f (x|Оё) < 0, в€ЂОё f (x|Оё0 ) =в‡’EОё0 log f (x|Оё) < EОё0 log f (x|Оё0 ), в€ЂОё =в‡’EОё0 log f (x|Оё) is maximized at the true Оё0 . EОё0 log This is the вЂњidentificationвЂќ assumption from the M-estimator consistency theorem. вЂў Analogously, we also know that, for Оґ > 0, Вµ1 = EОё0 log f (x; Оё0 в€’ Оґ) f (x; Оё0 ) < 0; Вµ2 = EОё0 log f (x; Оё0 + Оґ) f (x; Оё0 ) < 0. By the SLLN, we know that 1 n log i f (xi ; Оё0 в€’ Оґ) f (xi ; Оё0 ) = 1 as [log Ln (x; Оё0 в€’ Оґ) в€’ log Ln (x : Оё0 )] в†’ Вµ1 n so that, with probability 1, log Ln (x; Оё0 в€’ Оґ) < log Ln (x; Оё0 ) for n large enough. Similarly, for n large enough, log Ln (x; Оё0 + Оґ) < log Ln (x; Оё0 ) with probability 1. Hence, for large n, ОёЛ†n в‰Ў argmaxОё log Ln (x; Оё) в€€ (Оё0 в€’ Оґ, Оё0 + Оґ). That is, the MLE ОёЛ†n is strongly consistent for Оё0 . Note that this argument requires weaker assumptions than the M-estimator consistency theorem above. Now we consider another idea, efficiency, which can be thought of as the large-sample analogue of the вЂњminimum varianceвЂќ concept. 1 In this step, note the importance of assumption A3 in CB, pg. 516. If x has support depending on Оё, then it will not integrate to 1 for all Оё. 14 For the sequence of estimators Wn , suppose that d k(n)(Wn в€’ Оё) в†’ N (0, Пѓ 2 ) where k(n) is a polynomial in n. Then Пѓ 2 is denoted the asymptotic variance of Wn . в€љ в€љ ВЇ d In вЂњusualвЂќ cases, k(n) = n. For example, by the CLT, we know that n(X n в€’ Вµ) в†’ ВЇn. N (0, Пѓ 2 ). Hence, Пѓ 2 is the asymptotic variance of the sample mean X Definition 10.1.11: An estimator sequence Wn is asymptotically efficient for Оё if вЂў в€љ d n (Wn в€’ Оё) в†’ N (0, v(Оё)), where вЂў the asymptotic variance v(Оё) = 1 2 в€‚ EОё0 ( в€‚Оё log f (X|Оё)) (the CRLB) By asymptotic normality result for M-estimator, we know what the asymptotic distribution for the MLE should be. However, it turns out given the information inequality, the MLEвЂ™s asymptotic distribution can be further simplified. Theorem 10.1.12: Asymptotic efficiency of MLE Proof: (following Amemiya, pp. 143вЂ“144) вЂў ОёЛ†nM LE satisfies the FOC of the MLE problem: 0= в€‚ log L(Оё|Xn ) в€‚Оё M LE Оё=ОёЛ†n . вЂў Using the mean value theorem: в€‚ log L(Оё|Xn ) 0= в€‚Оё в€‚ 2 log L(Оё|Xn ) + в€‚Оё2 Оё=Оё0 в€‚ log L(Оё|Xn ) в€‚Оё Оё=Оё0 в€љ в€љ в€’ =в‡’ n ОёЛ†n в€’ Оё0 = n 2 в€‚ log L(Оё|Xn ) в€— в€‚Оё2 Оё=Оёn в€— Оё=Оёn ОёЛ†nM LE в€’ Оё0 в€љ в€’ n1 = n 1 n в€‚ log f (xi |Оё) i в€‚Оё Оё=Оё0 в€‚ 2 log f (xi |Оё) в€— i в€‚Оё2 Оё=Оёn (в€—в€—) вЂў Note that, by the LLN, 1 n i в€‚ log f (xi |Оё) в€‚Оё p Оё=Оё0 в†’ EОё0 в€‚ log f (X|Оё) в€‚Оё 15 Оё=Оё0 = в€‚f (xi |Оё) в€‚Оё Оё=Оё0 dx. Using same argument as in the information inequality result above, the last term is: в€‚f в€‚ dx = f dx = 0. в€‚Оё в€‚Оё вЂў Hence, the CLT can be applied to the numerator of (**): d numerator of (**) в†’ N 0, EОё0 2 в€‚ log f (xi |Оё) в€‚Оё Оё=Оё0 . вЂў By LLN, and uniform convergence of Hessian term: p denominator of (**) в†’ EОё0 в€‚ 2 log f (X|Оё) в€‚Оё2 Оё=Оё0 . вЂў Hence, by Slutsky theorem: пЈ« в€‚ log f (xi |Оё) в€‚Оё Оё=Оё0 в€љ d пЈ¬ EОё0 n ОёЛ†n в€’ Оё0 в†’ N пЈ0, 2 EОё0 в€‚ logв€‚Оёf2(X|Оё) 2 пЈ¶ 2пЈё. пЈ· Оё=Оё0 вЂў By the information inequality: EОё0 so that в€‚ log f (xi |Оё) в€‚Оё 2 Оё=Оё0 в€‚ 2 log f (X|Оё) = в€’EОё0 в€‚Оё2 пЈ« в€љ d пЈ¬ n ОёЛ†n в€’ Оё0 в†’ N пЈ0, Оё=Оё0 пЈ¶ 1 EОё0 пЈ· в€‚ log f (xi |Оё) в€‚Оё Оё=Оё0 2пЈё so that the asymptotic variance is the CRLB. Hence, the asymptotic approximation for the finite-sample distribution is пЈ« пЈ¶ a пЈ¬ 1 ОёЛ†nM LE в€ј N пЈОё0 , n 1 EОё0 16 в€‚ log f (xi |Оё) в€‚Оё Оё=Оё0 2пЈё. пЈ·