Point Estimation: properties of estimators • finite-sample properties (CB 7.3) • large-sample properties (CB 10.1) 1 FINITE-SAMPLE PROPERTIES How an estimator performs for finite number of observations n. Estimator: W Parameter: Оё Criteria for evaluating estimators: • Bias: does EW = Оё? • Variance of W (you would like an estimator with a smaller variance) Example: X1 , . . . , Xn в€ј i.i.d. (Вµ, Пѓ 2 ) Unknown parameters are Вµ and Пѓ 2 . Consider: Вµ Л†n в‰Ў n1 i Xi , estimator of Вµ ВЇ n )2 , estimator of Пѓ 2 . Пѓ Л†n2 в‰Ў n1 i (Xi в€’ X Bias: E Вµ Л†n = Var Вµ Л†n = 1 n В· nВµ = Вµ. So unbiased. 1 nПѓ 2 n2 = n1 Пѓ 2 . EПѓ Л† 2 = E( = 1 n 1 В· n ВЇ n )2 ) (Xi в€’ X i ВЇn + E X ВЇ 2) (EXi2 в€’ 2EXi X n i 1 Пѓ2 Пѓ2 В· n (Вµ2 + Пѓ 2 ) в€’ 2(Вµ2 + ) + + Вµ2 n n n nв€’1 2 = Пѓ . n = 1 Hence it is biased. To fix this bias, consider the estimator s2n в‰Ў (unbiased). 1 nв€’1 i (Xi ВЇ n )2 , and Es2 = Пѓ 2 в€’X n Mean-squared error (MSE) of W is E(W в€’ Оё)2 . Common criterion for comparing estimators. Decompose: M SE(W ) = V W + (EW в€’ Оё)2 =Variance + (Bias)2 . Hence, for an unbiased estimator: M SE(W ) = V W . Example: X1 , . . . , Xn в€ј U [0, Оё]. f (X) = 1/Оё, x в€€ [0, Оё]. ВЇn. • Consider estimator ОёЛ†n в‰Ў 2X E ОёЛ†n = 2 n1 В· E i Xi = 2 В· n1 В· 12 n В· Оё = Оё. So unbiased 2 V Xi = Оё M SE(ОёЛ†N ) = V ОёЛ†n = 42 i n 3n • Consider estimator ОёЛњn в‰Ў max (X1 , . . . , Xn ). In order to derive moments, start by deriving CDF: P (ОёЛњn ≤ z) = P (X1 ≤ z, X2 ≤ z, . . . , Xn ≤ z) n P (Xi ≤ z) = i=1 z n Оё = Therefore fОёЛњn (z) = n В· 1 z nв€’1 1 , Оё Оё for 0 ≤ x ≤ Оё. Оё E(ОёЛњn ) = 1 dz Оё Оё n z n dz = Оё. n+1 0 zВ·nВ· 0 = Bias(ОёЛњn ) = в€’Оё/(n + 1) Оё n+1 E(ОёЛњ2 ) = nn z dz = n Оё 0 Hence V ОёЛњn = Оё2 n n+2 Accordingly, MSE= в€’ if z ≤ Оё otherwise n Оёn z Оё nв€’1 n Оё2 . n+2 2 n n+1 n = Оё2 (n+2)(n+1) 2. 2Оё2 (n+2)(n+1) 2 Continue the previous example. Redefine ОёЛњn = estimators ОёЛ†n and ОёЛњn are unbiased. Оё2 Which is better? V ОёЛ†n = 3n = O(1/n). 2 n+1 V ОёЛњn = V (max(X1 , . . . , Xn )) = Оё2 n 1 n(n+2) n+1 n max(X1 , . . . , Xn ). Now both = O(1/n2 ). Hence, for n large enough, ОёЛњn has a smaller variance, and in this sense it is “better”. Best unbiased estimator: if you choose the best (in terms of MSE) estimator, and restrict yourself to unbiased estimators, then the best estimator is the one with the lowest variance. A best unbiased estimator is also called the “Uniform minimum variance unbiased estimator” (UMVUE). Formally: an estimator W is a UMVUE of Оё satisfies: (i) EОё W = Оё, for all Оё (unbiasedness) Лњ , for all Оё, and all other unbiased estimators W Лњ. (ii) VОё W ≤ VОё W The “uniform” condition is crucial, because it is always possible to find estimators which have zero variance for a specific value of Оё. It is difficult in general to verify that an estimator W is UMVUE, since you have to verify condition (ii) of the definition, that V W is smaller than all other unbiased estimators. Luckily, we have an important result for the lowest attainable variance of an estimator. • Theorem 7.3.9 (Cramer-Rao Inequality): Let X1 , . . . , Xn be a sample with joint pdf f (X|Оё), and let W (X) be any estimator satisfying (i) ∂ W (X) В· f (X|Оё) dX; ∂θ d EОё W (X) = dОё (ii) VОё W (X) < в€ћ. Then d E W (X) dОё Оё VОё W (X) ≥ EОё 3 ∂ ∂θ 2 log f (X|Оё) 2. The RHS above is called the Cramer-Rao Lower Bound. Proof: CB, pg. 336. d Note: the LHS of condition (i) above is dОё W (X)f (X|Оё)dX, so by Leibniz’ rule, this condition basically rules out cases where the support of X is dependent on Оё. • The equality EОё 1 ∂ log f (X|Оё) = ∂θ ∂f (X|Оё) f (X|Оё)dx f (X|Оё) ∂θ ∂ f (X|Оё)dx ∂θ ∂ В·1=0 = ∂θ = (1) is noteworthy, because ∂ log f (X|Оё) = 0 ∂θ is the FOC of maximum likelihood estimation problem. (Alternatively, as in CB, apply condition (i) of CR result, using W (X) = 1.) In the i.i.d. case, this becomes the sample average 1 n i ∂ log f (xi |Оё) = 0. ∂θ And by the LLN: 1 n i ∂ ∂ p log f (xi |Оё) в†’ EОё0 log f (xi |Оё), ∂θ ∂θ where Оё0 is the true value of Оё0 . This shows that maximum likelihood estimation of Оё is equivalent to estimation based on the moment condition ∂ log f (xi |Оё) = 0 ∂θ which holds only at the true value Оё = Оё0 . (Thus MLE is “consistent” for the true value Оё0 , as we’ll see later.) EОё0 (However, note that Eq. (1) holds at all values of Оё, not just Оё0 .) What if model is “misspecified”, in the sense that true density of X is g(x), and that for all Оё в€€ О�, f (x|Оё) = g(x) (that is, there is no value of the parameter Оё such that the postulated model f coincides with the true model g)? Does Eq. (1) still hold? What is MLE looking for? 4 • In the iid case, the CR lower bound can be simplified Corollary 7.3.10: if X1 , . . . , Xn в€ј i.i.d. f (X|Оё), then VОё W (X) ≥ d E W (X) dОё Оё ∂ ∂θ n В· EОё 2 log f (X|Оё) 2. • Up to this point, Cramer-Rao results not that operational for us to find a “best” estimator, because the estimator W (X) is on both sides of the inequality. However, for an unbiased estimator, how can you simplify the expression further? • Example: X1 , . . . , Xn в€ј i.i.d. N (Вµ, Пѓ 2 ). What is CRLB for an unbiased estimator of Вµ? Unbiased в†’ numerator =1. в€љ 1 xв€’Вµ log f (x|Оё) = log 2ПЂ в€’ log Пѓ в€’ 2 Пѓ xв€’Вµ ∂ в€’1 xв€’Вµ log f (x|Оё) = в€’ В· = ∂µ Пѓ Пѓ Пѓ2 E ∂ log f (X|Оё) ∂θ Hence the CRLB = 1 nВ· 12 Пѓ = 2 = E (X в€’ Вµ)2 Пѓ 4 = 2 1 1 V X = 2. 4 Пѓ Пѓ Пѓ2 . n This is the variance of the sample mean, so that the sample mean is a UMVUE for Вµ. • Sometimes we can simplify the denominator of the CRLB further: Lemma 7.3.11 (Information inequality): if f (X|Оё) satisfies (*) d EОё dОё ∂ log f (X|Оё) ∂θ then EОё ∂ log f (X|Оё) ∂θ ∂ ∂θ = 2 = в€’EОё ∂ log f (X|Оё) f (X|Оё) dx, ∂θ ∂2 log f (X|Оё) . ∂θ2 Rough proof: LHS of (*): Using Eq. (1) above, we get that LHS of (*) =0. 5 RHS of (*): ∂ ∂θ = =E ∂ log f ∂θ f dx ∂ 2 log f f dx + ∂θ2 ∂ 2 log f +E ∂θ2 1 f ∂f ∂θ ∂ log f ∂θ 2 dx 2 . Putting the LHS and RHS together yields the desired result. d ∂ Note: the LHS of the above condition (*) is just dОё log f (X|Оё) f (X|Оё)dX, ∂θ so that, by Leibniz’ rule, the condition (*) just states that the bounds of the integration (i.e., the support of X) does not depend on Оё. Normal distribution satisfies this (support is always (в€’в€ћ, в€ћ)), but U [0, Оё] does not. ∂ log f (X|Оё) = Also, the information inequality depends crucially on the equality EОё ∂θ 0, which depends on the correct specification of the model. Thus information inequality can be used as basis of “specification test”. (How?) • Example: for the previous example, consider CRLB for unbiased estimator of Пѓ2. We can use the information inequality, because condition (*) is satisfied for the normal distribution. Hence: в€’E в€’1 1 (x в€’ Вµ)2 ∂ log f (x|Оё) = + ∂σ 2 2Пѓ 2 2 Пѓ 4 ∂ ∂ 1 (x в€’ Вµ)2 log f (x|Оё) = 4 в€’ ∂σ 2 ∂σ 2 2Пѓ Пѓ6 ∂ 1 1 ∂ 1 log f (x|Оё) =в€’ в€’ 4 = 4. 2 2 4 ∂σ ∂σ 2Пѓ Пѓ 2Пѓ Hence the CRLB is 2Пѓ 4 . n • Example: X1 , . . . , Xn в€ј U [0, Оё]. Check conditions for CRLB for an unbiased estimator W (X) of Оё. d EW (X) dОё = 1 (because it is unbiased) ∂ W (X)f (X|Оё) dX = ∂θ W (X) В· Hence, condition (i) of theorem not satisfied. 6 в€’1 Оё2 dX = d EW (X) = 1 dОё • But when can CRLB (if it exists) be attained? Corollary 7.3.15: X1 , . . . , Xn в€ј i.i.d. f (X|Оё), satisfying the conditions of CR theorem. – Let L(Оё|X) = n i=1 f (X|Оё) denote the likelihood function. – Estimator W (X) unbiased for Оё – W (X) attains CRLB iff you can write ∂ log L(Оё|X) = a(Оё) W (X) в€’ Оё ∂θ for some function a(Оё). • Example: X1 , . . . , Xn в€ј i.i.d. N (Вµ, Пѓ 2 ) Consider estimating Вµ: ∂ ∂ log L(Оё|X) = ∂µ ∂µ = ∂ ∂µ n = i=1 = n log f (X|Оё) i=1 n в€’ log в€љ 2ПЂ в€’ log Пѓ в€’ i=1 1 2 (X в€’ Вµ)2 Пѓ2 X в€’Вµ Пѓ2 n ВЇ n в€’ Вµ). В· (X Пѓ2 Hence, CRLB can be attained (in fact, we showed earlier that CRLB attained ВЇn) by X Loss function optimality Let X в€ј f (X|Оё). Consider a loss function L(Оё, W (X)), taking values in [0, +в€ћ), which penalizes you when your W (X) estimator is “far” from the true parameter Оё. Note that L(Оё, W (X)) is a random variable, since X (and W (X)) are random. Consider estimators which minimize expected loss: that is min EОё L(Оё, W (X)) в‰Ў min R(Оё, W (В· В· В· )) W (В·В·В· ) W (В·В·В· ) 7 where R(Оё, W (В· В· В· )) is the risk function. (Note: the risk function is not a random variable, because X has been integrated out.) Loss function optimality is a more general criterion than minimum MSE. In fact, 2 because M SE(W (X)) = EОё W (X) в€’ Оё , the MSE is actually the risk function 2 associated with the quadratic loss function L(Оё, W (X)) = W (X) в€’ Оё . Other examples of loss functions: • Absolute error loss: |W (X) в€’ Оё| • Relative quadratic error loss: (W (X)в€’Оё)2 |Оё|+1 The exercise of minimizing risk takes a given value of Оё as given, so that the minimized risk of an estimator depends on whichever value of Оё you are considering. You might be interested in an estimator which does well regardless of which value of Оё you are considering. (Analogous to the focus on the uniform minimal variance.) For this different problem, you want to consider a notion of risk which does not depend on Оё. Two criteria which have been considered are: • “Average” risk: min R(Оё, W (В· В· В· ))h(Оё)dОё. W (В·В·В· ) where h(Оё) is some weighting function across Оё. (In a Bayesian interpretation, h(Оё) is a prior density over Оё.) • Minmax criterion: min max R(Оё, W (В· В· В· )). W (В·В·В· ) Оё Here you choose the estimator W (В· В· В· ) to minimize the maximum risk = maxОё R(Оё, W (В· В· В· )), where Оё is set to the “worse” value. So minmax optimizer is the best that can be achieved in a “worst-case” scenario. ВЇ n is: Example: X1 , . . . , Xn в€ј i.i.d. N (Вµ, Пѓ 2 ). Sample mean X unbiased minimum MSE UMVUE attains CRLB minimizes expected quadratic loss 8 2 LARGE SAMPLE PROPERTIES OF ESTIMATORS It can be difficult to compute MSE, risk functions, etc., for some estimators, especially when estimator does not resemble a sample average. Large-sample properties: exploit LLN, CLT Consider data {X1 , X2 , . . .} by which we construct a sequence of estimators Wn в‰Ў W (X1 ), W (X2 ), . . . . Wn is a random sequence. Define: we say that Wn is consistent for a parameter Оё iff the random sequence Wn as converges (in some stochastic sense) to Оё. Strong consistency obtains when Wn в†’ Оё. p Weak consistency obtains when Wn в†’ Оё. For estimators like sample-means, consistency (either weak or strong) follows easily using a LLN. Consistency can be thought of as the large-sample version of unbiasedness. Define: an M-estimator is an estimator of Оё which a maximizer of an objective function Qn (Оё). Examples: • MLE: Qn (Оё) = 1 n n i=1 log f (xi |Оё) • Least squares: Qn (Оё) = О± + Xi ОІ. n 2 i=1 [yi в€’ g(xi ; Оё)] . OLS is special case when g(xi ; Оё) = • GMM: Qn (Оё) = Gn (Оё) Wn (Оё)Gn (Оё) where 1 Gn (Оё) = n n i=1 1 m1 (xi ; Оё), n n i=1 1 m2 (xi ; Оё), . . . , n n mM (xi ; Оё) , i=1 an M Г— 1 vector of sample moment conditions, and Wn is an M Г— M weighting matrix. Notation: Denote the limit objective function Q0 (Оё) = plimnв†’в€ћ Qn (Оё) (at each Оё). Define Оё0 в‰Ў argmaxОё Q0 (Оё). Consistency of M-estimators Make the following assumptions: 9 1. Q0 (Оё) is uniquely maximized at some value Оё0 (“identification”) 2. Parameter space О� is a compact subset of RK . 3. Q0 (Оё) is continuous in Оё 4. Qn (Оё) converges uniformly in probability to Q0 (Оё); that is: p sup |Qn (Оё) в€’ Q0 (Оё)| в†’ 0. Оёв€€О� p Theorem: (Consistency of M-Estimator) Under assumption 1,2,3,4, Оёn в†’ Оё0 . Proof: We need to show: for any arbitrarily small neighborhood N containing Оё0 , P (Оёn в€€ N ) в†’ 1. For n large enough, the uniform convergence conditions that, for all , Оґ > 0, P sup |Qn (Оё) в€’ Q0 (Оё)| < /2 > 1 в€’ Оґ. Оёв€€О� The event “supОёв€€О� |Qn (Оё) в€’ Q0 (Оё)| < /2” implies Qn (Оёn ) в€’ Q0 (Оёn ) < /2 ⇔ Q0 (Оёn ) > Qn (Оёn ) в€’ /2 (2) Qn (Оё0 ) в€’ Q0 (Оё0 ) > в€’ /2 в‡’ Qn (Оё0 ) > Q0 (Оё0 ) в€’ /2. (3) Similarly, Since Оёn = argmaxОё Qn (Оё), Eq. (2) implies Q0 (Оёn ) > Qn (Оё0 ) в€’ /2. (4) Hence, adding Eqs. (3) and (4), we have Q0 (Оёn ) > Q0 (Оё0 ) в€’ . (5) So we have shown that sup |Qn (Оё) в€’ Q0 (Оё)| < /2 =в‡’ Q0 (Оёn ) > Q0 (Оё0 ) в€’ Оёв€€О� ⇔P (Q0 (Оёn ) > Q0 (Оё0 ) в€’ ) ≥ P sup |Qn (Оё) в€’ Q0 (Оё)| < /2 Оёв€€О� 10 в†’ 1. Now define N as any open neighborhood of RK , which contains Оё0 , and NВЇ is the complement of N in RK . Then О� ∩ NВЇ is compact, so that maxОёв€€О�∩NВЇ Q0 (Оё) exists. Set = Q0 (Оё0 ) в€’ maxОёв€€О�∩NВЇ Q0 (Оё). Then Q0 (Оёn ) > Q0 (Оё0 ) в€’ в‡’ Q0 (Оёn ) > max Q0 (Оё) ВЇ Оёв€€О�∩N в‡’ Оёn в€€ N ⇔ P (Оёn в€€ N ) ≥ P (Q0 (Оёn ) > Q0 (Оё0 ) в€’ ) в†’ 1. Since the argument above holds for any arbitrarily small neighborhood of Оё0 , we are done. • In general, the limit objective function Q0 (Оё) = plimnв†’в€ћ Qn (Оё) may not be that straightforward to determine. But in many cases, Qn (Оё) is a sample average of some sort: 1 Qn (Оё) = q(xi |Оё) n i (eg. least squares, MLE). Then by a law of large numbers, we conclude that (for all Оё) 1 q(xi |Оё) = Exi q(xi |Оё) Q0 (Оё) = plim n i where Exi denote expectation with respect to the true (but unobserved) distribution of xi . • Most of the time, Оё0 can be interpreted as a “true value”. But if model is misspecified, then this interpretation doesn’t hold (indeed, under misspecification, not even clear what the “true value” is). So a more cautious way to interpret the consistency result is that p Оёn в†’ argmaxОё Q0 (Оё) which holds (given the conditions) no matter whether model is correctly specified. ** Of the four assumptions above, the most “high-level” one is the uniform convergence condition. Sufficient conditions for this conditions are: 1. Pointwise convergence: For each Оё в€€ О�, Qn (Оё) в€’ Q0 (Оё) = op (1). 11 2. Qn (Оё) is stochastically equicontinuous: for every > 0, О· > 0 there exists a sequence of random variable ∆n ( , О·) and nв€— ( , О·) such that for all n > nв€— , P (|∆n | > ) < О· and for each Оё there is an open set N containing Оё with Лњ в€’ Qn (Оё)| ≤ ∆n , в€Ђn > nв€— . |Qn (Оё) supОёв€€N Лњ Note that both ∆n and nв€— do not depend on Оё: it is uniform result. In a deterministic context, this is an “in probability” version of the notion of uniform equicontinuity: we say a sequence of deterministic functions Rn (Оё) is uniformly equicontinuous if, for every > 0 there exists Оґ( ) and nв€— ( ) such that for all Оё Лњ в€’ Rn (Оё)| ≤ , в€Ђn > nв€— . supОё:|| |Rn (Оё) Лњ Оёв€’Оё||<Оґ Лњ Asymptotic normality for M-estimators Define the “score vector” ОёЛњQn (Оё) ∂Qn (Оё) ∂θ1 = ,..., Оё=ОёЛњ ∂Qn (Оё) ∂θK . Оё=ОёЛњ Similarly, define the K Г— K Hessian matrix ОёЛњОёЛњ Qn (Оё) i,j = ∂ 2 Qn (Оё) ∂θi ∂θj , 1 ≤ i, j ≤ K. Оё=ОёЛњ Note that the Hessian is symmetric. Make the following assumptions: p 1. Оёn = argmaxОё Qn (Оё) в†’ Оё0 2. Оё0 в€€ interior(О�) 3. Qn (Оё) is twice continuously differentiable in a neighborhood N of Оё0 . 4. в€љ n d Оё0 Qn (Оё) в†’ N (0, ОЈ) 5. Uniform convergence of Hessian: there exists the matrix H(Оё) which is continp uous at Оё0 and supОёв€€N || ОёОё Qn (Оё) в€’ H(Оё)|| в†’ 0. 6. H(Оё0 ) is nonsingular 12 Theorem (Asymptotic normality for M-estimator): Under assumptions 1,2,3,4,5, в€љ d n(Оёn в€’ Оё0 ) в†’ N (0, H0в€’1 ОЈH0в€’1 ) where H0 в‰Ў H(Оё0 ). Proof: (sketch) By Assumptions 1,2,3, Оёn Qn (Оё) = 0 (this is FOC of maximization problem). Then using mean-value theorem (with ОёВЇn denoting mean value): = Оё0 Qn (Оё) + ОёВЇn ОёВЇn Qn (Оё)(Оёn в€’ Оё0 ) в€љ в€љ n(Оёn в€’ Оё0 ) = в€’ n Оё0 Qn (Оё) ОёВЇn ОёВЇn Qn (Оё) 0= в‡’ Оёn Qn (Оё) p d в†’H0 (using A5) в†’N (0,ОЈ) (using A4) в€љ d ⇔ n(Оёn в€’ Оё0 ) в†’ в€’H(Оё0 )в€’1 N (0, ОЈ) = N (0, H0в€’1 ОЈH0в€’1 ). Note: A5 is a uniform convergence assumption on the sample Hessian. Compared to analogous step in the Delta method, we used the plim-operator theorem coupled with the “squeeze” result. Here it doesn’t suffice, because as n в†’ в€ћ, both ОёВЇn as well as the function Gn (В·) = В·В· Qn (Оё) are changing. 2.1 Maximum likelihood estimation The consistency of MLE can follow by application of the theorem above for consistency of M-estimators. Essentially, as we noted above,what the consistency theorem showed above was that, for any M-estimator sequence Оёn : plimnв†’в€ћ Оёn = argmaxОё Q0 (Оё). For MLE, there is an argument due to Wald (1949), who shows that, in the i.i.d. case, the “limiting likelihood function” (corresponding to Q0 (Оё) is indeed globally maximized at Оё0 , the “true value”. Thus, we can directly confirm the identification assumption of the M-estimator consistency theorem. This argument is of interest by itself. Argument: (summary of Amemiya, pp. 141–142) • Define ОёЛ†nM LE в‰Ў argmaxОё n1 • By LLN: Оё0 ). 1 n i log f (xi |Оё). Let Оё0 denote the true value. p i log f (xi |Оё) в†’ EОё0 log f (xi |Оё), for all Оё (not necessarily the true 13 f (x|Оё) f (x|Оё0 ) • By Jensen’s inequality: EОё0 log f (x|Оё) f (x|Оё0 ) • But EОё0 = f (x|Оё) f (x|Оё0 ) < log EОё0 f (x|Оё) f (x|Оё0 ) f (x|Оё0 ) = 1, since f (x|Оё) is a density function, for all Оё.1 • Hence: f (x|Оё) < 0, в€ЂОё f (x|Оё0 ) =в‡’EОё0 log f (x|Оё) < EОё0 log f (x|Оё0 ), в€ЂОё =в‡’EОё0 log f (x|Оё) is maximized at the true Оё0 . EОё0 log This is the “identification” assumption from the M-estimator consistency theorem. • Analogously, we also know that, for Оґ > 0, Вµ1 = EОё0 log f (x; Оё0 в€’ Оґ) f (x; Оё0 ) < 0; Вµ2 = EОё0 log f (x; Оё0 + Оґ) f (x; Оё0 ) < 0. By the SLLN, we know that 1 n log i f (xi ; Оё0 в€’ Оґ) f (xi ; Оё0 ) = 1 as [log Ln (x; Оё0 в€’ Оґ) в€’ log Ln (x : Оё0 )] в†’ Вµ1 n so that, with probability 1, log Ln (x; Оё0 в€’ Оґ) < log Ln (x; Оё0 ) for n large enough. Similarly, for n large enough, log Ln (x; Оё0 + Оґ) < log Ln (x; Оё0 ) with probability 1. Hence, for large n, ОёЛ†n в‰Ў argmaxОё log Ln (x; Оё) в€€ (Оё0 в€’ Оґ, Оё0 + Оґ). That is, the MLE ОёЛ†n is strongly consistent for Оё0 . Note that this argument requires weaker assumptions than the M-estimator consistency theorem above. Now we consider another idea, efficiency, which can be thought of as the large-sample analogue of the “minimum variance” concept. 1 In this step, note the importance of assumption A3 in CB, pg. 516. If x has support depending on Оё, then it will not integrate to 1 for all Оё. 14 For the sequence of estimators Wn , suppose that d k(n)(Wn в€’ Оё) в†’ N (0, Пѓ 2 ) where k(n) is a polynomial in n. Then Пѓ 2 is denoted the asymptotic variance of Wn . в€љ в€љ ВЇ d In “usual” cases, k(n) = n. For example, by the CLT, we know that n(X n в€’ Вµ) в†’ ВЇn. N (0, Пѓ 2 ). Hence, Пѓ 2 is the asymptotic variance of the sample mean X Definition 10.1.11: An estimator sequence Wn is asymptotically efficient for Оё if • в€љ d n (Wn в€’ Оё) в†’ N (0, v(Оё)), where • the asymptotic variance v(Оё) = 1 2 ∂ EОё0 ( ∂θ log f (X|Оё)) (the CRLB) By asymptotic normality result for M-estimator, we know what the asymptotic distribution for the MLE should be. However, it turns out given the information inequality, the MLE’s asymptotic distribution can be further simplified. Theorem 10.1.12: Asymptotic efficiency of MLE Proof: (following Amemiya, pp. 143–144) • ОёЛ†nM LE satisfies the FOC of the MLE problem: 0= ∂ log L(Оё|Xn ) ∂θ M LE Оё=ОёЛ†n . • Using the mean value theorem: ∂ log L(Оё|Xn ) 0= ∂θ ∂ 2 log L(Оё|Xn ) + ∂θ2 Оё=Оё0 ∂ log L(Оё|Xn ) ∂θ Оё=Оё0 в€љ в€љ в€’ =в‡’ n ОёЛ†n в€’ Оё0 = n 2 ∂ log L(Оё|Xn ) в€— ∂θ2 Оё=Оёn в€— Оё=Оёn ОёЛ†nM LE в€’ Оё0 в€љ в€’ n1 = n 1 n ∂ log f (xi |Оё) i ∂θ Оё=Оё0 ∂ 2 log f (xi |Оё) в€— i ∂θ2 Оё=Оёn (в€—в€—) • Note that, by the LLN, 1 n i ∂ log f (xi |Оё) ∂θ p Оё=Оё0 в†’ EОё0 ∂ log f (X|Оё) ∂θ 15 Оё=Оё0 = ∂f (xi |Оё) ∂θ Оё=Оё0 dx. Using same argument as in the information inequality result above, the last term is: ∂f ∂ dx = f dx = 0. ∂θ ∂θ • Hence, the CLT can be applied to the numerator of (**): d numerator of (**) в†’ N 0, EОё0 2 ∂ log f (xi |Оё) ∂θ Оё=Оё0 . • By LLN, and uniform convergence of Hessian term: p denominator of (**) в†’ EОё0 ∂ 2 log f (X|Оё) ∂θ2 Оё=Оё0 . • Hence, by Slutsky theorem: пЈ« ∂ log f (xi |Оё) ∂θ Оё=Оё0 в€љ d пЈ¬ EОё0 n ОёЛ†n в€’ Оё0 в†’ N пЈ0, 2 EОё0 ∂ log∂θf2(X|Оё) 2 пЈ¶ 2пЈё. пЈ· Оё=Оё0 • By the information inequality: EОё0 so that ∂ log f (xi |Оё) ∂θ 2 Оё=Оё0 ∂ 2 log f (X|Оё) = в€’EОё0 ∂θ2 пЈ« в€љ d пЈ¬ n ОёЛ†n в€’ Оё0 в†’ N пЈ0, Оё=Оё0 пЈ¶ 1 EОё0 пЈ· ∂ log f (xi |Оё) ∂θ Оё=Оё0 2пЈё so that the asymptotic variance is the CRLB. Hence, the asymptotic approximation for the finite-sample distribution is пЈ« пЈ¶ a пЈ¬ 1 ОёЛ†nM LE в€ј N пЈОё0 , n 1 EОё0 16 ∂ log f (xi |Оё) ∂θ Оё=Оё0 2пЈё. пЈ·
© Copyright 2024 Paperzz