ADAPTING FOR HETEROSCEDASTICITY IN REGRESSION MODELS Raymond J. Carroll David Ruppert 1 2 University of North Carolina and Leonard A. Stefanski North Carolina State University 1 Research supported by the Air Force Office of Scientific Research AFOSR-F-49620-85-C-0144. 2Research supported by The National Science Foundation MCS-8100748. ABSTRACT We investigate the limiting behavior of a M-estimators in heteroscedastic regression models. assumed to be known up to class is to be estimated one-step The mean function is parameters. but the variance function is considered an unknown function of a p dimensional vector. function of nonparametrically The variance by a function of the absolute residuals from the current fit to the mean. Under a variety of conditions. we for estimate was known. the discuss when the estimates adapt scale. i.e .. regresion parameter just as well as if the scale function Connections with the theory of optimal semiparametric estimation are made. Key Words and Phrases : Heteroscedasticity. adaptation, nonparametric regression. M-estimator, robustness. generalized least squares. INTRODUCTION We study aspects of the effect of estimating weights in of the heteroscedastic (1982a). and regression model considered by a generalization Carroll The observations are (y.,x.) for i=I, ... ,N, where y. is the I xi is the vector of predictors. I I & Ruppert response Let L(ylx) and S(ylx) define location and scale for the distribution of y given x. For example, L(ylx) could be the mean of y given x, while S(ylx) could be the standard deviation. The model is specified conditionally on the x. by 1 L(ylx) = f(x,P) ; ( 1.1) s 2 (ylx) (1. 2) Throughout it will be = vO(x) = v(x,9) understood that S(ylx) depends on a p dimensional subvector of x or on a p dimensional predictor z. If L is expectation and S is standard deviation, then (1.1)-(1.2) the usual heteroscedastic regression model. specify In (1.2), the vector parameter 9 is typically unknown. Carroll & Ruppert (1982a) an odd function 1/2 A regression parameter p. be consider Let 9 be any N with derivative ~1' a class of M-estimators for the consistent estimator of 9, and let Let f p ~ be the derivative of f with respect to p and v9 the derivative of v with respect to 9. Define 1 (1. 3) E. y- f(x,P) We assume that f and v e(p,9 ) o E./v 2 (x,9) z (P,9 ) -(d/dp)e(p,9) . are such that the following limit result holds for some positive definite covariance matrix WI : 2 N -1/2 N (1. 4) ~ Define 13 = } ~ z.(p,9) 1'{ e (p,9) i=l i 1 Normal( 0, WI) . to be a solution to the estimating equation p(~) -1/2 N (1. 5) N ~ i=l z . (13 ,") "I' { e. (13 ,~) } 1 = 0 . 1 If for some positive definite matrix W 2 (1. 6) then we would have the asymptotic limit result (1. 7) In most applications 9 is unknown and thus 13(9) is also . t or estlma 0 A natural unknown. . t en t es t'Ima t or f !,.,t 'IS th en,.,.., !t(.<I) were h 9 'IS a.Nl/ 2-consls 0 f ... .<I major question is to find conditions under which 13(') adapts for estimating i.e., 13(9) 13(9) and have the same limiting distribution (1.7). A 9. This has important statistical consequences, because if adaptation for 9 occurs, then at least for large samples we can pretend that 9 is known and use standard methods of inference. (1. 4) In For example, if "I'(w) this case, the of w then (1.1) and least squares 1 calculations generally imply algorithm with weights the When adaptation occurs, inference can be made as if the v(x.,9). weights were known, at least in the limit. order (1.2) solution to (1.5) is a generalized least squares estimator computed by a weighted inverse = to understand the For discussions effect of Rothenberg (1984) and Carroll, Ruppert & Wu (1986). which estimating use second weights, see 3 An easy extension of Theorem 1 in Carroll & Ruppert under conditions, regularity p has the same (1982a) states asymptotic that normal limit distribution as if 9 were known as long as (1. 8) N -1 N P X e . (P ,9) "i> 1{ e. (P ,9) } z. (P ,9) v9 (x. ,9) / v (x. ,9) i=l 1 1 1 1 1 For generalized least squares, "i>(w) = w and condition (1.8) satisfied. is a . -++-+ almost always For general M-estimators, condition (1.8) essentially reduces to an assumption of symmetry of the distribution of e. given x .. 1 In practice, the form specified but the scale dimension of function the 1 location function f(x. ,P) may be easily 1 v( x. ,9 ) 1 p of x is greater than one. less clear, especially when the There are at least three strategies for coping with this case. The first is to assume that the scale is a function the This mean response. dimension. reduces the scale estimation problem to a single See Carroll & Ruppert (1982a) and Davidian & Carroll (1986). A second strategy is to use an empirical v(x,9), of for example a response model for the scale function surface quadratic or its square root. approach has not been tried too often in the literature, although Box This & Meyer (1985) seem to suggest the idea. A third approach nonparametrically. Carroll (1982) which we investigate is to estimate the scale function Consider the case that involves the squares fit. residuals from an (1.8) holds unweighted Though va is unknown, he showed that one can estimate generalized least squares just as well as if va were known, so that result variance. proposed that the unknown variance function va be estimated by nonparametric kernel regression on the squared least (1.2) with "i> the identity function. the p by limit Unfortunately, while the result is interesting the conditions of his proof are unnecessarily stringent. 4 Subsequent to Carroll's paper,Matloff. Rose & Tai (1984) performed a study, while Muller & Stadtmuller (1986) considered a fixed design. simulation Robinson (1986) obtained Carroll's result for more than one dimension under far weaker regularity conditions. This is a nice piece of work. and we will borrow techniques from it where appropriate. In this paper, we study adaptation in a broader framework by considering a one-step of class nonparametrically. M-estimators weighted The model with weights estimated for the means is allowed to be nonlinear. allow a fairly broad class of smoothers, including kernel and nearest regression estimates. It is We neighbor not the particular smoother that matters. but rather that the smoother satisfy certain reasonable moment conditions. In particular, robust smoothers could be used. In the next section we introduce the class of estimators and the basic set In sections 3 & 4 we use our work to provide examples for which of conditions. adaptation occurs. bounds In section 5, we link our work to the theory of information for semiparametric models. In section 6 we address a few brief remarks to the case that the scale functions are known to response. dependent on the mean All proofs are in the Appendix. Section 2 The be Estimators and Main Results choice of one specifying extraneous regression. The step weighted conditions estimators for asymptotic include Robinson (1986) as special cases. Let M-estimates those P* A avoids the necessity of normality in nonlinear studied by Carroll (1982) and 1/2 be any N consistent estimate of p. 5 Write E.. ttJ) E.. (y ,x ,P) = y - f (x ,P) = = d(P) = fp(x,P) d(x,P) It is most compact to treat (y,x) as a random variable, the translation to case of fixed being XIS immediate. In what follows we use the notation of Pollard (1984), so that for any random variable G(y,x), espectation of G, and If g(x) the G(x,y) ~{ } is the G(x,y) } is the average of the values G(x.,y.). ~N{ 1 1 is any estimate of the scaling function vO(x), then as in Bickel (1975) a weighted one step M-estimator of p is P (2.1) = p* + { ~(p*,g)} AN (P, g) = IP N{ g (x) -1 B (P , g ) = ~ N{ g (x ) N Bickel (1975) calls this a arguments are d (P) d (P) -1/2 one d (P) step T BN(p*,g), where 1'1 ( E.. (P) g (x) l' ( E.. (P) g (x ) estimate of -1/2 -1/2 ) } ) } Type Our 1. technical for the Type 1 estimates, but they are easily modified to apply to his Type 2 estimates. squares -1 estimate with If 1'(w) weights w, the then inverse a is (2.1) of g(x). generalized least Ordinarily, one first computes p based on unweighted least squares residuals, updates the preliminary estimator and recomputes p. -'t Let qN = N an This process will be repeated a few times. for a sufficiently small positive estimate of vo(x) based on the residuals E..i(P*). amount qN avoids problems with infinite weights. vo(x) based on we could Let g(x) equal qN plus The addition of the small estimate of the true unknown errors E..i(P), and let v(x) = vo(x) + qN' In this and the next section, rather than adding qN change 'to have taken the maximum of conditions about these estimates are as follows. Let vO(x) be an to the smoothers, without qN and the smoother. The key 6 infimum va(x) > a J..A....l.l - P Na (I)-6 IP { va(x) - va(x) }2 N J.A..n where a ......... - - 4 I (1) ......... where a(2) Assumption (A.2) appropriate holds conditions, for kernel ~ and see section 4. (4 + have been used. We are not p) a for all 0 > a , a(l) nearest neighbor estimates under The constant a(l) is the optimal rate of convergence for nonparametric regression estimates, could > 0 , 0 P Na (2)-o IP { g(x) - vo(x) }2 N J.A....a.l a for all although slower rates restricting to smoothing only squared residuals. Condition (A.3) is often easy to check. kernel and linear smoothers such as nearest neighbor estimates having "bandwidth" not depending on the responses, a(2) some For instances, ~ 1 > a(I), at least under minimal regularity conditions. a(2) > 1. In It is easy to develop conditions for these to hold in the case of a linear smoother applied to squared residuals, but in the interest of space we forego the details. If x has compact support, then under additional regularity conditions the convergence in (A.2) - (A.3) is uniform in x, e.g., in (A.2) p o . See section 4 for a further discussion. would then particular, avoid be ~ The regularity conditions on ~ constructed as in Bickel (1975), and we could take qN would not need to be differentiable. the assumption of compactness. set, the convergence of g and va is and f O. In Where possible, we wish to If the x's are not confined to a compact not uniform and we must assume more e- 7 smoothness for The following conditions are reasonable, most of them trivial in~. linear regression. We use the notation II . II to be the Euclidean norm of the argument. iA.....il (1) , ~(2) ..,. ~1 exists and is continuous and for M I 'P(a) - 'P(b) I I 'P 1 (a) -~l(b) .1A.:..Ql lA...ll IP { ~-1/2 IP [ sup{ ~1 2 5 ) }2 < } II d (13) II 'P ( e (13 ) (~(f3+A1) 1 A 1/2) A for all M > J.A.....ll k 2 N / P [SU P{ " d(p."l) 11 (A.I0l If G(A ,A ) 1 2 (A.12l I < ~ 00 a - b 1 00 II 5 MN- 1I2 ] ~ 2 1) , O( I7 N II A 2 II d(f3+A ) - d(f3+A2) Ilk} 1 + o A 1 2 II 1/2 ~ MN- II ~ MN -1/2 and k = 1,2 . I~ (e (13 ) ) I d(f3+A ) d(f3+A1) T 1 I7 -1 IP [ sup{ II G(A ,A ) - G(O,A ) II} N 1 2 2 (A.lll c o. = 0(1) for all M > IP[ II ff3f3(X,f3) II 4 {1 , • 2 < II A } 2 00 00 a - b , Cop minimum( M ..,. (2) ~ 1 ( e (13 ) IP { J.A....§l 5 c~ minimum( M'P(l) 5 + Ie (13 ) I (M., (l) =00) I} ] ~ 1 {~(f3+A1 )/A/ 12 } 1I2 IIA 111 5 MN] > 1/2 IIA 211 - I7 N N- 2' ,,[ (sup" fp(X.P) ,,2k s2 j ( ~:)) ) -+++ 0 I ,,2~'7N , < 00 then for all ] _ 0 • for (k=1,2 : j=O,l : s='P ) or (k=l,j=l,s='P) . 1 j ok 2n n Write H . (~,x) = ~(f3) II -,( f(x,f3) II /vo(x) . Then k ,J,n 013 if M (2)=00 in (A.4), IP{ H . (~,x)} < --., k ,J ,n 00 for n=2,j=2,k=1 M > O. ] 8 if M ... (2) < in (A.4), IP{ H . (~,x)} < k ,J ,n 00 r 00 for n=2,k=l,j=0. (A.13) ....... 0 For linear regression simplify to x II II and generalized least squares, (A.4) (A.13) 4 4 2 , ~ (p) and II x vO(x) II having finite expectations. The first step is a preliminary expansion for p. LEMMA 1 Define Suppose that eN = 0p(l). Then under the conditions (1.4), (1.6) and (A.!) e- (A.13), we have that o (2.2) In principle, we methods using (2.2). by v, the estimator of THEOREM 1 can obtain the asymptotic distribution of p by direct However, sometimes it is possible to replace g V o based only on the errors ~i(P). Suppose a(2) > 1.0 in assumption (A.3). Suppose that D N op (1), Define and that either of the following hold: IP { ~ 2 (P) II d (P) 11 2 } < 00 • in (2.2) 9 For some q(I), q(2), p Nq (l) sup x p Nq (2) sup x NP ['U vO(x) P[ ( )2 I g(x) - v(x) H(..:1 ,..:1 1 v0 (x) +..:1 1 2 ) - H(..:1 ,O) 1 -++-t 2 1 and 0 11..:1 11 ~ 11..:1 11 ~ 1 2 q MN- (I)] MN- q ( 2) -++-t 0 , ~(,l3) where H(..:1 ,..:1 1 2 ) = l' ( (vO(x) +..:1 1 +..:1 2 )112) Then. under the conditions of LEMMA I, o (2.3) Our device estimator of V o of adding a constant qN the nonparametric regression can be eliminated in certain circumstances. COROLLARY 1 : In the definition of g, do not Suppose that {g - vol and {v the rate N . at to -"f Then O the add the positive constant qN' vOl converge in probability to zero uniformly conclusion to THEOREM 1 still holds. o SECTION 3 If, M-estimators in the Symmetric Case given xi' ~(xi',l3) = Yi - f(x i ,,l3> is symmetrically distributed about zero, then M-estimators adapt for heteroscedasticity, i.e., they have the distribution as if the scaling function va were known. same In the previous section 10 we assumed only that the estimator this section we these errors. restricting make the V o was a function of the errors further assumption that In effect, when estimating to the is an even function of V o scaling o' function v It we are nonparametric regression estimators which are functions of the absolute residuals from the current fit p*, as did Carroll (1982) and (1986) . In ~(xi'P). makes perfectly good sense to use the Robinson residuals to gain understanding of the variance structure. Assumption (C.2) below is typically even weaker proving the latter than (A.2), because in one typically shows that the expectation of the left hand side of (A.2) converges to zero. THEOREM 2 Assume P{ IId(P)1I 2 [ 1 + lId(P)1I -6 t]N If M., (1) I ~ (P)/vo(X)~) .,4( 2 vO(x) - vO(x) I 4 in (A.4), then IP{ ~ (P) E[ IP N{ = 00 2 A Further assume (1.4) and the conclusion to THEOREM (2.1) / V 2 (X) ] } < o }] -/ V 1. 2 o • e- 0 . (x) } < Then 00 00 the • estimators adapt for heteroscedasticity and have the same limit distribution (1.7) as if the scaling function v o were known. c For asymmetric errors when" is typically fails. As To see what the identity function, THEOREM will not pursue the matter in happens, consider a problem of dimension p = 1. Taylor series and (A.2), under sufficient regularity conditions, setting for convenience, we have 2 we indicate below, it is possible to compute the limit distribution in this case, although we detail. not any By a t]N =0 . 11 (3.1) D N 1/2 N IP { dC£3) N + N G(E. ,p,x) 1/2 Vo(X)-~ f' ( N -1/2 (a/aw) ( w 1 1'( E. (13) 1 V (X)2 O } G(E.,p,x) [ vO(X) - VO(X) 1 2 d(p) 1'(E.(p) 1 W / ) } } + I o (1) , P W=VO(X) As indicated in the next section, we can replace G(E.,p,x) by its conditional expectation given x, which is G*(x) = -(1/2) vo(x) The new -2 d(p) E [E.(p) 1'l{E.(p)/v (x) 1/2 o } I x ] . second component of (3.1) has a nontrivial limit distribution. Thus. the limit distribution of 13 typically will not be the same as if V were known. o SECTION 4 Adaptation for Generalized Least Squares If 1'(w) = w, the estimator (1.7) is a generalized least squares estimator. Under weak conditions, Robinson (1986) proves adaptation in using a linear regression variant of the nearest neighbor device, the smoother being applied to squared residuals. His g(x ) does not use the ith residual Yi - f(xi,p*). i proof is easily extended to nonlinear regression, as we now indicate. First consider linear smoothers based on squared residuals: (4.1) g(x) (4.2) 1 for all x . His 12 Define the expectation of va given the design as -1 N N (4.3) 2 (x. ,x) E{Eo • (,0) I x.} . i=1 N 1 1 1 X C LEMMA 1 and THEOREM 1 still hold if we replace va(x) in (A.2) by modified (A.2) is easy to verify under weak conditions. v*(x). The Following Robinson (1986), choose the weight function cN(x,z) to be of nearest neighbor type a. with THEOREM 1 eliminates the nonlinear regression function, and we can apply Robinson's proof virtually without change, the only complication being the addition of qN' Here is compact. ~ a second application of Theorem 1. Suppose the support of x is If the variance is to be modeled nonparametrically as a function of p 3 predictors, then we can obtain an adaptation nearest result based on the usual neighbor regression estimators which use the ith residual in computing an estimate of va(x), i.e., cN(x,x) ~ a. For reasons of space and interest we . forego the details. Some background is useful. Uniform convergence regression estimators when x has compact support has estimators by Devoyre (1978). weakest been of nonparametric proved for kernel Collomb & Haerdle (1986) and for nearest neighbor estimators by Applying these results to conditions, our problem does not yield the as the results assume that the marginal distribution of x is very smooth. Average mean squared error convergence as in (A.2) have been discussed Marron & Haerdle (1986) for kernel estimators with x being compactly supported and having a smooth distribution. (A.2) by for nearest Results of Mack (1981) can be used to neighbor estimators under similar conditions. prove For nearest neighbor estimators it is far easier to prove (A.2) with va replaced by v* and • 13 then note that (D.5) is the same as Robinson's Lemma 8, which by the way is a powerful result. Here is a third application of THEOREM squared residuals to 1. One (1985). for example using Figure 2 in We have found it more useful to smooth absolute rather than squared residuals, a robust smoother being even better. absolute rather than squared residuals are smoothed. conditions. with estimate the scaling function is the tendency for a few wildly aberrent values to distort the picture, see Hinkley difficulty Adaptation holds when We give one set of strong Set cN(x,x) = 0 and define 2 s (x,P) . Consider the smoothers 1 1 2 2 g(x) v(x) -1 N N X cN(x. ,x) s(x i ,P) i=l 1 THEOREM Consider 3 + zero .1.IL.ll ~ Assume that the fourth SUPPQse further that (A.4) through (A.13) hold and that IP [IP N{ Iv*(x) P* . given x is bounded and that s(x,P) Is strictly bounded away from ~(P) and~. N linear regression in which the support of x is bounded. Since this is linear regression P has dimension p. moment of I1 1/2 - vo(x) 1/2 2 I }] - 0 . 1/2 -1 . is restricted to the set of values (kN ) [11'" .l p ]' where 11'" .i p are integers. - v(x) 1/2 - v*(x) 1/2. satIsfies (A.2) p _0 1 1 _1 1 p N2 P { d(P)~(P)v*(x)-2 [v(x) 2 - V*(X)-2] } _ 0 N 14 Then then limit distribution of generalized least squares is the same as if g o were replaced by vo. (0.1) is the same as Robinson's key and powerful Lemma 8. the variance is a function of three or fewer predictors. (0.4) holds if Assumption (0.5) is essentially Robinson's key Proposition 2, substituting our 1~(p)1 for his ~2(p) 1 and our d(P)/v*(x)2 for his d(P). Semioarametric Models and Information Bounds SECTION 5 An see alternative approach is optimal estimation Begun. et. al. (1983) and Bickel (1982). in semiparametric models, The easiest context for applying this theory is the location-scale model Here, given xi' the {e } are independent i • and identically distributed random variables with a known density function w('), but the scaling function v (') is o unknown. Our previous results do not require that the e.(p) be independent and identically 1 distributed, conditionally on the x .. 1 The usual approach in the semiparametric literature is to find out how well one can do v o is unknown, i.e., find a matrix I(w) such that N1/~ (P - P) ~ Normal(O,S) implies S ~ I(W)-l . estimating pwhen 15 The matrix is called the semiparametric information bound. I(~) Any estimator achieving this bound is said to be asymptotically efficient. We can use the results of section 3 to construct asymptotically estimators for p. Suppose further that the 'P, Let = - (a/a w) log{ h (w) } . 'P ( w) Suppose is a symmetric density. ~(.) efficient regression function f and the nonparametric regression estimators (g,v ) satisfy (A.1) - (A.13), one of (B.1) - (B.2) o a(2) > 1.0, and (C.1) - (C.3). efficient in the semiparametric distribution maximum as with Then our estimators (2.1) are asymptotically sense, likelihood because with va they have known. the same limit As an added benefit of THEOREM 2. even if the errors are incorrectly specified but still symmetric. our estimators will behave as if va were known. That the information bound is the same as if V o were known is an easy informal calculation using the theory of Begun, et. al. (1983). theory, set vo(x) = dominating measure. It is a 2 v*(x), where v* is a density with To apply their respect if also clear that if one is willing to assume symmetric, independent the density ~(.) is unknown. carried comtemplate This program out by Bickel (1982) for homoscedastic regression. ~(.) change It should be possible to construct asymptotically efficient estimators in this case. been a The calculations are immediate. and identically distributed errors, then the information bound does not even to has already One can also being known and asymmetric or unknown and possibly asymmetric. but we leave this problem to others. 16 Variance a Function of the Mean SECTION 6 Often, the variance can be modeled as a function Assume that the data of the mean response. are normally distributed with mean f(x,P) and variance 2 modeled parametrically as a v mp (f(x,P),9). It is well known that generalized least squares estimates of p have the same limit distribution (1.7) as weighted least squares with known normal theory maximum likelihood asymptotically more maximum likel has the efficient ~stimate estimate than p is, of at the normal generalized least squares. has some disadvantages. model, However, the Generalized least squares robustness property that its asymptotic distribution is not dependent on assumptions about higher moments. the Jobson & Fuller (1980) showed that the weights. McCullagh (1983) shows that this case for the maximum likelihood estimate. is not It does not take a particularly nasty distibution before the maximum likelihood estimate becomes less efficient than generalized least squares. influence function of errors, and that unlike estimate it not the & Ruppert Carroll maximum generalized likelihood least (1982b) note using the estimate is quadratic in the squares, the maximum likelihood robust to a small misspecification in the variance function. We have seen no real data examples where even the potential gain in for that efficiency maximum likelihood is over 30%, and it our belief that "asymptopia" occurs for the maximum likelihood estimate only for much larger sample sizes than necessary for generalized least squares. With this location-scale background model with in scale mind, let depending us on turn the to the semiparametric regression Referring for notation to (1.3), assume that the errors satisfy parameter. 17 Eo i (/3 ) Eo i (/3 ) (6.1) v 1/2{f(x. ,/3)} m 1 where, given x., the {e.} are independent and identically distributed. 1 the ways of Each of 1 writing the errors (6.1) has an interesting interpretation. The first indicates that the variances are not constant, so that we could apply the techniques of sections 2-4. This is almost certain to be inefficient if the dimension of x. is of any size. The second way of writing the model suggests a variant squares. 1 of generalized least Take the squared residuals from the current fit and regress them nonparametrically on the predicted values from the current fit. regularity For symmetric conditions, errors, Carroll linear (1982) regression showed least squares has the limit distribution (1.7). improve upon and with stringent that this form of generalized It would be interesting Carroll's result, but we leave this to another time. form of (6.1) also suggests a semiparametric model. to The second We consider only the case that the errors (6.1) have known symmetric density w(·). As in the parametric case, it does not follow that, having found an asymptotically efficient semiparametric estimator. one calculations outlined below indicate should use function estimate is nonnormality. (1982b) does quadratic in and the hence Our that for normally distributed data the efficient semiparametric estimator will suffer the drawbacks that likelihood it. parametric the case. estimators the maximum The efficient influence will be sensitive to We conjecture that there is an analogue to the Carroll & Ruppert result , i.e., the estimators misspecifications of the variance model. will also be affected by It is not clear that it is often small the case that the increase in efficiency at the model can be achieved in reasonable size samples and for realistic problems. 18 The calculations we give here are informal, meant to indicate the form of the efficient influence function. the linear model. (1/2) /JJ In{ vm(JJ) } ; t(JJ) = SC(e(p» is o = Ft In{ c.>(t=e(p» = E{ x the loglikelihood with respect to p and -q(m) x {1 = ~ -1 {1 + e (P) T x p=JJ } The derivatives of are + e(p) SC(e(p»} - x SC(e(p» e. o (6.3) 0 I } . the usual score function in the univariate case. (6.2) of Define the auxilliary functions q(JJ) SC(·) We assume that the mean function is that SC (e (P» 1 {o2 v *(JJ)}1/2 m } . Using the notation of Begun, et. al. (1983), if the density function of the data is f d , the local derivative of the function v * is m 2(A b )/f 1/2 (6.4) This 1 1 local derivative d depends on the data only through JJ and le(p)l. As in Begun, et. al. (1983) or Bickel & Ritov (1986), the efficient score function is defined by (6.5) Here d is to be chosen so that the expectation of (6.4) is zero. to set d = O. the product of (6.5) and Since to is itself a function only of JJ and le(p)l, it suffices Thus, the efficient score function is ~. 19 (6.6) -q(m) {x - E[xIJJ]} (l information bound I(~) e(p) SC(e(p»} 1 {a 2v *(JJ)}1/2 m - x SC(e(p» The + is the covariance matrix of Fpa' which is easy to compute because the two components of (6.6) are uncorrelated by symmetry. covariance matrix of the second term The in (6.6) is the information bound of section 5, where one did not know that the variance was a function of the mean. This means that there is extra information available when one variance a function of the mean response. that the The only exception is when x Since the efficient semiparametric score Fpa differs E[xIJJ]. score is knows from the = usual (6.2), this is a case where there is a loss of asymptotic efficiency for not knowing the form of the variance function. For normally semiparametric distributed errors, and shows that the efficient score function is quadratic in the errors, not linear as is the case for generalized least squares. case, (6.6) the same concerns The same thing happens in the we have raised previously apply. gains in efficiency are less than in the parametric case. parametric The possible It is not clear to us that when one is assuming h(') is known, one should try to extract the extra information in the data. We errors can ~(.) carry out the same informal calculations when the density of the is symmetric and the density of the design s(·) is unknown. The local derivatives for hand s are (respectively) 1/2 2(A b )/f d 2 2 = 2 b 2 (e(p» 1/2 2(A b )/f 3 3 d These are orthogonal to (6.6) , = 1 h 1/2 (e(p» 2 b (x) 1 s1/2(x) 3 so the information bound are the same as if hand efficient s were score known. function and the While one can 20 construct uniformly efficient estimates of p when the errors are symmetric. it would be more interesting to know if such estimates are actually efficient in samples of small to moderate size. • REFERENCES Begun, J. M.. Hall. W. J .. Huang. W. M. & Wellner. J. A. (1983). Information and asymptotic efficiency in parametric nonparametric models. Annals of Statistics 11. 432-452. Bickel. P. J. (1975). One-step Huber estimates in the linear model. Journal of the American Statistical Association 70. 428-436. Bickel. P. J. (1982). 647-671. On adaptive estimation. Annals of Statistics 10. Bickel. P. J. & Ritov, Y. (1986). Efficient estimation in the errors in variables model. Annals of Statistics 14, 000-000. Box, G. E. P. & Myer, R. D. (1985b). Dispersion effects designs. Technometrics. 28. 19-28. from fractional Carroll, R. J. (1982a). Adapting for heteroscedasticity in linear models. Annal of Statistics 10, 1224-1233. Carroll. R. J., and Ruppert, D. (1982a). A comparison between maximum likelihood and generalized least squares in a heteroscedastic linear model. Journal of the American Statistical Association 77. 878-882. Carroll, R. J., and Ruppert. D. (1982b). Robust estimation heteroscedastic linear models. Annals of Statistics 10. 429-441. Carroll. R. J., Ruppert, D., & Wu. C. F. J. (1986). and the bootstrap in generalized least squares. in Variance expansions Preprint. Collomb. G. & Haerdle. W. (1986). Strong uniform convergence rates in robust nonparametric time series analysis and prediction kernel regression estimations from dependent observations. Stochastic Processes and Their Applications, to appear. Davidian. M. & Carroll. R. J. (1986). function estimators. Preprint. An asymptotic theory of variance Devoyre, L. (1978). Uniform convergence of nearest neighbor regression function estimators and their applications in optimiation. IEEE • 21 Transactions on Information Theory, IT42, 142-151. Hinkley, D. V. (1985). Transformation diagnostics for linear models. Biometrika 72, 487-496. Jobson, J. D. & Fuller, W. A. (1980). Least squares estimation when the covariance matrix and parameter vector are functionally related. Journal of the American Statistical Association 75, 176-181. Mack, Y. P. (1981). Local properties of K-NN regression estimators. Journal on Algebraic and Discrete Methods 2, 311-323. SIAM Marron, J. S. & Haerdle, W. (1986). Random approximations to some measures of accuracy in nonparametric curve estimation. Journal of Multivariate Analysis, to appear. Matloff, N., Rose, R. & Tai, R. (1984). A comparison of two estimating optimal weights in regression analysis. Statistical Computation and Simulation 19,265-274. McCullagh, P. (1983). 11, 59-67. Quasi-likelihood functions. Muller, H. G. & Stadtmuller, U. (1986). in regression analysis. Preprint. Pollard, D. (1984). New York. Annals methods for Journal of of Statistics Estimation of heteroscedasticity Convergence of Stochastic Processes. Springer-Verlag, Robinson, P. M. (1986). Asymptotically efficient estimation in presence of heteroscedasticity of unknown form. Econometrica 000-000. Rothenberg, T. J. (1984). Approximate normality of squares estimates. Econometrica 52, 811-825. Appendix The proof of LEMMA 1 is accomplished generalized through 1 propositions. As N -++ 00, 1 N2 P { h(p*,g) TN } N least series of 1 Let h(P,v) = d(x,P)!v(x)2 and r(p,v) = ~(P)/V(X)2. TN will be used as a current random variable under discussion. PROPOSITION 1 a the 56, P ~ 0 , where Throughout, 22 a Taylor series in p, with p By 1 A suffices to show that if N2 (P p -P) = °p (1) on the line connecting p* and p, it P then p ~1(r(pp,g) PN{h(p*,g) [ h(Pp,g) ~1(r(p*,g) - h(p*,g) ]} a . By assumption (A.4), it suffices to show that d (p*) (7.1) - d (P P ) ]} .......... a p (7.2) Since g(x) ~ qN and is ~ small, (7.2) follows from Cauchy-Schwarz, (7.1) follows from (A.7), (A. B) and since PROPOSITION A PN{s(x,E.)g(x) fIQQ! Let 2 -2 ° (1) } P s(x,E.) any i f p{s2(x,E.)} < 00 positive is small. function. C Then TN • Since PN{S(x,E.)/v~(X)} = 0p(1), it suffices that PN{s(x,E.) Now if be ~ Applying (A.B). V-1 o I A g(x) -2 - vO(x) -2 I} = 0p(1) and va-2 are boun d e d by c > 1 , th en A_2 -2 2 A Ig (x) - va (x) I ~ 2c Ig(x) - vO(x) I / 2 qN It thus suffices to show that -2 A qN IF'N{s(x,E.) Ig(x) - va(x)1 } = 0p(1). Using Cauchy-Schwarz and assumptions (A.2) and (A.3) for the result follows. ~ sufficiently small, C 23 PROPOSITION 3 ~ As N ~ ~, By a two term Taylor series of f p done componentwise, it suffices that P (7.3) fpp(X,P) 'P(r(p,g)} __ 0 (7.4) fppp(x,Pp) 'P(r(p,g) } __ p Assumption (A.13) is sufficient to prove (7.4). (7.5) (7.6) ~ o . For (7.3), it suffices that P fpp(x,P) ['P{r(p,g)} - 'P{r(p,v )}]} __ 0 o ~ -1 P IP {f N pp (x,P) 1>{r(p,vo )} [g(x)-vo(x)] [g(x)vo(x)] } - IP { N g(x) -1 o . Because of (A.1), (A.9) and Cauchy-Schwarz. (7.6) follows if (7.7) By adding and subtracting v(x) vo(x) in the numerator of (7.7) and then + qN applying (A.2) and (A.3), (7.7) holds as long as qN ~ 2 IP N {g(x) which follows from PROPOSITION 2 with (7.5). -2 P } -- 0 , s(x,~) First consider the case that ~1) = = Ig (x) It thus ~ in (A.4). ~ .d N(x) = 1. - v0 (x) 2 I / suffices Writing qN ' the square of (7.5) is bounded by PN{;(X)-2 II fpp(x,P) e(p) 2 11 } IP { .dN(x)} N c~l) x c~2) to prove 24 By (A.9) and PROPOSITION 2, C~l) = 0 P (2) C N Thus. (7.5) holds ~HC(v) = max(-c.min(v,c» ~1 ) if By (A. 2) - (A. 3) , (1). P -+++ 0 ~, and it suffices to consider M;l) <~. be the Huber ~ function. PROPOSITION 2, (A.9) Cauchy-Schwarz combine to show that it suffices that Using (A.4), it suffices that for every c IP N { ~Hc [ 2 ~ (,D) I <~, 2 ~ g(x) - vO(x) I -------------1 1 2 vO(x) g(x) [V (X)2 + g(X)2] O ~ By (A.l), monotonicity of K > p and the fact that g(x) ~Hc ~ o . } ~ qN' it suffices that o . (7.8) Fix ] O. Decompose the left hand side of (7.8) as A + A lN A = 2N lN ~ IPN{~HC{~2(,6) ~N(x)} [ I(~2(,6nK) c f'N{ I(~2(,6) ~ A If we let N ~ PROPOSITION 4 ~ 2N and then« As N ~ ~ c 2 ~~, K) } -+++ C IP{ + I(~2(,6)<K) I(~2(,6) ~ K)} P IP { ~N(x) } -+++ 0 . N (7.8) follows. [] ~, o P (N- 2'Y) ; ]} Let and • 25 By (A.l), for some constants c ' c ' l 2 The proof is completed by (A.2) - (A.3) and PROPOSITION 2. PROPOSITION 5 ~: As N ~ 0 ~, By (A.lO) and (1.6), it suffices that A AT A T IP N{ h (,D , g) h (,D , g) "P 1( r (,D , g h (,D , v0 )h (,D , v0) "P 1(r (,D , v0 » »- P } -- 0 By PROPOSITION 4 and (A.ll), IP N{ d (,D) d (,D ) T A "P 1 ( r (,D , g» [g ( x ) -1 - v0 ( x ) -1 P ]} - - 0, so it suffices that (7.9) If ~2) = ~ in (A.4), then by (A.12) and Cauchy-Schwarz it suffices that A IP { Ig(x) N _1 2- _1 vo(x) 2 P 21 } __ 0 . For a constant c, by (A.l) this last term can be bounded by by PROPOSITION If ~2) < ~ in (A.4), by (A.12) to prove (7.9) it suffices 4. that for every c < ~, 26 o . This is (7.8), which was proved in PROPOSITION 3. o Proof of Leima 1 o PROPOSITION 6 Combine PROPOSITIONS 1-5. As N ~ ~, 1 1 A [d(,6)1'(r(~,g): {g(X)2 - V(X)2} g(X)2 ~ : Under (B.1) or (B.2) with a(2) > 1.0, and choosing ~,~ o . 1 V(X)2 applying Cauchy-Schwarz, small shows that it suffices that o (7.10) P (1) . o This follows from (A.11). Proof of THEOREM 1 By (1.4) and PROPOSITION 6, we must show that P 1 TN N2 P N{ h (,6 , v ) [ l' ( r (,6 , g» - l' ( r (,6 , v» ] } ..- 0 . o This is routine for (B.1) and (B.2). Proof of THEOREM 2: • N2 P { N (A.3) It suffices that ~l)~ 0, ~2) ~ O. where 1 A l l d(,6)1'(~(,6)/vo(x)2)[v(x)-2 - V (X)-2] O } , • 27 1 _1 1 1 N2 IP { V(X) 2 d(,l3) [1'(E.(,l3)/V(X)2) -1'(E.(,l3)/V (X)2)] } . O N Recall that Because l' follows v(x) is odd. {E. (,l3)} has a symmetric distribution and V is even. it o i I 2 that for j=I.2, A~j) is N / times an average of N mean zero uncorrelated (not independent) random variables. IIIPA Thus. A.. .. -~ 2 ( 1) (1) T AN II ~ IP[ f'N{lId(,l3)1I 2l' 2 (E.(,l3)/V O(X)2) Iv(x)-2 - vo(x) I}]· N By (C.I) and Cauchy-Schwarz, A~1) ~ 0 as long as (7.11) This follows routinely from (A.I) and (A.2). Because A~2) is Nt times a sum of uncorrelated random variables, we have by (A.4) and (C.I) that pi 1~2)1 1 2 ~ 0 as long as p IP [ If' N{T (7.12) A v(x) Fix -1 c > 0 and write TN 0, where } ] N min2 [ ~ 1) , IE. (,l3 ) I TNI( 1E.(,l3) I>c) + A.. Iv (x) - 2 _1 - V0 (x) TNI( 1E.(,l3) I~c) of (A.I) and (C.2) But by (C.2) and mimicking the proof of Proposition 2. 2 I T~l) + T~2). Because 28 P[ P (7.13) - N {v(x) -2 }] < 00 • We thus need only show that .. lim lim P[ P (7.14) C4lO N-fIIlO < 00, N {T~l)} then by Cauchy-Schwarz, P[ IP (l) {T } ] N N ~ [P {I ( ~ -1 N -2 ~ [1P{I(IE.(.C)j>C)}]2 N i~1[P{v(x) }]2 ~ - ~ IE. (,6) I >c ) }] 2 { IP [ IP N {v (x) Invoking (7.13), equation (7.14) now follows. when o. ~1) = 00. -2 1 ]} 2 • We thus need to verify (7.12) By (C.2) - (C.3) and a proof similar to that of Proposition 2, we have that so that (7.12) follows by applying Cauchy-Schwarz and (7.11). PROOF OF THEOREM 3 : (Sketch) As in Bickel (1982, page [] 653), from (0.2) it 1 suffices to prove all steps with,6* replaced by,6* =,6 is a deterministic sequence. - - + ~N/N2, 1 1 1 ~ C N 2 {I + IV(X)2 - v*(x)21 ' 1 so that (A.3) holds with a(2) ~ 1.0 and va replaced by v*, Lemma 1 holds. ~N ~ ~o Using compactness and the boundedness of v*, I Ig(x) - v(x) where This assures that To prove THEOREM 1, it suffices to prove PROPOSITION 6, as the other step is similar. Write TN as in PROPOSITION 6 and 1 3/2 1 1 SN = N2 PN{ v* (x ) d (,6) E. (,6 ) [g (x ) 2 - V(X) 2] } . Then, since d(,6) = d(x,,6) x is bounded, for some c > 0, . • 29 Since we have replaced p* by p*. ~ 1 1 ~ Ig(X)2 - v(x)21 _1 ~ c 4 N2 p so that ITN-sNI ~ 0 by (0.4) and (A.6). Thus, to prove an analogue to P THEOREM it 1, boundedness of thus d(P) suffices that SN ~ O. Using Bickel's trick and the d(x,P) = x and the conditional fourth moment of shows directly that the covariance of SN converges to zero as ~(P), desired. one The next step in the proof of Theorem 3 is to show that o . By algebra, QN = c(l) N C~l) = These two terms 2 N~ IP N{ d(P) converge to + c(2) where N' ~(P) v*(x)-~ [v(x)-~ - v*(x)-~] } zero by (0.5) and (0.4) respectively. We are finished once we show that p ~ O. Note that since s(x,P) is bounded away from zero, so too is v*(x). is completed by noting The proof that this last random variable is a mean zero random variable whose covariance converges to zero by (0.1).
© Copyright 2025 Paperzz