Consistency, Efficiency and Robustness of Conditional Disparity Methods Giles Hooker and Anand Vidyashankar November 18, 2012 Abstract This report demonstrates the consistency and asymptotic normality of minimum-disparity estimators based on conditional density estimates. In particular, it provides L1 consistency results for conditional density estimators and conditional density estimators under homoscedastic model restrictions. It defines a generic formulation of disparity estimates in conditionally-specified models and demonstrates their asymptotic consistency and normality. In particular, for regression models with more than one continuous response or covariate, we demonstrate that disparity estimators based on unrestricted conditional density √ estimates have a bias that is larger than n, however for univariate homoscedastic models conditioned on a univariate continuous covariate, this bias can be removed. 1 Framework and Assumptions Throughout the following, we assume that we observe {Xn1 (ω), Xn2 (ω), Yn1 (ω), Yn2 )(ω), n ≥ 1} i.i.d. random variables with where Xn1 (ω) ∈ Rdx , Xn2 (ω) ∈ Sx , Yn1 (ω) ∈ Rdy , Yn2 (ω) ∈ Sy for countable sets Sx and Sy with joint distribution g(x1 , x2 , y1 , y2 ) = P (X2 = x2 , Y2 = y2 )P (X1 ∈ dx1 , Y1 ∈ dy1 |x2 , y2 ) and define the marginal and conditional densities h(x1 , x2 ) = X Z g(x1 , x2 , y1 , y2 )dy1 y2 ∈Sy f (y1 , y2 |x1 , x2 ) = g(x1 , x2 , y1 , y2 ) . h(x1 , x2 ) on the support of (x1 , x2 ). Further, for a scalar Yn1 ∈ R, removing Yn2 (or setting it to be always identically 0) we define the conditional expectation Z m(x1 , x2 ) = y1 f (y1 |x1 , x2 )dy1 and note that in homoscedastic regression models we write for some density f ∗ (e), f (y1 |x1 , x2 ) = f ∗ (y1 − m(x1 , x2 )). 1 (1.1) Within this context, the use of disparity methods to estimate parameters in linear regression was treated in Pak and Basu (1998) from the point of view of placing a disparity on the score equations. We take a considerably more direct approach here. The following regularity structures may be assumed in the theorems below: (D1) g is bounded and continuous in x1 and y1 . (D2) All third derivatives of g with respect to x1 and y1 exist, are continuous and bounded. (D3) The support of h, X ∈ Rdx ⊗Sx , is compact and h− = inf (x1 ,x2 )∈X h(x1 , x2 ) > 0. We note that under these conditions, continuity of h and f in x1 and y1 is inherited from g. In the case of homoscedastic models we also assume R (E1) |∇e f ∗ (e)|de < ∞. We will form kernel estimates for these quantities as follows: n X x1 − Xi1 (ω) y1 − Yi1 (ω) K Ix2 (Xi2 (ω))Iy2 (Yi2 (ω)) y dy x cnx2 cny2 ncdnx 2 cny2 i=1 n x − Xi1 (ω) 1 X Kx Ix2 (Xi2 (ω)) ĥn (x1 , x2 , ω) = dx cnx2 ncnx2 i=1 X Z ĝn (x1 , x2 , y1 , y2 , ω)dy1 = ĝn (y1 , y2 , x1 , x2 , ω) = 1 y2 ∈Sy Kx (1.2) (1.3) Rdy ĝn (x1 , x2 , y1 , y2 , ω) fˆn (y1 , y2 |x1 , x2 , ω) = . ĥn (x1 , x2 , ω) (1.4) (1.5) where Kx and Ky are densities on the spaces Rdx and Rdy respectively. Further conditions on these are detailed below. In the case of homoscedastic regression models we further define a homoscedastic conditional density via the Nadaraya-Watson estimator: Pn x1 −Xi1 (ω) Y (ω)K Ix2 (Xi2 (ω)) i1 x i=1 cnx2 m̂n (x1 , x2 , ω) = (1.6) Pn x1 −Xi1 (ω) K I (X (ω)) x x i2 2 i=1 cnx2 n X 1 e − (Yi (ω) − m̂n (Xi1 (ω), Xi2 (ω))) ∗ ˆ fn (e, ω) = dy Ky (1.7) cny2 ncny2 i=1 f˜n (y1 |x1 , x2 , ω) = fˆ∗ (y1 − m̂n (x1 , x2 , ω), ω) . (1.8) n with notation cny2 maintained as a bandwidth for the sake of consistency. 2 For the sake of notational compactness, we define measures µ and ν on Rdx ⊗Sx and Rdy ⊗Sy respectively given by the product of counting and Lebesgue measure and define y = (y1 , y2 ) and x = (x1 , x2 ). Where needed, we will write for any function F (x1 , x2 , y1 , y2 ), Z Z Z Z X Z Z F (x1 , x2 , y1 ,2 )dx1 dy1 = F (x1 , x2 , y1 , y2 )dµ(y)dν(x) = F (x, y)dµ(y)dν(x). (1.9) x∈Sx ,y∈Sy Throughout we will attempt to keep the notation explicit where possible and, in particular, will not make use of this convention in Sections 2 and 3 where the distinction between discrete-valued and continuous valued random variables is particularly important. Throughout we make the following assumptions on the kernels Kx and Ky : (K1) Kx and Ky are densities on Rdx and Rdy respectively. (K2) For some finite K + , supx1 ∈Rdx Kx (x1 ) < K + and supy1 ∈Rdy Ky (y1 ) < K + (K3) lim kx1 k2dx Kx (x1 ) → 0 as kx1 k → ∞ and lim ky1 k2dy Ky (y1 ) → 0 as ky1 k → ∞. (K4) Kx (x1 ) = Kx (−x1 ) and Ky (y1 ) = Ky (−y1 ). R R (K5) kx1 k2 Kx (x1 ) < ∞ and ky1 k2 Ky (y1 ) < ∞ (K6) Kx and Ky have bounded variation and finite modulus of continuity. We also assume that following properties of the bandwidths. These will be given in terms of the number of observations falling at each combination values of the discrete variables: n(x2 ) = n X Ix2 (X2i (ω)), i=1 n(y2 ) = n X Iy2 (Y2i (ω)), i=1 n(x2 , y2 ) = n X Ix2 (X2i (ω))Iy2 (Y2i (ω)) i=1 (B1) cnx2 → 0, cny2 → 0. d x x (B2) n(x2 )cdnx → ∞ for all x2 ∈ Sx and n(x2 , y2 )cdnx c y → ∞ for all (x2 , y2 ) ∈ Sx ⊗ Sy . 2 2 ny2 2dx → ∞. (B3) n(x2 )cnx 2 2d y x (B4) n(x2 , y2 )c2d nx2 cny2 → ∞. (B5) P∞ d x −dx −γn(x2 )cnx 2 n(x2 )=1 cnx2 e ≤ ∞ for all γ > 0. (B6) n(y2 )c4ny2 → 0 if dy = 1 and n(x2 )c4nx2 → 0 if dx = 1. where the sum is taken to be over all observations in the case that X2 or Y2 are singletons. 3 2 Consistency Results for Conditional Densities over Spaces of Mixed Types In this section we will provide a number of L1 consistency results for kernel estimates of densities and conditional densities of multivariate random variables in which some coordinates take values in Euclidean space while others take values on a discrete set. Pointwise consistency of conditional density estimates of this form can be found in, for example, Li and Racine (2007). However, we are unaware of equivalent L1 results which will be necessary for our development of conditional disparity-based inference. Throughout, we have assumed that both the conditioning variable x and the response y are multivariate with both types of coordinates. The specification to univariate models, or models with only discrete or only continuous variables in either x or y (and to unconditional densities) is readily seen to be covered by our results as well. We begin with a result on densities for random variables taking discrete values. Theorem 2.1. Let {Xn : n ≥ 1} denote a collection of i.i.d. discrete random variables with support S where S is a countable set. Let f (t) = P (X1 = t), t ∈ S, and n fn (t, ω) = 1X I{t} (Xk (ω)), n k=1 then there exists a set B with P (B) = 1 such that for ω ∈ B, X lim |f (t) − fn (t, ω)| = 0. n→∞ t∈S Proof. By strong law of large numbers there exists a set At such that P (At ) = 0 and on Act , fn (t, ω) converges to f (t) pointwise. Let B = ∩t∈S Act . Then, P (B) = 1 and for ω ∈ B, fn (t, ω) converges to f (t) pointwise for all t. P Notice that t∈S fn (t, ω) = 1. Let Dn = {t ∈ S, f (t) ≥ fn (t, ω)}. Hence it follows that, X X X 0= (f (t) − fn (t, ω)) = (f (t) − fn (t, ω))IDn (t) + (f (t) − fn (t, ω))IDnc (t). t∈S t∈S t∈S Hence, X (f (t) − fn (t))IDn (t) = t∈S X (fn (t) − f (t))IDnc (t). t∈S Now, X |f (t) − fn (t, ω)| = t∈S X |f (t) − fn (t, ω)|IDn (t) + t∈S = |f (t) − fn (t, ω)|IDnc (t) t∈S X X (f (t) − fn (t, ω))IDn (t) + (fn (t, ω) − f (t))IDnc (t) t∈S Hence using (2.10), we see that X |f (t) − fn (t, ω)| = t∈S X t∈S 2 X (f (t) − fn (t, ω))IDn (t). t∈S 4 (2.10) Alternatively, we can express the above equality as below X X |f (t) − fn (t, ω)| = 2 (f (t) − fn (t, ω))IDn (t) t∈S t∈S Z = (f (t) − fn (t, ω))IDn (t)dν(t), 2 S where ν(·) is the countingR measure. Now, fix ω ∈ B and set an (t, ω) = (f (t) − fn (t, ω))IDn (t). Notice that 0 ≤ an (t, ω) ≤ f (t), and S f (t)dν(t) = 1 < ∞. Hence by dominated convergence theorem applied to the counting measure space S, we see that Z X lim |f (t) − fn (t, ω)| = lim (f (t) − fn (t, ω))IDn (t)dν(t) n→∞ n→∞ t∈S S Z = Since P (B) = 1 and limn→∞ empirical probabilities. P t∈S lim (f (t) − fn (t, ω))IDn (t)dν(t) = 0. S n→∞ |f (t) − fn (t, ω)| = 0, we have almost sure L1 −consistency of the Using this, we can now provide the L1 convergence of densities of mixed types; pointwise and uniform convergence is included for completeness. Theorem 2.2. Let {(Xn1 , Xn2 ), n ≥ 1} be given as in Section 1, under assumptions D1 - D2, K1 - K6 and B1-B2 there exists at set B with P (B) = 1 such that for all ω ∈ B, X Z (2.11) ĥn (x1 , x2 , ω) − h(x1 , x2 ) dx1 = 0. x2 ∈Sx R For almost all (x1 , x2 ) ∈ Rdx ×Sx , there exists a set B(x1 ,x2 ) with P (B(x1 ,x2 ) ) = 1 such that for all ω ∈ B(x1 ,x2 ) , ĥn (x1 , x2 , ω) → h(x1 , x2 ). (2.12) Further, under Assumptions D3, and B5 there exists at set Bs with P (Bs ) = 1 such that for all ω ∈ Bs , sup ĥn (x1 , x2 , ω) − h(x1 , x2 ) → 0. (2.13) (x1 ,x2 )∈X Proof. We first re-write nx (ω) 1 ĥn (x1 , x2 , ω) = 2 x n nx2 (ω)cdnx 2 X Kx Xi2 (ω)=x2 x1 − Xi1 (ω) cnx2 = p̂x2 (ω)h̃x2 (x1 , ω) and observe that for every x2 there is a set Ax2 with P (Ax2 ) = 1 such that for ω ∈ Ax2 , nx2 (ω) → ∞ and hence from Devroye and Györfi (1985, Chap. 3 Theorem 1) there is a set Cx2 ⊂ Ax2 with P (Cx2 ) = 1 such that Z h̃x2 (x1 , ω) − h(x1 |x2 ) dx1 → 0. 5 We can now write Z X X Z p(x2 ) h̃x2 (x1 , ω) − h(x1 |x2 ) dx1 ĥn (x1 , x2 , ω) − h(x1 , x2 ) dx1 ≤ x2 ∈Sx R x2 ∈Sx + X R |p̂(x2 , ω) − p(x2 )| x2 ∈Sx where the second term converges for ω ∈ D with P (D) = 1 from Theorem 2.1 and the first term converges on E = ∪x2 ∈Sx Cx2 from the dominated convergence theorem applied with the observation that Z Z X X p(x2 ) (h̃x2 (x1 , ω) + h(x1 |x2 ))dx1 ≤ 2. p(x2 ) h̃x2 (x1 , ω) − h(x1 |x2 ) dx1 ≤ x2 ∈Sx R x2 ∈Sx R Hence (2.11) holds for ω ∈ B = D ∪ E with P (B) = 1. To demonstrate (2.12), we observe that (2.11) requires that for almost all (x1 , x2 ), ĥn (x1 , x2 , ω) − h(x1 , x2 ) → 0. Finally (2.13) follows by the finiteness of Sx under Assumption D3 and the uniform convergence of kernel densities under assumption B5 (see e.g. Nadaraya, 1989, Chap. 4). The next theorem demonstrates convergence that is uniformly L1 ; that is the L1 distance measured in (y1 , y2 ) at each (x1 , x2 ) converges uniformly over (x1 , x2 ). We begin by considering only the continuousvalued random variables where g(x1 , y1 ) and ĝn (x1 , y1 ) will indicate the corresponding density and kernel density estimates. Theorem 2.3. Let {(Xn1 , Yn1 ), n ≥ 1} be given as in Section 1 under assumptions D1 - D2, K1 - K6 and B1-B2 then for almost all x1 there exists a set Bx1 with P (Bx1 ) = 1 such that for all ω ∈ Bx1 Z |ĝn (x1 , y1 , ω) − g(x1 , y1 )| dy1 → 0. (2.14) Further, if Assumptions D3 and B5 hold, there exists a set B with P (B) = 1 such that for all ω ∈ B Z sup |ĝn (x1 , y1 , ω) − g(x1 , y1 )| dy1 → 0. (2.15) x1 ∈X Proof. For (2.14) we observe that Z Z Z |ĝn (x1 , y1 , ω) − g(x1 , y1 )| dy1 dx1 = T (x1 )dx1 →0 almost surely with T (x1 ) > 0, see Devroye and Györfi (1985, Chap. 3 Theorem 1). Thus T (x1 ) → 0 for almost all x1 . 6 Turning to (2.15), first we demonstrate the result on the expectation of ĝn (x1 , y1 ). We observe that we can choose bn(y2 ) → ∞ with bn(y2 ) cny2 → 0 so that Z sup |Eĝn (x1 , y1 ) − g(x1 , y1 )| dy1 x1 ∈X Z Z Z K (u)K (v)g(x + c u, y + c v)dudv − g(x , y ) = sup x y 1 nx2 1 ny2 1 1 dy1 x1 ∈X Z |g(x01 , y10 ) − g(x1 , y1 )| dy1 + Ky (bn(y2 ) ) ≤ sup sup sup sup g(x1 , y1 ) x1 ∈X x01 ∈X y10 :ky10 −y1 k>cny2 bn(y2 ) x1 ∈X ,y1 ∈Rdy ≤ M bdny cdnyy 2 + Ky (bn(y2 ) ) → 0. We now observe that since X is compact, it can be given a disjoint covering. Further, since the modulus of continuity of K is finite, for any , we can take a covering X = N [n Aj j=1 and define X Nn 1 u − x1 (an1 (x1 ), . . . , aNn (x1 )) = arg min sup aj (x1 )IAj (u) Kx − cnx2 a1 ,...,aNn u∈X cnx2 j=1 and Kn+ (u, x) = Nn X ajn (x1 )IAj (u) j=1 such that 1 u − x1 + sup sup Kx − Kn (u, x1 ) < cnx2 3 x∈X u∈X cnx2 x for some constant C depending on where we also observe that for any j, supx ajn (x1 ) < when Nn = C /cdnx 2 + dx K /cnx2 . From here we define n 1 X + y1 − Y1i (ω) ĝn∗ (x, y, ω) = dy Kn (x1 , X1i (ω))Ky cny2 ncny2 i=1 and observe that Z Z sup |ĝn (x1 , y1 , ω) − Eĝn (x1 , y1 )| dy1 ≤ sup |ĝn (x1 , y1 , ω) − ĝn∗ (x1 , y1 , ω)| dy1 x1 ∈X x1 ∈X Z + sup |Eĝn (x1 , y1 ) − Eĝn∗ (x1 , y1 )| dy1 x1 ∈X Z + sup |ĝn∗ (x1 , y1 , ω) − Eĝn∗ (x1 , y1 )| dy1 x1 ∈X Z 2 ≤ + sup |ĝn∗ (x1 , y1 , ω) − Eĝn∗ (x1 , y1 )| dy1 . 3 x1 ∈X 7 We thus only need to demonstrate that Z X ∗ ∗ P sup |ĝn (x1 , y1 , ω) − Eĝn (x1 , y1 )| dy1 > < ∞ x1 ∈X n and the result will follow from the Borel-Cantelli lemma. To do this we observe that ĝn∗ (x1 , y1 , ω) is constant in x1 over partitions Aj and thus Z Z sup |ĝn∗ (x1 , y1 , ω) − Eĝn∗ (x1 , y1 )| dy1 ≤ max |ĝn∗ (x1j , y1 , ω) − Eĝn∗ (x1j , y1 )| dy1 j∈1,...,Nn x1 ∈X = max qj ((X11 (ω), Y11 (ω)), . . . , (Xn1 (ω), Yn1 (ω))) j for any x1j ∈ Aj . Then 0 |qj ((x11 , y11 ), . . . , (xn1 , yn1 )) − qj ((x011 , y11 ), . . . , (xn1 , yn1 ))| Z 0 y − y11 y1 − y11 1 1 0 dy1 ajn (x11 )Ky − a (x )K ≤ jn 11 y ncny cny cny 2 2 2 2 supx∈X aj (x) ≤ n 2K + ≤ dx ncnx2 Hence by the bounded difference inequality 2d 2 ncnxx 2 P |qj ((X11 , Y11 ), . . . , (Xn1 , Yn1 )) − Eqj ((X11 , Y11 ), . . . , (Xn1 , Yn1 ))| > < 2e− 8K +2 . 2 We now observe that Z Eqj ((X11 (ω), Y11 (ω)), . . . , (Xn1 (ω), Yn1 (ω))) ≤ q 2 ∗ ∗ min E gˆn (y1 , x1j ), E (ĝn (y1 , x1j ) − E gˆn (y1 , x1j )) dy1 ∗ −dy /2 −dx ≤ Cg n−1/2 cny cnx2 2 for some constant Cg since E (ĝn∗ (y1 , x1j ) ∗ 2 − E gˆn (y1 , x1j )) ≤ Z 1 d ncnyy 2 ajn (x1 )2 Ky (u)2 g(x1 , y1 + cny2 v)dudv. −d /2 −dx Collecting terms, for n sufficiently large, Cg n−1/2 cny2y cnx < /2 by Assumption B4 and 2 P Z sup x1 ∈X 2 nc2 nx2 |ĝn∗ (x1 , y1 , ω) − Eĝn∗ (x1 , y1 )| dy1 > ≤ 2Nn e− 8K +2 2d ≤ which is summable under assumptions B5 8 2 x 2 2C − ncnx+2 8K e dx cnx2 We can now readily extend the above result to the mixed data-types case Theorem 2.4. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1 under assumptions D1 - D2, K1 - K6 and B1-B2 then for almost all (x1 , x2 ) there exists a set B(x1 ,x2 ) with P (B(x1 ,x2 ) ) = 1 such that for all ω ∈ B(x1 ,x2 ) X Z |ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 → 0. (2.16) y2 ∈Sy Further, if Assumptions D3 and B5 hold, there exists a set B with P (B) = 1 such that for all ω ∈ B X Z |ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 → 0. (2.17) sup (x1 ,x2 )∈X y ∈S 2 y Proof. (2.16) follows from the same arguments as (2.14) and the proof is omitted here. To demonstrate (2.17), from assumption D3, Sx is finite and it is therefore sufficient to demonstrate the result at each value of x2 . Writing g̃n (x1 , y1 |x2 , y2 , ω) = n X 1 d y x n(x1 , x2 )cdnx 2 cny2 Ky i=1 y1 − Y1i (ω) cny2 and p̂(x2 , y2 ) = Kx x1 − X1i (ω) cnx2 Ix2 (X2i (ω))Iy2 (Y2i (ω)) n(x2 , y2 ) n we can write sup x1 ∈X X Z |ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 y2 ∈Sy ≤ sup x1 ∈X X Z p(x2 , y2 ) |g̃n (x1 , y1 |x2 , y2 , ω) − g(x1 , y1 |x2 , y2 )| dy1 y2 ∈Sy + sup x1 ∈X X |p̂(x2 , y2 ) − p(x2 , y2 )| ĥn (x1 , x2 , ω). y2 ∈Sy From Theorem 2.2, ĥn (x1 , x2 , ω) is bounded and the second term converges almost surely. Further, the integral in the first term converges almost surely for each y2 by Theorem 2.3. Using that Z X sup p(x2 , y2 ) |g̃n (x1 , y1 |x2 , y2 , ω) − g(x1 , y1 |x2 , y2 )| dy1 ≤ sup ĥn (x1 , x2 , ω) + h(x1 , x2 ) x1 ∈X x1 ∈X y2 ∈Sy is also almost surely bounded, the result now follows from the dominated convergence theorem. The results above can now be readily extended to equivalent L1 results for conditional densities. Theorem 2.5. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1n under assumptions D1- D2, K1 - K6 and B1-B2 9 1. For almost all x = (x1 , x2 ) ∈ Rdx ⊗Sx there exists a set Bx with P (Bx ) = 1 such that for all ω ∈ Bx X Z (2.18) fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 → 0. y2 ∈Sy 2. There exists a set B1 with P (B1 ) = 1 such that for all ω ∈ B1 , X X Z h(x1 , x2 ) fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 dx1 → 0. (2.19) x2 ∈Sx y2 ∈Sy 3. If further, Assumptions D3 and B5 X Z sup (x1 ,x2 )∈X y ∈S 2 y hold, ˆ fn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 → 0. Proof. We begin with (2.18) and observe that X Z fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 (2.20) (2.21) y2 ∈Sy ≤ X Z ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 ) dy1 h(x1 , x2 ) y2 ∈Sy + X Z y2 ∈Sy = 1 1 ĝn (x1 , x2 , y1 , y2 , ω) − dy ĥn (x1 , x2 , ω) h(x1 , x2 ) 1 Z 1 |ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 h(x1 , x2 ) ĥn (x1 , x2 , ω) + 1 − . h(x1 , x2 ) (2.22) For the first term in (2.22), we observe that h(x1 , x2 ) > 0 and the result follows from the first part of Theorem 2.4. For the second term, for almost all (x1 , x2 ), ĥn (x1 , x2 , ω) → h(x1 , x2 ) with probability 1 from the second part of Theorem 2.2. 10 To demonstrated (2.19), we observe X X Z h(x1 , x2 ) fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 dx1 x2 ∈Sx y2 ∈Sy ≤ X X Z Z |ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 dx1 x2 ∈Sx y2 ∈Sy + X X Z Z x2 ∈Sx y2 ∈Sy = X X Z Z 1 1 h(x1 , x2 )ĝn (x1 , x2 , y1 , y2 , ω) − dy dx ĥn (x1 , x2 , ω) h(x1 , x2 ) 1 1 |ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 dx1 x2 ∈Sx y2 ∈Sy + X Z ĥn (x1 , x2 , ω) − h(x1 , x2 ) dx1 x2 ∈Sx where both terms in the final line converge to zero with probability 1 from Theorem 2.2. For (2.20), we follow the proof of (2.18) and observe that taking a surpremum for the first term in (2.22) gives Z 1 |ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 sup (x1 ,x2 )∈X h(x1 , x2 ) Z 1 |ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 ≤ − sup h (x1 ,x2 )∈X →0 from Theorem 2.4 while the supremum over the second term converges to zero by applying Theorem 2.4. 3 Consistency Results for Homoscedastic Conditional Densities In this section we study the conditional density estimates defined in Hansen (2004) for continuous-valued random variables in which a non-parametric location family is assumed: f (y1 |x1 , x2 ) = f ∗ (y − m(x1 , x2 )) for some f ∗ that does not depend on (x1 , x2 ) and a regression function m. This is estimated according to (1.6-1.8). Here we show that essentially all the consistency properties for conditional density estimates demonstrated in Section 2 continue to hold for the estimate (1.6-1.8). Lemma 3.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 with the restriction (1.1), under assumptions D1-D3, K1 - K6, B1-B2 and B5 there exists a set B with P (B) = 1 such that for all ω ∈ B. sup |m̂n (x1 , x2 , ω) − m(x1 , x2 )| → 0. (x1 ,x2 )∈X Proof. We observe that for any x2 m̂n (x1 , x2 , ω) is a Nadaraya-Watson estimator using the data restricted to X2 (ω) = x2 . Since the set Sx is finite, the result follows from Theorem 2.1 of Nadaraya (Chap. 4 1989) 11 Theorem 3.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 with the restriction (1.1), under assumptions D1-D3, E1 K1 - K6, B1-B2 and and B5 there exists a set B with P (B) = 1 such that for all ω∈B Z ˆ∗ (3.23) fn (e, ω) − f ∗ (e) de → 0 and sup (x1 ,x2 )∈X Z ˆ fn (y1 |x1 , x2 , ω) − f ∗ (y1 − m(x1 , x2 )) dy1 → 0. (3.24) Proof. We begin by defining f˜n∗ (e) to be the distribution of Yi (ω) − m̃(Xi1 , Xi2 ) for any m̃(x1 , x2 ) with at least two continuous derivatives in x1 : Z f˜∗ (e, m̃) = f ∗ (e − z)f + (z, m̃)dz (3.25) where f + (e, m̃) gives the density of m̃(x1 , x2 ) − m(x1 , x2 ) for (x1 , x2 ) distributed according to h(x1 , x2 ). We also define its kernel density estimate fˆn (e, m̃, ω) = 1 n X d ncnyy 2 i=1 K e − Yi (ω) − m̃(Xi1 (ω), Xi2 (ω)) cny2 (3.26) ∗ where fˆn (e, m̂n , ω) = fˆm (e, ω). To demonstrate (3.23), we break the integral into Z Z Z ˆ fn (e, m̂n , ω) − f ∗ (e) de ≤ fˆn (e, m̂n , ω) − f˜∗ (e, m̂n ) de + f˜∗ (e, m̂n ) − f ∗ (e) de. We begin by observing that the second term converges by a mild generalization of Deroye and Lugosi (2001, Theorem 9.1) given in Lemma 3.2 below with Gn (e) = f + (e, m̂n ) and the observation that Lemma 3.1 implies that the support of f + (e, m̂n ) shrinks to zero. For the first term, we follow the arguments of Deroye and Lugosi (2001, Theorems 9.2-9.4) for m̂n replaced with any bounded continuous m̃ and observe that the convergence can be given in terms that are independent of m̃. In particular, following Deroye and Lugosi (2001) in using the bounded difference inequality we have that for any g, Z Z 2 ˆ ˆ P fn (e, m̃, ω) − g(e) de − E fn (e, m̃, ω) − g(e) de > ≤ 2e− /2n hence Z Z ˆ fn (e, m̃, ω) − f˜∗ (e, m̃) de − E fˆn (e, m̃, ω) − f˜∗ (e, m̃) de → 0 almost surely by the Borel-Cantelli lemma. It is thus sufficient to establish the convergence of Z Z Z E fˆn (e, m̃, ω) − f˜(e, m̃) de ≤ E fˆn (e, m̃, ω) − f˜(e, m̃) de + E 12 ˆ fn (e, m̃, ω) − E fˆn (e, m̃, ω) de. For the first term, we replace fˆn (e, m̃, ω) with fˆn∗ (e, m̃, ω) by employing the kernel K + (e) = Ky (e)1kek<bn with bn → ∞ in place of Ky so that Z Z ˆ ∗ ˆ de ≤ Ky (e)de → 0 f (e, m̃, ω) − E f (e, m̃, ω) E n n e≥bn and observe that replacing Gn with d K + (e/cny2 )/cnyy 2 in Lemma 3.2 provides convergence whenever |m̂n (x1 , x2 , ω) − m(x1 , x2 )| ≤ L sup (x1 ,x2 )∈X for some finite L. This occurs with probability 1. Turning to the second term, we also make use of K + and show that Z Z ˆ ˆ E fn (e, m̃, ω) − E fn (e, m̃, ω) de ≤ 2 E fˆn (e, m̃, ω) − E fˆn (e, m̃, ω) de + ! r Z 2 ˜ ˆ ˆ de ≤ 2 min f (e, m̃), E fn (e, m̃, ω) − E fn (e, m̃, ω) Z + 2 E fˆn (e, m̃, ω) − f˜(e, m̃) de where we observe that 2 E fˆn (e, m̃, ω) − E fˆn (e, m̃, ω) ≤ n ≤ Z 1 d cnyy 2 1 2 K+ e−t cny2 sup f˜(e, m̃) d ncnyy 2 e∈Rdy Z f˜(t, m̃)dt K +2 (e)de →0 from Assumption 2. The convergence in made uniform in m̃ from the observation that sup f˜(e, m̃) ≤ sup f ∗ (e). e∈Rdy e∈Rdy For (3.24) we observe that Z ˆ∗ fn (y1 − m̂n (x1 , x2 , ω), ω) − f ∗ (y1 − m(x1 , x2 )) dy1 Z ≤ fˆn∗ (y1 − m̂n (x1 , x2 , ω), ω) − f ∗ (y1 − m̂n (x1 , x2 , ω)) dy1 Z + |f ∗ (y1 − m̂n (x1 , x2 , ω)) − f ∗ (y1 − m(x1 , x2 ))| dy1 . The first term converges to zero from (3.23) while we bound the second term by Z sup |m̂n (x1 , x2 , ω) − m(x1 , x2 )| |∇y f ∗ (y1 )| dy → 0 (x1 ,x2 )∈X from assumption E1 and Lemma 3.1. 13 Lemma 3.2. Let Gn be a sequence of positive, integrable functions on Rd with contained in [−Mn Mn ]d with Mn → 0. Then for any density f on Rd we have Z Z f (x − y)Gn (y)dy − f (x) dx → 0. lim R Gn (x)dx = 1 and support n→∞ Further, defining the class of densities with support on [−L, L]d by DL = {h : kxk > L ⇒ h(x) = 0} we have Z Z Z Z dx → 0. f (x − y − z)h(z)G (y)dzdy − f (x − z)h(z)dx lim sup n n→∞ h∈D L Proof. This proof is a generalization of Deroye and Lugosi (2001, Theorem 9.1) and follows from nearly identical arguments. First, assuming the result is true on any dense subspace of functions g, we have Z Z Z Z dx f (x − y − z)h(z)dzG (y)dy − f (x − z)h(z)dz n Z Z Z f (x − y − z)Gn (y)dy − f (x − z) h(z)dzdx ≤ Z Z Z ≤ |f (x − y) − g(x − y)| Gn (y)dydx + |f (x) − g(x)| dx Z Z + g(x − y)Gn (y)dy − g(x) dx Z ≤ 2 |f (x) − g(x)| dx + o(1) where the first integral can be made as small as desired without reference to h(z). It is therefore sufficient to prove the theorem in a dense subclass. In this case we take the class of Lipschitz densities with compact support. For any such f , let the support of f be contained in [−M ∗ M ∗ ]d and |f (y) − f (x)| ≤ Ckx − yk, x, y ∈ Rd . Then we observe that for any h(z) ∈ DL Z Z | (f (y − z) − f (x − z))h(z)dz| ≤ |f (y − z) − f (x − z)| h(z)dz ≤ |f (y) − f (x)| ≤ Ckx − yk 14 so that uniformly over DL . Z Z Z f (x − y − z)h(z)Gn (y)dzdy − f (x − z)h(z)dz dx Z Z Z ≤ f (x − y − z)Gn (y)dzdy − f (x − z) h(z)dzdx Z Z Z = f (x − y − z)Gn (y)dy − f (x − z) Gn (y)dy h(z)dzdx Z Z ≤ |f (x − y − z) − f (x − z)| h(z)Gn (y)dzdydx Z Z ≤ Ckyk|Gn (y)|dydx [−M ∗ −L−Mn ,M ∗ +L+Mn ]d √ ≤ (2M ∗ + 2L + 2Mn )d CMn d → 0. 4 Consistency of Minimum Disparity Estimators for Conditional Models In this section we define minimum disparity estimators for the conditionally specified models based on distributions and data defined in Section 1. A parametric model is given f (y1 , y2 |x1 , x2 ) = φ(y1 , y2 |x1 , x2 , θ), where we assume that the (Xi1 , Xi2 ) are independently drawn from a distribution h(x1 , x2 ) which is not parametrically specified. For this model, the maximum likelihood estimator for θ given observations (Yi , Xi ), i = 1, . . . , n is n X θ̂M LE = arg max log φ(Yi1 , Yi2 |Xi1 , Xi2 , θ) i=1 with attendant asymptotic variance X X Z Z I(θ0 ) = ∇2θ [log φ(y1 , y2 |x1 , x2 , θ0 )] φ(y1 , y2 |x1 , x2 , θ0 )h(x1 , x2 )dy1 dx1 y2 ∈Sy x2 ∈Sx when the specified parametric model is correct at θ = θ0 . In the context of disparity estimation, for every value (x1 , x2 ) we define the conditional disparity between f and φ as X Z f (y1 , y2 |x1 , x2 ) − 1 φ(y1 , y2 |x1 , x2 , θ)dy1 D(f, φ|x1 , x2 , θ) = C φ(y1 , y2 |x1 , x2 , θ) y2 ∈Sy these are combined over observed (Xi1 , Xi2 ) by n Dn (f, θ) = 1X D(f, φ|Xi1 , Xi2 , θ) n i=1 15 or X Z D̃n (f, θ) = D(f, φ|x1 , x2 , θ)ĥn (x1 , x2 )dx1 x2 ∈Sx with limiting case D∞ (f, θ) = X Z D(f, φ|x1 , x2 , θ)h(x1 , x2 )dx1 . x2 ∈Sx We now define the conditional minimum disparity estimator as θ̂nD = arg min Dn (fˆn , θ) θ∈Θ with fˆn defined as either (1.4) or (1.8) and C a strictly convex function from R to [−1 ∞) with a unique minimum at 0. Classical choices of C include e−x − 1, resulting in the Negative Exponential disparity (NED) √ 2 and x + 1 − 1 − 1, which corresponds to Hellinger distance (HD). Under this definition, we first establish the existence and consistency of θ̂nD . To do so we note that disparity results all rely on the boundedness of D(f, φ|Xi , θ) over θ and f and a condition of the form that for any conditional densities g and f , X Z sup |D(g, φ|Xi1 , Xi2 , θ) − Dn (f, φ|Xi1 , Xi2 , θ)| ≤ K |g(y1 , y2 |Xi1 , Xi2 ) − f (y1 , y2 |Xi1 , Xi2 ))| dy1 . θ∈Θ y2 ∈Sy (4.27) In the case of Hellinger distance (Beran, 1977), D(g, θ) < 2 and (4.27) follows from Minkowski’s inequality. For the alternate class of divergences studied in Park and Basu (2004), boundedness of D is established from supt∈[−1 ∞) |C 0 (t)| ≤ C ∗ < ∞ which also provides Z f (y1 , y2 |x1 , x2 ) g(y1 , y2 |x1 , x2 ) C −1 −C − 1 φ(y1 , y2 |x1 , x2 , θ)dµ(y) φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) Z g(y1 , y2 |x1 , x2 ) f (y1 , y2 |x1 , x2 ) ∗ ≤C φ(y1 , y2 |x1 , x2 , θ) − φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ)dµ(y) Z = C ∗ |g(y1 , y2 |x1 , x2 ) − f (y1 , y2 |x1 , x2 )| dµ(y) where we have employed the notational convention (1.9). For simplicity, we therefore use (4.27) as a condition below. In general, we will require the following assumptions (P1) The parameter space Θ locally is compact. (P2) P There exists N such that for n > N and θ1 6= θ2 , with probability 1, maxi∈1,...,n |φ(y1 , y2 |Xi1 , Xi2 , θ1 )− n i=1 φ(y1 , y2 |Xi1 , Xi2 , θ2 )| > 0 on a non-zero set of dominating measure in y. (P3) φ(y1 , y2 |x1 , x2 , θ) is continuous in θ for almost every (x1 , x2 , y1 , y2 ). (P4) Dn (f, φ|x1 , x2 , θ) is uniformly bounded over f , x1 , x2 and θ and (4.27) holds. 16 (P5) For every f there exists a compact set Sf ⊂ Θ and N such that for n ≥ N , inf Dn (f, θ) > inf Dn (f, θ). θ∈Sfc θ∈Sf These assumptions combine those of Park and Basu (2004) for a general class of disparities with those of Cheng and Vidyashankar (2006) which relax the assumption of compactness of Θ. Together, these provide the following results:F Theorem 4.1. Under assumtions P1-P5, define Tn (f ) = arg min Dn (f, θ), (4.28) θ∈Θ for n = 1, . . . , ∞ inclusive, then (i) For any f ∈ F there exists θ ∈ Θ such that Tn (f ) = θ. (ii) For n ≥ N , for any θ, θ = Tn (φ(·|·, θ)) is unique. (iii) If Tn (f ) is unique and fm → f in L1 for each x, then Tn (fm ) → Tn (f ). Proof. (i) Existence. We first observe that it is sufficient to restrict the infimum in (4.28) to Sf . Let {θm : θm ∈ Sf } be a sequence such that θm → θ as as m → ∞. Since f (y1 , y2 |x1 , x2 ) f (y1 , y2 |x1 , x2 ) C − 1 φ(y1 , y2 |x1 , x2 , θm ) → C − 1 φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θm ) φ(y1 , y2 |x1 , x2 , θ) by Assumption P3, using the bound on D(f, φ, θ) from Assumption P4 we have Dn (f, θm ) → Dn (f, θ) by the dominated convergence theorem. Hence Dn (f, t) is continuous in t and achieves its minimum for t ∈ Sf since Sf is compact. (ii) Uniqueness. This is a consequence of Assumption P2 and the unique minimum of C at 0. (iii) Continuity in f . For any sequence fm (·|x1 , x2 ) → f (·|x1 , x2 ) in L1 for every x as m → ∞ we have sup |Dn (fm , θ) − Dn (f, θ)| → 0 (4.29) θ∈Θ from Assumption P4. Now consider θm = Tn (fm ). We first observe that there exists M such that for m ≥ M , θm ∈ Sf otherwise from (4.29) and Assumption P5 Dn (fm , θm ) > inf Dn (fm , θ) θ∈Sf contradicting the definition of θm . Now suppose that θm does not converge to θ0 . By the compactness of Sf we can find a subsequence θm0 → θ∗ 6= θ0 implying Dn (f, θm0 ) → Dn (f, θ∗ ) from Assumption P3. Combining this with (4.29) implies Dn (f, θ∗ ) = Dn (f, θ0 ), contradicting the assumption of the uniqueness of Tn (f ). 17 Theorem 4.2. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1 and define θn0 = arg min Dn (f, θ) θ∈Θ 0 θ∞ for every n including ∞. Further, assume that is unique in the sense that for every there exists δ such that 0 0 kθ − θ∞ k > ⇒ D∞ (f, θ) > D∞ (f, θ∞ )+δ ¯ then under assumptions D1-D3, K1 - K6, B1-B2 and P1-P5 with fn either of the estimators (1.4) or (1.8) 0 θ̂n = Tn (f¯n ) → θ∞ as n → ∞ almost surely. Similarly, 0 T̃n (f¯n ) = arg min D̃n (f, θ) → θ∞ as n → ∞ almost surely. θ∈Θ Proof. First, we observe that for every f , it is sufficient to restrict attention to Sf and that sup |Dn (f, θ) − D∞ (f, θ)| → 0 almost surely (4.30) θ∈Sf from the strong law of large numbers, the compactness respect to θ. Further, Z ∗ ¯ sup Dm (fn , θ) − Dm (f, θ) ≤ C sup of Sf and the assumed continuity of C and of φ with f¯n (y1 , y2 |x1 , x2 ) − f (y1 , y2 |x1 , x2 ) dµ(y) x∈X m∈N,θ∈Θ → 0 almost surely (4.31) where the convergence is obtained from Theorem 2.5 or Theorem 3.1 depending on the form of f¯n . 0 , then we can find > 0 and a subsequence θ̂n0 such that Suppose that θ̂n does not converge to θ∞ 0 kθ̂n0 − θ∞ k > for all n0 . However, on this subsequence Dn0 (fˆn0 , θ̂n0 ) = Dn0 (fˆn0 , θ0 ) + (Dn0 (f, θ0 ) − Dn0 (fˆn0 , θ0 )) + (D∞ (f, θ0 ) − Dn0 (f, θ0 )) + (D∞ (f, θ̂n0 ) − D∞ (f, θ0 )) + (Dn0 (f, θ̂n0 ) − D∞ (f, θ̂n0 )) + (Dn0 (fˆn0 , θ̂n0 ) − Dn0 (f, θ̂n0 )) ≤ Dn0 (fˆn0 , θ0 ) + δ − 2 sup |Dn0 (f, θ) − D(f, θ)| θ∈Θ − 2 sup Dn0 (fˆn0 , θ) − Dn0 (f, θ) θ∈Θ but from (4.30) and (4.31) we can find N so that for n0 ≥ N sup |Dn0 (f, θ) − D(f, θ)| ≤ θ∈Θ and δ 6 δ sup Dn0 (fˆn0 , θ) − Dn0 (f, θ) ≤ 6 θ∈Θ ¯ contradicting the optimality of θ̂n0 . The proof for T̃n (fn ) follows analogously. 18 5 Asymptotic Normality and Efficiency of Minimum Disparity Estimators for Conditional Models: Unrestricted Case In this section we demonstrate the asymptotic normality and efficiency of minimum conditional disparity estimators using the unrestricted conditional density estimate fˆn (y1 , y2 |x1 , x2 ). In order to simplify some of our expressions, we introduce the following notation, that for a column vector A we define the matrix AT T = AAT . This will be particularly useful in defining information matrices. The proof techniques employed here are an extension of those developed in i.i.d. settings in Beran (1977); Tamura and Boos (1986); Lindsay (1994); Park and Basu (2004). In particular we will require the following assumptions: (N1) Define Ψθ (x1 , x2 , y1 , y2 ) = then sup XZ ∇θ φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) Ψθ (x1 , x2 , y1 , y2 )Ψθ (x1 , x2 , y1 , y2 )T f (y1 , y2 |x1 , x2 )dy1 < ∞ (x1 ,x2 )∈X y∈S y elementwise. Further, there exists ay > 0 such that XZ sup sup sup Ψθ (x1 + s, x2 , y1 + t, y2 )2 f (y1 , y2 |x1 , x2 )dy1 < ∞. (x1 ,x2 )∈X ktk≤ay ksk≤ay y∈S y and Z sup sup 2 (∇y Ψθ (x1 + t, x2 , y1 + s)) f (y1 |x1 , x2 )dy1 < ∞ sup (x1 ,x2 )∈X ktk≤ay ksk≤ay and Z sup sup 2 (∇x Ψθ (x1 + t, x2 , y1 + s)) f (y1 |x1 , x2 )dy1 < ∞. sup (x1 ,x2 )∈X ktk≤ay ksk≤ay (N2) There exists sequences bn and αn diverging to infinity such that (i) nc4ny2 b4n → 0 and n sup sup X Z Z (x1 ,x2 )∈X kuk>bn y ∈S 2 y Ψ2θ (x1 +cnx2 u, x2 , y1 +cny2 v, y2 )Ky2 (u)Kx2 (v)g(x1 , x2 , y1 , y2 )dvdy1 → 0 kvk>bn elementwise. (ii) sup(x1 ,x2 )∈X nP (kY1 − cny2 bn k > αn − c) → 0. (iii) sup X (x1 ,x2 )∈X y ∈S 2 y √ Z 1 d y x ncdnx 2 cny2 19 |Ψθ (x1 , x2 , y1 , y2 )|dy1 → 0 ky1 k≤αn (iv) sup sup sup sup X g(x1 + cnx s, x2 , y1 + cny t, y2 ) 2 2 = O(1). g(x1 , x2 , y1 , y2 ) (x1 ,x2 )∈X ktk≤bn ksk≤bn ky1 k<αn y ∈S 2 y p φ(y1 , y2 |x1 , x2 , θ)∇θ Ψθ (y1 , y2 , x1 , x2 ) < ∞. √ 2 (N4) C is either given by Hellinger distance C(x) = x + 1 − 1 − 1 or (N3) supy1 ,y2 ,x1 ,x2 A1 (r) = −C 00 (r − 1)r, A2 (r) = C(r − 1) − C 0 (r − 1)r, A3 (r) = C 00 (r − 1)r2 are all bounded in absolute value. Assumption N2i is a condition on the tails of Ψθ and Kx and Ky . As we demonstrate below, this allows us to truncate the kernels at bn ; this will prove mathematically convenient throughout the remainder of this section. Lemma 5.1. Under Assumption N2i, defining + + (u) = Ky (u)Ikuk<bn Knx (u) = Kx (u)Ikuk<bn , Kny + and and setting gn+ , fn+ and fn∗+ to be defined as (1.2), (1.4) and (1.8) with Kx and Ky replaced with Knx + Kny , we have the following convergences in probability X Z √ n sup Ψθ (x1 , x2 , y1 , y2 )(gn (x1 , x2 , y1 , y2 ) − gn+ (x1 , x2 , y1 , y2 ))dy1 → 0 (5.32) (x1 ,x2 )∈X y ∈S 2 y √ n X Z sup Ψθ (x1 , x2 , y1 , y2 )(fn (x1 , x2 , y1 , y2 ) − fn+ (x1 , x2 , y1 , y2 ))dy1 → 0 (5.33) (x1 ,x2 )∈X y ∈S 2 y √ n X Z sup p 2 gn+ (x1 , x2 , y1 , y2 ) dy1 → 0 (5.34) 2 q p + Ψθ (x1 , x2 , y1 , y2 ) fn (x1 , x2 , y1 , y2 ) − fn (x1 , x2 , y1 , y2 ) dy1 → 0 (5.35) Ψθ (x1 , x2 , y1 , y2 ) gn (x1 , x2 , y1 , y2 ) − q (x1 ,x2 )∈X y ∈S 2 y √ n X Z sup (x1 ,x2 )∈X y ∈S 2 y and in the homoscedastic case √ Z n sup Ψθ (x1 , x2 , y1 )(fn∗ (x1 , x2 , y1 ) − fn∗+ (x1 , x2 , y1 ))dy1 → 0 (5.36) 2 q p fn∗ (x1 , x2 , y1 ) − fn∗+ (x1 , x2 , y1 ) dy1 → 0 (5.37) (x1 ,x2 )∈X √ Z n sup Ψθ (x1 , x2 , y1 ) (x1 ,x2 )∈X Proof. Assumption N2i implies the convergence of the expected mean square of (5.32), and (5.32) therefore follows from Markov’s inequality. (5.36) follows from the same argument and (5.33) reduces to (5.32) by observing that Z √ gn (x1 , x2 , y1 , y2 ) − gn+ (x1 , x2 , y1 , y2 ) n sup Ψθ (x1 , x2 , y1 , y2 ) dµ(y) → 0 hn (x1 , x2 ) x∈X 20 in probability since min(x1 ,x2 )∈X hn (x1 , x2 ) is eventually bounded below almost surely and Z √ 1 1 sup + n Ψθ (x1 , x2 , y1 , y2 )gn+ (x1 , x2 , y1 , y2 )dµ(y) → 0 − hn (x1 , x2 ) x∈X hn (x1 , x2 ) since the second term is bounded in probability and 1 1 h+ (x , x ) − h(x1 , x2 ) → 0 and n 1 2 almost surely. For (5.34), (5.35) and (5.37) use the inequality √ 1 1 hn (x1 , x2 ) − h(x1 , x2 ) → 0 a− √ 2 b ≤ |a − b| and observe that substituting this reduces to (5.32), (5.33) and (5.36). Lemma 5.2. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1, under assumptions D1-D2, E1, K1 - K6, B1-B2 and P1-P5 for any for any function J(y1 , y2 , x1 , x2 ) satisfying the conditions on Ψ in Assumptions N1-N2iv and Z Z X J(y1 , y2 , x1 , x2 )g(x1 , x2 , y1 , y2 )dx1 dy1 = 0 x2 ∈Sx ,y2 ∈Sy and X VJ = Z Z J T T (y1 , y2 , x1 , x2 )g(x1 , x2 , y1 , y2 )dx1 dy1 < ∞ x2 ∈Sx ,y2 ∈Sy elementwise, the following hold Z Z X √ n J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 |x1 , x2 )ĥn (x1 , x2 )dy1 dx1 − Bn → N (0, VJ ) (5.38) y2 ∈Sy ,x2 ∈Sx in distribution and similarly n X Z X √ 1 J(y1 , y2 , Xi1 , Xi2 )fˆn (y1 , y2 |Xi1 , Xi2 )dy1 − B̃n → N (0, ṼJ ) n n i=1 (5.39) y2 ∈Sy in distribution where Z Z X Bn = J(y1 , y2 , x1 , x2 )Eĝn (x1 , x2 , y1 , y2 )dy1 dx1 y2 ∈Sy ,x2 ∈Sx = Z Z Z Z X J(y1 , y2 , x1 , x2 )g(x1 + cnx2 u, x2 , y1 + cny2 v, y2 )Kx (u) Ky (v) dudvdy1 dx1 . y2 ∈Sy ,x2 ∈Sx and B̃n = X Z R J(y1 , y2 , x1 , x2 )Eĝn (x1 , x2 , y1 , y2 )h(x1 , x2 )dy1 E ĥn (x1 , x2 ) y2 ∈Sy ,x2 ∈Sx = X y2 ∈Sy ,x2 ∈Sx Z RRR dx1 J(y1 , y2 , x1 , x2 )g(x1 + cnx2 u, x2 , y1 + cny2 v, y2 )Kx (u)Ky (v)h(x1 , x2 )dudvdy1 R dx1 h(x1 + cnx2 u, x2 )Kx (u)du 21 and ṼJ = VJ Z Z +3 Z T J(y1 , y2 , x1 , x2 )f (y1 , y2 |x1 , x2 )dµ(x) J(y1 , y2 , x1 , x2 )f (y1 , y2 |x1 , x2 )dµ(x) h(x1 , x2 )dν(x). + + Proof. Applying Lemma 5.1 we substitute Knx and Kny for Kx and Ky throughout. We begin by examining (5.38) Z Z √ n J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 |x1 , x2 )ĥn (x1 , x2 )dµ(y)dν(x) Z Z √ = n J(y1 , y2 , x1 , x2 )ĝn (x1 , x2 , y1 , y2 )dµ(y)dν(x) n 1 X =√ n i=1 Z Z 1 d y x cdnx 2 cny2 + J(y1 , y2 , x1 , x2 )Knx x1 − Xi1 cnx2 + Kny y1 − Yi1 cny2 Ix2 (Xi2 )Iy2 (Yi2 )dµ(y)dν(x) from which Z Z √ n J(y1 , y2 , x1 , x2 )ĝn (x1 , x2 , y1 , y2 )dµ(y)dν(x) − Bn Z Z Z Z X 1 x1 − u y1 − v + + J(v, y2 , u, x2 )Knx = Kny dudvdUn (x1 , y1 |x2 , y2 ) dx dy cnx2 cny2 y2 ∈Sy ,x2 ∈Sx cnx2 cny2 where Un (x1 , y1 |x2 , y2 ) = √ n (Fn (x1 , y1 |x2 , y2 ) − F (x1 , y1 |x2 , y2 )) where Fn (x1 , y1 |x2 , y2 ) is the empirical conditional cumulative distribution function and F its population counterpart. We now examine " #2 Z Z Z 1 x − u y − v 1 1 + + E d dy J(v, y2 , u, x2 )Knx Kny dudv − J(y1 , y2 , x1 , x2 ) dUn (x1 , y1 |x2 , y2 ) x cnx2 cny2 cnx 2 cny2 Z 2 + ≤ sup(Kny ) (J(y1 + cny2 bn , y2 , x1 + cnx2 bn , x2 ) − J(y1 , y2 , x1 , x2 )) g(x1 , x2 , y1 , y2 )dx1 dy1 y ≤ K + C ∗ (bdnx cdnyx 2 + bdny c2d nx2 ) →0 from the Cauchy-Schwartz inequality and Assumption N1. The result for (5.38) now follows from the Central Limit Theorem. For (5.39), we first observe that n Z √ 1X 1 1 n J(y1 , y2 , Xi1 , Xi2 )ĝn (y1 , y2 |Xi1 , Xi2 ) − dµ(y) → 0 n i=1 hn (Xi1 , Xi2 ) Ehn (Xi1 , Xi2 ) almost surely from the boundedness in probability of n Z √ 1X n J(y1 , y2 , Xi1 , Xi2 )ĝn (y1 , y2 |Xi1 , Xi2 )dµ(y) n i=1 22 and the almost sure convergence 1 1 → 0. sup − Ehn (x1 , x2 ) (x1 ,x2 )∈X hn (x1 , x2 ) We now examine n Z 1 X ĝn (y1 , y2 |Xi1 , Xi2 ) √ J(y1 , y2 , Xi1 , Xi2 ) dµ(y) n i=1 E ĥn (Xi1 , Xi2 ) Xj1 −Xi1 + + n n Z Kny (u)IXi2 (Xj2 )IYi2 (Yj2 ) cnx2 1 X X J(Yj1 + cny2 u, Yj2 , Xi1 , Xi2 )Knx =√ du nn i=1 j=1 E ĥn (Xi1 , Xi2 ) n Z Z Z + J(Yj1 + cny2 u, Yj2 , Xj1 + cnx2 v, Xj2 )Kny (u) + 1 X =√ Knx (v) h(Xj1 + cn v, Xj2 )dudvdν(x) n j=1 E ĥn (Xj1 + cn v, Xj2 ) n Z Z + + J(y1 + cny2 u, y2 , Xi1 , Xi2 )Knx (u) Kny (u) 1 X +√ g(Xi1 + cnx2 v, Xi2 , y1 , y2 )dudµ(y) n i=1 E ĥn (Xi1 , Xi2 ) + op (1) Z n 1 X J(Yi1 , Yi2 , Xi1 , Xi2 ) + J(y1 , y2 , Xi1 , Xi2 )f (y1 , y2 |Xi1 , Xi2 )dν(y) =√ n i=1 (5.40) n 1 X +√ Rn (Yi1 , Yi2 , Xi1 , Xi2 ) + op (1) n i=1 where (5.40) is a consequence of Lemma 5.4 and Rn (Yi1 , Yi2 , Xi1 , Xi2 ) Z Z Z + ∇y J(Yi1 + cny2 u∗ , Yi2 , Xi1 + cnx2 v, Xi2 )Kny (u) + = cny2 Knx (v) h(Xi1 + cn v, Xi2 )dudvdν(x) E ĥn (Xi1 + cn v, Xi2 ) Z Z + + (u) ∇y J(y1 + cny2 u∗ , y2 , Xi1 , Xi2 )Knx (u) Kny g(Xi1 + cnx2 v, Xi2 , y1 , y2 )dudµ(y) + cny2 E ĥn (Xi1 , Xi2 ) Z Z Z + ∇x J(Yi1 + cny2 u, Yi2 , Xi1 + cnx2 v ∗ , Xi2 )Kny (u) + Knx (v) h(Xi1 + cn v, Xi2 )dudvdν(x) + cnx2 E ĥn (Xi1 + cn v, Xi2 ) Z Z + + J(y1 + cny2 u, y2 , Xi1 , Xi2 )Knx (v) Kny (u) + cnx2 ∇x g(Xi1 + cnXi2 v ∗ , Xi2 , y1 , y2 )dudvdµ(y) E ĥn (Xi1 , Xi2 ) with u∗ = su and v ∗ = tv for some s and t in [0, 1] and we observe that " #2 n 1 X Rn (Yi1 , Yi2 , Xi1 , Xi2 ) → 0 E √ n i=1 from assumptions N1 and N2iv. The result now follows from the convergence in distribution Z n 1 X √ J(Yi1 , Yi2 , Xi1 , Xi2 ) + J(y1 , y2 , Xi1 , Xi2 )f (y1 , y2 |Xi1 , Xi2 )dν(y) → N (0, ṼJ ). n i=1 23 We remark here that the bias Bn corresponds to the bias found in Tamura and Boos (1986) for multivariate observations. As observed there, the bias in the estimate ĝn is O(c2nx2 + c2ny2 ) and that of ĥn is O(c2nx2 ), d x regardless of the dimension of x1 and y1 . However the variance is of order n−1 cdnx c y (corresponding 2 ny2 to B2), meaning that for dx + dy > 3, the asymptotic bias in the Central Limit Theorem is √ Assumption nc2nx2 c2ny2 → ∞ and will not become zero when the variance is controlled. We will further need to restrict 2d y x to n−1 c2d nx2 cny2 , effectively reducing the unbiased central limit theorem to the cases where either dx = 1 or dy = 1. As in Tamura and Boos (1986) we also note that this bias is often small in practice. However, as we demonstrate in Section 6 in the case of a univariate, homoscedastic model, the use of f˜n (y|x1 , x2 ) can remove this bias in some circumstances. Lemma 5.3. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1, under assumptions K1 - K6, B1-B4, and N1-N2iv for any function J(y1 , y2 , x1 , x2 ) satisfying the conditions on Ψ in Assumptions N1-N2iv √ n X Z Z sup q J(y1 , y2 , x1 , x2 ) s fˆn (y1 , y2 |x1 , x2 ) − (x1 ,x2 )∈X y ∈S 2 y Eĝn (x1 , x2 , y1 , y2 ) !2 E ĥn (x1 , x2 ) dy1 → 0 (5.41) in probability and further √ Z Z X n q J(y1 , y2 , x1 , x2 ) s fˆn (y1 , y2 |x1 , x2 ) − y2 ∈Sy ,x2 ∈Sx Eĝn (x1 , x2 , y1 , y2 ) ĥn (x1 , x2 ) !2 ĥn (x1 , x2 )dy1 dx1 → 0 (5.42) + + for Kx and Ky throughout. We begin by breaking and Kny Proof. Applying Lemma 5.1 we substitute Knx the integral in (5.41) into regions ky1 k ≥ αn and ky1 k < αn . For the first of these, we expand the square results in three terms First: Z √ X Eĝn (x1 , x2 , y1 , y2 ) n J(y1 , y2 , x1 , x2 ) dy1 E ĥn (x1 , x2 ) y2 ∈Sy ky1 k>αn √ X Z Z Z n 1 u − x1 u − y1 + + J(y , y , x , x )K ≤ − K g(u, x2 , v, y2 )dudvdy1 1 2 1 2 nx ny dx dy h cnx2 cny2 y2 ∈Sy cnx2 cny2 ky1 k>αn √ ≤ + o(1) n h− M sup 1/2 sup X Z ktk≤bn ksk≤bn y ∈S 2 y ky1 k>αn J(y1 + cny2 s, y2 , x1 + cnx2 t, x2 )2 g(x1 , x2 , y1 , y2 )dy1 where Z Z Z M= +2 +2 Knx (t)Kny (s)g(x1 1/2 + cnx2 s, x2 , y1 + cny2 s)dsdtdµ(y) the result now follows from Assumptions N1, N2ii and N2iii. 24 √ n Second: X Z y2 ∈Sy J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 |x1 , x2 )dy1 ky1 k>αn 1/2 x1 −X1i 1 + n Z I (X ) x 2i dx Knx X X 2 cnx2 sup Ky (y) cnx √ n(y2 , x2 )J(Y1i + cny2 bn , y2 , x1 , x2 )2 2 ≤ dsdtdy1 n ĥn (x1 , x2 ) i=1 ky1 k>αn y2 ∈Sy + o(1) and we observe that from Assumption N2ii the final integral is non-zero with probability smaller than nP (ky1 − cny2 bn k > αn ) → 0. Third: s Z Z q X √ Eĝn (x1 , x2 , y1 , y2 ) 2 n J(y1 , y2 , x1 , x2 ) fˆn (y1 , y2 |x1 , x2 ) dy1 dx1 h(x1 , x2 ) y2 ∈Sy ,x2 ∈Sx ≤ √ 1/2 n X Z y2 ∈Sy ky1 k>αn J(y1 , y2 , x1 , x2 )2 fˆn (y1 , y2 |x1 , x2 )dy1 reducing to the case above. √ 2 √ 2 q Turning to the region ky1 k < αn , we use the identity ( a − b) ≤ (a − b) /a and add and subtract ĝn (x1 , x2 , y1 , y2 )/E ĥn (x1 , x2 ) to obtain √ n X Z q J(y1 , y2 , x1 , x2 ) s fˆn (y1 , y2 |x1 , x2 ) − y2 ∈Sy Eĝn (x1 , x2 , y1 , y2 ) E ĥn (x1 , x2 ) !2 dy1 √ Z (ĝn (x1 , x2 , y1 , y2 ) − Eĝn (x1 , x2 , y1 , y2 ))2 n J(y1 , y2 , x1 , x2 ) dy1 ≤ − h Eĝn (x1 , x2 , y1 , y2 ) ky1 k<αn Z √ X (ĥn (x1 , x2 ) − E ĥn (x1 , x2 ))2 + n J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 , x1 , x2 )dy1 E ĥn (x1 , x2 ) y2 ∈Sy and we observe that the first term is positive and smaller in expectation than R +2 Z +2 Knx (t)Kny (s)g(x1 + cnx2 s, x2 , y1 + cny2 s, y2 )dsdt 1 + o(1) |J(y1 , y2 , x1 , x2 )| R + √ dx dy + Knx (t)Kny (s)g(x1 + cnx2 s, x2 , y1 + cny2 s, y2 )dsdt ncnx2 cny2 ky1 k<αn Z M g(x1 − 2cnx2 s, x2 , y1 − 2cny2 s, y2 ) ≤ √ d dy |J(y1 , y2 , x1 , x2 )|dy1 sup x g(x1 , x2 , y1 , y2 ) ksk≤b ,ktk≤b ky k<α ncnx c n n 1 n 2 ny2 which converges to zero from Assumptions N2iii and N2iv yielding convergence in probability. The second 25 term converges by bounding √ n sup (ĥn (x1 , x2 ) − E ĥn (x1 , x2 ))2 E ĥn (x1 , x2 ) (x1 ,x2 )∈X x 1 2 log cdx 2 ncdnx 2 ≤ − √ dnx x x h ncnx2 2 log cdnx 2 !2 sup |ĥn (x1 , x2 ) − E ĥn (x1 , x2 )| (x1 ,x2 )∈X (5.43) →0 almost surely from the boundedness of the last two terms on the right hand side of (5.43) (see Giné and Guillou, 2002) and assumption B4. To demonstrate (5.42) we observe that it is equal to Z Z p 2 p √ n J(y1 , y2 , x1 , x2 ) ĝn (x1 , x2 , y1 , y2 ) − Eĝn (x1 , x2 , y1 , y2 ) dµ(y)dν(x) → 0 in probability. This convergence can be demonstrated in the same manner as above. Theorem 5.1. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1, under assumptions D1-D2, E1, K1 - K6, B1-B4, P1-P5 and N1 - N4 define θf = arg min D∞ (f, θ) θ∈Θ and H D (θ) = ∇2θ D∞ (f, θ) Z X f (y1 , y2 |x1 , x2 ) = A2 ∇2θ φ(y1 , y2 |x1 , x2 , θ)h(x1 , x2 )dx1 dy1 φ(y1 , y2 |x1 , x2 , θ) y2 ∈Sy ,x2 ∈Sx Z X ∇θ φ(y1 , y2 |x1 , x2 , θ)T T f (y1 , y2 |x1 , x2 ) h(x1 , x2 )dx1 dy1 + A3 φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) y2 ∈Sy ,x2 ∈Sx I (θ) = H (θ)V D (θ)−1 H D (θ) I˜D (θ) = H D (θ)Ṽ D (θ)−1 H D (θ) D D then i √ h n Tn (fˆn ) − θf − Bn → N 0, I D (θf )−1 and i √ h n T̃n (fˆn ) − θf − B̃n → N 0, I˜D (θf )−1 in distribution where Bn , B̃n , V D (θ) and Ṽ D (θ) are obtained by substituting f (y1 , y2 |x1 , x2 ) ∇θ φ(y1 , y2 |x1 , x2 , θf ) J(y1 , y2 , x1 , x2 ) = A1 φ(y1 , y2 |x1 , x2 , θf ) φ(y1 , y2 |x1 , x2 , θf ) into the expressions for Bn , B̃n , VJ and ṼJ in Lemma 5.2. 26 (5.44) Here we note that in the case that f = φθ0 for some θ0 , the θf = θ0 and further since A1 (1) = A2 (1) = A3 (1) = 1 we have that I D (θf ) is given by the Fisher Information for φθ0 and further that I˜D (θf ) = I D (θf ) since Z ∇θ φ(y1 , y2 |x1 , x2 ) φ(y1 , y2 |x1 , x2 )dµ(y) = 0 φ(y1 , y2 |x1 , x2 ) at each x. Proof. We will define T̄n , f¯n (y1 , y2 |x1 , x2 ) and fK (y1 , y2 |x1 , x2 ) to take one of the following combinations of values f¯n fK T̄n Tn fˆn (y1 , y2 |x1 , x2 ) Eĝn (x1 , x2 , y1 , y2 )/ĥn (x1 , x2 ) T̃n fˆn (y1 , y2 |x1 , x2 ) Eĝn (x1 , x2 , y1 , y2 )/E ĥn (x1 , x2 ) Our arguments here follow those in Tamura and Boos (1986) and Park and Basu (2004). Since T̄n (f ) satisfies ∇θ Dn (f, T̄n (f )) = 0 we can write √ −1 ∇θ Dn (f¯n , θ0 ) n T̄n (f¯n ) − θ0 = − ∇2θ Dn (f¯n , θ+ ) for some θ+ between T̄n (f¯n ) and θ0 . It is therefore sufficient to demonstrate (i) ∇2θ Dn (f¯n , θ+ ) → H D (θ0 ) in probability. √ (ii) n ∇θ Dn (f¯n , θ0 ) − B̄n → N (0, V D (θ0 )−1 ) in distribution. with B̄n given by Bn or B̃n as appropriate. We begin with (i) where we observe that by Assumption N4 A2 (r) and A3 (r) are bounded and the result follows from Theorems 2.5 and 4.2 and the dominated convergence theorem. In the case of Hellinger distance #q Z " 2 ∇θ φ(y1 , y2 , x1 , x2 , θ) ∇θ φ(y1 , y2 , x1 , x2 , θ)T T 2 ¯ p ∇θ D(fn , φ|x1 , x2 , θ) = − f¯n (y1 , y2 |x1 , x2 )dµ(y) φ(y1 , y2 , x1 , x2 , θ)3/2 φ(y1 , y2 , x1 , x2 , θ) and the result follows from the uniform L1 convergence of f¯n (Theorem 2.4) and Assumption N3. Turning to (ii) where we observe that by the boundedness of C and the dominated convergence theorem we can write ∇θ Dn (f¯n , φ|x1 , x2 , θ) − Bn as ¯ Z fn (y1 , y2 |x1 , x2 ) ∇θ φ(y1 , y2 |x1 , x2 , θ)dµ(y) − B̄n A2 φ(y1 , y2 |x1 , x2 , θ) Z fK (y1 , y2 |x1 , x2 ) ∇θ φ(y1 , y2 |x1 , x2 , θ) ¯ = A1 fn (y1 , y2 |x1 , x2 ) − fK (y1 , y2 |x1 , x2 ) dµ(y) φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) Z ¯ fn (y1 , y2 |x1 , x2 ) fK (y1 , y2 |x1 , x2 ) + A2 − A2 ∇θ φ(y1 , y2 |x1 , x2 , θ)dµ(y) φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) ¯ Z fK (y1 , y2 |x1 , x2 ) fn (y1 , y2 |x1 , x2 ) fK (y1 , y2 |x1 , x2 ) − A1 − ∇θ φ(y1 , y2 |x1 , x2 , θ)dµ(y) φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) 27 from a minor modification Lemma 25 of Lindsay (1994) we have that by the boundedness of A1 and A2 there is a constant B such that |A2 (r2 ) − A2 (s2 ) − (r2 − s2 )A1 (s2 )| ≤ (r2 − s2 )B substituting s r= f¯n (y1 , y2 |x1 , x2 ) , s= φ(y1 , y2 |x1 , x2 , θ) s fK (y1 , y2 |x1 , x2 ) φ(y1 , y2 |x1 , x2 , θ) we obtain ¯ Z fn (y1 , y2 |x1 , x2 ) A2 ∇θ φ(y1 , y2 |x1 , x2 , θ)dµ(y) − B̄n φ(y1 , y2 |x1 , x2 , θ) Z fK (y1 , y2 |x1 , x2 ) ∇θ φ(y1 , y2 |x1 , x2 , θ) ¯ = A1 fn (y1 , y2 |x1 , x2 ) − fK (y1 , y2 |x1 , x2 ) dµ(y) φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) 2 q Z p ∇θ φ(y1 , y2 |x1 , x2 , θ) f¯n (y1 , y2 |x1 , x2 ) − fK (y1 , y2 |x1 , x2 ) dµ(y) +B φ(y1 , y2 |x1 , x2 , θ) The result now follows from Lemmas 5.2 and 5.3. For the special case of Hellinger distance, we observe that Z q ∇θ φ(y1 , y2 , x1 , x2 , θ) ¯ p ∇θ Dn (f¯n , φ|x1 , x2 , θ) = fn (y1 , y2 |x1 , x2 )dµ(y) φ(y1 , y2 , x1 , x2 , θ) √ √ √ √ √ √ and the result is obtained from applying the identity a − b = (a − b)2 /2 a + ( b − a)2 /2 a with a = fK (y1 , y2 |x1 , x2 ) and b = f¯n (y1 , y2 |x1 , x2 ) along with Assumption N3. Lemma 5.4. For any sequence of continuous square-integrable functions bn (x, y) with limit b(x, y) and (Xi , Yi , Zi ), i = 1, 2, ... a sequence of random variables with distribution G(x, y, z) = gx (x)gy (y)gz (z) with Z Z bn (x, y)gx (x)dx = 0, ∀y and b(x, y)gy (y)dy = 0, ∀x and Z 2 (bn (x, y) − b(x, y)) gx (x)gy (y) → 0 we have n n 1 XX √ bn (Xi , Yj ) → op (1) nn i=1 j=1 and further for any square integrable function a(x, y) Z n n 1 XX √ a(Xi , Yj ) − a(x, y)gx (x)gy (y)dxdy → N (0, Σ) nn i=1 j=1 in distribution where the asymptotic variance is given by 2 2 Z Z Z Z Σ= a(x, y)gx (x)dx gy (y)dy + a(x, y)gy (y)dy g( x)dx. 28 Finally, for any sequence of square integrable functions cn (x, y, z) → c(x, y, z) with zero univariate marginals, √ n n n 1 XXX cn (Xi , Yj , Zk ) → op (1) nn2 i=1 j=1 k=1 Proof. All the expressions above are U -statistics (e.g. Hoeffding, 1948); the results above are special cases of Theorem 1 in Weber (1983). 6 Asymptotic Normality and Efficiency of Minimum Disparity Estimators for Conditional Models: Homoscedastic Case For the homoscedastic case (1.1) with estimators (1.6 1.8), we can again obtain a central limit theorem. Although the calculations are direct,√it also produces efficient estimation in location scale families. In this framework, the analysis produces o( n) asymptotic bias when both dx and dy are 1 or 0, as opposed to the general case above in which only one can be non-zero. We speculate that if (1.6) is √replaced by a generalized additive model employing only univariate terms the bias is also asymptotically o( n) regardless of the dimension of x1 so long as the additivity assumptions remain true. The results below largely mimic those in section 5. In this case we need two further assumptions: H1 There exists ay > 0 such that 2 Z sup sup Ψθ (x1 , x2 , y1 ) (x1 ,x2 )∈X ksk≤ay (∇y f (y1 + s|x1 , x2 )) dy1 < ∞. f (y1 |x1 , x2 ) H2 There is a diverging sequence bn such that Z Z Z Ψ2θ (x1 + cnx2 u, x2 , y1 + cny2 v)Ky2 (v)Kx (u)2 f ∗ (y1 |x1 , x2 )dudvdy1 → 0. n sup (x1 ,x2 )∈X kuk>bn kvk>bn Lemma 6.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 with the restriction (1.1), under assumptions D1-D2, E1, K1 - K6, B1-B6 and P1-P5 for any function J(y1 , y2 , x1 , x2 ) satisfying the conditions on Ψ in Assumptions N1-N2iv with X Z Z J(y1 , x1 , x2 )f (y|x1 , x2 )h(x1 , x2 )dx1 dy1 = 0 x2 ∈Sx 29 and VJ1 = X Z Z T T J(y1 , x1 , x2 )f (y1 |x1 , x2 )dy1 h(x1 , x2 )dx1 < ∞, (6.45) x2 ∈Sx VJ2 = Z " X Z #T T J(e + m(x1 , x2 ), x1 , x2 )h(x1 , x2 )dx1 f ∗ (e)de < ∞, (6.46) h(x1 , x2 )dx1 < ∞, (6.47) x2 ∈Sx VJ3 = σ 2 X Z Z T T ∇y J(y1 , x1 , x2 )f (y1 |x1 , x2 )dy1 x2 ∈Sx VJ4 = σ 2 "Z X Z #T T ∇y J(y1 , x1 , x2 )f (y1 |x1 , x2 )h(x1 , x2 )dy1 dx1 < ∞, (6.48) x2 ∈Sx (6.49) with σ 2 = R e2 f ∗ (e)de. Define VJ = VJ1 + VJ2 + VJ3 − VJ4 then the following hold " # X Z Z √ n J(y1 , x1 , x2 )f˜n (y1 |x1 , x2 )ĥn (x1 , x2 )dy1 dx1 − Bn∗ → N (0, VJ ) (6.50) x2 ∈Sx in distribution and similarly # √ "X n Z n ∗ J(y1 , Xi1 , Xi2 )f˜n (y1 |Xi1 , Xi2 )dy1 − B̃n → N (0, VJ ) n i=1 (6.51) in distribution where X Z Z Z Bn∗ = J(y1 , x1 , x), f (y1 + cny2 u|x1 , x2 )Ky (u)h(x1 , x2 )dudy1 dx1 x2 ∈Sx + X Z Z x2 ∈Sx − X Z Z R ∇y J(y1 , x1 , x2 ) m(x1 + cnx2 u, x2 )h(x1 + cnx2 u, x2 )Kx (u)du R f (y1 |x1 , x2 )h(x1 , x2 )dx1 dy1 h(x1 + cnx2 v, x2 )Kx (v)dv ∇y J(y1 , x1 , x2 )m(x1 , x2 )f (y1 |x1 , x2 )h(x1 , x2 )dx1 dy1 x2 ∈Sx = X Z Z J(y1 , x1 , x2 )E f˜n∗ (y1 |x1 , x2 )h(x1 , x2 )dy1 dx1 x2 ∈Sx + X Z Z ∇y J(y1 , x1 , x2 ) [E m̂n (x1 , x2 ) − m(x1 , x2 )] f (y1 |x1 , x2 )h(x1 , x2 )dx1 dy1 x2 ∈Sx ∗ ∗ = Bn1 + Bn2 30 and B̃n∗ = X Z Z J(y1 , x1 , x2 )E f˜n∗ (y1 |x1 , x2 )E ĥn (x1 , x2 )dy1 dx1 x2 ∈Sx + X Z ∇y J(y1 , x1 , x2 ) [E m̂n (x1 , x2 ) − m(x1 , x2 )] f ∗ (y1 |x1 , x2 )E ĥn (x1 , x2 )dy1 dx1 x2 ∈Sx = ∗ B̃n1 ∗ + B̃n2 and in particular for dy = 1 and dx = 1, √ nBn∗ → 0 and √ nB̃n∗ → 0. (6.52) Proof. We give the proof for (6.51) below; (6.50) is demonstrated in a completely analogous manner. + + Applying Lemma 5.1 we substitute Knx and Kny for Kx and Ky throughout. We begin with (6.51) where we observe √ X n Z n J(y1 , Xi1 , Xi2 )f˜n (y1 |Xi1 , Xi2 )dy1 n i=1 ! n n Z 1 X1X e − Ẽjn + =√ J(e + m̂n (Xi1 , Xi2 ), Xi1 , Xi2 )Kny de cny2 n i=1 n j=1 n n Z 1 XX + = √ J(j + m(Xi1 , Xi2 ) + cny2 u + ên (Xi1 , Xi2 ) − ên (Xj1 , Xj2 ), Xi1 , Xi2 )Kny (u)du n n i=1 j=1 n Z n 1 XX + J(j + m(Xi1 , Xi2 ) + cny2 u, Xi1 , Xi2 )Kny (u)du = √ n n i=1 j=1 n Z n 1 XX + + √ ∇y J(j + m(Xi1 , Xi2 ) + cny2 u, Xi1 , Xi2 )Kny (u)du · [ên (Xi1 , Xi2 ) − ên (Xj1 , Xj2 )] n n i=1 j=1 ! n 1 X 2 +O √ en (Xi1 , Xi2 ) n i=1 where Ẽjn = Yi − m̂n (Xi1 , Xi2 ), i = Yi − m(Xi1 , Xi2 ) and ên (x1 , x2 ) = m̂n (x1 , x2 ) − m(x1 , x2 ) and we observe that the i are independent i1 , Xi2 ). of (X Pn 1 2 √ From here we observe that , O e (X , X ) → 0 in probability from Nadaraya (1989, Chapn i1 i2 i=1 n ter 3, Theorem 1.4). 31 Further n n 1 XX √ n j=1 i=1 Z + ∗ J(j + m(Xi1 , Xi2 ) + cny2 u, Xi1 , Xi2 )Kny (u)du − Bn1 n 1 X =√ n i=1 Z Z + ∗ J(e + m(Xi1 , Xi2 ), Xi1 , Xi2 )f ∗ (e + cny2 u)Kny (u)dude − Bn1 Z Z 1 + ∗ +√ J(j + m(x1 , x2 ) + cny2 u, x1 , x2 )Kny (u)h(x1 , x2 )dudν(x) − Bn1 n + op (1) from Lemma 5.4 below. These terms converge in mean square to n 1 X √ n i=1 Z J(e + m(Xi1 , Xi2 ), Xi1 , Xi2 )f ∗ (e)de + Z J(j + m(x1 , x2 ), x1 , x2 )h(x1 , x2 )dν(x) n 1 X =√ Q1 (i , Xi1 , Xi2 ) n i=1 These mean squares are guaranteed to be finite from assumptions N1 and N2i. For the middle term in this expansion, we define Z + L(, x1 , x) = ∇y J( + m(x1 , x2 ) + cny2 u, x1 , x2 )Kny (u)du for notational convenience. We also write Yk = m(Xk1 , Xk2 ) + k and divide ên (x1 , x2 ) into ên (x1 , x2 ) = ên1 (x1 , x2 ) + ên2 (x1 , x2 ) where ên1 (x1 , x2 ) = 1 x ncdnx 2 n K+ X i nx 1 x ncdnx 2 n m(X , X )K + X i1 i2 nx i=1 x−Xi1 cn Ix2 (Xi2 ) ĥn (Xi1 , Xi2 ) i=1 and ên2 (x1 , x2 ) = x−Xi1 cn ĥn (Xi1 , Xi2 ) 32 Ix2 (Xi2 ) − m(x1 , x2 ). We begin by examining terms involving ên1 and expand the summation: n n 1 XX √ L(j , Xi1 , Xi2 ) [ên1 (Xi1 , Xi2 ) − ên1 (Xj1 , Xj2 )] n n i=1 j=1 Xj1 −Xk1 + Xi1 −Xk1 + n X n X n IXj2 (Xk2 ) K I (X ) K X X k2 nx nx i2 cnXj2 cnXi2 1 − = k L(j , Xi1 , Xi2 ) x n5/2 cdnx ĥn (Xi1 , Xi2 ) ĥn (Xj1 , Xj2 ) 2 i=1 j=1 k=1 Z Z n + 1 X L(e, Xk1 + cnXk2 v, Xk2 )Knx (v)f ∗ (e)h(Xk1 + cnXk2 v, Xk2 ) =√ k dvde n E ĥn (Xk1 + cnXk2 v, Xk2 ) k=1 Z Z Z n + 1 X Knx (v)h(Xk1 + cnXk2 v, Xk2 ) −√ k L(e, x1 , x2 )f ∗ (e)h(x1 , x2 )dedν(x) dv n E ĥn (Xk1 + cnXk2 v, Xk2 ) k=1 + op (1) from Lemma 5.4. By taking Taylor expansions to remove terms involving cny2 and cnx2 , we observe that this converges in expected mean square to n n 1 X 1 X √ Q2 (k , Xk1 , Xk2 ) = √ k n n k=1 Z ∇y J(e + m(Xk1 , Xk2 ), Xk1 , Xk2 )f ∗ (e)de k=1 n 1 X −√ k n Z Z ∇y J(e + m(x1 , x2 ), x1 , x2 )f ∗ (e)h(x1 , x2 )dedν(x) . k=1 and we observe that these have expectation zero. The finiteness of these mean squares is obtains from Assumptions N1 and N2i. 33 Turning to terms involving ên2 , we observe that n n 1 XX L(j , Xi1 , Xi2 )ên2 (Xi1 , Xi2 ) n2 i=1 j=1 1 = x 6 n cdnx 2 = n X n X n X + m(Xk1 , Xk2 )Knx Xi1 −Xk1 cnXi2 IXi2 (Xk2 ) − m(Xi1 , xi2 ) ĥ (X , X ) n i1 i2 i=1 j=1 k=1 x01 −x1 + 0 Z Z m(x , x )K I (x )h(x , x ) n 1 2 2 1 2 x2 nx cnx0 X 2 0 0 0 L(j , x01 , x02 ) − m(x01 , x02 ) h(x1 , x2 )dx1 dx1 0 0 E ĥ (x , x ) n 1 2 j=1 1 x ncdnx 2 L(j , Xi1 , Xi2 ) + + m(Xk1 , Xk2 )Knx n Z Z 1 X L(e, x01 , x02 )f ∗ (e) x ncdnx 2 k=1 1 + dx ncnx2 n Z Z X x01 −Xk1 cnx0 2 E ĥn (x01 , x02 ) L(e, Xi1 , Xi2 )f ∗ (e) + m(x1 , x2 )Knx Xi1 −x1 cnXi2 I (Xk2 ) x02 0 0 0 − m(x01 , x02 ) h(x1 , x2 )dx1 de IXi2 (x2 )h(x1 , x2 ) E ĥn (Xi1 , Xi2 ) i=1 − m(Xi1 , Xi2 ) dx1 de + op (1) from Lemma 5.4. Here we observe that all three terms in the final expression are o(c2nx2 ) after change of variables and taking Taylor expansions. Thus we have n X n X √ 1 L(j , Xi1 , Xi2 )ên2 (Xi1 , Xi2 ) − Bn2 → 0 n 2 n i=1 j=1 ∗ ∗ in probability. We thus obtain Pna the result from collecting the bias terms Bn1 + Bn2 and applying the the 1 √ central limit theorem to n i=1 (Q1 (i , Xi1 , Xi2 ) + Q2 (i , Xi1 , Xi2 )). We note that (6.52) follows from a second-order Taylor approximation and condition B6. Lemma 6.2. Let {(Xn1 , Xn2 , Yn1 , n ≥ 1} be given as in Section 1, under assumptions K1 - K6, B1-B4 and N1-N2iv for any function J(y1 , x1 , x2 ) satisfying the conditions on Ψ in Assumptions N1-N2iv and H1 and H2, define Z E f˜∗ (y1 |x1 , x2 ) = f ∗ (y1 + cny u − m(x1 , x2 ))Ky (u)du 2 then √ X q Z f˜n (y1 |x1 , x2 ) − q E f˜∗ (y1 |x1 , x2 ) 2 dy1 → 0 (6.53) in probability and further when dy = 1, q 2 X Z p √ ∗ ∗ ˜ n sup J(y1 , x1 , x2 ) E f (y1 |x1 , x2 ) − f (y1 |x1 , x2 ) dy1 → 0 (6.54) n sup J(y1 , x1 , x2 ) (x1 ,x2 )∈X y ∈S ,x ∈S 2 y 2 x (x1 ,x2 )∈X y ∈S 2 y 34 Proof. Here we proceed as in the proof of Theorem 3.24 and for any continuous and bounded m̃(x1 , x2 ) define f˜(e, m̃) as in (3.25) along with m− = inf (x1 ,x2 )∈X (m̃(x1 , x2 ) − m(x1 , x2 )), m+ = (m̃(x1 , x2 ) − m(x1 , x2 )) sup (x1 ,x2 )∈X and observe that q 2 Z p ∗ ˜ J(e + m(x1 , x2 ), x1 , x2 ) f (e, m̃) − f (e) de ≤ (m+ − m− )2 sup 2 Z J(e + m(x1 , x2 ), x1 , x2 ) m− ≤η≤m+ (∇y f ∗ (e + η)) de. f ∗ (e) Thus, substituting m̂n for m̃ q 2 Z p ∗ ˜ J(e + m(x1 , x2 ), x1 , x2 ) f (e, m̂n ) − f (e) de → 0 almost surely from the convergence √ !2 n sup |m̂n (x1 , x2 ) − m(x1 , x2 )| →0 (x1 ,x2 )∈X Nadaraya (eg 1989, Chapter 3, Theorem 3.4). We now consider fˆn (e, m̃) as given in (3.26) and demonstrate that for any m̃ with m+ − m− < my we have q 2 Z q √ n J(e + m(x1 , x2 ), x1 , x2 ) fˆn (e, m̃) − E fˆn (e, m̃) de → 0 in probability. To do this we follow the proof of Lemma 5.3 with minor modifications. In particular, integrating over ky1 k > αn − c we complete the square and observe that Z √ n J(e + m(x1 , x2 ), x1 , x2 )E fˆ(e, m̃)de ky1 k>αn −c ≤ √ ≤ 2 nM sup ksk≤bn √ #1/2 ˜ J(e + m(x1 , x2 ) + cny2 s, x1 , x2 ) f (e, m̃)de "Z ke+m(x1 ,x2 )k>αn −c " n #1/2 Z sup J(y1 + η + cny2 s, x1 , x2 ) f (y1 − m(x1 , x2 ))dy1 sup M kηk<my ksk≤bn 2 ∗ ky1 k>αn −c →0 from Assumption N2ii with M = supy Ky . Similarly Z √ n J(y1 , x1 , x2 )fˆ(e, m̃)dy1 ky1 k>αn −c " n Z #1/2 X sup Ky (y) √ ≤ sup sup J(Yi + η + cny2 s, x1 , x2 )2 du n kηk<my ksk≤bn i=1 ky1 k>αn −c → 0. 35 √ √ On the region ky1 k < αn − c we make use of the identity ( a − b)2 ≤ (a − b)2 /a to obtain that √ q Z n J(e + m(x1 , x2 ), x1 , x2 ) fˆn (e, m̃) − q 2 ˆ E fn (e, m̃) de ky1 k≤αn −c ≤ √ Z n 2 fˆn (e, m̃) − E fˆn (e, m̃) J(e + m(x1 , x2 ), x1 , x2 ) de E fˆn (e, m̃) ky1 k≤αn −c which is smaller in expectation than √ +2 Kny (s)f˜(e + cny2 s, m̃)ds |J(e + m(x1 , x2 ), x1 , x2 )| R + de Kny (s)f˜(e + cny2 s, m̃)ds ky1 k≤αn −c Z 1 f ∗ (e + 2t) |J(e + m(x1 , x2 ), x1 , x2 )|de sup sup ≤ √ dy f ∗ (e) kek<αn ktk≤2cny2 bn ncny2 ky1 k≤αn Z 1 d ncnyy 2 →0 from Assumptions N2iii and N2iv. Finally, to demonstrate (6.54) we observe that √ q Z n J(e + m(x1 , x2 ), x1 , x2 ) f˜(e, m̃) − q E fˆn (e, m̃) ≤ √ nc2ny2 b2n Z J(e + m̃(x1 , x2 ), x1 , x2 ) 2 de ∇2 f˜(e + η) f˜(e) 2 de →0 from Assumptions B6, N2i and N1. We now observe that the above results do not depend on m̃ except in so far as sup |m̃(x1 , x2 )−m(x1 , x2 )| < c. Since this is almost surely true of m̂n for n sufficiently large, the result holds. Theorem 6.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 under model (1.1), under assumptions D1-D2, E1, K1 - K6, B1-B4, P1-P5, N1 - N4 and H1 - H2 define θf = arg min D∞ (f, θ) θ∈Θ and Ṽ D = VJ1 + VJ2 + VJ3 − VJ4 as given in (6.45-6.48) with ∇θ φ(y1 |x1 , x2 , θ) f (y1 |x1 , x2 ) J(y1 , x1 , x2 ) = A1 φ(y1 |x1 , x2 , θ) φ(y1 |x1 , x2 , θ) 36 and H D (θ) = ∇2θ D∞ (f, θ) X Z f (y1 |x1 , x2 ) A2 = ∇2θ φ(y1 |x1 , x2 , θ)h(x1 , x2 )dx1 dy1 φ(y1 |x1 , x2 , θ) x2 ∈Sx X Z f (y1 |x1 , x2 ) ∇θ φ(y1 |x1 , x2 , θ)T T + A3 h(x1 , x2 )dx1 dy1 φ(y1 |x1 , x2 , θ) φ(y1 |x1 , x2 , θ) x2 ∈Sx I (θ) = H (θ)Ṽ D (θ)−1 H D (θ) D D then i √ h n Tn (fˆn ) − θf − Bn∗ → N 0, I D (θf )−1 and i √ h n T̃n (fˆn ) − θf − B̃n∗ → N 0, I D (θf )−1 in distribution where Bn and B̃n are given in Lemma 6.1. Here we note that in the case that f = φθ0 for some θ0 , the θf = θ0 and further since A1 (1) = A2 (1) = A3 (1) = 1 we have that H D (θf ) is given by the Fisher Information for φθ0 . In fact when a regression model with a location-scale family for residuals is proposed for φθ , Ṽ D (θf ) is also given by the Fisher Information, yielding an efficient estimate. Specifically, we assume that 1 y1 − mθ (x1 , x2 ) φθ,σ (y1 |x1 , x2 ) = φ σ σ in which case ∇θ mθ (x1 , x2 ) φ0 (σ −1 (y1 − mθ (x1 , x2 ))) σ φ(σ −1 (y1 − mθ (x1 , x2 ))) −1 y1 − mθ (x1 , x2 ) φ0 (σ −1 (y1 − mθ (x1 , x2 ))) Ψσ (y1 , x1 , x2 ) = − σ σ2 φ(σ −1 (y1 − mθ (x1 , x2 ))) Ψθ (y1 , x1 , x2 ) = with σ scaled so that Z φ0 (e)2 de = 1. φ(e) Substituting Ψθ for J in (6.45-6.48) with f ∗ (e) = φ(e) and m(x1 , x2 ) = mθ0 (x1 , x2 ) yields Z Ψσ (y1 , x1 , x2 )φθ,σ (y1 , x1 , x2 )dy1 = 0, from which VJ1 = 0, VJ2 Z 1 X TT = 2 (∇θ mθ (x1 , x2 )h(x1 , x2 )) dx1 σ x2 ∈Sx and since ∇θ mθ (x1 , x2 ) φ00 (σ −1 (y1 − mθ (x1 , x2 ))) φ0 (σ −1 (y1 − mθ (x1 , x2 )))2 − ∇y Ψθ (y1 , x1 , x2 ) = σ2 φ(σ −1 (y1 − mθ (x1 , x2 ))) φ(σ −1 (y1 − mθ (x1 , x2 )))2 37 we have VJ3 −VJ4 = σ 2 1 σ3 Z 2 X Z φ0 (e)2 TT [∇θ mθ (x1 , x2 )] h(x1 , x2 )dx1 − de φ(e) x2 ∈Sx X Z !T T ∇θ mθ (x1 , x2 )h(x1 , x2 )dx1 x2 ∈Sx giving Z 1 X ∇θ mθ (x1 , x2 )T T h(x1 , x2 )dx1 . σ2 D Ṽθ,θ = x2 ∈Sx Similarly, we can demonstrate that D Vσ,σ = 4 D , Vθ,σ =0 σ4 giving the usual Fisher Information. Proof. Since our proof remains nearly identical to that of Therorem 5.1, up to the use of Lemmas 6.1 and 6.2, we will define T̄n , f¯n (y1 |x1 , x2 ) and fK (y1 |x1 , x2 ) to take one of the following combinations of values Tn T̃n f˜n (y1 |x1 , x2 ) f˜n (y1 |x1 , x2 ) E f˜n∗ (y1 |x1 , x2 ) E f˜n∗ (y1 |x1 , x2 ) Our arguments here follow those in Tamura and Boos (1986) and Park and Basu (2004). Since T̄n (f ) satisfies ∇θ Dn (f, θ) = 0 we can write √ −1 n T̄n (f¯n ) − θ0 = − ∇2θ Dn (f¯n , θ+ ) ∇θ Dn (f¯n , θ0 ) for some θ+ between T̄n (f¯n ) and θ0 . It is therefore sufficient to demonstrate (i) ∇2θ Dn (f¯n , θ+ ) → H D (θ0 ) in probability. √ (ii) n ∇θ Dn (f¯n , θ0 ) − B̄n → N (0, Ṽ D (θ0 )−1 ) in distribution. with B̄n given by Bn∗ or B̃n∗ as appropriate. We begin with (i) where we observe that by Assumption N4 A2 (r) and A3 (r) are bounded and the result follows from Theorems 2.5 or 3.1 and 4.2 and the dominated convergence theorem. In the case of Hellinger distance # Z " 2 TT q ∇ φ(y , x , x , θ) ∇ φ(y , x , x , θ) 1 1 2 θ 1 1 2 pθ ∇2θ D(f¯n , φ|x1 , x2 , θ) = − f¯n (y1 |x1 , x2 )dy1 φ(y1 , x1 , x2 , θ)3/2 φ(y1 , x1 , x2 , θ) and the result follows from the uniform L1 convergence of f¯n (Theorem 2.4) and Assumption N3. 38 Turning to (ii) where we observe that by the boundedness of C and the dominated convergence theorem we can write ∇θ Dn (f¯n , φ|x1 , x2 , θ) as ¯ Z fn (y1 |x1 , x2 ) A2 ∇θ φ(y1 |x1 , x2 , θ)dy1 − B̄n1 φ(y1 |x1 , x2 , θ) Z fK (y1 |x1 , x2 ) ∇θ φ(y1 |x1 , x2 , θ) ¯ = A1 fn (y1 |x1 , x2 ) − fK (y1 |x1 , x2 ) dy1 φ(y1 |x1 , x2 , θ) φ(y1 |x1 , x2 , θ) Z ¯ fn (y1 |x1 , x2 ) fK (y1 |x1 , x2 ) + A2 − A2 ∇θ φ(y1 |x1 , x2 , θ)dy1 φ(y1 |x1 , x2 , θ) φ(y1 |x1 , x2 , θ) ¯ Z fK (y1 |x1 , x2 ) fn (y1 |x1 , x2 ) fK (y1 |x1 , x2 ) − A1 − ∇θ φ(y1 |x1 , x2 , θ)dy1 φ(y1 |x1 , x2 , θ) φ(y1 |x1 , x2 , θ) φ(y1 |x1 , x2 , θ) from a minor modification Lemma 25 of Lindsay (1994) we have that by the boundedness of A1 and A2 there is a constant B such that |A2 (r2 ) − A2 (s2 ) − (r2 − s2 )A1 (s2 )| ≤ (r2 − s2 )B substituting s r= f¯n (y1 |x1 , x2 ) , s= φ(y1 |x1 , x2 , θ) s fK (y1 |x1 , x2 ) φ(y1 |x1 , x2 , θ) we obtain Z A2 ¯ fn (y1 |x1 , x2 ) ∇θ φ(y1 |x1 , x2 , θ)dy1 − B̄n φ(y1 |x1 , x2 , θ) Z fK (y1 |x1 , x2 ) ∇θ φ(y1 |x1 , x2 , θ) ¯ fn (y1 |x1 , x2 ) − fK (y1 |x1 , x2 ) dy1 = A1 φ(y1 |x1 , x2 , θ) φ(y1 |x1 , x2 , θ) 2 q Z p ∇θ φ(y1 |x1 , x2 , θ) ¯ +B fn (y1 |x1 , x2 ) − fK (y1 |x1 , x2 ) dy1 φ(y1 |x1 , x2 , θ) The result now follows from Lemmas 6.1 and 6.2 with the additional centering of B̄n2 . For the special case of Hellinger distance, we observe that Z q ∇θ φ(y1 , x1 , x2 , θ) ¯ p ∇θ Dn (f¯n , φ|x1 , x2 , θ) = fn (y1 |x1 , x2 )dy1 φ(y1 , x1 , x2 , θ) √ √ √ √ √ √ and the result is obtained from applying the identity a − b = (a − b)2 /2 a + ( b − a)2 /2 a with a = fK (y1 |x1 , x2 ) and b = f¯n (y1 |x1 , x2 ) along with Assumption N3. 7 Robustness Properties An important motivator for the study of disparity methods is that in addition to providing statistical efficiency as demonstrated above, they are also robust to contamination from outlying observations. Here we investigate the robustness of our estimates through the tools of influence functions and breakdown points. These have been studied for i.i.d. data in Beran (1977); Park and Basu (2004); Lindsay (1994) and the extension to conditional models follows similar lines. In particular, we examine two models for contamination: 39 1. To mimic the “homoscedastic” case, we contaminate g(x1 , x2 , y1 , y2 ) with outliers independent of (x1 , x2 ). That is, we define the contaminating density g,z (x1 , x2 , y1 , y2 ) = (1 − )g(x1 , x2 , y1 , y2 ) + δz (y1 , y2 )h(x1 , x2 ) (7.55) where δz is a contamination density parameterized by z. Typically, we think of δz as having small support centered around z. The results in the conditional density f,z (y1 , y2 |x1 , x2 ) = (1 − )f (y1 , y2 |x1 , x2 ) + δz (y1 , y2 ) which we think of as the result of smoothing a contaminated residual density. We note that we have not changed the marginal distribution of (x1 , x2 ) via this contamination. We note that this particularly applies to the case where only y1 is present and the estimate (1.6-1.8) is employed. 2. In the more general setting, we set g,z (x1 , x2 , y1 , y2 ) = (1 − )g(x1 , x2 , y1 , y2 ) + δz (y1 , y2 )JU (x1 , x2 )h(x1 , x2 ) (7.56) where JU (x1 , x2 ) is the indicator of (x1 , x2 ) ∈ U scaled so that h(x1 , x2 )JU (x1 , x2 ) is a distribution. This translates to the conditional density f (y1 , y2 |x1 , x2 ) (x1 , x2 ) ∈ /U f,z (y1 , y2 |x1 , x2 ) = (1 − )f (y1 , y2 |x1 , x2 ) + δz (y1 , y2 ) (x1 , x2 ) ∈ U which localizes contamination in covariate space. Note that the marginal distribution is now scaled differently in U . Naturally, this characterization (7.55) does not account for the effect of outliers on the Nadaraya-Watson estimator (1.6). If these are localized in covariate space, however, we can think of (1.8) as being approximately a mixture of the two cases above. As we will see the distinction between these two will not affect the basic properties below. Throughout we will write δz (y1 , y2 |x1 , x2 ) in place of δz (y1 , y2 ) or δz (y1 , y2 )JU (x1 , x2 ) as appropriate. h(x1 , x2 ) will be taken to be modified according to (7.56) if appropriate. We must first place some conditions on δz : D1 δz is orthogonal in the limit to f , in particular X Z lim δz (y1 , y2 |x1 , x2 )f (y1 , y2 |x1 , x2 )dy1 = 0, ∀(x1 , x2 ) z→∞ y2 ∈Sy D2 δz is orthogonal in the limit to φ: X Z lim δz (y1 , y2 |x1 , x2 )φ(y1 , y2 |x1 , x2 , θ)dy1 = 0, ∀(x1 , x2 ) z→∞ y2 ∈Sy D3 φ becomes orthogonal to g for large θ: X Z f (y1 , y2 |x1 , x2 )φ(y1 , y2 |x1 , x2 , θ) = 0 ∀(x1 , x2 ) lim kθk→∞ y2 ∈Sy 40 D4 C(−1) and C 0 (∞) are both finite or the disparity is Hellinger distance. In the following result with use T [f ] = arg min D∞ (f, θ) for any f in place of our estimate θ̂. Theorem 7.1. Under assumptions D1 - D4 under both contamination models (7.55) and (7.56) define ∗ to satisfy (1 − 2∗ )C 0 (∞) = inf D ((1 − ∗ )f, θ) − lim inf D (∗ δz , θ) (7.57) z→∞ θ∈Θ θ∈Θ 0 with C (∞) replaced by 1 in the case of Hellinger distance then for < ∗ lim T [f,z ] = T [(1 − )f ] z→∞ and in particular the breakdown point is at least ∗ : sup kT [f,z ] − T [(1 − )f ]k < ∞. z These results mimic Beran (1977) in demonstrating that the α-level influence functions IFα (δz ) = α−1 kT [fα,z ] − T [f ]k are bounded and tend to zero at infinity. The breakdown point is a natural consequence. Proof. We begin by observing that by Assumption D1, for any fixed θ, Z Z f,z (y|x) D(f,z , θ) = C − 1 φ(y|x, θ)h(x)dµ(y)dν(x) φ(y|x, θ) Az (x) Z Z f,z (y|x) + C − 1 φ(y|x, θ)h(x)dµ(y)dν(x) φ(y|x, θ) Acz (x) = DAz (f,z , θ) + DAcz (f,z , θ) where Az (x) = {y : max(f (y|x), φ(y|x, θ)) > δz (y|x)}. We note that for any η with z sufficiently large that sup sup δz (y|x) < η and sup sup f (y|x) < η (x)∈X y∈Acz (x) (x)∈X y∈Az (x) and thus for sufficiently large z, D(f,z , θ) − DA ((1 − )f, θ) + DAc (δz , θ) ≤ |η| sup |C 0 (t)|. z z t hence sup D(f,z , θ) − DAz ((1 − )f, θ) + DAcz (δz , θ) → 0. (7.58) θ We also observe that for any fixed θ, Z Z Z Z 2δz (y) DAcz (δz , θ) = − 1 φ(y|x, θ)h(x)dµ(y)dν(x) + δz (y|x)C 0 (t(y, x))dµ(y)dν(x) C φ(y|x, θ) Acz (x) Acz (x) → C 0 (∞) 41 0 for t(y, R x) between δz (y|x)/φ(y|x, θ) and 2δz (y|x)/φ(y|x, θ) since t(y, x) → ∞, C(·) and C (·) are bounded and Ac (x) φ(y|x, θ)dµ(y) → 0. z Similarly, DAcz ((1 − )f, θ) → D((1 − )f, θ) and thus D(f,z , θ) → D((1 − )f, θ) + C 0 (∞) which is minimized at θ = T [f,z ]. It remains to rule out divergent sequences kθz k → ∞. In this case, we define Bz (x) = {y : f (y|x) > max(δz (y|x), φ(y|x, θz )} and note that from the arguments above DBz ((1 − )f, θz ) → (1 − )C 0 (∞) and DBzc (δ, θz ) → D(δ, θz ) and hence lim D(f,z , θz ) > lim inf D(δz , θ) + (1 − )C 0 (∞) > D(f,z , T [(1 − )f ]) z→∞ z→∞ θ∈Θ from (7.57), yielding a contradiction. In the case of Hellinger distance, we observe |D(f,z , θ) − (D((1 − )f, θ) + D(δz , θ))| Z Z 1 1 1 1 1 1 2 = φ 2 (y|x, θ) f,z (y|x) − (1 − ) 2 f 2 (y|x) − 2 δz2 (y|x) h(x)dµ(y)dν(x) Z Z 2 1 1 1 1 1 2 ≤ (y|x) − (1 − ) 2 f 2 (y|x) − 2 δz2 (y|x) h(x)dµ(y)dν(x) f,z Z Z = [2(1 − )f (y|x) + 2δ(y|x)] h(x)dµ(y)dν(x) Z Z 1 1 1 1 1 2 (y|x) (1 − ) 2 f 2 (y|x) + 2 δz2 (y|x) h(x)dµ(y)dν(x) −2 f,z where, by dividing the range of y into Az (x) and Acz (x) as above, we find that on Az (x), for any η > 0 and z sufficiently large, 1 1 1 1 1 2 2 (y|x) (1 − ) 2 f (y|x) + 2 δz (y|x) ≤ (η) 2 f,z (y|x) + η (1 − )f (y|x) − f,z which with the corresponding arguments on Acz (x) yields (7.58). We further observe that for fixed θ Z p √ D(δz , θ) = 1 + − δz (y|x)φ(y|x, θ)h(x)dµ(y)dν(x) → 1 + and for kθz k → ∞, D((1 − )f, θz ) = 2 − − √ 1− Z p f (y|x)φ(y|x, θz )h(x)dµ(y)dν(x) → 2 − from which the result follows from the same arguments as above. 42 These results extent on Park and Basu (2004) and Beran (1977) and a number of ways and a few remarks are warranted: 1. In Beran (1977), Θ is assumed to be compact, allowing θz to converge at least on a subsequence. This removes the kθz k → ∞ case and the result can be shown for ∈ [0, 1). 2. We have not assume that the uncontaminated density f is a member of the parametric class φθ . If f = φθ0 for some θ0 , then we observe that by Jensen’s inequality D((1 − )φθ0 , θ) > C(−) = D((1 − )φθ0 , θ0 ) hence T [(1 − )φθ0 ] = θ0 . We can further bound D(δz , θ) > C( − 1) in which case (7.57) can be bounded by (1 − 2∗ )C 0 (∞) ≥ C(−) − C( − 1) which is satisfied for = 1/2. We note that in the more general condition, if (1 − )f is closer to the family φθ that δz at = 1/2, the breakdown point will be greater than 1/2; in the reverse situation it will be smaller. The influence function defined in Hampel (1974) as the limit of α-level influence functions can be written, using the implicit function theorem, as IF∞ (δz ) = H D (θ0 )−1 F (δz ) for Z Z F (δz ) = A3 f (y|x) −1 φ(y|x, θ0 ) ∇θ φ(y|x, θ) (δz (y|x) − f (y|x)) h(x)dµ(y)dν(x) φ(y|x, θ) which at f = φθ0 becomes independent of C, reducing to Z Z ∇θ φ(y|x, θ) IF∞ (δz ) = I(θ)−1 δz (y|x)dµ(y)dν(x) φ(y|x, θ) This is the analogue of the result observed in Lindsay (1994) that the influence function does not differ across different disparity models at the parametric family and corresponds to the observation in Beran (1977) that this limit does not distinguish robust estimators in this case. References Beran, R. (1977). Minimum Hellinger distance estimates for parametric models. Annals of Statistics 5, 445–463. Cheng, A.-L. and A. N. Vidyashankar (2006). Minimum Hellinger distance estimation for randomized play the winner design. Journal of Statistical Planning and Inference 136, 1875–1910. Deroye, L. and G. Lugosi (2001). Combinatorial Methods in Density Estimation. New York: Springer. Devroye, L. and G. Györfi (1985). Nonparametric Density Estimation: The L1 View. New York: Wiley. Giné, E. and A. Guillou (2002). Rates of strong consistency for multivariate kernel density estimators. Ann. Inst. H. Poincaré Probab. Statist. 38, 907–922. 43 Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistics Association 69, 383–393. Hansen, B. E. (2004). Nonparametric conditional density estimation. Hoeffding, W. (1948). A class of statistics with assymptotically normal distributions. Annals of Statistics 19, 293325. Li, Q. and J. S. Racine (2007). Nonparametric Econometrics. Princeton: Princeton University Press. Lindsay, B. G. (1994). Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Annals of Statistics 22, 1081–1114. Nadaraya, E. A. (1989). Nonparametric Estimation of Probability Densities and Regression Curves. Dordrecht: Kluwer. Pak, R. J. and A. Basu (1998). Minimum disparity estimation in linear regression models: Distribution and efficiency. Annals of the Institute of Statistical Mathematics 50, 503–521. Park, C. and A. Basu (2004). Minimum disparity estimation: Asymptotic normality and breakdown point results. Bulletin of Informatics and Cybernetics 36. Tamura, R. N. and D. D. Boos (1986). Minimum Hellinger distances estimation for multivariate location and and covariance. Journal of the American Statistical Association 81, 223–229. Weber, N. C. (1983). Central limit theorems for a class of symmetric statistics. Math. Proc. Camb. Phil. Soc 94, 307 – 313. 44
© Copyright 2025 Paperzz