Consistency, Efficiency and Robustness of

Consistency, Efficiency and Robustness
of
Conditional Disparity Methods
Giles Hooker and Anand Vidyashankar
November 18, 2012
Abstract
This report demonstrates the consistency and asymptotic normality of minimum-disparity estimators
based on conditional density estimates. In particular, it provides L1 consistency results for conditional
density estimators and conditional density estimators under homoscedastic model restrictions. It defines
a generic formulation of disparity estimates in conditionally-specified models and demonstrates their
asymptotic consistency and normality. In particular, for regression models with more than one continuous
response or covariate, we demonstrate that disparity estimators based on unrestricted conditional density
√
estimates have a bias that is larger than n, however for univariate homoscedastic models conditioned
on a univariate continuous covariate, this bias can be removed.
1
Framework and Assumptions
Throughout the following, we assume that we observe {Xn1 (ω), Xn2 (ω), Yn1 (ω), Yn2 )(ω), n ≥ 1} i.i.d. random
variables with where Xn1 (ω) ∈ Rdx , Xn2 (ω) ∈ Sx , Yn1 (ω) ∈ Rdy , Yn2 (ω) ∈ Sy for countable sets Sx and Sy
with joint distribution
g(x1 , x2 , y1 , y2 ) = P (X2 = x2 , Y2 = y2 )P (X1 ∈ dx1 , Y1 ∈ dy1 |x2 , y2 )
and define the marginal and conditional densities
h(x1 , x2 ) =
X Z
g(x1 , x2 , y1 , y2 )dy1
y2 ∈Sy
f (y1 , y2 |x1 , x2 ) =
g(x1 , x2 , y1 , y2 )
.
h(x1 , x2 )
on the support of (x1 , x2 ). Further, for a scalar Yn1 ∈ R, removing Yn2 (or setting it to be always identically
0) we define the conditional expectation
Z
m(x1 , x2 ) = y1 f (y1 |x1 , x2 )dy1
and note that in homoscedastic regression models we write for some density f ∗ (e),
f (y1 |x1 , x2 ) = f ∗ (y1 − m(x1 , x2 )).
1
(1.1)
Within this context, the use of disparity methods to estimate parameters in linear regression was treated
in Pak and Basu (1998) from the point of view of placing a disparity on the score equations. We take a
considerably more direct approach here.
The following regularity structures may be assumed in the theorems below:
(D1) g is bounded and continuous in x1 and y1 .
(D2) All third derivatives of g with respect to x1 and y1 exist, are continuous and bounded.
(D3) The support of h, X ∈ Rdx ⊗Sx , is compact and
h− =
inf
(x1 ,x2 )∈X
h(x1 , x2 ) > 0.
We note that under these conditions, continuity of h and f in x1 and y1 is inherited from g. In the case of
homoscedastic models we also assume
R
(E1) |∇e f ∗ (e)|de < ∞.
We will form kernel estimates for these quantities as follows:
n
X
x1 − Xi1 (ω)
y1 − Yi1 (ω)
K
Ix2 (Xi2 (ω))Iy2 (Yi2 (ω))
y
dy
x
cnx2
cny2
ncdnx
2 cny2 i=1
n
x − Xi1 (ω)
1 X
Kx
Ix2 (Xi2 (ω))
ĥn (x1 , x2 , ω) = dx
cnx2
ncnx2 i=1
X Z
ĝn (x1 , x2 , y1 , y2 , ω)dy1
=
ĝn (y1 , y2 , x1 , x2 , ω) =
1
y2 ∈Sy
Kx
(1.2)
(1.3)
Rdy
ĝn (x1 , x2 , y1 , y2 , ω)
fˆn (y1 , y2 |x1 , x2 , ω) =
.
ĥn (x1 , x2 , ω)
(1.4)
(1.5)
where Kx and Ky are densities on the spaces Rdx and Rdy respectively. Further conditions on these are
detailed below.
In the case of homoscedastic regression models we further define a homoscedastic conditional density via
the Nadaraya-Watson estimator:
Pn
x1 −Xi1 (ω)
Y
(ω)K
Ix2 (Xi2 (ω))
i1
x
i=1
cnx2
m̂n (x1 , x2 , ω) =
(1.6)
Pn
x1 −Xi1 (ω)
K
I
(X
(ω))
x
x
i2
2
i=1
cnx2
n
X
1
e − (Yi (ω) − m̂n (Xi1 (ω), Xi2 (ω)))
∗
ˆ
fn (e, ω) = dy
Ky
(1.7)
cny2
ncny2 i=1
f˜n (y1 |x1 , x2 , ω) = fˆ∗ (y1 − m̂n (x1 , x2 , ω), ω) .
(1.8)
n
with notation cny2 maintained as a bandwidth for the sake of consistency.
2
For the sake of notational compactness, we define measures µ and ν on Rdx ⊗Sx and Rdy ⊗Sy respectively
given by the product of counting and Lebesgue measure and define y = (y1 , y2 ) and x = (x1 , x2 ). Where
needed, we will write for any function F (x1 , x2 , y1 , y2 ),
Z Z
Z Z
X Z Z
F (x1 , x2 , y1 ,2 )dx1 dy1 =
F (x1 , x2 , y1 , y2 )dµ(y)dν(x) =
F (x, y)dµ(y)dν(x). (1.9)
x∈Sx ,y∈Sy
Throughout we will attempt to keep the notation explicit where possible and, in particular, will not make use
of this convention in Sections 2 and 3 where the distinction between discrete-valued and continuous valued
random variables is particularly important.
Throughout we make the following assumptions on the kernels Kx and Ky :
(K1) Kx and Ky are densities on Rdx and Rdy respectively.
(K2) For some finite K + , supx1 ∈Rdx Kx (x1 ) < K + and supy1 ∈Rdy Ky (y1 ) < K +
(K3) lim kx1 k2dx Kx (x1 ) → 0 as kx1 k → ∞ and lim ky1 k2dy Ky (y1 ) → 0 as ky1 k → ∞.
(K4) Kx (x1 ) = Kx (−x1 ) and Ky (y1 ) = Ky (−y1 ).
R
R
(K5) kx1 k2 Kx (x1 ) < ∞ and ky1 k2 Ky (y1 ) < ∞
(K6) Kx and Ky have bounded variation and finite modulus of continuity.
We also assume that following properties of the bandwidths. These will be given in terms of the number of
observations falling at each combination values of the discrete variables:
n(x2 ) =
n
X
Ix2 (X2i (ω)),
i=1
n(y2 ) =
n
X
Iy2 (Y2i (ω)),
i=1
n(x2 , y2 ) =
n
X
Ix2 (X2i (ω))Iy2 (Y2i (ω))
i=1
(B1) cnx2 → 0, cny2 → 0.
d
x
x
(B2) n(x2 )cdnx
→ ∞ for all x2 ∈ Sx and n(x2 , y2 )cdnx
c y → ∞ for all (x2 , y2 ) ∈ Sx ⊗ Sy .
2
2 ny2
2dx
→ ∞.
(B3) n(x2 )cnx
2
2d
y
x
(B4) n(x2 , y2 )c2d
nx2 cny2 → ∞.
(B5)
P∞
d
x
−dx −γn(x2 )cnx
2
n(x2 )=1 cnx2 e
≤ ∞ for all γ > 0.
(B6) n(y2 )c4ny2 → 0 if dy = 1 and n(x2 )c4nx2 → 0 if dx = 1.
where the sum is taken to be over all observations in the case that X2 or Y2 are singletons.
3
2
Consistency Results for Conditional Densities over Spaces of
Mixed Types
In this section we will provide a number of L1 consistency results for kernel estimates of densities and
conditional densities of multivariate random variables in which some coordinates take values in Euclidean
space while others take values on a discrete set. Pointwise consistency of conditional density estimates of
this form can be found in, for example, Li and Racine (2007). However, we are unaware of equivalent L1
results which will be necessary for our development of conditional disparity-based inference. Throughout, we
have assumed that both the conditioning variable x and the response y are multivariate with both types of
coordinates. The specification to univariate models, or models with only discrete or only continuous variables
in either x or y (and to unconditional densities) is readily seen to be covered by our results as well.
We begin with a result on densities for random variables taking discrete values.
Theorem 2.1. Let {Xn : n ≥ 1} denote a collection of i.i.d. discrete random variables with support S where
S is a countable set. Let
f (t) = P (X1 = t), t ∈ S,
and
n
fn (t, ω) =
1X
I{t} (Xk (ω)),
n
k=1
then there exists a set B with P (B) = 1 such that for ω ∈ B,
X
lim
|f (t) − fn (t, ω)| = 0.
n→∞
t∈S
Proof. By strong law of large numbers there exists a set At such that P (At ) = 0 and on Act , fn (t, ω) converges
to f (t) pointwise. Let B = ∩t∈S Act . Then, P (B) = 1 and for ω ∈ B, fn (t, ω) converges to f (t) pointwise for
all t.
P
Notice that t∈S fn (t, ω) = 1. Let Dn = {t ∈ S, f (t) ≥ fn (t, ω)}. Hence it follows that,
X
X
X
0=
(f (t) − fn (t, ω)) =
(f (t) − fn (t, ω))IDn (t) +
(f (t) − fn (t, ω))IDnc (t).
t∈S
t∈S
t∈S
Hence,
X
(f (t) − fn (t))IDn (t) =
t∈S
X
(fn (t) − f (t))IDnc (t).
t∈S
Now,
X
|f (t) − fn (t, ω)| =
t∈S
X
|f (t) − fn (t, ω)|IDn (t) +
t∈S
=
|f (t) − fn (t, ω)|IDnc (t)
t∈S
X
X
(f (t) − fn (t, ω))IDn (t) +
(fn (t, ω) − f (t))IDnc (t)
t∈S
Hence using (2.10), we see that
X
|f (t) − fn (t, ω)| =
t∈S
X
t∈S
2
X
(f (t) − fn (t, ω))IDn (t).
t∈S
4
(2.10)
Alternatively, we can express the above equality as below
X
X
|f (t) − fn (t, ω)| = 2
(f (t) − fn (t, ω))IDn (t)
t∈S
t∈S
Z
=
(f (t) − fn (t, ω))IDn (t)dν(t),
2
S
where ν(·) is the countingR measure. Now, fix ω ∈ B and set an (t, ω) = (f (t) − fn (t, ω))IDn (t). Notice that
0 ≤ an (t, ω) ≤ f (t), and S f (t)dν(t) = 1 < ∞. Hence by dominated convergence theorem applied to the
counting measure space S, we see that
Z
X
lim
|f (t) − fn (t, ω)| = lim
(f (t) − fn (t, ω))IDn (t)dν(t)
n→∞
n→∞
t∈S
S
Z
=
Since P (B) = 1 and limn→∞
empirical probabilities.
P
t∈S
lim (f (t) − fn (t, ω))IDn (t)dν(t) = 0.
S n→∞
|f (t) − fn (t, ω)| = 0, we have almost sure L1 −consistency of the
Using this, we can now provide the L1 convergence of densities of mixed types; pointwise and uniform
convergence is included for completeness.
Theorem 2.2. Let {(Xn1 , Xn2 ), n ≥ 1} be given as in Section 1, under assumptions D1 - D2, K1 - K6 and
B1-B2 there exists at set B with P (B) = 1 such that for all ω ∈ B,
X Z (2.11)
ĥn (x1 , x2 , ω) − h(x1 , x2 ) dx1 = 0.
x2 ∈Sx
R
For almost all (x1 , x2 ) ∈ Rdx ×Sx , there exists a set B(x1 ,x2 ) with P (B(x1 ,x2 ) ) = 1 such that for all ω ∈
B(x1 ,x2 ) ,
ĥn (x1 , x2 , ω) → h(x1 , x2 ).
(2.12)
Further, under Assumptions D3, and B5 there exists at set Bs with P (Bs ) = 1 such that for all ω ∈ Bs ,
sup ĥn (x1 , x2 , ω) − h(x1 , x2 ) → 0.
(2.13)
(x1 ,x2 )∈X
Proof. We first re-write
nx (ω)
1
ĥn (x1 , x2 , ω) = 2
x
n nx2 (ω)cdnx
2
X
Kx
Xi2 (ω)=x2
x1 − Xi1 (ω)
cnx2
= p̂x2 (ω)h̃x2 (x1 , ω)
and observe that for every x2 there is a set Ax2 with P (Ax2 ) = 1 such that for ω ∈ Ax2 , nx2 (ω) → ∞ and
hence from Devroye and Györfi (1985, Chap. 3 Theorem 1) there is a set Cx2 ⊂ Ax2 with P (Cx2 ) = 1 such
that
Z h̃x2 (x1 , ω) − h(x1 |x2 ) dx1 → 0.
5
We can now write
Z X
X Z p(x2 ) h̃x2 (x1 , ω) − h(x1 |x2 ) dx1
ĥn (x1 , x2 , ω) − h(x1 , x2 ) dx1 ≤
x2 ∈Sx
R
x2 ∈Sx
+
X
R
|p̂(x2 , ω) − p(x2 )|
x2 ∈Sx
where the second term converges for ω ∈ D with P (D) = 1 from Theorem 2.1 and the first term converges
on E = ∪x2 ∈Sx Cx2 from the dominated convergence theorem applied with the observation that
Z
Z X
X
p(x2 ) (h̃x2 (x1 , ω) + h(x1 |x2 ))dx1 ≤ 2.
p(x2 ) h̃x2 (x1 , ω) − h(x1 |x2 ) dx1 ≤
x2 ∈Sx
R
x2 ∈Sx
R
Hence (2.11) holds for ω ∈ B = D ∪ E with P (B) = 1.
To demonstrate (2.12), we observe that (2.11) requires that for almost all (x1 , x2 ), ĥn (x1 , x2 , ω) − h(x1 , x2 ) →
0.
Finally (2.13) follows by the finiteness of Sx under Assumption D3 and the uniform convergence of kernel
densities under assumption B5 (see e.g. Nadaraya, 1989, Chap. 4).
The next theorem demonstrates convergence that is uniformly L1 ; that is the L1 distance measured in
(y1 , y2 ) at each (x1 , x2 ) converges uniformly over (x1 , x2 ). We begin by considering only the continuousvalued random variables where g(x1 , y1 ) and ĝn (x1 , y1 ) will indicate the corresponding density and kernel
density estimates.
Theorem 2.3. Let {(Xn1 , Yn1 ), n ≥ 1} be given as in Section 1 under assumptions D1 - D2, K1 - K6 and
B1-B2 then for almost all x1 there exists a set Bx1 with P (Bx1 ) = 1 such that for all ω ∈ Bx1
Z
|ĝn (x1 , y1 , ω) − g(x1 , y1 )| dy1 → 0.
(2.14)
Further, if Assumptions D3 and B5 hold, there exists a set B with P (B) = 1 such that for all ω ∈ B
Z
sup
|ĝn (x1 , y1 , ω) − g(x1 , y1 )| dy1 → 0.
(2.15)
x1 ∈X
Proof. For (2.14) we observe that
Z Z
Z
|ĝn (x1 , y1 , ω) − g(x1 , y1 )| dy1 dx1 = T (x1 )dx1
→0
almost surely with T (x1 ) > 0, see Devroye and Györfi (1985, Chap. 3 Theorem 1). Thus T (x1 ) → 0 for
almost all x1 .
6
Turning to (2.15), first we demonstrate the result on the expectation of ĝn (x1 , y1 ). We observe that we
can choose bn(y2 ) → ∞ with bn(y2 ) cny2 → 0 so that
Z
sup
|Eĝn (x1 , y1 ) − g(x1 , y1 )| dy1
x1 ∈X
Z Z Z
K
(u)K
(v)g(x
+
c
u,
y
+
c
v)dudv
−
g(x
,
y
)
= sup
x
y
1
nx2
1
ny2
1 1 dy1
x1 ∈X
Z
|g(x01 , y10 ) − g(x1 , y1 )| dy1 + Ky (bn(y2 ) )
≤ sup sup
sup
sup
g(x1 , y1 )
x1 ∈X x01 ∈X y10 :ky10 −y1 k>cny2 bn(y2 )
x1 ∈X ,y1 ∈Rdy
≤ M bdny cdnyy 2 + Ky (bn(y2 ) )
→ 0.
We now observe that since X is compact, it can be given a disjoint covering. Further, since the modulus
of continuity of K is finite, for any , we can take a covering
X =
N
[n
Aj
j=1
and define
X
Nn
1
u − x1
(an1 (x1 ), . . . , aNn (x1 )) = arg min sup aj (x1 )IAj (u)
Kx
−
cnx2
a1 ,...,aNn u∈X cnx2
j=1
and
Kn+ (u, x) =
Nn
X
ajn (x1 )IAj (u)
j=1
such that
1
u − x1
+
sup sup Kx
− Kn (u, x1 ) <
cnx2
3
x∈X u∈X cnx2
x
for some constant C depending on where we also observe that for any j, supx ajn (x1 ) <
when Nn = C /cdnx
2
+ dx
K /cnx2 .
From here we define
n
1 X +
y1 − Y1i (ω)
ĝn∗ (x, y, ω) = dy
Kn (x1 , X1i (ω))Ky
cny2
ncny2 i=1
and observe that
Z
Z
sup
|ĝn (x1 , y1 , ω) − Eĝn (x1 , y1 )| dy1 ≤ sup
|ĝn (x1 , y1 , ω) − ĝn∗ (x1 , y1 , ω)| dy1
x1 ∈X
x1 ∈X
Z
+ sup
|Eĝn (x1 , y1 ) − Eĝn∗ (x1 , y1 )| dy1
x1 ∈X
Z
+ sup
|ĝn∗ (x1 , y1 , ω) − Eĝn∗ (x1 , y1 )| dy1
x1 ∈X
Z
2
≤
+ sup
|ĝn∗ (x1 , y1 , ω) − Eĝn∗ (x1 , y1 )| dy1 .
3
x1 ∈X
7
We thus only need to demonstrate that
Z
X ∗
∗
P sup
|ĝn (x1 , y1 , ω) − Eĝn (x1 , y1 )| dy1 > < ∞
x1 ∈X
n
and the result will follow from the Borel-Cantelli lemma.
To do this we observe that ĝn∗ (x1 , y1 , ω) is constant in x1 over partitions Aj and thus
Z
Z
sup
|ĝn∗ (x1 , y1 , ω) − Eĝn∗ (x1 , y1 )| dy1 ≤ max
|ĝn∗ (x1j , y1 , ω) − Eĝn∗ (x1j , y1 )| dy1
j∈1,...,Nn
x1 ∈X
= max qj ((X11 (ω), Y11 (ω)), . . . , (Xn1 (ω), Yn1 (ω)))
j
for any x1j ∈ Aj . Then
0
|qj ((x11 , y11 ), . . . , (xn1 , yn1 )) − qj ((x011 , y11
), . . . , (xn1 , yn1 ))|
Z 0
y
−
y11
y1 − y11
1
1
0
dy1
ajn (x11 )Ky
−
a
(x
)K
≤
jn 11
y
ncny
cny
cny
2
2
2
2 supx∈X aj (x)
≤
n
2K +
≤ dx
ncnx2
Hence by the bounded difference inequality
2d
2
ncnxx 2
P |qj ((X11 , Y11 ), . . . , (Xn1 , Yn1 )) − Eqj ((X11 , Y11 ), . . . , (Xn1 , Yn1 ))| >
< 2e− 8K +2 .
2
We now observe that
Z
Eqj ((X11 (ω), Y11 (ω)), . . . , (Xn1 (ω), Yn1 (ω))) ≤
q
2
∗
∗
min E gˆn (y1 , x1j ), E (ĝn (y1 , x1j ) − E gˆn (y1 , x1j )) dy1
∗
−dy /2 −dx
≤ Cg n−1/2 cny
cnx2
2
for some constant Cg since
E
(ĝn∗ (y1 , x1j )
∗
2
− E gˆn (y1 , x1j )) ≤
Z
1
d
ncnyy 2
ajn (x1 )2 Ky (u)2 g(x1 , y1 + cny2 v)dudv.
−d /2
−dx
Collecting terms, for n sufficiently large, Cg n−1/2 cny2y cnx
< /2 by Assumption B4 and
2
P
Z
sup
x1 ∈X
2
nc2
nx2 |ĝn∗ (x1 , y1 , ω) − Eĝn∗ (x1 , y1 )| dy1 > ≤ 2Nn e− 8K +2
2d
≤
which is summable under assumptions B5
8
2
x
2
2C − ncnx+2
8K
e
dx
cnx2
We can now readily extend the above result to the mixed data-types case
Theorem 2.4. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1 under assumptions D1 - D2, K1
- K6 and B1-B2 then for almost all (x1 , x2 ) there exists a set B(x1 ,x2 ) with P (B(x1 ,x2 ) ) = 1 such that for all
ω ∈ B(x1 ,x2 )
X Z
|ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 → 0.
(2.16)
y2 ∈Sy
Further, if Assumptions D3 and B5 hold, there exists a set B with P (B) = 1 such that for all ω ∈ B
X Z
|ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 → 0.
(2.17)
sup
(x1 ,x2 )∈X y ∈S
2
y
Proof. (2.16) follows from the same arguments as (2.14) and the proof is omitted here.
To demonstrate (2.17), from assumption D3, Sx is finite and it is therefore sufficient to demonstrate the
result at each value of x2 . Writing
g̃n (x1 , y1 |x2 , y2 , ω) =
n
X
1
d
y
x
n(x1 , x2 )cdnx
2 cny2
Ky
i=1
y1 − Y1i (ω)
cny2
and
p̂(x2 , y2 ) =
Kx
x1 − X1i (ω)
cnx2
Ix2 (X2i (ω))Iy2 (Y2i (ω))
n(x2 , y2 )
n
we can write
sup
x1 ∈X
X Z
|ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1
y2 ∈Sy
≤ sup
x1 ∈X
X
Z
p(x2 , y2 )
|g̃n (x1 , y1 |x2 , y2 , ω) − g(x1 , y1 |x2 , y2 )| dy1
y2 ∈Sy
+ sup
x1 ∈X
X
|p̂(x2 , y2 ) − p(x2 , y2 )| ĥn (x1 , x2 , ω).
y2 ∈Sy
From Theorem 2.2, ĥn (x1 , x2 , ω) is bounded and the second term converges almost surely.
Further, the integral in the first term converges almost surely for each y2 by Theorem 2.3. Using that
Z
X
sup
p(x2 , y2 ) |g̃n (x1 , y1 |x2 , y2 , ω) − g(x1 , y1 |x2 , y2 )| dy1 ≤ sup ĥn (x1 , x2 , ω) + h(x1 , x2 )
x1 ∈X
x1 ∈X
y2 ∈Sy
is also almost surely bounded, the result now follows from the dominated convergence theorem.
The results above can now be readily extended to equivalent L1 results for conditional densities.
Theorem 2.5. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1n under assumptions D1- D2, K1
- K6 and B1-B2
9
1. For almost all x = (x1 , x2 ) ∈ Rdx ⊗Sx there exists a set Bx with P (Bx ) = 1 such that for all ω ∈ Bx
X Z (2.18)
fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 → 0.
y2 ∈Sy
2. There exists a set B1 with P (B1 ) = 1 such that for all ω ∈ B1 ,
X X Z
h(x1 , x2 ) fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 dx1 → 0.
(2.19)
x2 ∈Sx y2 ∈Sy
3. If further, Assumptions D3 and B5
X Z
sup
(x1 ,x2 )∈X y ∈S
2
y
hold,
ˆ
fn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 → 0.
Proof. We begin with (2.18) and observe that
X Z fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1
(2.20)
(2.21)
y2 ∈Sy
≤
X Z ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 ) dy1
h(x1 , x2 )
y2 ∈Sy
+
X Z
y2 ∈Sy
=
1
1
ĝn (x1 , x2 , y1 , y2 , ω) −
dy
ĥn (x1 , x2 , ω) h(x1 , x2 ) 1
Z
1
|ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1
h(x1 , x2 )
ĥn (x1 , x2 , ω) + 1 −
.
h(x1 , x2 ) (2.22)
For the first term in (2.22), we observe that h(x1 , x2 ) > 0 and the result follows from the first part of
Theorem 2.4.
For the second term, for almost all (x1 , x2 ), ĥn (x1 , x2 , ω) → h(x1 , x2 ) with probability 1 from the second
part of Theorem 2.2.
10
To demonstrated (2.19), we observe
X X Z
h(x1 , x2 ) fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 dx1
x2 ∈Sx y2 ∈Sy
≤
X
X Z Z
|ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 dx1
x2 ∈Sx y2 ∈Sy
+
X
X Z Z
x2 ∈Sx y2 ∈Sy
=
X
X Z Z
1
1
h(x1 , x2 )ĝn (x1 , x2 , y1 , y2 , ω) −
dy dx
ĥn (x1 , x2 , ω) h(x1 , x2 ) 1 1
|ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 dx1
x2 ∈Sx y2 ∈Sy
+
X Z ĥn (x1 , x2 , ω) − h(x1 , x2 ) dx1
x2 ∈Sx
where both terms in the final line converge to zero with probability 1 from Theorem 2.2.
For (2.20), we follow the proof of (2.18) and observe that taking a surpremum for the first term in (2.22)
gives
Z
1
|ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1
sup
(x1 ,x2 )∈X h(x1 , x2 )
Z
1
|ĝn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1
≤ − sup
h (x1 ,x2 )∈X
→0
from Theorem 2.4 while the supremum over the second term converges to zero by applying Theorem 2.4.
3
Consistency Results for Homoscedastic Conditional Densities
In this section we study the conditional density estimates defined in Hansen (2004) for continuous-valued
random variables in which a non-parametric location family is assumed:
f (y1 |x1 , x2 ) = f ∗ (y − m(x1 , x2 ))
for some f ∗ that does not depend on (x1 , x2 ) and a regression function m. This is estimated according
to (1.6-1.8). Here we show that essentially all the consistency properties for conditional density estimates
demonstrated in Section 2 continue to hold for the estimate (1.6-1.8).
Lemma 3.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 with the restriction (1.1), under assumptions D1-D3, K1 - K6, B1-B2 and B5 there exists a set B with P (B) = 1 such that for all ω ∈ B.
sup
|m̂n (x1 , x2 , ω) − m(x1 , x2 )| → 0.
(x1 ,x2 )∈X
Proof. We observe that for any x2 m̂n (x1 , x2 , ω) is a Nadaraya-Watson estimator using the data restricted to
X2 (ω) = x2 . Since the set Sx is finite, the result follows from Theorem 2.1 of Nadaraya (Chap. 4 1989)
11
Theorem 3.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 with the restriction (1.1), under
assumptions D1-D3, E1 K1 - K6, B1-B2 and and B5 there exists a set B with P (B) = 1 such that for all
ω∈B
Z ˆ∗
(3.23)
fn (e, ω) − f ∗ (e) de → 0
and
sup
(x1 ,x2 )∈X
Z ˆ
fn (y1 |x1 , x2 , ω) − f ∗ (y1 − m(x1 , x2 )) dy1 → 0.
(3.24)
Proof. We begin by defining f˜n∗ (e) to be the distribution of Yi (ω) − m̃(Xi1 , Xi2 ) for any m̃(x1 , x2 ) with at
least two continuous derivatives in x1 :
Z
f˜∗ (e, m̃) = f ∗ (e − z)f + (z, m̃)dz
(3.25)
where f + (e, m̃) gives the density of m̃(x1 , x2 ) − m(x1 , x2 ) for (x1 , x2 ) distributed according to h(x1 , x2 ). We
also define its kernel density estimate
fˆn (e, m̃, ω) =
1
n
X
d
ncnyy 2
i=1
K
e − Yi (ω) − m̃(Xi1 (ω), Xi2 (ω))
cny2
(3.26)
∗
where fˆn (e, m̂n , ω) = fˆm
(e, ω).
To demonstrate (3.23), we break the integral into
Z Z Z ˆ
fn (e, m̂n , ω) − f ∗ (e) de ≤ fˆn (e, m̂n , ω) − f˜∗ (e, m̂n ) de + f˜∗ (e, m̂n ) − f ∗ (e) de.
We begin by observing that the second term converges by a mild generalization of Deroye and Lugosi
(2001, Theorem 9.1) given in Lemma 3.2 below with Gn (e) = f + (e, m̂n ) and the observation that Lemma
3.1 implies that the support of f + (e, m̂n ) shrinks to zero.
For the first term, we follow the arguments of Deroye and Lugosi (2001, Theorems 9.2-9.4) for m̂n replaced
with any bounded continuous m̃ and observe that the convergence can be given in terms that are independent
of m̃. In particular, following Deroye and Lugosi (2001) in using the bounded difference inequality we have
that for any g,
Z Z 2
ˆ
ˆ
P
fn (e, m̃, ω) − g(e) de − E fn (e, m̃, ω) − g(e) de > ≤ 2e− /2n
hence
Z Z ˆ
fn (e, m̃, ω) − f˜∗ (e, m̃) de − E fˆn (e, m̃, ω) − f˜∗ (e, m̃) de → 0
almost surely by the Borel-Cantelli lemma.
It is thus sufficient to establish the convergence of
Z Z Z
E fˆn (e, m̃, ω) − f˜(e, m̃) de ≤ E fˆn (e, m̃, ω) − f˜(e, m̃) de + E
12
ˆ
fn (e, m̃, ω) − E fˆn (e, m̃, ω) de.
For the first term, we replace fˆn (e, m̃, ω) with fˆn∗ (e, m̃, ω) by employing the kernel K + (e) = Ky (e)1kek<bn
with bn → ∞ in place of Ky so that
Z
Z ˆ
∗
ˆ
de
≤
Ky (e)de → 0
f
(e,
m̃,
ω)
−
E
f
(e,
m̃,
ω)
E n
n
e≥bn
and observe that replacing Gn with
d
K + (e/cny2 )/cnyy 2
in Lemma 3.2 provides convergence whenever
|m̂n (x1 , x2 , ω) − m(x1 , x2 )| ≤ L
sup
(x1 ,x2 )∈X
for some finite L. This occurs with probability 1.
Turning to the second term, we also make use of K + and show that
Z
Z ˆ
ˆ
E fn (e, m̃, ω) − E fn (e, m̃, ω) de ≤ 2 E fˆn (e, m̃, ω) − E fˆn (e, m̃, ω) de
+
!
r Z
2
˜
ˆ
ˆ
de
≤ 2 min f (e, m̃), E fn (e, m̃, ω) − E fn (e, m̃, ω)
Z + 2 E fˆn (e, m̃, ω) − f˜(e, m̃) de
where we observe that
2
E fˆn (e, m̃, ω) − E fˆn (e, m̃, ω) ≤
n
≤
Z
1
d
cnyy 2
1
2
K+
e−t
cny2
sup f˜(e, m̃)
d
ncnyy 2 e∈Rdy
Z
f˜(t, m̃)dt
K +2 (e)de
→0
from Assumption 2. The convergence in made uniform in m̃ from the observation that
sup f˜(e, m̃) ≤ sup f ∗ (e).
e∈Rdy
e∈Rdy
For (3.24) we observe that
Z ˆ∗
fn (y1 − m̂n (x1 , x2 , ω), ω) − f ∗ (y1 − m(x1 , x2 )) dy1
Z ≤ fˆn∗ (y1 − m̂n (x1 , x2 , ω), ω) − f ∗ (y1 − m̂n (x1 , x2 , ω)) dy1
Z
+ |f ∗ (y1 − m̂n (x1 , x2 , ω)) − f ∗ (y1 − m(x1 , x2 ))| dy1 .
The first term converges to zero from (3.23) while we bound the second term by
Z
sup |m̂n (x1 , x2 , ω) − m(x1 , x2 )| |∇y f ∗ (y1 )| dy → 0
(x1 ,x2 )∈X
from assumption E1 and Lemma 3.1.
13
Lemma 3.2. Let Gn be a sequence of positive, integrable functions on Rd with
contained in [−Mn Mn ]d with Mn → 0. Then for any density f on Rd we have
Z Z
f (x − y)Gn (y)dy − f (x) dx → 0.
lim
R
Gn (x)dx = 1 and support
n→∞
Further, defining the class of densities with support on [−L, L]d by DL = {h : kxk > L ⇒ h(x) = 0} we have
Z
Z Z Z
dx → 0.
f
(x
−
y
−
z)h(z)G
(y)dzdy
−
f
(x
−
z)h(z)dx
lim sup
n
n→∞ h∈D
L
Proof. This proof is a generalization of Deroye and Lugosi (2001, Theorem 9.1) and follows from nearly
identical arguments. First, assuming the result is true on any dense subspace of functions g, we have
Z Z Z
Z
dx
f
(x
−
y
−
z)h(z)dzG
(y)dy
−
f
(x
−
z)h(z)dz
n
Z Z Z
f (x − y − z)Gn (y)dy − f (x − z) h(z)dzdx
≤
Z Z
Z
≤
|f (x − y) − g(x − y)| Gn (y)dydx + |f (x) − g(x)| dx
Z Z
+ g(x − y)Gn (y)dy − g(x) dx
Z
≤ 2 |f (x) − g(x)| dx + o(1)
where the first integral can be made as small as desired without reference to h(z). It is therefore sufficient
to prove the theorem in a dense subclass. In this case we take the class of Lipschitz densities with compact
support.
For any such f , let the support of f be contained in [−M ∗ M ∗ ]d and |f (y) − f (x)| ≤ Ckx − yk, x, y ∈ Rd .
Then we observe that for any h(z) ∈ DL
Z
Z
| (f (y − z) − f (x − z))h(z)dz| ≤ |f (y − z) − f (x − z)| h(z)dz
≤ |f (y) − f (x)|
≤ Ckx − yk
14
so that uniformly over DL .
Z
Z Z
f (x − y − z)h(z)Gn (y)dzdy − f (x − z)h(z)dz dx
Z
Z Z
≤ f (x − y − z)Gn (y)dzdy − f (x − z) h(z)dzdx
Z
Z Z
= f (x − y − z)Gn (y)dy − f (x − z) Gn (y)dy h(z)dzdx
Z Z
≤
|f (x − y − z) − f (x − z)| h(z)Gn (y)dzdydx
Z
Z
≤
Ckyk|Gn (y)|dydx
[−M ∗ −L−Mn ,M ∗ +L+Mn ]d
√
≤ (2M ∗ + 2L + 2Mn )d CMn d
→ 0.
4
Consistency of Minimum Disparity Estimators for Conditional
Models
In this section we define minimum disparity estimators for the conditionally specified models based on
distributions and data defined in Section 1. A parametric model is given
f (y1 , y2 |x1 , x2 ) = φ(y1 , y2 |x1 , x2 , θ),
where we assume that the (Xi1 , Xi2 ) are independently drawn from a distribution h(x1 , x2 ) which is not
parametrically specified. For this model, the maximum likelihood estimator for θ given observations (Yi , Xi ),
i = 1, . . . , n is
n
X
θ̂M LE = arg max
log φ(Yi1 , Yi2 |Xi1 , Xi2 , θ)
i=1
with attendant asymptotic variance
X X Z Z
I(θ0 ) =
∇2θ [log φ(y1 , y2 |x1 , x2 , θ0 )] φ(y1 , y2 |x1 , x2 , θ0 )h(x1 , x2 )dy1 dx1
y2 ∈Sy x2 ∈Sx
when the specified parametric model is correct at θ = θ0 .
In the context of disparity estimation, for every value (x1 , x2 ) we define the conditional disparity between
f and φ as
X Z
f (y1 , y2 |x1 , x2 )
− 1 φ(y1 , y2 |x1 , x2 , θ)dy1
D(f, φ|x1 , x2 , θ) =
C
φ(y1 , y2 |x1 , x2 , θ)
y2 ∈Sy
these are combined over observed (Xi1 , Xi2 ) by
n
Dn (f, θ) =
1X
D(f, φ|Xi1 , Xi2 , θ)
n i=1
15
or
X Z
D̃n (f, θ) =
D(f, φ|x1 , x2 , θ)ĥn (x1 , x2 )dx1
x2 ∈Sx
with limiting case
D∞ (f, θ) =
X Z
D(f, φ|x1 , x2 , θ)h(x1 , x2 )dx1 .
x2 ∈Sx
We now define the conditional minimum disparity estimator as
θ̂nD = arg min Dn (fˆn , θ)
θ∈Θ
with fˆn defined as either (1.4) or (1.8) and C a strictly convex function from R to [−1 ∞) with a unique
minimum at 0. Classical choices of C include e−x − 1, resulting in the Negative Exponential disparity (NED)
√
2
and
x + 1 − 1 − 1, which corresponds to Hellinger distance (HD).
Under this definition, we first establish the existence and consistency of θ̂nD . To do so we note that
disparity results all rely on the boundedness of D(f, φ|Xi , θ) over θ and f and a condition of the form that
for any conditional densities g and f ,
X Z
sup |D(g, φ|Xi1 , Xi2 , θ) − Dn (f, φ|Xi1 , Xi2 , θ)| ≤ K
|g(y1 , y2 |Xi1 , Xi2 ) − f (y1 , y2 |Xi1 , Xi2 ))| dy1 .
θ∈Θ
y2 ∈Sy
(4.27)
In the case of Hellinger distance (Beran, 1977), D(g, θ) < 2 and (4.27) follows from Minkowski’s inequality.
For the alternate class of divergences studied in Park and Basu (2004), boundedness of D is established from
supt∈[−1 ∞) |C 0 (t)| ≤ C ∗ < ∞ which also provides
Z f (y1 , y2 |x1 , x2 )
g(y1 , y2 |x1 , x2 )
C
−1 −C
− 1 φ(y1 , y2 |x1 , x2 , θ)dµ(y)
φ(y1 , y2 |x1 , x2 , θ)
φ(y1 , y2 |x1 , x2 , θ)
Z g(y1 , y2 |x1 , x2 )
f (y1 , y2 |x1 , x2 ) ∗
≤C
φ(y1 , y2 |x1 , x2 , θ) − φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ)dµ(y)
Z
= C ∗ |g(y1 , y2 |x1 , x2 ) − f (y1 , y2 |x1 , x2 )| dµ(y)
where we have employed the notational convention (1.9). For simplicity, we therefore use (4.27) as a condition
below.
In general, we will require the following assumptions
(P1) The parameter space Θ locally is compact.
(P2) P
There exists N such that for n > N and θ1 6= θ2 , with probability 1, maxi∈1,...,n |φ(y1 , y2 |Xi1 , Xi2 , θ1 )−
n
i=1 φ(y1 , y2 |Xi1 , Xi2 , θ2 )| > 0 on a non-zero set of dominating measure in y.
(P3) φ(y1 , y2 |x1 , x2 , θ) is continuous in θ for almost every (x1 , x2 , y1 , y2 ).
(P4) Dn (f, φ|x1 , x2 , θ) is uniformly bounded over f , x1 , x2 and θ and (4.27) holds.
16
(P5) For every f there exists a compact set Sf ⊂ Θ and N such that for n ≥ N ,
inf Dn (f, θ) > inf Dn (f, θ).
θ∈Sfc
θ∈Sf
These assumptions combine those of Park and Basu (2004) for a general class of disparities with those of
Cheng and Vidyashankar (2006) which relax the assumption of compactness of Θ. Together, these provide
the following results:F
Theorem 4.1. Under assumtions P1-P5, define
Tn (f ) = arg min Dn (f, θ),
(4.28)
θ∈Θ
for n = 1, . . . , ∞ inclusive, then
(i) For any f ∈ F there exists θ ∈ Θ such that Tn (f ) = θ.
(ii) For n ≥ N , for any θ, θ = Tn (φ(·|·, θ)) is unique.
(iii) If Tn (f ) is unique and fm → f in L1 for each x, then Tn (fm ) → Tn (f ).
Proof. (i) Existence. We first observe that it is sufficient to restrict the infimum in (4.28) to Sf . Let
{θm : θm ∈ Sf } be a sequence such that θm → θ as as m → ∞. Since
f (y1 , y2 |x1 , x2 )
f (y1 , y2 |x1 , x2 )
C
− 1 φ(y1 , y2 |x1 , x2 , θm ) → C
− 1 φ(y1 , y2 |x1 , x2 , θ)
φ(y1 , y2 |x1 , x2 , θm )
φ(y1 , y2 |x1 , x2 , θ)
by Assumption P3, using the bound on D(f, φ, θ) from Assumption P4 we have Dn (f, θm ) → Dn (f, θ) by
the dominated convergence theorem. Hence Dn (f, t) is continuous in t and achieves its minimum for t ∈ Sf
since Sf is compact.
(ii) Uniqueness. This is a consequence of Assumption P2 and the unique minimum of C at 0.
(iii) Continuity in f . For any sequence fm (·|x1 , x2 ) → f (·|x1 , x2 ) in L1 for every x as m → ∞ we have
sup |Dn (fm , θ) − Dn (f, θ)| → 0
(4.29)
θ∈Θ
from Assumption P4.
Now consider θm = Tn (fm ). We first observe that there exists M such that for m ≥ M , θm ∈ Sf
otherwise from (4.29) and Assumption P5
Dn (fm , θm ) > inf Dn (fm , θ)
θ∈Sf
contradicting the definition of θm .
Now suppose that θm does not converge to θ0 . By the compactness of Sf we can find a subsequence
θm0 → θ∗ 6= θ0 implying Dn (f, θm0 ) → Dn (f, θ∗ ) from Assumption P3. Combining this with (4.29) implies
Dn (f, θ∗ ) = Dn (f, θ0 ), contradicting the assumption of the uniqueness of Tn (f ).
17
Theorem 4.2. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1 and define
θn0 = arg min Dn (f, θ)
θ∈Θ
0
θ∞
for every n including ∞. Further, assume that
is unique in the sense that for every there exists δ such
that
0
0
kθ − θ∞
k > ⇒ D∞ (f, θ) > D∞ (f, θ∞
)+δ
¯
then under assumptions D1-D3, K1 - K6, B1-B2 and P1-P5 with fn either of the estimators (1.4) or (1.8)
0
θ̂n = Tn (f¯n ) → θ∞
as n → ∞ almost surely.
Similarly,
0
T̃n (f¯n ) = arg min D̃n (f, θ) → θ∞
as n → ∞ almost surely.
θ∈Θ
Proof. First, we observe that for every f , it is sufficient to restrict attention to Sf and that
sup |Dn (f, θ) − D∞ (f, θ)| → 0 almost surely
(4.30)
θ∈Sf
from the strong law of large numbers, the compactness
respect to θ.
Further,
Z
∗
¯
sup
Dm (fn , θ) − Dm (f, θ) ≤ C sup
of Sf and the assumed continuity of C and of φ with
f¯n (y1 , y2 |x1 , x2 ) − f (y1 , y2 |x1 , x2 ) dµ(y)
x∈X
m∈N,θ∈Θ
→ 0 almost surely
(4.31)
where the convergence is obtained from Theorem 2.5 or Theorem 3.1 depending on the form of f¯n .
0
, then we can find > 0 and a subsequence θ̂n0 such that
Suppose that θ̂n does not converge to θ∞
0
kθ̂n0 − θ∞
k > for all n0 . However, on this subsequence
Dn0 (fˆn0 , θ̂n0 ) = Dn0 (fˆn0 , θ0 ) + (Dn0 (f, θ0 ) − Dn0 (fˆn0 , θ0 )) + (D∞ (f, θ0 ) − Dn0 (f, θ0 ))
+ (D∞ (f, θ̂n0 ) − D∞ (f, θ0 ))
+ (Dn0 (f, θ̂n0 ) − D∞ (f, θ̂n0 )) + (Dn0 (fˆn0 , θ̂n0 ) − Dn0 (f, θ̂n0 ))
≤ Dn0 (fˆn0 , θ0 ) + δ
− 2 sup |Dn0 (f, θ) − D(f, θ)|
θ∈Θ
− 2 sup Dn0 (fˆn0 , θ) − Dn0 (f, θ)
θ∈Θ
but from (4.30) and (4.31) we can find N so that for n0 ≥ N
sup |Dn0 (f, θ) − D(f, θ)| ≤
θ∈Θ
and
δ
6
δ
sup Dn0 (fˆn0 , θ) − Dn0 (f, θ) ≤
6
θ∈Θ
¯
contradicting the optimality of θ̂n0 . The proof for T̃n (fn ) follows analogously.
18
5
Asymptotic Normality and Efficiency of Minimum Disparity Estimators for Conditional Models: Unrestricted Case
In this section we demonstrate the asymptotic normality and efficiency of minimum conditional disparity
estimators using the unrestricted conditional density estimate fˆn (y1 , y2 |x1 , x2 ). In order to simplify some of
our expressions, we introduce the following notation, that for a column vector A we define the matrix
AT T = AAT .
This will be particularly useful in defining information matrices.
The proof techniques employed here are an extension of those developed in i.i.d. settings in Beran (1977);
Tamura and Boos (1986); Lindsay (1994); Park and Basu (2004). In particular we will require the following
assumptions:
(N1) Define
Ψθ (x1 , x2 , y1 , y2 ) =
then
sup
XZ
∇θ φ(y1 , y2 |x1 , x2 , θ)
φ(y1 , y2 |x1 , x2 , θ)
Ψθ (x1 , x2 , y1 , y2 )Ψθ (x1 , x2 , y1 , y2 )T f (y1 , y2 |x1 , x2 )dy1 < ∞
(x1 ,x2 )∈X y∈S
y
elementwise. Further, there exists ay > 0 such that
XZ
sup
sup sup
Ψθ (x1 + s, x2 , y1 + t, y2 )2 f (y1 , y2 |x1 , x2 )dy1 < ∞.
(x1 ,x2 )∈X ktk≤ay ksk≤ay y∈S
y
and
Z
sup
sup
2
(∇y Ψθ (x1 + t, x2 , y1 + s)) f (y1 |x1 , x2 )dy1 < ∞
sup
(x1 ,x2 )∈X ktk≤ay ksk≤ay
and
Z
sup
sup
2
(∇x Ψθ (x1 + t, x2 , y1 + s)) f (y1 |x1 , x2 )dy1 < ∞.
sup
(x1 ,x2 )∈X ktk≤ay ksk≤ay
(N2) There exists sequences bn and αn diverging to infinity such that
(i) nc4ny2 b4n → 0 and
n
sup
sup
X Z Z
(x1 ,x2 )∈X kuk>bn y ∈S
2
y
Ψ2θ (x1 +cnx2 u, x2 , y1 +cny2 v, y2 )Ky2 (u)Kx2 (v)g(x1 , x2 , y1 , y2 )dvdy1 → 0
kvk>bn
elementwise.
(ii) sup(x1 ,x2 )∈X nP (kY1 − cny2 bn k > αn − c) → 0.
(iii)
sup
X
(x1 ,x2 )∈X y ∈S
2
y
√
Z
1
d
y
x
ncdnx
2 cny2
19
|Ψθ (x1 , x2 , y1 , y2 )|dy1 → 0
ky1 k≤αn
(iv)
sup
sup
sup
sup
X g(x1 + cnx s, x2 , y1 + cny t, y2 )
2
2
= O(1).
g(x1 , x2 , y1 , y2 )
(x1 ,x2 )∈X ktk≤bn ksk≤bn ky1 k<αn y ∈S
2
y
p
φ(y1 , y2 |x1 , x2 , θ)∇θ Ψθ (y1 , y2 , x1 , x2 ) < ∞.
√
2
(N4) C is either given by Hellinger distance C(x) =
x + 1 − 1 − 1 or
(N3) supy1 ,y2 ,x1 ,x2
A1 (r) = −C 00 (r − 1)r, A2 (r) = C(r − 1) − C 0 (r − 1)r, A3 (r) = C 00 (r − 1)r2
are all bounded in absolute value.
Assumption N2i is a condition on the tails of Ψθ and Kx and Ky . As we demonstrate below, this allows
us to truncate the kernels at bn ; this will prove mathematically convenient throughout the remainder of this
section.
Lemma 5.1. Under Assumption N2i, defining
+
+
(u) = Ky (u)Ikuk<bn
Knx
(u) = Kx (u)Ikuk<bn , Kny
+
and
and setting gn+ , fn+ and fn∗+ to be defined as (1.2), (1.4) and (1.8) with Kx and Ky replaced with Knx
+
Kny , we have the following convergences in probability
X Z
√
n sup
Ψθ (x1 , x2 , y1 , y2 )(gn (x1 , x2 , y1 , y2 ) − gn+ (x1 , x2 , y1 , y2 ))dy1 → 0
(5.32)
(x1 ,x2 )∈X y ∈S
2
y
√
n
X Z
sup
Ψθ (x1 , x2 , y1 , y2 )(fn (x1 , x2 , y1 , y2 ) − fn+ (x1 , x2 , y1 , y2 ))dy1 → 0
(5.33)
(x1 ,x2 )∈X y ∈S
2
y
√
n
X Z
sup
p
2
gn+ (x1 , x2 , y1 , y2 )
dy1 → 0
(5.34)
2
q
p
+
Ψθ (x1 , x2 , y1 , y2 )
fn (x1 , x2 , y1 , y2 ) − fn (x1 , x2 , y1 , y2 ) dy1 → 0
(5.35)
Ψθ (x1 , x2 , y1 , y2 )
gn (x1 , x2 , y1 , y2 ) −
q
(x1 ,x2 )∈X y ∈S
2
y
√
n
X Z
sup
(x1 ,x2 )∈X y ∈S
2
y
and in the homoscedastic case
√
Z
n
sup
Ψθ (x1 , x2 , y1 )(fn∗ (x1 , x2 , y1 ) − fn∗+ (x1 , x2 , y1 ))dy1 → 0
(5.36)
2
q
p
fn∗ (x1 , x2 , y1 ) − fn∗+ (x1 , x2 , y1 ) dy1 → 0
(5.37)
(x1 ,x2 )∈X
√
Z
n
sup
Ψθ (x1 , x2 , y1 )
(x1 ,x2 )∈X
Proof. Assumption N2i implies the convergence of the expected mean square of (5.32), and (5.32) therefore
follows from Markov’s inequality. (5.36) follows from the same argument and (5.33) reduces to (5.32) by
observing that
Z
√
gn (x1 , x2 , y1 , y2 ) − gn+ (x1 , x2 , y1 , y2 )
n sup Ψθ (x1 , x2 , y1 , y2 )
dµ(y) → 0
hn (x1 , x2 )
x∈X
20
in probability since min(x1 ,x2 )∈X hn (x1 , x2 ) is eventually bounded below almost surely and
Z
√
1
1
sup +
n Ψθ (x1 , x2 , y1 , y2 )gn+ (x1 , x2 , y1 , y2 )dµ(y) → 0
−
hn (x1 , x2 ) x∈X hn (x1 , x2 )
since the second term is bounded in probability and
1
1
h+ (x , x ) − h(x1 , x2 ) → 0 and
n
1
2
almost surely.
For (5.34), (5.35) and (5.37) use the inequality
√
1
1
hn (x1 , x2 ) − h(x1 , x2 ) → 0
a−
√ 2
b ≤ |a − b| and observe that substituting this
reduces to (5.32), (5.33) and (5.36).
Lemma 5.2. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1, under assumptions D1-D2, E1,
K1 - K6, B1-B2 and P1-P5 for any for any function J(y1 , y2 , x1 , x2 ) satisfying the conditions on Ψ in
Assumptions N1-N2iv and
Z Z
X
J(y1 , y2 , x1 , x2 )g(x1 , x2 , y1 , y2 )dx1 dy1 = 0
x2 ∈Sx ,y2 ∈Sy
and
X
VJ =
Z Z
J T T (y1 , y2 , x1 , x2 )g(x1 , x2 , y1 , y2 )dx1 dy1 < ∞
x2 ∈Sx ,y2 ∈Sy
elementwise, the following hold


Z Z
X
√
n
J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 |x1 , x2 )ĥn (x1 , x2 )dy1 dx1 − Bn  → N (0, VJ )
(5.38)
y2 ∈Sy ,x2 ∈Sx
in distribution and similarly


n X Z
X
√
1
J(y1 , y2 , Xi1 , Xi2 )fˆn (y1 , y2 |Xi1 , Xi2 )dy1 − B̃n  → N (0, ṼJ )
n
n i=1
(5.39)
y2 ∈Sy
in distribution where
Z Z
X
Bn =
J(y1 , y2 , x1 , x2 )Eĝn (x1 , x2 , y1 , y2 )dy1 dx1
y2 ∈Sy ,x2 ∈Sx
=
Z Z Z Z
X
J(y1 , y2 , x1 , x2 )g(x1 + cnx2 u, x2 , y1 + cny2 v, y2 )Kx (u) Ky (v) dudvdy1 dx1 .
y2 ∈Sy ,x2 ∈Sx
and
B̃n =
X
Z R
J(y1 , y2 , x1 , x2 )Eĝn (x1 , x2 , y1 , y2 )h(x1 , x2 )dy1
E ĥn (x1 , x2 )
y2 ∈Sy ,x2 ∈Sx
=
X
y2 ∈Sy ,x2 ∈Sx
Z RRR
dx1
J(y1 , y2 , x1 , x2 )g(x1 + cnx2 u, x2 , y1 + cny2 v, y2 )Kx (u)Ky (v)h(x1 , x2 )dudvdy1
R
dx1
h(x1 + cnx2 u, x2 )Kx (u)du
21
and
ṼJ = VJ
Z Z
+3
Z
T
J(y1 , y2 , x1 , x2 )f (y1 , y2 |x1 , x2 )dµ(x)
J(y1 , y2 , x1 , x2 )f (y1 , y2 |x1 , x2 )dµ(x) h(x1 , x2 )dν(x).
+
+
Proof. Applying Lemma 5.1 we substitute Knx
and Kny
for Kx and Ky throughout. We begin by examining
(5.38)
Z Z
√
n
J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 |x1 , x2 )ĥn (x1 , x2 )dµ(y)dν(x)
Z Z
√
= n
J(y1 , y2 , x1 , x2 )ĝn (x1 , x2 , y1 , y2 )dµ(y)dν(x)
n
1 X
=√
n i=1
Z Z
1
d
y
x
cdnx
2 cny2
+
J(y1 , y2 , x1 , x2 )Knx
x1 − Xi1
cnx2
+
Kny
y1 − Yi1
cny2
Ix2 (Xi2 )Iy2 (Yi2 )dµ(y)dν(x)
from which
Z Z
√
n
J(y1 , y2 , x1 , x2 )ĝn (x1 , x2 , y1 , y2 )dµ(y)dν(x) − Bn
Z Z Z Z
X
1
x1 − u
y1 − v
+
+
J(v, y2 , u, x2 )Knx
=
Kny
dudvdUn (x1 , y1 |x2 , y2 )
dx dy
cnx2
cny2
y2 ∈Sy ,x2 ∈Sx cnx2 cny2
where
Un (x1 , y1 |x2 , y2 ) =
√
n (Fn (x1 , y1 |x2 , y2 ) − F (x1 , y1 |x2 , y2 ))
where Fn (x1 , y1 |x2 , y2 ) is the empirical conditional cumulative distribution function and F its population
counterpart. We now examine
"
#2
Z Z Z 1
x
−
u
y
−
v
1
1
+
+
E d dy
J(v, y2 , u, x2 )Knx
Kny
dudv − J(y1 , y2 , x1 , x2 ) dUn (x1 , y1 |x2 , y2 )
x
cnx2
cny2
cnx
2 cny2
Z
2
+
≤ sup(Kny
) (J(y1 + cny2 bn , y2 , x1 + cnx2 bn , x2 ) − J(y1 , y2 , x1 , x2 )) g(x1 , x2 , y1 , y2 )dx1 dy1
y
≤ K + C ∗ (bdnx cdnyx 2 + bdny c2d
nx2 )
→0
from the Cauchy-Schwartz inequality and Assumption N1. The result for (5.38) now follows from the Central
Limit Theorem.
For (5.39), we first observe that
n Z
√ 1X
1
1
n
J(y1 , y2 , Xi1 , Xi2 )ĝn (y1 , y2 |Xi1 , Xi2 )
−
dµ(y) → 0
n i=1
hn (Xi1 , Xi2 ) Ehn (Xi1 , Xi2 )
almost surely from the boundedness in probability of
n Z
√ 1X
n
J(y1 , y2 , Xi1 , Xi2 )ĝn (y1 , y2 |Xi1 , Xi2 )dµ(y)
n i=1
22
and the almost sure convergence
1
1
→ 0.
sup −
Ehn (x1 , x2 ) (x1 ,x2 )∈X hn (x1 , x2 )
We now examine
n Z
1 X
ĝn (y1 , y2 |Xi1 , Xi2 )
√
J(y1 , y2 , Xi1 , Xi2 )
dµ(y)
n i=1
E ĥn (Xi1 , Xi2 )
Xj1 −Xi1
+
+
n
n Z
Kny
(u)IXi2 (Xj2 )IYi2 (Yj2 )
cnx2
1 X X J(Yj1 + cny2 u, Yj2 , Xi1 , Xi2 )Knx
=√
du
nn i=1 j=1
E ĥn (Xi1 , Xi2 )
n Z Z Z
+
J(Yj1 + cny2 u, Yj2 , Xj1 + cnx2 v, Xj2 )Kny
(u) +
1 X
=√
Knx (v) h(Xj1 + cn v, Xj2 )dudvdν(x)
n j=1
E ĥn (Xj1 + cn v, Xj2 )
n Z Z
+
+
J(y1 + cny2 u, y2 , Xi1 , Xi2 )Knx
(u) Kny
(u)
1 X
+√
g(Xi1 + cnx2 v, Xi2 , y1 , y2 )dudµ(y)
n i=1
E ĥn (Xi1 , Xi2 )
+ op (1)
Z
n 1 X
J(Yi1 , Yi2 , Xi1 , Xi2 ) + J(y1 , y2 , Xi1 , Xi2 )f (y1 , y2 |Xi1 , Xi2 )dν(y)
=√
n i=1
(5.40)
n
1 X
+√
Rn (Yi1 , Yi2 , Xi1 , Xi2 ) + op (1)
n i=1
where (5.40) is a consequence of Lemma 5.4 and
Rn (Yi1 , Yi2 , Xi1 , Xi2 )
Z Z Z
+
∇y J(Yi1 + cny2 u∗ , Yi2 , Xi1 + cnx2 v, Xi2 )Kny
(u) +
= cny2
Knx (v) h(Xi1 + cn v, Xi2 )dudvdν(x)
E ĥn (Xi1 + cn v, Xi2 )
Z Z
+
+
(u)
∇y J(y1 + cny2 u∗ , y2 , Xi1 , Xi2 )Knx
(u) Kny
g(Xi1 + cnx2 v, Xi2 , y1 , y2 )dudµ(y)
+ cny2
E ĥn (Xi1 , Xi2 )
Z Z Z
+
∇x J(Yi1 + cny2 u, Yi2 , Xi1 + cnx2 v ∗ , Xi2 )Kny
(u) +
Knx (v) h(Xi1 + cn v, Xi2 )dudvdν(x)
+ cnx2
E ĥn (Xi1 + cn v, Xi2 )
Z Z
+
+
J(y1 + cny2 u, y2 , Xi1 , Xi2 )Knx
(v) Kny
(u)
+ cnx2
∇x g(Xi1 + cnXi2 v ∗ , Xi2 , y1 , y2 )dudvdµ(y)
E ĥn (Xi1 , Xi2 )
with u∗ = su and v ∗ = tv for some s and t in [0, 1] and we observe that
"
#2
n
1 X
Rn (Yi1 , Yi2 , Xi1 , Xi2 ) → 0
E √
n i=1
from assumptions N1 and N2iv. The result now follows from the convergence in distribution
Z
n 1 X
√
J(Yi1 , Yi2 , Xi1 , Xi2 ) + J(y1 , y2 , Xi1 , Xi2 )f (y1 , y2 |Xi1 , Xi2 )dν(y) → N (0, ṼJ ).
n i=1
23
We remark here that the bias Bn corresponds to the bias found in Tamura and Boos (1986) for multivariate
observations. As observed there, the bias in the estimate ĝn is O(c2nx2 + c2ny2 ) and that of ĥn is O(c2nx2 ),
d
x
regardless of the dimension of x1 and y1 . However the variance is of order n−1 cdnx
c y (corresponding
2 ny2
to
B2), meaning that for dx + dy > 3, the asymptotic bias in the Central Limit Theorem is
√ Assumption
nc2nx2 c2ny2 → ∞ and will not become zero when the variance is controlled. We will further need to restrict
2d
y
x
to n−1 c2d
nx2 cny2 , effectively reducing the unbiased central limit theorem to the cases where either dx = 1 or
dy = 1. As in Tamura and Boos (1986) we also note that this bias is often small in practice. However, as
we demonstrate in Section 6 in the case of a univariate, homoscedastic model, the use of f˜n (y|x1 , x2 ) can
remove this bias in some circumstances.
Lemma 5.3. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1, under assumptions K1 - K6, B1-B4,
and N1-N2iv for any function J(y1 , y2 , x1 , x2 ) satisfying the conditions on Ψ in Assumptions N1-N2iv
√
n
X Z Z
sup
q
J(y1 , y2 , x1 , x2 )
s
fˆn (y1 , y2 |x1 , x2 ) −
(x1 ,x2 )∈X y ∈S
2
y
Eĝn (x1 , x2 , y1 , y2 )
!2
E ĥn (x1 , x2 )
dy1 → 0
(5.41)
in probability and further
√
Z Z
X
n
q
J(y1 , y2 , x1 , x2 )
s
fˆn (y1 , y2 |x1 , x2 ) −
y2 ∈Sy ,x2 ∈Sx
Eĝn (x1 , x2 , y1 , y2 )
ĥn (x1 , x2 )
!2
ĥn (x1 , x2 )dy1 dx1 → 0
(5.42)
+
+
for Kx and Ky throughout. We begin by breaking
and Kny
Proof. Applying Lemma 5.1 we substitute Knx
the integral in (5.41) into regions ky1 k ≥ αn and ky1 k < αn . For the first of these, we expand the square
results in three terms
First:
Z
√ X
Eĝn (x1 , x2 , y1 , y2 )
n
J(y1 , y2 , x1 , x2 )
dy1
E ĥn (x1 , x2 )
y2 ∈Sy ky1 k>αn
√ X
Z Z
Z
n
1
u − x1
u − y1
+
+
J(y
,
y
,
x
,
x
)K
≤ −
K
g(u, x2 , v, y2 )dudvdy1
1 2
1
2
nx
ny
dx dy
h
cnx2
cny2
y2 ∈Sy cnx2 cny2 ky1 k>αn
√
≤
+ o(1)

n
h−
M  sup
1/2
sup
X Z
ktk≤bn ksk≤bn y ∈S
2
y
ky1 k>αn
J(y1 + cny2 s, y2 , x1 + cnx2 t, x2 )2 g(x1 , x2 , y1 , y2 )dy1 
where
Z Z Z
M=
+2
+2
Knx
(t)Kny
(s)g(x1
1/2
+ cnx2 s, x2 , y1 + cny2 s)dsdtdµ(y)
the result now follows from Assumptions N1, N2ii and N2iii.
24
√
n
Second:
X Z
y2 ∈Sy
J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 |x1 , x2 )dy1
ky1 k>αn
1/2
x1 −X1i
1
+
n Z
I
(X
)
x
2i
dx Knx
X
X
2
cnx2
sup Ky (y) 
cnx

√
n(y2 , x2 )J(Y1i + cny2 bn , y2 , x1 , x2 )2 2
≤
dsdtdy1 

n
ĥn (x1 , x2 )
i=1 ky1 k>αn

y2 ∈Sy
+ o(1)
and we observe that from Assumption N2ii the final integral is non-zero with probability smaller than
nP (ky1 − cny2 bn k > αn ) → 0.
Third:
s
Z Z
q
X
√
Eĝn (x1 , x2 , y1 , y2 )
2
n
J(y1 , y2 , x1 , x2 ) fˆn (y1 , y2 |x1 , x2 )
dy1 dx1
h(x1 , x2 )
y2 ∈Sy ,x2 ∈Sx
≤
√
1/2

n
X Z
y2 ∈Sy
ky1 k>αn
J(y1 , y2 , x1 , x2 )2 fˆn (y1 , y2 |x1 , x2 )dy1 
reducing to the case above.
√ 2
√
2
q Turning to the region ky1 k < αn , we use the identity ( a − b) ≤ (a − b) /a and add and subtract
ĝn (x1 , x2 , y1 , y2 )/E ĥn (x1 , x2 ) to obtain
√
n
X Z
q
J(y1 , y2 , x1 , x2 )
s
fˆn (y1 , y2 |x1 , x2 ) −
y2 ∈Sy
Eĝn (x1 , x2 , y1 , y2 )
E ĥn (x1 , x2 )
!2
dy1
√ Z
(ĝn (x1 , x2 , y1 , y2 ) − Eĝn (x1 , x2 , y1 , y2 ))2
n
J(y1 , y2 , x1 , x2 )
dy1
≤ −
h
Eĝn (x1 , x2 , y1 , y2 )
ky1 k<αn
Z
√ X
(ĥn (x1 , x2 ) − E ĥn (x1 , x2 ))2
+ n
J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 , x1 , x2 )dy1
E ĥn (x1 , x2 )
y2 ∈Sy
and we observe that the first term is positive and smaller in expectation than
R +2
Z
+2
Knx (t)Kny
(s)g(x1 + cnx2 s, x2 , y1 + cny2 s, y2 )dsdt
1
+ o(1)
|J(y1 , y2 , x1 , x2 )| R +
√ dx dy
+
Knx (t)Kny (s)g(x1 + cnx2 s, x2 , y1 + cny2 s, y2 )dsdt
ncnx2 cny2 ky1 k<αn
Z
M
g(x1 − 2cnx2 s, x2 , y1 − 2cny2 s, y2 )
≤ √ d dy
|J(y1 , y2 , x1 , x2 )|dy1
sup
x
g(x1 , x2 , y1 , y2 )
ksk≤b
,ktk≤b
ky
k<α
ncnx
c
n
n
1
n
2 ny2
which converges to zero from Assumptions N2iii and N2iv yielding convergence in probability. The second
25
term converges by bounding
√
n
sup
(ĥn (x1 , x2 ) − E ĥn (x1 , x2 ))2
E ĥn (x1 , x2 )
(x1 ,x2 )∈X
x
1 2 log cdx 2 ncdnx
2
≤ − √ dnx
x
x
h
ncnx2 2 log cdnx
2
!2
sup
|ĥn (x1 , x2 ) − E ĥn (x1 , x2 )|
(x1 ,x2 )∈X
(5.43)
→0
almost surely from the boundedness of the last two terms on the right hand side of (5.43) (see Giné and
Guillou, 2002) and assumption B4.
To demonstrate (5.42) we observe that it is equal to
Z Z
p
2
p
√
n
J(y1 , y2 , x1 , x2 )
ĝn (x1 , x2 , y1 , y2 ) − Eĝn (x1 , x2 , y1 , y2 ) dµ(y)dν(x) → 0
in probability. This convergence can be demonstrated in the same manner as above.
Theorem 5.1. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1, under assumptions D1-D2, E1,
K1 - K6, B1-B4, P1-P5 and N1 - N4 define
θf = arg min D∞ (f, θ)
θ∈Θ
and
H D (θ) = ∇2θ D∞ (f, θ)
Z
X
f (y1 , y2 |x1 , x2 )
=
A2
∇2θ φ(y1 , y2 |x1 , x2 , θ)h(x1 , x2 )dx1 dy1
φ(y1 , y2 |x1 , x2 , θ)
y2 ∈Sy ,x2 ∈Sx
Z
X
∇θ φ(y1 , y2 |x1 , x2 , θ)T T
f (y1 , y2 |x1 , x2 )
h(x1 , x2 )dx1 dy1
+
A3
φ(y1 , y2 |x1 , x2 , θ)
φ(y1 , y2 |x1 , x2 , θ)
y2 ∈Sy ,x2 ∈Sx
I (θ) = H (θ)V D (θ)−1 H D (θ)
I˜D (θ) = H D (θ)Ṽ D (θ)−1 H D (θ)
D
D
then
i
√ h
n Tn (fˆn ) − θf − Bn → N 0, I D (θf )−1
and
i
√ h
n T̃n (fˆn ) − θf − B̃n → N 0, I˜D (θf )−1
in distribution where Bn , B̃n , V D (θ) and Ṽ D (θ) are obtained by substituting
f (y1 , y2 |x1 , x2 )
∇θ φ(y1 , y2 |x1 , x2 , θf )
J(y1 , y2 , x1 , x2 ) = A1
φ(y1 , y2 |x1 , x2 , θf )
φ(y1 , y2 |x1 , x2 , θf )
into the expressions for Bn , B̃n , VJ and ṼJ in Lemma 5.2.
26
(5.44)
Here we note that in the case that f = φθ0 for some θ0 , the θf = θ0 and further since A1 (1) = A2 (1) =
A3 (1) = 1 we have that I D (θf ) is given by the Fisher Information for φθ0 and further that I˜D (θf ) = I D (θf )
since
Z
∇θ φ(y1 , y2 |x1 , x2 )
φ(y1 , y2 |x1 , x2 )dµ(y) = 0
φ(y1 , y2 |x1 , x2 )
at each x.
Proof. We will define T̄n , f¯n (y1 , y2 |x1 , x2 ) and fK (y1 , y2 |x1 , x2 ) to take one of the following combinations of
values
f¯n
fK
T̄n
Tn fˆn (y1 , y2 |x1 , x2 )
Eĝn (x1 , x2 , y1 , y2 )/ĥn (x1 , x2 )
T̃n fˆn (y1 , y2 |x1 , x2 ) Eĝn (x1 , x2 , y1 , y2 )/E ĥn (x1 , x2 )
Our arguments here follow those in Tamura and Boos (1986) and Park and Basu (2004).
Since T̄n (f ) satisfies
∇θ Dn (f, T̄n (f )) = 0
we can write
√
−1
∇θ Dn (f¯n , θ0 )
n T̄n (f¯n ) − θ0 = − ∇2θ Dn (f¯n , θ+ )
for some θ+ between T̄n (f¯n ) and θ0 . It is therefore sufficient to demonstrate
(i) ∇2θ Dn (f¯n , θ+ ) → H D (θ0 ) in probability.
√ (ii) n ∇θ Dn (f¯n , θ0 ) − B̄n → N (0, V D (θ0 )−1 ) in distribution.
with B̄n given by Bn or B̃n as appropriate.
We begin with (i) where we observe that by Assumption N4 A2 (r) and A3 (r) are bounded and the result
follows from Theorems 2.5 and 4.2 and the dominated convergence theorem. In the case of Hellinger distance
#q
Z " 2
∇θ φ(y1 , y2 , x1 , x2 , θ) ∇θ φ(y1 , y2 , x1 , x2 , θ)T T
2
¯
p
∇θ D(fn , φ|x1 , x2 , θ) =
−
f¯n (y1 , y2 |x1 , x2 )dµ(y)
φ(y1 , y2 , x1 , x2 , θ)3/2
φ(y1 , y2 , x1 , x2 , θ)
and the result follows from the uniform L1 convergence of f¯n (Theorem 2.4) and Assumption N3.
Turning to (ii) where we observe that by the boundedness of C and the dominated convergence theorem
we can write ∇θ Dn (f¯n , φ|x1 , x2 , θ) − Bn as
¯
Z
fn (y1 , y2 |x1 , x2 )
∇θ φ(y1 , y2 |x1 , x2 , θ)dµ(y) − B̄n
A2
φ(y1 , y2 |x1 , x2 , θ)
Z
fK (y1 , y2 |x1 , x2 ) ∇θ φ(y1 , y2 |x1 , x2 , θ) ¯
= A1
fn (y1 , y2 |x1 , x2 ) − fK (y1 , y2 |x1 , x2 ) dµ(y)
φ(y1 , y2 |x1 , x2 , θ)
φ(y1 , y2 |x1 , x2 , θ)
Z ¯
fn (y1 , y2 |x1 , x2 )
fK (y1 , y2 |x1 , x2 )
+
A2
− A2
∇θ φ(y1 , y2 |x1 , x2 , θ)dµ(y)
φ(y1 , y2 |x1 , x2 , θ)
φ(y1 , y2 |x1 , x2 , θ)
¯
Z
fK (y1 , y2 |x1 , x2 )
fn (y1 , y2 |x1 , x2 )
fK (y1 , y2 |x1 , x2 )
− A1
−
∇θ φ(y1 , y2 |x1 , x2 , θ)dµ(y)
φ(y1 , y2 |x1 , x2 , θ)
φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ)
27
from a minor modification Lemma 25 of Lindsay (1994) we have that by the boundedness of A1 and A2 there
is a constant B such that
|A2 (r2 ) − A2 (s2 ) − (r2 − s2 )A1 (s2 )| ≤ (r2 − s2 )B
substituting
s
r=
f¯n (y1 , y2 |x1 , x2 )
, s=
φ(y1 , y2 |x1 , x2 , θ)
s
fK (y1 , y2 |x1 , x2 )
φ(y1 , y2 |x1 , x2 , θ)
we obtain
¯
Z
fn (y1 , y2 |x1 , x2 )
A2
∇θ φ(y1 , y2 |x1 , x2 , θ)dµ(y) − B̄n
φ(y1 , y2 |x1 , x2 , θ)
Z
fK (y1 , y2 |x1 , x2 ) ∇θ φ(y1 , y2 |x1 , x2 , θ) ¯
= A1
fn (y1 , y2 |x1 , x2 ) − fK (y1 , y2 |x1 , x2 ) dµ(y)
φ(y1 , y2 |x1 , x2 , θ)
φ(y1 , y2 |x1 , x2 , θ)
2
q
Z
p
∇θ φ(y1 , y2 |x1 , x2 , θ)
f¯n (y1 , y2 |x1 , x2 ) − fK (y1 , y2 |x1 , x2 ) dµ(y)
+B
φ(y1 , y2 |x1 , x2 , θ)
The result now follows from Lemmas 5.2 and 5.3.
For the special case of Hellinger distance, we observe that
Z
q
∇θ φ(y1 , y2 , x1 , x2 , θ) ¯
p
∇θ Dn (f¯n , φ|x1 , x2 , θ) =
fn (y1 , y2 |x1 , x2 )dµ(y)
φ(y1 , y2 , x1 , x2 , θ)
√
√
√
√
√
√
and the result is obtained from applying the identity a − b = (a − b)2 /2 a + ( b − a)2 /2 a with
a = fK (y1 , y2 |x1 , x2 ) and b = f¯n (y1 , y2 |x1 , x2 ) along with Assumption N3.
Lemma 5.4. For any sequence of continuous square-integrable functions bn (x, y) with limit b(x, y) and
(Xi , Yi , Zi ), i = 1, 2, ... a sequence of random variables with distribution G(x, y, z) = gx (x)gy (y)gz (z) with
Z
Z
bn (x, y)gx (x)dx = 0, ∀y and
b(x, y)gy (y)dy = 0, ∀x
and
Z
2
(bn (x, y) − b(x, y)) gx (x)gy (y) → 0
we have
n
n
1 XX
√
bn (Xi , Yj ) → op (1)
nn i=1 j=1
and further for any square integrable function a(x, y)
Z
n
n 1 XX
√
a(Xi , Yj ) − a(x, y)gx (x)gy (y)dxdy → N (0, Σ)
nn i=1 j=1
in distribution where the asymptotic variance is given by
2
2
Z Z
Z Z
Σ=
a(x, y)gx (x)dx gy (y)dy +
a(x, y)gy (y)dy g( x)dx.
28
Finally, for any sequence of square integrable functions cn (x, y, z) → c(x, y, z) with zero univariate marginals,
√
n
n
n
1 XXX
cn (Xi , Yj , Zk ) → op (1)
nn2 i=1 j=1
k=1
Proof. All the expressions above are U -statistics (e.g. Hoeffding, 1948); the results above are special cases
of Theorem 1 in Weber (1983).
6
Asymptotic Normality and Efficiency of Minimum Disparity Estimators for Conditional Models: Homoscedastic Case
For the homoscedastic case (1.1) with estimators (1.6 1.8), we can again obtain a central limit theorem.
Although the calculations are direct,√it also produces efficient estimation in location scale families. In this
framework, the analysis produces o( n) asymptotic bias when both dx and dy are 1 or 0, as opposed to
the general case above in which only one can be non-zero. We speculate that if (1.6) is √replaced by a
generalized additive model employing only univariate terms the bias is also asymptotically o( n) regardless
of the dimension of x1 so long as the additivity assumptions remain true.
The results below largely mimic those in section 5. In this case we need two further assumptions:
H1 There exists ay > 0 such that
2
Z
sup
sup
Ψθ (x1 , x2 , y1 )
(x1 ,x2 )∈X ksk≤ay
(∇y f (y1 + s|x1 , x2 ))
dy1 < ∞.
f (y1 |x1 , x2 )
H2 There is a diverging sequence bn such that
Z
Z Z
Ψ2θ (x1 + cnx2 u, x2 , y1 + cny2 v)Ky2 (v)Kx (u)2 f ∗ (y1 |x1 , x2 )dudvdy1 → 0.
n sup
(x1 ,x2 )∈X
kuk>bn
kvk>bn
Lemma 6.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 with the restriction (1.1), under assumptions D1-D2, E1, K1 - K6, B1-B6 and P1-P5 for any function J(y1 , y2 , x1 , x2 ) satisfying the conditions on
Ψ in Assumptions N1-N2iv with
X Z Z
J(y1 , x1 , x2 )f (y|x1 , x2 )h(x1 , x2 )dx1 dy1 = 0
x2 ∈Sx
29
and
VJ1 =
X Z Z
T T
J(y1 , x1 , x2 )f (y1 |x1 , x2 )dy1
h(x1 , x2 )dx1 < ∞,
(6.45)
x2 ∈Sx
VJ2 =
Z " X Z
#T T
J(e + m(x1 , x2 ), x1 , x2 )h(x1 , x2 )dx1
f ∗ (e)de < ∞,
(6.46)
h(x1 , x2 )dx1 < ∞,
(6.47)
x2 ∈Sx
VJ3 = σ 2
X Z Z
T T
∇y J(y1 , x1 , x2 )f (y1 |x1 , x2 )dy1
x2 ∈Sx
VJ4 = σ 2
"Z
X Z
#T T
∇y J(y1 , x1 , x2 )f (y1 |x1 , x2 )h(x1 , x2 )dy1 dx1
< ∞,
(6.48)
x2 ∈Sx
(6.49)
with σ 2 =
R
e2 f ∗ (e)de. Define
VJ = VJ1 + VJ2 + VJ3 − VJ4
then the following hold
"
#
X Z Z
√
n
J(y1 , x1 , x2 )f˜n (y1 |x1 , x2 )ĥn (x1 , x2 )dy1 dx1 − Bn∗ → N (0, VJ )
(6.50)
x2 ∈Sx
in distribution and similarly
#
√ "X
n Z
n
∗
J(y1 , Xi1 , Xi2 )f˜n (y1 |Xi1 , Xi2 )dy1 − B̃n → N (0, VJ )
n i=1
(6.51)
in distribution where
X Z Z Z
Bn∗ =
J(y1 , x1 , x), f (y1 + cny2 u|x1 , x2 )Ky (u)h(x1 , x2 )dudy1 dx1
x2 ∈Sx
+
X Z Z
x2 ∈Sx
−
X Z Z
R
∇y J(y1 , x1 , x2 )
m(x1 + cnx2 u, x2 )h(x1 + cnx2 u, x2 )Kx (u)du
R
f (y1 |x1 , x2 )h(x1 , x2 )dx1 dy1
h(x1 + cnx2 v, x2 )Kx (v)dv
∇y J(y1 , x1 , x2 )m(x1 , x2 )f (y1 |x1 , x2 )h(x1 , x2 )dx1 dy1
x2 ∈Sx
=
X Z Z
J(y1 , x1 , x2 )E f˜n∗ (y1 |x1 , x2 )h(x1 , x2 )dy1 dx1
x2 ∈Sx
+
X Z Z
∇y J(y1 , x1 , x2 ) [E m̂n (x1 , x2 ) − m(x1 , x2 )] f (y1 |x1 , x2 )h(x1 , x2 )dx1 dy1
x2 ∈Sx
∗
∗
= Bn1
+ Bn2
30
and
B̃n∗ =
X Z Z
J(y1 , x1 , x2 )E f˜n∗ (y1 |x1 , x2 )E ĥn (x1 , x2 )dy1 dx1
x2 ∈Sx
+
X Z
∇y J(y1 , x1 , x2 ) [E m̂n (x1 , x2 ) − m(x1 , x2 )] f ∗ (y1 |x1 , x2 )E ĥn (x1 , x2 )dy1 dx1
x2 ∈Sx
=
∗
B̃n1
∗
+ B̃n2
and in particular for dy = 1 and dx = 1,
√
nBn∗ → 0 and
√
nB̃n∗ → 0.
(6.52)
Proof. We give the proof for (6.51) below; (6.50) is demonstrated in a completely analogous manner.
+
+
Applying Lemma 5.1 we substitute Knx
and Kny
for Kx and Ky throughout. We begin with (6.51) where
we observe
√ X
n Z
n
J(y1 , Xi1 , Xi2 )f˜n (y1 |Xi1 , Xi2 )dy1
n i=1
!
n
n Z
1 X1X
e − Ẽjn
+
=√
J(e + m̂n (Xi1 , Xi2 ), Xi1 , Xi2 )Kny
de
cny2
n i=1 n j=1
n
n Z
1 XX
+
= √
J(j + m(Xi1 , Xi2 ) + cny2 u + ên (Xi1 , Xi2 ) − ên (Xj1 , Xj2 ), Xi1 , Xi2 )Kny
(u)du
n n i=1 j=1
n Z
n
1 XX
+
J(j + m(Xi1 , Xi2 ) + cny2 u, Xi1 , Xi2 )Kny
(u)du
= √
n n i=1 j=1
n Z
n
1 XX
+
+ √
∇y J(j + m(Xi1 , Xi2 ) + cny2 u, Xi1 , Xi2 )Kny
(u)du · [ên (Xi1 , Xi2 ) − ên (Xj1 , Xj2 )]
n n i=1 j=1
!
n
1 X
2
+O √
en (Xi1 , Xi2 )
n i=1
where Ẽjn = Yi − m̂n (Xi1 , Xi2 ), i = Yi − m(Xi1 , Xi2 ) and ên (x1 , x2 ) = m̂n (x1 , x2 ) − m(x1 , x2 ) and we
observe that the i are independent
i1 , Xi2 ).
of (X
Pn
1
2
√
From here we observe that , O
e
(X
,
X
)
→ 0 in probability from Nadaraya (1989, Chapn
i1
i2
i=1
n
ter 3, Theorem 1.4).
31
Further
n
n
1 XX
√
n j=1 i=1
Z
+
∗
J(j + m(Xi1 , Xi2 ) + cny2 u, Xi1 , Xi2 )Kny
(u)du − Bn1
n
1 X
=√
n i=1
Z Z
+
∗
J(e + m(Xi1 , Xi2 ), Xi1 , Xi2 )f ∗ (e + cny2 u)Kny
(u)dude − Bn1
Z Z
1
+
∗
+√
J(j + m(x1 , x2 ) + cny2 u, x1 , x2 )Kny
(u)h(x1 , x2 )dudν(x) − Bn1
n
+ op (1)
from Lemma 5.4 below. These terms converge in mean square to
n
1 X
√
n i=1
Z
J(e + m(Xi1 , Xi2 ), Xi1 , Xi2 )f ∗ (e)de +
Z
J(j + m(x1 , x2 ), x1 , x2 )h(x1 , x2 )dν(x)
n
1 X
=√
Q1 (i , Xi1 , Xi2 )
n i=1
These mean squares are guaranteed to be finite from assumptions N1 and N2i.
For the middle term in this expansion, we define
Z
+
L(, x1 , x) = ∇y J( + m(x1 , x2 ) + cny2 u, x1 , x2 )Kny
(u)du
for notational convenience. We also write Yk = m(Xk1 , Xk2 ) + k and divide ên (x1 , x2 ) into
ên (x1 , x2 ) = ên1 (x1 , x2 ) + ên2 (x1 , x2 )
where
ên1 (x1 , x2 ) =
1
x
ncdnx
2
n K+
X
i nx
1
x
ncdnx
2
n m(X , X )K +
X
i1
i2
nx
i=1
x−Xi1
cn
Ix2 (Xi2 )
ĥn (Xi1 , Xi2 )
i=1
and
ên2 (x1 , x2 ) =
x−Xi1
cn
ĥn (Xi1 , Xi2 )
32
Ix2 (Xi2 )
− m(x1 , x2 ).
We begin by examining terms involving ên1 and expand the summation:
n
n
1 XX
√
L(j , Xi1 , Xi2 ) [ên1 (Xi1 , Xi2 ) − ên1 (Xj1 , Xj2 )]
n n i=1 j=1


Xj1 −Xk1
+
Xi1 −Xk1
+
n X
n X
n
IXj2 (Xk2 )
K
I
(X
)
K
X
X
k2
nx
nx
i2
cnXj2
cnXi2
1

−
=
k L(j , Xi1 , Xi2 ) 
x
n5/2 cdnx
ĥn (Xi1 , Xi2 )
ĥn (Xj1 , Xj2 )
2 i=1 j=1 k=1
Z Z
n
+
1 X
L(e, Xk1 + cnXk2 v, Xk2 )Knx
(v)f ∗ (e)h(Xk1 + cnXk2 v, Xk2 )
=√
k
dvde
n
E ĥn (Xk1 + cnXk2 v, Xk2 )
k=1
Z Z
Z
n
+
1 X
Knx
(v)h(Xk1 + cnXk2 v, Xk2 )
−√
k
L(e, x1 , x2 )f ∗ (e)h(x1 , x2 )dedν(x)
dv
n
E ĥn (Xk1 + cnXk2 v, Xk2 )
k=1
+ op (1)
from Lemma 5.4. By taking Taylor expansions to remove terms involving cny2 and cnx2 , we observe that
this converges in expected mean square to
n
n
1 X
1 X
√
Q2 (k , Xk1 , Xk2 ) = √
k
n
n
k=1
Z
∇y J(e + m(Xk1 , Xk2 ), Xk1 , Xk2 )f ∗ (e)de
k=1
n
1 X
−√
k
n
Z Z
∇y J(e + m(x1 , x2 ), x1 , x2 )f ∗ (e)h(x1 , x2 )dedν(x) .
k=1
and we observe that these have expectation zero. The finiteness of these mean squares is obtains from
Assumptions N1 and N2i.
33
Turning to terms involving ên2 , we observe that
n
n
1 XX
L(j , Xi1 , Xi2 )ên2 (Xi1 , Xi2 )
n2 i=1 j=1
1
=
x
6
n cdnx
2
=
n X
n X
n
X

+
m(Xk1 , Xk2 )Knx
Xi1 −Xk1
cnXi2

IXi2 (Xk2 )
− m(Xi1 , xi2 )
ĥ
(X
,
X
)
n
i1
i2
i=1 j=1 k=1


x01 −x1
+
0
Z
Z
m(x
,
x
)K
I
(x
)h(x
,
x
)
n
1
2
2
1
2
x2
nx
cnx0
X


2
0
0
0
L(j , x01 , x02 ) 
− m(x01 , x02 )

 h(x1 , x2 )dx1 dx1
0
0
E
ĥ
(x
,
x
)
n
1
2
j=1
1
x
ncdnx
2
L(j , Xi1 , Xi2 ) 

+
+
 m(Xk1 , Xk2 )Knx
n Z Z
1 X
L(e, x01 , x02 )f ∗ (e) 

x
ncdnx
2 k=1
1
+ dx
ncnx2
n Z Z
X
x01 −Xk1
cnx0
2
E ĥn (x01 , x02 )

L(e, Xi1 , Xi2 )f ∗ (e) 
+
m(x1 , x2 )Knx
Xi1 −x1
cnXi2

I (Xk2 )
x02

0
0
0
− m(x01 , x02 )
 h(x1 , x2 )dx1 de
IXi2 (x2 )h(x1 , x2 )
E ĥn (Xi1 , Xi2 )
i=1

− m(Xi1 , Xi2 ) dx1 de
+ op (1)
from Lemma 5.4. Here we observe that all three terms in the final expression are o(c2nx2 ) after change of
variables and taking Taylor expansions. Thus we have


n X
n
X
√
1
L(j , Xi1 , Xi2 )ên2 (Xi1 , Xi2 ) − Bn2  → 0
n 2
n i=1 j=1
∗
∗
in probability. We thus obtain
Pna the result from collecting the bias terms Bn1 + Bn2 and applying the the
1
√
central limit theorem to n i=1 (Q1 (i , Xi1 , Xi2 ) + Q2 (i , Xi1 , Xi2 )).
We note that (6.52) follows from a second-order Taylor approximation and condition B6.
Lemma 6.2. Let {(Xn1 , Xn2 , Yn1 , n ≥ 1} be given as in Section 1, under assumptions K1 - K6, B1-B4 and
N1-N2iv for any function J(y1 , x1 , x2 ) satisfying the conditions on Ψ in Assumptions N1-N2iv and H1 and
H2, define
Z
E f˜∗ (y1 |x1 , x2 ) = f ∗ (y1 + cny u − m(x1 , x2 ))Ky (u)du
2
then
√
X
q
Z
f˜n (y1 |x1 , x2 ) −
q
E f˜∗ (y1 |x1 , x2 )
2
dy1 → 0
(6.53)
in probability and further when dy = 1,
q
2
X Z
p
√
∗
∗
˜
n sup
J(y1 , x1 , x2 )
E f (y1 |x1 , x2 ) − f (y1 |x1 , x2 ) dy1 → 0
(6.54)
n
sup
J(y1 , x1 , x2 )
(x1 ,x2 )∈X y ∈S ,x ∈S
2
y
2
x
(x1 ,x2 )∈X y ∈S
2
y
34
Proof. Here we proceed as in the proof of Theorem 3.24 and for any continuous and bounded m̃(x1 , x2 )
define f˜(e, m̃) as in (3.25) along with
m− =
inf
(x1 ,x2 )∈X
(m̃(x1 , x2 ) − m(x1 , x2 )), m+ =
(m̃(x1 , x2 ) − m(x1 , x2 ))
sup
(x1 ,x2 )∈X
and observe that
q
2
Z
p
∗
˜
J(e + m(x1 , x2 ), x1 , x2 )
f (e, m̃) − f (e) de
≤
(m+ − m− )2
sup
2
Z
J(e + m(x1 , x2 ), x1 , x2 )
m− ≤η≤m+
(∇y f ∗ (e + η))
de.
f ∗ (e)
Thus, substituting m̂n for m̃
q
2
Z
p
∗
˜
J(e + m(x1 , x2 ), x1 , x2 )
f (e, m̂n ) − f (e) de → 0
almost surely from the convergence
√
!2
n
sup
|m̂n (x1 , x2 ) − m(x1 , x2 )|
→0
(x1 ,x2 )∈X
Nadaraya (eg 1989, Chapter 3, Theorem 3.4).
We now consider fˆn (e, m̃) as given in (3.26) and demonstrate that for any m̃ with m+ − m− < my we
have
q
2
Z
q
√
n J(e + m(x1 , x2 ), x1 , x2 )
fˆn (e, m̃) − E fˆn (e, m̃) de → 0
in probability. To do this we follow the proof of Lemma 5.3 with minor modifications. In particular,
integrating over ky1 k > αn − c we complete the square and observe that
Z
√
n
J(e + m(x1 , x2 ), x1 , x2 )E fˆ(e, m̃)de
ky1 k>αn −c
≤
√
≤
2
nM sup
ksk≤bn
√
#1/2
˜
J(e + m(x1 , x2 ) + cny2 s, x1 , x2 ) f (e, m̃)de
"Z
ke+m(x1 ,x2 )k>αn −c
"
n
#1/2
Z
sup
J(y1 + η + cny2 s, x1 , x2 ) f (y1 − m(x1 , x2 ))dy1
sup M
kηk<my ksk≤bn
2 ∗
ky1 k>αn −c
→0
from Assumption N2ii with M = supy Ky . Similarly
Z
√
n
J(y1 , x1 , x2 )fˆ(e, m̃)dy1
ky1 k>αn −c
" n Z
#1/2
X
sup Ky (y)
√
≤
sup
sup
J(Yi + η + cny2 s, x1 , x2 )2 du
n
kηk<my ksk≤bn i=1 ky1 k>αn −c
→ 0.
35
√
√
On the region ky1 k < αn − c we make use of the identity ( a − b)2 ≤ (a − b)2 /a to obtain that
√
q
Z
n
J(e + m(x1 , x2 ), x1 , x2 )
fˆn (e, m̃) −
q
2
ˆ
E fn (e, m̃) de
ky1 k≤αn −c
≤
√
Z
n
2
fˆn (e, m̃) − E fˆn (e, m̃)
J(e + m(x1 , x2 ), x1 , x2 )
de
E fˆn (e, m̃)
ky1 k≤αn −c
which is smaller in expectation than
√
+2
Kny
(s)f˜(e + cny2 s, m̃)ds
|J(e + m(x1 , x2 ), x1 , x2 )| R +
de
Kny (s)f˜(e + cny2 s, m̃)ds
ky1 k≤αn −c
Z
1
f ∗ (e + 2t)
|J(e + m(x1 , x2 ), x1 , x2 )|de sup
sup
≤ √ dy
f ∗ (e)
kek<αn ktk≤2cny2 bn
ncny2 ky1 k≤αn
Z
1
d
ncnyy 2
→0
from Assumptions N2iii and N2iv.
Finally, to demonstrate (6.54) we observe that
√
q
Z
n
J(e + m(x1 , x2 ), x1 , x2 )
f˜(e, m̃) −
q
E fˆn (e, m̃)
≤
√
nc2ny2 b2n
Z
J(e + m̃(x1 , x2 ), x1 , x2 )
2
de
∇2 f˜(e + η)
f˜(e)
2
de
→0
from Assumptions B6, N2i and N1.
We now observe that the above results do not depend on m̃ except in so far as sup |m̃(x1 , x2 )−m(x1 , x2 )| <
c. Since this is almost surely true of m̂n for n sufficiently large, the result holds.
Theorem 6.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 under model (1.1), under assumptions
D1-D2, E1, K1 - K6, B1-B4, P1-P5, N1 - N4 and H1 - H2 define
θf = arg min D∞ (f, θ)
θ∈Θ
and Ṽ D = VJ1 + VJ2 + VJ3 − VJ4 as given in (6.45-6.48) with
∇θ φ(y1 |x1 , x2 , θ)
f (y1 |x1 , x2 )
J(y1 , x1 , x2 ) = A1
φ(y1 |x1 , x2 , θ)
φ(y1 |x1 , x2 , θ)
36
and
H D (θ) = ∇2θ D∞ (f, θ)
X Z
f (y1 |x1 , x2 )
A2
=
∇2θ φ(y1 |x1 , x2 , θ)h(x1 , x2 )dx1 dy1
φ(y1 |x1 , x2 , θ)
x2 ∈Sx
X Z
f (y1 |x1 , x2 )
∇θ φ(y1 |x1 , x2 , θ)T T
+
A3
h(x1 , x2 )dx1 dy1
φ(y1 |x1 , x2 , θ)
φ(y1 |x1 , x2 , θ)
x2 ∈Sx
I (θ) = H (θ)Ṽ D (θ)−1 H D (θ)
D
D
then
i
√ h
n Tn (fˆn ) − θf − Bn∗ → N 0, I D (θf )−1
and
i
√ h
n T̃n (fˆn ) − θf − B̃n∗ → N 0, I D (θf )−1
in distribution where Bn and B̃n are given in Lemma 6.1.
Here we note that in the case that f = φθ0 for some θ0 , the θf = θ0 and further since A1 (1) = A2 (1) =
A3 (1) = 1 we have that H D (θf ) is given by the Fisher Information for φθ0 . In fact when a regression model
with a location-scale family for residuals is proposed for φθ , Ṽ D (θf ) is also given by the Fisher Information,
yielding an efficient estimate. Specifically, we assume that
1
y1 − mθ (x1 , x2 )
φθ,σ (y1 |x1 , x2 ) = φ
σ
σ
in which case
∇θ mθ (x1 , x2 ) φ0 (σ −1 (y1 − mθ (x1 , x2 )))
σ
φ(σ −1 (y1 − mθ (x1 , x2 )))
−1 y1 − mθ (x1 , x2 ) φ0 (σ −1 (y1 − mθ (x1 , x2 )))
Ψσ (y1 , x1 , x2 ) =
−
σ
σ2
φ(σ −1 (y1 − mθ (x1 , x2 )))
Ψθ (y1 , x1 , x2 ) =
with σ scaled so that
Z
φ0 (e)2
de = 1.
φ(e)
Substituting Ψθ for J in (6.45-6.48) with f ∗ (e) = φ(e) and m(x1 , x2 ) = mθ0 (x1 , x2 ) yields
Z
Ψσ (y1 , x1 , x2 )φθ,σ (y1 , x1 , x2 )dy1 = 0,
from which
VJ1 = 0, VJ2
Z
1 X
TT
= 2
(∇θ mθ (x1 , x2 )h(x1 , x2 )) dx1
σ
x2 ∈Sx
and since
∇θ mθ (x1 , x2 ) φ00 (σ −1 (y1 − mθ (x1 , x2 ))) φ0 (σ −1 (y1 − mθ (x1 , x2 )))2
−
∇y Ψθ (y1 , x1 , x2 ) =
σ2
φ(σ −1 (y1 − mθ (x1 , x2 )))
φ(σ −1 (y1 − mθ (x1 , x2 )))2
37
we have
VJ3 −VJ4 = σ 2
1
σ3
Z

2 X Z
φ0 (e)2
TT
[∇θ mθ (x1 , x2 )] h(x1 , x2 )dx1 −
de 
φ(e)
x2 ∈Sx
X Z
!T T 
∇θ mθ (x1 , x2 )h(x1 , x2 )dx1
x2 ∈Sx
giving
Z
1 X
∇θ mθ (x1 , x2 )T T h(x1 , x2 )dx1 .
σ2
D
Ṽθ,θ
=
x2 ∈Sx
Similarly, we can demonstrate that
D
Vσ,σ
=
4
D
, Vθ,σ
=0
σ4
giving the usual Fisher Information.
Proof. Since our proof remains nearly identical to that of Therorem 5.1, up to the use of Lemmas 6.1 and
6.2, we will define T̄n , f¯n (y1 |x1 , x2 ) and fK (y1 |x1 , x2 ) to take one of the following combinations of values
Tn
T̃n
f˜n (y1 |x1 , x2 )
f˜n (y1 |x1 , x2 )
E f˜n∗ (y1 |x1 , x2 )
E f˜n∗ (y1 |x1 , x2 )
Our arguments here follow those in Tamura and Boos (1986) and Park and Basu (2004).
Since T̄n (f ) satisfies
∇θ Dn (f, θ) = 0
we can write
√
−1
n T̄n (f¯n ) − θ0 = − ∇2θ Dn (f¯n , θ+ )
∇θ Dn (f¯n , θ0 )
for some θ+ between T̄n (f¯n ) and θ0 . It is therefore sufficient to demonstrate
(i) ∇2θ Dn (f¯n , θ+ ) → H D (θ0 ) in probability.
√ (ii) n ∇θ Dn (f¯n , θ0 ) − B̄n → N (0, Ṽ D (θ0 )−1 ) in distribution.
with B̄n given by Bn∗ or B̃n∗ as appropriate.
We begin with (i) where we observe that by Assumption N4 A2 (r) and A3 (r) are bounded and the result
follows from Theorems 2.5 or 3.1 and 4.2 and the dominated convergence theorem. In the case of Hellinger
distance
#
Z " 2
TT q
∇
φ(y
,
x
,
x
,
θ)
∇
φ(y
,
x
,
x
,
θ)
1
1
2
θ
1
1
2
pθ
∇2θ D(f¯n , φ|x1 , x2 , θ) =
−
f¯n (y1 |x1 , x2 )dy1
φ(y1 , x1 , x2 , θ)3/2
φ(y1 , x1 , x2 , θ)
and the result follows from the uniform L1 convergence of f¯n (Theorem 2.4) and Assumption N3.
38

Turning to (ii) where we observe that by the boundedness of C and the dominated convergence theorem
we can write ∇θ Dn (f¯n , φ|x1 , x2 , θ) as
¯
Z
fn (y1 |x1 , x2 )
A2
∇θ φ(y1 |x1 , x2 , θ)dy1 − B̄n1
φ(y1 |x1 , x2 , θ)
Z
fK (y1 |x1 , x2 ) ∇θ φ(y1 |x1 , x2 , θ) ¯
= A1
fn (y1 |x1 , x2 ) − fK (y1 |x1 , x2 ) dy1
φ(y1 |x1 , x2 , θ)
φ(y1 |x1 , x2 , θ)
Z ¯
fn (y1 |x1 , x2 )
fK (y1 |x1 , x2 )
+
A2
− A2
∇θ φ(y1 |x1 , x2 , θ)dy1
φ(y1 |x1 , x2 , θ)
φ(y1 |x1 , x2 , θ)
¯
Z
fK (y1 |x1 , x2 )
fn (y1 |x1 , x2 )
fK (y1 |x1 , x2 )
− A1
−
∇θ φ(y1 |x1 , x2 , θ)dy1
φ(y1 |x1 , x2 , θ)
φ(y1 |x1 , x2 , θ) φ(y1 |x1 , x2 , θ)
from a minor modification Lemma 25 of Lindsay (1994) we have that by the boundedness of A1 and A2 there
is a constant B such that
|A2 (r2 ) − A2 (s2 ) − (r2 − s2 )A1 (s2 )| ≤ (r2 − s2 )B
substituting
s
r=
f¯n (y1 |x1 , x2 )
, s=
φ(y1 |x1 , x2 , θ)
s
fK (y1 |x1 , x2 )
φ(y1 |x1 , x2 , θ)
we obtain
Z
A2
¯
fn (y1 |x1 , x2 )
∇θ φ(y1 |x1 , x2 , θ)dy1 − B̄n
φ(y1 |x1 , x2 , θ)
Z
fK (y1 |x1 , x2 ) ∇θ φ(y1 |x1 , x2 , θ) ¯
fn (y1 |x1 , x2 ) − fK (y1 |x1 , x2 ) dy1
= A1
φ(y1 |x1 , x2 , θ)
φ(y1 |x1 , x2 , θ)
2
q
Z
p
∇θ φ(y1 |x1 , x2 , θ)
¯
+B
fn (y1 |x1 , x2 ) − fK (y1 |x1 , x2 ) dy1
φ(y1 |x1 , x2 , θ)
The result now follows from Lemmas 6.1 and 6.2 with the additional centering of B̄n2 .
For the special case of Hellinger distance, we observe that
Z
q
∇θ φ(y1 , x1 , x2 , θ) ¯
p
∇θ Dn (f¯n , φ|x1 , x2 , θ) =
fn (y1 |x1 , x2 )dy1
φ(y1 , x1 , x2 , θ)
√
√
√
√
√
√
and the result is obtained from applying the identity a − b = (a − b)2 /2 a + ( b − a)2 /2 a with
a = fK (y1 |x1 , x2 ) and b = f¯n (y1 |x1 , x2 ) along with Assumption N3.
7
Robustness Properties
An important motivator for the study of disparity methods is that in addition to providing statistical
efficiency as demonstrated above, they are also robust to contamination from outlying observations. Here we
investigate the robustness of our estimates through the tools of influence functions and breakdown points.
These have been studied for i.i.d. data in Beran (1977); Park and Basu (2004); Lindsay (1994) and the
extension to conditional models follows similar lines.
In particular, we examine two models for contamination:
39
1. To mimic the “homoscedastic” case, we contaminate g(x1 , x2 , y1 , y2 ) with outliers independent of
(x1 , x2 ). That is, we define the contaminating density
g,z (x1 , x2 , y1 , y2 ) = (1 − )g(x1 , x2 , y1 , y2 ) + δz (y1 , y2 )h(x1 , x2 )
(7.55)
where δz is a contamination density parameterized by z. Typically, we think of δz as having small
support centered around z. The results in the conditional density
f,z (y1 , y2 |x1 , x2 ) = (1 − )f (y1 , y2 |x1 , x2 ) + δz (y1 , y2 )
which we think of as the result of smoothing a contaminated residual density. We note that we have
not changed the marginal distribution of (x1 , x2 ) via this contamination. We note that this particularly
applies to the case where only y1 is present and the estimate (1.6-1.8) is employed.
2. In the more general setting, we set
g,z (x1 , x2 , y1 , y2 ) = (1 − )g(x1 , x2 , y1 , y2 ) + δz (y1 , y2 )JU (x1 , x2 )h(x1 , x2 )
(7.56)
where JU (x1 , x2 ) is the indicator of (x1 , x2 ) ∈ U scaled so that h(x1 , x2 )JU (x1 , x2 ) is a distribution.
This translates to the conditional density
f (y1 , y2 |x1 , x2 )
(x1 , x2 ) ∈
/U
f,z (y1 , y2 |x1 , x2 ) =
(1 − )f (y1 , y2 |x1 , x2 ) + δz (y1 , y2 ) (x1 , x2 ) ∈ U
which localizes contamination in covariate space. Note that the marginal distribution is now scaled
differently in U .
Naturally, this characterization (7.55) does not account for the effect of outliers on the Nadaraya-Watson
estimator (1.6). If these are localized in covariate space, however, we can think of (1.8) as being approximately
a mixture of the two cases above. As we will see the distinction between these two will not affect the basic
properties below. Throughout we will write δz (y1 , y2 |x1 , x2 ) in place of δz (y1 , y2 ) or δz (y1 , y2 )JU (x1 , x2 ) as
appropriate. h(x1 , x2 ) will be taken to be modified according to (7.56) if appropriate.
We must first place some conditions on δz :
D1 δz is orthogonal in the limit to f , in particular
X Z
lim
δz (y1 , y2 |x1 , x2 )f (y1 , y2 |x1 , x2 )dy1 = 0, ∀(x1 , x2 )
z→∞
y2 ∈Sy
D2 δz is orthogonal in the limit to φ:
X Z
lim
δz (y1 , y2 |x1 , x2 )φ(y1 , y2 |x1 , x2 , θ)dy1 = 0, ∀(x1 , x2 )
z→∞
y2 ∈Sy
D3 φ becomes orthogonal to g for large θ:
X Z
f (y1 , y2 |x1 , x2 )φ(y1 , y2 |x1 , x2 , θ) = 0 ∀(x1 , x2 )
lim
kθk→∞
y2 ∈Sy
40
D4 C(−1) and C 0 (∞) are both finite or the disparity is Hellinger distance.
In the following result with use T [f ] = arg min D∞ (f, θ) for any f in place of our estimate θ̂.
Theorem 7.1. Under assumptions D1 - D4 under both contamination models (7.55) and (7.56) define ∗
to satisfy
(1 − 2∗ )C 0 (∞) = inf D ((1 − ∗ )f, θ) − lim inf D (∗ δz , θ)
(7.57)
z→∞ θ∈Θ
θ∈Θ
0
with C (∞) replaced by 1 in the case of Hellinger distance then for < ∗
lim T [f,z ] = T [(1 − )f ]
z→∞
and in particular the breakdown point is at least ∗ :
sup kT [f,z ] − T [(1 − )f ]k < ∞.
z
These results mimic Beran (1977) in demonstrating that the α-level influence functions
IFα (δz ) = α−1 kT [fα,z ] − T [f ]k
are bounded and tend to zero at infinity. The breakdown point is a natural consequence.
Proof. We begin by observing that by Assumption D1, for any fixed θ,
Z Z
f,z (y|x)
D(f,z , θ) =
C
− 1 φ(y|x, θ)h(x)dµ(y)dν(x)
φ(y|x, θ)
Az (x)
Z Z
f,z (y|x)
+
C
− 1 φ(y|x, θ)h(x)dµ(y)dν(x)
φ(y|x, θ)
Acz (x)
= DAz (f,z , θ) + DAcz (f,z , θ)
where Az (x) = {y : max(f (y|x), φ(y|x, θ)) > δz (y|x)}. We note that for any η with z sufficiently large that
sup
sup δz (y|x) < η and sup
sup f (y|x) < η
(x)∈X y∈Acz (x)
(x)∈X y∈Az (x)
and thus for sufficiently large z,
D(f,z , θ) − DA ((1 − )f, θ) + DAc (δz , θ) ≤ |η| sup |C 0 (t)|.
z
z
t
hence
sup D(f,z , θ) − DAz ((1 − )f, θ) + DAcz (δz , θ) → 0.
(7.58)
θ
We also observe that for any fixed θ,
Z Z
Z Z
2δz (y)
DAcz (δz , θ) =
− 1 φ(y|x, θ)h(x)dµ(y)dν(x) +
δz (y|x)C 0 (t(y, x))dµ(y)dν(x)
C
φ(y|x, θ)
Acz (x)
Acz (x)
→ C 0 (∞)
41
0
for t(y,
R x) between δz (y|x)/φ(y|x, θ) and 2δz (y|x)/φ(y|x, θ) since t(y, x) → ∞, C(·) and C (·) are bounded
and Ac (x) φ(y|x, θ)dµ(y) → 0.
z
Similarly,
DAcz ((1 − )f, θ) → D((1 − )f, θ)
and thus
D(f,z , θ) → D((1 − )f, θ) + C 0 (∞)
which is minimized at θ = T [f,z ].
It remains to rule out divergent sequences kθz k → ∞. In this case, we define Bz (x) = {y : f (y|x) >
max(δz (y|x), φ(y|x, θz )} and note that from the arguments above
DBz ((1 − )f, θz ) → (1 − )C 0 (∞)
and
DBzc (δ, θz ) → D(δ, θz )
and hence
lim D(f,z , θz ) > lim inf D(δz , θ) + (1 − )C 0 (∞) > D(f,z , T [(1 − )f ])
z→∞
z→∞ θ∈Θ
from (7.57), yielding a contradiction.
In the case of Hellinger distance, we observe
|D(f,z , θ) − (D((1 − )f, θ) + D(δz , θ))|
Z Z
1
1
1
1
1
1
2
=
φ 2 (y|x, θ) f,z
(y|x) − (1 − ) 2 f 2 (y|x) − 2 δz2 (y|x) h(x)dµ(y)dν(x)
Z Z 2
1
1
1
1
1
2
≤
(y|x) − (1 − ) 2 f 2 (y|x) − 2 δz2 (y|x) h(x)dµ(y)dν(x)
f,z
Z Z
=
[2(1 − )f (y|x) + 2δ(y|x)] h(x)dµ(y)dν(x)
Z Z 1
1
1
1
1
2
(y|x) (1 − ) 2 f 2 (y|x) + 2 δz2 (y|x) h(x)dµ(y)dν(x)
−2
f,z
where, by dividing the range of y into Az (x) and Acz (x) as above, we find that on Az (x), for any η > 0 and
z sufficiently large,
1
1
1
1
1
2
2
(y|x) (1 − ) 2 f (y|x) + 2 δz (y|x) ≤ (η) 2 f,z
(y|x) + η
(1 − )f (y|x) − f,z
which with the corresponding arguments on Acz (x) yields (7.58). We further observe that for fixed θ
Z p
√
D(δz , θ) = 1 + − δz (y|x)φ(y|x, θ)h(x)dµ(y)dν(x) → 1 + and for kθz k → ∞,
D((1 − )f, θz ) = 2 − −
√
1−
Z p
f (y|x)φ(y|x, θz )h(x)dµ(y)dν(x) → 2 − from which the result follows from the same arguments as above.
42
These results extent on Park and Basu (2004) and Beran (1977) and a number of ways and a few remarks
are warranted:
1. In Beran (1977), Θ is assumed to be compact, allowing θz to converge at least on a subsequence. This
removes the kθz k → ∞ case and the result can be shown for ∈ [0, 1).
2. We have not assume that the uncontaminated density f is a member of the parametric class φθ . If
f = φθ0 for some θ0 , then we observe that by Jensen’s inequality
D((1 − )φθ0 , θ) > C(−) = D((1 − )φθ0 , θ0 )
hence T [(1 − )φθ0 ] = θ0 . We can further bound D(δz , θ) > C( − 1) in which case (7.57) can be
bounded by
(1 − 2∗ )C 0 (∞) ≥ C(−) − C( − 1)
which is satisfied for = 1/2. We note that in the more general condition, if (1 − )f is closer to the
family φθ that δz at = 1/2, the breakdown point will be greater than 1/2; in the reverse situation
it will be smaller.
The influence function defined in Hampel (1974) as the limit of α-level influence functions can be written,
using the implicit function theorem, as
IF∞ (δz ) = H D (θ0 )−1 F (δz )
for
Z Z
F (δz ) =
A3
f (y|x)
−1
φ(y|x, θ0 )
∇θ φ(y|x, θ)
(δz (y|x) − f (y|x)) h(x)dµ(y)dν(x)
φ(y|x, θ)
which at f = φθ0 becomes independent of C, reducing to
Z Z
∇θ φ(y|x, θ)
IF∞ (δz ) = I(θ)−1
δz (y|x)dµ(y)dν(x)
φ(y|x, θ)
This is the analogue of the result observed in Lindsay (1994) that the influence function does not differ across
different disparity models at the parametric family and corresponds to the observation in Beran (1977) that
this limit does not distinguish robust estimators in this case.
References
Beran, R. (1977). Minimum Hellinger distance estimates for parametric models. Annals of Statistics 5,
445–463.
Cheng, A.-L. and A. N. Vidyashankar (2006). Minimum Hellinger distance estimation for randomized play
the winner design. Journal of Statistical Planning and Inference 136, 1875–1910.
Deroye, L. and G. Lugosi (2001). Combinatorial Methods in Density Estimation. New York: Springer.
Devroye, L. and G. Györfi (1985). Nonparametric Density Estimation: The L1 View. New York: Wiley.
Giné, E. and A. Guillou (2002). Rates of strong consistency for multivariate kernel density estimators. Ann.
Inst. H. Poincaré Probab. Statist. 38, 907–922.
43
Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American
Statistics Association 69, 383–393.
Hansen, B. E. (2004). Nonparametric conditional density estimation.
Hoeffding, W. (1948). A class of statistics with assymptotically normal distributions. Annals of Statistics 19,
293325.
Li, Q. and J. S. Racine (2007). Nonparametric Econometrics. Princeton: Princeton University Press.
Lindsay, B. G. (1994). Efficiency versus robustness: The case for minimum Hellinger distance and related
methods. Annals of Statistics 22, 1081–1114.
Nadaraya, E. A. (1989). Nonparametric Estimation of Probability Densities and Regression Curves. Dordrecht: Kluwer.
Pak, R. J. and A. Basu (1998). Minimum disparity estimation in linear regression models: Distribution and
efficiency. Annals of the Institute of Statistical Mathematics 50, 503–521.
Park, C. and A. Basu (2004). Minimum disparity estimation: Asymptotic normality and breakdown point
results. Bulletin of Informatics and Cybernetics 36.
Tamura, R. N. and D. D. Boos (1986). Minimum Hellinger distances estimation for multivariate location
and and covariance. Journal of the American Statistical Association 81, 223–229.
Weber, N. C. (1983). Central limit theorems for a class of symmetric statistics. Math. Proc. Camb. Phil.
Soc 94, 307 – 313.
44