Marron, James StephenConvergence Properties of an Empirical Error Criterion for Multivariate Density Estimation."

•
CONVERGENCE PROPERTIES OF AN EMPIRICAL ERROR CRITERION
FOR MULTIVARIATE DENSITY ESTI}1ATION
James Stephen Marron
University of North Carolina at Chapel Hill
'e
AMS 1980 Subject Classifications:
Key Words and Phrases:
Primary 62G20; Secondary 62H12
Nonparametric density estimation, kernel estimator,
empirical error criterion, stochastic measures of
accuracy.
ABSTRACT
For the purpose of comparing different nonparametric density estimators,
Wegman (1972) introduced an empirical error criterion.
In a recent paper by
Hall (1982) it is shown that this empirical error criterion converges to the
mean integrated square error.
Here, in the case of kernel estimation, the
results of Hall are improved in several ways, most notably multivariate
densities are treated and the range of allowable bandwidths is extended.
The
techniques used here are quite different from those of Hall, which demonstrates
that the elegant Brownian Bridge approximation of Komlos, Major and
Tusnady (1975) does not always give the strongest results possible.
1.
INTRODUCTION
Consider the problem of estimating a probability density f(x), defined
using a sample of random vectors Xl""'Xn from f.
If
f(x) = rex,
Xl, •.• ,X ) denotes some estimator of f(x), a very popular means of measuring
n
the accuracy of f is Mean Integrated Square Error.
Given a nonnegative
function w(x), this error criterion is defined by
(1.1)
f
MISE
2
A
E [f(x)-f(x)] w(x)f(x)dx.
Writing the weight function in the form w(x)f(x) is for notational convenience
in the proofs of this paper.
There is no loss of generality in this device
because if a weight function w*(x) is desired, simply take w(x)
w*(x)f(x)
-1
In a survey paper, Wegman (1972) was interested in comparing the MISE of
several different density estimators, in the case d = 1.
Unfortunately, as
Wegman pointed out, the computation of MISE can be quite tedious.
Hence
Wegman used the empirical error criterion
A
MISE = n
-1 n
L [f(X.)-f(X.)] 2w(X.),
I'
j=l
J
J
J
and gave some heuristic justification.
Recently there has been some controversy regarding this practice.
The
difficulties have been essentially settled by Hall (1982), who has shown that
Wegman's heuristics are valid if f(x) is either a kernel estimator or an
orthogonal series estimator.
Hall has shown that,
relatively mild conditions on f and w, as n
A
MISE
MISE +
0
P
+
00
(MISE).
,
in the case d=l, under
.
-2In this paper, Hall's theorem 1 (dealing with kernel estimation) is
extended in several ways.
First, the assumptions made on f, K and ware
substantially weaker here.
d is treated here.
Of more interest, the case of general dimension
Also of importance is the fact that the bandwidth of the
kernel estimator satisfies much weaker restrictions than in Hall's theorem.
This is vital to the results of
Marron (1983)
where an asymptotically
efficient means of choosing the bandwidth is proposed.
It is interesting to note that the crude, "brute force" method of proof
used in this paper gives stronger results than the elegant techniques employed
by Hall.
2.
ASSUMPTIONS AND STATEMENT OF THEOREM
Given a "bandwidth", h > 0, and a "kernel function", K, defined on lRd ,
the usual kernel estimator of f(x) is given by
"
(2.1)
=
f(x,h)
n
x-X.
1
K(~)
d
nh j=l
L
For the rest of this paper it is assumed that K satisfies the following
assumptions.
= 1,
(K.1)
!K(x)dx
(K.2)
K is bounded,
(K.3)
!K(x)2 dx <
(K.4)
lim
II xii +
00
,
II x II I K(x) I
<
00
,
00
d
where the integrals are taken over lR and
II ·11
denotes the usual Euclidean
norm.
Assumption (K.1) is required to make f(x,h) integrate to 1.
Assumptions
(K.2), (K.3) and (K.4) are milder than those used in theorem 1 of Hall (1982).
-3-
It will also be assumed that the underlying density satisfies the following assumptions:
(f.l)
f is bounded,
(f. 2)
f is uniformly continuous.
Assumption (f.l) is a consequence of (f.2) but it is required often
enough in the upcoming proofs to be listed separately.
Assumption (f.2) is
much weaker than the bounded variation and differentiability assumptions used
in theorem 1 of Hall.
Finally, it is assumed that,
(w.l)
w(x) is nonnegative,
(w.2)
w(x) is bounded.
Hall deals only with the special case w(x)
=1.
Given sequences {a } and {b } it will be convenient to let the phrase
n
"h is between a
n
n
and b " mean the sequence h = h(n) satisfies
n
lim
n-+
oo
a h
n
-1
0,
00
The main theorem of this paper can now be stated.
Theorem 1:
if
f
as n
Under the assumptions (K.I)-(K.4), (f.l), (f.2), (w.l) and (w.2),
l d
is the kernel estimator of (2.1) and if h is between n- / and 1, then
-+
00,
"'-
MISE
= MISE +
0
P
(MISE) •
It is worth noting that in the case d
much larger than that of Hall (1982).
Marron (1983a).
= 1 the range of allowable h is
This fact is vital to the results of
-4-
e
3.
PROOF OF THEOREM
First note that by the familiar variance and bias
2
decomposition (see
Rosenblatt (1971) or (3.6) of Marron (1983»,
MISE = n-1h-dC!fex)2w(x)dx)(!K(u)2du) + o(n- 1h- d) +
(3.1)
+J[JK(u)f(x-hu)du-f(x)]2w(x)f(x)dx •
It
will be convenient to let the bias
(3.2)
2
part of this expansion be denoted by
Sf(h) = ![!K(u)f(x-hu)du-f(x)]2 w (x)f(x)dx.
Note that by (f.2)
lim sf(h) = 0
(3.3)
h + 0
It is seen in Marron
(1983) that the rate of convergence of sf(h) to 0
provides a measure of the "smoothness" of f.
Next, for j=l, ... ,n, it will be useful to define the "leave one out"
kernel estimator:
f. (x, h) = _....::1=---=-d I
(3.4)
(n-1)h
J
i~j
x-X
K(-h
i)
A
A quantity which is more tractable than MISE is:
~
1
n
2
(3.5)
MISE = - I [f.(X.,h)-f(X.)] w(X.).
n j=l J J
J
J
Note that for j
=
A
1, ... ,n
f. (x,h)-f(x,h)
J
-lid
Hence, by (K.2), for h between n a n d 1,
(3.6)
sup
sup
If.(x,h)-f(x,h)
'-1 , ... ,n J
x J-
I
1
= O(n- h- d) •
Theorem 1 is now a consequence of the following.
-5-
e
Theorem 2:
Under the conditions of theorem 1,
~
MISE
4.
MISE +
=
0
P
(MISE) •
PROOF OF THEOREM 2
= 1, .•• ,n
First, for j
define
2
[f.(X.,h)-f(X.)] w(X.)
A
J
u. =
(4.1)
J
J
J
J
MISE
1 d
Note that by (1.1), (3.1) and (3.6), for h between n- / and 1,
EU. = MISE- 1 !E[f.(x,h)-f(x)]2w(x)f(x)dx
J
J
= MISE- 1 jE[f(x,h)-f(x)]2w(x)f(x)dx + O(n-1h- d)
(4.2)
= 1 + 0(1).
The proof of Theorem 2 will be complete when a weak law of large numbers is
established for the U..
J
For i
(4.3)
~ j
= 1, .•• ,n define
_
-d.
[h lK(
V•. ~
Xj-Xi
~
h ) - f(X.)]w(X.) •
J
J
It follows from (3.4) that, for j = 1, •.. ,n,
" (X. ,h)-f(X,)] 2w(X.)
[f.
J
J
J
J
=
[(n-1)
-1 \
2
LVi'].
i=!j J
Hence,
(4.4)
var(U,)
J
=
(n_1)-4 MISE-2 var[
L
i~j
Now if this last variance is expanded in terms
there will be (n-1)
all different.)
4
v.,]2
1J
of variances and covariances,
terms of the following types (where i, i', i", i*, j are
-6-
if of Terms
General Terms
D(n)
var(V .. Vi")
1J
2
J
2
cov (V .. , Vi' .)
1J
J
2
1J
cov(V .. , Vi' V. , . )
J 1 J
2
cov(Vi . , V. , . V. " . )
J
1
J
1
J
cov (V. j V. , . , Vi . V. " . )
1
1 J
J 1 J
cov(V.. V. , . ,Vi'" V. *.)
1J 1 J
J 1 J
Bounds will now be computed for each general term.
Using (K.2), (K.3), (f.l), (w.2) and (4.3), as h
-e
ff [h
(4.5)
-3d
4
JK(u) -4h
- 4K(u)f(x)
-2d-
3
-d
+
0,
2
JK(u) f(x) + 6h JK(u) f(x) -
3
+ h d f(x) 4 ]w(x) 2 f(x)f(x-hu)dudx
D(h- 3d ) •
Similarly,
var(Vi.V.,.)
J 1 J
(4.6)
fff [h
-d
~ EV:.V:,.
=
1J 1 J
x-y
2 -d x-z
2
2
K( h )-f(x)] [h K(~)-f(x)] w(x) f(x)f(y)f(z)dxdydz
f[f[h- dK(u)2-2K(u)f(x)
= D(h- 2d ) •
Very similar computations show that
(4.7)
2
-d
= D(h ) •
1J
EV ..
+ h d f(x)2]f(x-hu)du]2 f (x)dx =
-7It follows from the above that
In the same way it is easily seen that
2
cov (V
. ij' Vij V
i' j)
=
O(h- 2d ) ,
)
( 2 Vi' j V
COVV!j'
i"j
=
O(h- d )
COV(Vij Vi 'j' Vij Vi"j)
= O(h- )
d
To bound the final term, note that
(4.8)
d X.-y
~
E[V .. Ix.] = Jrh-J.{(-.-L...h) - f(X.)]w(X.) f(y)dy
1.J J
J
J
Hence, by (3.2)
=
E[V .. V '.]
1.J i J
-e
(4.9)
E(E[V .lx.]2)
iJ J
=
!r!K(u)f(x-hu)du - f(x)]2w(x)f(x)dx
As another consequence of (4.8), note that by (K.2),
(4.10)
(f.l) and (w.2),
suplw(x)~[fK(u)f(X+hu)du
suvIE[v .. lx. = x]1 =
X
1.J J
x
=
- f(x)]1 <
00
•
It follows from the above that
~ sup E[V .. 1X. = xl
x
1.J
J
22
E(E[Vi.jX.]) =
J J
O(sf(h».
Hence,
COV(V"V"J,V,,,,V'*J) = O(sf(h»
1.J 1.
1. J 1.
-lId
Looking back to (4.4), it is now apparent that for h between n a n d 1,
-8-
-1 -d
O(n h ) + O(sf(h»
var(U.)
=
)
--------=---::...--2
MISE
Hence, by (3.1) and (3.2), there is a constant C so that, for h between n -lid
and 1,
n
-1
O(n
-1 -d
h
)+O(sf(h»
var(U.)
~
)
n[Cn-lh-d+Sf(h)+0(n-lh-d)]2
o (n-lh- d )
O(sf(h»
+-----
2 -1 -2d
Cn h
-d
2Ch sf(h)
(4.1:1)
d
O(h )
-+- 0 •
Next, for j " j', a sindlar bound will be computed for
(4.12)
cov(U). ,Uj ,) = (n-l)
-4
MISE
-2
\
2
\
2
cov( [ L Vi'] ,[ LV .. ,] )
i,&j)
i:fj'
1)
Once again, by the linearity of covariance, the right hand covariance in (4.12)
may be written as the sum of (n-l)
i*, j and j' are all different):
4
terms of the form (again assume i,
.,,
1
i" ,
-9II of Terms
General Term
0(1)
2
2
cov(Vj 'j , Vjj ,)
O(n)
2
2
cov (V. , . , Vi' ,)
J J
J
O(n)
cov(V~j , ~j ,)
2
0(n )
2
2
cov (Vi . , V. , . , )
J ~ J
O(n)
2
cov (V. , . , V.. , Vi' , )
J J JJ
J
2
0(n )
2
cov(V. , . , Vi' , V. , . ,)
J J
J ~ J
O(n)
2
cov(V .. , Vij , V.. ,)
~J
JJ
2
0(n )
2
cov(Vij,Vij,Vi'j')
2
0(n )
2
cov(V .. , V.. , Vi'j')
~J
JJ
3
0(n )
2
cov (Vij , Vi' j , Vi" j , )
O(n)
cov(v.,.vi·V .. ,V .. ,)
J J J, J J ~J
2
0(n )
cov (V. , . V.. , V.. , V. , . , )
J J ~J JJ ~ J
2
0(n )
cov (V. , . V.. , V.. , V. , . , )
J J ~J ~J ~ J
3
0(n )
cov(V. , . V.. , Vi' . , V.". ,)
J J ~J
J ~ J
2
0(n )
cov(V .. Vi' . , V.. , Vi' . , )
~J
J ~J
J
3
0(n )
cov(V .. Vi' . , Vi' , V.". , )
~J
J
J ~ J
4
0(n )
cov(Vij Vi 'j' Vi"j' Vi*j ,)
These general terms will now be bounded in order of increasing difficulty.
By the independence of
XI, ... ,Xn ,
Using the Schwartz Inequality, (4.5) and (4.6) each term which appears O(n)
times may be bounded by O(h
-3d
) •
-10Using (4.7) and (4.10)
E[V:,.V .. ,V.,.,] =
J J 1.J
1. J
E(V~,.E[Vi.,lx.,x.,]E[Vi,.,lx.,x.,]) ~
J J
J
J
J
J
J
J
2
2
-d
sup E[V .. IX.=x] E(V.,.) = D(h )
x
1.J J
J J
~
Thus, by (3.3) and (4.9),
COy
2
(V . , . , Vij' V.,., ) = D(h- d) •
J J
1. J
Similarly, again using the Schwartz Inequality,
E(V. , . Vi' V.. ,V. , . ,) = E(V. , . Vj . ,E [V .. Ix. ,X. , ]E [V. , . , Ix. ,X ,])
J J J J J 1. J
J J . J
1.J J J
1. J
J j
~ sup E[Vi . /X.=X]2 E(V., .V .. ,) = D(h- d).
x
J J
J J JJ
from which it follows that
cov(V.,.V .. ,V .. ,V.,.,) = D(h- d).
J J 1.J
JJ
1. J
By the above techniques it is easy to see that
·e
cov(V"'Vi"V""V'''j')
= D(Sf(h».
J J J 1. J 1.
The remaining terms will be a little more difficult.
By (4.3),
2
E[Vi .vi .,v.,.,] = E(;ijV .. ,E[Vi,·,lx.,x.,x.,]) =
J J 1. J
1.J
J
1.
J J
1
• w(y)w(Z)~E[V .. IX.=z]f(x)f(y)f(z)dxdydz
1.J
J
d
d
d
= JJJh- [K(u)-h f(x+hu)]2[K(V)-h f(x+hv)] •
1
" w(X+hu)w(X+hv)~[Vi' Ix. = x+hv]f(x)f(x+hu)f(x+hv)dxdudv
J
J
= D(h- d),
and hence,
COy
2
(V •• , Vi" V.,., )
1.J
J
D(h- d).
1. J
Similarly,
2 2
E[Vi.V.. ,vi ,.,] = E(Vi.Vj.,E[Vi'j' 1X.,X.,X.,]) =
J JJ
J
J J
1. J J
= JJJ[h-dK(Y~x)- f(y)]2[h-~(z~y)_ f(z)]"
"w(y)w(z)~E[Vi.lx. = z]f(x)f(y)f(z)dxdydz = D(h- d) ,
J
J
-11and so,
2 V V
.. , JJ.
., I 1. I j , )
cov (V1J
O(.h- d ) •
=
By the same technique,
E[Vj I .Vi' Vi' IVi I . I] = E(V . I . V.. Vi' IE[Vi I . I Ix. ,X. I ,Xi]) =
J
J
J
J
J J 1J
J
J
J
J
= JJ f[h-~(Y~X)-f(Y)][h -dKe~Z) -f(y)][h-dK(X~Z)_ f(x)]·
•
W(Y)W(X)~E[Vi'
J
IX.=x]f(x)f(y)f(z)dxdydz =
J
O(h- d )
and hence,
-d
cov(V. I . V... V.. IV. I . I) = O(h
J J 1J
1J
1 J
) •
The most difficult terms have been saved for last.
It will be convenient
to define the sets
1
{(x,y): Ilx-yll > 2h~}
A
C
A
d
= lR
d
x JR , A.
Note that
E(V .. Vi I . V.. IV. I . I) = E(E [Vi' V.. I IX.. X. I ] 2) =
1J
J 1J
1 J
J 1J
J
J
JJ{J[h-~(X~Z)-f(X)][h-~(Y~Z)-f(Y)]f(Z)dz}2w(x)W(Y)f(x)f(y)dxdy
=
(4.13)
= I + II ,
where
I
JJ{
2
} w(x)w(y)f(x)f(y)dxdy,
A
II
JJ{
2
} w(x)w(y)f(x)f(y)dxdy •
C
A
Note that, uniformly over x,y,
x-z
-d .l::.!.
(x)] [h K( h )-f (y)] f (z) dz
f [h-dK(h)-f
~
=
O(h
-d
)
Thus, since for each x E JRd , the set {y:(x,y)EAc } has Lebesgue measure O(h d / 2),
-12(4.14)
Now given (xtY) € At note that the sets Al
{z:lly-zll
1
3
< h
~2
}
{z:
=
II
x-z II
d
Let A = lR \(~UA2).
3
are disjoint.
Define lIt 1 2t and
by
I =
ffU
=
f
dz +
A Al
J
dz +
A2
dz]2w(x)w(y)f(x)f(y)dxdy =
A3
ff[I +I +1 ]2 w(x)w(y)f(x)f(y)dxdy.
2 3
A l
.
Using (K.4)t uniformly over (xtY) € A and z
€
~
,
Hence t
sup
II
(x,y)€A
·e
sup
-d
J [h
x-z
-du y-z
K(T) - f(x)][h l{(T) - fey) ]f(z)dz
=
A ~
= O(h~-d) •
SimilarlYt
sup
A
I
~
2
3 -- O(h -
d
)
•
It follows from the above that
Now from (4.13) and (4.14)
cov ( V.. V.
~J
~
I •
J
t Vi . I V.
J
~
I • I
J
2d
) = 0 ( h -3d/2) + O(h l - ) = o(h-2d) •
The above computations will prove useful in bounding the last term.
Using the Schwartz Inequality and the independence of X. and X'
J
J
I
t
-13-
IE(V1.J.. V.1.
.
~
I •
J
I=IE (E [V1.J.. Vij Ix.J , X.J
Vi . I V." . ,)
J
I
1. J
[E(E[Vi,V "
J iJ
I X.,X.
J
J
2~ [E(E[V
, ])]
I ]
E [Vi I
•
J
IXj ] E[Vi" J.
2
1
)E(E[Vi""
i'j X.]
J
J
X'
I
J
I X.
J
I])
I
~
2~
, ])]
Note that the first factor on the right side appears in (4.13).
Thus the
computations following (4.13) together with (4.9) imply
cov(Vi,Vi,,,V,,,Vi"") = o(h
J
J
1.J
J
-d
sf(h»
•
l d
Now looking back to (4.12) it is apparent that, for h between n- / and
o(n
1,
cov (U . J U. ,)
J J
-2 -2d
-1 -d
h
) + o(n h sf(h»
=
MISE
2
Thus by computations similar to (4.11), for h between n- / d and 1,
1
cov(U. , U. I) -+ 0 •
_
J
J
-lid
It follows from this together with (4.11) that for h between n
a n d 1,
1 n
var(n- L U.) -+ 0 ,
j=l J
and hence, by the Chebychev Inequality,
n
-1
n
L U. -+ E(U.)
j=l J
J
in probability.
Theorem 2 is an easy consequence of this together with (3.5), (4.1) and
(4.2).
ACKNOWLEDGEMENT
The author is grateful to David Ruppert for recommending the
force" techniques employed in this paper •
•
'~rute
-14-
REFERENCES
Ha11 t P. (1982). Limit theorems for stochastic measures of the accuracy of
nonparametric density estimators. Stoch. Processes App1n. l l t 11-25.
Kom1os t J. t Major t P. and Tusnady (1975). An approximation of partial sums
of independent random variab1es t and the sample distribution function.
Z. Wahrsch. Verw. Geb.
~t
111-131 •
Marron t J.S. (1983). An asymptotically efficient solution to the bandwidth
problem of kernel density estimation. North Carolina Institute of
Statistics, Mimeo Series #1518.
-e
•
•
Wegman t E.J. (1972). Nonparametric density estimation. II. A comparison of
density estimation methods. J. Statist. Compo Simu1. t ~t 225-245 •