Hardle, Wolfgang and Marron, James StephenOptimal Bandwidth Selection in nonparametric Regression Function Estimation."

OPTIMAL BANDWImH SELECTION IN NONPARAMETRIC REGRESSION
FUNCTION ESTIMATION
Wolfgang HardIe
Universitat Heidelberg and
University of North Carolina at Chapel Hill
James Stephen Marron
University of North Carolina at Chapel Hill
Keywords and Phrases:
nonparametric regression estimation, kernel estimators,
optimal bandwidth, smoothing parameter, cross-validation
MiS 1980 subj ect classifications:
Primary 62GOS, Secondary 62G20
ABSTRACf
In the setting of nonparametric regression estimation where the independent variables are random, kernel estimators are considered.
It is seen that
a certain cross-validated choice of the bandwidth is asymptotically equivalent to the bandwidth which minimizes a version of the Mean Integrated Square
Error. Since no precise asswnptions are made on the amOtmt of "smoothness"
of the unknown regression function, the estimators of this paper settle an
open problem raised by Stone (1982).
1.
INTRODUCfION
This paper presents a solution to the univariate version of an open
problem raised in the Special Invited Paper (to the Annals of Statistics) of
Stone (1982) (see his Question 3).
In fact it will be seen that the results
of this paper reach somewhat deeper than the level of Stone's question.
The setting is that of nonparametric regression estimation.
Let (X,Y) ,
(Xl ,Y1), (X 2 'Y Z), ... be independent random vectors with a common joint
density function, f ,y(x,y). Let f(x) be the marginal density of X. Denote
X
the regression curve of Y on X by
m(x) = E[Ylx=x] = JYfx,y(x,y)dy/f(x).
The results of Stone (1982) may be interpreted, in the present setting,
as follows.
If very precise "smoothness" assumptions are made on m(x), then
there is an estimator of m(x) , depending on the "smoothness" of m, which
optimizes (in a minimax sense, as n
+
00) the exponent of the algebraic rate
of convergence of an L2 error criterion.
Stone says such an estimator
"achieves the optimal rate of convergence."
In question 3, Stone asks if there
exists a single estimator which achieves the optimal rate of convergence
uniformly over a certain continuum of different smoothness classes.
In this paper not only is an affirmative answer to this question provided,
but in fact the results presented here go somewhat further.
This is because
not only is the exponent of algebraic convergence optimized, but in fact the
constant coefficient is in some sense optimized as well.
The results of this paper use kernel estimators, which are defined as
follows.
Given a positive integer n, a "kernel function" K(x) , and a
"bandwidth" h > 0, define, for i=l, ... ,n, the kernel weights,
x-X.
ai(x) = n-lh- l K(~)
'Then m(x)
1S
estimated by
-2-
the following weighted average of Yl, ... ,Yn , as proposed by Nadaraya (1964) and
Watson (1964) ~
n
I a. (x)Y./f(x) , where f(x) is the familiar Rosenblatti=l I
I
Parzen estimator of f(x) given by
m*(x) =
n
f(x) =
I
a. (x)
. 1 I
1=
Tn the case where the marginal density f(x) is known, another reasonable
estimate, studied by Johnston (1982), is given by
n
m(x)/f(x) =
L a.(x)Y./f(x)
.
11
. 1
1=
It should be noted that this estimator has asymptotic behavior which is, In
general, slightly inferior to that of m*.
It is studied here because the
nonrandom denominator makes it more tractable.
(;iven any estimator m (x), a popular means of assessing the performance
n
of Tnn (x) is the Mean Integrated Square Error, defined by
MISE = EJ[mn (x) - m(x)]2w(x)dx ,
where w(x) is a nonnegative ''weight fmction".
In the case m = m/f,
n
MISE may be easily analyzed by a variance/bias 2 decomposition.
First define
2
Sex) = E(Y Ix=x)
(1.1)
and assume S and fare miformly continuous.
Now, by straightforward computa-
tions very similar to those of Rosenblatt (1971), as n
-+
00, with h = hen)
EID(x) = JK(u)m(x-hu)f(x-hu)du,
(1.2)
var m(x) = n -1h -1S(x)f(x) JK(u) 2du
A
+
o(n -1 h -1 )
-+
0,
-3-
Hence,
MISE
= EJ[m(x)/f(x)-m(x)]2w(x)dx =
(1.3)
where the bias 2 contribution has been denoted
sl (h)
(1.4)
=
f(x-hu) du-m(x) f(x)]
J[K(u)m(x-hu)
J
2
f(x)
-2
w(x) dx.
Many authors have dealt with quantities similar to sl(h) by approximations
which arise from assuming that K has some vanishing moments and that m and f
have a Taylor expansion.
This technique is inadequate for the results of this
2
paper because it gives only upper bounds on the bias part of the MISE. The
advantage of sl (h) is that it measures precisely the rate of convergence of
2
the bias . Hence, it is apparent that sl(h) provides a measure of the quantity
called "smoothness" which is perhaps superior to that of Stone (1982) and
previous authors.
In the case m
n
= m*, some modification of MISE is required because the
moments of m* need not exist (see Rosenblatt (1969)).
Since the results of
/\
this paper can be more easily described in terms of the estimator mlf, the
more difficult case of m* will be discussed in Section 2.
From (1.3) it is clear that if the bandwidth h is chosen deterministically
to asymptotically minimize MISE (for the estimator m/f) then use must be made
of functionals of the unknown m(x).
The theorems of this paper show that
this difficulty may be overcome by using the data to specify h through crossvalidation.
This technique was introduced in the setting of regression
{mction estimation using splines by Wahba and Wold (1975).
The idea is to
try to choose h to make m(X)/f(X) (or m*(X)) an effective predictor of Y.
-4-
TIlis is accornplishedas follows.
First, for j=l, ... ,n, define the
"leave-one-out" estimators
m.(x) =
J
La.(x)Y.
Vj
1
1
(1.5)
m*(x) = m.(x)/f(x) .
J
J
Then form the estimated Residual Sums of Squares
RSS
=
n -1 ~L [Y.-m.(X.)/f(X.)] 2
j=l J J J
J
A
RSS*= n- l
n
I
[y.-m~(x.)]2 ,
j=l J
A
J
A
J
A
and take fi (or h*) to minimize RSS (RSS* respectively).
The reason for
employing the leave-one-out estimators is that otherwise RSS is trivially
minimized at h
=
0, and RSS will yield similar pathological behavior.
In section 3 theorems are stated which show that a slight modification
of the cross-validated bandwidths fi and fi* have excellent asymptotic properties.
A
In particular it is shown that choosing h to minimize RSS is asymptotically
equivalent to choosing h to minimize NITSE.
established for R.SS*,
A similar result will also be
where NITSE for m* is appropriately defined in section 2.
Sections 3 and 4 contain the proofs of the optimality theorems for
h and
A
h* respectively.
2.
AN ERROR CRITERION FOR m*
As noted in section 1, NITSE may not be a meaningful error criterion for
the estimator mn = m*, because the moments of m*·may fail to exist. This
difficulty will be overcome by restricting expectation to an event whose
-5-
probability tends to one.
First let {h
} denote a sequence for which there is an
-n
(2.1)
.6--E
lim h n Z
n-+OO -n
=
/log n = 0,
E
> 0 so that
00
and let {hn } denote a sequence for which
(2.2)
lim 11
n-+OO n
=0
lim h log n
n-+OO n
=
00
•
_.6-
Intuitively, -n
h tends to 0 "just slower than" n Z and hn "just barely" tends
to O. Next suppose that the marginal density f(x) and the kernel K(x) satisfy
the assumptions of theorem A of Silverman (1978).
Note that the proof of
that theorem may be easily extended to show
A
(2.3)
sup_ sup
~~h~hn x
If(~)-f(x)
I+
a.s.
0
Next assume that there is a constant
y >
0 so that for x
E
supp(w), the
support of the weight function w(x) ,
f(x) >
y .
For n = 1,2,... define the event
Un = {f(x) > y/2 for h
E
and let ~ denote the complement of Un.
[h
,h ], x
-n n
E
supp(w)} ,
Note that, by (2.3)
Also note that on the event Un there is no difficulty about existence of moments
of m*.
-6-
From the above it follows that, on the event Un ,
m*(x)-m(x) = [m(x)-m(x)f(x)]/f(x) =
= [~(x)-m(x)f(x)]/f(x)+[m(x)-m(x)f(x)] [f(x)-f(x)]/f(x)f(x)
(2.4)
=
Wliformly over h
[m(x) -m(x) t(x)] /f(x) +0 (m(x) -m(x) f(x)) ,
P .
[~,hn]'
E
x
supp(w).
E
Now for i=1,2, ... define the residuals
(2.5)
€-
m(X.) .
= Y. -
111
Note that
(2.6)
m(x)-m(x)t(x)=
n
n
1=
1=
I a.(X)E.+
I a.(x) [m(X.)-m(x)]
. 1 1
1· I I I
,
Next, following the notation (1.1), let
-
Vex) = S(x)-m(x)
2
,
and assume V, f and m are uniformly continuous.
A computation very similar to
that leading to (1.2) yields
+ [f K(u) [m(x-hu)-m(x)]f(x-hu)du]
2
-1 -1
+ o(n h )
Next for n=1,2, ... let E* denote expectation over the event U.
n
from the above that
It follows
-7-
MISE*
(2. 7);
=
E*f[m*(x)-m(x)] 2w(x)dx =
=
rv 2
.
n -1 h -lJV(x)f(x) -1w(x)dx)~(u)
du +s2(h)+
o(n -1-h -1 ),
where
s2(h) = J[JK(u) [m(x-hu)-m(x)]f(x-hu)du]2 f (x)-2w(x)dx.
(2.8)
MISE* is the error criterion that will be used for m* in the optimality
theorem of section 3.
3.
A<.>SUMPTIONS AND TI-IEOREMS
'fhe theorems of this paper require the following set of assumptions.
(A. 1)
Let {h
} and {hn } satisfy (2.1) and (2.2).
-n
(A.2)
There exists a c
<
00
and a sequence of positive constants {an}
such that (letting f y denote the marginal density of Y)
sup_ h -3
h
-n
J
~h~h
2
y fy(y)dy ~ c ,
n
J
lim sup
n~ O~x~l
y2f (x,y)dy = 0,
Iyl>a
n
_l
lim
_l
sup_ n 2h 2 a (log n)
n~ h ~h~h
n
-n
gn(x) =
for n = 1,2, ... ,
Iyl>an
2
= 0,
n
an
J
-a
n
2·
y f(x,y)dy
~
n
> 0 for all x
E
[0,1], n=1,2, ...
-8-
(A.3)
There are constants M > 0, and
~ >
t
+ E
(see (2.1)) so that,
for any real munbers x and t,
Im(x)-m(t)! ~ Mlx-tl~
IS(x)-S(t)I ~ Mlx-tl~
If(x)-f(t) I ~ Mlx-tl~
(A.4)
Both Sex) and f(x) are of bounded variation.
(A.S)
There is a constant
y >
0,
f(x)
~ y
•
so that for x
E
[0,1] ,
The kernel function K has compact support, has a derivative which
(A.6)
is of bounded variation and satisfies JK(x)dx = 1.
Note that by (1.4), (2.8),
sl(h) = o(h)
(A.3), (A.4) and (A.S), as h
~
0 ,
and s2(h) = o(h) .
Hence, from (1.3) and (2.7) the optimal h for both HISE and :MISE* is
(asymptotically) contained in [h ,h ].
-n n
It should be noted that, by taking a = nt(log n)-3, a sufficient
n
condition for (A.2) is that Y has a moment of order 8 + n (some n > 0).
This
is substantially weaker than the boundedness conditions on Y that have been
imposed by a number of authors, starting with Nadaraya (1964).
It was seen in section 1 that the boundedness of f above 0 is very
convenient.
The choice of the
in~erval
[0,1] in (A.S) is without loss of
generality (by a simple rescaling argument).
Stone (1982) has made a similar assumption.
It should also be noted that
-9-
The reader may be surprised at the lack of "vanishing moment"
assumptions on K such as those introduced by Parzen (1962).
The theorems of
this paper are true under asslllIIptions of this type, but such asstunptions are
not necessary.
Now since f is known to be bounded above
°only on the interval
[0,1],
define
.J = {j=l, ... ,n:X.
J
E
[O,l]} .
Next redefine the estimated Residual Sums of Squares
A
R.SS = n
-1
I
A
j EJ
[y.-m.(X.)/f(X.)]
J
J
J
2
J
(3.1)
RSS*= n- l
L [Y.-m~(x.)]2
j EJ
~
J
J
J
Using the notation (2.5), the actual Residual Sum of Squares may be written
as
RSS=n
(3.2)
-1
LEo2
jEJ
J
ft is important to note that RSS is independent uf h.
The main theorems of this paper may now be stated
Theorem 1:
Under the asslllIlptions (A.l) - (A.6),
A
RSS = RSS
uniformly over h
(3.3)
Theorem 2:
e
E
+ ~rrSE +
[~,~];
0p(MISE),
where the weight function (in MISE) is taken to be
w(x) = l[O,l](x)f(x) .
Under the asslllIlptions (A.l) - (A. 6),
A
RSS* = RSS
+
MISE*
+
0p (MISE*) ,
-10Wlifonnly over h
[bn,hn ]; where the weight filllction (in MISE*) is taken to be
E
w(x) = 1[0,1] (x)f(x) .
From these theorems it follows that choosing h
/'-
I~SS
E
[~,hn]
to minimize
A
(or RSS*) is asymptotically the same as minimizing MISE (or MITSE* respec-
tively).
Thus, as mentioned in section 1, not only is the exponent of
algebraic convergence optimized, but in fact the constant coefficient is the
best possible for the given kernel K and weight filllction w.
To see how this provides an answer to question 3 of Stone (1982), assume
that for some k
E
the kernel K satisfies, for j=l, ... ,i, where i
7l ,
~
k
Then it is apparent from (1.4), (2.8) and Taylor's theorem that if both f and
m satisfy Stone's smoothness condition (1.2) with p = k+B
tmiformly (over filllctions satisfying (1.2)).
~
i, then
Hence, by (1.3) and (2.7),
letting r = 2p/(2p+l),
~rrSE = O(n- r ),
wIlen h ~ n- l /(2p+l).,
Now d e fOlne
MITSE* = O(n- r ) ,
the l
d Square E_ITor,
n t
egrate
1
ISE =
fo [mn (x)-m(x)]f(x)dx
,
where ITIn (x) is either m(x)/f(x) or m*(x). In either case, it follows from
the Tv1arkov Inequality that
lim lim sup P[ISE ~ c n- r ] = o.
c~ n~
m,f
Now using computations very similar to (3.11) - (3.13) of Stone (1982), this
may be sharpened to
-11-
lim sup prISE
n-wo m,f
for some c
l~e
>
0.
fact that
~
c n
-r
] = 0,
This is the statement of question 3 of Stone (1982).
~~SE
and MITSE* are restricted to [0,1] is also consistent
with the results of Stone (1982).
The choice of w(x) used here diverges
slightly from that of Stone but is very natural because here MITSE is
proportional to the conditional expected square error
E*[(m(X)/f(X)-m(x))2\X
<llld similarly for MITSE*.
E
[0,1]] ,
It is apparent from (1.3) and (2.7) that the choice
of w(x) is irrelevant to optimizing the exponent of algebraic convergence.
Hencc, the estimators of this paper provide a solution to question 3 of
Stone (1982).
It should also be noted that the results of this paper concern a fixed
regression function, m, while Stone (1982) works unifonnly over classes of
rcgression functions.
However, the sup taken over the class 0r appearing
in Stone's question 3 is a consequence of the results of this paper, as long
as 11is constant K2 may be considered independent of r.
At first glance, the reader who is familiar with the literature on nonparametric regression estimation may be disturbed by the fact that the weight
flmction, w(x) , is truncated off the interval [0,1].
scttings Gasser and
~~ller
In somewhat similar
(1979) and Rice and Rosenblatt (1983) have reported
that an untruncated MISE can be drastically influenced by an "endpoint effect".
This is caused by inflation of the bias 2 part of the mean square error near the
endpoints of the interval of support of f(x).
e
measure of the performance of an estimator.
This effect makes MITSE a poor
Indeed, choosing h to optimize
such a MISE gives an estimator which is seen to be quite suboptimal everywhere
except near the endpoints.
Despite the discouraging fact, one may see with
-12very little effort that this endpoint effect does not occur in the present
setting.
This is because here, unlike in the settings treated by the above
authors, the marginal density is assumed to extend (and be "smooth") beyond
the interval [0,11 and data points from outside the interval are used in the
estimators of this paper.
Hence, in the present setting, MISE provides a
very reasonable assessment of the performance of the estimator on the entire
interval [0,1].
Another approach to the above endpoint difficulties has been taken in
the unpublished manuscript by Rice (1982), who assumes a somewhat restrictive
"circular design", Le.:
m(x) and its first two derivatives agree at the
endpoints of the support of f(t).
The advantage of this assumption is that
MISE may now be taken over the entire support of f, instead of only over a
subinterval as done here.
The setting of that paper is also somewhat
different from the present because there the independent variables are deterministic and in fact equally spaced.
Rice's asymptotics avoid the smoothness
questions raised by Stone (1982) but his paper contains some interesting
~bnte
Carlo comparisons of several estimators which appear to be indistinguishable
uSlng the asymptotics of this paper.
Finally, it is noted that the fact that the results of this paper require
minimization be perfonned over h
E
[!!.,h] may be somewhat disturbing to the
experimenter with a fixed sample size.
An interesting and worthwhile extension
of the results of this paper would be to show that the cross-validated
/\
A-
h (and h*) satisfies this restriction.
4.
e
PROOF OF THEOREM 1
First it is convenient to define the estimated Mean Integrated Square
Error
-13-
"'-
[m(X.)/f(X.) - m(x.)]2
MISE
(4.1)
J
J
J
A
Wegman (1972) has used a density estimation analog of MISE for f-bnte Carlo
comparisons of estimators.
The fact that this would be a reasonable proce-
dure in the present setting is established by
Lerrona 1:
Under the assumptions of theorem 1,
MISE = MISE
uni formly over h
E
+
0 (MISE) ,
P
[h,11
].
-fl n
The proof of Lerrona 1 is not given here because this is Theorem 1 of
Ilardle (1983).
e
Lerrona 2:
Theorem 1 of this paper is an easy consequence of Lemma 1 and
Under the assumptions of Theorem 1,
RSS = RSS
uni formly over h
E
+
MITSE
+
0p(MISE),
[~,lin]'
Proo f of Lerrona 2
. From (3.1) and (3.2), by the addition and subtraction of m(X.) note that,
J
(4.2)
where
An = n
-1 \
L
jEJ
A
2E. [m(X.)-m.(X.)/f(X.)] ,
J
J
J
J
J
B = n -1 L\ [m(X.)-m.(X.)/f(X.)] 2 .
n
'J
J
J J
J
JE
A
e
These quanti ties will be approximated in turn.
-14-
By (1.5) and (2.5), An may be decomposed by
-~A
(4.3)
n
A
In
=
A
2n'
+
where
Al
n
=
n
A
= n
2n
-1
-1 \
L
E· f (X.)
La. (X.) E.
. T]
J
.J.. 1
]
1
] E~
,
Ii]
-1 \
-1 \
L Eof(X.)
[L a.(X.)m(X.)-m(X.)f(X.)].
. T]
]
.J.. 1 ]
1
]
]
Ii]
] E<
To approximate the term Aln , note that by conditioning on {Xl'··· ,Xn }
EAln = O.
Fay jEJ, define
z.
L a.(X.)f(X.)-l E. Eo
=
J
ifj
1
J
]
1
J
Note that
2
-2 2 2
Z.]2 = .J..
L a.(X.)
f(X.)
E·E·
1]
J
1 ]
+
Ii]
+
-2 E.Eo,E 2.
I a.(X.)a.,(X.)f(X.)
1
J 1
]
J
1 1
J
. . ,J..
1,1 i ]
Vi'
and so, by the independence of the residuals {E }, (A.3), (A.S) and a computaj
tion similar to (1.2), uniformly over h
=
>.
iij
~ y
E
[h ,h ]
-n
n
2 2
E[a.(X.) 2f(X.) -2 E.E.]
1]
-2 sup S 2(x)
O~x~l
]
1 ]
~
\ E[a.(X.) 2]
L
ifj
1]
=
-lSBy
similar methods it is apparent that, for jfj', unifonn1y over h E [~,lin] ,
J~ft ;-2h-2KCXi{)KCYfix)V(x)V(y)P[XE [0,1]] -Zdxdy
=
It follows from the above that
e
und hence, by (1.3), unifonn1y over h E [!!n,hn ],
(4.4)
To bOlU1d AZn a decomposition
which is somewhat similar to the usual
.
d
varIance Ib.las Z IS use,
o
o
0
(4.5)
where
Av
zn
= n -1
I
E.f(X.) -1 [L\ a.(X.)m(X.)-E[ \L a.(X.)m(X.) IJ,Xo]] ,
.JJ
JE
J
../..IJ
ITJ
1
. ./.oIJ
ITJ
1
J
C
Az _ -1 I E.f(X.)-l[E[ I ao(X.)mCX.) IJ,X.]-JK(u)f(Xo-hu)m(X.-hu)du] ,
n-n JE<;
·r J
1
J
J
J
J
../.01J
ITJ
-1 J
AbZ = n -1 L\cf(X.)
[K(u)f(X.-hu)m(X.-hu)du-m(X.)f(X.)]
n
0JJ
J
J
J
J
J
J E,
Now by Chebyshev's Inequality, (A.3) , (A.S) and appropriate conditioning
arguments (in particular using E[E.lx.]=O) together with the notation
J
J
-16--
s =
sup S(x) ,
O:o;x:o;l
:0;
h 2n - 2Sy 2E[ L\'
. J
\' E[(a.(X.)m(X.)-E[ L\' a.(X.)m(X.) I J,X.]) 2 J,X.]] .
1
I
J
I
.J.. I
J
I
J
J
L
'J.'
JE
IrJ
IrJ
But hy computations very similar to (1.2) it is seen that, tUliform1y in x and
in h
E
[~,hn] ,
E[(a.(X.)m(X.))
I
J
I
2
1
X.=x] = O(n
J
It follows from this that tUliform1y over h
v
A2n
(4.6)
E
-2 -1
h ) .
[h ,h ] ,
--n n
= 0 p (n -1 h -1 ) .
A similar technique will now be used to approximate A~n
5
h 2n 2Sy 2E[
o
'
2] •
a.(X.)m(x·)IJ,x.]- J
K(u)f(~.-hu)m(X.-hu)du)
I
J
J
J
J
J
IrJ
I J (E[ I
OJ..
JE
Next let # (J) denote the cardinality of J, let C = lR \ [0,1], and note that
E[
I
irj
=
a.(X.)m(x·)IJ,x.] =
I
J
J
J
#(J) 1
-
n
1
JOK(u)m(X. -hu)
J
f(X.-hu)
J
P[XE [0,1]]
du+ n-
#(J)
n
f(X.-hu)
!K(u)m(X. -hu)
J
duo
C
J
P[XEC]
-17-
But by the Central Limit Theorem,
#(J)/n = P[XdO,l]] + 0 (n-!-)
p
[~,hn]'
1t now follows from the above that, uniformly over h E
AC = a (n-lh- l )
2n
p
(4. 7)
Using the same technique on A~n yields
b
1
1
PrlA2nl > Sl (h)2(nh)-2 n] s
S
=
sl(h)-ln- l hn- 2SE[ L E[(!K(u)f(X.-hu)m(X.-hu)du-m(X.)f(X.))2f (x.)-2 IJ ]] =
jEJ
J
J
J
J
J
sl(h)-ln- l hn- 2SE[ L J5(!K(U) [f(x-hu)m(x-hu)-f(x)m(x)]du)2 f(x)
j EJ
2n
= 0
dx]
P [XdO, 1]]
Thus, from (1.4) and (3.3), uniformly over hE
b
A
-1
[~,hn]'
1._1
p
(s (h)2(nh) 2)
1
It follows from this together with (1.3), (4.5), (4.6) and (4.7) that,
uniformly over h E
Azn
[~,~],
= 0p (MISE) .
This, (4.3) and (4.4) imply that, uniformly over h E
(4.8)
'\ = 0p(MISE)
[~,hn]'
•
Now for the term Bn of (4.2), by another addition and subtraction, using
the notation (4.1) write
A
(4.9)
where
Bn = MISE
+
Bl n
+ B~
·~n
,
-18B
In
=
2n- l
L [m(X.) -m.(X.)/f(x.)][m(x.)-m.(X.)]f(X.)-l ,
J
j EJ
J
J
J
J
J
J
J
Now using Prop. 4 of 'Mack and Silverman (1982), uniformly over h
[h
E
-il
,nn ]
sUPlm(X.)-m(X.)f(X.) I = opel).
J
j
J
J
Thus, by the fact that
m(X.)-m.(X.) = (nh)-lK(O)Y.
J
J
J
J
and by the as.sumptions (A.3) and (A.5) , uniformly over h
B
In
= 0 (n-lh- l )
p
B2n = 0 p (n -1h -1 )
=
0p(MISE)
=
0p (i'-ITSE)
E
[h
,hn ]
-il
It now follows from (4.2), (4.8) and (4.9) that, uniformly over h
RSS
= RSS
+
MISE
+
E
[h ,h ]
-il
n
0 (MISE)
P
which completes the proofs of Lemma 2 and Theorem 1.
5.
PRCX)F OF lHEOREM 2
This proof is very similar to the proof of Theorem 1 and only parts that
arc quite different will be given in detail here.
A
MISE* = n
Lemma 3:
-1
L [m*(X.)-m(X.)]
jEJ
J
Define
2
J
Under the assumptions of Theorem 2,
A
MISE* = MISE*
uni fonnly over h
E
[h
--11
+
0P (i'-ITSE*) ,
,hn ].
Lemma 3 is not proved here because this is Theorem 2 of HardIe (1983).
2 of this paper follows irrnnediately from Lemma 3 and
Lemma 4:
Under the assumptions of Theorem 2,
RSS*
= RSS
+
MISE*
+
0p (MISE*) ,
Theorem
-19-
Proo f
0
f Lenuna 4
As in section 4 \vrite
RSS* =
(5.1)
RSS +
An*
+ B*
n
where
A* = n -1
n
B*
n- l
=
n
\'
:Zc.
'J J
JE
[I!l(X.)-m~(X.)]
L
r
j EJ
J
J
J
[m(X.)-m~(x.)]Z
J
J
,
.
J
Now by the leave-one-out analogs of (Z.4) and (Z.6) write (on the event U )
n
-lA*n = A*In
2
lmi formly over h E
A*
ln
+
A*Zn
[~,hn]'
+ 0 (A~
p
-~n
A*Zn'
)
+
where
= n -1 L E.f(X.) -1 \'L a.(X.)E.
J. EJJ
J
'../.' I
ITJ
J
I
A* = n -1 L\' E.f(X.) -1 \'L a. (X.) [m(X.) -m(X.)]
Zn
. JJ
J
'../.' I J
I
J
JE
ITJ
Note that by the methods used to approximate A
ln
in section 4, lllliformly
over h E [~,hn] ,
A'!C
-~n
= 0 p (MISE*)
Again following section 4, write
A* = A*v
Zn
Zn
+
A*c
Zn
+
A*b
Zn'
where
v
A* = n
Zn
-1
I
c.f(X.)
J
. r
J Ec J
-1
.
[L
a.(X.)(m(X.)-m(X.))-E[ L a.(X.)(m(X.)-m(x·)IJ,x.]]
. ../.. I J
I
J
. ../.. I J
I
J
J.
ITJ
ITJ
-20n -1 L\' E.f(X.) -1 [E[ L\' a.(X.)(m(X.)-m(X.)) IJ,X.]- JK(u)[m(X.-hu)-m(X. )]f(X.-hu)du]
. J]]
.~. 1 ]
1
] ]
]
1]
] E
lr]
.
A*Z c
n
=
A*·b
Zn
= n -1
\'L E.f(X.)
'J]]
-lJ K(u) [m(X.-hu)-m(X.)]f(X.-hu)du
.
]E
]
]]
.
v
c
b
But now, by methods very similar to those used on AZn ' AZn and AZn m
section 4, uniformly over h E [~,hn]'
Thus by (2.7), uniformly over h
[~,hn]
€o
,
A2n
* = 0p (MISE*) ,
and so
~
(S.2)
= 0p(MISE*) .
To approximate the term 13*, write
n
13*n = MISE*
+
13*In
+
13*2n'
where
131* = 2n
n
B~
n
=n
-1
j
I
".-1
[m(x.)-m(X.)jf(X.)] [m(X.) -m.(X.) ]r(X.)
A
€oj
]
A
J
]
A
J
J
J
-1 \'
2
-2
L [m(X.)-m.(X.)] f(X.)
.
j €oj
]
]]
J
A
A
A
Now approximations similar to (2.4) together with
)
BIn and 13
2n
]
in section 4 yield, uniformly over h
A
13*n = ~rrSE*
+
0p (MISE*) .
the methods used on
-
€o
[brt,h ] ,
n
-21LCIillTla 4 is an easy consequence of this together with (5.1) and (5.2).
This
completes the proof of Theorem 2.
ACKNOWLEDGEMENT
"
The authors are grateful to Charles J. Stone for posing the problem
solvell in this paper, and to Raymond J. Carroll for several useful connnents
and suggestions.
REFERENCES
C~SER, T. and MULLER, H.G. (1979).
Kernel estimation of regression functions.
Smoothing techniques for curve estimation. Lecture Notes in Math. ill, 23-68 •
•
lIARDLE, W. (1983). Approximations to the meeUl integrated square error with
applications to optimal bandwidth selection for nonparametric regression
estimators. North Carolina Institute of Statistics, Mimeo Series #1529.
JOHNSTON, G.J. (1982). Probabilities of maximal deviations for nonparametric
regression function estimation. J. Mu1t. Anal., 1£,402-414.
MACK, Y.P. and SILVERMAN, B.W. (1982). Weak and strong uniform consistency
of kernel regression· estimates. Z. Wahrsch. 61, 405-415.
"""
NADARAYA, E.A. (1964).
141-142.
On
estimating regression.
Thear. Prob. App!. 2,
PARZEN, E. (1962). On the estimation of a probability density and mode.
Ann. Math. Statist. 33, 1065-1076.
"""
IUCE, J. (1982).
Bandwidth choice for nonparametric kernel regression.
Unpublished manuscript.
RIG;, J. and ROSE~~LATT, M. (1983). Smoothing splines: Regression, derivatives and deconvolution. Ann. Statist. 11, 141-156.
"""
ROSENBLATT, M. (1969). Conditional probability density and regression estimators. Multivariate Analysis Vol. 2 (ed. Krishnaiah) 25-31.
ROSr~LATT,
M. (1971).
Curve Estimates.
Ann. Math. Statist.
~,
1815-1842.
SILVERMAN, B.W.· (1978). Weak and strong uniform consistency of the kernel
estimate ofa density and its derivatives. Ann. Statist. £,177-184.
STONE, C.J. (1982). Optimal global rates of convergence for nonparametric
regression. Ann. Statist. lQ, 1040-1053.
J
WAHBA, G. and WOLD, S. (1975). A completely automatic french curve:
spline functions by cross-validation. Corom. Statist. i, 1-17.
fitting
WATSON, G.S. (1964).
\\IEG1.1J\N, E.J. (1972).
Smooth regression analysis.
Sankhya, Sere A, ,..-...
26, 359-372.
Nonparametric probability density estimation II. A comparison of density estimation methods. J. Statist. Comp. Simul. 1, 225-245 .
...
•
)