AN EFFICIENT CROSS{VALIDATION ALGORITHM FOR WINDOW

AN EFFICIENT CROSS{VALIDATION ALGORITHM FOR WINDOW
WIDTH SELECTION FOR NONPARAMETRIC KERNEL REGRESSION
Je Racine
Visiting Assistant Professor
Department of Economics, 0508
9500 Gilman Drive
University of California, San Diego
La Jolla, CA 92093
Key Words and Phrases: kernel regression; window width selection; cross-validation; computational eciency.
ABSTRACT
This paper presents an approach to cross-validated window width choice which greatly
reduces computation time, which can be used regardless of the nature of the kernel function,
and which avoids the use of the Fast Fourier Transform. This approach is developed for
window width selection in the context of kernel estimation of an unknown conditional mean.
1. INTRODUCTION.
Nonparametric density estimation procedures have provided many exciting new techniques
for statistical analysis and data exploration (for an excellent survey of density estimation
techniques, see Izenman [2]). However it is well known that these techniques tend to be very
computationally intensive. Therefore, it is extremely important to use algorithms which are
computationally ecient to avoid excessively long calculation times.
Algorithms for nonparametric estimation can be decomposed into two parts { one for
the calculation of the kernel and another for window width choice. Silverman [6] (page 88)
states that \One important factor in reducing the computer time is the choice of a kernel
that can be calculated very quickly.". Having chosen a kernel that is ecient to compute,
one must then choose the window widths.
There is no generally accepted method for choosing the window widths. Methods currently available include `subjective choice' and automatic methods such as the `plug-in',
`cross-validation' (CV), and `penalizing function' approaches (See Marron [3] for an excellent survey). Hardle [1] (page 173) compared various automatic methods and found that
\The best overall performance, though, showed GCV (generalized cross-validation).".
One problem with the CV approach is that the CV function has to be repeatedly calculated over a range of window widths, which in turn requires repeated evaluation of the kernel
function. The Fast Fourier Transform (FFT) has been applied for both estimation and window width selection in the context of kernel estimation (see Silverman [5]). However, the
FFT suers from two drawbacks { loss in precision due to discretization, and inapplicability
when the kernel function is dened to be zero over a range of its domain (which is the case
for the optimal kernel, the Epanechnikov kernel).
1
This paper presents an approach to cross-validated window width choice which greatly
reduces computation time, which can be used regardless of the kernel chosen, and which
avoids the use of FFTs. The paper proceeds in the following manner: Section 2.1 presents
an overview of the problem of window width selection, Section 2.2 presents an ecient
algorithm for cross-validated window width selection, Section 2.3 presents some examples,
while Section 3 summarizes and concludes.
2. A COMPUTATIONALLY EFFICIENT APPROACH.
A standard starting point for obtaining the window widths involves minimizing the approximate mean integrated square error (AMISE) of the kernel estimator. However, this yields a
formula for the window widths which depends in general on both the kernel function and the
unknown data generating process { that is, the resulting window width is not `operational'.
Below I consider the problem in more detail in the context of the estimation of a conditional
mean. For the purposes of this paper I restrict attention to bivariate conditional means for
notational simplicity - extension to the multivariate case follows directly.
2.1. AN OVERVIEW OF KERNEL REGRESSION.
Consider two continuous random variables (Y; X ) with realizations fy ; x g; i = 1; . . . ; n.
The `regression' function is given by
i
yi
= E (Y jx ) + = M (x ) + ;
i
i
i
i
i
i
= 1; . . . ; n
(1)
where M (x ) denotes the conditional mean evaluated at the realization x . The disturbances
are independent mean zero random variables satisfying E [ ] = 0 and E [2] = 2 < 1.
The conditional mean is dened as
Z1
f (y; x )
dy
(2)
E (Y jx ) =
y
?1 f (x )
i
i
i
i
i
i
i
i
i
If nonparametric techniques are used to estimate f (y; x) and f (x) in equation (2), the
resulting estimator M^ (x ) is a nonparametric estimator of M (x ). The nonparametric kernel
estimator of M (x ) is
i
i
i
^ ( ) = E^ (Y jx )
Z1 ^
f (y; x )
dy
y
=
?1 2f^(x ) M xi
i
i
i
=
X
n
j =1
=
X
n
yj
4P
K
n
j =1
( )
yj Wj xi
j =1
2
? i 3
? 5
j
i
K
xj
x
h
x
x
h
(3)
where K () is the kernel function and W (x ) is a weight function which lies in the interval
[0; 1]. This estimator was proposed by Nadaraya [4] and Watson [7].
The value of the window width h which, based on asymptotic expansions, minimizes the
AMISE of M^ (x ) can be shown to be
j
i
i
h
= c n?1 5
(4)
=
x
where c is a constant of proportionality and is the standard deviation of X . The unknown
`scaling factor' c depends in a non-trivial way on the kernel function K () and on the
underlying data generating process. The plug-in methods and CV methods can be thought
of as dierent approaches to obtaining the constant c. Obtaining the constant c via the CV
approach involves dening the CV function and minimizing this function with respect to
c. This is strictly a numerical problem since c is data dependent. The benets of viewing
equation (4) as n?1 5 scaled by an unknown constant will be claried below.
Dene the CV function to be
x
x
=
S =
c
X
n
(y ? M^ (x? ; c))2
i
(5)
i
i=1
where M^ (x? ; c) denotes the `leave-one-out' estimator evaluated for a particular value of c.
^ () is obtained by omitting the realization (y ; x ) from the estimator of M () at the point
M
x . The CV approach to window width selection selects that c for which S is minimized
for a given sample of data.
This CV function S must be evaluated for many dierent values of c, and each evaluation
is of O(n2). That is, in general there are n2 calculations required to estimate the CV function
for a given value of c. Consider, by way of example, a grid search method for obtaining
the optimal c. Clearly there exist preferred numerical search algorithms such as polynomial
interpolation, `golden rule' search and so on. Grid search is most likely the easiest means of
demonstrating the algorithm proposed in this paper and using the suggested approach leads
to the same order algorithm regardless of the search method. If you wish to evaluate S over
a grid of values for c and there are G points on the grid, then there are Gn2 calculations
required. For example, if there are 1,000 observations and you wish to evaluate S over a
grid containing 100 points, there are 108 calculations required.
i
i
i
c
i
c
c
c
2.2. THE PROPOSED ALGORITHM.
When evaluating the CV function it is desirable to use all information contained in the
sample. The problem is how to use all of this information while minimizing evaluation time.
The optimal window width can be viewed as n?1 5 scaled by a constant. Including
the term eectively `normalizes' the data, while including n?1 5 ensures the proper rate
of convergence of h. Therefore, all that is needed is to scale n?1 5 by the appropriate
constant while using all information contained in the sample.
x
=
=
x
x
3
=
Clearly, if you took a subsample from the data of a given size n and used this to evaluate
the CV function for a given value of c, this would be faster than using the entire sample.
However, the sample might not be representative of the entire sample, the resulting scaling
factor will display a fair bit of variability from sample to sample, and this approach ignores
information contained in the rest of the sample (n ? n ). Taking all mutually exclusive and
exhaustive subsamples and evaluating the CV function for a given c would address these
problems and is much faster than using the direct approach based on the entire sample.
This is the approach taken in this paper. As will be demonstrated, the gains in reduced
calculation time can be quite dramatic.
Consider the improvements gained by taking this approach. Evaluating S for a given
value of c directly involves n2 calculations. Breaking the sample into s (n=n ) subsamples of
size n involves n2 calculations for each subsample for a total of s n2 = (n=n ) n2 = nn
calculations. To evaluate S over a grid of G points therefore requires Gnn calculations
which is always less than the Gn2 required taking the direct approach for any n < n.
Consider a sample of 1,000 observations where, again, you wish to evaluate S over a grid
containing 100 points. If n = 100, then there are 9 107 calculations saved!
The proposed algorithm proceeds in the following manner:
1. Shue the sample of n observations to remove potential order.
2. Divide the sample into s mutually exclusive and exhaustive subsamples each of size
n .
3. For a given value of c, c0, evaluate S for each subsample. Call these s values S .
P
4. For the given value c0 take the mean of these s S values. Call this S = s?1 S .
5. Repeat these steps for all values of c0 over the entire grid.
6. The cross-validated window width is that obtained using the value of c0 for which S
is a minimum over the grid.
Clearly, there is no need for such an approach if one has a relatively small number of
observations since the direct approach is not computationally burdensome. However, when
the number of observations is large, the savings in computation time can be dramatic.
s
s
c
s
s
s
s
s
c
s
s
s
s
c
s
s
0
c
c
s
0
c
s
0
c0
s
c
s
c0
2.3. APPLICATIONS.
In order to apply the above algorithm three issues must be addressed. The rst issue involves
appropriate subsample sizes n . The second issue is concerned with the sample size n above
which this approach should/should not be used. The third issue deals with the eect of the
above two concerns on the estimated model. I consider two simulated examples to try to
address these issues.
Example 1. The rst conditional mean function considered is
s
yi
= M (x ) + i
4
i
= 2:0 sin(x ) + i
(6)
i
where x U (?10; 10) and N (0; 1).
Example 2. The second function considered is one found in Hardle [1]. The model is
i
i
yi
= M (x ) + = 1:0 ? x + e?200 0( ?0 5)2 + i
i
:
i
xi
:
(7)
i
where N (0; 0:25) and X U (0; 1).
To investigate the nature of this approach to window width selection, I consider varying
the sample size n and the subsample size n . The grid size G was xed at 100 points and
the `coarseness' of the grid was 0:01. The results are presented in the tables below.
i
i
s
ns
n
n
n
n
n
n
= 100
= 250
= 500
= 1; 000
= 5; 000
= 10; 000
10
0:37
0:33
0:37
0:31
0:30
0:31
25
0:27
0:20
0:22
0:22
0:22
0:21
TABLE I
50
0:21
0:19
0:22
0:20
0:21
0:19
100
0:20
0:20
0:22
0:19
0:19
0:19
250
500 1; 000 2; 500 5; 000 10; 000
0:20
0:20
0:18
0:19
0:19
0:18
0:17
0:19
0:19
0:18
0:19
0:19
0:19
0:19
0:19
0:19
0:19
TAB I: The model used was y = 2:0 sin(x ) + . The values in the table are the window
width constant c obtained by ecient CV. The entries in the rightmost columns are those
obtained from direct CV (or ecient CV when n = n).
i
i
i
opt
s
ns
n
n
n
n
n
n
= 100
= 250
= 500
= 1; 000
= 5; 000
= 10; 000
10
0:58
0:57
0:71
0:61
0:64
0:66
25
0:23
0:33
0:39
0:40
0:39
0:37
TABLE II
50
0:22
0:27
0:33
0:33
0:33
0:31
100
0:21
0:29
0:31
0:30
0:30
0:29
250
500 1; 000 2; 500 5; 000 10; 000
0:23
0:27
0:26
0:26
0:28
0:24
0:24
0:25
0:26
TAB II: The model used was y = 1:0 ? x + e?
0:26
0:26
0:25
0:26
0:24
?
0:26
0:26
0:26
+ . The values in the table
are the window width obtained by CV. values in the table are the window width constant
c
obtained by ecient CV. The entries in the rightmost columns are those obtained from
direct CV (or ecient CV when n = n).
i
200:0(xi 0:5)2
i
i
opt
s
Next we must examine the `accuracy' of this approach versus the direct approach. That
is, how do the variations in the scaling factor c which arise due to changing the subsample
size n aect estimation. To assess this, the conditional means were estimated for example
2 for the range of scaling factors c arising due to changes in n for n 50. The values for
n = 10 were quite `noisy', but this noise tended to disappear quite rapidly as n increased
s
s
s
s
s
5
to 25 or more. For subsample sizes of n = 50 and greater, the estimated conditional means
were extremely close for the range of c found when 50 n n, while for n 100 the
estimated conditional means were virtually identical (these results were corroborated for a
variety of data sets I had `lying around'). The results for example 2 for the sample sizes of
n = 1; 000 and n = 5; 000 are presented in gures 1 and 2 respectively. The variation for
larger samples was so small as to be insignicant.
s
s
s
FIGURE I
1:6
1:4
1:2
1
^ (x ) 0:8
M
0:6
0:4
0:2
0
i
0
0:2
0:4
X
0:6
0:8
1
? 2 + . The smooth line is the
FIG I: The model used was y = 1:0 ? x + e?
actual curve. The other two lines represent the variation in ts for the range of ecient CV
constants c for 50 n n. The sample size was n = 1; 000.
i
200:0(xi
i
0:5)
i
s
FIGURE II
1:6
1:4
1:2
1
^
M (x ) 0:8
0:6
0:4
0:2
0
i
0
0:2
0:4
X
0:6
0:8
1
? 2 + . The smooth line is the
FIG II: The model used was y = 1:0 ? x + e?
actual curve. The other two lines represent the variation in ts for the range of ecient CV
constants c for 50 n n. The sample size was n = 5; 000.
i
200:0(xi 0:5)
i
i
s
The following general conclusions are suggested by these two examples:
6
1. Subsamples less than n = 25 should be avoided regardless of the sample size n.
s
2. Subsamples of n = 25 or n = 50 are appropriate providing the sample size n is 500
or more.
s
s
3. The variability in the resulting scaling factors for a given subsample size n increases
as the overall t to the data worsens.
s
One point to note is that the values of c for sample sizes n = 100 and n = 250 displayed
high variability. Therefore, evaluation of the optimal c based on simply taking one subsample
from n is not recommended.
These results suggest that for samples of n > 500 you can keep the subsample n constant
at, say, n = 25 or n = 50. The eect of this is to reduce the computational burden to an
O(n) calculation versus the O(n2) calculation required by the direct approach.
The time savings gained by this approach can be quite dramatic. For example, for
n = 25 and n = 10; 000, to evaluate the CV function for a grid of G = 10 points takes
1:6 minutes on a 33Mhz 80386 computer equipped with a math co-processor (all programs
were written in ANSI C compiled using the Borland C++ compiler). The direct evaluation
of the CV function would take 1:07 hours.
A general strategy suggested by the above results is to consider subsamples of varying
sizes (for example, n = 25; 50; 100). If the resulting scaling factor is fairly constant, this
suggests that the estimate is `reliable'. If it is not, consider a larger subsample.
Some kernel functions have CV functions which are not `smooth', and you may wish
to `explore' the CV function over a range of values of c. The approach put forth in this
paper is well-suited to such exploration. For subsample sizes of n = 25 or more, the
resulting window width tends to lie quite close to that obtained by direct CV. Since the
reduction in calculation time can be very dramatic, this technique could be extremely useful
for exploratory purposes.
s
s
s
s
s
s
3. CONCLUSION.
This paper considers a cross-validation (CV) algorithm for ecient selection of window
widths in the context of nonparametric kernel estimation of an unknown conditional mean.
The `direct' CV approach involves repeated estimation of the CV function based on
the entire sample of data. This direct procedure is an O(n2 ) calculation. The approach
proposed in this paper involves breaking down the problem into a set of O(n2 ) calculations
where n < n. For large sample sizes, appropriate selection of the subsample size n reduces
this to an O(n) calculation.
Two examples are considered in this paper and some general conclusions are drawn based
on these and other data sets. First, the technique appears to work remarkably well provided
that the sample size is roughly greater than or equal to n = 500. Secondly, the subsample
size n = 25 appears to work quite well for a variety of situations. Increasing the subsample
s
s
s
s
7
size beyond n = 100 oers virtually no improvement when n is fairly large (that is, 5; 000
or more).
Finally, the technique can be used to quickly obtain cross-validated window widths without resorting to Fast Fourier Transforms, and can be used for any kernel function whatsoever.
The proposed technique permits computationally ecient selection of cross-validated window widths without the computational burden associated with the direct approach and does
so with virtually no loss in accuracy.
s
ACKNOWLEDGEMENTS
I would like to thank members of the Department of Economics at UCSD for providing
an exceptionally stimulating research environment and visitor program. I would also like to
thank Steve Marron for his helpful comments. This research is supported by the Natural
Sciences and Engineering Research Council of Canada.
BIBLIOGRAPHY
References
[1] W. Hardle. Applied Nonparametric Regression. Cambridge, New Rochelle, 1990.
[2] A. J. Izenman. Recent developments in nonparametric density estimation. Journal of
The American Statistical Association, 86(413):205{224, 1991.
[3] J. S. Marron. Automatic smoothing parameter selection: A survey. Empirical Economics, 1988.
[4] E. A. Nadaraya. On nonparametric estimates of density functions and regression curves.
Theory of Applied Probability, 10:186{190, 1965.
[5] B. W. Silverman. Algorithm as176. Kernel density estimation using the fast fourier
transform. Applied Statistics, pages 166{172, 1982.
[6] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and
Hall, New York, 1986.
[7] G. S. Watson. Smooth regression analysis. Sanikhya, 26:15:175{184, 1964.
8