AN EFFICIENT CROSS{VALIDATION ALGORITHM FOR WINDOW WIDTH SELECTION FOR NONPARAMETRIC KERNEL REGRESSION Je Racine Visiting Assistant Professor Department of Economics, 0508 9500 Gilman Drive University of California, San Diego La Jolla, CA 92093 Key Words and Phrases: kernel regression; window width selection; cross-validation; computational eciency. ABSTRACT This paper presents an approach to cross-validated window width choice which greatly reduces computation time, which can be used regardless of the nature of the kernel function, and which avoids the use of the Fast Fourier Transform. This approach is developed for window width selection in the context of kernel estimation of an unknown conditional mean. 1. INTRODUCTION. Nonparametric density estimation procedures have provided many exciting new techniques for statistical analysis and data exploration (for an excellent survey of density estimation techniques, see Izenman [2]). However it is well known that these techniques tend to be very computationally intensive. Therefore, it is extremely important to use algorithms which are computationally ecient to avoid excessively long calculation times. Algorithms for nonparametric estimation can be decomposed into two parts { one for the calculation of the kernel and another for window width choice. Silverman [6] (page 88) states that \One important factor in reducing the computer time is the choice of a kernel that can be calculated very quickly.". Having chosen a kernel that is ecient to compute, one must then choose the window widths. There is no generally accepted method for choosing the window widths. Methods currently available include `subjective choice' and automatic methods such as the `plug-in', `cross-validation' (CV), and `penalizing function' approaches (See Marron [3] for an excellent survey). Hardle [1] (page 173) compared various automatic methods and found that \The best overall performance, though, showed GCV (generalized cross-validation).". One problem with the CV approach is that the CV function has to be repeatedly calculated over a range of window widths, which in turn requires repeated evaluation of the kernel function. The Fast Fourier Transform (FFT) has been applied for both estimation and window width selection in the context of kernel estimation (see Silverman [5]). However, the FFT suers from two drawbacks { loss in precision due to discretization, and inapplicability when the kernel function is dened to be zero over a range of its domain (which is the case for the optimal kernel, the Epanechnikov kernel). 1 This paper presents an approach to cross-validated window width choice which greatly reduces computation time, which can be used regardless of the kernel chosen, and which avoids the use of FFTs. The paper proceeds in the following manner: Section 2.1 presents an overview of the problem of window width selection, Section 2.2 presents an ecient algorithm for cross-validated window width selection, Section 2.3 presents some examples, while Section 3 summarizes and concludes. 2. A COMPUTATIONALLY EFFICIENT APPROACH. A standard starting point for obtaining the window widths involves minimizing the approximate mean integrated square error (AMISE) of the kernel estimator. However, this yields a formula for the window widths which depends in general on both the kernel function and the unknown data generating process { that is, the resulting window width is not `operational'. Below I consider the problem in more detail in the context of the estimation of a conditional mean. For the purposes of this paper I restrict attention to bivariate conditional means for notational simplicity - extension to the multivariate case follows directly. 2.1. AN OVERVIEW OF KERNEL REGRESSION. Consider two continuous random variables (Y; X ) with realizations fy ; x g; i = 1; . . . ; n. The `regression' function is given by i yi = E (Y jx ) + = M (x ) + ; i i i i i i = 1; . . . ; n (1) where M (x ) denotes the conditional mean evaluated at the realization x . The disturbances are independent mean zero random variables satisfying E [ ] = 0 and E [2] = 2 < 1. The conditional mean is dened as Z1 f (y; x ) dy (2) E (Y jx ) = y ?1 f (x ) i i i i i i i i i If nonparametric techniques are used to estimate f (y; x) and f (x) in equation (2), the resulting estimator M^ (x ) is a nonparametric estimator of M (x ). The nonparametric kernel estimator of M (x ) is i i i ^ ( ) = E^ (Y jx ) Z1 ^ f (y; x ) dy y = ?1 2f^(x ) M xi i i i = X n j =1 = X n yj 4P K n j =1 ( ) yj Wj xi j =1 2 ? i 3 ? 5 j i K xj x h x x h (3) where K () is the kernel function and W (x ) is a weight function which lies in the interval [0; 1]. This estimator was proposed by Nadaraya [4] and Watson [7]. The value of the window width h which, based on asymptotic expansions, minimizes the AMISE of M^ (x ) can be shown to be j i i h = c n?1 5 (4) = x where c is a constant of proportionality and is the standard deviation of X . The unknown `scaling factor' c depends in a non-trivial way on the kernel function K () and on the underlying data generating process. The plug-in methods and CV methods can be thought of as dierent approaches to obtaining the constant c. Obtaining the constant c via the CV approach involves dening the CV function and minimizing this function with respect to c. This is strictly a numerical problem since c is data dependent. The benets of viewing equation (4) as n?1 5 scaled by an unknown constant will be claried below. Dene the CV function to be x x = S = c X n (y ? M^ (x? ; c))2 i (5) i i=1 where M^ (x? ; c) denotes the `leave-one-out' estimator evaluated for a particular value of c. ^ () is obtained by omitting the realization (y ; x ) from the estimator of M () at the point M x . The CV approach to window width selection selects that c for which S is minimized for a given sample of data. This CV function S must be evaluated for many dierent values of c, and each evaluation is of O(n2). That is, in general there are n2 calculations required to estimate the CV function for a given value of c. Consider, by way of example, a grid search method for obtaining the optimal c. Clearly there exist preferred numerical search algorithms such as polynomial interpolation, `golden rule' search and so on. Grid search is most likely the easiest means of demonstrating the algorithm proposed in this paper and using the suggested approach leads to the same order algorithm regardless of the search method. If you wish to evaluate S over a grid of values for c and there are G points on the grid, then there are Gn2 calculations required. For example, if there are 1,000 observations and you wish to evaluate S over a grid containing 100 points, there are 108 calculations required. i i i c i c c c 2.2. THE PROPOSED ALGORITHM. When evaluating the CV function it is desirable to use all information contained in the sample. The problem is how to use all of this information while minimizing evaluation time. The optimal window width can be viewed as n?1 5 scaled by a constant. Including the term eectively `normalizes' the data, while including n?1 5 ensures the proper rate of convergence of h. Therefore, all that is needed is to scale n?1 5 by the appropriate constant while using all information contained in the sample. x = = x x 3 = Clearly, if you took a subsample from the data of a given size n and used this to evaluate the CV function for a given value of c, this would be faster than using the entire sample. However, the sample might not be representative of the entire sample, the resulting scaling factor will display a fair bit of variability from sample to sample, and this approach ignores information contained in the rest of the sample (n ? n ). Taking all mutually exclusive and exhaustive subsamples and evaluating the CV function for a given c would address these problems and is much faster than using the direct approach based on the entire sample. This is the approach taken in this paper. As will be demonstrated, the gains in reduced calculation time can be quite dramatic. Consider the improvements gained by taking this approach. Evaluating S for a given value of c directly involves n2 calculations. Breaking the sample into s (n=n ) subsamples of size n involves n2 calculations for each subsample for a total of s n2 = (n=n ) n2 = nn calculations. To evaluate S over a grid of G points therefore requires Gnn calculations which is always less than the Gn2 required taking the direct approach for any n < n. Consider a sample of 1,000 observations where, again, you wish to evaluate S over a grid containing 100 points. If n = 100, then there are 9 107 calculations saved! The proposed algorithm proceeds in the following manner: 1. Shue the sample of n observations to remove potential order. 2. Divide the sample into s mutually exclusive and exhaustive subsamples each of size n . 3. For a given value of c, c0, evaluate S for each subsample. Call these s values S . P 4. For the given value c0 take the mean of these s S values. Call this S = s?1 S . 5. Repeat these steps for all values of c0 over the entire grid. 6. The cross-validated window width is that obtained using the value of c0 for which S is a minimum over the grid. Clearly, there is no need for such an approach if one has a relatively small number of observations since the direct approach is not computationally burdensome. However, when the number of observations is large, the savings in computation time can be dramatic. s s c s s s s s c s s s s c s s 0 c c s 0 c s 0 c0 s c s c0 2.3. APPLICATIONS. In order to apply the above algorithm three issues must be addressed. The rst issue involves appropriate subsample sizes n . The second issue is concerned with the sample size n above which this approach should/should not be used. The third issue deals with the eect of the above two concerns on the estimated model. I consider two simulated examples to try to address these issues. Example 1. The rst conditional mean function considered is s yi = M (x ) + i 4 i = 2:0 sin(x ) + i (6) i where x U (?10; 10) and N (0; 1). Example 2. The second function considered is one found in Hardle [1]. The model is i i yi = M (x ) + = 1:0 ? x + e?200 0( ?0 5)2 + i i : i xi : (7) i where N (0; 0:25) and X U (0; 1). To investigate the nature of this approach to window width selection, I consider varying the sample size n and the subsample size n . The grid size G was xed at 100 points and the `coarseness' of the grid was 0:01. The results are presented in the tables below. i i s ns n n n n n n = 100 = 250 = 500 = 1; 000 = 5; 000 = 10; 000 10 0:37 0:33 0:37 0:31 0:30 0:31 25 0:27 0:20 0:22 0:22 0:22 0:21 TABLE I 50 0:21 0:19 0:22 0:20 0:21 0:19 100 0:20 0:20 0:22 0:19 0:19 0:19 250 500 1; 000 2; 500 5; 000 10; 000 0:20 0:20 0:18 0:19 0:19 0:18 0:17 0:19 0:19 0:18 0:19 0:19 0:19 0:19 0:19 0:19 0:19 TAB I: The model used was y = 2:0 sin(x ) + . The values in the table are the window width constant c obtained by ecient CV. The entries in the rightmost columns are those obtained from direct CV (or ecient CV when n = n). i i i opt s ns n n n n n n = 100 = 250 = 500 = 1; 000 = 5; 000 = 10; 000 10 0:58 0:57 0:71 0:61 0:64 0:66 25 0:23 0:33 0:39 0:40 0:39 0:37 TABLE II 50 0:22 0:27 0:33 0:33 0:33 0:31 100 0:21 0:29 0:31 0:30 0:30 0:29 250 500 1; 000 2; 500 5; 000 10; 000 0:23 0:27 0:26 0:26 0:28 0:24 0:24 0:25 0:26 TAB II: The model used was y = 1:0 ? x + e? 0:26 0:26 0:25 0:26 0:24 ? 0:26 0:26 0:26 + . The values in the table are the window width obtained by CV. values in the table are the window width constant c obtained by ecient CV. The entries in the rightmost columns are those obtained from direct CV (or ecient CV when n = n). i 200:0(xi 0:5)2 i i opt s Next we must examine the `accuracy' of this approach versus the direct approach. That is, how do the variations in the scaling factor c which arise due to changing the subsample size n aect estimation. To assess this, the conditional means were estimated for example 2 for the range of scaling factors c arising due to changes in n for n 50. The values for n = 10 were quite `noisy', but this noise tended to disappear quite rapidly as n increased s s s s s 5 to 25 or more. For subsample sizes of n = 50 and greater, the estimated conditional means were extremely close for the range of c found when 50 n n, while for n 100 the estimated conditional means were virtually identical (these results were corroborated for a variety of data sets I had `lying around'). The results for example 2 for the sample sizes of n = 1; 000 and n = 5; 000 are presented in gures 1 and 2 respectively. The variation for larger samples was so small as to be insignicant. s s s FIGURE I 1:6 1:4 1:2 1 ^ (x ) 0:8 M 0:6 0:4 0:2 0 i 0 0:2 0:4 X 0:6 0:8 1 ? 2 + . The smooth line is the FIG I: The model used was y = 1:0 ? x + e? actual curve. The other two lines represent the variation in ts for the range of ecient CV constants c for 50 n n. The sample size was n = 1; 000. i 200:0(xi i 0:5) i s FIGURE II 1:6 1:4 1:2 1 ^ M (x ) 0:8 0:6 0:4 0:2 0 i 0 0:2 0:4 X 0:6 0:8 1 ? 2 + . The smooth line is the FIG II: The model used was y = 1:0 ? x + e? actual curve. The other two lines represent the variation in ts for the range of ecient CV constants c for 50 n n. The sample size was n = 5; 000. i 200:0(xi 0:5) i i s The following general conclusions are suggested by these two examples: 6 1. Subsamples less than n = 25 should be avoided regardless of the sample size n. s 2. Subsamples of n = 25 or n = 50 are appropriate providing the sample size n is 500 or more. s s 3. The variability in the resulting scaling factors for a given subsample size n increases as the overall t to the data worsens. s One point to note is that the values of c for sample sizes n = 100 and n = 250 displayed high variability. Therefore, evaluation of the optimal c based on simply taking one subsample from n is not recommended. These results suggest that for samples of n > 500 you can keep the subsample n constant at, say, n = 25 or n = 50. The eect of this is to reduce the computational burden to an O(n) calculation versus the O(n2) calculation required by the direct approach. The time savings gained by this approach can be quite dramatic. For example, for n = 25 and n = 10; 000, to evaluate the CV function for a grid of G = 10 points takes 1:6 minutes on a 33Mhz 80386 computer equipped with a math co-processor (all programs were written in ANSI C compiled using the Borland C++ compiler). The direct evaluation of the CV function would take 1:07 hours. A general strategy suggested by the above results is to consider subsamples of varying sizes (for example, n = 25; 50; 100). If the resulting scaling factor is fairly constant, this suggests that the estimate is `reliable'. If it is not, consider a larger subsample. Some kernel functions have CV functions which are not `smooth', and you may wish to `explore' the CV function over a range of values of c. The approach put forth in this paper is well-suited to such exploration. For subsample sizes of n = 25 or more, the resulting window width tends to lie quite close to that obtained by direct CV. Since the reduction in calculation time can be very dramatic, this technique could be extremely useful for exploratory purposes. s s s s s s 3. CONCLUSION. This paper considers a cross-validation (CV) algorithm for ecient selection of window widths in the context of nonparametric kernel estimation of an unknown conditional mean. The `direct' CV approach involves repeated estimation of the CV function based on the entire sample of data. This direct procedure is an O(n2 ) calculation. The approach proposed in this paper involves breaking down the problem into a set of O(n2 ) calculations where n < n. For large sample sizes, appropriate selection of the subsample size n reduces this to an O(n) calculation. Two examples are considered in this paper and some general conclusions are drawn based on these and other data sets. First, the technique appears to work remarkably well provided that the sample size is roughly greater than or equal to n = 500. Secondly, the subsample size n = 25 appears to work quite well for a variety of situations. Increasing the subsample s s s s 7 size beyond n = 100 oers virtually no improvement when n is fairly large (that is, 5; 000 or more). Finally, the technique can be used to quickly obtain cross-validated window widths without resorting to Fast Fourier Transforms, and can be used for any kernel function whatsoever. The proposed technique permits computationally ecient selection of cross-validated window widths without the computational burden associated with the direct approach and does so with virtually no loss in accuracy. s ACKNOWLEDGEMENTS I would like to thank members of the Department of Economics at UCSD for providing an exceptionally stimulating research environment and visitor program. I would also like to thank Steve Marron for his helpful comments. This research is supported by the Natural Sciences and Engineering Research Council of Canada. BIBLIOGRAPHY References [1] W. Hardle. Applied Nonparametric Regression. Cambridge, New Rochelle, 1990. [2] A. J. Izenman. Recent developments in nonparametric density estimation. Journal of The American Statistical Association, 86(413):205{224, 1991. [3] J. S. Marron. Automatic smoothing parameter selection: A survey. Empirical Economics, 1988. [4] E. A. Nadaraya. On nonparametric estimates of density functions and regression curves. Theory of Applied Probability, 10:186{190, 1965. [5] B. W. Silverman. Algorithm as176. Kernel density estimation using the fast fourier transform. Applied Statistics, pages 166{172, 1982. [6] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York, 1986. [7] G. S. Watson. Smooth regression analysis. Sanikhya, 26:15:175{184, 1964. 8
© Copyright 2025 Paperzz