Computational aspects of nonparametric smoothing with illustrations

Computational Statistics & Data Analysis 42 (2003) 545 – 560
www.elsevier.com/locate/csda
Computational aspects of nonparametric
smoothing with illustrations from the sm library
A.W. Bowmana; ∗ , A. Azzalinib
a Department
of Statistics, University of Glasgow, G12 8QQ Glasgow, UK
of Statistical Sciences, University of Padua, Padua, Italy
b Department
Received 1 January 2001; received in revised form 1 March 2002
Abstract
Smoothing techniques such as density estimation and nonparametric regression are widely
used in applied work and the basic estimation procedures can be implemented relatively easily
in standard statistical computing environments. However, computationally e2cient procedures
quickly become necessary with large datasets, many evaluation points or more than one covariate.
Further computational issues arise in the use of smoothing techniques for inferential, rather than
simply descriptive, purposes. These issues are addressed in two ways by (i) deriving e2cient
matrix formulations of nonparametric smoothing methods and (ii) by describing further simple
modi6cations to these for the use of ‘binned’ data when sample sizes are large. The implications
for other graphical and inferential aspects of the estimators are also discussed. These issues are
dealt with in an algorithmic manner, to allow implementation in any programming environment,
but particularly those which are geared towards vector and matrix representations of data. Speci6c
examples of S-Plus code from the sm library of Bowman and Azzalini (Applied Smoothing
Techniques for Data Analysis: the Kernel Approach With S-Plus Illustrations, Oxford University
c 2002 Elsevier Science B.V.
Press, Oxford, 1997) are given in an appendix as illustrations. All rights reserved.
Keywords: Binning; Density estimation; Graphics; Kernel methods; Large datasets; Nonparametric regression;
Quadratic forms; sm library; Smoothing techniques; S-Plus
1. Introduction
Smoothing techniques such as density estimation and nonparametric regression have
become established tools in applied statistics. There is now a wide variety of texts
∗
Corresponding author. Tel.: +44-141-330-4046; fax: +44-141-330-4814.
E-mail address: [email protected] (A.W. Bowman).
c 2002 Elsevier Science B.V. All rights reserved.
0167-9473/03/$ - see front matter PII: S 0 1 6 7 - 9 4 7 3 ( 0 2 ) 0 0 1 1 8 - 4
546
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
which describe these methods and a huge literature of research papers. Recent texts
include Green and Silverman (1994), Wand and Jones (1995), Fan and Gijbels (1996),
SimonoM (1996), Bowman and Azzalini (1997) and Schimek (2000a). A broader framework for the case of regression, known as generalized additive models, is also described
by Hastie and Tibshirani (1990). Some statistical computing environments such as
S-Plus, Genstat and XLispStat include some general procedures for smoothing techniques and SimonoM (1996) and Schimek (2000b) give extensive references to more
specialist software. However, the simplest forms of nonparametric smoothing techniques
can be implemented relatively easily through elementary programming techniques.
Modern statistical computing environments are generally geared towards vector and
matrix representations of data. It is therefore a principal aim of this paper to provide simple
matrix formulations of smoothing techniques which allow e2cient implementation in
this type of environment. A second aim of the paper is to address the computational
issues which arise when nonparametric methods are applied to large datasets. This issue
is also dealt with in the context of vector and matrix representations of the estimators.
E2cient matrix formulations of smoothing techniques are discussed in Section 2 of
the paper, for both density estimation and regression. The issue of binning and its implementation within a matrix framework is described in Section 3. Computational issues
also arise when techniques other than straightforward construction of the estimators
for descriptive purposes are required. Bowman and Azzalini (1997, Chapters 4 and 5)
describe a variety of testing procedures based on quadratic form calculations and the
implementation of these for binned data is described in Section 4. Some graphical issues
and other topics are also discussed in that section. Further discussion is given in Section 5.
The computational methods discussed in the paper have been implemented in the
sm library written to accompany the book by Bowman and Azzalini (1997). This
library contains an extensive collection of functions for diMerent data structures and
statistical problems and version 2 now includes the binning techniques described in
the paper. Examples of S-Plus code from the core instructions of the two principal
functions, sm.density and sm.regression, are given in an appendix to illustrate the
ideas described in the paper. The principal emphasis is on the algebraic expressions
provided in the main body of the paper, to allow readers to implement the ideas in other
programming environments. In addition, they also provide speci6c guidance to those
who wish to amend the existing sm library functions or to implement the techniques
themselves in S-Plus.
2. Matrix formulations of smoothing techniques
2.1. Density estimation
The formula for computing a density estimate at the evaluation point z from a sample
y = (y1 ; : : : ; yn )T of univariate data is
n
f̂(z) =
1
w(z − yj ; h);
n
j=1
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
547
where the kernel function w(x; h) is itself a density function which is symmetric with
mean 0. The degree of smoothing applied to the data is controlled by the scale parameter h, which is conveniently parameterised as the standard deviation of the kernel
function. The detailed shape of the kernel function is relatively unimportant, and for
convenience a normal density function will subsequently be used in the illustrative
code in the appendix.
The calculation of the estimate may be expressed in vector form as
f̂(z) =
1 T
w 1n ;
n
where w denotes the vector with elements w(z − yj ; h) and 1n denotes a vector of
length n which consists entirely of 1’s. An estimate usually needs to be computed over
a vector of evaluation points (z1 ; : : : ; zk )T . The entire calculation can then be expressed
in matrix form as
f̂ =
1
W 1n ;
n
(1)
where f̂ denotes the vector of estimated values and W denotes the matrix whose
(i; j)th element is w(zi − yj ; h). There are some occasions when diMerent weights p =
(p1 ; : : : ; pn )T are associated with the observations; in the sequel it is assumed that the
weights are scaled to sum to 1. The estimate is then of the form
f̂(zi ) =
n
pj w(zi − yj ; h)
j=1
and the matrix formulation becomes
f̂ = Wp:
Further generality can be obtained by the use of a diMerent smoothing parameter hj in
the kernel function around each observation. This is easily accommodated by de6ning
the elements of W as w(zi − yj ; hj ).
When the data are two-dimensional, {(y1j ; y2j ); j = 1; : : : ; n}, the simplest form of
the estimate at an evaluation point (z1 ; z2 ) is
n
f̂(z1 ; z2 ) =
1
w(z1 − y1j ; h1 )w(z2 − y2j ; h2 ):
n
j=1
This corresponds to the use of a kernel function which is a product of univariate
kernels. In matrix form, the vector of estimates at the evaluation points {(z1i ; z2i ); i =
1; : : : ; k} can still be expressed in the form (1), where the (i; j)th element of W is now
w(z1i − y1j ; h1 )w(z2i − y2j ; h2 ). This can also be written as
1
f̂ = (W1 ∗ W2 )1n ;
n
548
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
where the (i; j)th elements of W1 and W2 are w(z1i − y1j ; h1 ) and w(z2i − y2j ; h2 )
respectively, and ∗ denotes the Hadamard product, namely elementwise multiplication.
Where weights apply to the observations, this becomes
f̂ = (W1 ∗ W2 )p:
With two-dimensional data, the evaluation points will often lie on a rectangular grid,
denoted by {(z1r ; z2s ); r; s =1; : : : ; k}. For simplicity we will assume that the number of
grid points in each margin is identical; diMerent numbers can be accommodated easily,
at the expense of a slightly more detailed notation. The form of the estimate at each
grid point
n
f̂(z1r ; z2s ) =
1
w(z1r − y1j ; h1 )w(z2s − y2j ; h2 )
n
j=1
allows the full set of estimates to be neatly computed in matrix form as
f̂ =
1
W1 W2T ;
n
where the (i; j)th elements of W1 and W2 are now de6ned as w(z1i − y1j ; h1 ) and w(z2i −
y2j ; h2 ), respectively. This signi6cantly reduces the dimensionality of the constituent
matrices. Where weights apply, these can be incorporated as
f̂ = W1 PW2T ;
where P is a diagonal matrix with p as its diagonal vector.
2.2. Nonparametric regression
Smoothing techniques allow observed data {(xj ; yj ); j = 1; : : : ; n} to be used in investigating a regression relationship through the model y = m(x) + , where m is
assumed only to be smooth. The concept of smooth local averaging leads immediately
to a ‘local mean’ approach to the construction of an estimate at an evaluation point
z, as
n
j=1 w(xj − z; h)yj
m̂(z) = n
;
j=1 w(xj − z; h)
as proposed by Nadaraya (1964) and Watson (1964). The concept of ‘local linear’ estimation was proposed some time ago by Cleveland (1979) and others, and its attractive
properties were clearly argued by Fan and Gijbels (1992) and Fan (1993). At a point of
interest z, the estimate is computed as the intercept term ˆ in the least-squares problem
min
;
n
j=1
{yj − − (xj − z)}2 w(xj − z; h):
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
549
Some simple algebra shows that
n
m̂(z) =
1 {s2 (z; h) − s1 (z; h)(xj − z)}w(xj − z; h)yj
n
s2 (z; h)s0 (z; h) − s1 (z; h)2
j=1
where sr (z; h) = { (xj − z)r w(xj − z; h)}=n.
In matrix notation, for estimation over a vector of evaluation points (z1 ; : : : ; zk )T ,
the starting point for both approaches is the matrix of basic kernel weights W whose
(i; j)th element is w(xj − zi ; h). The ‘smoothing matrix’ S is de6ned by
S = W=(W 1n 1Tn );
(2)
where = denotes elementwise division of two matrices. The local mean estimator can
then be written in the simple form
m̂ = Sy;
where m̂ denotes the vector of 6tted values at the estimation points.
The matrix representation of the local linear case is a little more complex. The 6rst
step is to de6ne the matrix of diMerences D whose (i; j)th element is (xj −zi ). The k ×n
matrices S0 , S1 and S2 can then be de6ned as Sr = (W ∗ Dr )1n 1Tn , where Dr denotes
elementwise exponentiation. The vector of 6tted values is again given by m̂=Sy, where
now
S = (S2 − D ∗ S1 ) ∗ W=(S2 ∗ S0 − S1 ∗ S1 ):
(3)
When weights apply to the observations, these can be viewed as attached to the
kernel functions w(xj − zi ; h). For example, in the local linear case the least-squares
problem becomes
min
;
n
{yj − − (xj − z)}2 w(xj − z; h)pj :
j=1
The de6nitions of the smoothing matrices given in (2) and (3) therefore remain valid,
with W replaced by WP.
When there are two covariates, x1 and x2 , the local mean estimator is de6ned by
n
j=1 w(x1j − z1 ; h1 )w(x2j − z2 ; h2 )yj
m̂(z1 ; z2 ) = n
j=1 w(x1j − z1 ; h1 )w(x2j − z2 ; h2 )
and the local linear estimator as the value of ˆ from the least-squares problem
min
;;
n
{yj − − (x1j − z1 ) − (x2j − z2 )}2 w(x1j − z1 ; h)w(x2j − z2 ; h):
(4)
j=1
The basic kernel weight matrix can be written as W = W1 ∗ W2 , where the (i; j)th elements of W1 and W2 are w(x1j −z1i ; h1 ) and w(x2j −z2i ; h2 ) respectively. The smoothing
matrix for the local mean estimator is then given again by (2). Furthermore, when the
evaluation points lie on a regular grid de6ned by {(z1r ; z2s ); r; s=1; : : : ; k} then a matrix
550
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
of 6tted values can be computed as
m̂ = W1 YW2T W1 W2T ;
where Y is a diagonal matrix with diagonal vector y and the (i; j)th elements of W1
and W2 are w(x1j − z1i ; h1 ) and w(x2j − z2i ; h2 ), respectively. In the weighted case the
corresponding expression is simply W1 YPW2T .
In the local linear case, for estimation at the point (z1 ; z2 ), let Xz denote the n × 3
matrix whose ith row is (1; (x1i − z1 ); (x2i − z2 )), following the notation of Ruppert and
Wand (1994). If Wz denotes the diagonal matrix whose (i; i)th element is wi = w(x1i −
z1 ; h)w(x2i − z2 ; h), then the solution of the weighted least-squares problem (4) is
(XzT Wz Xz )−1 XzT Wz y:
The local linear estimate is de6ned by the 6rst element of this vector of
length 3.
The elements of the 3 × 3 matrix A = (aij ) = (XzT Wz Xz ) are all of the form i wi (x1i −
z1 )r (x2i − z2 )s , where r + s 6 2. To obtain the 6rst element of the least-squares solution
we need only the 6rst row of (XzT Wz Xz )−1 , denoted by (b1 ; b2 ; b3 ). By applying standard
linear algebra results, reported for instance by Healy (1986, Section 3.4), these can be
written as
1
b1 = 1
a11 − {(a12 a33 − a13 a23 )a12 + (a13 a22 − a12 a23 )a13 } ;
d
b2 =
b1
(a13 a23 − a12 a33 );
d
b3 =
b1
(a12 a23 − a13 a22 );
d
(5)
where d = a22 a33 − a223 . When the vector (b1 ; b2 ; b3 ) is post-multiplied by XzT Wz the
result is a vector of length n whose ith element is {b1 + b2 (x1i − z1 ) + b3 (x2i − z2 )}.
The inner product of this vector with y produces the local linear estimate at (z1 ; z2 ).
This calculation can be vectorised easily.
A very important case is when the evaluation points lie on a rectangular grid. In
order to describe an e2cient computational
method,
it is helpful 6rst
to observe that
XzT Wz y can be written as (c1 ; c2 ; c3 )T = ( yi wi ; yi wi (x1i − z1 ); yi wi (x2i − z2 ))T
and that the local linear estimate can then be represented as b1 c1 + b2 c2 + b3 c3 . For
the rectangular grid {(z1r ; z2s ); r; s = 1; : : : ; k}, the diMerence matrices D1 ; D2 can be
de6ned to have (i; j)th elements (x1j − z1i ) and (x2j − z2i ), respectively. In this case,
the aij quantities can be computed in matrix form as
a11 = W1 W2T ;
a12 = (W1 ∗ D1 )W2T ;
a13 = W1 (W2 ∗ D2 )T ;
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
551
a22 = (W1 ∗ D12 )W2 ;
a23 = (W1 ∗ D1 )(W2 ∗ D2 )T ;
a33 = W1 (W2 ∗ D22 )T ;
where the basic kernel weight matrices W1 ; W2 are as de6ned just after (4). Matrices
d; b1 ; b2 ; b3 can be computed using elementwise operations in the expressions (5), where
the 1 at the start of the expression for b1 is replaced by 1k 1Tk . Finally, the matrix of
6tted values is computed as
c1 = W1 (W2 ∗ Y )T ;
c2 = (W1 ∗ D1 )(W2 ∗ Y )T ;
c3 = W1 (W2 ∗ Y ∗ D2 )T ;
m̂ = b1 ∗ c1 + b2 ∗ c2 + b3 ∗ c3 :
In the case of two covariates, care has to be taken when plotting the estimate to
ensure that the smooth 6tted surface is not plotted at points which extrapolate far
beyond the covariate values. In the one covariate case the plotting region can easily
be de6ned by the range of the data. In the two covariate case a simple analogue
is to plot within the convex hull of the covariate values. The computational method
described above produces a matrix of estimates over a rectangular grid. It remains to
mask this grid to match the convex hull of the covariate values. Denote the points on
the regular grid by (g1i ; g2i ) and the vertices of the convex hull of the covariate values
by (e1j ; e2j ), where the 6nal vertex in the list coincides with the 6rst. Let a denote the
vector diMerence between two consecutive vertices of the convex hull ej+1 − ej and
b denote the vector diMerence gi − ej , for particular values of i and j. It is assumed
that the vertices of the convex hull are arranged in clockwise order. A grid point lies
◦
within the convex hull if the angle between a and b is less than 180 for all values
of j de6ning the convex hull edges. The condition for this to happen is a1 b2 6 a2 b1 .
In the original co-ordinates this can be expressed as g1i a2j − g2i a1j ¿ e1j a2j − e2j a1j .
In matrix notation, this can be written as
0 1
GLF ¿ J (E ∗ LF);
L=
;
−1 0
where the rows of the matrix G contain the co-ordinates of the grid points, the columns
of E contain the vertices of the convex hull, the columns of F contain the diMerences
between the vertices of the convex hull and J is a two-column matrix of 1’s with
number of rows equal to the number of grid-points. The inequality sign must be read
in an elementwise sense. Grid points which lie inside the convex hull are identi6ed by
rows of the matrices on the left- and right-hand sides where the inequality holds for
all elements.
552
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
3. Binning for large datasets
The matrix formulae described in Section 2 provide compact computational expressions for nonparametric estimators. However, the size of many of the matrices is dependent on the sample size n and this becomes ine2cient as n increases. A number
of highly e2cient smoothing algorithms are available, as discussed by Fan and Marron (1994), for example. In particular, the concept of binning, where the raw data
are reduced to frequencies over a 6ne grid, is a practical solution. HPardle and Scott
(1992) and Wand and Jones (1995, Appendix D) provide good introductions to different forms of this technique. Binning can be implemented with linear interpolation
and other re6nements and fast Fourier transform ideas can also be used to great eMect,
as described by Silverman (1982) and HPardle (1987). However, if the number of grid
points is set to be large then the accuracy of the simplest form of binning can be
maintained at a high level without introducing complications into the computational
formulae.
When a 6ne grid is placed over the sample space, the original data can be recoded
as frequencies at grid locations and, for regression data, sample means and standard
deviations of the response variable. Bins containing no observations should be omitted
from the recoded list. The table below identi6es the form of the recoded data explicitly
and establishes notation.
Estimate
Dimensionality
Original data
Recoded data
Density
1
2
1
2
{yi }
{y1i ; y2i }
{xi ; yi }
{x1i ; x2i ; yi }
{yQ i ; ni }
{yQ 1i ; yQ 2i ; ni }
{xQi ; yQ i ; si ; ni }
{xQ1i ; xQ2i ; yQ i ; si ; ni }
Regression
The number of bins will be denoted by b. For density estimation, yQ denotes bin
location. For regression, xQ denotes bin location while yQ and si denote the mean and
standard deviation of the responses for the data in each bin. In all cases, ni denotes
the bin frequencies. It is sometimes convenient to be able to refer to the individual
observations within each bin. A double subscript will be used so that, for example, yij
refers to observation j which lies in bin i.
For density estimation, it is immediately obvious from the de6nition of the estimators
that all that is required is to replicate the kernel function over each bin, according to
the bin frequencies, in order to mimic the form of the estimator from the original data.
All of the
matrix formulae then apply with the elements of the weight vector de6ned as
pi =ni = i ni . The de6nition of the local mean form of nonparametric regression makes
it clear that replication is su2cient to deal with this case too. From the discussion of
weights in Section 2.2, the smoothing matrix is then given by
(WP)=(WP1b 1Tb );
where P is the diagonal matrix with p as its diagonal vector. In the local linear case,
with a single covariate, the observations yij lying within bin i are all now regarded as
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
553
having the same covariate value xQi . The expansion
ij
{yQ ij − − (xQi − z)}2 w(xQi − z; h)
=
ni si2 w(xQi − z; h) +
i
i
ni {yQ i − − (xQi − z)}2 w(xQi − z; h)
shows that the estimation of , and hence the construction of the nonparametric regression curve, can be based solely on the bin means, locations and frequencies. The same
principle applies in the two covariate case. The matrix formulae discussed in Section 2
then remain valid, with the basic kernel weight matrix W replaced by WP. However,
the dimensionality of the matrices involved in these manipulations is now controlled
by k and b rather than k and n, which is a very considerable reduction.
Clearly, the process of constructing the nonparametric estimates changes very little in the presence of binning. However, the estimates themselves are often only the
starting points for further analyses. The eMects of binning on computations related to
more inferential aspects of nonparametric models are explored in Section 4. In preparation for that discussion, the inRuence of binning on the problem of estimating the
error variance in a nonparametric model is considered here. With a single covariate,
a simple nonparametric model is expressed as y = m(x) + , where the errors have
independent Normal distributions with constant variance (2 . Although it is a relatively
straightforward matter to compute an estimate of (2 without binning, some inferential uses of the estimates in the quadratic form methods to be described in Section 4
are impractical to implement without binning when the sample size is large. This is
because it is necessary to work with the speci6c matrix which de6nes the estimator
as a quadratic form and this matrix is of size n × n in the unbinned case. Estimation of (2 in a two covariate model is much more problematic, even without binning,
and there has so far been relatively little discussion of this problem in the research
literature.
A simple, but very useful, estimator of (2 was proposed by Rice (1984) as
(ˆ2 =
n
1
(yi − yi−1 )2 ;
2(n − 1)
j=2
where it is assumed that the observations (xi ; yi ) have been ordered by xi . The behaviour
of the estimator is made clear by considering
E(yi − yi−1 )2 = E{(yi − m(xi )) − (yi−1 − m(xi−1 )) + (m(xi ) − m(xi−1 ))}2 ;
= 2(2 + {m(xi ) − m(xi−1 )}2 :
The estimate (ˆ2 is therefore inRated by a term which measures the variation in the
underlying function m but the eMect of this will rapidly diminish if the covariate values
spread more densely over the design space as the sample size increases. If the means
554
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
of binned data are used, the corresponding expansion is
E(yQ i − yQ i−1 )2 = var(yQ i ) + var(yQ i−1 ) + {m(xi ) − m(xi−1 )}2 ;
1
1
=
(2 + {m(xi ) − m(xi−1 )}2 :
+
ni
ni−1
An estimate of (2 can therefore be constructed as
(ˆ2b =
b
1 (yQ i − yQ i−1 )2
:
b−1
(1=ni + 1=ni−1 )
i=2
However, information on (2 is also present in the bin sample variances. These two
sources of information can then be combined to form the estimate
b
(b − 1)(ˆ2b + i=1 (ni − 1)si2
2
:
(ˆ =
n−1
The denominator
arises by combining the b − 1 ‘degrees of freedom’ from the bin
means and the i (ni − 1) degrees of freedom from the bin variances.
An alternative approach to estimating (2 was proposed by Gasser et al. (1986). This
involves the creation of ‘pseudo-residuals’
xi+1 − xi
xi − xi−1
˜i =
yi−1 +
yi+1 − yi
xi+1 − xi−1
xi+1 − xi−1
as the diMerence between observation yi and the linear interpolation from its two neighbours. If the coe2cients of yi−1 and yi+1 in this expression are denoted by ai and bi ,
respectively, then the estimate of error variance is
(ˆ2 =
n−1
˜2i
1 ;
n−2
(a2i + b2i + 1)
i=2
(a2i
+ b2i
+ 1) is the coe2cient of (2 in var(˜i ). This second diMerence approach
where
should reduce the eMect of the underlying function in inRating the estimate. When this
approach is applied to binned data the coe2cients must be amended to
(ˆ2b =
b−1
˜2i
1 :
b−2
(a2i =ni−1 + b2i =ni+1 + 1=ni )
i=2
This estimator can then be combined with the bin variances to produce
b
(b − 2)(ˆ2b + i=1 (ni − 1)si2
:
(ˆ2 =
n−2
4. Inferential issues in nonparametric regression
The standard error of a nonparametric regression curve at its evaluation points provides a very useful indication of the variability of the estimator. Since the vector of
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
555
estimated values m̂ can be expressed as Sy, the estimated variance is simply
m̂) = diag(SP −1 S T )(ˆ2 ;
var(
where P is the diagonal matrix with the bin frequencies as its diagonal vector. This
refers to the case of binned data. The formula which applies in the absence of binning
is obtained by setting b = n and P = I .
Bowman and Young (1996) and Bowman and Azzalini (1997) discuss the use of
‘reference bands’ as a graphical display of the level of agreement between a nonparametric curve estimate and a reference model of interest. Particular reference models
of interest include ‘no eMect’, represented by a horizontal line, and a simple linear
regression. Since these are linear models they can be represented in the binned data
case as Hy, where H = X (X T PX )−1 X T P and X is the appropriate design matrix. The
widths of the reference bands are determined by the standard errors of m̂ − Hy under
the assumption that the linear model holds. This is easily derived from
m̂ − Hy) = diag{(S − H )P −1 (S − H )T }(ˆ2 :
var(
A variation applies in the ‘no eMect’ case when the response data are in fact residuals
from a linear model projection matrix H1 and interest lies in detecting patterns with
respect to a further explanatory variable not yet included in the model. In the unbinned
case, the estimated variance is then
m̂ − Hy) = (S − H )(I − H1 )(S − H )T (ˆ2 :
var(
In the binned case it becomes rather complex to compute the appropriate projection
matrix because of the need to track the design matrix entries of the binned residuals.
It is also possible to carry out formal tests which compare parametric and nonparametric models in a global manner, or which compare two nonparametric models. As
an example, consider a comparison of a linear model with a nonparametric regression,
when there are one or two covariates. A general treatment can be given if the comparison is formulated in terms of residuals from the linear model. The null hypothesis
is that these residuals have mean 0 while the alternative proposes the mean to be a
smooth function of the covariate(s), as a consequence of nonlinearity. By analogy with
an F-test for comparing linear models, a natural test statistic is
RSS0 − RSS1
;
RSS1
where in this case RSS0 and RSS1 denote the sums-of-squares of the linear model
residuals and the smoothed residuals, respectively. The residual sums-of-squares are
quadratic forms and a p-value for the test can be calculated numerically, as described
in Azzalini and Bowman (1993) and Bowman and Azzalini (1997, Chapter 5).
With binned data, it is convenient to adopt the notation eij = yij − ŷ(xQi ), where ŷ(xQi )
denotes the 6tted value from the linear model, and also to de6ne eQ i = yQ i − ŷ(xQi ). We
then have eij = (yij − yQ i ) + eQ i and in matrix notation the smoothed residuals are given
556
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
by m̂ = S e.
Q The residual sums-of-squares can then be written as
RSS0 =
eij2 =
(ni − 1)si2 +
ni eQ2i
ij
i
=
RSS1 =
i
(ni − 1)si2 + eQ T P e;
Q
i
2
(eij − m̂i ) =
ij
(ni − 1)si2 +
i
=
ni (eQ i − m̂i )2
i
(ni − 1)si2 + eQ T (I − S)T P(I − S)e:
Q
i
The p-value for the test is calculated as
eQ T P eQ − eQ T AeQ
¿ Fobs ;
p=P
eQ T AeQ + i (ni − 1)si2
where, for convenience, A denotes the matrix (I − S)T P(I − S). The calculation can be
reformulated as
T
T
2
p = P eQ P eQ − (1 + Fobs )eQ AeQ − Fobs
(ni − 1)si ¿ 0
i
T
= P eQ BeQ − Fobs
(ni −
1)si2
¿0 ;
i
where B denotes the matrix P − (1 + Fobs )A.
Under the assumption that the data yij , and hence the residuals eij , follow a Normal
distribution, the distributional calculations on eQ T BeQ follow from well known results
in quadratic forms. Exact calculations can be performed using the eigenvalues of B
while Kuonen (1999) describes an Edgeworth expansion approach. Johnson and Kotz
(1972) give the details of how the cumulants, and hence the moments, of eQ T BeQ may be
calculated and this leads to simple but accurate distributional approximations. The error
variance (2 may be assumed to be 1 because its eMect is scaled out in the ratio form
of the test statistic. The covariance matrix of eQ is also required. This is easily seen
T
to be (I − X (X T PX )−1 X P)P −1 , where X is the design matrix for the linear model.
The additional term Fobs i (ni − 1)si2 introduced by binning is independent of eQ and
2
it follows a -n−b
distribution. The 6rst three cumulants of this additional component
2
3
are therefore Fobs (n − b), Fobs
2(n − b), Fobs
8(n − b). These cumulants can be added
to those of the quadratic form to obtain the moments of the whole expression. The
probability calculation can then be completed by matching moments with a shifted and
scaled -2 distribution, namely -2 + .
Other inferential problems, such as the comparison of two or more nonparametric
regression curves as described by Young and Bowman (1995), involve estimates of
the error variance (2 in the appropriate test statistics, using the expressions provided at
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
557
the end of Section 3. Similar quadratic form calculations, with appropriate adjustments
for binning, then also apply.
5. Discussion
The methods described in this paper provide e2cient vector–matrix implementations
of standard smoothing procedures. There are, of course, many other types of smoothing
procedure in common use. Local logistic regression for binary data (Fan et al., 1995)
and quantile smoothing for survival data (Bowman and Wright, 2000) are particularly
important cases. Vector–matrix implementations of the algorithms for these cases follow
similar principles to those described above but the details have not been included
here.
An important issue in the application of smoothing techniques is the selection of appropriate values for the smoothing parameters. A wide variety of methods are described
in the literature but the implementations of these are rather speci6c to each procedure.
This topic has therefore not been addressed in the general principles discussed in the
paper.
Where binning is used to handle large datasets a decision is necessary on the numbers
of bins to be used. We have found that setting the number of bins to 8 log(n=p), where
p denotes the number of dimensions (1 or 2), provides an eMective empirical rule,
applied only when the sample size n is larger than 500. In the two-dimensional case,
the value produced by this rule refers to the number of bins in each dimension.
Software
The sm library is available from the following Web sites:
http://www.stats.gla.ac.uk/∼adrian/sm and
http://www.stat.unipd.it/∼azzalini/Book sm/
Version 2 of the library includes the binning procedures discussed in the paper.
Appendix A. Implementations in S-Plus
The code listed here gives an indication of how the formula described in the paper
can be implemented in S-Plus. The sm library for S-Plus is based on the same
principles but contains more complex code to allow greater generality.
A.1. Density estimation in one dimension
The following code constructs a density estimate est from the vector of data y
at the vector of evaluation points z, using a smoothing parameter h and a vector of
observation weights p whose elements sum to 1.
558
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
n
k
W
W
W
est
<- length (y)
<- length (z)
<- matrix (rep(z, n), ncol = n)
<- W - matrix (rep(y, k), ncol = n, byrow = T)
<- exp(−0.5*(W/h)ˆ 2)/(sqrt(2*pi)*h)
<- W %*% p
A.2. Density estimation in two dimensions
The following code constructs a density estimate est from the vectors of data y1
and y2 at the evaluation points z1, z2, using the vector h of length 2 as smoothing parameter and a vector of observation weights p whose elements sum to 1. Two
forms of construction are provided, depending on whether the evaluation points are
provided in arbitrary point form or as short vectors de6ning the margins of a regular
grid.
n
<- length (y1)
k
<- length (z1)
W1 <- matrix (rep(z1, n), ncol = n)
W1 <- W - matrix (rep(y1 , k), ncol = n, byrow = T)
W1 <- exp(−0.5*(W1/h[1])ˆ 2)/(sqrt(2*pi)*h[1])
W2 <- matrix (rep(z2, n), ncol = n)
W2 <- W - matrix (rep(y2 , k), ncol = n, byrow = T)
W2 <- exp(−0.5*(W2/h[2])ˆ 2)/(sqrt(2*pi)*h[2])
est <- (W1*W2) %*% p
If the evaluation points form a regular grid, de6ned by the marginal values in z1,
z2, each of length k, then the code above remains valid but the matrix of 6tted values
is computed as
est <- W1 %*% diag(p) %*% t (W2)
A.3. Nonparametric regression in one dimension
The following code constructs a nonparametric regression estimate est from the
vectors of data x and y at the evaluation points z, using a smoothing parameter h and
a vector of observation weights p.
n <- length (x)
k <- length (z)
Wd <- matrix (rep(z, n), ncol = n)
Wd <- Wd - matrix (rep(x, k), ncol = n, byrow = T)
W <- exp(−0.5*(Wd/h)ˆ 2)
W <- W*matrix (rep(p, k), ncol = n, byrow = T)
For local mean estimation, the smoothing matrix is computed as
den <- W %*% rep(1, n)
S <- W / matrix(rep(den,n), ncol = n)
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
559
For local linear estimation, the smoothing matrix is computed as
S0
S1
S2
W
S
<- W %*% rep(1, n)
<- (W*Wd) %*% rep(1, n)
<- (W*Wdˆ 2) %*% rep(1, n)
<- W* (matrix(rep(S2,n), ncol = n) - Wd*matrix(rep(S1,n), ncol = n))
<- W/(matrix(rep(S2,n), ncol = n) *matrix(rep(S0, n), ncol = n))
- matrix(rep(S1, n),ncol = n)ˆ 2
In both cases, the vector of estimates is computed as
est <- as.vector (S %*% y)
A.4. Nonparametric regression in two dimensions
The following code constructs a nonparametric regression estimate est from the
vectors of data x1, x2 and y at the evaluation points z1, z2 de6ning the margins of
a regular grid , using the vector h of length 2 to de6ne the smoothing parameters.
n
<- length (x1)
k
<- length (z1)
Wd1 <- matrix (rep(z1, n), ncol = n)
Wd1 <- Wd1 - matrix (rep(y1,k), ncol = n, byrow = T)
W1 <- exp(−0.5*(Wd1/h[1])ˆ 2)
Wd2 <- matrix (rep(z2, n), ncol = n)
Wd2 <- Wd2 - matrix (rep(x2 , k), ncol = n, byrow = T)
W2 <- exp(−0.5*(Wd2/h[2])ˆ 2)
Wy <- matrix (rep(y, k), ncol = n, byrow = T)
For local mean estimation, the smoothing matrix is computed as
est <- W1 %*% t(W2*Wy)/(W1 %*% t(W2)
For local linear estimation, the smoothing matrix is computed as
a11
a12
a13
a22
a23
a33
d
b1
b2
b3
c1
c2
c3
est
<- W1 %*% t(W2)
<- (W1*Wd1) %*% t(W2)
<- W1 %*% t (W2*Wd2)
<- (W1*Wd1ˆ 2) %*% t(W2)
<- (W1*Wd1) %*% t(W2*Wd2)
<- (W1 %*% t(W2*Wd2ˆ 2)
<- a22*a33 - a23ˆ 2
<- 1/(a11 - ((a12*a33 - a13*a23)*a12 + (a13*a22 - a12*a23) *a3)/d)
<- ((a13*a23 - a12*a33)*b1)/d
<- ((a12*a23 - a13*a22)*b1)/d
<- W1 %*% t(W2*Wy)
<- (W1*Wd1) %*% t (W2*Wy)
<- W1 %*% t(W2*Wy*Wd2)
<- b1*c1 + b2*c2 + b3*c3
560
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
References
Azzalini, A., Bowman, A.W., 1993. On the use of nonparametric regression for checking linear relationships.
J. Roy. Statist. Soc. Ser. B 55, 549–557.
Bowman, A.W., Azzalini, A., 1997. Applied Smoothing Techniques for Data Analysis: the Kernel Approach
With S-Plus Illustrations. Oxford University Press, Oxford.
Bowman, A.W., Wright, E., 2000. Graphical exploration of covariate eMects on survival data through
nonparametric quantile curves. Biometrics 56, 563–570.
Bowman, A.W., Young, S.G., 1996. Graphical comparison of nonparametric curves. Appl. Statist. 45, 83–98.
Cleveland, W.S., 1979. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist.
Assoc. 74, 829–836.
Fan, J., 1993. Local linear regression smoothers and their minimax e2ciencies. Ann. Statist. 21, 196–216.
Fan, J., Gijbels, I., 1992. Variable bandwidth and local linear regression smoothers. Ann. Statist. 20, 2008–
2036.
Fan, J., Gijbels, I., 1996. Local Polynomial Modelling and Its Applications. Chapman & Hall, London.
Fan, J., Marron, J.S., 1994. Fast implementations of nonparametric curve estimators. J. Comput. Graph.
Statist. 3, 35–56.
Fan, J., Heckmann, N.E., Wand, M.P., 1995. Local polynomial kernel regression for generalized linear models
and quasi likelihood functions. J. Amer. Statist. Assoc. 90, 141–150.
Gasser, T., Sroka, L., Jennen-Steinmetz, C., 1986. Residual variance and residual pattern in nonlinear
regression. Biometrika 73, 625–633.
Green, P.J., Silverman, B.W., 1994. Nonparametric Regression and Generalized Linear Models: A Roughness
Penalty Approach. Chapman & Hall, London.
HPardle, W., 1987. Algorithm AS 222: resistant smoothing using the fast fourier transform Appl. Statist. 36,
104–111.
HPardle, W., Scott, D.W., 1992. Smoothing by weighted average of rounded points. Comput. Statist. 7, 97–
128.
Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive Models. Chapman & Hall, London.
Healy, M.J.R., 1986. Matrices for Statistics. Oxford University Press, Oxford.
Johnson, N.L., Kotz, S., 1972. Distributions in Statistics: Continuous Univariate Distributions, Vol. II. Wiley,
New York.
Kuonen, D., 1999. Saddlepoint approximations for distributions of quadratic forms in normal variables.
Biometrika 86, 929–935.
Nadaraya, E.A., 1964. On estimating regression. Theory Probab. Appl. 10, 186–190.
Rice, J., 1984. Bandwidth choice for nonparametric regression. Ann. Statist. 12, 1215–1230.
Ruppert, D., Wand, M.P., 1994. Multivariate locally weighted least squares regression. Ann. Statist. 22,
1346–1370.
Schimek, M.G. (Ed.), 2000a. Smoothing and Regression. Approaches, Computation and Application. Wiley,
New York.
Silverman, B.W., 1982. Algorithm AS 176: kernel density estimation using the fast fourier transform Appl.
Statist. 31, 93–99.
SimonoM, J.S., 1996. Smoothing Methods in Statistics. Springer, New York.
Wand, M.P., Jones, M.C., 1995. Kernel Smoothing. Chapman & Hall, London.
Watson, G.S., 1964. Smooth regression analysis. SankhyQa Ser. A 26, 359–372.
Young, S.G., Bowman, A.W., 1995. Non-parametric analysis of covariance. Biometrics 51, 920–9310.