A law of large numbers for nearest neighbour statistics

Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
Proc. R. Soc. A (2008) 464, 3175–3192
doi:10.1098/rspa.2008.0235
Published online 12 August 2008
A law of large numbers for nearest
neighbour statistics
B Y D AFYDD E VANS *
School of Computer Science, Cardiff University, 5 The Parade,
Cardiff CF24 3AA, UK
In practical data analysis, methods based on proximity (near-neighbour) relationships
between sample points are important because these relations can be computed in time
O(n log n) as the number of points n/N. Associated with such methods are a class
of random variables defined to be functions of a given point and its nearest neighbours
in the sample. If the sample points are independent and identically distributed, the
associated random variables will also be identically distributed but not independent.
Despite this, we show that random variables of this type satisfy a strong law of large
numbers, in the sense that their sample means converge to their expected values almost
surely as the number of sample points n/N.
Keywords: nearest neighbours; geometric probability; difference-based methods;
noise estimation
1. Introduction
Let X ZX1, X2, . be a sequence of independent and identically distributed
random vectors Xi 2 Rd , and let X n Z ðX1 ; .; Xn Þ denote the first n points of the
sequence. The common distribution of the Xi will be called the sampling
distribution and X n will be called a sample (of size n). Let ni(n, k) denote the
index of the k -nearest neighbour of Xi among the points of X n Z ðX1 ; .; Xn Þ,
where the distance between any two points is defined with respect to the metric
induced by any l p-norm on Rd ,
8
!1=p
d
>
< X p
jaj j
;
1% p!N;
kða 1 ; .; ad Þkp Z
jZ1
>
:
maxjZ1; .; d jaj j; p ZN;
and equidistant points are ordered by their indices. Let h : Rd!ðkC1Þ / R be a
measurable function. We define the identically distributed random variables
hi;n ðX Þ Z hðXi ; Xni ðn;1Þ ; .; Xni ðn;kÞ Þ
ði Z 1; .; nÞ;
ð1:1Þ
so that hi,n(X ) is a function of the sample point Xi and its k-nearest neighbours
among the points of X n . Let mn, s2n and kn, respectively, denote the mean,
variance and kurtosis of the (identically distributed) random variables hi,n(X ),
*[email protected]
Received 5 June 2008
Accepted 11 July 2008
3175
This journal is q 2008 The Royal Society
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3176
D. Evans
and let Sn and Hn, respectively, denote their sum and sample mean,
n
n
X
1 X
hi;n ðX Þ and Hn ðX Þ Z
h ðX Þ:
Sn ðX Þ Z
n i Z1 i;n
i Z1
ð1:2Þ
Our main result is that Hn is a strongly consistent estimator for mn as n/N.
Theorem 1.1. If the sampling distribution is continuous and if the kurtosis kn is
uniformly bounded for all n2N, then
Hn K mn / 0 a:s: as n/N:
ð1:3Þ
s
n
Theorem 1.1 shows that the rate at which Hn converges to mn is at least of the
order of O(sn) as n/N. Let us define the normalized random variables,
hi;n ðX ÞK mn
;
ð1:4Þ
h i;n ðX Þ Z
sn
along with their sum S n and sample mean Hn ,
n
n
X
1 X
S n ðX Þ Z
h i;n ðX Þ and Hn ðX Þ Z
h i;n ðX Þ:
ð1:5Þ
n i Z1
i Z1
If the sampling distribution is continuous, we show that the variance of S n is
of the asymptotic order O(n) as n/N.
Theorem 1.2. If the sampling distribution is continuous, then for all nR16k,
varðS n Þ% 2ðn C 1Þð3 C 8kbÞð3 C 64kÞðkn C knC1 Þ;
ð1:6Þ
d
where bðk; d; pÞZ kbdvd;p c and vd,p is the volume of the unit ball in ðR ; k$kp Þ.
To show that jHn j/ 0 a.s. as n/N, we first show that the expected squared
difference EðS n KS nC1 Þ2 between successive terms of the sequence S n is bounded
independently of n. We then use the Efron–Stein inequality (Efron & Stein 1981;
Steele 1986) to bound the variance varðS n Þ, which by Chebyshev’s inequality
yields the weak convergence of Hn to zero as n/N. Finally, we again use the
bound on EðS n KS nC1 Þ2 to prove strong convergence using a standard argument.
Similar results have previously appeared in the literature. In particular,
Penrose & Yukich (2003) obtained a weak law of large numbers for functions of
binomial point processes in Rd , then applied the result to a number of different
types of random proximity graphs, and a number of functions including the total
edge length (under arbitrary weighting) and the number of components in the
graph. Wade (2007) applied the general result of Penrose & Yukich (2003) to
the total (power-weighted) edge length of several types of the nearest neighbour
graphs on random point sets in Rd , and gave explicit expressions for the limiting
constants. In two recent papers, Penrose (2007) obtained a law of large numbers
for a class of marked point processes and gave some applications to noise
estimation, while Penrose & Wade (2008) have investigated the asymptotic
properties of the random online nearest neighbour graph.
The methods of Penrose & Yukich (2003) depend on the notion of stabilization,
which demands that the neighbourhood of a particular vertex is unaffected by
changes in the neighbourhood of another vertex located outside some sufficiently
large ball (the radius of this ball is called the radius of stabilization). Because we
consider only k -nearest neighbours graphs, we will instead rely on the standard
Proc. R. Soc. A (2008)
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
Nearest neighbour statistics
3177
geometric fact that there exists a finite number bZb(k, d, p) such that for any
countable set of points in ðRd ; k$kp Þ, any point of the set can be among the first
k-nearest neighbours of at most b other points of the set.
Motivated by the need for a multidimensional goodness-of-fit test, Bickel &
Breiman (1983) investigated the asymptotic properties of sums of bounded functions
of the (first) nearest neighbour distances, and proved a central limit theorem for the
random variables f ðXi ÞkXi K Xni ðn;1Þ k where f is the (unknown) sampling density.
This was extended to k-nearest neighbour distances by Penrose (2000), with k
being allowed to increase as a fractional power of the number of points n.
Functions defined in terms of proximity relations exhibit ‘local’ dependence,
which can often be represented by dependency graphs, first described by
Petrovskaya & Leontovitch (1982). For a set of random variables fXi : i 2 V g,
the graph G(V, E ) is said to be a dependency graph for the set if for any pair of
disjoint sets A1, A24V such that no edge has one endpoint in A1 and the other in
A2, the s-fields sfXi : i 2 A1 g and sfXi : i 2 A2 g are mutually independent. The
following result is due to Baldi & Rinott (1989).
Theorem 1.3 (Baldi & Rinott 1989). Let fZi : i 2
PV g be random variables
having a dependency graph G(V, E ), and define S Z i2V Zi . Let D denote the
maximal degree of G and suppose jZij%B almost surely. Then
!1=2
!
pffiffiffi jV jD 2 B 3
S KEðSÞ
;
ð1:7Þ
P pffiffiffiffiffiffiffiffiffiffiffiffiffiffi % x KFðxÞ % 32 1 C 6
varðSÞ
varðSÞ3=2
where jV j is the cardinality of V and F(x) is the standard normal distribution
N(0, 1).
Avram & Bertsimas (1993) applied theorem 1.3 to the length of the k-nearest
neighbour graph, the Delaunay triangulation and the Voronoi diagram of random
point sets. More recently, Penrose & Yukich (2001) used stabilization properties to
prove central limit theorems for functions of various types of (random) proximity
graphs, while Chen & Shao (2004) obtained central limit theorems for various
types of local dependence, and in particular improved on the result of Baldi &
Rinott (1989).
In the present context, vertices in the dependency graph corresponding to the
random variables hi,n and hj,n are connected by an edge only if they share a common
nearest neighbour. By lemma 4.2, the maximal degree of G is therefore bounded
independently of the number of points n. Under the additional assumption that
the hi,n are bounded a.s., if we could show that there exist constants c, eO0 such
that varðSn ÞR cn2=3Ce , it would then follow that S n / N ð0; 1Þ in distribution as
n/N. This is beyond the scope of the present paper.
2. Applications
(a ) Nearest neighbour distances
Let X ZX1, X2, . be a sequence of independent and identically distributed
random variables taking values in the unit cube [0,1]d. Let di,n(X ) be the distance
between the point Xi and its k -nearest neighbour in the sample X n , and let Dn(X )
Proc. R. Soc. A (2008)
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3178
D. Evans
denote the sample mean of the di,n(X),
di;n ðX Þ Z kXi K Xni ðn;kÞ k
ðdÞ
ðdÞ
and Dn ðX Þ Z
ðdÞ
n
1 X
d ðX Þ:
n i Z1 i;n
ð2:1Þ
Let mn , sn and kn , respectively, denote the mean, standard deviation and
kurtosis of the (identically distributed) random variables di,n(X ).
Let F denote the sampling distribution, Bx(r) denote the ball of radius r
centred at x 2 Rd and ux(r) denote its probability measure:
ð
dF:
ð2:2Þ
ux ðrÞ Z
Bx ðrÞ
Suppose that F satisfies a positive density condition (Gruber 2004), in the sense
that there exist constants aO1 and rO0 such that a K1 r d % ux ðrÞ% ar d for all
0%r%r. Suppose also that F satisfies a smooth density condition in the sense
that its second partial derivatives are bounded at every point of [0,1]d. Under
these conditions, it is known (Evans et al. 2002) that for all eO0, the moments of
the k -nearest neighbour distance distribution satisfies
ð
a Gðk C a=dÞ
1
ðdKaÞ=d
E di;n Z a=d
f ðxÞ
dx
1 C O 1=dKe
;
ð2:3Þ
n
vd;p n a=d GðkÞ
as n/N, where vd,p is the volume of the unit ball in ðRd ; k$kp Þ and f(x)ZF 0 (x) is the
ðdÞ
ðdÞ
density of the sampling distribution. In particular, sn Z Oðn K1=d Þ and kn Z Oð1Þ
as n/N, so by theorem 1.1 it follows that:
n1=d jDn KmðdÞ
n j/ 0 a:s: as n/N:
ð2:4Þ
Thus, Dn is a strongly consistent estimator for the expected k -nearest neighbour
ðdÞ
distance mn , and its rate of convergence is of the asymptotic order Oðn1=d Þ as the
number of sample points n/N. In fact, taking aZ1 in (2.3) we have
ð
Gðk C 1=dÞ
1=d ðdÞ
ð2:5Þ
f ðxÞðdK1Þ=d dx;
lim n mn Z a=d
n/N
vd;p GðkÞ
so by theorem 1.1,
lim n
n/N
1=d
Dn ðX Þ Z
Gðk C 1=dÞ
ð
a=d
vd;p GðkÞ
f ðxÞðdK1Þ=d dx
a:s:
ð2:6Þ
(i) The Euclidean k -nearest neighbours graph
For any set of vertices {x 1, ., xn} in Rd , the (undirected) Euclidean k -nearest
neighbours graph is constructed by including an edge between every point and its
k -nearest neighbours in the set, where the nearest neighbour relations are defined
with respect to the Euclidean metric ( pZ2). Let Ln(X ) denote the total length of
the Euclidean k -nearest neighbours graph of the random sample X n ,
n X
k
X
Ln ðX Þ Z
kXi K Xni ðn;[ Þ k2 :
ð2:7Þ
i Z1 [ Z1
Proc. R. Soc. A (2008)
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3179
Nearest neighbour statistics
Because the volume of the unit ball in ðRd ; k$k2 Þ is equal to 2pd=2 =Gð1C d=2Þ,
then subject to the above conditions on the sampling distribution it follows by
(2.6) that:
ð
lim Ln ðX Þ=nðdK1Þ=d Z aðk; dÞ f ðxÞðdK1Þ=d dx
n/N
where
d
Gðk C 1 C 1=dÞ
;
aðk; dÞ Z
d C1
GðkÞ
a:s:;
ð2:8Þ
ð2:9Þ
which agrees with theorem 2(b) of Wade (2007).
(b ) Noise estimation
Non-parametric regression attempts to model the behaviour of some
observable variable Y2R in terms of another observable variable X 2 Rd ,
based only on a finite number of observations of the joint variable (X, Y ). The
relationship between X and Y is often assumed to satisfy the additive hypothesis
Y Z EðY jXÞ C R;
ð2:10Þ
where E(YjX ) is the regression function and R is the residual variable (noise). Let
ZZZ1, Z2, . be a sequence of independent and identically distributed
observations ZiZ(Xi , Yi) of the joint variable ZZ(X, Y ), let X ZX1, X2, . and
YZY1, Y2, . denote the marginal sequences, and let ni(n, k) denote the index of
the k-nearest neighbour of Xi among the points of the marginal sample X nZ
ðX1 ; .; Xn Þ. To estimate the k -moment E(Rk ) of the residual distribution (k2N),
we consider the random variables gi,n(Z) and their sample mean Gn(Z), defined by
k
n
Y
1 X
gi;n ðZÞ Z
ðYi K Yni ðn;[ Þ Þ and Gn ðZÞ Z
g ðZÞ:
ð2:11Þ
n i Z1 i;n
[ Z1
ðgÞ
ðgÞ
ðgÞ
Let mn , sn and kn , respectively, denote the mean, standard deviation and
kurtosis of the (identically distributed) random variables gi,n(Z).
Perhaps the main contribution of this paper is that theorem 1.1 extends to
possibly unbounded random variables, the only requirement being that their first
four moments are bounded. This allows the residual variable R in (2.10) to take
arbitrarily large values.
Suppose that the sampling distribution of the explanatory variables Xi satisfies
smooth and positive density conditions, the regression function E(YjX ) satisfies
a Lipschitz condition on Rd and the residual variable R is independent of X
(homoscedastic). Under these conditions, Evans & Jones (2008) have shown that
for k2N,
k
ðdÞ
mðgÞ
n Z EðR Þ C Oðmn Þ
ðdÞ
mn
as n/N;
ð2:12Þ
is the expected k -nearest neighbour distance in the marginal sample
where
(X1, ., Xn) and the implied constant depends on the residual moments up to
order kK1, the constant implied by the Lipschitz condition on the regression function, and the constant implied by the positive density condition on the
sampling distribution.
Proc. R. Soc. A (2008)
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3180
D. Evans
If the first 4k moments of the residual distribution are bounded, it can be
ðgÞ
ðgÞ
shown that sn Z Oð1Þ and kn Z Oð1Þ as n/N. Hence, by (2.5), (2.12) and
theorem 1.1, it follows that:
jGn KmðgÞ
ð2:13Þ
n j/ 0 a:s: as n/N:
Thus, the sample mean Gn is a strongly consistent estimator for the kth moment
E(Rk ) of the residual distribution as n/N.
3. The Efron–Stein inequality
Let X1, X2, . be a sequence of independent and identically distributed random
vectors in Rd , and let g : Rd!n / R be a symmetric function of n vectors in Rd .
The Efron–Stein inequality (Efron & Stein 1981) provides an upper bound for
the variance of the statistic Z Z gðX1 ; .; Xn Þ in terms of the quantities,
nC1
1 X
ZðiÞ Z gðX1 ; .; XiK1 ; XiC1 ; .; XnC1 Þ and Zð$Þ Z
Z :
ð3:1Þ
n C 1 i Z1 ðiÞ
Theorem 3.1 (Efron & Stein 1981). Let X1, X2, ., XnC1 be independent and
identically distributed random vectors in Rd , let g : Rd!n / R be a symmetric
function, and define the random variable Z Z gðX1 ; .; Xn Þ.
If E(Z 2)!N, then
varðZÞ%
nC1
X
EðZðiÞ K Zð$Þ Þ2 :
ð3:2Þ
i Z1
The Efron–Stein inequality has found a wide range of applications in statistics.
Our approach is partly based on the work of Reitzner (2003), who uses the inequality
to prove strong laws of large numbers for statistics related to random polytopes.
First we note that, because X1, ., XnC1 are independent and identically
distributed, it follows by symmetry on the indices that:
varðZÞ% ðn C 1ÞEðZðnC1Þ K Zð$Þ Þ2 :
Furthermore, for any a2R, we have that
nC1
nC1
X
X
EðZðiÞ K Zð$Þ Þ2 Z
EððZðiÞ KaÞKðZð$Þ KaÞÞ2
i Z1
ð3:3Þ
ð3:4Þ
i Z1
Z
nC1
X
EðZðiÞ KaÞ2 Kðn C 1ÞðZð$Þ KaÞ2 ;
ð3:5Þ
iZ1
so inequalities (3.2) and (3.3) are preserved if we replace Z($) by any other
function of X1, ., XnC1. To prove theorem 1.1, we apply the Efron–Stein
inequality to the sum Sn(X ),
n
X
ZðX1 ; .; Xn Þ Z
hi;n ðX Þ
ð3:6Þ
i Z1
Z
n
X
iZ1
Proc. R. Soc. A (2008)
hðXi ; Xni ðn;1Þ ; .; Xni ðn;kÞ Þ:
ð3:7Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
Nearest neighbour statistics
3181
If Eðh 2i;n Þ!N then EðZ 2 Þ!N. Furthermore, because the points X1, ., Xn are
independent and identically distributed, it follows that Sn Z ZðX1 ; .; Xn Þ is
invariant under permutations of its arguments. Thus, we may apply the Efron–
Stein inequality to Sn. In this case ZðnC1Þ Z Sn , so by replacing Z($) by SnC1 in
(3.3), we obtain
varðSn Þ% ðn C 1ÞEðSn K SnC1 Þ2 :
ð3:8Þ
An identical expression also holds for the normalized sum S n . The lemma 3.2
shows that the expected squared difference between successive values of S n is
bounded independently of the total number of points n. The proof is given in §4.
Lemma 3.2. If the sampling distribution is continuous, then for all nR16k,
E ðS n KS nC1 Þ2 % 2ð3 C 8kbÞð3 C 64kÞðkn C knC1 Þ:
ð3:9Þ
The following corollary of lemma 3.2 follows immediately by (3.8) and
Chebyshev’s inequality, and asserts the weak consistency of the sample mean Hn,
defined in (1.2), as an estimator for true mean mn as n/N (provided the kurtosis
kn is uniformly bounded for all n2N).
Corollary 3.3. If the sampling distribution is continuous, then for all nR16k,
Hn K mn O e % 4ð3 C 8kbÞð3 C 64kÞðkn C knC1 Þ :
P ð3:10Þ
sn ne2
We prove lemma 3.2 using methods based on the approach of Bickel &
Breiman (1983) and Hitczenko et al. (1999). Having established lemma 3.2, the
proof of theorem 1.1 is then straightforward.
4. Proofs
(a ) Standard results
We need two standard geometric results. The first of these concerns the expected
probability measure of k -nearest neighbour balls. For Xi 2 X n , let Bi;n ðX Þ 3Rd
denote the k -nearest neighbour ball of Xi (with respect to the finite sample X n ),
defined to be the ball centred at Xi of radius equal to the distance from Xi to its
k-nearest neighbour in X n , and let ui,n(X) denote its probability measure,
ð
ui;n ðX Þ Z
dF;
ð4:1Þ
Bi;n ðXÞ
where F is the (common) distribution function of the sample points X1, X2, . .
It is well known, see for example Percus & Martin (1998) or Evans et al. (2002),
that provided the sampling distribution is continuous, the expected probability
measure of any k -nearest neighbour ball (over all sample realizations) is equal
to k/n.
Lemma 4.1. If the sampling distribution is continuous, then
Eðui;n Þ Z k=n:
Proc. R. Soc. A (2008)
ð4:2Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3182
D. Evans
The second result concerns the maximum degree of any vertex in a k -nearest
neighbour graph. It is well known that this number is bounded independently of
the total number of vertices (e.g. Stone 1977; Bickel & Breiman 1983; Zeger &
Gersho 1994; Yukich 1998).
Lemma 4.2. For every countable set of points in Rd , any point can be among
the first k-nearest neighbours of at most bðk; d; pÞZ kbdvd;p c other points of the
set, where vd,p is the volume of the unit ball in ðRd ; k$kp Þ.
In particular, bðk; d; 2ÞZ kb2pd=2 =Gðd=2Þc and bðk; d;NÞZ kd2d .
(b ) Adding a sample point
Let ln denote the second moment of the random variables hi,n(X),
ln Z s2n C m2n :
ð4:3Þ
Lemma 4.3. If the sampling distribution is continuous,
!
n
X
ðhi;n K hi;nC1 Þ2 % 2kðln C lnC1 Þ:
E
ð4:4Þ
i Z1
Proof. Let Bi;n ðX Þ 3Rd denote the k -nearest neighbour ball of Xi with respect
to the finite sample X n Z ðX1 ; .; Xn Þ, and suppose we add another (independent
and identically distributed) point XnC1 to the ensemble. If XnC1 ;Bi;n ðX Þ (i.e. if
the new point XnC1 falls outside the current k -nearest neighbour ball of Xi), then
Xni ðn;[ Þ Z Xni ðnC1;[ Þ for all 1%[%k, and hence
hi;n Z hðXi ; Xni ðn;1Þ ; .; Xni ðn;kÞ Þ
ð4:5Þ
Z hðXi ; Xni ðnC1;1Þ ; .; Xni ðnC1;kÞ Þ
ð4:6Þ
Z hi;nC1 :
ð4:7Þ
Thus, we have hi,nshi,nC1 only if XnC12Bi,n(X ). Let Ji,nC1 denote the indicator
variable of the event XnC12Bi,n(X),
(
1; if XnC1 2 Bi;n ðX Þ;
Ji;nC1 ðX Þ Z
ð4:8Þ
0; otherwise:
Then
E
n
X
i Z1
!
2
ðhi;n K hi;nC1 Þ
ZE
n
X
!
2
ðhi;n K hi;nC1 Þ Ji;nC1 :
ð4:9Þ
i Z1
By lemma 4.1, the probability measure of Bi,n(X ) is equal to k/n. Hence, because
the new point XnC1 is assumed to be independent of the previously selected
points X1, .,Xn, we have
PðJi;nC1 Z 1Þ Z k=n
Proc. R. Soc. A (2008)
ð4:10Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3183
Nearest neighbour statistics
and thus
E
n
X
!
ðhi;n K hi;nC1 Þ2
Z
i Z1
n
k X
Eððhi;n K hi;nC1 Þ2 jJi;nC1 Z 1Þ:
n i Z1
ð4:11Þ
So that we can consider samples of sizes n and nC1 separately, we apply the
crude bound
Eððhi;n K hi;nC1 Þ2 jJi;nC1 Z 1Þ% 2E h 2i;n jJi;nC1 Z 1 C 2E h 2i;nC1 jJi;nC1 Z 1 :
ð4:12Þ
For samples of size n, because the random variable h 2i;n is determined only by
the points X1,.,Xn, it follows that h 2i;n is independent of the event XnC12Bi,n,
and therefore:
ð4:13Þ
E h 2i;n jJi;nC1 Z 1 Z E h 2i;n Z ln :
For samples of size nC1, the event XnC12Bi,n is simply that the ‘last’ point of the
sample X1,.,XnC1 is among the first k -nearest neighbours of Xi , i.e. XnC12Bi,n
if and only if ni ðnC 1; [ ÞZ nC 1 for some 1%[%k. Because the points X1,.,XnC1
are independent and identically distributed, the order in which they are indexed
is arbitrary. Hence, it follows by symmetry on the indices jsi that the condition
ni ðnC 1; kÞZ nC 1 cannot affect the expected value of h 2i;nC1 , so
ð4:14Þ
E h 2i;nC1 jJi;nC1 Z E h 2i;nC1 Z lnC1 :
Thus, we have that
Eððhi;n K hi;nC1 Þ2 jJi;nC1 Z 1Þ Z 2ðln C lnC1 Þ
ð4:15Þ
and the result follows by (4.11).
&
Lemma 4.4. If the sampling distribution is continuous, then for all nR16k,
ð4:16Þ
lnC1 % 3ln :
Proof. Because the hi,n are identically distributed,
!
n
1 X
lnC1 Z E
h2
n i Z1 i;nC1
n
X
1
Z E
ðhi;n Kðhi;n K hi;nC1 ÞÞ2
n
i Z1
Proc. R. Soc. A (2008)
ð4:17Þ
!
ð4:18Þ
!
n X
2
2
2
% E
h i;n C ðhi;n K hi;nC1 Þ
n
i Z1
ð4:19Þ
!
n
X
2
2
ðhi;n K hi;nC1 Þ :
%2ln C E
n
i Z1
ð4:20Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3184
D. Evans
Hence by lemma 4.3,
4k
ðl C lnC1 Þ:
n n
Thus, for nR16k we have 4k/n%1/4 and hence lnC1%3ln.
lnC1 % 2ln C
ð4:21Þ
&
Lemma 4.5. If the sampling distribution is continuous, then for all nR16k,
!2
n
X
E
ðhi;n K hi;nC1 Þ % 8kbln :
ð4:22Þ
i Z1
Proof. Let MnC1 be the set containing the index of every point in X nC1 , which
has the new point XnC1 among its first k -nearest neighbours,
MnC1 ðX Þ Z fi : ni ðn C 1; [ Þ Z n C 1;
for some 1% [ % kg:
ð4:23Þ
By construction, if i ;MnC1 then Xni ðn;[ Þ Z Xni ðnC1;[ Þ for all 1%[%k, so hi,nshi,nC1
only if i 2 MnC1 . Furthermore, by lemma 4.2, the new point XnC1 can become
one of the first k -nearest neighbours of at most b of the existing points X1, ., Xn,
so jMnC1 j% b. Hence, by the Cauchy inequality,
!2
!2
n
X
X
ðhi;n K hi;nC1 Þ Z
ðhi;n K hi;nC1 Þ
ð4:24Þ
i Z1
i2MnC1
%jMnC1 j
X
ðhi;n K hi;nC1 Þ2
ð4:25Þ
i2MnC1
%b
n
X
ðhi;n K hi;nC1 Þ2 ;
ð4:26Þ
i Z1
and the result follows by lemmas 4.3 and 4.4.
&
(c ) An upper bound on var (Sn )
Lemma 4.6. If the sampling distribution is continuous, then for all nR16k,
ðiÞ EðSn K SnC1 Þ2 % 2ð3 C 8kbÞln
and
ðiiÞ
varðSn Þ% 2ðn C 1Þð3 C 8kbÞln :
Proof. By definition,
Sn K SnC1 Z
n
X
ðhi;n K hi;nC1 ÞK hnC1;nC1 ;
ð4:27Þ
i Z1
so
2
ðSn K SnC1 Þ % 2
n
X
!2
ðhi;n K hi;nC1 Þ
C 2h 2nC1;nC1 :
ð4:28Þ
i Z1
Taking expectations then applying lemmas 4.4 and 4.5,
EðSn K SnC1 Þ2 % 2ð3 C 8kbÞln
Proc. R. Soc. A (2008)
ð4:29Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
Nearest neighbour statistics
3185
varðSn Þ% 2ðn C 1Þð3 C 8kbÞln :
ð4:30Þ
&
and by (3.8),
(d ) Normalization
To find an upper bound on EðS nC1 KS n Þ2 , we first need to quantify the squared
difference between successive values of mn and sn.
Lemma 4.7. If the sampling distribution is continuous, then for all nR16k,
ðiÞ
ðmn K mnC1 Þ2 %
8kb
ln
n2
and
16k
l :
n n
Proof. To prove (i), because the hi,n are identically distributed,
"
!# 2
n
X
1
ðmn K mnC1 Þ2 Z E
ðh K hi;nC1 Þ
n i Z1 i;n
ðiiÞ
ðsn K snC1 Þ2 %
n
X
1
% 2E
ðhi;n K hi;nC1 Þ
n
i Z1
ð4:31Þ
!2
ð4:32Þ
8kb
ln ;
ð4:33Þ
n2
where the last inequality follows by lemma 4.5.
To prove (ii), let X~ Z ðX~ 1 ; X~ 2 ; .Þ be an independent copy of X Z(X1, X2,.),
and let n~i ðn; kÞ denote the index of the k -nearest neighbour of X~ i among the
points of X~n Z ðX~ 1 ; .; X~ n Þ. We consider the random variables,
ð4:34Þ
h~i;n ðX~ Þ Z h X~ i ; X~ n~i ðn;1Þ ; .; X~ n~i ðn;kÞ :
%
Because hi,n and h~i;n are independent and identically distributed,
2
Eððhi;n Kh~i;n Þ2 Þ Z Eðh 2i;n ÞK2Eðhi;n h~i;n Þ C Eðh~i;n Þ
ð4:35Þ
2
Z Eðh 2i;n ÞK2Eðhi;n ÞEðh~i;n Þ C Eðh~i;n Þ
ð4:36Þ
2
Z Eðh 2i;n ÞKm2n C Eðh~i;n ÞKm2n
ð4:37Þ
Z 2s2n :
ð4:38Þ
By the Cauchy–Schwarz inequality,
Eððhi;n Kh~i;n Þðhi;nC1 Kh~i;nC1 ÞÞ% 2sn snC1 ;
Proc. R. Soc. A (2008)
ð4:39Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3186
D. Evans
so
2ðsn K snC1 Þ2 Z 2s2n C 2s2nC1 K4sn snC1
% Eðhi;n Kh~i;n Þ2 C Eðhi;nC1 Kh~i;nC1 Þ2
K2Eðhi;n Kh~i;n Þðhi;nC1 Kh~i;nC1 Þ:
ð4:40Þ
ð4:41Þ
Thus, we have
2ðsn K snC1 Þ2 % Eðððhi;n Kh~i;n ÞKðhi;nC1 Kh~i;nC1 ÞÞ2 Þ
ð4:42Þ
%2Eððhi;n K hi;nC1 Þ2 C ðh~i;n Kh~i;nC1 Þ2 Þ
ð4:43Þ
and because X and X~ are identically distributed,
ðsn K snC1 Þ2 % 2Eðhi;n K hi;nC1 Þ2 :
Finally, because the hi,n are identically distributed,
n
X
2
ðsn K snC1 Þ2 % E
ðhi;n K hi;nC1 Þ2
n
i Z1
ð4:44Þ
!
and (ii) follows by lemmas 4.3 and 4.4.
ð4:45Þ
&
(e ) An upper bound on varðSn Þ
Proof of lemma 3.2. We aim to show that for all nR16k,
E ðS n KS nC1 Þ2 % 2ð3 C 8kbÞð3 C 64kÞðkn C knC1 Þ:
First, we write
Sn Knmn
SnC1 Kðn C 1ÞmnC1
Z
K
sn
snC1
1
sn
Sn Knmn K
ðSnC1 Kðn C 1ÞmnC1 Þ
Z
sn
snC1
1
ðSn K SnC1 ÞKnðmn K mnC1 Þ C mnC1
Z
sn
1
K
ðs K snC1 ÞðSnC1 Kðn C 1ÞmnC1 Þ :
snC1 n
from which it follows that:
4
2
E ðS n KS nC1 Þ % 2 EðSn K SnC1 Þ2 C n2 ðmn K mnC1 Þ2 C m2nC1
sn
1
2
C 2 ðsn K snC1 Þ varðSnC1 Þ :
snC1
S n KS nC1
Thus, by lemmas 4.4, 4.6 and 4.7 we obtain
ln
lnC1
2
E ðS n KS nC1 Þ % 4ð3 C 8kbÞ 2
3 C 64k 2
:
sn
snC1
Proc. R. Soc. A (2008)
ð4:46Þ
ð4:47Þ
ð4:48Þ
ð4:49Þ
ð4:50Þ
ð4:51Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3187
Nearest neighbour statistics
1=2
Finally, because ln =s2n % kn
and knC1R1,
1=2
E ðS n KS nC1 Þ2 % 4ð3 C 8kbÞð3 C 64kÞk1=2
n knC1 ;
and the result follows by the fact that
1=2 1=2
2kn knC1 % kn C knC1 .
ð4:52Þ
&
Proof of theorem 1.2. By lemma 3.2 and (3.8),
varðS n Þ% 2ðn C 1Þð3 C 8kbÞð3 C 64kÞðkn C knC1 Þ:
ð4:53Þ
&
(f ) Strong convergence
Proof of theorem 1.1. We aim to show that jHn j/ 0 a.s. as n/N. Suppose
that there exists a finite constant cO0 such that kn!c for all n2N. Then by
corollary 3.3, the probabilities PðjHnm jO eÞ are summable for the subsequence
nmZm2,
N
N
X
c0 X
1
P Hm 2 O e % 2
!N;
ð4:54Þ
e mZ1 m 2
mZ1
where c 0 Z 4cð3C 8kbÞð3C 64kÞ. Hence, by the Borel–Cantelli lemma it follows
that Hm 2 / 0 a.s. as m/N. For m 2 ! n! ðmC 1Þ2 , we write
m2
1 1 Hm2 C 2 S n KS m2 :
Hn Z S m2 C S n KS m2 Z
ð4:55Þ
n
n
m
Let
1 2
2
Wm Z max
S n KS m2 : m ! n! ðm C 1Þ :
ð4:56Þ
m2
Then jHn j% jHm 2 jC Wm for all m 2 ! n! ðmC 1Þ2 , so by (4.54) it is sufficient to
show that Wm/0 a.s. as m/N. Writing
S n KS m2 Z ðS n KS nK1 Þ C ðS nK1 KS nK2 Þ C/C S m2C1 KS m2 ;
ð4:57Þ
we see that
1
Wm % 2
m
2
ðmC1Þ
XK1
S j KS jK1 :
ð4:58Þ
jZm 2C1
Hence by the Cauchy inequality, and using the fact that the sum has exactly 2m
terms, it follows by lemma 3.2 that:
0
12
2
ðmC1Þ
K1
X
1
S j KS jK1 A
E W 2m % 4 E @
ð4:59Þ
m
2
jZm C1
2
% 3
m
%
Proc. R. Soc. A (2008)
2
ðmC1Þ
XK1
2
E S j KS jK1
ð4:60Þ
jZm 2C1
16cð3 C 8kbÞð3 C 64kÞ
:
m2
ð4:61Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3188
D. Evans
Thus, by Markov’s inequality and the Borel–Cantelli lemma, Wm/0 a.s. as
m/N, which concludes the proof of theorem 1.1.
&
5. Conclusion
For a sequence X ZX1, X2, . of independent and identically distributed random
vectors in Rd , we have proved a strong law of large numbers for functions of a
point and its nearest neighbours in the sample X n Z ðX1 ; .; Xn Þ. We have also
used the result to show that certain non-parametric difference-based estimators
of residual moments are strongly consistent as the number of points n/N
(provided the distribution of the explanatory vectors satisfies smooth and
positive density conditions). Perhaps the most significant advance put forward in
this paper is that the random variables need not be bounded, the only
requirement being that their first four moments are bounded.
In practical applications, one area of concern is that the bound on varðS n Þ of
theorem 1.2 increases exponentially with the dimension of the underlying space
Rd , which negatively affects the rate of convergence when d is large. The
exponential factor b is introduced in the proof of lemma 4.5, where we have used
the relatively crude bound of lemma 4.2. By contrast, the proof of lemma 4.3 uses
the bound of lemma 4.1, which avoids the exponential factor. If we attempt to
prove lemma 4.5 along the same lines as those followed in the proof of lemma 4.3,
we arrive at the expression
!2
n
n
X
X
E
ðhi;n K hi;nC1 Þ Z
Eððhi;n K hi;nC1 Þðhj;n K hj;nC1 Þ2 Jij Þ;
ð5:1Þ
i Z1
i; jZ1
where Jij(X ) is the indicator variable of the event XnC1 2 Bi;n ðX Þh Bj;n ðX Þ,
and Bi,n(X ) and Bj,n(X ) are, respectively, the k -nearest neighbour balls of
the points Xi and Xj in the sample X n . At this point, we would like to show
that PðXnC1 2 Bi;n h Bj;n Þ is at most equal to c(d )/n, where c(d ) does not
increase exponentially with d. However, Dwyer (1995) showed that if k Z1 and
dR7, then for n[d the expected number of edges ne in the random sphere of
influence graph, or equivalently the expected number of pairs (Xi , Xj) for which
/, satisfies
Bi;n ðX Þh Bj;n ðX Þ s0
ð0:324Þ2d n% Eðne Þ% ð0:677Þ2d n:
ð5:2Þ
This suggests that the curse of dimensionality will not be easily lifted.
The author would like to thank the Royal Society for supporting his research through the
University Research Fellowship scheme.
Appendix A. Standard results
Proof of lemma 4.1. Let ui,n denote the probability measure of the k -nearest
neighbour ball of Xi 2 X n , G(u) be the distribution function of ui,n and Bi(u) be
the ball centred at Xi of probability measure u. Because the Xi are identically
Proc. R. Soc. A (2008)
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
Nearest neighbour statistics
3189
distributed,
ðA 1Þ
GðuÞ Z Pðui;n % uÞ
Z PðBi ðuÞ contains at least k points other than Xi Þ
ðA 2Þ
kK1
X
Z 1K
PðBi ðuÞ contains exactly j points other than Xi Þ
ðA 3Þ
jZ0
Z 1K
kK1
X
n K1
jZ0
j
!
u j ð1KuÞnKjK1 :
ðA 4Þ
Integrating by parts,
E uai;n Z
ð1
0
Za
ð1
a
u dGðuÞ Z 1Ka
kK1
X
n K1
jZ0
j
!ð
1
uaK1 GðuÞdu
ðA 5Þ
0
u jCaK1 ð1KuÞnKjK1 :
ðA 6Þ
0
The integral in (A 6) is the beta function Bða; bÞZ GðaÞGðbÞ=GðaC bÞ with
parameters aZjCa and bZnKj. Thus, we obtain
kK1
aGðnÞ X
Gðj C aÞ GðnÞGðk C aÞ
Z
:
E uai;n Z
Gðn C aÞ jZ1 Gðj C 1Þ
Gðn C aÞGðkÞ
In particular, Eðui;n ÞZ k=n.
ðA 7Þ
&
d
Lemma A.1. For every countable set of points in R , any point can be the (first)
nearest neighbour of at most bdvd,pc other points of the set, where vd,p is the
volume (Lebesgue measure) of the unit ball in ðRd ; k$kp Þ.
Proof. Let tOdvd,p be an integer and suppose that x 0 is the nearest neighbour
of every point in the set {x 1,.,xt}. We project the points xi onto the surface of
the unit ball in Rd , writing
xi Z x 0 C ri xi
where
ri Z kxi K x 0 k
and kxi k Z 1:
ðA 8Þ
Let xi and xj be two distinct points of {x 1,.,xt} and suppose (without loss of
generality) that kxj K x 0 k% kxi K x 0 k. The vector xiKxj can be expressed as
ðxi K xj Þ Z ri xi K rj xj Z ðri K rj Þxi C rj ðxi K xj Þ;
ðA 9Þ
kxi K xj k% ðri K rj Þkxi k C rj kxi K xj k:
ðA 10Þ
so
Hence, because ri % kxi K xj k and kxikZ1 we have
ri % ðri K rj Þ C rj kxi K xj k
Proc. R. Soc. A (2008)
ðA 11Þ
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3190
D. Evans
and therefore
1% kxi K xj k:
ðA 12Þ
Thus, every point xi must be located within an otherwise empty region on the
surface of the unit ball in Rd , and the surface area of each of these regions must
be at least equal to 1. By hypothesis, there are tOdvd,p points x1, ., xt so the
total surface area covered by these disjoint regions must be greater than dvd,p.
However, because the total surface area of the unit ball is equal to dvd,p we have
a contradiction, and thus we conclude that t%bdvd,pc.
&
Lemma 8.4 of Yukich (1998) shows that lemma 4.2 follows lemma A.1. An
alternative proof is given here.
Proof of lemma 4.2. Let tOkdvd,p be an integer; suppose that x 0 is one of the
k -nearest neighbours of every point in the set {x 1,.,xt}. First we choose x1 to be
the point of {x 1,.,xt} furthest away from x 0, then eliminate this point x1 along
with all points that are closer to x1 than x 0 is to x1. At least one point is
eliminated (namely the point x1 itself ), and because x 0 is one of the k -nearest
neighbours of x1 there can be at most kK1 other points closer to x1 than x 0 is to
x1, so at most k points are eliminated in total.
Next we repeat the procedure on remaining points, choosing x2 to be the
furthest point away from x 0, then eliminating x2 along with all points that are
closer to x2 than x 0 is to x2. Because x2 was not eliminated in the first round, it
must be closer to x 0 than it is to x1,
kx2 K x 0 k% kx2 K x1 k:
ðA 13Þ
We continue in this way, at each stage choosing xi to be the point (among those
that remain) furthest away from x 0, and eliminating xi along with all points that
are closer to xi than x 0 is to xi. Because xi was not eliminated in the previous
rounds, it must be closer to x 0 than to any of the points chosen previously,
kxi K x 0 k% kxi K xj k
for all j Z 1; .; i K1:
ðA 14Þ
At least one point is eliminated at each stage, so the process must eventually
terminate, say after T steps, and we obtain a set of points {x1,.,xT}, each
having x 0 as its nearest neighbour. At most k points are eliminated at each step,
so a minimum of t/k steps must be performed before the process terminates. By
hypothesis, t Okdvd,p so T Odvd,p. However, by lemma A.1 we know that any
point x 0 can be the nearest neighbour of at most bdvd,pc other points. Thus we
have a contradiction, and conclude that t%kbdvd,pc.
&
For Euclidean space, Zeger & Gersho (1994) establish an alternative bound in
terms of kissing numbers.
Corollary A.2. If pZ2 then for every countable set of points in Rd , any point
can be among the k-nearest neighbours of at most kK(d ) other points of the set,
where K(d ) is the maximum kissing number in Rd .
Proof. Following the proof of lemma A.1, by (A12) we can place a set of t nonoverlapping spheres of radius 1/2 at each point xi , and each of these will be
tangent to the sphere of radius 1/2 centred at the origin. This contradicts the
fact that there can be at most K(d ) such tangent spheres.
&
Proc. R. Soc. A (2008)
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
Nearest neighbour statistics
3191
It is known that K(d ) satisfies the following inequalities, where the upper
bound is due to Kabatiansky & Levenshtein (1978) and the lower bound is due to
Wyner (1965),
20:2075dð1Coð1ÞÞ % KðdÞ% 20:401dð1Coð1ÞÞ :
References
Avram, F. & Bertsimas, D. 1993 On central limit theorems in geometrical probability. Ann. Appl.
Prob. 3, 1033–1046. (doi:10.1214/aoap/1177005271)
Baldi, P. & Rinott, Y. 1989 On normal approximations of distributions in terms of dependency
graphs. Ann. Prob. 17, 1646–1650. (doi:10.1214/aop/1176991178)
Bickel, P. J. & Breiman, L. 1983 Sums of functions of nearest neighbour distances, moment bounds,
limit theorems and a goodness of fit test. Ann. Prob. 11, 185–214. (doi:10.1214/aop/
1176993668)
Chen, L. H. Y. & Shao, Q. 2004 Normal approximation under local dependence. Ann. Prob. 32,
1985–2028. (doi:10.1214/009117904000000450)
Dwyer, R. A. 1995 The expected size of the sphere-of-influence graph. Comput. Geomet. Theory
Appl. 5, 155–164. (doi:10.1016/0925-7721(94)00025-Q)
Efron, B. & Stein, C. 1981 The jackknife estimate of variance. Ann. Stat. 9, 586–596. (doi:10.1214/
aos/1176345462)
Evans, D. & Jones, A. J. 2008 Non-parametric estimation of residual moments and covariance.
Proc. R. Soc. A 464, 2831–2846. (doi:10.1098/rspa.2007.0195)
Evans, D., Jones, A. J. & Schmidt, W. M. 2002 Asymptotic moments of near neighbour distance
distributions. Proc. R. Soc. A 458, 2839–2849. (doi:10.1098/rspa.2002.1011)
Gruber, P. M. 2004 Optimum quantization and its applications. Adv. Math. 186, 456–497. (doi:10.
1016/j.aim.2003.07.017)
Hitczenko, P., Janson, S. & Yukich, J. E. 1999 On the variance of the random sphere of influence
graph. Random Struct. Algor. 14, 139–152. (doi:10.1002/(SICI )1098-2418(199903)14:2!139::
AID-RSA2O3.0.CO;2-E)
Kabatiansky, G. A. & Levenshtein, V. I. 1978 Bounds for packings on a sphere and in space. Probl.
Peredachi Inf. 1, 3–25.
Penrose, M. D. 2000 Central limit theorems for k-nearest neighbour distances. Stoch. Proc. Appl.
85, 295–320. (doi:10.1016/S0304-4149(99)00080-0)
Penrose, M. D. 2007 Laws of large numbers in stochastic geometry with statistical applications.
Bernoulli 13, 1124–1150. (doi:10.3150/07-BEJ5167)
Penrose, M. D. & Wade, A. R. 2008 Limit theory for the random on-line nearest-neighbor graph.
Random Struct. Algor. 32, 125–156. (doi:10.1002/rsa.20185)
Penrose, M. D. & Yukich, J. E. 2001 Central limit theorems for some graphs in computational
geometry. Ann. Appl. Prob. 11, 1005–1041. (doi:10.1214/aoap/1015345393)
Penrose, M. D. & Yukich, J. E. 2003 Weak laws of large numbers in geometric probability. Ann.
Appl. Prob. 13, 277–303. (doi:10.1214/aoap/1042765669)
Percus, A. G. & Martin, O. C. 1998 Scaling universalities of k th nearest neighbor distances on
closed manifolds. Adv. Appl. Math. 21, 424–436. (doi:10.1006/aama.1998.0607)
Petrovskaya, M. & Leontovitch, A. 1982 The central limit theorem for a sequence of random
variables with a slowly growing number of dependencies. Theory Probab. Appl. 27, 815–825.
(doi:10.1137/1127089)
Reitzner, M. 2003 Random polytopes and the Efron–Stein jackknife inequality. Ann. Prob. 31,
2136–2166. (doi:10.1214/aop/1068646381)
Steele, J. M. 1986 An Efron–Stein inequality for nonsymmetric statistics. Ann. Stat. 14, 753–758.
(doi:10.1214/aos/1176349952)
Proc. R. Soc. A (2008)
Downloaded from http://rspa.royalsocietypublishing.org/ on July 28, 2017
3192
D. Evans
Stone, C. J. 1977 Consistent nonparametric regression. Ann. Stat. 5, 595–620. (doi:10.1214/aos/
1176343886)
Wade, A. R. 2007 Explicit laws of large numbers for random nearest-neighbour-type graphs. Adv.
Appl. Prob. 39, 326–342. (doi:10.1239/aap/1183667613)
Wyner, A. D. 1965 Capabilities of bounded discrepancy decoding. Bell Syst. Tech. J. 44,
1061–1122.
Yukich, J. E. 1998 Probability theory of classical Euclidean optimization problems. Springer
Lecture Notes in Mathematics, no. 1675. Berlin, Germany: Springer.
Zeger, K. & Gersho, A. 1994 The number of nearest neighbors in a Euclidean code. IEEE Trans.
Inform. Theory 40, 1647–1649. (doi:10.1109/18.333884)
Proc. R. Soc. A (2008)