NONPARAMETRIC STATISTICS SUMMARY 1

Proof. Let U1 , . . . , Un iidU (0, 1) so that F −1 (Udnpe ) =d Xdnpe . Then by
the previous theorem and the delta method with g = F −1 ,
√
√
n(Xdnpe − F −1 (p)) =d n(g(Udnpe ) − g(p))
(1.12)
NONPARAMETRIC STATISTICS SUMMARY
ANDREW TULLOCH
d
→ N (0,
p(1 − p)
)
f (F −1 (p))2
(1.13)
Definition. Usually, we prefer to choose h to minimize some expression
measuring how well fˆh estimates f as a function. We therefore define the
Mean Integrated Squared Error (M SIE) as
Z ∞
M SIE(fˆh ) = E
{fˆh (x) − f (x)}2 dx
(1.14)
−∞
Z ∞
=
M SE(fˆh (x))dx
(1.15)
1. Basics
Theorem (The Delta Method). Let Yn be a sequence of random vectors
1
in Rd such that for some µ ∈ Rd and a random vector Z, we have n 2 (Yn −
d
µ) → Z. If g :
∇ g(µ)T Z.
Rd
1
2
d
→ R is differentiable at µ, then n (g(Yn ) − g(µ)) →
Proof. For d = 1. Let g 0 (µ) = ∇ g(µ), and let h : R → R, by
( g(y)−g(µ)
y 6= µ
y−µ
h(y) =
g 0 (µ)
y=µ
Z
−∞
∞
((Kh ? f )(x) − f (x))2 +
=
∞
(1.1)
[
2
? f )(x) − (Kh ? f )2 (x))dx
]h((Kn
1
(1.16)
which is justified by Fubini’s theorem as the integrand is non-negative.
1
Then by the continuous mapping theorem and Sltusky’s theorem, n 2 (g(Yn )− JAlthough exact, this expression depends on h in a complicated way. We
1
d
therefore seek asymptotic approximation to calify this dependnence and
g(µ)) = h(Yn )n 2 (Yn − µ) → g 0 (µ)Z.
faciliate an asymptotically optimal choice of h.
Let X1 , . . . , Xn be iid on a probability space (Ω, F , P) with distribution
We need the following conditions:
R ∞ 00 2
function F . The empirical distribution function F̂n is defined by
(i) f is twice differentiable, f 0 is bounded, and R(f ) = −∞
f (x) dx <
n
∞.
1X
F̂n (x) =
I(Xi ≤ x) .
(1.2)
(ii) h = hn is a non-random sequecne with h → 0 and nh → ∞ as
n i=1
n → ∞.
R∞
R∞
(iii) K is non-negative, −∞
K(x)dx = 1, −∞
xK(x)dx = 0, µ2 (K) =
Theorem (Glivenko-Cantelli (1933) - The Fundamental Theorem of StaR∞ 2
tistics).
x K(x)dx < ∞, and R(x) < ∞.
−∞
as
(1.3) Theorem. Assume that the previous conditions hod. Then, for all x ∈ R,
sup F̂n (x) − F (x) → 0.
x∈R
Theorem. Let X1 , . . . , Xn ∼ F iid. Then for every > 0,
2
P sup |F̂n (x) − F (x)| ≥ ≤ 2e−2n .
R(K)f (x)
1
1
M SE(fˆn (x)) =
+ h4 µ22 (K)f 00 (x)2 + o(
+ h4 )
nh
4
nh
as n → ∞.
(1.4)
R(K)
Consider now minimizing the asymptotic MISE (AMISE) nh + 41 h4 µ22 (K)R(f
with respect to h, yielding the asymptotically optimal bandwidth
x∈R
Definition. For p ∈ (0, 1], the quartile function is defined by F −1 (p) =
inf{x ∈ R : F (x) ≥ p} and is left-continuous.
The sample quartile function is F̂n−1 (p) = inf{x ∈ R : F̂n (x) ≥ p}.
hAM ISE = (
d
n(Udnpe − p) → N (0, p(1 − p)).
(1.5)
Proof. Let Y1 , . . . , Yn iid Exp(1), let Vn = Y1 + · · · + Ydnpe and Wn =
Ydnpe+1 , . . . , Yn+1 . Note that Vn , Wn are independent and
Vn
=d Udnpe
Vn + W n
by previous proposition. Then
p
√ Vn
dnpe Vn − dnpe
dnpe − np d
n(
− p) = √
( p
)+
→ N (0, p)
√
n
n
n
dnpe
√
Vn W n
,
) − g(p, q))
n
n
p 0
d
→ N (0, ∇ g(p, q)T
∇ g(p, q))
0 q
n(Udnpe − p) =d
n(g(
=d n(0, p(1 − p))
(1.18)
Theorem. Assume the previous assumptions (i), (ii), (iii) and that K is
bounded. Then, for all x ∈ R,
2
1
d
(1.20)
n 5 (fˆhAM ISE (x) − f (x)) → N ( µ2 (K)f 00 (x), R(K)f (x))
2
(1.6)
Theorem. If f is the N (0, σ 2 ) density, then R(f 00 ) =
(1.7)
8
3
√
π
σ −5 . The nor-
R(f 00 )
mal scale rate ĥN S consists of replacing
in hAM ISE with
where σ̂ is an estimate of σ. This tends to oversmooth.
by the CLT and Slutsky’s theorem.
√
d
Similarly, n( Wnn − q) → N (0, q), where q = 1 − p, then by the delta
x
method, with g(x, y) = x+y ,
√
1
R(K)
)5
µ22 (K)R(f 00 )n
Substituting back, we obtain
4
2
1
4
5
AM ISE(fˆAM ISE ) = R(K) 5 µ2 (K) 5 R(f 00 ) 5 n− 5 .
(1.19)
4
Notice the slower rate than the typical O(n−1 ) parametric rate. Notice
that for the “rough” densities, with larger R(f 00 ), we should use a smaller
bandwidth, and these densities are harder to estimate.
Theorem. Let U1 , U2 , . . . , Un ∼ U (0, 1) iid and p ∈ (0, 1). Then
√
(1.17)
8
3
√
π
σ̂ −5 ,
The choice of kernel is coupled with the choice of bandwidth, because
if we replace K(x) by 12 K( 12 ) and we halve the bandwidth, the estimate is
unchanged. We therefore fix the scale by setting µ2 (K) = 1. Minimizing
AM ISE(fˆh ) over K the amounts to minimizing R(K) subject to
Z ∞
K(x)dx = 1
(1.21)
−∞
Z ∞
xK(x)dx = 0
(1.22)
(1.8)
(1.9)
(1.10)
−∞
µ2 (K) = 1
(1.23)
Theorem. Let p ∈ (0, 1) and let X1 , . . . , Xn iidF where F is differentiable
at F −1 (p) with positive derivative f (F −1 (p)). Then
K(x) ≥ 0
(1.24)
√
p(1 − p)
n(Xdnpe − F −1 (p)) → N (0,
)
f (F −1 (p))2
d
The solution is given by the Epanechnikov kernel (1969).
√ 3
x2 KE (x) = √ (1 −
)I |x| ≤ 5
5
4 5
(1.11)
1
(1.25)
NONPARAMETRIC STATISTICS SUMMARY
R(K )
E
The ratio R(K)
is called the efficiency of a kernel K, because it
represents the ratio of the sample sizes needed to obtain the same AM ISE
when using KE compared with K.
Kernel
Efficiency
Epachnikov
1.0
Normal
0.951
Triangular
0.986
Uniform
0.930
Theorem. A natural estimator of the r-th derivative f (r) of f is given by
(r)
fˆh (x) =
n
X
1
x − xi
K(
)
r+1
nh
h
i=1
(1.26)
obtained from differentiating the standard KDE for fˆ.
Under regularity conditions,
R(K (r) )
1
1
(r)
M SE(fˆh (x)) =
f (x) + h4 µ22 f (r−2) (x)2 + o(
+ h4 ).
nh2r+1
4
nh
(1.27)
1
− 2r+5
This leads to an optimal bandwidth of order n
4
− 2r+5
converge of n
and a rate of
2
All local polynomial estimators of the form
n
X
W (xi , x)Yi
(2.6)
i=1
This type of estimator is called a linear estimator. This set of weights
{W (xi , x)} is called the effective kernel.
2.1. Mean Squared Error Approximations. For convenience, let xi =
i
. We consider the following conditions:
n
(i) m is twice continuosuly differentiable on [0, 1] and is bounded, v
is continuous.
(ii) h = hn , hn → 0, nh → ∞.
(iii) K is a nonnegative probability
density, symmetric, hasR zeros outR
side of [−1, 1]. R(K) = K 2 (x)dx < ∞, and µ2 (K) = xK 2 (x) <
∞.
Theorem. Under the conditions previously, for x ∈ (0, 1), we have
1
1
1
M SE(m̂h (x; 1)) =
R(K)v(x) + h4 (m00 (x))2 µ2 (K) + o(
+ h4 )
nh
4
nh
(2.7)
2.2. Splines. Let n ≥ 3, and consider for a fixed homoscedastic design
.
Yi = m(xi ) + σi
(2.8)
Theorem. It is possible to make the dominant integrated squared bias
term of M ISE(fˆh ) vanish by choosing µ2 (K) = 0. This means we have
to allow the Kernel to take negative values, so the resulting estimate need
not be a density.
We can set fˆh (x) = max(fˆh (x), 0) and then renormalize, but then we
lose smoothness. RNevertheless, we define K to be a k-th order kernel if
∞
writing µj (K) = −∞
xj K(x)dx, we have
where i are iid with E(i ) = 0, V(i ) = 1.
Another natural idea to estimate the regression curve m is to balance
the fidelity of the fit to the data and the roughness of the resulting curve.
This can be done by minimizing
Z
n
X
(Yi − g̃(xi ))2 + λ g̃ 00 (x)2 dx
(2.9)
µ0 (K) = 1
over g̃ ∈ S2 [a, b], the set of twice continuously differentiable functions on
[a, b]. λ is a regularization parameter. As λ → ∞, the curve is very close
to the linear regression line. As λ → 0, the resulting curve closely fits the
observations.
(1.28)
and µj (K) = 0 for j = 1, . . . , k − 1, µk (K) 6= 0, and
Z ∞
|x|k |K(x)|dx < ∞
(1.29)
−∞
If f has k continuous bounded derivatives with R(f (k) ) < ∞, then it is
1
− 2k+1
shown (example sheet) that hAM ISE = cn
AM ISE(fˆhAM ISE ) = O(n
and
2k
− 2k+1
)
(1.30)
Thus, under increasingly strong smoothness assumptions, convergence
rates arbitrarily close to the parametric rate of O(n−1 ) can be obtained.
The practical benefit of higher order kernels is not always apparent,
and the negativity/smoothness/bandwidth selection problems mean that
they are rarely used in practice.
2. Nonparameteric Regression
Assume a fixed design. The local polynomial estimator m̂h (x; p) of
degree p with kernel K with a bandwidth h is constructed by fitting a
polynomial of degree p using weighted least squares. The weight Kh (xi −x)
is associated with the weight (xi , Yi ).
More precisely, m̂h (x; p) = β̂0 where β̂ = (β̂0 , β̂1 , . . . , β̂p ) whicch is
minimizing
n
X
(Yi − β0 − β1 (xi − x) + · · · + βp (xi − x)p )2 Kh (xi − x)
(2.1)
i=1
Definition. A cubic spline is a function g : [a, b] → R satisfies
(i) g is a cubic polynomial on [(a, x1 ), (x1 , x2 ), . . . , (xn , b)].
(ii) g is twice continuously differentiable on [a, b].
T ), there exists a unique natural
Proposition. For a given g = (g1 , . . . , gn
cubic spline g with knots x1 , . . . , xn - so g(xi ) = gi for i = 1, . . . , n.
Moreover, there exists a nonnegative definite matrix K such that
Z b
g 00 (x)2 dx = g T Kg
(2.10)
a
We call g the natural cubic spline interploant to g at x1 , . . . , xn .
Theorem. For any g̃ ∈ S2 [a, b] satisfying g̃(xi ) = gi , i = 1, . . . , n, the
cubic spline interpolant to g at g = g1 , . . . , gn uniquely minimizes
Z b
g̃ 00 (x)2 dx
(2.11)
a
over g̃ ∈ S2 [a, b].
Recall that Yi = m(xi ) + σi , m ∈ S2 [a, b], 0 < x1 < · · · < xn < b. We
seek to minimize
Z b
n
X
Gλ (g̃) =
(Yi − g̃(xi ))2 + λ
g̃ 00 (x)2 dx
(2.12)
i=1
a
i=1
where β ∈ Rp+1
The theory of weighted least squares gives
(X T KX)β̂ = X T Ky
over g̃ ∈ S2 [a, b].
(2.2)
For p = 0, then a simple expression (Nodorya-Watson, local constant)
exists:
Pn
i=1 Kh (xi − x)Yi
(2.3)
m̂h (x; 0) = P
n
i=1 kh (xi − x)
For p = 1, we call this a local linear estimator, and we have the explicit
result
n
1 X S2,h (x) − S1,h (x)(xi − x)
m̂h (x; 1) =
Kh (xi − x)Yi
(2.4)
n i=1 S2,h (x)S0,h (x) − S1,h (x)2
Theorem. For each λ > 0, there is a unique solution ĝ minimizing
G(g̃)over g̃ ∈ S2 [a, b]. This is the natural cubic spline
ĝ = (I + λK)−1 Y
Cross validation method validates the estimated curve without the i-th
observation by comparing the i-th value
CV (λ) =
n
X
1
Sr,h (x) =
n
i=1
r
(xi − x) Kh (xi − x)
n
1X
(Yi − ĝ−i,λ (xi ))2
n i=1
(2.14)
where ĝ−i,λ is chosen by minimizing Gλ over all data points except the
i-th,
with
n
X
(2.13)
j6=i
(2.5)
(*)
(Yj − g̃(xj ))2 + λ
Z
b
a
g̃ 00 (x)2 dx
(2.15)
NONPARAMETRIC STATISTICS SUMMARY
Theorem.
3
5. Extreme Value Theory
n
1 X Yi − ĝλ (xi ) 2
CV (λ) =
(
)
n i=1
1 − Aii
where A = (I + λK)−1 and
Z ∞
00
ĝλ
(x)2 dx = ĝλ (x)T K ĝλ (x)
(2.16)
(2.17)
−∞
3. Nearest Neighbour Classification
Definition. A function g : Rd → {0, 1} is called a classifier. If the
distribution of (X, Y ) are known, we can minimize the risk P(g(X) 6= Y ) =
L(g) over g : Rd → {0, 1}. The minimizer g ? is called a Bayes classifier,
and L(g d ) is called the Bayes risk.
Lemma. For a classifier g̃ whcih has the form
(
1 ν̂(x) > 21
g̃(x) =
0 otherwise
(3.1)
we have
?
P(g̃(X) 6= Y ) − L ≤ 2E(kν̂(X) − ν(X))
(3.2)
Definition (k-nearest neighbour classification). A k-nn classifier gn is
defined by
(
Pn
Pn
1
i=1 Wni (X)I(Yi = 1) >
i=1 Wni (X)I(Yi = 0)
gn (x) =
(3.3)
0 otherwise
which is equivalent to
n
X
Wni (X)I(Yi = 1) >
i=1
n
X
1
1
Wni (X)Yi >
⇐⇒
2
2
i=1
(3.4)
where
1
k
if Xi is a k-nearest neighbour of X, and zero otherwise.
Wni (X) =
(3.5)
Definition. For a certain distirbution of (X, Y ), we say gn is consistent
if P(gn (X) 6= Y ) − L? → 0 .
We say gn is strongly consistent if
P lim L(gn ) = L(g ? ) = 1
(3.6)
n→∞
k
Theorem. If k → ∞, n
→ 0, then for all distributions of (X, Y ), the
k-nn estimates gn are consistent.
Definition. As a first attempt to understand a nonparametric estimation
problem, we consider a minimax risk,
(4.1)
θ̃ θ∈Θ
For the first question, we have the Extremal Types theorem. For the
second question, we have the
“domain of attraction” problem.
Recall that P X(n) ≤ x = F (x)n . We say that F is in the domain
of attraction of G (F ∈ D(G)) if there exists {an } > 0, {bn } and a nondegenerate G such that
X(n) − bn
P
≤ x = [F (an x + bn )n → G(x) for all continuity points x of G].
an
(5.2)
and write F (an x + bn )n ,→ G(x).
We say that G1 and G2 are of same type if G1 (ax + b) = G2 (x) for
some a > 0, b.
The next lemma shows that if F ∈ D(G1 ) and F ∈ D(G2 ), then G1
and G2 are of the same type.
Lemma. Suppose Xn is an IID sample from F and there exists {an } >
0, {bn } and non-degenerate G such that F (an x + bn )n ,→ G(x) . Then
there exists {αn } > 0, , {βn } and non-degenerate G? such that F (αn xβn )n ,→
n
→ a for some a > 0, and βna−β → b for some b.
G? (x). if and only if α
an
n
Then we can let G? (x) = G(ax + b).
Definition. G is max-stable if for every n ∈ N, there exists {an } >
0, {bn } such that Gn (an x + bn ) = G(x)
Theorem. D(g) is non-empty if and only if G is max-stable.
(i) Frechet - G1,α (x) = exp(−x−α ), x > 0, α > 0
(ii) Negative Weibull - G2,α = exp(−(−x)α ), x < 0, α > 0
(iii) Gumbel - G3 (x) = exp(− exp(−x)), x ∈ R.
Conversely, these distributions can appear as such limits in (5.1).
Definition. If we can find our θ̂? , which minimizes supθ∈Θ Eθ L(θ̃, θ) we
call θ̂? our minimax estimator. However, it is very difficult to find θ̂? .
Let cγn ≤ R(Θ) ≤ Cγn , we call γn is minimax rate of convergence.
Lemma (Le Cam’s two points lemma). Let P be probability measures on
(X , A), and let (Θ, d) be the pseudo-metric space, with
d : Θ × Θ → [0, ∞)
(i) What kind of G appears in the limit of (5.1)?
(ii) Cna we characterize F such that (5.1) holds for a specific limit
distribution G?
Theorem. If F ∈ D(G), then G must belong to the following distributions
(within type):
4. Minimax Lower Bounds
R(Θ) = inf sup Eθ L(j̃, θ).
Let Xn be an IID sample from a distribution function F , and denote
X(n) = max{X1 , . . . , Xn } as the maximum order statistic.
Without any normalization, X(n) → x? = inf{x : F (x) = 1}.
This is not overly interesting, since the limit distribution is degenerate
(we call F non-degenerate if there does not exists a ∈ R such that F (x) =
I(x ≥ a))
We may ask if there exists {an } > 0, {bn } > 0, and a nondegenerate G
such that
X(n) − bn
P
≤ x → G(x)
(5.1)
an
for all continuity points x of G
Classical extreme value theory starts by asking:
(4.2)
Remark.
(i) Using X(1) = − max{−X1 , . . . , −xn }, we have equivalent theorems in terms of normalized minima.
(ii) Sometimes, we cannot have nondegenreate G of normalized maxima - for example X1 , . . . , Xn ∼ Bern( 21 ), X(n) .
(iii) We can combine these three types into Generalized Extreme Value
Distribution (GEV) G(x; µ, σ, γ) = exp(−(1 + γ(
given by
d(θ1 , θ2 ) = d(θ2 , θ1 ), d(θ1 , θ2 ) + d(θ2 , θ3 ) ≥ Ad(θ1 , θ3 )
(4.3)
Let θ : P → Θ, θ(P ) is the parameter of interest (P ∈ P). With
θ0 = θ(P0 ) , θ1 = θ(P1 ), under two conditions,
(i) d(θ0 , θ1 ) ≥ δ > 0,
(ii) h2 (P0 , P1 ) ≤ C < 1
R √
√
where h2 (P0 , P 1) is the Hellinger distance ( dP0 − dP1 )2 , the we have
for all estimators θ̃,
√
Aδ
sup EP d(θ̃, θ(P )) ≥
(1 − C)
(4.4)
2
P ∈P
Theorem (Nonparametric regression). Let Yi = m(xi ) + i , i ∼ N (0, 1),
xi = ni , m ∈ Θ with Θ the set of all twice continuously differentiable
functions on [0, 1], m00 (x) < ∞. Then for any estimator m̃ and any x0 ∈
[0, 1],
4
sup E (m̃(x) − m(x0 ))2 ≥ Cn− 5
(4.5)
m∈Θ
x − µ − γ1
))
)
σ
(5.3)
with 1 + γ( x−µ
) > 0, µ ∈ R, γ ∈ R, σ > 0.
σ
We have Frechet corresponds to γ > 0, α = γ1 , NW is γ < 0, α = − γ1 ,
and Gumbel corresponds to the case where γ → 0.
5.1. Necessary and Sufficient Conditions for Convergence.
Definition. We say a function l : [C, ∞] → (0, ∞) is “slowly varying” if
l(tx)
limx→∞ l(x) = 1 for all t > 0. For example, l(x) = log x, log log x, (log x)α .
We say a function rα : [C, ∞) → (0, ∞) is “regularly varying” with an
index a ∈ R if rα (x) = x−α l(x) where l is slowly varying - so r2 (x) =
x−2 log x.
We define an expected residual lifetime as
Z x?
1
R(x) = E(X − x|X > x) =
(1 − F (y))dy
(5.4)
1 − F (x) x
where x? = inf{x : F (x) = 1}, and F (x) = 1 − F (x)
NONPARAMETRIC STATISTICS SUMMARY
Theorem. F ∈ D(G1,α ) if and only if x? = ∞, F (x) = x−α l(x) where
1
l is slowly varying. We can choose bn = 0, an = F −1 (1 − n
) for which
F n (an x + bn ) ,→ G1,α (x) is satisfied.
F ∈ D(G2,α ) if and only if x? < ∞, F (x? − x1 ) = x−α l(x), with l
1
slowly varying for x > 0. We can choose bn = x? , an = x? − F −1 (1 − n
)
for convergence.
F ∈ D(G3 ) if and only if
F (x + tR(x))
F (x)
We can choose bn = F −1 (1 −
1
),
n
→ e−t
(5.5)
an = R(bn ).
Lemma (?). Suppose there exists an > 0, bn such that n(1 − F (an x +
bn )) → u(x). Then
F n (an x + bn ) ,→ exp(−u(x))
References
(5.6)
4