Orthogonal series density estimation

Advanced Review
Orthogonal series density
estimation
Sam Efromovich∗
Orthogonal series density estimation is a powerful nonparametric estimation
methodology that allows one to analyze and present data at hand without any
prior opinion about shape of an underlying density. The idea of construction of an
adaptive orthogonal series density estimator is explained on the classical example
of a direct sample from a univariate density. Data-driven estimators, which have
been used for years, as well as recently proposed procedures, are reviewed.
Orthogonal series estimation is also known for its sharp minimax properties which
are explained. Furthermore, applications of the orthogonal series methodology
to more complicated settings, including censored and biased data as well as
estimation of the density of regression errors and the conditional density, are also
presented.  2010 John Wiley & Sons, Inc. WIREs Comp Stat 2010 2 467–476
E
stimation of the probability density function, or
simply the density, is one of the most fundamental
problems in statistics.1–13 A traditional parametric
approach assumes that the density is known up
to several parameters (like a normal density with
unknown mean and variance) and then estimators
of those parameters are plugged in the density
creating a parametric density estimate. As a result,
under the parametric approach, shape of the density
is assumed to be known. For the normal density
example, this means that only location and scale
of this famous bell-shaped curve are unknown.
Nonparametric methodology relaxes this assumption,
and it relies solely on data and allows the data speaks
for itself.
The simplest setting is when a direct sample from
an underlying density is available. Let us formulate
it. Consider a univariate continuous random variable
X distributed according to a probability density f (x).
This phrase means that for any practically interesting
set A of real numbers we can find the probability
(likelihood) that
X belongs to this set by the formula
Pr(X ∈ A) = A f (x)dx. For instance, for A = [a, b]
b
we get that Pr(a ≤ X ≤ b) = a f (x)dx. Then one
observes n independent realizations X1 , X2 , . . . , Xn
of X, also called the sample of size n from X or f , and
the aim is to find an estimate f̃ (x) = f̃ (x; (X1 , . . . , Xn )),
∗ Correspondence
to: [email protected]
Department of Mathematical Sciences, The University of Texas at
Dallas, Richardson, TX 75083, USA
DOI: 10.1002/wics.97
Vo lu me 2, Ju ly /Au gu s t 2010
based on those observations, that fits the underlying
density f (x).
A customary and very natural use of density
estimates is to present a data set at hand, as well
as to make a kind of informal investigation of
properties of the data. A typical approach, used in the
literature, is to choose an interesting real data set and
then show how an estimator exhibits an underlying
density (data). Unfortunately, there is little to learn
from this approach because we do not know the
underlying density and thus do not know how well
the chosen estimator performs. A better way to learn
about possible samples from densities of interest and
about a particular estimator is to generate data from
those densities and then visualize data, corresponding
histograms, and the estimates. Let us use this approach
for three specific densities and check what orthogonal
series estimation can offer.
Each column in Figure 1 exhibits three simulations of 50 observations shown by crosses, from three
different underlying densities supported on [0, 1]. In
the left column of diagrams, the underlying density
is uniform (it is equal to 1 for all x ∈ [0, 1]), and
for the others it is shown by the solid lines. The top
row of diagrams shows us how the most widely used
nonparametric probability density estimator, the histogram, performs. A nice overview of the histogram
can be found in Scott,14 and note that it is the most
popular nonparametric density estimator. Histograms
shown in the top diagrams are the default ones recommended by statistical softwares R and S-PLUS, in
the bottom diagrams the bin width is half as long.
 2010 Jo h n Wiley & So n s, In c.
467
Advanced Review
www.wiley.com/wires/compstats
3.5
2.5
3.0
1.5
2.0
1.5
Density
Density
Density
1.0
2.5
2.0
1.5
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3.5
2.5
2.0
3.0
2.0
2.5
1.5
2.0
1.0
Density
Density
Density
1.5
1.5
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
FIGURE 1 | Performance of the default R (and S-PLUS) histogram is shown in the top diagrams and performance of the orthogonal series Universal
estimator (dashed lines) is shown in the bottom diagrams for the same simulated samples of size n = 50. The underlying density in the left diagrams
is uniform, and in other diagrams it is shown by the solid line. Bottom diagrams exhibit underlying samples using histograms with smaller bin width.
The main goal of any nonparametric estimator
is to exhibit shape of the density. Does the default
histogram make a good job? For the left diagram,
the shape is plainly bimodal and it is far from the
underlying horizontal line of the uniform density.
However, this is not the fault of the histogram. The
diagram below exhibits a histogram with smaller
bins and this histogram also exhibits at least two
pronounced modes. For this particular sample, one
needs to make the bin width much larger, than the
one chosen by R, to get a histogram which resembles
the uniform density. What we see here is a typical
situation in a nonparametric estimation where the
468
art (or science) of smoothing data is paramount. The
dashed line in the left bottom diagram, which is the
orthogonal series estimate, indicates the underlying
uniform density because it correctly smoothes the
data. It will be explained shortly how the estimate is
constructed.
The middle row of diagrams show us how the
histogram and the orthogonal series estimate exhibit
the underlying normal density shown by the solid
line. The default histogram in the middle top diagram
correctly exhibits the unimodal shape of the density
but the histogram is obviously skewed. Interestingly,
here the histogram with shorter bins, shown in the
 2010 Jo h n Wiley & So n s, In c.
Vo lu me 2, Ju ly /Au gu s t 2010
WIREs Computational Statistics
Orthogonal series density estimation
bottom diagram, exhibits a less skewed density. Please
note that here using wider bins, the move that could
benefit the uniform density, will hurt the normal
density case. As a result, each case requires its own
correct smoothing. The orthogonal density estimator,
shown in the bottom diagram by the dashed line,
exhibits a perfectly symmetric bell-shaped curve but
its magnitude is not large enough. (This is another
important lesson—while the sample size n = 50 is
practically asymptotic for parametric estimates, for
nonparametric estimates it is only the onset of reliable
estimation. The interested reader may use software5 to
get a feeling of what sample size is ‘right’ for estimating
different types of densities.) Finally, the right column
of diagrams exhibits results for the underlying bimodal
density (a mixture of two normal densities). Let us first
look at the right top histogram. It perfectly indicates
the bimodal nature of the density. However, if you
force yourself to forget for a moment that you know
the underlying bimodal density and quickly glance on
the left histogram for the uniform underlying density,
then you may feel a bit skeptical about the bimodal
shape of the default histogram. On the other hand,
the histogram with shorter bins, shown in the right
bottom diagram, presents the bimodal shape much
better, and the orthogonal series density estimate also
supports this conclusion. Please note that in all three
cases the orthogonal series estimator corresponds to a
histogram with correctly chosen bin width as well as
with correctly smoothed bins. This is exactly what a
good smooth nonparametric density estimator should
do.
How is the used orthogonal series estimator
constructed? This will be explained shortly, and now
let us explain the pivotal idea of orthogonal series
estimation of a univariate density which is due to
Čencov.15
Assume that the random variable X is supported
on [0, 1], that is, P(X ∈ [0, 1]) = 1, and that the
probability density f of X is square integrable. Then
the density may be approximated with any wished
accuracy by a partial sum (truncated orthogonal
series),
fJ (x) :=
J
θ j ϕ j (x), 0 ≤ x ≤ 1,
where
j=0
θj
=
0
1
ϕ j (x)f (x)dx.
(1)
Here, {ϕ j } is an orthonormal basis which may
be trigonometric, polynomial, spline, wavelet, and
so on. A discussion of different bases and their
properties can be found in Refs 3,5,16,17. Only to be
Vo lu me 2, Ju ly /Au gu s t 2010
specific, here and in what follows we√
are considering
the cosine basis {ϕ 0 (x) = 1, ϕ j (x) = 2 cos(πjx), j =
1, 2, . . .}. Parameter J in Eq. (1) is called the cutoff,
and θ j is called the jth Fourier coefficient of f .
1
Note that θ 0 = 0 f (x)dx = P(X ∈ [0, 1]) = 1;
thus θ 0 is known. Furthermore, because f is the
density, the jth Fourier coefficient can be written as
the expectation,
θj =
1
0
f (x)ϕ j (x)dx = E{ϕ j (X)}.
(2)
This is the pivotal result for any orthogonal
series estimator because it implies the possibility of
estimating θ j via the sample mean estimator
θ̂ j = n−1
n
ϕ j (Xl ).
(3)
l=1
Let us calculate the mean and the variance of
this estimator. First, the estimator is unbiased because
E{θ̂ j } =
1
0
ϕ j (x)f (x)dx = θ j .
(4)
The variance is easily calculated with the help
of the elementary trigonometric identity cos2 (α) =
[1 + cos(2α)]/2 which allows us to write
Var(θ̂ j ) = E(θ̂ j − θ j )2
= n−1 [1 + 2−1/2 θ 2j − θ 2j ] =: n−1 dj .
(5)
Fourier coefficients of any square integrable
density decrease as j increases due to the Parseval’s
identity
0
1
f 2 (x)dx = 1 +
∞
θ 2j .
(6)
j=1
As a result, dj → d = 1 as j increases, and this
allows us to introduce the important rule of thumb,
supported by the asymptotic theory, that the variance
of the sample mean estimate θ̂ j is dn−1 (i.e., dj ≈ d),
where d is called the coefficient of difficulty. For
the considered case of direct observations d = 1,
but later we will consider more complicated cases
with indirect data where d > 1. As a result, we
always keep d in formulas. This coefficient describes
complexity of the data at hand, and this explains the
name.
One of the attractive features of orthogonal
series estimation is the simplicity of considering
multivariate densities. To do this, a tensor-product
basis is used. Furthermore, both continuous and
 2010 Jo h n Wiley & So n s, In c.
469
Advanced Review
www.wiley.com/wires/compstats
discrete components can be considered; see a
discussion in Refs 4–6,10,11,18,19.
Similarly to parametric density estimators, nonparametric orthogonal series estimators are recommended only if they possess desired statistical
properties; see a discussion in Refs 5,8,11–13,20,21.
A density estimator f̂n of the density f , on the
basis of a sample of size n, is called unbiased if
E{f̂n (x)} = f (x) for all x. It is shown by Rosenblatt22
that no bona fide unbiased density estimator exists for
all continuous densities. As a result, a nonparametric density estimator f̂n (x) can be only asymptotically
unbiased for f , that is E{f̂n (x)} → f (x) as n → ∞.
The next property to consider is pointwise consistency of a density estimator where f̂n (x) → f (x) in
probability for every x. Another measure of the performance, which is most popular in the literature,
is the mean integrated squared error (MISE) defined
∞
as MISE = E{ −∞ [f̂n (x) − f (x)]2 dx}. This measure is
often referred to as the L2 approach. Some statisti∞
cians favor other Lp approaches where E{[ ∞ |f̂n (x) −
f (x)|p dx]1/p } measures the performance of f̂n . Specifically, the book3 argues in favor of the L1 approach.
Other popular measures of the performance are
defined via Kullback–Leibler relative entropy and
Hellinger distance; see Refs 3,4.
Results (3)–(5) are the foundation of any
orthogonal density estimation procedure. Next section
presents many estimation procedures suggested in
the literature. Section Asymptotic Theory is devoted
to asymptotic results. Section Beyond the Standard
Setting explains how more complicated settings, in
particular with indirectly observed data, can be
explored via the orthogonal series approach.
ESTIMATORS
Choosing an Orthogonal Series
The choice primarily depends on support of the density. In general, when the real line (−∞, ∞) or the half
line [0, ∞) are the support, then Hermite and Laguerre
series are recommended; see Refs 3,17,23,24. If f has
a compact support, for instance [0, 1], then trigonometric (or Fourier) series are recommended. In particular, the cosine basis, defined in the introductory
section, is convenient because it nicely approximates
aperiodic densities. A discussion of trigonometric
bases can be found in Refs 2,3,5,11,25,26. Classical orthogonal polynomials, including Legendre,
Gegenbauer, Jacobi, and Chebyshev, are a popular choice as well; see Refs 17,27,28. Wavelet bases
are another opportunity. Wavelets may help to visualize local changes in frequency of the density as
470
well as its possible discontinuities.
A multiresolution
wavelet series is written as f (x) = ∞
k=−∞ sj0 k φ j0 k (x) +
∞ ∞
j/2
j
j=j0
k=−∞ djk ψ jk (x). Here ψ jk (x) = 2 ψ(2 x −
j/2
j
k), φ jk (x) = 2 φ(2 x − k), ψ(x) is the wavelet function (mother wavelet) and φ(x) is the scaling function
(father wavelet), sjk and djk are wavelet coefficients.
A technical complication in employing wavelets for
density estimation is that no closed formula for
wavelet functions is available. A discussion of wavelet
bases and some encouraging simulation results can be
found in Refs 17,29–31.
Finally, let us note that in many cases the choice
between a compact support versus infinite support
may be complicated; see a discussion in Refs 24,32.
Wahba33 suggests that ‘in many applications it might
be preferable to assume the true density has compact
support and to scale the data to interior of [0, 1].’
This is a very solid advice, and it will be used in what
follows together with the cosine basis defined in the
introductory section.
Estimation Procedures
Over the years, numerous estimation procedures have
been suggested to optimize density estimators and to
make them adaptive to the underlying smoothness
of the density. Nonetheless, all these procedures are
based on the pivotal relations (1)-(6) presented in the
introductory section. Furthermore, any orthogonal
series estimator can be written as
f̂ (x) = f̂ (x, {ŵj }) = 1 +
∞
ŵj θ̂ j ϕ j (x),
(7)
j=1
where θ̂ j is the empirical Fourier coefficient (3), and
ŵj ∈ [0, 1] is a shrinking coefficient. Let us present
most popular approaches in choosing shrinkage
coefficients.
Truncated Estimators
These are estimation procedures mimicking (1) with
ŵj = I(j ≤ J); here and in what follows I(·) is the
indicator function. Note that only one parameter—the
cutoff J —should be chosen. The underlying idea of
choosing the cutoff is as follows. Denote a truncated
J
density estimator as f̃J (x) := 1 + j=1 θ̂ j ϕ j (x). Then
the Parseval identity (6), together with Eq. (5), allows
us to write,
1
J
∞
2
[f̃J (x) − f (x)] dx =
Var(θ̂ j ) +
θ 2j
E
0
 2010 Jo h n Wiley & So n s, In c.
j=1
j=J+1
J
= n−1
j=1
dj +
∞
θ 2j .
(8)
j=J+1
Vo lu me 2, Ju ly /Au gu s t 2010
WIREs Computational Statistics
Orthogonal series density estimation
As we see, the cutoff J + 1 is worse than the
cutoff J if n−1 dJ+1 > θ 2J+1 . Of course, a squared
Fourier coefficient θ 2j is unknown, but its unbiased
2
estimate θ̂ j − n−1 dj can be used instead. Also
remember Eq. (5) and the rule of thumb that dj may be
considered equal to d = 1. Using these observations,
Tarter and Kronmal34 proposed to choose Ĵ as
2
a minimal integer J such that 2n−1 dJ+i > θ̂ J+i for
all i = 1, . . . , r. Specifically, r = 2 is recommended.
Diggle and Hall35 showed that this simple rule may
imply small cutoffs for multimodal densities and
suggested the following improvement. Choose an
increasing function λ(j), specifically λ(J) = 4J1/2 is
recommended. Then the authors note that in many
λ(J)J 2
cases
j=J+1 θ j is a good approximation for the
2
unknown term ∞
j=J+1 θ j in (8). Using the unbiased
2
θ̂ j
− n−1 dj of θ 2j , the recommended cutoff Ĵ
J
λ(J)J 2
minimizes n−1 j=1 dj + j=J+1 [θ̂ j − n−1 dj ].
A different procedure to avoid estimation of the
2
36
The Parseval
sum ∞
j=J+1 θ j was proposed by Hart.
identity allows us to write
estimate
∞
j=J+1
θ 2j =
1
0
f 2 (x)dx − 1 −
J
θ 2j .
(9)
j=1
1
Because 0 f 2 (x)dx is a fixed number, the datadriven cutoff is defined as J which minimizes
J
2
[2n−1 dj − θ̂ j ].
(10)
j=1
Analysis of numerical experiments, devoted to
comparison of these estimators, can be found
in Ref 26. Overall, for densities with a simple
shape all these estimators perform similarly, but
the last two may perform better for multimodal
densities.
From the point of view of the modern
nonparametric curve estimation, Hart’s estimator is a particular case of the penalized least
squares model selection procedure where Ĵ minJ
2
imizes [− j=1 θ̂ j + pen(J)] with pen(J) being the
penalty function. This general procedure is discussed in the recent book.7 Furthermore, in that
book it is shown that unbiased cross-validation,
which is another popular procedure for data-driven
choosing parameters, see Refs 37,38, is also a particular case of the penalized least squares model
selection.
Vo lu me 2, Ju ly /Au gu s t 2010
Threshold Estimators
This is a more complicated and also potentially more
rewarding estimation procedure. Simplest procedures
use zero or unit weights ŵj . The first such procedure was proposed by Kronmal and Tarter,39 and it
mimics the above-discussed rule of including jth term
2
if θ 2j > dj n−1 . Namely, ŵj = I(θ̂ j > 2dj n−1 ) is recommended. Unfortunately, neither asymptotic theory
nor numerical simulations support this idea.26,40 Note
that in the modern statistics this estimator would be
referred to as a hard-threshold estimator. Two types of
thresholding have been proposed.
Hard thresholding
uses weights ŵj = I(|θ j | > t(j, n) dj n−1 ) with threshold t(j, n) being a specific function. A closely related
procedure
is soft thresholdingwhere ŵj = [(|θ̂ j | −
t(j, n) dj n−1 )/|θ̂ j |]I(|θ̂ j | > t(j, n) dj n−1 ). In simulations, these two procedures perform similarly but
soft thresholding is simpler for the theoretical analysis. The most popular threshold is √
t(j, n) = c[ln(n)]1/2
where c > 0 is a constant; c = 2 is a standard
choice. Note that this threshold is significantly larger
than t(j, n) = 2 used by Kronmal and Tarter.39 For
wavelet bases, thresholds depending on scale have
been also proposed. See a discussion of these procedures in Refs 5,17,30,31. Furthermore, Efromovich5
shows that hard thresholding is a good choice for
estimation of multivariate densities.
Theoretical drawback of the above-defined
thresholding procedures is that they cannot attain
minimax rates of the MISE convergence. To overcome
this issue, a blockwise thresholding has been proposed.
Namely, all coefficients are grouped into not
overlapping blocks and then the same shrinkage is
used for a block of coefficients. The idea of blocks was
first proposed in Ref 41 and then further developed
in Refs 29,42,43. This procedure will be explained in
more details shortly.
Mimicking of Oracle
This is an interesting and popular approach in the
modern nonparametric curve estimation, which is
based on the idea of asking an oracle about optimal
shrinking of the empirical Fourier coefficients θ̂ j .
Oracle is allowed to know the underlying density,
and this may yield a good estimation procedure. Then
the statistician tries to mimic the oracle estimator. A
straightforward calculation, which is due to Watson,21
yields that
min E(θ j − wθ̂ j )2 = E(θ j − w∗j θ̂ j )2 , where
w
 2010 Jo h n Wiley & So n s, In c.
w∗j =
θ 2j
θ 2j + dj n−1
.
(11)
471
Advanced Review
www.wiley.com/wires/compstats
Note that, due to the Parseval identity, w∗j is the
oracle’s shrinking that minimizes the MISE of the
estimator f̂ (x, {wj }) defined in Eq. (7). Then it is
natural to use a statistic in place of w∗j , for instance,
take on negative values. Fortunately, there is a
simple remedy—the L2 -projection of f̌ onto a class
of nonnegative densities,
the unbiased estimate θ̂ j − dj n−1 of θ 2j can be plugged
in. Review of other mimicking estimators of w∗j can
be found in books.5,26 . Good numerical outcomes
have been reported, but it is also possible to show
that any estimator based on term-by-term shrinkage
is not asymptotically minimax. To overcome this
issue, a blockwise shrinkage should be used. A direct
calculation, which is due to Ref 41, yields
(12)
f̂ (x) := max(0, f̌ (x) − c),
2
min
W
E(θ j − W θ̂ j )2 =
j∈B
W∗ =
E(θ j − W ∗ θ̂ j )2 , where
j∈B
2
j∈B θ j
.
2
−1
j∈B [θ j + dj n ]
Furthermore, it is established that if cardinality
of blocks is increasing, then the blockwise oracle
is minimax over a large set of density classes and
its MISE also attains the sharp minimax constant.
Furthermore, it is possible to propose a statistic, used
in place of W ∗ , which does not change that asymptotic
property. This and other related results can be found
in Refs 5,41,44,45.
Universal Estimator
Let us define an estimator, proposed in Refs 5,46 ,
which combines main underlying ideas of the aboveintroduced estimators; this is also the estimator used
in the introductory section. The first step is to calculate
Ĵ
a pilot estimate f̄ (x) := 1 + j=1 ŵj θ̂ j ϕ j (x). Here
2
ŵj := max(0, 1 − d/nθ̂ j ) and Ĵ minimizes Eq. (10)
with dj = d. Then two ‘improvements’ of the pilot
estimator are made. The first one is based on the
idea of obtaining a good estimator for spatially
inhomogeneous densities that may have several
relatively large Fourier coefficients beyond the cutoff
Ĵ. This yields a modification based on a hard-threshold
procedure for high-frequency Fourier coefficients.
cJM Ĵ
I{θ̂ 2 >c d ln(n)/n} θ̂ j ϕ j (x).
Namely, f̌ (x) := f̄ (x) +
j=Ĵ+1
j
T
Here cJM and cT are two parameters that define the
maximal number of Fourier coefficients that may be
included into the estimate and the coefficient in the
hard-threshold procedure, respectively. The default
values of these parameters are 6 and 4. Note that a
high-frequency Fourier coefficient is included only if
it is extremely large.
The necessity of the second improvement is
obvious—an orthogonal series density estimate may
472
1
where the constant c is such that 0 f̂ (x)dx = 1.
Note that f̂ (x) is completely data-driven. The
estimator is called universal because in software5
it is used for all statistical problems including
nonparametric regression, density estimation, spectral
density estimation, filtration, and so on.
Aggregation
This is an attractive procedure of construction
of a good data-driven density estimator using
several known density estimators. Suppose that there
already exist M ≥ 2 competing density estimators
p̂1 , p̂2 , . . . , p̂M . In general, they can be parametric,
histograms, kernel or any other estimators (not
necessarily orthogonal series). Then a natural idea
is to look for a better estimator constructed by
combining these M estimators in a suitable way. An
improved estimator is referred to as aggregate and its
construction is called aggregation.
Three main types of aggregation are: model
selection, convex aggregation, and linear aggregation.
In all these cases, an aggregate is compared with an
oracle-estimator that can be written as p̃λ = M
j=1 λj p̂j ,
λ = (λ1 , . . . , λM ). The objective of model selection is
to find an aggregate which is as good as the best
among the M given estimators; in this case the oracleestimator has λ with a single element being 1 and
others being 0. The objective of convex aggregation is
to suggest an aggregate which is at least as good as the
best convex combination of the M given estimators;
here λ ∈ RM , λj ≥ 0, and M
j=1 λj ≤ 1. The objective
of linear aggregation is to suggest an aggregate which
is at least as good as the best linear combination of
the M given estimators; here λ ∈ RM .
As an example, let us explain how a linear aggregation can be performed. Apply the Gram–Schmidt
orthonormalization to the set of known estimators {p̂1 , . . . , p̂M } and get {φ 1 , . . . , φ N } orthonormal
functions, N ≤ M. Then, similarly to Eqs. (1)–(3),
define a linear aggregate p̌(x) = N
j=1 λ̂j φ j (x), where
n
−1
λ̂j = n
l=1 φ j (Xl ).
Note that this linear aggregate is the empirical
orthogonal series estimator. Furthermore, it is possible
to show that this aggregate performs the linear
aggregation, that is, its MISE is close to the MISE
of the linear oracle. Of course, in applications the
M estimators are unknown. In this case, splitting of
 2010 Jo h n Wiley & So n s, In c.
Vo lu me 2, Ju ly /Au gu s t 2010
WIREs Computational Statistics
Orthogonal series density estimation
data is recommended where part of observations is
used for construction of M estimators and remaining
observations are used for aggregation. Known results
are primarily theoretical in their nature, but available
numerical studies are encouraging. See a discussion in
Refs 7,47–51.
Bona Fide Estimation
The main drawback of orthogonal series density
estimation, which is often discussed in the literature,
is that a series approximation may not be a bona
fide density. Namely, a series approximation can
take on negative values and not be integrated
to 1. Furthermore, in some cases restrictions on
unimodality, multimodality or monotonicity may be
known. It is shown in Refs 5,52,53 that in these cases
the optimal algorithm of estimation has two steps. The
first step is an orthogonal series estimation. The second
step is projection of the estimate on the given class of
densities. As an example, Eq. (12) is the L2 -projection
of an estimate on the class of all nonnegative densities.
ASYMPTOTIC THEORY
One of the attractive features of orthogonal series
estimation is that it allows one to develop a feasible
asymptotic theory. Let us explain a particular
asymptotic result via the example of a minimax
density estimation over a Sobolev class of k-fold
differentiable
densities F(k, Q) = {f : f (x) = 1 +
∞
∞
2k 2
j=1 θ j ϕ j (x),
j=1 (πj) θ j ≤ Q < ∞, f (x) ≥ 0},
k ≥ 1.
We begin with establishing a simple lower bound
for the minimax MISE of a linear density estimate
f̂ (x, {wj }) defined in Eq. (7). Write using Eq. (11),
inf
sup E
{wj } f ∈F(k,Q)
≥
1
0
(f̃ (x, {wj }) − f (x)) dx
sup inf E
f ∈F(k,Q) {wj }
≥
=
sup E
f ∈F(k,Q)
sup
0
0
1
1
(f̃ (x, {wj }) − f (x))2 dx
(f̃ (x, {w∗j })
2
− f (x)) dx
∞
dj n−1 θ 2j
f ∈F(k,Q) j=1
θ 2j + dj n−1
= n−2k/(2k+1) P(k, Q, d)(1 + on (1)),
(13)
where
P(k, Q, d) = Q1/(2k+1) [k/π (k + 1)d]2k/(2k+1)
1/(2k+1)
(2k + 1)
, and on (1) → 0 as n → ∞. In
relations (13) the first inequality always holds, the
Vo lu me 2, Ju ly /Au gu s t 2010
f̂ (x, {Bs , ts }) = 1 +
S ˆ s θ̂ j ϕ j (x),
(14)
s=1 j∈Bs
2
2
−1
ˆ s = (1 − dLs /n
where
j∈Bs θ̂ j )I(Ls
j∈Bs θ̂ j >
d(1 + ts )n−1 ). Here Bs are consecutive blocks of
indexes {1, 2, . . .}, Ls is the cardinality of Bs (the number of indexes in Bs ), S is the largest integer such that
S
1/3
ln(n), and ts > 0 are thresholds. This
s=1 Ls < n
estimator
is
suggested
∞ −1 −3 in Ref 41 where it is established
that if
s=1 Ls ts < ∞, Ls+1 /Ls = 1 + os (1) and
ts = os (1), then the MISE of the adaptive estimator
(14) attains the lower bound (13).
The adaptive estimator (14) is called sharp
minimax because its MISE attains the minimax
rate n−2k/(2k+1) , which depends on smoothness of
the estimated density which is obviously unknown
to the estimator, and it also attains the sharp
constant P(k, Q, d). Interestingly, the latter property
of attaining the sharp constant is paramount for
supporting estimators proposed for small data sets;
there is a number of publications5,45,46,56,57 which
argue that for small samples the asymptotic constant
may be more important than the rate. The fact that
an orthogonal series estimator may attain both the
optimal rate and the sharp constant boosts feasibility
of the orthogonal series methodology.
BEYOND THE STANDARD SETTING
2
second inequality is due to Eq. (11), and the last
two equalities are based on direct calculations.
Furthermore, it is possible to show that the lower
bound (13) holds for all possible estimators (not
necessarily linear); see Refs 41,54,55. Furthermore,
this lower bound is asymptotically sharp. Indeed,
consider a data-driven blockwise orthogonal series
estimator
The aim of this section is to explain how the abovedescribed approach for construction of orthogonal
series density estimators can be expanded to more
complicated settings. Remember that, for the case
of direct observations, the pivot of any orthogonal
series density estimate is the sample mean estimate of
Fourier coefficients and the corresponding coefficient
of difficulty d. As a result, for a more complicated
setting we need to understand how to suggest a
sample mean estimator of Fourier coefficients and then
evaluate the corresponding coefficient of difficulty.
Censored Data
Let X denote a lifetime random variable whose
observations may be right-censored. An observation
 2010 Jo h n Wiley & So n s, In c.
473
Advanced Review
www.wiley.com/wires/compstats
X is right-censored if all you know about X is that
it is greater than some given value. An example that
motivated the name is as follows. Suppose that X is a
person’s age at death (in years), and you know only
that it is larger than 56, in which case the time is
censored at 56 (this may occur if the person at age
56 moved to another country and can no longer be
traced).
Realizations of X are not available to us directly,
but instead, the data (Yl , δ l ), l = 1, 2, . . . , n are
given where Yl = min(Xl , Tl ), δ l = I(Xl ≤ Tl ), Tl are
independent and identically distributed (iid) random
variables with the survivor function ST (t) that ‘censor’
the random variable of interest X.
Let us find an appropriate sample mean estimate
of Fourier coefficients (2). Remember that X is
supported on [0, 1] and suppose that ST (1) > 0.
y ∈ [0, 1], we can write, P(Y ≤ y, δ = 1) =
For
y X
T
f
(x)S
(x)dx, and this allows us to get θ j =
0
E{δ[ϕ j (Y)/ST (Y)]}. Because θ j is written as the
expectation,
the following sample mean estimate
θ̂ j := n−1 nl=1 δ l ϕ j (Yl )/ST (Yl ) is unbiased estimate of
θ j . Furthermore, a straightforward calculation of the
variance of this estimator allows us to conclude that
1
dj → d = 0 f (x)/ST (x)dx as j → ∞. Finally, if the
survivor function ST is unknown then the widely used
product-limit (Kaplan–Meier) estimate can be used
instead. According to Efromovich,52 this approach
yields sharp minimax estimation of the density. Note
that this result holds for the complicated case of
indirect (here censored) data.
Biased Data
This is a familiar indirect sample where in place of a
sample from the density f of X one observes a sample
Y1 , . . . , Yn from the density f Y (y) := g(y)f X (y)/µ,
where g(y) is a given positive function and µ is a
constant which makes f Y a bona fide density, µ =
1
X
X
0 g(x)f (x)dx. Note that because f is unknown, the
parameter µ is also unknown.
Let us present an example of biased data.
Suppose that a researcher would like to know the
distribution of the ratio of alcohol in the blood of
liquor-intoxicated drivers traveling along a particular
highway. The data are available from routine police
reports on arrested drivers charged with driving under
the influence of alcohol (a routine report means
that there are no special police operations to test
all drivers). Because a drunker driver has a larger
chance of attracting the attention of the police,
it is clear that the data are length-biased toward
higher ratios of alcohol in the blood. Thus, the
researcher should make an appropriate adjustment
474
in a method of estimation of an underlying density
of the ratio of alcohol in the blood of all intoxicated
drivers.
For the biased data, it is not difficult to suggest a
sample mean estimate of the Fourier
coefficients of an
underlying density f X , θ̂ j := µn−1 nl=1 ϕ j (Yl )/g(Yl ).
Because the parameter µ
is unknown, we plug
in its estimate µ̂ := [n−1 nl=1 1/g(Yl )]−1 . The idea
of this estimate is that 1/µ̂ is an unbiased
estimate of 1/µ. Indeed, E{1/µ̂} = E{g−1 (Y)} =
1
µ−1 0 g(y)f X (y)g−1 (y)dy = 1/µ. Finally, a direct calculation shows that the coefficient of difficulty for this
1
problem is d := µ 0 f X (y)g−1 (y)dy = µ2 E{g−2 (Y)},
and its natural estimate is d̂ := µ̂2 n−1 nl=1 g−2 (Yl ).
Efromovich58 shows that this approach implies sharp
minimax estimation.
Density of Regression Errors
This is one of the most challenging examples of
indirect data. Consider a nonparametric heteroscedastic regression Yl = m(Xl ) + σ (Xl ) l , l = 1, 2, . . . , n
where m(x) = E(Y|X = x) is the regression function,
σ (x) is the scale (volatility) function, X is a random variable with the design density h(x), and is a regression error. We are interested in estimation of the density of the regression error in the
presence of three unknown nuisance functions: (1)
the regression function m(x); (2) the design density h(x) of the predictor X; (3) the scale function
σ (x).
Here a natural approach is to estimate the
regression and scale function, calculate residuals
ˆ l = [Yl − m̂(Xl )]/σ̂ (Xl ), rescale them onto [0, 1], and
then use an orthogonal series density estimator. Can
this naive approach work? Efromovich56,59,60 give
a positive answer. Furthermore, it is shown that
an orthogonal series estimator performs as well as
an oracle that knows underlying regression errors
1, . . . , n.
Conditional Density
Consider a setting where a sample of size n from
the joint density f XY (x, y), supported on [0, 1]2 , is
given. The problem is to estimate the conditional
density f (y|x) of the response Y given the predictor X.
Assume that the marginal density of the predictor
1
p(x) := 0 f XY (x, y)dy is positive on [0, 1]. Then
the conditional density can be written as f (y|x) =
f XY (x, y)/p(x). On first glance, the last equation leads
to a simple estimation procedure—estimate the joint
and marginal densities and consider the corresponding
ratio. Unfortunately, in general, this natural idea does
 2010 Jo h n Wiley & So n s, In c.
Vo lu me 2, Ju ly /Au gu s t 2010
WIREs Computational Statistics
Orthogonal series density estimation
not lead to minimax estimation whenever f (y|x) is
smoother p(x).
Instead, let us one more time use the
orthogonal series approach. We begin with the
orthogonal Fourier expansion of the conditional
density as a function in two variables, f (y|x) =
11
∞
where θ ij = 0 0 f (y|x)ϕ i (x)
i,j=0 θ ij ϕ i (x)ϕ j (y),
ϕ j (y)dxdy = E{ϕ i (X)ϕ j (Y)[p(X)]−1}. The last expression for the Fourier coefficient via the expectation yields
the following sample mean estimate
θ̂ ij = n−1 nl=1 ϕ i (Xl )ϕ j (Yl )/p(Xl ) of θ ij . Furthermore,
a direct calculation
shows that the coefficient of diffi1
culty is d = 0 [1/p(x)]dx. Note that Cauchy–Schwarz
inequality yields d ≥ 1 with the equality for the uniform p. If the marginal density p is unknown, then its
estimate is used. It is shown in Ref 61 that this classical orthogonal series approach implies sharp minimax
estimation of the conditional density.
CONCLUSION
Orthogonal series density estimation is a powerful and
universal methodology which can be used for a wide
variety of settings with direct and indirect observations, for adaptive estimation of univariate, multivariate, and conditional densities as well as for small and
large samples. Data-driven procedures, which perform
well in numerical studies and supported by the asymptotic theory, are available. Orthogonal series approach
is the foundation of the modern theory of sharp minimax estimation. An important applied property of any
orthogonal series estimator is data reduction which
allows to restore a density estimate using a relatively
small number of Fourier coefficients. Furthermore, a
wide variety of bases, including trigonometric, polynomial and wavelet, allows the data analyst to address
any challenging shape of the probability density.
REFERENCES
1. Anderson GL, de Figueiredo RJP. An adaptive orthogonal series estimator for probability density functions.
Ann Stat 1980, 8:347–376.
13. Wegman EJ. Nonparametric probability density estimation: I. A summary of available methods. Technometrics
1972, 14:533–546.
2. Čencov NN. Statistical Decision Rules and Optimum
Inference. New York: Springer-Verlag; 1980.
14. Scott DW. Histogram. WIREs Comp Stat 2010,
2(1):44–48.
3. Devroye L, Györfi L. Nonparametric Density Estimation: The L1 View. New York: John Wiley & Sons;
1985.
15. Čencov NN. Evaluation of an unknown distribution
density from observations. Soviet Math. Dokl. 1962,
3:1559–1562.
4. Devroye L, Lugosi G. Combinatorial Methods in Density Estimation. New York: Springer; 2001.
16. Hart JD. Nonparametric Smoothing and Lack-of-Fit
Tests. New York: Springer; 1997.
5. Efromovich S. Nonparametric Curve Estimation: Methods, Theory and Applications. New York: Springer;
1999.
17. Walter, GG. Wavelets and other Orthogonal Systems
with Applications. London: CRC Press; 1994.
6. Ibragimov IA, Khasminskii RZ. Statistical Estimation:
Asymptotic Theory. New York: Springer; 1981.
7. Massart P. Concentration Inequalities and Model Selection, vol. 1896, Lecture Notes in Mathematics, New
York: Springer; 2007.
18. Hall P. Orthogonal series methods for both qualitative
and quantitative data. Ann Stat 1983, 11:1004–1007.
19. Hendriks H. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier
expansions. Ann Stat 1990, 18:832–849.
8. Rosenblatt M. Curve estimates. Ann Math Stat 1971,
42:1815–1842.
20. Wahba G. Optimal convergence properties of variable
knot, kernel, and orthogonal series methods for density
estimation. Ann Stat 1975, 3:15–29.
9. Schwartz S. Estimation of probability density by an
orthogonal series. Ann Math Stat 1967, 38:1262–1265.
21. Watson GS. Density estimation by orthogonal series.
Ann Math Stat 1969, 38:1262–1265.
10. Scott DW. Multivariate Density Estimation: Theory,
Practice, and Visualization. New York: John Wiley &
Sons; 1992.
22. Rosenblatt M. Remarks on some non-parametric estimates of a density function. Ann Math Stat 1956,
27:832–837.
11. Silverman BW. Density Estimation for Statistics and
Data Analysis. London: Chapman & Hall; 1986.
23. Hall P. Estimating a density on the positive half line by
the method of orthogonal series. Ann Inst Stat Math
1980, 32:351–362.
12. Tapia RA, Thompson JR. Nonparametric Probability
Density Estimation. Baltimore: Johns Hopkins University Press; 1978.
Vo lu me 2, Ju ly /Au gu s t 2010
24. Walter GG. Properties of Hermite series estimation of
probability density. Ann Stat 1977, 5:1258–1264.
 2010 Jo h n Wiley & So n s, In c.
475
Advanced Review
www.wiley.com/wires/compstats
25. Hall P. On trigonometric series estimates of densities.
Ann Stat 1981, 9:683–685.
26. Tarter ME, Lock MD. Model-Free Curve Estimation.
New York: Chapman & Hall; 1993.
44. Efromovich S. Density estimation for the case of supersmooth measurement error. J Am Stat Assoc 1997,
92:526–535.
27. Buckland ST. Fitting density functions with polynomials. J R Stat Soc [Ser A] 1992, 41:63–76.
45. Efromovich S. Adaptive estimation of and oracle
inequalities for probability densities and characteristic
functions. Ann Stat 2008, 36:1127–1155.
28. Rudzkis R, Radavicius M. Adaptive estimation of distribution density in the basis of algebraic polynomials.
Theory Probab Appl 2005, 49:93–109.
46. Efromovich S. Adaptive orthogonal series density estimation for small samples. Comput Stat Data Anal
1996, 22:599–617.
29. Chicken E, Cai T. Block thresholding for density estimation: local and global adaptivity. J Multivar Anal
2005, 95:75–106.
47. Catoni O. Statistical Learning Theory and Stochastic
Optimization, vol. 1851, Lecture Notes in Mathematics. New York: Springer; 2004.
30. Donoho D, Johnstone I, Kerkyacharian G, Picard D.
Density estimation by wavelet thresholding. Ann Stat
1996, 24:508–539.
31. Härdle W, Kerkyacharian G, Picard D, Tsybakov A.
Wavelets, Approximation, and Statistical Applications.
New York: Springer; 1998.
32. Good IJ, Gaskins RA. Density estimation and bumphunting by penalized likelihood method exemplified by
scattering and meteorite data, with discussion. J Am
Stat Assoc 1980, 75:42–73.
33. Wahba G. Data-based optimal smoothing of orthogonal series density estimates. Ann Stat 1981, 9:146–156.
34. Tarter ME, Kronmal RA. An introduction to the
implementation and theory of nonparametric density
estimation. Am Stat 1976, 30:105–112.
35. Diggle PJ, Hall P. The selection of terms in an orthogonal series density estimator. J Am Stat Assoc 1986,
81:230–233.
36. Hart JD. On the choice of truncation point in Fourier
series density estimation. J Stat Comput Simul 1985,
21:95–116.
37. Hall P. Cross-validation and the smoothing of orthogonal series density estimators. J Multivar Anal 1987,
21:189–206.
38. Scott DW, Terrell GR. Biased and unbiased crossvalidation in density estimation. J Am Stat Assoc 1987,
82:1131–1146.
39. Kronmal RA, Tarter ME. The estimation of probability
densities and cumulatives by Fourier series methods. J
Am Stat Assoc 1968, 63:925–952.
40. Crain BB. A note on density estimation using orthogonal expansions. Ann Stat 1973, 2:454–463.
41. Efromovich S. Nonparametric estimation of a density
with unknown smoothness. Theory Probab Appl 1985,
30:557–568.
42. Hall P, Kerkyacharian G, Picard D. Block threshold
rules for curve estimation using Kernel and wavelet
methods. Ann Stat 1998, 26:922–942.
43. Rigollet P. Adaptive density estimation using the
blockwise Stein method. Bernoulli (Andover) 2006,
12:351–370.
476
48. Nemirovski A. Topics in Non-Parametric Statistics,
vol. 1738, Lecture Notes in Mathematics, New York:
Springer; 2000.
49. Rigollet P, Tsybakov A. Linear and convex aggregation of density estimators. Math Methods Stat 2007,
16:260–280.
50. Samarov A, Tsybakov A. Aggregation of density estimators and dimension reduction. In: Nair V, ed. Advances
in Statistical Models and Inference, Essays in Honor of
Kjell Doksum. Singapore: World Scientific; 233–251,
2007.
51. Yang Y. Mixing strategies for density estimation. Ann
Stat 2000, 28:75–87.
52. Efromovich S. Density estimation under random censorship and order restrictions: from asymptotic to small
samples. J Am Stat Assoc 2001, 96:667–685.
53. Gajek L. On improving density estimators which are
not bona fide functions. Ann Stat 1986, 14:1612–1618.
54. Efromovich S. Lower bound for estimation of Sobolev
densities of order less 1/2. J Stat Plan Inference 2009,
139:2261–2268.
55. Golubev GK. Nonparametric estimation of smooth
probability densities in L2 . Prob Inform Transm 1992,
28:44–54.
56. Efromovich S. Adaptive estimation of error density in
nonparametric regression with small sample size. J Stat
Plan Inference 2007, 137:363–378.
57. Marron JS, Wand MP. Exact mean integrated squared
error. Ann Stat 1992, 20:712–736.
58. Efromovich S. Density estimation for biased data. Ann
Stat 2004, 32:1137–1161.
59. Efromovich S. Estimation of the density of regression
errors. Ann Stat 2005, 33:2194–2227.
60. Efromovich S. Optimal nonparametric estimation of
the density of regression errors with finite support. Ann
Inst Stat Math 2007, 59:617–654.
61. Efromovich S. Conditional density estimation. Ann Stat
2007, 35:2504–2535.
 2010 Jo h n Wiley & So n s, In c.
Vo lu me 2, Ju ly /Au gu s t 2010