Advanced Review Orthogonal series density estimation Sam Efromovich∗ Orthogonal series density estimation is a powerful nonparametric estimation methodology that allows one to analyze and present data at hand without any prior opinion about shape of an underlying density. The idea of construction of an adaptive orthogonal series density estimator is explained on the classical example of a direct sample from a univariate density. Data-driven estimators, which have been used for years, as well as recently proposed procedures, are reviewed. Orthogonal series estimation is also known for its sharp minimax properties which are explained. Furthermore, applications of the orthogonal series methodology to more complicated settings, including censored and biased data as well as estimation of the density of regression errors and the conditional density, are also presented. 2010 John Wiley & Sons, Inc. WIREs Comp Stat 2010 2 467–476 E stimation of the probability density function, or simply the density, is one of the most fundamental problems in statistics.1–13 A traditional parametric approach assumes that the density is known up to several parameters (like a normal density with unknown mean and variance) and then estimators of those parameters are plugged in the density creating a parametric density estimate. As a result, under the parametric approach, shape of the density is assumed to be known. For the normal density example, this means that only location and scale of this famous bell-shaped curve are unknown. Nonparametric methodology relaxes this assumption, and it relies solely on data and allows the data speaks for itself. The simplest setting is when a direct sample from an underlying density is available. Let us formulate it. Consider a univariate continuous random variable X distributed according to a probability density f (x). This phrase means that for any practically interesting set A of real numbers we can find the probability (likelihood) that X belongs to this set by the formula Pr(X ∈ A) = A f (x)dx. For instance, for A = [a, b] b we get that Pr(a ≤ X ≤ b) = a f (x)dx. Then one observes n independent realizations X1 , X2 , . . . , Xn of X, also called the sample of size n from X or f , and the aim is to find an estimate f̃ (x) = f̃ (x; (X1 , . . . , Xn )), ∗ Correspondence to: [email protected] Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75083, USA DOI: 10.1002/wics.97 Vo lu me 2, Ju ly /Au gu s t 2010 based on those observations, that fits the underlying density f (x). A customary and very natural use of density estimates is to present a data set at hand, as well as to make a kind of informal investigation of properties of the data. A typical approach, used in the literature, is to choose an interesting real data set and then show how an estimator exhibits an underlying density (data). Unfortunately, there is little to learn from this approach because we do not know the underlying density and thus do not know how well the chosen estimator performs. A better way to learn about possible samples from densities of interest and about a particular estimator is to generate data from those densities and then visualize data, corresponding histograms, and the estimates. Let us use this approach for three specific densities and check what orthogonal series estimation can offer. Each column in Figure 1 exhibits three simulations of 50 observations shown by crosses, from three different underlying densities supported on [0, 1]. In the left column of diagrams, the underlying density is uniform (it is equal to 1 for all x ∈ [0, 1]), and for the others it is shown by the solid lines. The top row of diagrams shows us how the most widely used nonparametric probability density estimator, the histogram, performs. A nice overview of the histogram can be found in Scott,14 and note that it is the most popular nonparametric density estimator. Histograms shown in the top diagrams are the default ones recommended by statistical softwares R and S-PLUS, in the bottom diagrams the bin width is half as long. 2010 Jo h n Wiley & So n s, In c. 467 Advanced Review www.wiley.com/wires/compstats 3.5 2.5 3.0 1.5 2.0 1.5 Density Density Density 1.0 2.5 2.0 1.5 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 3.5 2.5 2.0 3.0 2.0 2.5 1.5 2.0 1.0 Density Density Density 1.5 1.5 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 FIGURE 1 | Performance of the default R (and S-PLUS) histogram is shown in the top diagrams and performance of the orthogonal series Universal estimator (dashed lines) is shown in the bottom diagrams for the same simulated samples of size n = 50. The underlying density in the left diagrams is uniform, and in other diagrams it is shown by the solid line. Bottom diagrams exhibit underlying samples using histograms with smaller bin width. The main goal of any nonparametric estimator is to exhibit shape of the density. Does the default histogram make a good job? For the left diagram, the shape is plainly bimodal and it is far from the underlying horizontal line of the uniform density. However, this is not the fault of the histogram. The diagram below exhibits a histogram with smaller bins and this histogram also exhibits at least two pronounced modes. For this particular sample, one needs to make the bin width much larger, than the one chosen by R, to get a histogram which resembles the uniform density. What we see here is a typical situation in a nonparametric estimation where the 468 art (or science) of smoothing data is paramount. The dashed line in the left bottom diagram, which is the orthogonal series estimate, indicates the underlying uniform density because it correctly smoothes the data. It will be explained shortly how the estimate is constructed. The middle row of diagrams show us how the histogram and the orthogonal series estimate exhibit the underlying normal density shown by the solid line. The default histogram in the middle top diagram correctly exhibits the unimodal shape of the density but the histogram is obviously skewed. Interestingly, here the histogram with shorter bins, shown in the 2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010 WIREs Computational Statistics Orthogonal series density estimation bottom diagram, exhibits a less skewed density. Please note that here using wider bins, the move that could benefit the uniform density, will hurt the normal density case. As a result, each case requires its own correct smoothing. The orthogonal density estimator, shown in the bottom diagram by the dashed line, exhibits a perfectly symmetric bell-shaped curve but its magnitude is not large enough. (This is another important lesson—while the sample size n = 50 is practically asymptotic for parametric estimates, for nonparametric estimates it is only the onset of reliable estimation. The interested reader may use software5 to get a feeling of what sample size is ‘right’ for estimating different types of densities.) Finally, the right column of diagrams exhibits results for the underlying bimodal density (a mixture of two normal densities). Let us first look at the right top histogram. It perfectly indicates the bimodal nature of the density. However, if you force yourself to forget for a moment that you know the underlying bimodal density and quickly glance on the left histogram for the uniform underlying density, then you may feel a bit skeptical about the bimodal shape of the default histogram. On the other hand, the histogram with shorter bins, shown in the right bottom diagram, presents the bimodal shape much better, and the orthogonal series density estimate also supports this conclusion. Please note that in all three cases the orthogonal series estimator corresponds to a histogram with correctly chosen bin width as well as with correctly smoothed bins. This is exactly what a good smooth nonparametric density estimator should do. How is the used orthogonal series estimator constructed? This will be explained shortly, and now let us explain the pivotal idea of orthogonal series estimation of a univariate density which is due to Čencov.15 Assume that the random variable X is supported on [0, 1], that is, P(X ∈ [0, 1]) = 1, and that the probability density f of X is square integrable. Then the density may be approximated with any wished accuracy by a partial sum (truncated orthogonal series), fJ (x) := J θ j ϕ j (x), 0 ≤ x ≤ 1, where j=0 θj = 0 1 ϕ j (x)f (x)dx. (1) Here, {ϕ j } is an orthonormal basis which may be trigonometric, polynomial, spline, wavelet, and so on. A discussion of different bases and their properties can be found in Refs 3,5,16,17. Only to be Vo lu me 2, Ju ly /Au gu s t 2010 specific, here and in what follows we√ are considering the cosine basis {ϕ 0 (x) = 1, ϕ j (x) = 2 cos(πjx), j = 1, 2, . . .}. Parameter J in Eq. (1) is called the cutoff, and θ j is called the jth Fourier coefficient of f . 1 Note that θ 0 = 0 f (x)dx = P(X ∈ [0, 1]) = 1; thus θ 0 is known. Furthermore, because f is the density, the jth Fourier coefficient can be written as the expectation, θj = 1 0 f (x)ϕ j (x)dx = E{ϕ j (X)}. (2) This is the pivotal result for any orthogonal series estimator because it implies the possibility of estimating θ j via the sample mean estimator θ̂ j = n−1 n ϕ j (Xl ). (3) l=1 Let us calculate the mean and the variance of this estimator. First, the estimator is unbiased because E{θ̂ j } = 1 0 ϕ j (x)f (x)dx = θ j . (4) The variance is easily calculated with the help of the elementary trigonometric identity cos2 (α) = [1 + cos(2α)]/2 which allows us to write Var(θ̂ j ) = E(θ̂ j − θ j )2 = n−1 [1 + 2−1/2 θ 2j − θ 2j ] =: n−1 dj . (5) Fourier coefficients of any square integrable density decrease as j increases due to the Parseval’s identity 0 1 f 2 (x)dx = 1 + ∞ θ 2j . (6) j=1 As a result, dj → d = 1 as j increases, and this allows us to introduce the important rule of thumb, supported by the asymptotic theory, that the variance of the sample mean estimate θ̂ j is dn−1 (i.e., dj ≈ d), where d is called the coefficient of difficulty. For the considered case of direct observations d = 1, but later we will consider more complicated cases with indirect data where d > 1. As a result, we always keep d in formulas. This coefficient describes complexity of the data at hand, and this explains the name. One of the attractive features of orthogonal series estimation is the simplicity of considering multivariate densities. To do this, a tensor-product basis is used. Furthermore, both continuous and 2010 Jo h n Wiley & So n s, In c. 469 Advanced Review www.wiley.com/wires/compstats discrete components can be considered; see a discussion in Refs 4–6,10,11,18,19. Similarly to parametric density estimators, nonparametric orthogonal series estimators are recommended only if they possess desired statistical properties; see a discussion in Refs 5,8,11–13,20,21. A density estimator f̂n of the density f , on the basis of a sample of size n, is called unbiased if E{f̂n (x)} = f (x) for all x. It is shown by Rosenblatt22 that no bona fide unbiased density estimator exists for all continuous densities. As a result, a nonparametric density estimator f̂n (x) can be only asymptotically unbiased for f , that is E{f̂n (x)} → f (x) as n → ∞. The next property to consider is pointwise consistency of a density estimator where f̂n (x) → f (x) in probability for every x. Another measure of the performance, which is most popular in the literature, is the mean integrated squared error (MISE) defined ∞ as MISE = E{ −∞ [f̂n (x) − f (x)]2 dx}. This measure is often referred to as the L2 approach. Some statisti∞ cians favor other Lp approaches where E{[ ∞ |f̂n (x) − f (x)|p dx]1/p } measures the performance of f̂n . Specifically, the book3 argues in favor of the L1 approach. Other popular measures of the performance are defined via Kullback–Leibler relative entropy and Hellinger distance; see Refs 3,4. Results (3)–(5) are the foundation of any orthogonal density estimation procedure. Next section presents many estimation procedures suggested in the literature. Section Asymptotic Theory is devoted to asymptotic results. Section Beyond the Standard Setting explains how more complicated settings, in particular with indirectly observed data, can be explored via the orthogonal series approach. ESTIMATORS Choosing an Orthogonal Series The choice primarily depends on support of the density. In general, when the real line (−∞, ∞) or the half line [0, ∞) are the support, then Hermite and Laguerre series are recommended; see Refs 3,17,23,24. If f has a compact support, for instance [0, 1], then trigonometric (or Fourier) series are recommended. In particular, the cosine basis, defined in the introductory section, is convenient because it nicely approximates aperiodic densities. A discussion of trigonometric bases can be found in Refs 2,3,5,11,25,26. Classical orthogonal polynomials, including Legendre, Gegenbauer, Jacobi, and Chebyshev, are a popular choice as well; see Refs 17,27,28. Wavelet bases are another opportunity. Wavelets may help to visualize local changes in frequency of the density as 470 well as its possible discontinuities. A multiresolution wavelet series is written as f (x) = ∞ k=−∞ sj0 k φ j0 k (x) + ∞ ∞ j/2 j j=j0 k=−∞ djk ψ jk (x). Here ψ jk (x) = 2 ψ(2 x − j/2 j k), φ jk (x) = 2 φ(2 x − k), ψ(x) is the wavelet function (mother wavelet) and φ(x) is the scaling function (father wavelet), sjk and djk are wavelet coefficients. A technical complication in employing wavelets for density estimation is that no closed formula for wavelet functions is available. A discussion of wavelet bases and some encouraging simulation results can be found in Refs 17,29–31. Finally, let us note that in many cases the choice between a compact support versus infinite support may be complicated; see a discussion in Refs 24,32. Wahba33 suggests that ‘in many applications it might be preferable to assume the true density has compact support and to scale the data to interior of [0, 1].’ This is a very solid advice, and it will be used in what follows together with the cosine basis defined in the introductory section. Estimation Procedures Over the years, numerous estimation procedures have been suggested to optimize density estimators and to make them adaptive to the underlying smoothness of the density. Nonetheless, all these procedures are based on the pivotal relations (1)-(6) presented in the introductory section. Furthermore, any orthogonal series estimator can be written as f̂ (x) = f̂ (x, {ŵj }) = 1 + ∞ ŵj θ̂ j ϕ j (x), (7) j=1 where θ̂ j is the empirical Fourier coefficient (3), and ŵj ∈ [0, 1] is a shrinking coefficient. Let us present most popular approaches in choosing shrinkage coefficients. Truncated Estimators These are estimation procedures mimicking (1) with ŵj = I(j ≤ J); here and in what follows I(·) is the indicator function. Note that only one parameter—the cutoff J —should be chosen. The underlying idea of choosing the cutoff is as follows. Denote a truncated J density estimator as f̃J (x) := 1 + j=1 θ̂ j ϕ j (x). Then the Parseval identity (6), together with Eq. (5), allows us to write, 1 J ∞ 2 [f̃J (x) − f (x)] dx = Var(θ̂ j ) + θ 2j E 0 2010 Jo h n Wiley & So n s, In c. j=1 j=J+1 J = n−1 j=1 dj + ∞ θ 2j . (8) j=J+1 Vo lu me 2, Ju ly /Au gu s t 2010 WIREs Computational Statistics Orthogonal series density estimation As we see, the cutoff J + 1 is worse than the cutoff J if n−1 dJ+1 > θ 2J+1 . Of course, a squared Fourier coefficient θ 2j is unknown, but its unbiased 2 estimate θ̂ j − n−1 dj can be used instead. Also remember Eq. (5) and the rule of thumb that dj may be considered equal to d = 1. Using these observations, Tarter and Kronmal34 proposed to choose Ĵ as 2 a minimal integer J such that 2n−1 dJ+i > θ̂ J+i for all i = 1, . . . , r. Specifically, r = 2 is recommended. Diggle and Hall35 showed that this simple rule may imply small cutoffs for multimodal densities and suggested the following improvement. Choose an increasing function λ(j), specifically λ(J) = 4J1/2 is recommended. Then the authors note that in many λ(J)J 2 cases j=J+1 θ j is a good approximation for the 2 unknown term ∞ j=J+1 θ j in (8). Using the unbiased 2 θ̂ j − n−1 dj of θ 2j , the recommended cutoff Ĵ J λ(J)J 2 minimizes n−1 j=1 dj + j=J+1 [θ̂ j − n−1 dj ]. A different procedure to avoid estimation of the 2 36 The Parseval sum ∞ j=J+1 θ j was proposed by Hart. identity allows us to write estimate ∞ j=J+1 θ 2j = 1 0 f 2 (x)dx − 1 − J θ 2j . (9) j=1 1 Because 0 f 2 (x)dx is a fixed number, the datadriven cutoff is defined as J which minimizes J 2 [2n−1 dj − θ̂ j ]. (10) j=1 Analysis of numerical experiments, devoted to comparison of these estimators, can be found in Ref 26. Overall, for densities with a simple shape all these estimators perform similarly, but the last two may perform better for multimodal densities. From the point of view of the modern nonparametric curve estimation, Hart’s estimator is a particular case of the penalized least squares model selection procedure where Ĵ minJ 2 imizes [− j=1 θ̂ j + pen(J)] with pen(J) being the penalty function. This general procedure is discussed in the recent book.7 Furthermore, in that book it is shown that unbiased cross-validation, which is another popular procedure for data-driven choosing parameters, see Refs 37,38, is also a particular case of the penalized least squares model selection. Vo lu me 2, Ju ly /Au gu s t 2010 Threshold Estimators This is a more complicated and also potentially more rewarding estimation procedure. Simplest procedures use zero or unit weights ŵj . The first such procedure was proposed by Kronmal and Tarter,39 and it mimics the above-discussed rule of including jth term 2 if θ 2j > dj n−1 . Namely, ŵj = I(θ̂ j > 2dj n−1 ) is recommended. Unfortunately, neither asymptotic theory nor numerical simulations support this idea.26,40 Note that in the modern statistics this estimator would be referred to as a hard-threshold estimator. Two types of thresholding have been proposed. Hard thresholding uses weights ŵj = I(|θ j | > t(j, n) dj n−1 ) with threshold t(j, n) being a specific function. A closely related procedure is soft thresholdingwhere ŵj = [(|θ̂ j | − t(j, n) dj n−1 )/|θ̂ j |]I(|θ̂ j | > t(j, n) dj n−1 ). In simulations, these two procedures perform similarly but soft thresholding is simpler for the theoretical analysis. The most popular threshold is √ t(j, n) = c[ln(n)]1/2 where c > 0 is a constant; c = 2 is a standard choice. Note that this threshold is significantly larger than t(j, n) = 2 used by Kronmal and Tarter.39 For wavelet bases, thresholds depending on scale have been also proposed. See a discussion of these procedures in Refs 5,17,30,31. Furthermore, Efromovich5 shows that hard thresholding is a good choice for estimation of multivariate densities. Theoretical drawback of the above-defined thresholding procedures is that they cannot attain minimax rates of the MISE convergence. To overcome this issue, a blockwise thresholding has been proposed. Namely, all coefficients are grouped into not overlapping blocks and then the same shrinkage is used for a block of coefficients. The idea of blocks was first proposed in Ref 41 and then further developed in Refs 29,42,43. This procedure will be explained in more details shortly. Mimicking of Oracle This is an interesting and popular approach in the modern nonparametric curve estimation, which is based on the idea of asking an oracle about optimal shrinking of the empirical Fourier coefficients θ̂ j . Oracle is allowed to know the underlying density, and this may yield a good estimation procedure. Then the statistician tries to mimic the oracle estimator. A straightforward calculation, which is due to Watson,21 yields that min E(θ j − wθ̂ j )2 = E(θ j − w∗j θ̂ j )2 , where w 2010 Jo h n Wiley & So n s, In c. w∗j = θ 2j θ 2j + dj n−1 . (11) 471 Advanced Review www.wiley.com/wires/compstats Note that, due to the Parseval identity, w∗j is the oracle’s shrinking that minimizes the MISE of the estimator f̂ (x, {wj }) defined in Eq. (7). Then it is natural to use a statistic in place of w∗j , for instance, take on negative values. Fortunately, there is a simple remedy—the L2 -projection of f̌ onto a class of nonnegative densities, the unbiased estimate θ̂ j − dj n−1 of θ 2j can be plugged in. Review of other mimicking estimators of w∗j can be found in books.5,26 . Good numerical outcomes have been reported, but it is also possible to show that any estimator based on term-by-term shrinkage is not asymptotically minimax. To overcome this issue, a blockwise shrinkage should be used. A direct calculation, which is due to Ref 41, yields (12) f̂ (x) := max(0, f̌ (x) − c), 2 min W E(θ j − W θ̂ j )2 = j∈B W∗ = E(θ j − W ∗ θ̂ j )2 , where j∈B 2 j∈B θ j . 2 −1 j∈B [θ j + dj n ] Furthermore, it is established that if cardinality of blocks is increasing, then the blockwise oracle is minimax over a large set of density classes and its MISE also attains the sharp minimax constant. Furthermore, it is possible to propose a statistic, used in place of W ∗ , which does not change that asymptotic property. This and other related results can be found in Refs 5,41,44,45. Universal Estimator Let us define an estimator, proposed in Refs 5,46 , which combines main underlying ideas of the aboveintroduced estimators; this is also the estimator used in the introductory section. The first step is to calculate Ĵ a pilot estimate f̄ (x) := 1 + j=1 ŵj θ̂ j ϕ j (x). Here 2 ŵj := max(0, 1 − d/nθ̂ j ) and Ĵ minimizes Eq. (10) with dj = d. Then two ‘improvements’ of the pilot estimator are made. The first one is based on the idea of obtaining a good estimator for spatially inhomogeneous densities that may have several relatively large Fourier coefficients beyond the cutoff Ĵ. This yields a modification based on a hard-threshold procedure for high-frequency Fourier coefficients. cJM Ĵ I{θ̂ 2 >c d ln(n)/n} θ̂ j ϕ j (x). Namely, f̌ (x) := f̄ (x) + j=Ĵ+1 j T Here cJM and cT are two parameters that define the maximal number of Fourier coefficients that may be included into the estimate and the coefficient in the hard-threshold procedure, respectively. The default values of these parameters are 6 and 4. Note that a high-frequency Fourier coefficient is included only if it is extremely large. The necessity of the second improvement is obvious—an orthogonal series density estimate may 472 1 where the constant c is such that 0 f̂ (x)dx = 1. Note that f̂ (x) is completely data-driven. The estimator is called universal because in software5 it is used for all statistical problems including nonparametric regression, density estimation, spectral density estimation, filtration, and so on. Aggregation This is an attractive procedure of construction of a good data-driven density estimator using several known density estimators. Suppose that there already exist M ≥ 2 competing density estimators p̂1 , p̂2 , . . . , p̂M . In general, they can be parametric, histograms, kernel or any other estimators (not necessarily orthogonal series). Then a natural idea is to look for a better estimator constructed by combining these M estimators in a suitable way. An improved estimator is referred to as aggregate and its construction is called aggregation. Three main types of aggregation are: model selection, convex aggregation, and linear aggregation. In all these cases, an aggregate is compared with an oracle-estimator that can be written as p̃λ = M j=1 λj p̂j , λ = (λ1 , . . . , λM ). The objective of model selection is to find an aggregate which is as good as the best among the M given estimators; in this case the oracleestimator has λ with a single element being 1 and others being 0. The objective of convex aggregation is to suggest an aggregate which is at least as good as the best convex combination of the M given estimators; here λ ∈ RM , λj ≥ 0, and M j=1 λj ≤ 1. The objective of linear aggregation is to suggest an aggregate which is at least as good as the best linear combination of the M given estimators; here λ ∈ RM . As an example, let us explain how a linear aggregation can be performed. Apply the Gram–Schmidt orthonormalization to the set of known estimators {p̂1 , . . . , p̂M } and get {φ 1 , . . . , φ N } orthonormal functions, N ≤ M. Then, similarly to Eqs. (1)–(3), define a linear aggregate p̌(x) = N j=1 λ̂j φ j (x), where n −1 λ̂j = n l=1 φ j (Xl ). Note that this linear aggregate is the empirical orthogonal series estimator. Furthermore, it is possible to show that this aggregate performs the linear aggregation, that is, its MISE is close to the MISE of the linear oracle. Of course, in applications the M estimators are unknown. In this case, splitting of 2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010 WIREs Computational Statistics Orthogonal series density estimation data is recommended where part of observations is used for construction of M estimators and remaining observations are used for aggregation. Known results are primarily theoretical in their nature, but available numerical studies are encouraging. See a discussion in Refs 7,47–51. Bona Fide Estimation The main drawback of orthogonal series density estimation, which is often discussed in the literature, is that a series approximation may not be a bona fide density. Namely, a series approximation can take on negative values and not be integrated to 1. Furthermore, in some cases restrictions on unimodality, multimodality or monotonicity may be known. It is shown in Refs 5,52,53 that in these cases the optimal algorithm of estimation has two steps. The first step is an orthogonal series estimation. The second step is projection of the estimate on the given class of densities. As an example, Eq. (12) is the L2 -projection of an estimate on the class of all nonnegative densities. ASYMPTOTIC THEORY One of the attractive features of orthogonal series estimation is that it allows one to develop a feasible asymptotic theory. Let us explain a particular asymptotic result via the example of a minimax density estimation over a Sobolev class of k-fold differentiable densities F(k, Q) = {f : f (x) = 1 + ∞ ∞ 2k 2 j=1 θ j ϕ j (x), j=1 (πj) θ j ≤ Q < ∞, f (x) ≥ 0}, k ≥ 1. We begin with establishing a simple lower bound for the minimax MISE of a linear density estimate f̂ (x, {wj }) defined in Eq. (7). Write using Eq. (11), inf sup E {wj } f ∈F(k,Q) ≥ 1 0 (f̃ (x, {wj }) − f (x)) dx sup inf E f ∈F(k,Q) {wj } ≥ = sup E f ∈F(k,Q) sup 0 0 1 1 (f̃ (x, {wj }) − f (x))2 dx (f̃ (x, {w∗j }) 2 − f (x)) dx ∞ dj n−1 θ 2j f ∈F(k,Q) j=1 θ 2j + dj n−1 = n−2k/(2k+1) P(k, Q, d)(1 + on (1)), (13) where P(k, Q, d) = Q1/(2k+1) [k/π (k + 1)d]2k/(2k+1) 1/(2k+1) (2k + 1) , and on (1) → 0 as n → ∞. In relations (13) the first inequality always holds, the Vo lu me 2, Ju ly /Au gu s t 2010 f̂ (x, {Bs , ts }) = 1 + S ˆ s θ̂ j ϕ j (x), (14) s=1 j∈Bs 2 2 −1 ˆ s = (1 − dLs /n where j∈Bs θ̂ j )I(Ls j∈Bs θ̂ j > d(1 + ts )n−1 ). Here Bs are consecutive blocks of indexes {1, 2, . . .}, Ls is the cardinality of Bs (the number of indexes in Bs ), S is the largest integer such that S 1/3 ln(n), and ts > 0 are thresholds. This s=1 Ls < n estimator is suggested ∞ −1 −3 in Ref 41 where it is established that if s=1 Ls ts < ∞, Ls+1 /Ls = 1 + os (1) and ts = os (1), then the MISE of the adaptive estimator (14) attains the lower bound (13). The adaptive estimator (14) is called sharp minimax because its MISE attains the minimax rate n−2k/(2k+1) , which depends on smoothness of the estimated density which is obviously unknown to the estimator, and it also attains the sharp constant P(k, Q, d). Interestingly, the latter property of attaining the sharp constant is paramount for supporting estimators proposed for small data sets; there is a number of publications5,45,46,56,57 which argue that for small samples the asymptotic constant may be more important than the rate. The fact that an orthogonal series estimator may attain both the optimal rate and the sharp constant boosts feasibility of the orthogonal series methodology. BEYOND THE STANDARD SETTING 2 second inequality is due to Eq. (11), and the last two equalities are based on direct calculations. Furthermore, it is possible to show that the lower bound (13) holds for all possible estimators (not necessarily linear); see Refs 41,54,55. Furthermore, this lower bound is asymptotically sharp. Indeed, consider a data-driven blockwise orthogonal series estimator The aim of this section is to explain how the abovedescribed approach for construction of orthogonal series density estimators can be expanded to more complicated settings. Remember that, for the case of direct observations, the pivot of any orthogonal series density estimate is the sample mean estimate of Fourier coefficients and the corresponding coefficient of difficulty d. As a result, for a more complicated setting we need to understand how to suggest a sample mean estimator of Fourier coefficients and then evaluate the corresponding coefficient of difficulty. Censored Data Let X denote a lifetime random variable whose observations may be right-censored. An observation 2010 Jo h n Wiley & So n s, In c. 473 Advanced Review www.wiley.com/wires/compstats X is right-censored if all you know about X is that it is greater than some given value. An example that motivated the name is as follows. Suppose that X is a person’s age at death (in years), and you know only that it is larger than 56, in which case the time is censored at 56 (this may occur if the person at age 56 moved to another country and can no longer be traced). Realizations of X are not available to us directly, but instead, the data (Yl , δ l ), l = 1, 2, . . . , n are given where Yl = min(Xl , Tl ), δ l = I(Xl ≤ Tl ), Tl are independent and identically distributed (iid) random variables with the survivor function ST (t) that ‘censor’ the random variable of interest X. Let us find an appropriate sample mean estimate of Fourier coefficients (2). Remember that X is supported on [0, 1] and suppose that ST (1) > 0. y ∈ [0, 1], we can write, P(Y ≤ y, δ = 1) = For y X T f (x)S (x)dx, and this allows us to get θ j = 0 E{δ[ϕ j (Y)/ST (Y)]}. Because θ j is written as the expectation, the following sample mean estimate θ̂ j := n−1 nl=1 δ l ϕ j (Yl )/ST (Yl ) is unbiased estimate of θ j . Furthermore, a straightforward calculation of the variance of this estimator allows us to conclude that 1 dj → d = 0 f (x)/ST (x)dx as j → ∞. Finally, if the survivor function ST is unknown then the widely used product-limit (Kaplan–Meier) estimate can be used instead. According to Efromovich,52 this approach yields sharp minimax estimation of the density. Note that this result holds for the complicated case of indirect (here censored) data. Biased Data This is a familiar indirect sample where in place of a sample from the density f of X one observes a sample Y1 , . . . , Yn from the density f Y (y) := g(y)f X (y)/µ, where g(y) is a given positive function and µ is a constant which makes f Y a bona fide density, µ = 1 X X 0 g(x)f (x)dx. Note that because f is unknown, the parameter µ is also unknown. Let us present an example of biased data. Suppose that a researcher would like to know the distribution of the ratio of alcohol in the blood of liquor-intoxicated drivers traveling along a particular highway. The data are available from routine police reports on arrested drivers charged with driving under the influence of alcohol (a routine report means that there are no special police operations to test all drivers). Because a drunker driver has a larger chance of attracting the attention of the police, it is clear that the data are length-biased toward higher ratios of alcohol in the blood. Thus, the researcher should make an appropriate adjustment 474 in a method of estimation of an underlying density of the ratio of alcohol in the blood of all intoxicated drivers. For the biased data, it is not difficult to suggest a sample mean estimate of the Fourier coefficients of an underlying density f X , θ̂ j := µn−1 nl=1 ϕ j (Yl )/g(Yl ). Because the parameter µ is unknown, we plug in its estimate µ̂ := [n−1 nl=1 1/g(Yl )]−1 . The idea of this estimate is that 1/µ̂ is an unbiased estimate of 1/µ. Indeed, E{1/µ̂} = E{g−1 (Y)} = 1 µ−1 0 g(y)f X (y)g−1 (y)dy = 1/µ. Finally, a direct calculation shows that the coefficient of difficulty for this 1 problem is d := µ 0 f X (y)g−1 (y)dy = µ2 E{g−2 (Y)}, and its natural estimate is d̂ := µ̂2 n−1 nl=1 g−2 (Yl ). Efromovich58 shows that this approach implies sharp minimax estimation. Density of Regression Errors This is one of the most challenging examples of indirect data. Consider a nonparametric heteroscedastic regression Yl = m(Xl ) + σ (Xl ) l , l = 1, 2, . . . , n where m(x) = E(Y|X = x) is the regression function, σ (x) is the scale (volatility) function, X is a random variable with the design density h(x), and is a regression error. We are interested in estimation of the density of the regression error in the presence of three unknown nuisance functions: (1) the regression function m(x); (2) the design density h(x) of the predictor X; (3) the scale function σ (x). Here a natural approach is to estimate the regression and scale function, calculate residuals ˆ l = [Yl − m̂(Xl )]/σ̂ (Xl ), rescale them onto [0, 1], and then use an orthogonal series density estimator. Can this naive approach work? Efromovich56,59,60 give a positive answer. Furthermore, it is shown that an orthogonal series estimator performs as well as an oracle that knows underlying regression errors 1, . . . , n. Conditional Density Consider a setting where a sample of size n from the joint density f XY (x, y), supported on [0, 1]2 , is given. The problem is to estimate the conditional density f (y|x) of the response Y given the predictor X. Assume that the marginal density of the predictor 1 p(x) := 0 f XY (x, y)dy is positive on [0, 1]. Then the conditional density can be written as f (y|x) = f XY (x, y)/p(x). On first glance, the last equation leads to a simple estimation procedure—estimate the joint and marginal densities and consider the corresponding ratio. Unfortunately, in general, this natural idea does 2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010 WIREs Computational Statistics Orthogonal series density estimation not lead to minimax estimation whenever f (y|x) is smoother p(x). Instead, let us one more time use the orthogonal series approach. We begin with the orthogonal Fourier expansion of the conditional density as a function in two variables, f (y|x) = 11 ∞ where θ ij = 0 0 f (y|x)ϕ i (x) i,j=0 θ ij ϕ i (x)ϕ j (y), ϕ j (y)dxdy = E{ϕ i (X)ϕ j (Y)[p(X)]−1}. The last expression for the Fourier coefficient via the expectation yields the following sample mean estimate θ̂ ij = n−1 nl=1 ϕ i (Xl )ϕ j (Yl )/p(Xl ) of θ ij . Furthermore, a direct calculation shows that the coefficient of diffi1 culty is d = 0 [1/p(x)]dx. Note that Cauchy–Schwarz inequality yields d ≥ 1 with the equality for the uniform p. If the marginal density p is unknown, then its estimate is used. It is shown in Ref 61 that this classical orthogonal series approach implies sharp minimax estimation of the conditional density. CONCLUSION Orthogonal series density estimation is a powerful and universal methodology which can be used for a wide variety of settings with direct and indirect observations, for adaptive estimation of univariate, multivariate, and conditional densities as well as for small and large samples. Data-driven procedures, which perform well in numerical studies and supported by the asymptotic theory, are available. Orthogonal series approach is the foundation of the modern theory of sharp minimax estimation. An important applied property of any orthogonal series estimator is data reduction which allows to restore a density estimate using a relatively small number of Fourier coefficients. Furthermore, a wide variety of bases, including trigonometric, polynomial and wavelet, allows the data analyst to address any challenging shape of the probability density. REFERENCES 1. Anderson GL, de Figueiredo RJP. An adaptive orthogonal series estimator for probability density functions. Ann Stat 1980, 8:347–376. 13. Wegman EJ. Nonparametric probability density estimation: I. A summary of available methods. Technometrics 1972, 14:533–546. 2. Čencov NN. Statistical Decision Rules and Optimum Inference. New York: Springer-Verlag; 1980. 14. Scott DW. Histogram. WIREs Comp Stat 2010, 2(1):44–48. 3. Devroye L, Györfi L. Nonparametric Density Estimation: The L1 View. New York: John Wiley & Sons; 1985. 15. Čencov NN. Evaluation of an unknown distribution density from observations. Soviet Math. Dokl. 1962, 3:1559–1562. 4. Devroye L, Lugosi G. Combinatorial Methods in Density Estimation. New York: Springer; 2001. 16. Hart JD. Nonparametric Smoothing and Lack-of-Fit Tests. New York: Springer; 1997. 5. Efromovich S. Nonparametric Curve Estimation: Methods, Theory and Applications. New York: Springer; 1999. 17. Walter, GG. Wavelets and other Orthogonal Systems with Applications. London: CRC Press; 1994. 6. Ibragimov IA, Khasminskii RZ. Statistical Estimation: Asymptotic Theory. New York: Springer; 1981. 7. Massart P. Concentration Inequalities and Model Selection, vol. 1896, Lecture Notes in Mathematics, New York: Springer; 2007. 18. Hall P. Orthogonal series methods for both qualitative and quantitative data. Ann Stat 1983, 11:1004–1007. 19. Hendriks H. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. Ann Stat 1990, 18:832–849. 8. Rosenblatt M. Curve estimates. Ann Math Stat 1971, 42:1815–1842. 20. Wahba G. Optimal convergence properties of variable knot, kernel, and orthogonal series methods for density estimation. Ann Stat 1975, 3:15–29. 9. Schwartz S. Estimation of probability density by an orthogonal series. Ann Math Stat 1967, 38:1262–1265. 21. Watson GS. Density estimation by orthogonal series. Ann Math Stat 1969, 38:1262–1265. 10. Scott DW. Multivariate Density Estimation: Theory, Practice, and Visualization. New York: John Wiley & Sons; 1992. 22. Rosenblatt M. Remarks on some non-parametric estimates of a density function. Ann Math Stat 1956, 27:832–837. 11. Silverman BW. Density Estimation for Statistics and Data Analysis. London: Chapman & Hall; 1986. 23. Hall P. Estimating a density on the positive half line by the method of orthogonal series. Ann Inst Stat Math 1980, 32:351–362. 12. Tapia RA, Thompson JR. Nonparametric Probability Density Estimation. Baltimore: Johns Hopkins University Press; 1978. Vo lu me 2, Ju ly /Au gu s t 2010 24. Walter GG. Properties of Hermite series estimation of probability density. Ann Stat 1977, 5:1258–1264. 2010 Jo h n Wiley & So n s, In c. 475 Advanced Review www.wiley.com/wires/compstats 25. Hall P. On trigonometric series estimates of densities. Ann Stat 1981, 9:683–685. 26. Tarter ME, Lock MD. Model-Free Curve Estimation. New York: Chapman & Hall; 1993. 44. Efromovich S. Density estimation for the case of supersmooth measurement error. J Am Stat Assoc 1997, 92:526–535. 27. Buckland ST. Fitting density functions with polynomials. J R Stat Soc [Ser A] 1992, 41:63–76. 45. Efromovich S. Adaptive estimation of and oracle inequalities for probability densities and characteristic functions. Ann Stat 2008, 36:1127–1155. 28. Rudzkis R, Radavicius M. Adaptive estimation of distribution density in the basis of algebraic polynomials. Theory Probab Appl 2005, 49:93–109. 46. Efromovich S. Adaptive orthogonal series density estimation for small samples. Comput Stat Data Anal 1996, 22:599–617. 29. Chicken E, Cai T. Block thresholding for density estimation: local and global adaptivity. J Multivar Anal 2005, 95:75–106. 47. Catoni O. Statistical Learning Theory and Stochastic Optimization, vol. 1851, Lecture Notes in Mathematics. New York: Springer; 2004. 30. Donoho D, Johnstone I, Kerkyacharian G, Picard D. Density estimation by wavelet thresholding. Ann Stat 1996, 24:508–539. 31. Härdle W, Kerkyacharian G, Picard D, Tsybakov A. Wavelets, Approximation, and Statistical Applications. New York: Springer; 1998. 32. Good IJ, Gaskins RA. Density estimation and bumphunting by penalized likelihood method exemplified by scattering and meteorite data, with discussion. J Am Stat Assoc 1980, 75:42–73. 33. Wahba G. Data-based optimal smoothing of orthogonal series density estimates. Ann Stat 1981, 9:146–156. 34. Tarter ME, Kronmal RA. An introduction to the implementation and theory of nonparametric density estimation. Am Stat 1976, 30:105–112. 35. Diggle PJ, Hall P. The selection of terms in an orthogonal series density estimator. J Am Stat Assoc 1986, 81:230–233. 36. Hart JD. On the choice of truncation point in Fourier series density estimation. J Stat Comput Simul 1985, 21:95–116. 37. Hall P. Cross-validation and the smoothing of orthogonal series density estimators. J Multivar Anal 1987, 21:189–206. 38. Scott DW, Terrell GR. Biased and unbiased crossvalidation in density estimation. J Am Stat Assoc 1987, 82:1131–1146. 39. Kronmal RA, Tarter ME. The estimation of probability densities and cumulatives by Fourier series methods. J Am Stat Assoc 1968, 63:925–952. 40. Crain BB. A note on density estimation using orthogonal expansions. Ann Stat 1973, 2:454–463. 41. Efromovich S. Nonparametric estimation of a density with unknown smoothness. Theory Probab Appl 1985, 30:557–568. 42. Hall P, Kerkyacharian G, Picard D. Block threshold rules for curve estimation using Kernel and wavelet methods. Ann Stat 1998, 26:922–942. 43. Rigollet P. Adaptive density estimation using the blockwise Stein method. Bernoulli (Andover) 2006, 12:351–370. 476 48. Nemirovski A. Topics in Non-Parametric Statistics, vol. 1738, Lecture Notes in Mathematics, New York: Springer; 2000. 49. Rigollet P, Tsybakov A. Linear and convex aggregation of density estimators. Math Methods Stat 2007, 16:260–280. 50. Samarov A, Tsybakov A. Aggregation of density estimators and dimension reduction. In: Nair V, ed. Advances in Statistical Models and Inference, Essays in Honor of Kjell Doksum. Singapore: World Scientific; 233–251, 2007. 51. Yang Y. Mixing strategies for density estimation. Ann Stat 2000, 28:75–87. 52. Efromovich S. Density estimation under random censorship and order restrictions: from asymptotic to small samples. J Am Stat Assoc 2001, 96:667–685. 53. Gajek L. On improving density estimators which are not bona fide functions. Ann Stat 1986, 14:1612–1618. 54. Efromovich S. Lower bound for estimation of Sobolev densities of order less 1/2. J Stat Plan Inference 2009, 139:2261–2268. 55. Golubev GK. Nonparametric estimation of smooth probability densities in L2 . Prob Inform Transm 1992, 28:44–54. 56. Efromovich S. Adaptive estimation of error density in nonparametric regression with small sample size. J Stat Plan Inference 2007, 137:363–378. 57. Marron JS, Wand MP. Exact mean integrated squared error. Ann Stat 1992, 20:712–736. 58. Efromovich S. Density estimation for biased data. Ann Stat 2004, 32:1137–1161. 59. Efromovich S. Estimation of the density of regression errors. Ann Stat 2005, 33:2194–2227. 60. Efromovich S. Optimal nonparametric estimation of the density of regression errors with finite support. Ann Inst Stat Math 2007, 59:617–654. 61. Efromovich S. Conditional density estimation. Ann Stat 2007, 35:2504–2535. 2010 Jo h n Wiley & So n s, In c. Vo lu me 2, Ju ly /Au gu s t 2010
© Copyright 2026 Paperzz