A Bernstein-von Mises theorem for smooth functionals in
semiparametric models
Ismaël Castillo and Judith Rousseau
Abstract
A Bernstein-von Mises theorem is derived for general semiparametric functionals. The
result is applied to a variety of semiparametric problems, in i.i.d. and non-i.i.d. situations.
In particular, new tools are developed to handle semiparametric bias, in particular for nonlinear functionals and in cases where regularity is possibly low. Examples include the squared
L2 -norm in Gaussian white noise, non-linear functionals in density estimation, as well as functionals in autoregressive models. For density estimation, a systematic study of BvM results for
two important classes of priors is provided, namely random histograms and Gaussian process
priors.
1
Introduction
Bayesian approaches are often considered to be close asymptotically to frequentist likelihood-based
approaches so that the impact of the prior disappears as the information brought by the data –
typically the number of observations – increases. This common knowledge is verified in most parametric models, with a precise expression of it through the so-called Bernstein–von Mises Theorem
or property (hereafter BvM). This property says that, as the number of observations increases the
posterior distribution can be approached by a Gaussian distribution centered at an efficient estimator of the parameter of interest and with variance the inverse of the Fisher information matrix
of the whole sample, see for instance van der Vaart [42], Berger [4] or Ghosh and Ramamoorthi [30]. The situation becomes however more complicated in non- and semi-parametric models.
Semiparametric versions of the BvM property consider the behaviour of the marginal posterior in
a parameter of interest, in models potentially containing an infinite-dimensional nuisance parameter. There some care is typically needed in the choice of the non-parametric prior and a variety
of questions linked to prior choice and techniques of proofs arise. Results on semiparametric BvM
applicable to general models and/or general priors include Shen [41], Castillo [12], Rivoirard and
Rousseau [40] and Bickel and Kleijn [5]. The variety of possible interactions between prior and
model and the subtelties of prior choice are illustrated in the previous general papers and in recent
results in specific models such as Kim [32], De Blasi and Hjort [21], Leahu [38], Knapik et al. [35],
Castillo [13] and Kruijer and Rousseau [36]. Inbetween semi- and non-parametric results, BvM for
parameters with growing dimension have been obtained in e.g. Ghosal [26], Boucheron and Gassiat [9] and Bontemps [8]. Finally, although there is no immediate analogue of the BvM property
for infinite dimensional parameters, as pointed out by Cox [20] and Freedman [24], some recent
contributions have introduced possible notions of nonparametric BvM, see Castillo and Nickl [15]
and also Leahu [38]. In fact, the results of the present paper are relevant for these, as discussed
below.
For semiparametric BvM, it is of particular interest to obtain generic sufficient conditions,
that do not depend on the specific form of the considered model. In this paper, we provide a
general result, Theorem 2.1 in Section 2, on the existence of the BvM property for generic models
and functionals of the parameter. Let us briefly discuss the scope of our results, see Section 2
for precise definitions. Consider a model parameterised by η varying in a (subset of a) metric
space S equipped with a σ-field S. Let ψ : S → Rd , d ≥ 1, be a measurable functional of
interest and let Π be a probability distribution on S. Given observations Y n from the model, we
1
study the asymptotic posterior distribution of ψ(η), denoted Π[ψ(η) | Y n ]. Let N (0, V ) denote
the centered normal law with covariance matrix V . We give general conditions under which a
BvM-type property is valid,
i
h√
N (0, V ),
(1.1)
Π n(ψ(η) − ψ̂) | Y n
as n → ∞ in probability, where ψ̂ is a (random) centering point, and V a covariance matrix,
both to be specified, and where
stands for weak convergence. An interesting and well-known
consequence of BvM is that posterior credible sets, such as equal-tail credible intervals, highest
posterior density regions or one-sided credible intervals are also confidence regions with the same
asymptotic coverage.
The contributions of the present paper can be regrouped around the following aims
1. Provide general conditions on the model and on the functional ψ to guarantee (1.1) to hold,
in a variety of frameworks both i.i.d. and non-i.i.d. This includes investigating how the
choice of the prior influences bias ψ̂ and variance V . This also includes studying the case
of nonlinear functionals, which involves specific techniques for the bias. This is done via
a Taylor-type expansion of the functional involving a linear term as well as, possibly, an
additional quadratic term.
2. In frameworks with low regularity, second order properties in the functional expansion may
become relevant. We study this as an application of the main Theorem in the important case
of estimation of the squared L2 -norm of an unknown regression function in the case where
the convergence rate for the functional is still parametric but where the ‘plug-in’ property
in the sense of Bickel and Ritov [7] is not necessarily satisfied.
3. Provide simple and ready-to-use sufficient conditions for BvM in the important example of
density estimation on the unit interval. We present extensions and refinements in particular
of results in Castillo [12] and Rivoirard and Rousseau [40] with respect respectively to use of
Gaussian process priors in the context of density estimation and handling non-linear functionals. The class of random density histogram priors is also studied in details systematically
for the first time in the context of Bayesian semiparametrics.
4. Provide simple sufficient conditions on the prior for BvM to hold in a more complex example
involving dependent data, namely the nonlinear autoregressive model. To our knowledge this
is the first result of this type in such a model.
The techniques and results of the paper, as it turned out, have also been useful for different
purposes in a recent series of works developing a multiscale approach for posteriors, in particular:
a) to prove functional limiting results, such as Bayesian versions of Donsker’s theorem, or more
generally nonparametric BvM results as in Castillo and Nickl [16], a first step consists in proving
the result for finite dimensional projections: this is exactly asking for a semiparametric BvM to
hold, and results from Section 4 can be directly applied; b) related to this is the study of many
functionals simulaneously: this is used in the study of posterior contraction rates in the supremum
norm in Castillo [14]. Finally, along the way, we shall also derive posterior rate results for Gaussian
processes which are of independent interest, see Proposition 2 (check number) in the supplement
[18].
Our results show that the most important condition is a no-bias condition, which will be
seen to be essentially necessary. This condition is written in a non explicit way in the general
Theorem 2.1, since the study of such a condition depends heavily on the family of priors that
are considered together with the statistical model. Extensive discussions on the implication of
this no-bias condition are provided in the context of the white noise model and density models
for two families of priors. In the examples we have considered, the main tool used to
√ verify this
condition consists in constructing a change of parameterisation in the form η → η +Γ/ n for some
given Γ depending on the functional of interest, which leaves the prior approximately unchanged.
2
Roughly speaking for the no-bias condition to be valid, it is necessary that both η0 and Γ are
well approximated under the prior. If this condition is not verified, then BvM may not hold: an
example of this phenomenon is provided in Section 4.3.
Theorem 2.1 does not rely on a specific type of model, nor on a specific family of functionals.
In Section 3 it is applied to the study of a nonlinear functional in the white noise model, namely
the squared-norm of the signal. Applications to density estimation with three different types of
functionals and to an autoregressive model can be found respectively in Section 4 and Section 5.
Section 6 is devoted to proofs, together with the supplement [18].
Model, prior and notation
Let (Y n , G n , Pηn , η ∈ S) be a statistical experiment, with observations Y n sitting on a space
Y n equipped with a σ-field G n , and where n is an integer quantifying the available amount of
information. We typically consider the asymptotic framework n → ∞. We assume that S is
equipped with a σ-field S, that S is a subset of a linear space and that for all η ∈ S, the measures
Pηn are absolutely continous with respect to a dominating measure µn . Denote by pnη the associated
density and by `n (η) the log–likelihood. Let η0 denote the true value of the parameter and Pηn0 the
frequentist distribution of the observations Y n under η0 . Throughout the paper we set P0n := Pηn0
and P0 := P01 . Similarly E0n [·] and E0 [·] denote the expectation under P0n and P0 respectively and
Eηn and Eη are the corresponding expectations under Pηn and Pη . Given any prior probability Π
n
on S, we denote
] the
R nby Π[·|Y
R associated posterior distribution on S, given by Bayes formula:
n
n
Π[B | Y ] = B pη (Y )dΠ(η)/ pnη (Y n )dΠ(η). Throughout the paper, we use the notation op in
the place of oP0n for simplicity.
The quantity of interest in this paper is a functional ψ : S → Rd , d ≥ 1. We restrict in this
paper to the case of real-valued functionals d = 1, noting that the presented tools do have natural
multivariate counterparts not detailed here for notational simplicity.
For η1 , η2 in S, the Kullback-Leibler divergence between Pηn1 and Pηn2 is
n
Z
dPη1 n
(y
)
dPηn1 (y n ),
KL(Pηn1 , Pηn2 ) :=
log
dPηn2
Yn
and the corresponding variance of the likelihood ratio is denoted by
n
Z
dPη1 n
2
n
n
(y ) dPηn1 (y n ) − KL(Pηn1 , Pηn2 )2 .
Vn (Pη1 , Pη2 ) :=
log
dPηn2
Yn
Let k · k2 and h·, ·i2 denote respectively the L2 norm and the associated inner product on [0, 1].
We use also k · k1 to denote the L1 norm on [0, 1]. For all β ≥ 0, C β denotes the class of βHölder functions on [0, 1] where β = 0 corresponds to the case of continuous functions. Let
R1 √
√
h(f1 , f2 ) = ( 0 ( f1 − f2 )2 dµ)1/2 stand for the Hellinger distance between two densities f1 and
f2 relative to a measure µ. For g integrable on [0, 1] with respect to Lebesgue measure, we often
R1
R
R1
write 0 g or g instead of 0 g(x)dx. For two real-valued functions A, B (defined on R or on N),
we write A . B if A/B is bounded and A B if |A/B| is bounded away from 0 and ∞.
2
Main result
In this section, we give the general theorem which provides sufficient conditions on the model, the
functional and the prior for BvM to be valid.
We say that the posterior distribution for the functional ψ(η) is asymptotically normal with
centering ψn and variance V if, for β the bounded Lipschitz metric (also known as Lévy-Prohorov
metric)
for weak convergence, see Section 1 in the supplement [18], and τn the mapping τn : η →
√
n(ψ(η) − ψn ), it holds, as n → ∞, that
β(Π[ · | Y n ] ◦ τn−1 , N (0, V )) → 0,
3
(2.1)
in P0n -probability, which we also denote Π[ · | Y n ] ◦√τn−1
N (0, V ).
In models where an efficiency theory at rate n is available, we say that the posterior distribution for the functional ψ(η) at η = η0 satisfies the BvM Theorem if (2.1) holds with ψn =
√
ψ̂n + op (1/ n), for ψ̂n a linear efficient estimator of ψ(η) and V the efficiency bound for estimating ψ(η). For instance, for i.i.d. models and a differentiable functional ψ withhefficient
influence
i
function ψ̃η0 , see e.g. [42] Chap. 25, the efficiency bound is attained if V = P0n ψ̃η20 . Let us now
state the assumptions which will be required.
Let An be a sequence of measurable sets such that, as n → ∞,
Π [An | Y n ] = 1 + op (1).
(2.2)
We assume that there exists a Hilbert space (H, h·, ·iL ) with associated norm denoted k · kL , and
for which the inclusion An −η0 ⊂ H is satisfied for n large enough. Note that we do not necessarily
assume that S ⊂ H, as H gives a local description of the parameter space near η0 only. Note also
that H may depend on n. The norm k·kL typically corresponds to the LAN (locally asymptotically
normal) norm as described in (2.3) below.
Let us first introduce some notation, which corresponds to expanding both the log-likelihood
`n (η) := `n (η, Y n ) in the model and the functional of interest ψ(η). Both expansions have remainders Rn and r respectively.
LAN expansion. Write, for all η ∈ An ,
`n (η) − `n (η0 ) =
−n||η − η0 ||2L √
+ nWn (η − η0 ) + Rn (η, η0 ),
2
(2.3)
where (Wn (h), h ∈ H) is a collection of real random variables verifying that, P0n -almost surely, the
mapping h → Wn (h) is linear, and that for all h ∈ H, we have Wn (h)
N (0, khk2L ) as n → ∞.
(1)
(2)
Functional smoothness. Consider ψ0 ∈ H and a self-adjoint linear operator ψ0 : H → H
and write, for any η ∈ An ,
(1)
ψ(η) = ψ(η0 ) + hψ0 , η − η0 iL
1 (2)
+ hψ0 (η − η0 ), η − η0 iL + r(η, η0 ),
2
(2.4)
where there exists a positive constant C1 such that
(2)
kψ0 hkL ≤ C1 khkL
∀h ∈ H,
and
(1)
kψ0 kL ≤ C1 .
(2.5)
Note that both formulations, on the functional smoothness and on the LAN expansion, are not
assumptions since nothing is required yet on r(η, η0 ) or on R(η, η0 ). This is done in assumption
A. The norm k · kL is typically identified from a local asymptotic normality property of the model
at the point η0 . It is thus intrinsic to the considered statistical model. Next, the expansion of ψ
around η0 is in term of the latter norm: since this norm is intrinsic to the model, this can be seen
as a canonical choice.
(2)
Consider two cases, depending on the value of ψ0 in (2.4). The first case corresponds to a firstorder analysis of the problem. It ignores any potential non-linearity in the functional η → ψ(η)
(1)
by considering a linear approximation with representer ψ0 in (2.4) and shifting any remainder
term into r.
(2)
Case A1. We set ψ0 = 0 in (2.4) and, for all η ∈ An and t ∈ R define
(1)
tψ
ηt = η − √0 .
n
(2.6)
(2)
Case A2. We allow for a nonzero second-order term ψ0 in (2.4). In this case we need a few
more assumptions. One is simply the existence of some posterior convergence rate in k · kL -norm.
Suppose that, for some sequence εn = o(1) and An as in (2.2),
Π[η ∈ An ; kη − η0 kL ≤ εn /2 | Y n ] = 1 + op (1).
4
(2.7)
Next, we assume that the action of the process Wn above can be approximated by an innerproduct, with a representer wn , which will be particularly useful in defining a suitable path ηt
enabling to handle second-order terms.
Suppose that there exists wn ∈ H such that, for all h ∈ H,
Wn (h) = hwn , hiL + ∆n (h),
P0n -almost surely,
(2.8)
where the remainder term ∆n (·) is such that
(2)
sup ∆n (ψ0 (η − η0 )) = op (1)
(2.9)
η∈An
and where one further assumes that
√
(2)
(1)
hwn , ψ0 (ψ0 )iL + εn kwn kL = op ( n).
(2.10)
Finally set, for all η ∈ An and wn as in (2.8), for all t ∈ R,
(1)
(2)
(2)
tψ (η − η0 ) tψ0 wn
tψ
−
.
ηt = η − √0 − 0 √
2n
n
2 n
(2.11)
Assumption A. In Cases A1 and A2, with ηt defined by (2.6) and (2.11) respectively, assume
that for all t ∈ R, ηt ∈ S for n large enough and that
√
(2.12)
sup |t nr(η, η0 ) + Rn (η, η0 ) − Rn (ηt , η0 )| = op (1).
η∈An
The suprema in the previous display may not be measurable, in this case one interprets the
previous probability statements in terms of outer measure.
We then provide a characterisation of the asymptotic distribution of ψ(η). At first read,
(2)
one may set ψ0 = 0 in the next Theorem: this provides a first-order result that will be used
repeatedly in Sections 4 and 5. The complete statement allows for a second order analysis via a
(2)
possibly non-zero ψ0 and will be applied in Section 3.
Theorem 2.1. Consider a statistical model {Pηn , η ∈ S}, a real-valued functional η → ψ(η) and
(1)
(2)
h· , ·iL , ψ0 , ψ0 , Wn , wn as defined above. Suppose that Assumption A is satisfied, and denote
(1)
(2)
Wn (ψ0 ) hwn , ψ0 wn iL
√
ψ̂ = ψ(η0 ) +
+
,
2n
n
V0,n
2
(2)
(1) ψ0 wn = ψ0 − √ .
2 n L
Let Π be a prior distribution on η. Let An be any measurable set such that (2.2) holds. Then for
any real t with ηt as in (2.11),
R
h √
i
t2 V
e`n (ηt )−`n (η0 ) dΠ(η)
Π
t n(ψ(η)−ψ̂)
n
op (1)+ 20,n RAn
E e
| Y , An = e
.
(2.13)
e`n (η)−`n (η0 ) dΠ(η)
An
Moreover if V0,n = V0 + op (1) for some V0 > 0 and if for some possibly random sequence of reals
µn , for any real t,
R
e`n (ηt )−`n (η0 ) dΠ(η)
RAn
= eµn t (1 + op (1)),
(2.14)
e`n (η)−`n (η0 ) dΠ(η)
An
√
then the posterior distribution of n(ψ(η) − ψ̂) − µn is asymptotically normal and mean-zero, with
variance V0 .
The proof of Theorem 2.1 is given in Section 6.1.
5
(2)
Corollary 1. Under the conditions of Theorem 2.1, if (2.14) holds with µn = op (1) and kψ0 wn kL =
√
√
op ( n), then the posterior distribution of n(ψ(η) − ψ̂) is asymptotically mean-zero normal, with
(1)
variance kψ0 k2L .
Assumption A ensures that the local behaviour of the likelihood resembles the one in a Gaussian
experiment with norm k · kL . An assumption of this type is expected, as the target distribution in
the BvM theorem is Gaussian. As will be seen in the examples in Sections 3, 4 and 5, An is often
a well chosen subset of a neighbourhood of η0 , with respect to a given metric, which need not be
the LAN norm k · kL .
We note that for simplicity here we restrict to approximating paths ηt to η0 in (2.6) (first order
results) and (2.11) (second order results) that are linear in the perturbation. This covers already
quite a few interesting models. More generally, some models may be locally curved around η0 ,
with a possibly non-linear form of approximating paths. A more general statement would possibly
have an extra condition to control the curvature. Examining this type of examples is left for future
work.
The central condition for applying Theorem 2.1 is (2.13). To check this condition, a possible
approach is to construct a change of parameter from η to ηt (or some parameter close enough
to ηt ) which leaves the prior and An approximately unchanged. More formally, let ψn be an
(1)
approximation of ψ0 in a sense to be made precise below and let Πψn := Π ◦ (τ ψn )−1 denote the
image measure of Π through the mapping
√
τ ψn : η → η − tψn / n.
To check (2.13), one may for instance suppose that the measures Πψn and Π are mutually absolutely
continuous and that the density dΠ/dΠψn is close to the quantity eµn t on An . This is the approach
we follow for various models and priors in the sequel. In particular, we prove that a functional
change of variable is possible for various classes of prior distributions. For instance, in density
estimation, Gaussian process priors and piecewise constant priors are considered and Propositions
1 and 3 below give a set of sufficient conditions that guarantee (2.13) for each class of priors.
In general, the construction of a feasible change of parameterisation heavily depends on the
structure of the prior model. We note that this change of parameter approach above only provides a
sufficient condition. For some priors, shifted measures may be far from being absolutely continuous,
even using approximations of the shifting direction: for such priors, one may have to compare the
integrals directly.
Remark 1. Here the main focus is on estimation of abstract semiparametric functionals ψ(η).
Our results also apply to the case of separated semiparametric models where η = (ψ, f ) and ψ(η) =
ψ ∈ R, as considered in [12], with a weak convergence to the normal distribution instead of a strong
convergence obtained in [12]. We have ψ(η) − ψ(η0 ) = hη − η0 , (1, −γ)iL /I˜η0 where γ is the least
(1)
favorable direction and I˜η0 = k(1, −γ)k2L , see [12]. We can then choose ψ0 = (1, −γ)/I˜η0 in [12].
√
−1
˜
If γ = 0 (no loss of information), ηt = (ψ − tIη0 / n, f ) and (2.13) is satisfied if π = πψ ⊗ πf
with πψ positive and continuous at ψ(η0 ), so that we obtain a similar result as Theorem 1 of [12].
In [12] a slightly weaker version of condition (2.12) is considered; however the proof of Section
6.1 can be easily adapted –in the case of separated semiparametric models– so that the result holds
under the weaker version of (2.12) as well.
(1)
Remark 2. As follows from the proof of Theorem 2.1, ψ0 can be replaced by any element, say
ψ̃ of H such that
(1)
(1)
hψ̃, η − η0 iL = hψ0 , η − η0 iL , kψ̃kL = kψ0 kL ,
where ψ̃ may potentially depend on η. This proves to be useful when considering constraint spaces
as in the case of density estimation.
We now apply Theorem 2.1 in the cases of white noise, density and autoregressive models and
for various types of functionals and priors.
6
3
Applications to the white noise model
Consider the model
dY n (t) = f (t)dt + n−1/2 dB(t),
t ∈ [0, 1],
2
where f ∈ L [0, 1] and B is standard Brownian motion. Let (φk )k≥1 be an orthonormal basis for
L2 [0, 1] =: L2 . The model can be rewritten
Yk = fk + n−1/2 k ,
Z
1
fk =
f (t)φk (t)dt,
k ∼ N (0, 1)
i.i.d,
k ≥ 1.
0
The likelihood admits a LAN expansion, with η = f here, k · kL = k · k2 and Rn = 0:
nkf − f0 k2 √
+ nW (f − f0 ),
2
R1
P
where for any u ∈ L2 = H with coefficients uk = 0 u(t)φk (t)dt, we set W (u) = k≥1 k uk .
In this model consider the squared-L2 norm as a functional of f . Set
`n (f ) − `n (f0 ) = −
ψ(f ) = kf k22 = ψ(f0 ) + 2hf0 , f − f0 i2 + kf − f0 k22 ,
(1)
ψ0 = 2f0 ,
(2)
ψ0 h = 2h,
r(f, f0 ) = 0.
The functional has been extensively studied in the frequentist literature, see [6], [23], [37], [25] and
[10] to name but a few, as it is used in many testing problems. The verification of assumption A
and of condition (2.14) is prior-dependent and is considered within the proof of the next Theorem.
Suppose that the true function f0 belongs to the Sobolev class
X
2
Wβ := {f ∈ L2 ,
k 2β hf, φk i < ∞}
k≥1
of order β > 1/4. First, one should note that, while the case β > 1/2 can be treated using the
first-order term of the expansion of the functional only (case A1), the case 1/4 < β < 1/2 requires
the conditions from case A2 as the second order term cannot be neglected. This is related to the
fact that the so-called plug-in property in [7] does not work for β < 1/2. An analysis based on
second order terms as in Theorem
√ 2.1 is thus required. The case β ≤ 1/4 is interesting too, but
one obtains a rate slower than 1/ n, see e.g. Cai and Low [10] and references therein, and a BvM
result in a strict sense does not hold. Although a BvM-type result can be obtained essentially
with the tools developed here, its formulation is more complicated and this case will be treated
elsewhere. When β > 1/4, a natural frequentist estimator of ψ(η) is
ψ̄ := ψ̄n :=
Kn
X
2 1
Yk − ,
n
with Kn = bn/ log nc.
k=1
Now define a prior Π on f by sampling independently each coordinate fk , k ≥ 1 in the following
way. Given a density ϕ on R and a sequence of positive real numbers (σk ), set Kn = bn/ log nc
and
1
·
fk ∼
ϕ( ) if 1 ≤ k ≤ Kn , and fk = 0 if k > Kn ,
(3.1)
σk σk
In particular we focus on the cases where ϕ is either the standard Gaussian density or ϕ(x) =
1l[−M,M] (x)/M, M > 0, called respectively Gaussian ϕ and Uniform ϕ.
Suppose that there exists M > 0 such that, for any 1 ≤ k ≤ Kn ,
|f0,k |
≤M
σk
and
7
1
σk ≥ √ .
n
(3.2)
Theorem 3.1. Suppose the true function f0 belongs to the Sobolev space Wβ of order β > 1/4.
Let the prior Π and Kn be chosen according to (3.1) and let f0 , {σk } satisfy (3.2). Consider the
following choices for ϕ
1. Gaussian ϕ. Suppose that as n → ∞,
K
n
σk−2
1 X
√
= o(1).
n
n
(3.3)
k=1
2. Uniform ϕ. Suppose M > 4 ∨ (16M ) and that for any c > 0
Kn
X
2
σk e−cnσk = o(1)
(3.4)
k=1
Then, in Pfn0 -probability, as n → ∞,
√
Kn
Π
n ψ(f ) − ψ̄ − 2
|Y n
n
N (0, 4kf0 k22 ).
(3.5)
The proof of Theorem 3.1 is given in Section 2.2 of the supplement [18]. R
Theorem 3.1 gives the BvM theorem for the non-linear functional ψ(f ) = f 2 , up to a (known)
bias term 2Kn /n. Indeed it implies that the posterior distribution of ψ(f ) − ψ̂n = ψ(f ) − ψ̄ − 2 Knn
is asymptotically Gaussian with mean 0 and variance 4kf0 k22 /n which is the inverse of the efficient
information (divided by n). Recall that ψ̄ is an efficient estimate when β > 1/4, see for instance
[10]. Therefore, even though the posterior distribution of ψ(η) does not satisfy the BvM theorem
per se, it can be modified a posteriori by recentering with the known quantity 2Kn /n to lead to
a BvM theorem. The possibility of existence of a Bayesian nonparametric prior leading to a BvM
for the functional kf k22 without any bias term in general is unclear. However √if we restrict our
attention to β > 1/2, a different choice of Kn can be made, in particular Kn = n/ log n leads to
a standard BvM property without bias term.
Condition (3.2) can be interpreted as an undersmoothing condition: the true function should
be at least as ‘smooth’ as the prior; for a fixed prior, it corresponds to intersecting the Sobolev
regularity constraint on f0 with a Hölder-type constraint. It is used to verify the concentration of
the posterior (2.7), see Lemma 3 of the supplement [18] (it is used here mostly for simplicity of
presentation and can possibly be slightly improved). For instance if σk & k −1/4 for all k ≤ Kn ,
then condition (3.2) is valid for all f0 ∈ Wβ , with β > 1/4. Conditions (3.3) and (3.4) are here to
ensure that the prior is hardly modified by the change of parametrisation (2.11), they are verified
in particular for any σk & k −1/4 .
An interesting phenomenon appears when comparing the two examples of priors considered in
Theorem 3.1. If σk = k −δ , for some δ ∈ R, condition (3.3) holds for any δ ≤ 1/4 in the Gaussian ϕ
case, whereas (3.4) only requires δ < 1/2 in the Uniform ϕ case, this for any f0 in W1/4 intersected
with the Hölder-type space {f0 : |f0,k | ≤ M k −δ , k ≥ 1}. One can conclude that fine details of
the prior (here, the specific form of ϕ chosen, for given variances {σk2 }) really matter for BvM to
hold in this case. Indeed, it can be checked that the condition for the Gaussian prior is sharp:
while the proof of Theorem 3.1 is an application of the general Theorem 2.1, a completely different
proof can be given for Gaussian priors using conjugacy, similar in spirit to [35], leading to (3.3) as
a necessary condition. Hence, choosing σk & k −1/4 leads to a posterior distribution satisfying the
BvM property adaptively over Sobolev balls with smoothness β > 1/4.
The introduced methodology also allows us to provide conditions under generic smoothness
assumptions on ϕ. For instance if the density ϕ of the prior is a Lipschitz function on R, then the
conclusion of Theorem 3.1 holds when, as n → ∞,
Kn
X
σ −1
k
k=1
n
8
= o(1).
(3.6)
This last condition is not sharp in general (compare for instance with the sharp (3.3) in the
Gaussian case), but provides a sufficient condition for a variety of prior distributions, including
light and heavy tails behaviours. For instance, if σk = k −δ , then (3.6) asks for δ ≤ 0.
4
Application to the density model
The case of functionals of the density is another interesting application of Theorem 2.1. The case
of linear functionals of the density has first been considered by [40]. Here we obtain a broader
version of Theorem 2.1 in [40], which weakens the assumptions for the case of linear functionals
and also allows for nonlinear functionals.
4.1
Statement
n
Let Y = (Y1 , . . . , Yn ) be independent and identically distributed, having density f with respect
to Lebesgue measure on the interval [0, 1]. In all of this Section, we assume that the true density
f0 belongs to the set F0 of all densities that are bounded away from 0 and ∞ on [0, 1]. Let us
consider An = {f ; kf − f0 k1 ≤ εn } where εn is a sequence decreasing to 0, or any set of the form
An ∩ Fn , as long as P0n Π(Fnc | Y n ) → 0. Define
Z 1
L2 (f0 ) = {ϕ : [0, 1] → R,
ϕ(x)2 f0 (x)dx < ∞}.
0
R1
For any ϕ in L2 (f0 ), let us write F0 (ϕ) as shorthand for 0 ϕ(x)f0 (x)dx and set, for any positive
density f on [0, 1],
√
η = log f, η0 = log f0 ,
h = n(η − η0 ).
Following [40], we have the LAN expansion
`n (η) − `n (η0 ) =
√
n
1 X
nF0 (h) + √
[h(Yi ) − F0 (h)]
n i=1
1
= − khk2L + Wn (h) + Rn (η, η0 ),
2
with the following notation, for any g in L2 (f0 ),
kgk2L
Z
=
0
1
n
1 X
[g(Yi ) − F0 (g)],
(g − F0 (g))2 f0 , Wn (g) = Gn g = √
n i=1
(4.1)
√
1
2
and Rn (η, η0 ) = nP
norm
R f0 h + 2 khkL . Note that k · kL is an Hilbertian
R induced by the innerproduct hg1 , g2 iL = g1 g2 f0 defined on the space HT := {g ∈ L2 (Pf0 ), gf0 = 0} ⊂ H = L2 (f0 ),
the so-called maximal tangent set at f0 .
We consider functionals ψ(f ) of the density f , which are differentiable relative to (a dense
subset of) the tangent set HT with efficient influence function ψ̃f0 , see [42], Chap. 25. In particular
ψ̃f0 belongs to HT , so F0 (ψ̃f0 ) = 0. We further assume that ψ̃f0 is bounded on [0, 1]. Set
ψ(f ) − ψ(f0 ) = h
f − f0
, ψ̃f0 iL + r̃(f, f0 )
f0
= hη − η0 − F0 (η − η0 ) , ψ̃f0 iL + B(f, f0 ) + r̃(f, f0 ),
(4.2)
η = log f,
where B(f, f0 ) is the difference
Z
B(f, f0 ) =
0
1
f − f0
η − η0 −
(x)ψ̃f0 (x)f0 (x)dx,
f0
and define r(f, f0 ) = B(f, f0 ) + r̃(f, f0 ).
9
Theorem 4.1. Let ψ be a differentiable functional relative to the tangent set HT , with efficient
influence function ψ̃f0 bounded on [0, 1]. Let r̃ be defined by (4.2). Suppose that for some εn → 0
it holds
Π[f : kf − f0 k1 ≤ εn | Y n ] → 1,
(4.3)
in P0 -probability and that, for An = {f, kf − f0 k1 ≤ εn },
√
sup r̃(f, f0 ) = o(1/ n).
f ∈An
Set ηt = η −
√t ψ̃f
0
n
− log
R1
0
e
η− √tn ψ̃f0
and assume that in P0 -probability
R
An
R
e`n (ηt )−`n (η0 ) dΠ(η)
e`n (η)−`n (η0 ) dΠ(η)
→ 1.
(4.4)
Then, for ψ̂ any linear efficient estimator of ψ(f ), the BvM theorem holds for the functional ψ.
√
That is, the posterior distribution of n(ψ(f ) − ψ̂) is asymptotically Gaussian with mean 0 and
variance kψ̃f0 k2L , in P0 -probability.
The semiparametric efficiency bound for estimating ψ is kψ̃f0 k2L and linear efficient estimators
√
√
of ψ are those for which ψ̂ = ψ(f0 ) + Gn (ψ̃f0 )/ n + op (1/ n), see e.g. van der Vaart [42], Chap.
25, so Theorem 4.1 yields the BvM Theorem (with best possible limit distribution).
Remark 3. The L1 -distance between densities
in Theorem 4.1 can be replaced by Hellinger’s
√
distance h(·, ·) up to replacing εn by εn / 2.
(2)
Theorem 4.1 is proved in Section 6 and is deduced from Theorem 2.1 with ψ0 = 0 and
R 1 η− √t ψ̃f
√
√
(1)
0
n
ψ0 = ψ̃f0 − t−1 n log 0 e
. The condition supf ∈An r̃(f, f0 ) = o(1/ n), together with
(4.3) imply assumption A. It improves on Theorem 2.1 of [40] in the sense that an L1 -posterior
concentration rate is required instead of a posterior concentration rate in terms of the LAN norm
k · kL , it is also a generalisation to approximately linear functionals, which include the following
examples.
R1
Example 4.1 (Linear functionals). Let ψ(f ) = 0 f (x)a(x)dx, for some bounded function a.
R
R1
Then, writing as shorthand for 0 ,
Z
f − f0
ψ(f ) − ψ(f0 ) = h
, a − af0 iL
f0
R
with the efficient influence function ψ̃f0 = a − af0 . In this case, r̃(f, f0 ) = 0.
R1
Example 4.2 (Entropy functional). Let ψ(f ) = 0 f (x) log f (x)dx, for f bounded away from 0
and infinity. Then
Z
Z
f
f − f0
, log f0 − f0 log f0 iL + f log .
ψ(f ) − ψ(f0 ) = h
f0
f0
R
R
with the efficient influence function ψ̃f0 = log f0 − f0 log f0 . In this case, r̃(f, f0 ) = f log ff0 .
For the two types of priors
√ considered below, under some smoothness assumptions on f0 , it holds
supf ∈An r̃(f, f0 ) = o(1/ n).
R1p
Example 4.3 (Square-root functional). Let ψ(f ) = 0 f (x)dx, for f a bounded density. Then
√
Z p
Z √
1 f − f0
1
1
f − f f −f
√ 0 √ √ 0.
ψ(f ) − ψ(f0 ) = h
, √ −
f0 iL +
2
f0
2
f0
f0 + f
f0
R
R √ √ 2
√
with the efficient influence function ψ̃f0 = 21 ( √1f −
f0 ). In this case, r̃(f, f0 ) = − ( f20√−f f ) .
0
0
In particular, the remainder term of the functional expansion is bounded
by a constant times the
√
2
square of the Hellinger distance between densities, hence as soon
√ as εn n = o(1), if An is written
in terms of h, see Remark 3, one has supf ∈An r̃(f, f0 ) = o(1/ n).
10
R1
Example 4.4 (Power functional). Let ψ(f ) = 0 f (x)q dx, for f a bounded density and q ≥ 2 an
integer. Then
Z
f − f0
ψ(f ) − ψ(f0 ) = h
, qf0q−1 − q f0q iL + r(f, f0 ).
f0
R
The remainder r̃(f, f0 ) is a sum of terms of the form (f − f0 )2+s f0q−2−s , for 0√≤ s ≤ q − 2 an
integer. For the two types of priors considered below, supf ∈An r̃(f, f0 ) = o(1/ n), under some
smoothness assumptions on f0 .
We now consider two families of priors: random histograms and Gaussian process priors. For
each family, we provide a key no-bias condition for BvM on functionals to be valid. For each the
idea is based on a certain functional change of variables formula. To simplify the notation we
write ψ̃ = ψ̃f0 in the sequel.
4.2
Random histograms
For any k ∈ N∗ , consider the partition of [0, 1] defined by Ij = [(j − 1)/k, j/k) for j = 1, . . . , k.
Denote by
Hk = {g ∈ L2 [0, 1],
g(x) =
k
X
gj ∈ R, j = 1, . . . , k}
gj 1lIj (x),
j=1
Pk
the set of all regular histograms with k bins on [0, 1]. Let Sk = {ω ∈ [0, 1]k ; j=1 ωj = 1} be the
unit simplex in Rk and denote Hk1 the subset of Hk consisting of histograms which are densities
on [0, 1]:
Hk1
2
= {f ∈ L [0, 1],
f (x) = fω,k = k
k
X
ωj 1lIj (x),
(ω1 , . . . , ωk ) ∈ Sk },
j=1
A prior on Hk1 is completely specified by the distributions of k and of (ω1 , . . . , ωk ) given k. Conditionally on k, we consider a Dirichlet prior on ω = (ω1 , . . . , ωk ):
c1 k −a ≤ αj,k ≤ c2 ,
ω ∼ D(α1,k , . . . , αk,k ),
(4.5)
for some fixed constants a, c1 , c2 > 0 and any 1 ≤ j ≤ k.
Consider two situations: either a deterministic number of bins with k = Kn = o(n) or, for πk
a distribution on positive integers,
k ∼ πk ,
e−b1 k log(k) ≤ πk (k) ≤ e−b2 k log(k) ,
(4.6)
for all k large enough and some 0 < b2 < b1 < ∞. Condition (4.6) is verified for instance by the
Poisson distribution which is commonly used in Bayesian nonparametric models, see for instance
[3].
The set Hk is a closed subspace of L2 [0, 1]. For any function h in L2 [0, 1], consider its projection
h[k] in the L2 -sense on Hk . It holds
h[k] = k
k Z
X
{
h}1lIj .
j=1
Ij
Lemma 4 in the supplement [18] gathers useful properties on histograms.
Let the functional ψ satisfy (4.2) with bounded efficient influence function ψ̃f0 = ψ̃ 6= 0 and
set, for k ≥ 1,
ψ̂k = ψ(f0[k] ) +
Gn ψ̃[k]
√ ,
n
Gn ψ̃
ψ̂ = ψ(f0 ) + √ ,
n
11
Vk = kψ̃[k] k2L
(4.7)
V =
kψ̃k2L ,
with k · kL , Gn as in (4.1). Finally, for n ≥ 2, k ≥ 1, M > 0, denote
An,k (M ) = f ∈ Hk1 , h(f, f0,[k] ) ≤ M εn,k ,
with ε2n,k =
k log n
.
n
(4.8)
In Section 6.3, we shall see that the posterior distribution of k concentrates on a deterministic
subset Kn of {1, · · · , bn/(log n)2 c} and that under the following technical condition on the weights,
as n → ∞,
k
X
√
(4.9)
sup
αj,k = o( n),
k∈Kn j=1
the conditional posterior distribution given k, concentrates on the sets An,k (M ). It can then be
checked that
i
h√
Π n(ψ − ψ̂) ≤ z|Y n
h√
i
X
√
Π[k | Y n ]Π n(ψ − ψ̂k ) ≤ z + n(ψ̂ − ψ̂k )|Y n , k + op (1)
=
k∈Kn
=
X
Π[k | Y n ]Φ((z +
√
p
n(ψ̂ − ψ̂k ))/ Vk ) + op (1).
k∈Kn
The last line expresses that the posterior is asymptotically close to √
a mixture of normals, and that
the mixture reduces to the target law N (0, V ) if Vk goes to V and n(ψ̂ − ψ̂k ) to 0, uniformly for
k in Kn . The last quantity can also be rewritten
√
√
n(ψ̂k − ψ̂) = n(ψ(f0[k] ) − ψ(f0 )) + Gn (ψ̃[k] − ψ̃)
Z
√
= n (ψ̃ − ψ̃[k] )(f0[k] − f0 ) + Gn (ψ̃[k] − ψ̃) + o(1).
It is thus natural to ask for, and this is satisfied in most examples, see below,
max |kψ̃[k] k2L − kψ̃k2L | = op (1)
k∈Kn
and
max Gn (ψ̃[k] − ψ̃) = op (1),
k∈Kn
(4.10)
This leads to the next Proposition, proved in Section 6.
Proposition 1. Let f0 belong to F0 and the prior Π be defined by (4.5)-(4.9). Let the prior πk
be either the Dirac mass at k = Kn ≤ n/(log n)2 , or the law given in (4.6). Let Kn be a subset of
{1, 2, . . . , n/ log2 n} such that Π(Kn | Y n ) = 1 + op (1).
Consider estimating a functional ψ(f ), with r̃ in (4.2), verifying (4.10) and, for any M > 0,
with An,k (M ) defined in (4.8),
√
nr̃(f, f0 ) = op (1),
(4.11)
sup
sup
k∈Kn f ∈An,k (M )
as n → ∞. Additionally, suppose
Z
√ max n (ψ̃ − ψ̃[k] )(f0[k] − f0 ) = o(1).
k∈Kn
(4.12)
Then the BvM theorem for the functional ψ holds.
The core condition is (4.12), which can be seen as a no-bias condition. Condition (4.11) controls
the remainder term of the expansion of ψ(f ) around f0 . Condition (4.10) is satisfied under very
mild conditions: for its first part it is enough that inf Kn goes to ∞ with n. For the second part,
barely more than this typically suffices, using a simple empirical process argument, see Section 6.
The next Theorem investigates the previous conditions under deterministic and random priors
on k, for the examples of functionals 4.1 to 4.4.
12
Theorem 4.2. Suppose f0 ∈ C β , with β > 0. Let two priors Π1 , Π2 be defined by (4.5)-(4.9) and
the prior on k be either the Dirac mass at k = Kn = bn1/2 (log n)−2 c for Π1 , or k ∼ πk given by
(4.6) for Π2 . Then
R
• Example 4.1, linear functionals ψ(f ) = af , under the prior Π1 with deterministic k = Kn
if a(·) ∈ C γ with γ + β > 1 for some γ > 0, then the BvM theorem holds for the
functional ψ(f ).
R
if a(·) = 1l·≤z for z ∈ [0, 1], then BvM holds for the functional 1l·≤z f = F (z), the
cumulative distribution function of f .
• Examples 4.2-4.3-4.4. For all β > 1/2, the BvM theorem holds for ψ(f ) for both priors Π1
(deterministic k) and Π2 (random k).
Theorem 4.2 is proved in Section 6.3. From this proof it may be noted that different
R choices
of Kn in some range lead to similar results for some examples. For instance, if ψ(f ) = ψf and
ψ ∈ C γ , choosing Kn = bn/(log n)2 c implies that the BvM holds for all γ + β > 1/2.
Obtaining BvM in the case of a prior with random k in Example 4.1 is case-dependent. The
answer lies in the respective approximation properties of both f0 and ψ̃f0 through the prior (note
that a random k prior typically adapts to the regularity of f0 ), and the no-bias condition (4.12)
may not be satisfied if inf Kn is not large enough.
We present below a counterexample where BvM is proved to fail for a large class of true
densities f0 when a prior with random k is chosed.
4.3
A semiparametric curse of adaptation: a counterexample for BvM
under random number of bins histogram priors
Consider a C 1 , strictly increasing true function f0 , say
f00 ≥ ρ > 0
on [0, 1].
(4.13)
The following reasoning can be extended to any approximately
monotone smooth function on
R
[0, 1]. Consider estimation of the linear functional ψ(f ) = ψf . The BvM theorem is not satisfied
√
if the bias term n(ψ̂ − ψ̂k ) is predominant for all k’s which are asymptotically given mass under
the posterior. This will happen if for all such k’s,
Z
Z
√
√
−bn,k = n ψ(f0 − f0[k] ) = n (ψ − ψ[k] )(f0 − f0[k] ) 1,
as n → ∞. To simplify the presentation we restrict ourselves to the case of dyadic random
histograms, in other words the prior on k only puts mass on values of k = 2p , p ≥ 0. Then define
ψ(x) as, for α > 0,
l
−1
X 2X
1
H
ψ(x) =
2−l( 2 +α) ψlj
(x),
(4.14)
l≥0 j=0
H
ψlj
(x)
l/2
l
where
= 2 ψ00 (2 x − j) and ψ00 (x) = −1l[0,1/2] (x) + 1l(1/2,1] (x) is the mother wavelet of
the Haar basis (we omit the scaling function 1 in the definition of ψ).
Proposition 2. Let f0 be any function as in (4.13) and α, ψ as in (4.14). Let the prior be as in
Theorem 4.2. Then there exists k1 > 0 such that
Π k < k1 (n/ log n)1/3 | Y n = 1 + oP (1)
and for all p ∈ N such that 2p := K < k1 (n/ log n)1/3 , the conditional posterior distribution of
√
√
n(ψ(f ) − ψ̂ − bn,k )/ Vk | k = K converges in distribution to N (0, 1), in P0n -probability, with
√
bn,K . − nK −α−1 .
In particular, the BvM property does not hold if α < 1/2.
13
Remark 4. For the considered f0 it can be checked that the posterior even concentrates on values
of k such that k = kn (n/ log n)1/3 .
As soon as the regularities of the functional ψ(f ) to be estimated and of the true function f0 are
fairly different, taking an adaptive prior (with respect to f ) can have disastrous effects with a nonnegligible bias appearing in the centering of the posterior distribution. As in the counterexample in
Rivoirard and Rousseau [40], the BvM is ruled out because the posterior distribution concentrates
on values of k that are too small and for which the bias bn,k is not negligible. Note that for
each of these functionals the BvM is violated for a large class of true densities f0 . Some related
phenomena in terms of rates are discussed in Knapik et al. [34] for linear functionals and adaptive
priors in white noise inverse problems.
Let us sketch the proof of Proposition 2. It is not difficult to show that (see the Appendix),
since f0 ∈ C 1 , the posterior concentrates on the set {f : kf − f0 k1 ≤ M (n/ log n)−1/3 , k ≤
k1 (n/ log n)1/3 }, for some positive M and k1 . Since Haar wavelets are special cases of (dyadic)
histograms, for any K ≥ 1 the best approximation of ψ within HK is
l
ψ[K] (x) =
p 2X
−1
X
1
H
(x).
2−l( 2 +α) ψlj
l=0 j=0
√ R1
√ R1
n 0 (f0 − f0,[K] )(ψ − ψ[K] ) = n 0 f0 (ψ − ψ[K] ),
The semiparametric bias −bn,K is equal to
which can be written, for any K ≥ 1,
l
−bn,K
Z
−1
√ X 2X
−l( 21 +α)
= n
2
1
H
f0 (x)ψlj
(x)dx
0
l>p j=0
l
Z
−1
√ X 2X
2−lα
= n
2−l (j+1/2)
(f0 (x + 2−l /2) − f0 (x))dx
2−l j
l>p j=0
√ X −lα l −2l √
& n
2 22
& nK −α−1 .
l>p
Since Π(k ≤ n1/3 | Y n ) = 1 + op (1), we have that inf k≤n1/3 −bn,k → +∞ for all α < 1/2. Also, the
sequence of real numbers {Vk }k≥1 stays bounded, while the supremum sup1≤k≤n1/3 |Gn (ψ̃ − ψ̃[k] )|
is bounded by a constant times (log n)1/2 in probability, by a standard empirical process argument.
This implies that
h √
i
X 2
√
E Π et n(ψ(f )−ψ̂) Y n , Bn = (1 + o(1))
et Vk /2+t n(ψ̂−ψ̂k ) Π[k | Y n ] = op (1),
k∈Kn
so that the posterior distribution is not asymptotically equivalent to N (0, kψ̃k2L ), and there exists
Mn going to infinity such that
√
Π[ n|ψ(f ) − ψ̂| > Mn |Y n ] = 1 + op (1).
4.4
Gaussian process priors
We now investigate the implications of Theorem 4.1 in the case of Gaussian process priors for the
density f . Consider as a prior on f the distribution on densities generated by
f (x) = R 1
0
eW (x)
eW (x) dx
,
(4.15)
where W is a zero-mean Gaussian process indexed by [0, 1] with continuous sample paths. The
process W can also be viewed as a random element in the Banach space B of continous functions
14
on [0, 1] equipped with the sup-norm k · k∞ , see [44] for precise definitions. We refer to [44] , [43]
and [11] for basic definitions on Gaussian priors and some convergence properties respectively. Let
K(x, y) = E[W (x)W (y)] denote the covariance kernel of the process and let (H, k.kH ) denote the
reproducing kernel Hilbert space of W .
Example 4.5 (Brownian motion released at 0). Consider the distribution induced by
x ∈ [0, 1],
W (x) = N + Bx ,
where Bx is standard Brownian motion and N is an independent N (0, 1) variable. We use it as
a prior on w. It can be seen, see [43], as a random element in the Banach space B = (C 0 , k · k∞ )
and its RKHS is
Z ·
HB =
g(u)du,
c+
c ∈ R, g ∈ L2 [0, 1] ,
0
a Hilbert space with norm given by kc +
R·
0
g(u)duk2HB = c2 +
R1
0
g(u)2 du.
Example 4.6 (Riemann-Liouville type processes). Consider the distribution induced by, for α > 0
and x ∈ [0, 1],
Z x
bαc+1
X
(x − s)α−1/2 dBs ,
Z k xk +
W α (x) =
0
k=0
where Zi s are independent standard normal variables and B is an independent Brownian motion.
The RKHS Hα of W α can be obtained explicitly from the one of Brownian motion, and is nothing
but a Sobolev space of order α + 1, see [43], Theorem 4.1.
The concentration function of the Gaussian process in B at η0 = log f0 is defined for any ε > 0
by, see [44],
1
ϕη0 (ε) = − log Π(kW k∞ ≤ ε) +
inf
khk2H .
2 h∈H: kh−η0 kB <ε
In van der Vaart and van Zanten [43], it is shown that the posterior contraction rate for such a
prior is closely connected to a solution εn of
ϕη0 (εn ) ≤ nε2n ,
η0 = log f0 .
(4.16)
Proposition 3. Suppose f0 verifies c0 ≤ f0 ≤ C0 on [0, 1], for some positive c0 , C0 . Let the prior
Π on f be induced via a Gaussian process W as in (4.15) and let H denote its RKHS. Let εn → 0
verify (4.16). Consider estimating a functional ψ(f ), with r̃ in (4.2) verifying
√
sup r̃(f, f0 ) = o(1/ n),
f ∈An
for An such that Π(An | Y n ) = 1 + op (1) and An ⊂ {f : h(f, f0 ) ≤ εn }. Suppose that ψ̃f0 is
continuous and that there exists a sequence ψn ∈ H and ζn → 0, such that
√
kψn − ψ̃f0 k∞ ≤ ζn ,
and kψn kH ≤ nζn ,
(4.17)
√
nεn ζn → 0.
(4.18)
Then, for ψ̂ any linear efficient estimator of ψ(f ), in P0n -probability, the posterior distribution
√
of n(ψ(f ) − ψ̂) converges to a Gaussian distribution with mean 0 and variance kψ̃f0 k2L and the
BvM theorem holds.
The proof is presented in Section 3.2 of the supplement [18]. We now investigate conditions
(4.17)-(4.18) for examples of Gaussian priors.
15
Theorem 4.3. Suppose that η0 = log f0 belongs to C β , for some β > 0. Let Πα be the priors
defined from a Gaussian process W via (4.15). For Π1 , we take W to be Brownian motion (released
at 0) and for Π2 we take W = W α , a Riemann-Liouville-type process of parameter α > 0.
R
• Example 4.1, linear functionals ψ(f ) = af
if a(·) ∈ HB , then the BvM theorem holds for the functional ψ(f ) and prior Π1 . The
same holds if a(·) ∈ Hα for prior Π2 .
if a(·) ∈ C µ , µ > 0, the BvM property holds for prior Π2 if
α∧β >
1
+ (α − µ) ∨ 0.
2
• Examples 4.3-4.4. Under the same condition as for the linear functional with µ = β, the
BvM theorem holds for Π2 .
An immediate illustration of Theorem 4.3 is as follows. Consider prior Π1 built from Brownian
motion. Then for all linear functionals
Z 1
1
xr f (x)dx,
ψ(f ) =
r> ,
2
0
the BvM theorem holds. Indeed, x → xr , r > 1/2 belongs to HB .
To prove Theorem 4.3, one applies Proposition 3: it is enough to compute bounds for εn
and ζn . This follows from the results on the concentration function for
R Riemann-Louville type
processes obtained in Theorem 4 in [11]. For linear functionals ψ(f ) = af and a ∈ C µ , one can
take εn = n−α∧β/(2α+1) and ζn = n−µ/(2α+1) , up to some logarithmic factors. So (4.18) holds if
α ∧ β > 21 + (α − µ) ∨ 0.
The square-root functional is similar to a linear functional with µ = β, since the remainder
term in the expansion of the functional is of the order of the Hellinger distance. Indeed,
since f0
√
is bounded away from 0 and ∞, the fact that w0 ∈ C β implies that f0 ∈ C β and f0 ∈ C β . For
power functionals, the remainder term r(f, f0 ) is more complicated but is easily bounded by a
linear combination of terms of the type
Z
Z
q−r−2
(f − f0 )2+r f0q−2−r ≤ kf0 k∞
kf − f0 kr∞ (f − f0 )2 .
Using Proposition 1 in the supplement √
[18], one obtains that, under
√ the posterior distribution,
kf − f0 k∞ . 1 and kf − f0 k2 . εn . So, nr(f, f0 ) = o(1) holds if nε2n = o(1), which is the case
since α ∧ β > 1/2.
5
Application to the nonlinear autoregressive model
Consider an autoregressive model in which one observes Y1 , · · · , Yn given by
Yi+1 = f (Yi ) + i ,
i ∼ N (0, 1)
i.i.d.
(5.1)
where kf k∞ ≤ L for a fixed given positive constant L and f belongs to a Hölder space C β , β > 0.
This example has been in particular studied by [29] and it is known that (Yi , i = 1, · · · , n) is an
homogeneous Markov chain and that under these assumptions, for all f , there exists a unique
stationary distribution Qf with density qf with respect to Lebesgue measure. The transition
density is pf (y|x) = φ(y − f (x)). Denoting r(y) = (φ(y − L) + φ(y + L))/2, the transition density
satisfies pf (y|x) r(y) for all x, y ∈ R. Following [29], define the norms, for any s ≥ 2,
Z
kf − f0 ks,r =
|f (x) − f0 (x)|s r(x)dx
R
16
1/s
As √
in [29], we consider a prior Π on f based on piecewise constant functions. Let us set
an = b log n, where b > 0 and consider functions f of the form
f (x) := fω,k (x) =
k−1
X
Ij = an ([j/k, (j + 1)/k] − 1/2).
ωj 1lIj (x),
j=0
A prior on k and on ω = (ω0 , . . . , ωk−1 ) is then specified as follows. First draw k ∼ πk , for πk
a law on the integers. Given k, the law ω | k is supposed to have a Lebesgue density πω|k with
support [−M, M ]k for some M > 0. Assume further that these laws satisfy, for 0 < c2 ≤ c1 < ∞
and C1 , C2 > 0,
e−c1 K log K ≤ πk [k > K] ≤ e−c2 K log K ,
for large K
(5.2)
∀ω ∈ [−M, M ]k .
R
We consider the squared-weighted-L2 norm functional ψ(f ) = R f 2 (y)qf (y)dy. As before define
e
−C2 k log k
. πω|k (ω) ≤ C1 ,
εn (β) = (n/ log n)−β/(2β+1) .
kn (β) = b(n/ log n)1/(2β+1) c,
For all bounded f0 and all k > 0, define
R
0
ω̃[k]
= (ω̃10 , . . . , ω̃k0 ),
ω̃j0 =
Ij
f0 (x)qf0 (x)dx
R
;
q (x)dx
Ij f0
these are the weights of the projection of f0 on the weighted space L2 (qf0 ). We then have the
following sufficient condition for the BvM to be valid.
Theorem 5.1. Consider the autoregressive model (5.1) and the prior (5.2). Assume that f0 ∈ C β ,
with β > 1/2 and kf0 k∞ < L, and assume that πω|k satisfies for all t > 0 and all M0 > 0
πω|k (ω − tω̃ 0 /√n)
[k]
sup
− 1 = o(1).
(5.3)
0
π
(ω)
ω|k
kω−ω̃ k2,r ≤M0 εn (β)
[k]
Then the posterior distribution of
variance V0 , where
√
n(ψ(f ) − ψ̂) is asymptotically Gaussian with mean 0 and
n
ψ̂ = ψ(f0 ) +
2X
i f0 (Yi−1 ) + op (n−1/2 ),
n i=1
V0 = 4kf0 k22,qf
0
and the BvM is valid under the distribution associated to f0 and any initial distribution ν on R.
Theorem 5.1 is proved in Section 4 of the supplement [18]. The conditions on the prior (5.2) and
(5.3) are satisfied in particular when k ∼ P(λ) and when given k, the law ω|k is the independent
product of k laws U(−M, M ). Theorem 5.1 is an application of the general theorem 2.1, with
0
An = {fω,k ; k ≤ k1 kn (β); kω − ω[k]
k2,r ≤ M0 n (β)} and assumpion A implied by β > 1/2.
Condition (5.3) is used to prove condition (2.13).
6
6.1
Proofs
Proof of Theorem 2.1
Let the set An be as in assumption A. Set
i
h √
In := E et n(ψ(η)−ψ(η0 )) | Y n , An .
17
(2)
For the sake of conciseness we prove the result in the case where ψ0 6= 0 since the other case is
a simpler version of it. Using the LAN expansion (2.3) together with the expansion (2.4) of the
functional ψ, one can write
√
R
In
=
e
An
√
(1)
(2)
nt hψ0 , η−η0 iL + 12 hψ0 (η−η0 ) , η−η0 iL +`n (η)−`n (η0 )+t nr(η,η0 )
R
An
e
−n||η−η0 ||2
L
2
dΠ(η)
√
+ nWn (η−η0 )+Rn (η,η0 ) dΠ(η)
Consider, for any real number t, as defined in (2.11),
(1)
(2)
tψ
t
tψ wn
(2)
ηt = η − √0 − √ ψ0 (η − η0 ) − 0
.
2n
n
2 n
Then using (2.9)-(2.10) in assumption A, on An ,
`n (ηt ) − `n (η0 ) − (`n (η) − `n (η0 ))
√
n
= − ||ηt − η0 ||2L − ||η − η0 ||2L + nhwn , ηt − ηiL + Rn (ηt , η0 ) − Rn (η, η0 ) + oP (1)
2
2
(2)
√
√
t2 (1) ψ0 wn (1)
(2)
(1)
= −thwn , ψ0 + ψ0 wn /(2 n)iL − ψ0 + √ + nthψ0 , η − η0 iL
2 2 n L
√
t n (2)
hψ0 (η − η0 ) , η − η0 iL + Rn (ηt , η0 ) − Rn (η, η0 ) + oP (1).
+
2
One deduces that on An , from (2.12) in assumption A,
√
√
1 (2)
(1)
nt hψ0 , η − η0 iL + hψ0 (η − η0 ) , η − η0 iL + `n (η) − `n (η0 ) + ntr(η, η0 )
2
2
(2)
√
t2 (1) ψ0 wn (1)
(2)
= `n (ηt ) − `n (η0 ) + thwn , ψ0 + ψ0 wn /(2 n)iL + ψ0 + √ + oP (1).
2 2 n L
We can then rewrite In as
In = e
2
(2)
(2)
R
wn ψ
wn
ψ
2 (1)
(1)
oP (1)+ t2 ψ0 + 02√n +thwn , ψ0 + 02√n iL
An
L
R
An
e`n (ηt )−`n (η0 ) dΠ(η)
e`n (η)−`n (η0 ) dΠ(η)
,
and Theorem 2.1 is proved using condition (2.14), together with the fact that, see Section 1 of the
supplement [18], convergence of Laplace transforms for all t in probability implies convergence in
distribution in probability. 6.2
Proof of Theorem 4.1
(1)
One can define ψ0 = ψ̃f0 + c for any constant c, since the inner product associated to the LAN
norm corresponds to re-centered quantities. In particular for all η = log f
Z
h(ψ̃f0 + c), η − η0 iL = (ψ̃f0 − Pf0 ψ̃f0 )(η − η0 )f0 , kψ̃f0 + ckL = kψ̃f0 kL .
To check assumption A, let us write
√
(1)
ψ0 = ψ̃f0 +
n
log
t
Z
1
e
η− √tn ψ̃f0
(x)dx ,
(6.1)
0
√
η0 ) +
which depends on η but is of the form ψ̃f0 + c, see also Remark 2, and we study ntr(η,
√
Rn (η, η0 ) − Rn (ηt , η0 ) using [40]’s calculations pages 1504-1505. Indeed writing h = n(η − η0 )
we have
√
t2
Rn (η, η0 ) − Rn (ηt , η0 ) = thh, ψ̃f0 iL − kψ̃f0 k2L + n log F [e−tψ̃f0 / n ]
2
18
and expanding the last term as in page 1506 of [40] we obtain that
√
t
t
t2
n log F [e−tψ̃f0 / n ] = n log 1 − hh, ψ̃f0 iL − √ B(f, f0 ) +
kψ̃f0 k2L
n
2n
n
t2
+ (F − F0 )(ψ̃f20 ) + O(n−3/2 )
2n
√
t2
= −thh, ψ̃f0 iL − t nB(f, f0 ) + kψ̃f0 k2L + O(kf − f0 k1 + n−1/2 )
2
√
t2
= −thh, ψ̃f0 iL − t nB(f, f0 ) + kψ̃f0 k2L + o(1)
2
√
since |(F − F0 )(ψ̃f20 ) ≤ kψ̃f0 k2∞ kf − f0 k1 . εn on An . Finally this implies that ntr(η, η0 ) +
Rn (η, η0 ) − Rn (ηt , η0 ) = o(1) uniformly over An and assumption A is satisfied.
6.3
Proof of Theorem 4.2
The first part of the proof consists in establishing that the posterior distribution on random
histograms concentrates a) given the number of bins k, around the projection f0,[k] of f0 and b)
globally around f0 in terms of the Hellinger distance.
More precisely a) there exist c, M > 0 such that
n
P0 ∃k ≤
; Π [f ∈
/ An,k (M ) | Y n , k] > e−ck log n = o(1).
(6.2)
log n
b) Suppose f0 ∈ C β with 0 < β ≤ 1. If kn (β) = (n/ log n)1/(2β+1) and εn (β) = kn (β)−β , then for
k1 , M large enough,
Π [h(f0 , f ) ≤ M εn (β); k ≤ k1 kn (β) | Y n ] = 1 + op (1).
(6.3)
Both results are new. As a)-b) are an intermediate step and concern rates rather than BvM per
se, their proofs are given in the Supplement [18].
We now prove that the BvM holds if there exists Kn such that Π(Kn |Y n ) = 1 + op (1), and for
which
√
sup n|ψ̂ − ψ̂k | = op (1), sup |Vk − V | = op (1),
(6.4)
k∈Kn
k∈Kn
for all ψ(f ) satisfying (4.2) with
sup
sup
r̃(f ; f0 ) = op (1).
(6.5)
k∈Kn f ∈An,k (M )
Consider first
√ the deterministic k = Kn number of bins case. The study of the posterior distribution of n(ψ(f ) − ψ̂) is based on a slight modification of the proof of Theorem 4.1. Instead of
taking the true f0 as basis point for the LAN expansion, we take instead f0,[k] . This enables to
write the main terms in the LAN
within Hk .
R expansion completely
R
Let us define ψ̄(k) := ψ[k] − ψ[k] f0,[k] = ψ̃[k] − ψ̃[k] f0,[k] and ψ̂k = ψ(f0,[k] ) + √1n Wn (ψ̄(k) ).
With the same notation as in Section 4, where indexation by k means that f0 is replaced by f0,[k]
(in k · kL,k , Rn,k etc., where one can note that for g ∈ Hk , one has Wn,k (g) = Wn (g)),
√
t n(ψ(f ) − ψ̂k ) + `n (f ) − `n (f0,[k] )
√
n
f
t
f
t
− √ ψ̄(k) k2L,k + nWn (log
− √ ψ̄(k) )
= − k log
2
f0,[k]
f0,[k]
n
n
2
√
t
+ kψ̄(k) k2L,k + t nBn,k + Rn,k (f, f0,[k] ).
2
19
−
tψ̄(k)
√
−
tψ̄(k)
√
n /F (e
n ). Then, using the same arguments as in Section 4, together
Let us set ft,k = f e
R
with (C.2) and the fact that ψ̄(k) f0,[k] = 0,
√
t2
t n(ψ(f ) − ψ̂k ) + `n (f ) − `n (f0,[k] ) = kψ̄(k) k2L,k + `n (ft,k ) − `n (f0,[k] ) + o(1),
2
p
so that choosing An,k = {ω ∈ Sk ; kfω,k − f0[k] k1 ≤ M k log n/n}, we have
R
h √
i
e`n (ft,k )−`n (f0[k] ) dΠk (f )
2
t2
A
E Π et n(ψ(f )−ψ̂k ) Y n , An,k = e 2 kψ̄(k) kL,k +o(1) × R n,k ` (f )−` (f )
,
n
0[k] dΠ (f )
en
k
An,k
uniformly over k = o(n/ log n). Within each model Hk , since f = fω,k , we can express ft,k =
Pk
k j=1 ζj 1lIj , with
ωj γj−1
ζj = Pk
,
(6.6)
−1
j=1 ωj γj
√
R
where we have set, for 1 ≤ j ≤ k, γj = etψ̄j / n , and ψ̄j := k Ij ψ̄(k) . Denote Sγ −1 (ω) =
Pk
−1
−1
. So,
j=1 ωj γj . Note that (6.6) implies Sγ −1 (ω) = Sγ (ζ)
k
Pk
Y
√
Πk (ω)
et(αj,k −1)ψ̄j / n Sγ (ζ)− j=1 (αj,k −1) .
=
Πk (ζ)
j=1
Let ∆ be the Jacobian of the change of variable computed in Lemma 5 of the supplement [18].
Over the set An,k , it holds
k
Y
dΠk (ω) =
et(αj,k −1)ψ̄j /
√
n
Sγ (ζ)−
Pk
j=1 (αj,k −1)
∆(ζ)dΠk (ζ)
j=1
= Sγ (ζ)−
Pk
αj,k t
√
Pk
e j=1 αj,k ψ̄j / n dΠk (ζ)
Pkj=1 αj,k
Z 1
P
√
t
t k
α
ψ̄
/
n
−1
j
j,k
= e j=1
1− √
ψ̄(k) (f − f0 ) + O(n )
dΠk (ζ),
n 0
j=1
where we have used that
Z
Sγ −1 (ω) =
1
e−tψ̄(k) /
0
√
n
t
f =1− √
n
Z
1
ψ̄(k) (f − f0 ) + O(n−1 ).
0
√
√
Moreover, if kω − ω 0 k1 ≤ M k log n/ n,
0
kζ − ω k1 ≤ M
p
√
2|t|kψ̃k∞
k log n
√
k log n/ n +
≤ (M + 1) √
n
n
√
and vice versa. Hence choosing M large enough (independent of k) such that
h
i
p
Π kω − ω 0 k1 ≤ (M − 1) k log n/n|Y n , k = 1 + op (1)
implies that if
Pk
j=1
√
αj = o( n), noting kψ̄(k) kL,k = kψ̃[k] kL ,
h √
i
2
2
E Π et n(ψ(f )−ψ̂k ) Y n , An,k = et kψ̃[k] kL /2 (1 + o(1)).
(6.7)
The last estimate is for the restricted distribution Π[· | Y n , An,k ], but (C.2) implies that the unrestricted version also follows. Since kψ̃k2L is the efficiency bound for estimating ψ in the density
model, (6.4) follows.
20
Now we turn to the random k case. The previous proof can be reproduced k by k, that is,
one decomposes the posterior Π[· | Y n , Bn ], for Bn = ∪1≤k≤n An,k ∩ {f = fω,k , k ∈ Kn }, into the
mixture of the laws Π[· | Y n , Bn , k] with weights Π[k | Y n ]. Combining the assumption on Kn and
(C.2) yields Π[Bn | Y n ] = 1 + op (1). Now notice that in the present context (6.7) becomes
h √
i
i
h √
E Π et n(ψ(f )−ψ̂k ) Y n , Bn , k = E Π et n(ψ(f )−ψ̂k ) Y n , An,k , k
= et
2
kψ̃[k] k2L /2
(1 + o(1)),
where it is important to note that the o(1) is uniform in k. This follows from the fact that the
proof in the deterministic case holds for any given k less than n and any dependence in k has been
made explicit in that proof. Thus
h √
h √
i
i
X
E Π et n(ψ(f )−ψ̂) Y n , Bn =
E Π et n(ψ(f )−ψ̂k ) Y n , An,k , k Π[k | Y n ]
k∈Kn
= (1 + o(1))
X
et
2
√
Vk /2+t n(ψ̂k −ψ̂)
Π[k | Y n ].
k∈Kn
Using (6.4) together with the continuous mapping theorem for the exponential function yields that
2
the last display converges in probability to et V /2 as n → ∞, which leads to the BvM theorem.
We apply this to the four examples. First, in the
√ case of Example 4.1 with deterministic
k =√Kn , we have by definition that r̃(f, f0 ) = 0 and n(ψ̂Kn − ψ̂) = bn,Kn + op (1) with bn,Kn =
O( nKn−β−γ ) = o(1) if β + γ > 1, when a ∈ C γ . On the other hand, if a(x) = 1lx≤z , for all β > 0,
Z
√ z
√
0
|bn,Kn | . n (f0 (x) − kwbKn zc )dx = O( nKn−(β+1) ) = o(1).
bKn zc/Kn
We now verify (6.4) together with (6.5) for Examples 4.2, 4.3 and 4.4. We present the proof
in the case Example 4.2, since the other two are treated similarly. Set, in the random k case
Kn = k ∈ [1, k1 kn (β)], ∃f ∈ Hk1 , h(f, f0 ) ≤ M εn (β) ,
for some k1 , M large enough so that Π[Kn | Y n ] = 1+op (1) from (C.2), with εn (β) = (n/ log n)−β/(2β+1) .
For β > 1/2, note that kε2n,k . kεn (β)2 = o(1), uniformly over k . kn (β). In the deterministic
case, simply set Kn = {Kn }.
First observe that for k ∈ Kn , the elements of the set {f ∈ Hk1 , h(f, f0 ) ≤ M εn (β)} are
bounded away from 0 and ∞. Indeed, p
since this is true for√f0 , writing the Hellinger distance
√
as a sum over the various bins leads to f (x) ≥ c0 − εn,k k which implies that f (x) ≥ c0 /2
2
for n large enough, since kεn = o(1). Similarly kf k∞ ≤ 2kf0 k∞ for n large. Now, by writing
log(f /f0 ) = 1 + (f − f0 )/f0 + ρ(f − f0 ), and using that f /f0 is bounded away from 0 and ∞, one
R1
easily checks that |r̃(f, f0 )| in Example 4.2 is bounded from above
by a multiple of 0 (f − f0 )2 ,
√
which itself is controlled by h(f, f0 )2 for f, f0 as before. Also nε2n,k = o(1) when β > 1/2,
which
√ implies (6.5). It is easy to adapt the above computations to the case where k = Kn =
O( n/(log n)2 ).
Next we check condition (6.4). Since ψ̃ = log f0 − ψ(f0 ), under the deterministic k-prior with
k = Kn = bn1/2 (log n)−2 c and β > 1/2,
Z 1
Z 1
√
2
ψ̃(f
−
f
)
=
(
ψ̃
−
ψ̃
)(f
−
f
)
0
0
0[k] [k]
0[k] . h (f0 , f0[k] ) = o(1/ n).
0
0
√
In that case the posterior distribution of n(ψ(f ) − ψ̂) is asymptotically Gaussian with mean 0
and variance kψ̃k2L , so the BvM theorem is valid.
Under the random k-prior, recall from the reasoning above that any f with h(f, f0 ) ≤ M εn (β)
is bounded from below and above, so the Hellinger and L2 -distances considered below are comparable. For a given k ∈ Kn , by definition there exists fk∗ ∈ Hk1 with h(f0 , fk∗ ) ≤ M εn (β), so using
21
(C.3),
h2 (f0 , f0[k] ) .
Z
0
1
(f0 − f0[k] )2 (x)dx ≤
Z
1
(f0 − fk∗ )2 (x)dx . h2 (f0 , fk∗ ) . ε2n (β).
0
This implies, using the same bound as in the deterministic-k case,
F0 ((ψ̃[k] − ψ̃)2 ) . h(f0 , f0[k] )2 = O(ε2n (β)),
2
) − F0 (ψ̃ 2 ) = o(1), uniformly over k ∈ Kn . To control the empirical process
and that F0 (ψ̃[k]
part of (6.4), that is the second part of (4.10), one uses e.g. Lemma 19.33 in [42], which provides
an upper-bound for the maximum, together with the last display. So, for random k, the BvM
theorem is satisfied if β > 1/2.
References
[1] Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes with
applications to Markov chains. Electron. J. Probab., 13:no. 34, 1000–1034.
[2] Adamczak, R. and Bednorz, W. (2012). Exponential concentration inequalities for additive
functionals of Markov chains. Technical report, University of Warsaw.
[3] Arbel, J., Gayraud, G., and Rousseau, J. (2013). Bayesian adaptive optimal estimation using
a sieve prior. Scand. J. Statist., 40:549–570.
[4] Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. Springer Series in
Statistics. Springer-Verlag, New York, second edition.
[5] Bickel, P. J. and Kleijn, B. J. K. (2012). The semiparametric Bernstein–von Mises theorem.
Ann. Statist., 40:206–237.
[6] Bickel, P. J. and Ritov, Y. (1988). Estimating integrated squared density derivatives: sharp
best order of convergence estimates. Sankhyā Ser. A, 50(3):381–393.
[7] Bickel, P. J. and Ritov, Y. (2003). Nonparametric estimators which can be “plugged-in”. Ann.
Statist., 31(4):1033–1053.
[8] Bontemps, D. (2011). Bernstein–von Mises theorems for Gaussian regression with increasing
number of regressors. Ann. Statist., 39:2557–2584.
[9] Boucheron, S. and Gassiat, E. (2009). A Bernstein-von Mises Theorem for discrete probability
distributions. Electron. J. Stat., 3:114–148.
[10] Cai, T. and Low, M. G. (2006). Optimal adaptive estimation of a quadratic functional. Ann.
Statist., 34:2298–2325.
[11] Castillo, I. (2008). Lower bounds for posterior rates with Gaussian process priors. Electronic
Journal of Statistics, 2:1281–1299.
[12] Castillo, I. (2012a). A semiparametric Bernstein–von Mises theorem for Gaussian process
priors. Probab. Theory Related Fields, 152(1-2):53–99.
[13] Castillo, I. (2012b). Semiparametric Bernstein–von Mises theorem and bias, illustrated with
Gaussian process priors. Sankhya A, 74(2):194–221.
[14] Castillo, I. (2014). On Bayesian supremum norm contraction rates. Ann. Statist. To appear.
[15] Castillo, I. and Nickl, R. (2013). Nonparametric Bernstein–von Mises theorems in Gaussian
white noise. Ann. Statist., 41(4):1999–2028.
22
[16] Castillo, I. and Nickl, R. (2014). On the Bernstein–von Mises phenomenon for nonparametric
Bayes procedures. Ann. Statist. To appear.
[17] Castillo, I. and Rousseau, J. (2013a). A general Bernstein–von Mises Theorem in semiparametric models. Technical report.
[18] Castillo, I. and Rousseau, J. (2013b). A general Bernstein–von Mises Theorem in semiparametric models: Supplementary material. Technical report.
[19] Cohen, A., Daubechies, I., and Vial, P. (1993). Wavelets on the interval and fast wavelet
transforms. Appl. Comput. Harmon. Anal., 1(1):54–81.
[20] Cox, D. D. (1993). An analysis of Bayesian inference for nonparametric regression. Ann.
Statist., 21(2):903–923.
[21] De Blasi, P. and Hjort, N. L. (2009). The Bernstein–von Mises theorem in semiparametric
competing risks models. J. Statist. Plann. Inference, 139(7):2316–2328.
[22] Dudley, R. M. (2002). Real analysis and probability, volume 74 of Cambridge Studies in
Advanced Mathematics. Cambridge University Press, Cambridge.
[23] Efromovich, S. and Low, M. (1996). On optimal adaptive estimation of a quadratic functional.
Ann. Statist., 24:11061125.
[24] Freedman, D. (1999). On the Bernstein–von Mises theorem with infinite-dimensional parameters. Ann. Statist., 27(4):1119–1140.
[25] Gayraud, G. and Tribouley, K. (1999). Wavelet methods to estimate an integrated quadratic
functional: adaptivity and asymptotic law. Statist. Probab. Lett., 44(2):109–122.
[26] Ghosal, S. (1999). Asymptotic normality of posterior distributions in high-dimensional linear
models. Bernoulli, 5(2):315–331.
[27] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior
distributions. Ann. Statist., 28(2):500–531.
[28] Ghosal, S. and van der Vaart, A. (2007a). Posterior convergence rates of Dirichlet mixtures
at smooth densities. Ann. Statist., 35(2):697–723.
[29] Ghosal, S. and van der Vaart, A. W. (2007b). Convergence rates of posterior distributions
for noniid observations. Ann. Statist., 35(1).
[30] Ghosh, J. and Ramamoorthi, R. (2003). Bayesian non parametrics. Springer-Verlag, New
York.
[31] Giné, E. and Nickl, R. (2011). Rates of contraction for posterior distributions in Lr -metrics,
1 ≤ r ≤ ∞. Ann. Statist., 39(6):2883–2911.
[32] Kim, Y. (2006). The Bernstein–von Mises theorem for the proportional hazard model. Ann.
Statist., 34(4):1678–1700.
[33] Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite dimensional
Bayesian statistics. Ann. Statist., 34:837–877.
[34] Knapik, B. T., Szabó, B. T., van der Vaart, A. W., and van Zanten, J. H. (2012). Bayes
procedures for adaptive inference in inverse problems for the white noise model. ArXiv e-prints.
[35] Knapik, B. T., van der Vaart, A. W., and van Zanten, J. H. (2011). Bayesian inverse problems
with Gaussian priors. Ann. Statist., 39(5):2626–2657.
23
[36] Kruijer, W. and Rousseau, J. (2013). Bayesian semiparametric estimation of the long-memory
parameter under FEXPpriors. Electronic Journal of Statistics, 7:2947–2969.
[37] Laurent, B. (1996). Efficient estimation of integral functionals of a density. Ann. Statist.,
24(2):659–681.
[38] Leahu, H. (2011). On the Bernstein–von Mises phenomenon in the Gaussian white noise
model. Electron. J. Stat., 5:373–404.
[39] Lifshits, M. and Simon, T. (2005). Small deviations for fractional stable processes. Ann. Inst.
H. Poincaré Probab. Statist., 41(4):725–752.
[40] Rivoirard, V. and Rousseau, J. (2012). On the Bernstein–von Mises theorem for linear functionals of the density. Ann. Statist., 40(3):1489–1523.
[41] Shen, X. (2002). Asymptotic normality of semiparametric and nonparametric posterior distributions. J. American Statist. Assoc., 97:222–235.
[42] van der Vaart, A. W. (1998). Asymptotic statistics, volume 3 of Cambridge Series in Statistical
and Probabilistic Mathematics. Cambridge University Press, Cambridge.
[43] van der Vaart, A. W. and van Zanten, H. (2008a). Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Statist., 36(3):1435–1463.
[44] van der Vaart, A. W. and van Zanten, H. (2008b). Reproducing kernel Hilbert spaces of
Gaussian priors. IMS Collections, 3:200–222.
24
A
Appendix: Some weak convergence facts
We state some (certainly well-known) lemmas on weak convergence, in probability, of a sequence
of random probability measures on the real line. Proofs are included for the sake of completeness.
Let β be a distance which metrises weak convergence of probability measures on R, here for
convenience taken to be the bounded Lipschitz metric (see e.g. Dudley [22], Chap. 11). Let
Pn be a sequence of random probability measures on R. We say that Pn converges weakly in
P0 -probability to a fixed measure P on R if, as n → ∞, one has β(Pn , P ) → 0 in P0 -probability.
R tx
Lemma
1. Suppose
R tx that for any real t, the Laplace transform e dP (x) is finite, and that
R tx
e dPn (x)R → e dPR (x), in P0 -probability. Then, for any continuous and bounded real function
f , it holds f dPn → f dP , in P0 -probability.
Lemma 2. Under the conditions of Lemma 1, it holds β(Pn , P0 ) → 0 in P0 -probability. If P has
no atoms, we also have, in P0 -probability,
sup |Pn ((−∞, s]) − P ((−∞, s])| → 0,
s∈R
R
Proof of Lemma 1. Let L(t) := etx dP (x) = EP [etX ]. For M > 0,
Pn [|X| > M ] ≤ e−M EPn eXn + EPn e−Xn
≤ e−M L(1) + L(−1) + EPn eXn − L(1) + EPn e−Xn − L(−1) .
Let ε > 0 be fixed. Let M > 0 be such that e−M [L(1) + L(−1)] ≤ ε/2. Then
P0 (Pn [|X| > M ] > ε) ≤ P0 EPn eXn − L(1) > ε/4
+ P0 EPn e−Xn − L(−1) > ε/4 = o(1).
Let f be a given continuous and bounded real function and write
EPn f (X)1l|X|≤M = EP f (X)1l|X|≤M + EPn f (X)1l|X|≤M − EP f (X)1l|X|≤M
Over the compact set [−M, M ], Stone-Weierstrass’ theorem,
P applied to the algebra of finite linear
combinations of exponential functions of the form x → j αj etj x , shows that for any ε > 0 there
exists (Nε , αj , tj , j ≤ Nε ), such that
Nε
X
tj x sup f (x) −
αj e < ε/2.
|x|≤M j=1
Therefore one obtains
EPn f (X)1l|X|≤M − EP f (X)1l|X|≤M
Nε
Nε
X
X
t
X
t
X
j
j
≤ (EPn + EP ) 1l|X|≤M f (X) −
αj e + αj EPn [e ] − L(tj ) j=1
j=1
≤ ε/2 + oP0 (1).
Thus
R
f d(Pn − P ) = oP0 (1), for any continuous and bounded function f .
Proof of Lemma 2. For the first part of the statement, let us reason by contradiction and suppose
that β(Pn , P0 ) 9 0 in P0 -probability. Let {ψm } be a countable collection of elements in the
space BL(R) of bounded Lipschitz functions, dense in BL(R) for the supremum
R norm (not for the
BL-metric),
see
e.g.
Dudley
[22],
proof
of
Proposition
11.4.1.
By
Lemma
1,
ψm dPn converges
R
to ψm dP in probability for any m. Such convergence can be made into an almost sure one up
25
to
R subsequence Rextraction. By a diagonal argument, one then finds a subsequence φ(n) such that
ψm dPφ(n) → ψm dP for any possible m, almost surely. Let us now work on the event say Ω0
on which this happens. Let f be a given bounded-Lipschitz function on R. Let ε > 0 be arbitrary.
There exists an index m such that kf − ψm k∞ ≤ ε. Thus by the triangle inequality
Z
Z
| f d(P − Pφ(n) )| ≤ 2ε + | ψm d(P − Pφ(n) )|.
The last term converges to 0 on the event Ω0 . Since ε is arbitrary, this contradicts the fact that
β(Pn , P0 ) 9 0.
The second part of the statement follows from the fact that the collection A = {(−∞, s], s ∈ R}
forms a uniformity class for weak convergence. The ‘in-probability’ part of the convergence follows,
again for instance by a reasoning by contradiction via extraction of a subsequence along which
almost sure convergence holds, see also Castillo and Nickl [15] Section 4.2 for a similar argument
and a detailed discussion on uniformity classes on separable metric spaces.
B
B.1
Appendix: White noise model
A Lemma
The following result is a slight adaptation of a result in Castillo and Nickl [15] and provides a
contraction rate in L2 for the posterior in the Gaussian white noise model for any prior of the
form (3.1) in [17]. Let f0 ∈ L2 [0, 1] and set
X
Kn
2
ε2n =
+
f0,k
.
(B.1)
n
k>Kn
2
Lemma 3 (L -result in Castillo and Nickl [15]). Consider the Gaussian white noise model with
f0 ∈ L2 [0, 1].
R Let Π be defined by (3.1) in [17] and εn be defined by (B.1). Suppose (3.2) in [17]
holds, that R x2 ϕ(x)dx < ∞, and that there exist constants cϕ , Cϕ such that ϕ(x) ≤ Cϕ for all
real x and ϕ(x) ≥ cϕ for all x ∈ (−τ, τ ).
Then there exists C > 0 such that, as n → ∞,
(n)
Pf0 Π[f : kf − f0 k2 ≤ Cεn | Y n ] → 1.
Remark 5. Lemma 3 still holds if ϕ depends on k, n, as long as one can find Cϕ , cϕ , τ independent
of k, n satisfying the conditions of the Lemma.
B.2
Proof of Theorem 3.1
(1)
(2)
For the considered functional recall that we have set ψ0 = 2f0 and ψ0 f = 2f for f ∈ H = L2 .
(2)
Also, r = 0 in (2.4) and (2.5) in [17] holds. Since ψ0 is not the zero function, one needs to find
a candidate for wn in
P the case A2. Set wn,k = k if 1 ≤ k ≤ Kn and wn,k = 0 otherwise. In
particular, ∆n (h) = k>Kn hk k for any h in L2 .
Lemma 3pimplies, under (3.1) in [17] and β > 1/4, that the posterior concentrates at rate at
least εn = 2 Kn /n around f0 . Set An = {f : kf − f0 k2 ≤ 2εn }. Then (2.9) in [17] holds since
∆n (f − f0 ) = −∆n (f0 ) is independent of f and follows a Gaussian distribution with vanishing
variance. Also, (2.10) in [17] holds using the expression of εn and that Kn is a o(n).
Denote by P[Kn ] the orthogonal projector in L2 onto the first Kn coordinates. Let us compute
the centering ψ̂ from Theorem 2.1 of [17],
(1) √
(2)
ψ̂ = ψ(f0 ) + W (ψ0 )/ n + hwn , ψ0 wn iL /(2n)
=
Kn
X
√
Yk2 + kf0 − P[Kn ] f0 k2 + 2W (f0 − P[Kn ] f0 )/ n
k=1
√
√
= ψ̄ + n−1/2 Kn / n + o( nKn−2β ) + oP (1) .
26
√
As β > 1/4, our choice of Kn implies nKn−2β = o(1). From Theorem 2.1 of [17], it follows that
√
√
√
Ln (t, Y ) := E Π [et n(ψ(f )−ψ̂n ) | Y n , An ] equals, with ft = (1 − t/ n)f − (t/ n)f0 − twn /n,
√
R
2
h
i
−n
2 kft −f0 kL + nW (ft −f0 ) dΠ(f )
Kn
√ +op (1) t+[2kf0 k2 +op (1)]t2 An e
n
Ln (t, Y ) = e
.
R − n kf −f k2 +√nW (f −f )
0 L
0 dΠ(f )
e 2
R
Indeed, this is expression (2.13) in [17], up to the fact that in the denominator An is replaced by
R
. But we can do this without affecting the argument since the ratio of the two previous integrals
is nothing but Π(An | Y n ) = 1 + op (1).
−1
For any f in L2 , denote fn := P[Kn ] f and let Πn := Π ◦ P[K
. With Bn = {g ∈ RKn , kg −
n]
f0,n k2 ≤ 4ε2n − kf0 − f0,n k2 }, it holds
R
2 √
n
e− 2 kft,n −f0,n k + nW (ft,n −f0,n ) dΠn (fn )
2kf0 k2 t2 +op (1) bn Bn
Ln (t, Y ) = e
e
.
R − n kf −f k2 +√nW (f −f )
n
0,n dΠ (f )
e 2 n 0,n
n n
The term bn originates from the fact that the prior sets fk = 0 when k > Kn ,
√ X
n X 2
bn =
(f0,k − (ft,k − f0,k )2 ) + n
ft,k k .
2
k>Kn
k>Kn
From the definition of ft , one gets
X
X
√
2
2
n
(f0,k
− (ft,k − f0,k )2 ) = (−t2 − 2t n)
f0,k
.
k>Kn
√
k>Kn
n
X
ft,k k = −t
k>Kn
X
k f0,k .
k>Kn
Since β > 1/4 the first term is o(1) and the second a oP (1) using the regularity assumption on f0 .
It is thus enough to focus on
√
R
2
n
e− 2 kft,n −f0,n kL + nW (ft,n −f0,n ) dΠn (fn )
Bn
In := R − n kf −f k2 +√nW (f −f )
.
n
0,n dΠ (f )
e 2 n 0,n L
n n
Let us write In = Jn × Kn with
R
n
Bn
Jn = R
R
Kn =
√
2
e− 2 kft,n −f0,n kL +
nW (ft,n −f0,n )
√
2
−n
2 kft,n −f0,n kL + nW (ft,n −f0,n )
dΠn (fn )
e
dΠn (fn )
√
2
−n
kf
−f
k
+
nW
(f
−f
)
t,n
0,n
e 2 t,n 0,n L
dΠn (fn )
.
R − n kf −f k2 +√nW (f −f )
n
0,n
n
0,n
L
e 2
dΠn (fn )
Each integral appearing in Jn and Kn is an integral over RKn and can be rewritten using the
explicit form of the prior. Note that Kn can be split in a product of Kn ratios along each
coordinate, while Jn cannot because of the integrating set Bn which mixes the coordinates. In
integrals involving ft,n we make the affine change of variables which is the inverse of the mapping
ψn
:
RKn
{fk }
→
→ { 1−
√t
n
RK n
fk − t
f0,k
√
n
+
k
n
That is, we define the new variable gn = ψn (fn ). For simplicity denote
t
f0,k
k
ct = 1 − √
and
δk = δk (k ) = t √ +
,
n
n
n
The Jacobian of the change of variables is, since Kn = o(n),
n
c−K
= etKn /
t
27
√
n+t2 o(1)
.
}.
k ≤ Kn
Study of Jn
This leads to
R
Jn =
−1
ψn
(Bn )
QKn
R QKn
2
n
k=1
e− 2 (gk −f0,k )
n
√
+ nεk (gk −f0,k )
ϕ(
c−1
t gk −δk
)dgk
σk
√
−1
nεk (gk −f0,k ) ϕ( ct gk −δk )dg
k
σk
2+
− 2 (gk −f0,k )
k=1 e
.
Note that Jn coincides with Π̃n (ψn−1 (Bn ) | Y n ), where Π̃n is the distribution
Π̃n ∼
Kn
c−1 · −δ O
c−1
k
t
ϕ t
.
σk
σk
k=1
The new product prior Π̃n is a slightly (randomly) perturbed version of Πn . With high probability,
the induced perturbation is not too important. Set, for some D > 0 to be chosen,
Cn =
max |i | ≤ D log n .
1≤k≤Kn
Let us use the following standard concentration inequality for the sup-norm of a Gaussian vector.
For a large enough universal constant D,
2
P max |i | > D log n ≤ e− log n .
i=1,...,n
probability. Thus from the beginning one can work on Cn . On Cn ,
So the event Cnc has vanishing
√
we have |δk | ≤ t|f0,k |/ n + D log n/n. Thus, on Cn ,
(
)
Kn h 2
f0,k
g
log2 n i
Ct2 X
2
2
g : k − f0 k2 ≤ εn −
+
⊂ ψn−1 (Bn ).
ct
ct
n
n2
k=1
We deduce, since {Kn (log n/n)2 } ∨ n−1 = o(ε2n ), that
g : kg − f0 k22 ≤ 4ε2n (1 + o(1)) ⊂ ψn−1 (Bn ).
It thus follows that
Π̃n [g : kg − f0 k22 ≤ 4ε2n (1 + o(1)) | Y n ] ≤ Jn ≤ 1.
The integrating set in the last display is nonrandom and we need to prove a usual contraction
result for the posterior Π̃n [· | Y n ] in P0n -probability. To do so, we first start by restricting to the
event Cn . Given the data Y n , the quantity Π̃n is a fixed prior distribution of the product form
−1
with a coordinatewise unnormalised density equal to ϕ̃k := c−1
t ϕ(ct · −δk ). On Cn , both ct and δk
can respectively be made as close to 1 and 0 as wished, uniformly in k, for n large enough. Thus
the perturbed ϕ̃k also satisfies the conditions of Lemma 3, up to the use of different constants, see
Remark 5. Lemma 3 now yields Π̃n [g : kg − f0 k22 ≤ 4ε2n (1 + o(1)) | Y n ] → 1 in probability. Thus
Jn → 1 in probability.
√
n
Study of Kn We now show that Kn = c−K
(1 + op (1)) = etKn / n (1 + op (1)). We start also by
t
changing variables as above and the ratio splits into the product over 1 ≤ k ≤ Kn of the terms
R
c−1
t
Setting u =
√
n(gk − f0,k ) and v =
R
Bk (k ) :=
2
n
√
c−1 g −δ
e− 2 (gk −f0,k ) + nk (gk −f0,k ) ϕ( t σkk k )dgk
.
R − n (f −f )2 +√n (f −f ) fk
k
k
0,k ϕ(
e 2 k 0,k
σk )dfk
√
u2
n(fk − f0,k ), one needs to control
c−1 (f
√
+u/ n)−δ
k
e− 2 +k u ϕ( t 0,k σk
)du
Nk
√
=:
(k ).
R − v2 + v f0,k +v/ n
D
k
k
2
e
ϕ(
)dv
σk
28
Gaussian prior. Let ϕ be the standard Gaussian density. The term Bk (k ) can be computed
−1
σ −2 c−2
explicitly. If one denotes Σ2k,t = 1 + k n t
, it holds
Bk (k ) =
Σk,t
Σk,0
exp
i2 h
δk (k )c−1
c−2
f0,k
t f0,k
t √
√
k − σ 2 n +
2
nσk
k
.
i2
h
Σ2k,0
f0,k
√
exp
−
k
2
σ2 n
Σ2k,t
2
k
Under (3.2) in [17], tedious but simple computations lead to, for some C > 0,
K
Kn n
X
σk−2
1 X
2
log Bk (k ) ≤ C(1 + op (1)) √
f0,k +
.
n
n
k=1
k=1
Uniform prior. Consider the choice ϕ(u) = 1l|u|≤M with M > 16M . One can write Bk (k ) =
1 + ζk (k ) and then use the fact
Ef0 |
Kn
Y
Bk (k ) − 1| ≤
k=1
Kn
Y
(1 + Ef0 |ζk (k )|) − 1
k=1
PKn
≤e
k=1
Ef0 |ζk (k )|
− 1.
The quantity ζk (k ) = Bk (k ) − 1 admits the expression
√
R − u2 + u
c−1 (f +u/ n)−δk (k )
)du
e 2 k 1l[−M,M] ( t 0,k σk
√
−1
ζk (k ) =
R − u2 + u
n
f
+u/
0,k
e 2 k 1l[−M,M] (
)du
σk
R −a −
R c −
u2
( −bkk−kk + dkk−kk )e− 2 du
=
,
R dk −k − u2
2 du
e
−ak −k
with ak , bk , ck , dk defined by (we omit the dependence in k in the notation)
√
√
√
√
√
ak = Mσk n + f0,k n, bk = Mct σk n + f0,k n − δk (k ) nct
√
√
√
√
√
ck = Mct σk n − f0,k n + δk (k ) nct , dk = Mσk n − f0,k n.
In order to evaluate Ef0 |ζk (k )|, we distinguish the cases k > 0 and k < 0. We present only the
argument for k > 0, the other case is analogous up to a few changes in constants. We have (note
that bk , ck still depend on w)
Z 3a4k R −bk −w R dk −w − u22
(
+ ck −w )e
du − w2
| −akR−w
|e 2 dw
Ef0 |ζk (k )|1k >0 =
u2
dk −w
−
2 du
0
e
−ak −w
Z +∞ R −bk −w R dk −w − u22
(
+ ck −w )e
du − w2
| −akR−w
|e 2 dw.
+
2
u
dk −w
3ak
− 2 du
e
4
−ak −w
The first integral is bounded by noticing that the denominator is larger than a fixed positive
constant, uniformly in k, since M > 16M implies dk > 3ak /4. Then, the numerator is bounded
by the length of the integration interval times the largest value of the integrated function. Note
that in the considered domain, the bounds −bk − w and −ak − w stay below −ak /8 (for a large
enough n independently of k), while ck − w and dk − w stay above ak /8. Thus, for some constant
D,
Z 3a4k
2
2
Ef0 |ζk (k )|10<k <3ak /4 ≤
(|ak − bk | + |ck − dk |)e−Dnσk e−w /2 dw
0
2
≤ cσk e−Dnσk .
29
The second integral in the last but one display is bounded as follows. First, dk − 1 ≥ −ak because
M > 4, so for any real w,
Z dk −w
Z dk −w
2
(dk −w)2
(dk −1−w)2
u2
− u2
2
du ≥
e− 2 du ≥ e− 2
∧ e−
.
e
dk −1−w
−ak −w
2
The last inequality follows from the fact that the smallest value of u → e−u /2 on an interval of
size 1 is attained at one of the endpoints. Thus
Z ∞
(dk −1−w)2
(dk −w)2
w2
2
∨e
)e− 2 dw
Ef0 |ζk (k )|1k >3ak /4 ≤
(|ak − bk | + |ck − dk |)(e 2
3ak
4
Z
≤C
∞
d2
(dk −1)2
w
k
(σk + √ )(e 2 −wdk ∨ e 2 −w(dk −1) )dw
3ak
n
4
The term in factor of σk is bounded by
cσk d−1
k (e
d2
k
2
−
3ak
4
dk
∨e
(dk −1)2
2
−
3ak
4
(dk −1)
2
) ≤ cσk e−λdk ,
−x
with λ small enough constant. The term in factor of w is bounded
≤
QKn similarly using xe
−(1−r)x
Cr e
for all x ≥ 0, for small r > 0. Thus in order to have k=1 Bk (k ) = 1 + op (1), it is
enough that for any D > 0,
Kn
X
2
σk e−Dnσk = o(1).
k=1
Prior with ϕ Lipschitz. Using the same techniques and (3.2) in [17], one checks
1 |f0,k |
t
1
2
Ef0 |ζk (t)| ≤ Ct √
+ √ √ +
.
n σk
σk n n nσk
C
Appendix: Density estimation
C.1
Random histograms
We first recall some basic facts that will be used throughout the proofs on random histograms.
Lemma 4. Let k ≥ 1 be an integer.
1. (i) For any f ∈ Hk , it holds f[k] = f
2. (ii) For any density f on [0, 1], it holds f[k] ∈ Hk1 .
3. (iii) Let f ∈ Hk and g ∈ L2 [0, 1], then
Z 1
Z
fg =
0
1
Z
f g[k] =
0
1
f[k] g[k] .
(C.1)
0
4. (iv) Let g be a given function in C α , with α > 0. Then
kg − g[k] k∞ ≤ k −(α∧1) .
Proposition 4. There exist c, M > 0 such that
n
; Π [f ∈
/ An,k (M ) | Y n , k] > e−ck log n = o(1).
P0 ∃k ≤
log n
(C.2)
Suppose now f0 ∈ C β with 0 < β ≤ 1. If kn (β) = (n/ log n)1/(2β+1) and εn (β) = kn (β)−β , then
for k1 , M large enough,
Π [h(f0 , f ) ≤ M εn (β); k ≤ k1 kn (β) | Y n ] = 1 + op (1).
30
(C.3)
Proof of Proposition 1. This result is a simple application of Kleijn and van der Vaart [33], we
sketch the proof here. The notation fω,k is defined in Section 4.2 of the main text√ [17], together
n
}. Note
with Ij = ((j − 1)/k, j/k] Let k ≤ n/(log n)2 and An,k = {ω; h(f0[k] , fω,k ) ≤ M k√log
n
R
0
0
0
0
that for all k, ω = (ω1 , · · · , ωk ) with ωj = Ij f0 (x)dx, minimizes over Sk the Kullback-Leibler
divergence KL(f0 , fω,k ). We have from Lemma 4,
R
e`n (fω,k )−`n (f0[k] ) dΠk (ω)
c
Acn,k
n
Π An,k |Y , k = R ` (f )−` (f )
,
e n ω,k n 0[k] dΠk (ω)
!
Z 1
k
X
ωj0
f0[k] (x)
0
= KL(f0[k] , fω,k )
f0 (x) log
dx =
ωj log
fω,k (x)
ωj
0
j=1
and
4
f0[k] (x)
f0 (x) log
− KL(f0[k] , fω,k ) dx
fω,k (x)
0
4
k
X
f0[k] (x)
=
− KL(f0[k] , fω,k ) dx.
ωj0 log
fω,k (x)
j=1
Z
1
V4 (f0[k] , fω,k ) :=
p
so that considering the set Sn := {ω ∈ Sk ; |ωj − ωj0 | ≤ Cωj0 k log n/n}, then there exists C1 > 0
such that
k log n
k log n
1/2
Sn ⊂ ω; KL(f0[k] , fω,k ) ≤ C1
, V4 (f0[k] , fω,k )
≤ C1
n
n
and from Lemma 6.1 of [27], there exists c > 0 such that
Π[Sn ] ≥ e−ck log n ,
and condition 2.4 of Theorem 2.1 of Kleijn and van der Vaart [33] is satisfied. Moreover, since
Z 1 p
p
f0 (x)
2
d(fω1 , fω2 ) :=
( fω1 − fω2 )2 (x)
dx = h2 (fω1 , fω2 ),
f0[k]
0
Lemmas 2.1 and 2.3 of Kleijn and van der Vaart [33] imply that condition (2.5) of Theorem 2.1 of
Kleijn and van der Vaart [33] can be replaced by the usual Hellinger-entropy condition. Since the
ε-Hellinger entropy of Sk is bounded by a term of order k log(1/ε), we obtain for all k ≤ n/(log n)2 ,
P0 Π Acn,k |Y n , k > e−ak log n = O(1/(k 2 log n)),
for some a > 0, where the bound 1/(k 2 log n) comes from 1/(n2n,k )2 which is the usual bound
obtained from the proof of Theorem 1 of [29] with a Kullback-Leibler neighbourhood associated
with V4 . This implies (C.2).
Finally we prove (C.3) for f0 ∈ C β . This is a consequence of the fact that
h
i
2
Π k > k1 (n/ log n)1/(2β+1) ≤ e−c1 nεn (β)
for some c1 > 0 and that
h
i
2
Π k > k2 (n/ log n)1/(2β+1) ≥ e−c2 nεn (β)
for some c2 > 0, together with Theorem 1 of Ghosal et al. [27].
The following Lemma gives the Jacobian of the change of variable used in the proofs on random
histograms in [17], with the notation in use there.
31
Lemma 5. Denoting by ∆(ζ) the Jacobian of the change of variables
!
−1
ωk−1 γk−1
ω1 γ1−1
,...,
= (ζ1 , . . . , ζk−1 ) =: ζ T ,
(ω1 , . . . , ωk−1 ) →
Sγ −1 (ω)
Sγ −1 (ω)
it holds, with γ = (γ1 , . . . , γk−1 )T ,
∆(ζ) = Sγ (ζ)−k
k
Y
γj .
j=1
Proof. Simple calculations give that the matrix M of the change of variables, that is the matrix
of partial derivatives ∂ω/∂ζ, has general term mij , for 1 ≤ i, j ≤ k − 1, with
mij =
γi
γi ζi (γj − γk )
.
δij −
Sγ (ζ)
Sγ (ζ)2
Let Γ denote the diagonal matrix Diag(γ1 , . . . , γk−1 ) and Idk−1 the identity matrix of size k − 1.
Then
M = Sγ (ζ)−1 (Γ − Sγ (ζ)−1 (γζ).(γ − γk )T )
= Sγ (ζ)−1 Γ(Idk−1 − Sγ (ζ)−1 ζ.(γ − γk )T ).
It remains to compute the determinant det(M ) of M . For this note that for any vectors v, w in
Rk−1 , it holds
det(Idk−1 − vwT ) = 1 − wT v.
Deduce that ∆(ζ) = Sγ (ζ)−k+1 (1 − Sγ (ζ)−1 (γ − γk )T ζ) det(Γ). A direct computation shows that
the term in brackets equals γk Sγ (ζ)−1 .
C.2
Gaussian process priors
Proof of Proposition 3. Recall that we need only prove condition (4.4) in [17], since the posterior
concentration condition is a consequence of (4.16) in [17] together with the results of van der Vaart
and van Zanten [43].
Because ψ̃f0 might not belong to B, we cannot directly consider the change of measure from
√
W to W − tψ̃f0 / n. We first prove that under conditions (4.17) and (4.18) in [17]
sup (`n (ηt ) − `n (ηn )) = op (1),
(C.4)
η∈An
where An is a subset of {f, d(f0 , f ) ≤ εn }, where d(., .) is the Hellinger or the L1 distance and
Z 1
√
ψn
η−tψn / n
ηn = η − t √ − log
e
n
0
Define the following isometry associated to the Gaussian process W :
U : Vecth {t → K(·, t), t ∈ R} i → L2 (Ω)
p
p
X
X
η :=
ai K(·, ti ) →
ai W (ti ) =: U (η),
i=1
i=1
Pp(n)
and since by definition any h ∈ H is the limit of a sequence i=1 ai,n K(·, ti,n ), it can be extended
Pp(n)
into an isometry U : H → L2 (Ω). Then U h is the limit in L2 (Ω) of the sequence i=1 ai,n W (ti,n ),
so that U h is a Gaussian random variable with mean 0 and variance khk2H . Set Bn = {f ; d(f0 , f ) ≤
εn } as in van der Vaart and van Zanten [43], and define the event
√
An = {|U (ψn )| ≤ M nεn kψn kH } ∩ Bn .
32
R1 w
√
w
x
Here f satisfies
√ |U (ψn )| ≤ M nεn kψn kH is to be understood as f = e /( 0 e dx) and w ∈
{|U (ψn )| ≤ M nεn kψn kH }, with w a realisation of W . Since
M nε2
√
ψn
> M nεn ≤ 2e− 2 n ,
Π U
kψn kH
taking M large enough, we have, using van der Vaart and van Zanten [43],
Π [An |Y n ] = 1 + op (1).
We now study `n (ηn ) − `n (ηt ) on An . Using (4.17) in [17], on An ,
n
√ √
t X
`n (ηn ) − `n (ηt ) = √
(ψ̃f0 (Yi ) − ψn (Yi )) + n log Eη (e−tψ̃f0 / n ) − log Eη (e−tψn / n )
n i=1
!
2
2
E
(
ψ̃
−
ψ
)
√
tEη (ψ̃f0 − ψn )
0
n
f
0
√
+ t2
= tGn (ψ̃f0 − ψn ) − ntE0 (ψn ) + n −
+ o(1)
2n
n
Z
√
= tGn (ψ̃f0 − ψn ) + t n (f0 − fη )(ψ̃f0 − ψn ) + o(1).
R
Since (f0 − fη )(ψ̃f0 − ψn ) ≤ kψ̃f0 − ψn k∞ εn ≤ ζn εn and Gn (ψ̃f0 − ψn ) = op (1), we obtain,
using condition (4.18) in [17], that
sup |`n (ηn ) − `n (ηt )| = op (1)
η∈An
Hence to prove (4.4) in [17] it is enough to prove that, in P0n probability
R
e`n (ηn )−`n (η0 ) dΠ(η)
An
R
→ 1.
e`n (η)−`n (η0 ) dΠ(η)
(C.5)
Lemma 17 in Castillo [12] states that for all Φ : B → R measurable and for any g, h ∈ H, and any
ρ > 0,
h
i
h
i
2
E 1l|U (g)|≤ρ Φ(W − h) = E 1l|U (g)+hg,hi |≤ρ Φ(W )eU (−h)−khkH /2 .
(C.6)
H
R
Since kψn k∞ ≤√ kψ̃f0 k∞ + ζn , if w is such that h2 (fw , f0 ) ≤ ε2n with fw = ew ( ew )−1 , then
wn = w − tψn / n satisfies
h2 (fwn , f0 ) = h2 (fw , f0 ) + O(n−1 ) + O(εn n−1/2 ) ≤ 2ε2n
√
and vice versa if h(fwn , f0 ) ≤ εn / 2 then h(fw , f0 ) ≤ εn , for n large enough. fHence the numerator
of (C.5) can be bounded from above and below by terms in the form
Z
1l|U (ψn )|≤M √nεn kψn kH e`n (ηn )−`n (η0 ) dΠ(η)
√
η−tψn / n∈B̃n
with B̃n = {f, d(f, f0 ) ≤ 2εn } and B̃n = {f, d(f, f0 ) ≤ εn /2} respectively. Using (C.6), one
obtains
R
e`n (ηn )−`n (η0 ) dΠ(η)
An
R
e`n (η)−`n (η0 ) dΠ(η)
√
R
√
kψn k2
1l
e`n (η)−`n (η0 ) eU (−ψn / n) dΠ(η)
n kψn kH
− 2 H +o(1) B̃n |U (ψn )|≤M 2nε
R
≤e
e`n (η)−`n (η0 ) dΠ(η)
R
kψn k2
1l|U (ψn )|≤M √2nεn kψn kH e`n (η)−`n (η0 ) dΠ(η)
H
B̃
R
≤ e− 2 +o(1) n
,
e`n (η)−`n (η0 ) dΠ(η)
√
where the last inequality comes from the constraint |U (ψn )| ≤ M 2nεn kψn kH together with
(4.17)-(4.18) in [17]. A similar lower bound is also obtained. This concludes the proof of the
Proposition.
33
The next Proposition provides a rate of convergence in k·k∞ -norm and is used in [17] to handle
remainder terms for non-linear functionals.
Proposition 5. Let f0 be bounded away from 0 and ∞. Suppose that η0 = log f0 belongs to C β .
Let the prior Π be the law induced by a centered Gaussian process W in B = C 0 with RKHS H.
Let α > 0. Suppose that the process W takes values in C δ , for all δ < α and let εn → 0 satisfy
(4.17) in [17]. Suppose that for some Kn → ∞ and some 0 < γ < α, the sequence
p
√
(C.7)
ρn := εn Kn + nεn Kn−γ + Kn−β −→ 0
as n → ∞. Then for large enough M ,
Π(f : kf − f0 k2 ≤ M εn | X (n) ) → 1,
and, for any ρn defined by (C.7) such that ρn = o(1),
i
h
Π f : kf − f0 k∞ ≤ M ρn | X (n) → 1.
(C.8)
(C.9)
The condition on the path of W of Proposition 5 is satisfied for a great variety of Gaussian
processes, for instance for the Riemann-Liouville type processes (up to adding the polynomial
part, which does not affect the property) this is established in [39]. For the Riemann-Liouville
process indexed by α > 0, bounds on the concentration function have been obtained in [43]-[11],
leading to a rate εn = n−(α∧β)/(1+2α) up to logarithmic terms. Thus, taking n1/(2α+1) in (C.7)
leads to a rate
1
ρn = n
2
+s−α∧β
1+2α
,
for arbitrary s > 0 (corresponding to the choice s = α − γ in (C.7)). The rate ρn is some
(intermediate) sup-norm rate. The proof of Proposition 5 can thus be seen as an alternative route
to derive results such as the ones obtained in [31], here for slightly different priors (here one gets
an extra –arbitrary– s > 0 in the rate. It can be checked that in some examples one can in fact
take s = 0. Since this has no effect on the verification of the BvM theorem for functionals, we
refrain from stating such refinements).
Proof of Proposition 5. A useful tool in the proof is the existence of a localised wavelet basis on
[0, 1]. Let us start by introducing some related notation and stating useful properties of the basis.
For convenience we consider the basis constructed in Cohen et al. [19], that we call CDV-basis.
We take the standard notation {ψlk , l ≥ 0, 0 ≤ k ≤ 2l − 1}. The family {ψlk } forms a complete
orthonormal system of L2 [0, 1] and the basis elements can be chosen regular enough so that for a
given γ > 0, Hölder or Besov-norm of spaces of regularity up to γ can be characterized in terms
of wavelet coefficients. For any g in C γ [0, 1], if γ is not an integer, we have C γ = Bγ,∞,∞ and
denoting k · kγ the norm of C γ ,
kgkγ ≡ max
max
l≥0 0≤k≤2l −1
1
2l( 2 +γ) |hg, ψlk i2 |,
where ≡ means equivalence up to universal constants.
We shall further use two properties of the
P
CDV-basis, namely that it is localised in that 0≤k≤2l −1 kψlk k∞ . 2l/2 , and that the constant
function equal to 1 on [0, 1] is orthogonal to high-level wavelet functions, that is h1, ψlk i2 = 0, any
l ≥ L and any k, for L large enough, see Cohen et al. [19] p. 57.
Now we start the proof by recalling that, here f = fw = exp(w − c(w)) and that f0 is bounded
away from 0,
V (f0 , f ) ≥ ckw − c(w) − w0 k22 .
(C.10)
From Lemma 8 in Ghosal and van der Vaart [28], we know that, for some universal constant C,
max(V (f, f0 ), V (f0 , f )) ≤ Ch2 (f, f0 )(1 + log kf /f0 k∞ ).
and the term in brackets is bounded above by 1+log exp(kw0 −w+c(w)k∞ ) = 1+kw0 −w+c(w)k∞ .
34
On the other hand, it is possible to link the sup-norm kw0 − w + c(w)k∞ to the L2 -norm via
basis expansions. Fix γ < α with γ ∈
/ N. Since by assumption, W belongs to C δ for any δ < α, it
γ+2δ
belongs in particular to C
for small enough δ. The continuous embedding C γ+2δ ,→ Bγ+δ,δ−1 ,1
thus shows that W can be seen as a Gaussian random element in the separable Banach space
Bγ+δ,δ−1 ,1 [0, 1]. Thus Borell’s inequality in the form of Corollary 5.1 in van der Vaart and van
Zanten [44] for the Gaussian process W leads to
√
2
P(kW kγ+δ,δ−1 ,1 > M nεn ) . e−Cnεn ,
for any given C > 0, provided M is chosen large enough. The continuous embedding Bγ+δ,δ−1 ,1 ,→
Bγ,∞,∞ = C γ , any γ, δ > 0, γ ∈
/ N now implies that
√
2
P(kW kγ > M nεn ) . e−Cnεn .
√
Setting Fn = {w, kwkγ ≤ M nεn }, one deduces, similar to [43], that Π(Fn | Y n ) → 1. In the
sequel, we thus work on the set Fn .
Let us now expand w0 − w + c(w) onto the CDV wavelet basis on [0, 1]. Let Kn = n1/(2α+1)
and set Ln = log2 Kn . Then
X X
kw0 − w + c(w)k∞ = k
hw0 − w + c(w), ψlk i2 ψlk k∞
l≥0 0≤k≤2l −1
≤
X
2l/2
l≤Ln
+
X
l>Ln
2l/2
max
|hw0 − w + c(w), ψlk i2 |
(C.11)
max
|hw0 − w + c(w), ψlk i2 |.
(C.12)
0≤k≤2l −1
0≤k≤2l −1
By Cauchy-Schwarz inequality, and using the fact that the maximum
√ of squares is bounded above
by the sum of the squares, the term (C.11) is bounded above by Kn kw0 − w + c(w)k2 . For the
term (C.12), let us write hw0 − w + c(w), ψlk i2 = hw0 − w, ψlk i2 by orthogonality of constants to
high resolution wavelets. Next using the control of kwkγ on Fn ,
X
X
√
1
2l/2 max |hw, ψlk i2 | . kwkγ
2l/2 2−l( 2 +γ) . nεn Kn−γ .
l>Ln
0≤k≤2l −1
l>Ln
Similarly, using that w0 ∈ C β , one gets that the same quantity with w replaced by w0 is bounded
above by Kn−β .
Putting together the previous inequalities and (C.10), one obtains
p
√
ckw0 − w + c(w)k22 ≤ Ch2 (f, f0 ) 1 + o(1) + nεn Kn−γ + Kn kw0 − w + c(w)k2
√
≤ Ch2 (f, f0 ) 2 + o(1) + nεn Kn−γ + Kn kw0 − w + c(w)k22
√
. (1 + nεn Kn−γ )ε2n + (Kn ε2n )kw0 − w + c(w)k22 ,
where for the last inequality we have used that (4.16) in [17] implies posterior convergence in the
Hellinger distance at rate εn , as in van der Vaart and van Zanten [43]. Since Kn ε2n is a o(1) by
assumption, one obtains
(c/2)kw0 − ϕn + c(ϕn )k22 ≤ O(1)ε2n .
Inserting this bound back in the previous inequality kw0 − w + c(w)k∞ ≤ (C.11) + (C.12) in the
bound of (C.11) leads to
p
√
kw0 − w + c(w)k∞ ≤ Kn εn + nεn Kn−γ + Kn−β .
Conclude that kw0 − w + c(w)k∞ ≤ ρn .
35
The squared L2 -norm can be expressed as
Z
Z
2
(f − f0 ) = f02 (ew−c(w)−w0 − 1)2
From what precedes we know that with posterior probability tending to 1, the sup-norm of w −
c(w) − w0 is bounded. Therefore, the inequality |ex − 1| ≤ C|x|, valid for bounded x and C large
enough implies
Z
Z
(f − f0 )2 ≤ C 2 f02 (w − c(w) − w0 )2 . kw − c(w) − w0 k22 . ε2n
on a set of posterior probability tending to 1, using that f0 is bounded from above. For the result
in sup-norm, we again use the previous inequality to obtain, on a set of overwhelming posterior
probability,
kf − f0 k∞ = kf0 (ew−c(w)−w0 − 1)k∞ ≤ Ckw − c(w) − w0 k∞ .
D
Appendix: Autoregressive model, proof
Proof of Theorem 5.1. Since the model is uniformly geometrically ergodic, the choice of the initial
distribution does not matter and we can work without loss of generality under the stationary
distribution, denoted P0 .
Let An = {fω,k ; k ≤ k1 kn (β), kfω,k − f0 k2,r ≤ M εn (β)}. Following Ghosal and van der Vaart
[29] Section 7.4.1, we can prove that
Π [An |Y n ] = 1 + op (1).
(D.1)
R
Indeed, denote by I0 = [−an , an ]c and ω r = (ω1r , · · · , ωkr ) , with ωjr = r(Ij )−1 Ij f0 (x)r(x)dx, then
kf0 1lI0c ks,r ≤ M Φ(−an + M ) . n−b
2
(1−δ)/2
,
∀δ > 0
for n large enough, where Φ(·) is the cumulative distribution function of a standard Gaussian
distribution. We thus choose b2 (1 − δ) ≥ 2β/(2β + 1) for some δ > 0 arbitrarily small. Then, for
all f0 ∈ C β , all j ≥ 1 and any k such that L(an /k)β ≤ εn (β)/2,
π(kf − f0 ks,r ≤ εn (β)) ≥ πk (k)πω|k (kω − ω0 ks,r ≤ εn (β)/2)
choosing k = bk0 kn (β)c implies that
π(kf − f0 ks,r ≤ εn (β)) ≥ e−c1 kn (β) log n (εn (β)/(4L))kn (β) ≤ e−ckn (β) log n
for some c > 0 large enough. Moreover Π(k > k1 kn (β)) ≤ e−ck1 kn (β) log n so that if k1 is large
enough, combining the above results with Section 7.4.1 Ghosal and van der Vaart [29], we finally
obtain (D.1).
We now study the LAN expansion in the model. Conditioning on Y0 = y0 ,
`y0 (f ) − `y0 (f0 ) =
n
X
i (f (Yi−1 ) − f0 (Yi−1 )) −
i=1
where q0 = qf0
n−1
1X
(f0 (Yi ) − f (Yi ))2
2 i=0
√
n
= − kf0 − f k22,q0 + nWn (f − f0 ) + Rn (f, f0 )
2
Pn
and Wn (g) = n−1/2 i=1 i g(Yi−1 ) and
√
n−1
n
1 X
2
Rn (f, f0 ) = −
Gn ((f0 − f ) ) := −
(f0 (Yi ) − f (Yi ))2 − kf0 − f k22,q0 .
2
2 i=0
36
Next let us
√study the expansion of the functional ψ(f ). If β > 1/2, for all f ∈ An , kf −f0 k2,q0 .
εn (β) = o(1/ n) and since for all f such that kf k∞ ≤ L, r(y) . qf (y) . r(y), it holds
Z
ψ(f ) − ψ(f0 ) =2 q0 (y)(f − f0 )(y)f0 (y)dy
R
Z
√
+ 2 (qf − q0 )(y)(f − f0 )(y)f0 (y)dy + o(1/ n),
R
uniformly over An . Moreover, simple computations imply that
Z
|pf (y|x) − pf0 (y|x)| r(x)dxdy ≤ C(L)kf − f0 k2,r ,
R2
where C(L) is a constant depending only on L. Using the Markov property we obtain for all m ≥ 1
Z
Z (m)
(m)
|pf (y|x) − pf0 (y|x)| r(x)dxdy
pf (y|x) − pf0 (y|x) r(x)dxdy ≤ mk
R2
R2
≤ mC(L)kf − f0 k2,r ,
(m)
where pf (y|x) is the conditional distribution of Ym given Y0 = x. Since the Markov chain under
Pf is uniformly geometrically ergodic we can deduce choosing m = bC0 log nc := mn with C0 large
enough
kqf − q0 k1 . 2mn kf − f0 k2,r + 2ρmn . n log n
with ρ < 1 and independent of f (depending only on L). Hence, uniformly over An ,
Z
√
ψ(f ) = ψ(f0 ) + 2 q0 (y)(f − f0 )(y)f0 (y)dy + o(1/ n),
R
(1)
(2)
(1) √
so that ψ0 = 2f0 and ψ0 = 0. Set ft = f − tψ0 / n.
We now have to verify assumption A, case A1, i.e. control
2t2
Rn (ft , f0 ) − Rn (f, f0 ) = − √ Gn (f02 ) + 2tGn (f0 (f − f0 )).
n
Let k ≤ k1 kn (β), one can write, if f = fω,k ,
2t2
Rn (ft , f0 ) − Rn (f, f0 ) = − √ Gn (f02 ) + 2tGn (f0 (fω0 ,k − f0 ) + f0 (fω,k − fω0 ,k )).
n
Since kf0 k2,q0 ≤ kf0 k∞ ≤ L and since the Markov chain (Yi ) √
is geometrically uniformly ergodic
under the assumptions on f0 , we obtain that Gn (f02 ) = op ( n). Also we decompose f0 into
f0,an = f0 (x)1lx∈[−an ,an ] and f¯0,an = f0 1l[−an ,an ]c . We have fωr ,k − f0 = fωr ,k − f0,an − f¯0,an and
kf¯0,an k2,q0 . εn so that Gn (f0 f¯0,an ) = op (1). To control uniformly on k ≤ k1 kn (β), Gn (f0 (fω0 ,k −
f0,an )) we use Theorem 8 of Adamczak [1] which states that there exists a constant κ0 depending
on the Hölder constant K0 of f0 ,
P0 [|Gn (f0 (fωr ,k − f0,an ))| > t] ≤ exp −κ0 t2 k 2β
since the Markov chain (Yj )j is aperiodic, irreducible, satisfies the drift condition and since
n−1
X
f0 (xi )(fωr ,k − f0,an )(xi ) − f0 (yi )(fωr ,k − f0,an )(yi )]
i=0
≤ kf0 k∞ kfωr ,k − f0,an k∞ |{i; xi 6= yi }| ≤ kf0 k∞ K0 k −β |{i; xi 6= yi }|.
37
Therefore Gn (f0 (fωr ,k − f0,an )) = op (1) uniformly on {k̃n ≤ k ≤ k1 kn }, for any sequence k̃n
increasing to infinity.
R Now for all m0 > 0 and all k ≤ m0 such that fωr ,k ∈ An , writing h =
f0 (fωr ,k − f0,an ) − R f0 (fωr ,k − f0,an )q0 (y)dy, it holds
P0 [|Gn (h)| > δ] ≤
n−1
khk22,q0
1 XX
εn
+
E0 [h(Yi )h(Yj )] . 2 ,
δ2
n i=0 j>i
δ
so that Gn (f0 (fωr ,k − f0,an )) = op (1) uniformly on {1 ≤ k ≤ k1 kn (β)} ∩ {k; fωr ,k ∈ An }. We now
study Gn (f0 (fω,k − fωr ,k )) on An . We have
Gn (f0 (fω,k − fωr ,k )) =
k
X
(ωj − ωjr )Gn (f0 1lIj ).
j=1
We use Theorem 5 of Adamczak and Bednorz [2] with m = 1 the small set being the whole set so
that σ 2 ≤ E0 [f02 (Y1 )1lIj ] . r(Ij ), α = 1 and the constants a, b, c uniformly bounded in a similar
way. We present our bound in the case of a. As in Adamczak and Bednorz [2], we define
a = inf{c > 0; EP̄x exp |f0 (Y1 )1lIj − µ0,j |/c ≤ 2}, µ0,j = E0 [f0 (Yj )1lIj ] = O(r(Ij ))
where P̄x is the distribution of the split chain starting at x. For all c > 0,
EP̄x exp |f (Y1 )1lIj − µ0,j |/c ≤ r(Ij )eL/c + 1 ≤ 2
as soon as c ≥ a0 | log r(Ij )|−1 for some a0 > 0. For all j ≤ k and k ≤ k1 kn (β), one thus obtains
√
P0 |Gn (f0 1lIj )| > t . exp −κ1 nt| log r(Ij )|
nt2
√
+ exp −κ1
.
nr(Ij ) + nt| log(r(Ij ))|−1 log n
Note that by definition of an and a0 , nr(Ij ) & n1−2β/(2β+1)−δ for some δ arbitrarily small. Choose
t = t0 r(Ij )1/2 , t0 > 0, then with probability smaller than e−κ2 n
and κ2 > 0,
Gn (f0 (fω,k − fωr ,k )) .
k
X
(1−δ 0 )/(2β+1)
for some δ 0 > 0 small
1r(Ij )>rn /n |ωj − ωjr |r(Ij )1/2
j=1
√
.
kkω − ω r k2,r = o(1)
which implies that uniformly over An
Rn (ft , f0 ) − Rn (f, f0 ) = op (1)
and assumption A is verified.
We then need only prove (2.14) in the main text [17]. To do so we first make the change of
variables
0
0
ωt = ω − tω[k]
, ω[k]
= (ωj0 , j = 0, . . . , 2an k)
38
and compare `y0 (fω,t ) − `y0 (fωt ).
`y0 (fω,t ) − `y0 (fωt ) = −tn−1/2
n
X
0 ) +
i (f0 − fω[k]
i=1
t
+√
n
n−1
X
0 (Yi ))(fω − f0 )(Yi )
(f0 (Yi ) − fω[k]
i=0
= −tn−1/2
n
X
0 ) +
i (f0 − fω[k]
i=1
+
√
n−1
t2 X
2
0 (Yi ))
(f0 (Yi ) − fω[k]
2n i=0
t2
2
0 k
kf0 − fω[k]
2,q0
2
Z
nt
R
0 (y))(fω − f0 )(y)q0 (y)dy
(f0 (y) − fω[k]
t2
0 ) + tGn ((f0 − fω 0 )(fω − f0 )).
+ √ Gn (f0 − fω[k]
[k]
2 n
Using the above computations, on An
0 ) = op (1)
Gn (f0 (Yi ) − fω[k]
uniformly in k and
0 )(fω − f0 )) = Gn ((f0 − fω 0 )(fω − fω 0 )) + op (1)
Gn ((f0 − fω[k]
[k]
[k]
=
2a
nk
X
0 )1lI ) = op (1)
(ωj − ωj0 )Gn ((f0 − fω[k]
j
j=1
uniformly in k and over An . Combining these results with condition (5.3) in the main text [17]
concludes the proof of Theorem 5.1.
39
© Copyright 2026 Paperzz